<span class="gmail_quote"></span><span class="gmail_quote"></span>Hi, <br><div><span><br>I&#39;ve put together a simple 2-node cluster using Debian etch , OpenMPI , FAI &amp; Cfengine. <div><span class="e" id="q_114baaa454d580b7_1">

<br>I&#39;m looking for ideas that can help me with building a better self-healing cluster. Right now I&#39;m making rule files for cfengine and would acknowledge any input on sample files and important configurations that need to be made for the cluster&#39;s health. (Although it&#39;s site-specific but I&#39;m sure I can get good hints out of them)

<br><br>However I&#39;d also be glad to see if you have any monitoring system in mind that can cooperate with cfengine in the maintenance job. I&#39;ve looked briefly into Ganglia and Nagios so far. It seems Ganglia is mostly meant for large (groups of) clusters and focuses on hw resources. Nagios seems to be better-suited for my job, but the gurus at cfengine mailing list believe that cfenvd &amp; cfexecd can provide equal monitoring &amp; recovery capability (in terms of response time).

<br>What&#39;s your take on either of them?<br><br>Thanks beforehand to anyone sharing their experience.<br><br><br>

</span></div></span></div>