We commonly reboot entire clusters at once (around 10,000 servers in larger clus... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		fintler on Sept 11, 2013 \| parent \| context \| favorite \| on: CoreOS: Boot on Bare Metal with PXE We commonly reboot entire clusters at once (around 10,000 servers in larger clusters -- each running a full Linux OS) over PXE without a problem. We have a configuration management machine that creates an image, then we push that down to a small cluster of TFTP servers that serve it out. The strain on NFS (we keep parts of the OS in RAM, and load other parts on demand over NFS) after we kexec from the PXE kernel into the production kernel causes more problems than the initial TFTP traffic (but it usually works fine as well). Btw, after booting, we use PanFS (DirectFlow) or Lustre for computing stuff, not NFS. Although it's not what we use, here's a program that does a similar type of management: http://warewulf.lbl.gov/trac If you take the time to combine Warewulf with something like Puppet or Chef, you'll have a nice system for managing 100s of thousands of machines (I could easily see this scaling to over a million servers if you have the cash to build something like that). If you're wondering about dynamic libraries in an environment like this, take a look at https://github.com/hpc/Spindle And yes, I still get giddy when I type one command to reboot 10,000 servers.

ajdecon on Sept 12, 2013 [–]

Since you mention kexec and TFTP+NFS, are you currently using Perceus? Or is there another system out there with that combo?

fintler on Sept 15, 2013 | [–]

We're using a modified Perceus tied into cfengine.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact