ITS Research Computing has decommissioned the Killdevil computing cluster.
Users migrated their workflow from Killdevil over to Research Computing’s more modern, powerful clusters, Longleaf and Dogwood.
Killdevil served University for seven years
Research Computing turned off Killdevil on December 17, 2018. Killdevil had served UNC-Chapel Hill researchers since summer 2011. It had 8,976 7-year-old CPUs and 48 terabytes of memory.
Killdevil was a general purpose cluster widely used by many groups on campus. Some of the groups that Killdevil served were genetics, biostatistics, chemistry, economics, environmental science, physics, biology, pharmacy, biochemistry and biophysics, computer science, marine sciences and the business school.
During its heyday, Killdevil ran jobs around the clock, except for some infrequent maintenance windows and outages. The nature of a compute cluster with a job scheduler is that it will run as many jobs at once as it can while maintaining a list of pending jobs waiting for resources to become available.
“As we transitioned users to Longleaf and then Dogwood, this eased a little although maybe it says something about the appetite for computing in the research community that we had users doing opportunistic computing up until we pulled the plug, even though we had better — and cheaper — options by then,” said Mark Reed, Scientific Engagement Manager with Research Computing.
The transition went very smoothly with very few emails and requests for service from users. “We did have a long lead-in to enable users to transition to the new clusters,” Reed said, “and we identified all users with significant usage and set up some special services to make it simple for them to identify and get an account on the cluster most suited for their research.”
Researchers have access to more than 15,000 modern CPUs and 112 terabytes of memory on Dogwood. On Longleaf, they have more than 6,500 CPUs and in excess of 90 terabytes of memory. Research Computing directed users with serial and single node or data-intensive jobs to Longleaf, and users with multi-node, distributed computing workloads to Dogwood.
“The biggest win is that those workflows moved to modern CPUs and have access to more than five times the CPU resource on Dogwood compared to Killdevil,” said Liam Greenwood, IT Manager with Research Computing.
Prepped for one year
Research Computing staff members prepared for this decommission for more than a year. They communicated with and supported researchers with their move to Longleaf or Dogwood. Staff began by identifying the research groups that were most constrained on Killdevil and moved them first to the best system for their needs.
Research Computing has some chemistry and astrophysics workflows running on Dogwood, for instance, that would have been all but impossible to run on Killdevil.
The biggest challenge with the decommission, Greenwood said, was getting “researchers workflows up and running under the new scheduler on the new clusters.”
The move, though, had little impact on users. “Once they migrated their workflow, life was pretty much the same,” Greenwood said. “The researchers will take very little time to utilize the extra resource, get used to it and want more.”
In the past, the clusters focused on massively parallel workloads. In contrast, with this new generation of clusters, Research Computing has provided separate clusters for massively parallel workloads and for serial, or high throughput, workloads.
The Dogwood cluster is provided and optimized for those programs that can use hundreds or thousands of CPUs — more than Research Computing can make available on a single compute node.
To this end, Dogwood has a low-latency, high throughput Infiniband interconnect between the compute nodes, and the software and hardware are tuned for message passing interface (MPI) jobs that span multiple compute nodes. The Longleaf cluster is for more data-intensive workloads that do not span multiple nodes. Each program that runs is on a single compute node and will not span the CPUs across multiple nodes.
Before retiring Killdevil, Research Computing had last decommissioned a cluster, Kure, the previous year, in March 2017.
Moving away from total decommissions
Going forward, complete retirement of clusters will no longer be the norm. One side effect of MPI, Greenwood said, is homogeneous nodes are needed to enable a single program to run across multiple compute nodes as if they were a single computer. This requirement drives the big-bang replacement of MPI clusters. Research Computing expects that the Dogwood cluster will have at least a five-year lifespan.
Because individual programs do not span multiple compute nodes, the Longleaf cluster does not have this constraint, and can be made up of a heterogeneous set of compute nodes. This enables Research Computing to grow and lifecycle at the node level. As a result, “we shouldn’t need to do an entire cluster replacement,” Greenwood said.