UNC-Chapel Hill recognizes that computational resources are an important part of research endeavors, and that research varies with respect to its data and processing demands, and also with respect to the need to compute, modify theories/codes, re-compute, etc.. UNC-Chapel Hill is also committed to providing a base computational resource both to help build research programs, to extend the value of extramural contracts/grants/awards, and to help sustain programs. The university acknowledges, too, that some projects may take weeks to realize, some may take decades to realize. Research problems are not one-sized; therefore, computational demands are not one-sized.
The Research Computing division of UNC-Chapel Hill provides expert scientific and information technology consultants and cyberinfrastructure to the scholarly and research community of the university. The consultation staff includes nine scientists and scholars who have experience across a wide range of disciplinary communities from the physical sciences to the life sciences, from the computational sciences to clinical research, from social/behavioral sciences to the humanities. Cyberinfrastructure includes two large computational clusters. One cluster is designed specifically for high-performance computing needs with more than 11000 conventional cores where each node has 512-GB memory (8052 at 2.4GHz, 2000 Skylake core) and 1440 Knights Landing cores, parallel scratch filesystem, and low-latency interconnect fabric (Infiniband EDR). The second cluster is designed specifically for high-throughput and data-intensive processing needs: it contains more than 6000 cores (each node minimum of 256-GB memory), including five (5) large 3-TB memory nodes, thirty (30) Skylake nodes each with 750-GB memory, and nodes for “Big Data” workloads, accessing 3-PB of shared high performance storage. With respect to GPUs, Longleaf includes both consumer-grade and enterprise-grade cards. For consumer-grade GPUs, Longleaf includes five (5) nodes each with 8 Nvidia 1080GTX cards, comprising over 100,000 CUDA cores). Longleaf includes sixteen (16) nodes each with four (4) Nvidia Volta GPUs with NVLink—totaling 480 double precision TFLOP/s; or 960 single precision TFLOP/s; or 7680 Tensor TFLOP/s). In their own special purpose cluster, there are three Nvidia DGX-1 boxes (each has 8 Voltas with NVlink) and a DGX workstation (having 4 Voltas with NVlink)—adding 210 double precision TFLOP/s; or 420 single precision TFLOP/s; or 3360 Tensor TFLOP/s. For permanent storage, Research Computing offers more than 5-PB cluster mounted via NFS and 11-PB of active archive. For smaller scale needs Research Computing provides two virtual desktop solutions: (i) VCL, a self-service private cloud for virtual scientific workstations; (ii) SRW, a Secure Research Workstation enclave for computing on sensitive and regulated data per NIST-800 controls (with secure file transfer solutions). Cyberinfrastructure administration (i.e., nine systems administrators) and consultation is available at no cost to researchers. With respect to cyberinfrastructure, Research Computing provides an institutional allocation for each element and incremental charges for resources above that allocation. The division’s aim is to ensure that research efforts have a stable, consistent, available, and expert, resource for all phases of the research lifecycle.
I. Consultation and Engagement
Research Computing includes an “Engagement Team” of experienced scientists who are also adept with various computational, information-processing, and data management techniques.
The Engagement Team is loosely organized by disciplinary families:
- Physical, Information, Mathematical, Computer Science
- Life and Environmental Science
- Health Outcomes and Clinical Research
- Economics, Social and Behavioral Science, Business
If a project does not fit one of the above families easily, we assign an engagement member as appropriate. Engagement team members perform three general functions: (i) user/group onboarding, (ii) disciplinary/project outreach, (iii) advanced consultations. The Engagement Team also conducts select short course training.
Contributions by engagement team members range from co-investigation and article co-authorship to assisting lab teams with job submission scripts, to collaborating on scientific workshops.
II. Institutional Research CyberInfrastructure
These resources, and the lifecycle of them, are institutionally supported/provided.
A. “Cluster-scale” Computation and Information-Processing
High Performance Computation:
In August 2017, Research Computing brought into production a new high-performance compute cluster, Dogwood. Acquired from Lenovo, Dogwood is designed to execute scalable computational jobs, e.g., 4224-way jobs and 1056-way non-blocking jobs, routinely. Dogwood’s initial implementation comprised 8052 cores on 183 nodes, each with 44×2.4GHz physical cores and 512GB RAM, on EDR Infiniband. In Summer 2018, Research Computing added 2000 Skylake cores and 1440 Knights Landing cores, bringing Dogwood’s the total core count in excess of 11000. Dogwood has a 250-TB high-performance GPFS scratch filesystem.
A permanent 5-PB high performance scale-out NFS storage cluster on Dell/EMC Isilon X-series is presented to all Dogwood and Longleaf (see below) nodes.
Research groups, programmes, investigators, and users in general, whose typical workloads are MPI and/or OpenMP+MPI hybrid (or relevantly similar) workloads will be provided access to and resource allocations on Dogwood.
High Throughput, data-intensive, regulated-data, and big-data computation:
Longleaf is a new cluster explicitly designed to address the computational, data-intensive, memory-intensive, and big data needs of researchers and research programmes that require scalable information-processing capabilities that are not of the MPI and/or OpenMP+MPI hybrid variety. Longleaf includes 160 “General-Purpose Type-I” nodes (24-cores each; 256-GB RAM; 2x10Gbps NIC), 30 “General Purpose Type-II” nodes (24-cores each; 768-GB RAM; 2x10Gbps NIC) and 30 “Big-Data” nodes (12-cores each; 256-GB RAM; 2x10Gbps; 2x40Gbps), 5 large memory nodes (3-TB RAM each), 5 “GPU” nodes each with GeForce GTX1080 cards (102,400 CUDA cores in total), zero-hop connections to a high-performance and high-throughput parallel filesystem (GPFS; a.k.a., “IBM SpectrumScale”) and storage subsystem—with 10-controllers, over 225-TB of high-performance SSD disk storage, and approximately 2-PB of high-performance SAS disk. All nodes include local SSD disks for a GPFS Local Read-Only Cache (“LRoC”) that optimizes the most frequent metadata data/file requests to the node itself, thus eliminating traversals of the network fabric and disk subsystem. Both General-Purpose and Big-Data nodes have 68-GigaBytes/second of memory bandwidth. General-Purpose nodes have 10.67GB of memory per core and 53.34-Megabytes/second of network bandwidth per core. Big-Data nodes have 21.34GB of memory per core and 213.34-Megabytes/second of network bandwidth per core. Longleaf uses the SLURM resource management and batch scheduling system. Longleaf’s total conventional compute core count is 6,496 cores, delivering 12,992 threads (hyperthreading is enabled). In Summer 2018, Research Computing added sixteen (16) nodes each with four (4) Nvidia Volta GPUs with NVLink.
In addition, a permanent 5-PB high performance scale-out NFS storage cluster on Dell/EMC Isilon X-series is presented to all Dogwood (see above) and Longleaf nodes.
Longleaf includes two types of GPU nodes. It contains five (5) consumer-grade GPU nodes each with GeForce GTX1080 cards (102,400 CUDA cores in total). Longleaf also contains includes sixteen (16) nodes each with four (4) Nvidia Volta GPUs with NVLink—totaling 480 double precision TFLOP/s; or 960 single precision TFLOP/s; or 7680 Tensor TFLOP/s). Research Computing also provides a special purpose NVidia DGX cluster: are three Nvidia DGX-1 boxes (each has 8 Voltas with NVlink) and a DGX workstation (having 4 Voltas with NVlink)—adding 210 double precision TFLOP/s; or 420 single precision TFLOP/s; or 3360 Tensor TFLOP/s.
Research groups, programmes, investigators, and users in general, whose typical workloads are best satisfied by Longleaf are provided access to and resource allocations there.
B. Permanent storage systems and data management
For comparatively large capacity permanent storage, Research Computing presents a 5-PB high performance scale-out NFS storage cluster on Dell/EMC Isilon X-series. Researchers whose research requires it may receive a 5-TB institutional allocation upon request. On a project-by-project basis, researchers may request additional storage space (usually not to exceed 25-TBs of added space) for the duration of a time-delimited project (usually not to exceed 3-years), pending available capacity.
Networked Attached Storage (NAS):
Researchers have access to Netapp filer storage providing predominantly NFS (and also CIFS for specific use cases). High-performance storage to is delivered via SATA disks; extreme-performance storage is delivered via SAS disks. All storage is configured with large controller caches and redundant hardware components to protect against single points of failure. This storage space is “snapshotted” in order to support file recovery in the event of accidental deletions. Faculty receive an institutional allocation of 10-GB per person; additional storage is available for incremental cost.
For active archive, Research Computing offers Quantum StorNext active archive with 600TB disk cache, and in excess of 11-PB tape storage. Data protected against media failure via two copies, and encrypted on tape. Faculty receive an institutional allocation of 2-TB per person; laboratories and project teams receive an institutional allocation of 10-TB per person. Additional capacity is available for incremental cost.
Research Computing supports Globus (http://www.globus.org) for secure data/file transfer amongst participating institutions. Globus is the preferred file transfer method.
To facilitate the deposition of files/data from external organizations into UNC-Chapel Hill, Research Computing offers a secure file-transfer-protocol service that allows files/data to be uploaded but prohibits downloading. This file transfer service meets additional IT-Security requirements for sensitive data.
Research Computing offers schemas on managed Oracle databases (delivered by Oracle Database Appliances) sufficient for many small to medium sized research projects. These included patching, general database administration, and transparent database/datafile encryption.
MySQL and PostgreSQL are available within contexts where there is an ongoing engagement project, and it fits within available resources and projects. These are on a case-by-case basis.
C. Secure Research Workspace
The Secure Research Workspace (SRW) contains computational and storage resources specifically designed for management and interaction with high-risk data. The SRW is used for storage and access to Electronic Health Records (EHR) and other highly sensitive or regulated data; it includes technical and administrative controls that satisfy applicable institutional policies. SRW is specifically designed to be an enclave that minimizes the risk of storing and computing on regulated or sensitive data. It is designed to satisfy NIST-800-53.r4 controls at the Moderate” level.
Technically, the SRW is an advanced implementation of a Virtual Desktop Infrastructure (VDI) system based on VMWare Horizon View, Cisco Unified Computing System, Netapp Clustered Data ONTAP comprised of standard disk and flash arrays, with network segmentation and protection guaranteed by design, by adaptive Palo Alto enterprise firewalls, and enterprise TippingPoint Intrusion Prevention System appliances. Access controls and permissions are managed via centrally administered systems and technologies appropriate to ensure security practices and procedures are correctly and consistently applied.
ITS-Research Computing consults with the investigator or research group to arrive at a reasonable initial configuration suitable for their respective project(s).
The default software installed is:
|• ActivePerl• Adobe Reader
• ArcGIS Workflow Manager
• ERD Concepts 6
• Google Chrome
• Internet Explorer
• Java Runtime
• Java Development Kit
• Microsoft Accessories Bundle
• Microsoft Sharepoint Workspace
• Microsoft Silverlight
|• Microsoft-SQL Server 2008• Notepad++
• Oracle Client
• Stata 13
In addition, Data Leakage Prevention software is available for install on systems that enable data ingress and egress but require detailed access and transfer logging, or that require additional server-level controls. Two-step (or “two-factor”) authentication is also available as required or requested.
D. Virtual Computing Lab
Research Computing provides a self-service private cloud virtualization service called “Virtual Computing Lab” (VCL) to UNC-Chapel Hill researchers at http://vcl.unc.edu. Originally developed by NC State University in collaboration with IBM, VCL (see http://vcl.apache.org) provides researchers with anytime, anywhere access to custom application environments created specifically for their use.
With only a web-browser, users can make a reservation for an application, either in advance or immediately, and the VCL will provision that application on a centrally maintained server, and provide the user with remote access to that server.
VCL provides users remote access to hardware and software that they would otherwise have to install themselves on their own systems, or visit a computer lab to use. It also reduces the burden on computer labs to maintain large numbers of applications on individual lab computers, where in many cases it’s difficult for some applications to coexist on the same machine. In the VCL, operating system images with the desired applications and custom configurations are stored in an image library, and deployed to a server on-demand when a user requests it.
E. Select Commercial Scientific Software
Research Computing licenses commercial software to support the research community at UNC-Chapel Hill. Notable software includes:
- Biovia (DiscoveryStudio, MaterialsStudio); formerly “Accelrys”
- Cambridge Crystallographic
- Globus Connect
- Harris Geospatial Solutions (ENVI+IDL); formerly “Excelis”
- Intel Compilers
- KEGG Database
- nQuery (Statistical Solutions)
- Portland Group (Fortran/C/C++)
- RogueWave (TotalView and IMSL)
- Scientific Computing Modeling (ADF and BAND Modeling Suite)
- StataCorp (Stata/SE)
- Certara (SYBYL)
- Wolfram (Mathematica)
The above list is not exhaustive.
Research Computing offers short courses during the Summer, Fall and Spring terms. Courses are:
- Linux: Intermediate
- Linux: Introduction
- Matlab: Intermediate
- Matlab: Introduction
- Python for Scientific Computing
- Python Workshop
- Scientific Computing: Gaussian and GaussView
- Scientific Computing: Introduction to Computational Chemistry
- Shell Scripting
- TarHeel Linux
- Using Research Computing Clusters
- Web Scraping
IV. Costs to researchers
Research Computing’s suite of services and support has broadened significantly since the most recent charging structure was established (in 2013). The cost structure is due for review and refactoring.
Research Computing is actively re-orienting our cyberinfrastructure to the needs of the research programs at UNC-Chapel Hill. Coarse and piecemeal approaches like “core hours” charges and/or “storage” charges assume a mostly homogeneous workload: the resource demands of different disciplines are sufficiently varied that these approaches unintentionally disadvantage domains of inquiry. Worse still, they make it so we are least likely to observe the capacity demands that mixed workloads and emerging workloads present; thus, it reduces our ability to respond to the needs of the research community. Additional kinds of cyberinfrastructure exhibit similar complexity.
Given the newness of our approach, and the fact that we are re-orienting overtly and intentionally to respond to a broader suite of demands from a broader array of research pursuits, we do not yet know which dimension (or dimensions) of resource will drive cost. Nor do we know precisely how many materially different “workload profiles” we will observe. In short, we need to see what happens, measure what happens, and perform some analysis, in order to know enough to frame a costing structure (or: to know enough to justify it).
Our approach is to apply technical resource limits that will (i) help us to measure the relevant dimensions of resource demands, (ii) facilitate incremental adjustment as our observations of the actual job streams suggest, and of course (iii) protect against over-consuming users or runaway tasks/workload/resource-use.
With this in view, we encourage investigators to consult with us on project proposals so we can bring to bear the full suite of Research Computing capabilities, services, and cyberinfrastructure initiatives/projects.