ITS Research Computing highlights fiscal 2016-2017 accomplishments

During fiscal year 2016-2017, ITS Research Computing procured and installed the new Dogwood computer cluster, decommissioned the Kure cluster and moved much work onto the Longleaf cluster. For details about these accomplishments and many more, please keep reading.

Research Computing procured and installed Dogwood cluster

Research Computing procured a new high-performance compute cluster, Dogwood. Acquired from Lenovo, Dogwood initially comprises 8,052 cores on 183 nodes, each with 44×2.4GHz physical cores and 512GB RAM, on an Infiniband EDR low latency fabric.
Dogwood is designed to execute computational jobs routinely that would not have been practical, even possible, to run on Research Computing’s Kill Devil cluster: e.g., 4,224-way jobs; 1,056-way non-blocking jobs.

Research Computing plans to add in excess of another 2,000 cores. Among the many things Research Computing does, high-performance computing is its bread-and-butter.

Research Computing decommissioned the Kure cluster

Research Computing decommissioned the Kure cluster in March 2017.

Kure coincided with the explosion of data produced by NextGen Sequencing technologies. The initial 500 TB Isilon file system has been expanded to nearly 4 PB. Kure processed 82 percent of the samples processed locally in The Cancer Genome Atlas (TCGA) effort. Processing 26 percent of the 13,665 samples processed under the V1 pipeline, Kure was selected as the main production for the V2 pipeline, processing 98 percent of the remaining 53,870 samples.

TCGA (https://cancergenome.nih.gov/) is a multi-university, international effort with the stated mission: “to accelerate our understanding of the molecular basis of cancer.” At UNC-Chapel Hill, the TCGA effort was a six-year endeavor that exceeded $20 million. It continues under the Leidos TCGA3 project.

Leidos TCGA3 Pipeline work moves to Longleaf/Pine

In the wake of Kure’s decommissioning, much of the type workload that Kure so ably served is now being discharged on Longleaf/Pine.

A successor to TCGA, the Leidos TCGA3 Pipeline is a large-scale sequencing project associated with Lineberger Bioinformatics Group and High Throughput Sequencing Facility. About 10,000 samples are expected over several years. The pipeline set up on Longleaf is sufficient for processing more than 500 samples per week.

This is a sub-contract, with Leidos Corp. handling sequencing only. However, a quality control pipeline is necessary to guarantee quality of sequencing results. More than 500 samples were successfully processed this year.

Improving the performance of Amber with GPUs on Longleaf

Processing with Graphical Processing Units continues to grow. Longleaf included five GPU nodes, each with eight Nvidia 1080X CPUs. Research Computing plans to add more. Here’s an example of why: Amber is a popular molecular dynamics suite and its performance is done with Amber’s standard benchmark job, dihydrofolate reductase (DHFR) in water with 23,558 atoms. These jobs are performed in the GPU partition of the Longleaf cluster. The speedup of using 1 GPU/1 CPU core with respect to 1 CPU core is 125 fold.

Discovering a little more about how stars form

Fabian Heitsch, Associate Professor in the UNC-Chapel Hill Physics and Astronomy Department, and his team want to know how gas behaves in star formations. The problem is that these simulations have to run at low resolutions to be computationally feasible. Low resolutions make it seem like gases mix less as time goes on—which is counter intuitive, but is nevertheless the result from the typical low resolution runs.

Running a higher resolution job (using 2048 processor core) on Kill Devil, the Heitsch team demonstrated that this counter intuitive conclusion is due to averaging errors at low resolution. The higher resolution simulation enables a better picture of what’s actually happening (where actually happening is constrained by the simulation parameters). Of course, running the higher resolution job required more computational resources: about 20 percent core utilization (more like 30 percent, given the RAM requirement) running flat-out for about 11 days. See https://doi.org/10.1093/mnras/stx720, especially Figure 11.

Provided advanced consulting and engagement for microbiome analysis

Irinotecan-related microbiome analysis

Irinotecan is a common drug used to treat colon and other cancers. Debilitating diarrhea as a side effect limits the duration of treatment in most patients. Microbiome analysis indicated particular genera of bacteria, whose growth is limited by irinotecan, resulting in a gut bacterial bloom of proteobacteria, causing diarrhea as a side effect.

Research Computing analyzed the pilot study results. The follow-up studies that Research Computing also analyzed confirmed the results of the pilot study and suggest approaches to improved cancer patient care. Mice given additional treatments to control proteobacteria growth are able to maintain irinotecan regimen longer, resulting in reduction of tumor size.

Research Computing worked in collaboration with UNC-Chapel Hill Chemistry (A. BHATT, M. RIDENBO).

Statistical techniques of Genome Wide Association Studies (GWAS) applied to microbiome analysis.

Fanconi anemia is a rare, genetic blood disease that ultimately leads to bone marrow failure. Identified genus: neisseria, associated with increased neutrophil count in Fanconi anemia data set from the UNC-Chapel Hill Dentistry School (F. TELES). Low neutrophil count, neutropenia, is an indication of anemia disease state.

The statistical method used was developed in collaboration with UNC-Chapel Hill Department of Genetics (W. VALDAR).

Mouse RNAseq and microbiome analysis

In order to study the relationship between interleukin 10 (IL10) and inflammatory bowel diseases such as ulcerative colitis and irritable bowel disease, competent mast cells from wild type mice were introduced into IL10-receptor knock-out transgenic mice.

Systematic variations in both RNA expression and gut microbiome were analyzed. Interestingly, the response of male and female showed completely opposite expression in RNA expression of certain genes and in abundance of certain gut microbial taxa.

At this point, it remains unclear whether the RNA expression variation is a response to the differential in microbial abundance or vice versa; however, results indicate systematic differences in response of female and male mice to mast cell treatment.

Research Computing worked in collaboration with the University of Tennessee – Knoxville (L. LENNON), formerly of North Carolina State University.

Supported the Microbiome Core Facility on Longleaf/Pine

Supporting the production Bioinformatics Microbiome Core Facility, Longleaf processed 78 Illumina flowcells and 6 Ion Torrent runs were completed successfully this year.

Extra-mural customers included Ritter Pharmaceutical, the Environmental Protection Agency’s Green Housing Study and the National Institute of Environmental Health Sciences (NIEHS). Intra-mural customers included Department of Genetics, Marine Sciences, Center for Gastrointestinal Biology and Disease (CGIBD) and the School of Dentistry.

Paul Jones

Assumed provisioning and management of the infrastructure of ibiblio

As part of a new partnership between ITS and the School of Information and Library Science, Research Computing took over provisioning and management of the infrastructure of ibiblio.org.

Research Computing took on this work with ibiblio, a digital archive, in May 2017. People all over the world take advantage of ibiblio’s hosting and sharing services. It is one of ITS’ largest network customers, reaching millions of users every day.

Ibiblio was founded at UNC-Chapel Hill in September 2000 as one of the world’s first online libraries and a way to share and support all kinds of free software.

Paul Jones, the director of ibiblio.org and a clinical professor the School of Information and Library Science, has been instrumental in the formation and evolution of the service. Ibiblio’s new closer relationship with Research Computing, Jones said, brings “access to new minds trying to solve new problems in new ways by working with real professionals who really know what they’re doing and are willing to take risk and accept failure as part of the learning.”

Research Computing is excited to engage more with information sciences, said Michael Barker, Assistant Vice Chancellor for Research Computing.

Preparing to retire Kill Devil

With Dogwood coming into service, Research Computing will need to soon retire Kill Devil. Research Computing’s approach will be similar to that method it used for moving customers from Kure onto Longleaf or Kill Devil, as appropriate. With Longleaf, Research Computing identified the research groups that were most constrained on Kure (or on Kill Devil) and moved them first to the best system—Longleaf. Research Computing continued in this vein iteratively, with consultation, engagement, and technology orientation opportunities throughout.

Similarly, Research Computing’s approach with Kill Devil is to identify the research groups whose workloads have been most constrained on Kill Devil and to move them to Dogwood. Research Computing already has some chemistry and astrophysics running on Dogwood that would have been all but impossible to run on Kill Devil.

Once Research Computing has successfully moved a substantial segment of the respective research groups, Research Computing will allow sign ups online, although Research Computing will keep a more substantive engagement process and will set a date to turn off Kill Devil—probably in the Spring 2018 term.

Added capabilities through the use of the Apache cTAKES

The Cardiovascular Epidemiology Program of the Department of Epidemiology at the Gillings School of Public Health has been using Research Computing’s Secure Research Workspace (SRW) for the past three years. The Cardiovascular Epidemiology Research System (CERES), one of the environments on SRW, enables 83 users to securely access 5.38 TB of highly confidential data that contain multiple personal health identifiers and to use sophisticated analytical programming tools stored within the SRW.

Analyses include routine regression models, joint trajectory modeling, multiple imputation of outcomes and exposures, spatio-temporal assessment of environmental pollutant and weather data, and epigenome-wide association studies. Most recently, Research Computing has added capabilities for natural language processing of information stored in electronic health records through the use of the Apache cTAKES (clinical Text Analysis Knowledge Extraction System).

With the data security needed, users have virtual windows desktops within a cluster of designated servers. They are able to use software such as SAS, R and the standard MS Windows suite. With the use of ACLs (Access Control List), access can be fine-tuned by node, file system and even down to individual files. Control of data access is handled within the Department of Epidemiology Cardiovascular Disease (CVD).

Secure Research Workspace enables commercial real estate research

Jacob Sagi, a Kenan-Flagler Business School professor, uses the Secure Research Workspace in attempting to aggregate commercial real estate data from disparate sources into something unified and cohesive.

Given the sensitive nature of the data, the data providers insisted on a secure environment, which is where the Secure Research Workspace comes in. The environment also needed to be user friendly with access to a wide variety of software, including Matlab, Stata, ArcGIS, etc., which SRW was well suited to accommodate.

The project is a multi-institutional collaboration involving people from outside of UNC-Chapel Hill, all of whom have needed access to and have used this platform.

Jon Williams

Supported an economist’s transition to cluster-scale computation

Economics Associate Professor Jon Williams’ research and that of his PhD students focuses mainly on numerical solutions to complex models of economic decision-making. Williams describes his group’s activities:

“Research Computing staff at UNC, in particular Sandeep Sarangi, has been a tremendous asset and helped make the last few years the most productive of my career. My research and that of my PhD students focuses mainly on numerical solutions to complex models of economic decision-making.

Over the past year, Sandeep has helped us in so many ways. This began with migrating our work to Research Computing resources, including more than 20 TBs of data and providing access to the new Longleaf cluster for our computational work. This was a tremendous improvement in our everyday work environment by offering an integrated data and computational environment that could readily provide access to many TBs of storage and 1,000s of CPUs.

In addition, Sandeep provided assistance in adapting specialized optimization software (NOMAD) to a Linux environment and our specific problem, and is assisting in translating some of our code to Julia, a new language that we expect to yield substantial improvements. Beyond simply assisting me, he has also offered the same to multiple Ph.D. students of mine. This has included working on projects related to net neutrality in the telecommunications industry and solving complex dynamic programming problems related to optimal pricing in the airline industry.

Further, Sandeep offers a seminar to our entire Ph.D. student body to help them take advantage of all that UNC Research Computing has to offer. I expect it to be tremendously helpful as these students pursue their dissertation research. UNC seems to really understand that research computing goes beyond simply hardware resources, as it is the people that help researchers make the most of those resources.”

Jeff Roach

Who “does science”? Jeffrey Roach, Ph.D.

The Research Computing engagement team works with researchers in a wide array of disciplines across the entire institution. Some of that work is sheer technological onboarding. Much is knee deep in the science. Although every member of the Research Computing engagement team works with faculty, research staff and students at high levels, this year it is worth shining a spotlight on Jeffrey Roach’s publications to get a flavor of these contributions.

A A Seyerle, …, J Roach, …, C L Avery: Pharmacogenomics study of thiazide diuretics and QT interval in multi-ethnic populations: the cohorts for heart and aging research in genomic epidemiology. The Pharmacogenomics Journal 07/2017; DOI:10.1038/tpj.2017.10

Raymond Noordam, …, Jeffrey Roach, …, Eric A Whitsel: A genome-wide interaction analysis of tricyclic/tetracyclic antidepressants and RR and QT intervals: A pharmacogenomics study from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium. Journal of Medical Genetics 12/2016; DOI:10.1136/jmedgenet-2016-104112

James S Floyd, …, Jeffrey Roach, …, Bruno H Stricker: Large-scale pharmacogenomic study of sulfonylureas and the QT, JT and QRS intervals: CHARGE Pharmacogenomics Working Group. The Pharmacogenomics Journal 12/2016; DOI:10.1038/tpj.2016.90

A G Robertson, …, J Roach, …, S E Woodman: Integrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal Melanoma. Cancer Cell 08/2017; DOI:10.1016/j.ccell.2017.07.003

Farshad Farshidfar, …, Jeffrey Roach, …, Erik Zmuda: Integrative Genomic Analysis of Cholangiocarcinoma Identifies Distinct IDH – Mutant Molecular Profiles. Cell Reports 06/2017; 19(13):2878-2880, DOI:10.1016/j.celrep.2017.06.008

Adrian Ally, …, Jeffrey Roach, …, Peter W. Laird: Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma. Cell 06/2017. DOI:10.1016/j.cell.2017.05.046

Andrew D. Cherniack, …, Jeffrey Roach, …, Erik Zmuda: Integrated Molecular Characterization of Uterine Carcinosarcoma. Cancer Cell 03/2017; 31(3):411-423, DOI:10.1016/j.ccell.2017.02.010

Lauren Fishbein, …, Jeffrey Roach, …, Erik Zmuda: Comprehensive Molecular Characterization of Pheochromocytoma and Paraganglioma. Cancer Cell 02/2017; 31(2), DOI:10.1016/j.ccell.2017.01.001

C.P. Furquim, …, J Roach, …, F.R.F. Teles: The Salivary Microbiome and Oral Cancer Risk: A Pilot Study in Fanconi Anemia. Journal of dental research 11/2016; 96(3), DOI:10.1177/0022034516678169

Jason W. Arnold, Jeffrey Roach, M. Andrea Azcarate-Peril: Emerging Technologies for Gut Microbiome Research. Trends in Microbiology 07/2016; 24(11), DOI:10.1016/j.tim.2016.06.008

By the Numbers

Two new cluster systems (Longleaf and Dogwood)
More than 5.8 million jobs on Longleaf
More than 15.7 million processor hours on Longleaf
More than 11 petabytes of research data storage