Research Computing has wrapped up four years of work on a nationwide cancer genome research effort in which the Information Technology Services unit hit its all-time high storage usage amount for a single project.
From early June 2011 until January of this year, Research Computing has provided production bioinformatics expertise and computational infrastructure, enabling the fundamental translational research mission of UNC-Chapel Hill’s Lineberger Comprehensive Cancer Center. Lineberger was one of more than a dozen centers in the United States and Canada that were funded for five years to find a deeper and more systematic understanding of the mechanisms responsible for cancer and to improve the ability to diagnose, treat and prevent it.
Investment into The Cancer Genome Atlas (TCGA) has totaled $275 million and brought $20 million to the University. The landmark research project is supported by the National Cancer Institute and the National Human Genome Research Institute at the National Institutes of Health. Researchers are identifying the genomic changes in more than 20 types of cancer.
UNC-Chapel Hill completed its part in the project in January after processing more than 70,000 samples extracted from 10,000 tumors, said Jeff Roach, Senior Scientific Research Associate at Research Computing.
Although the Lineberger Comprehensive Cancer Center and the Lineberger Bioinformatics Group led the effort, Research Computing, which got involved a year into the project, was just one of several campus units involved. The work also was not limited to just the medical school. Researchers in both Computer Science and Statistics in the College of Arts and Science, and another ITS unit, Networking, made substantial contributions to the overall success.
Kure processed more than 80 percent of information
Lineberger labs prepared the samples that were sequenced at the UNC-Chapel Hill High Throughput Sequencing Facility. The initial results of these sequencing experiments were deposited in Research Computing’s tape archive and transferred to Kure by the Center for Bioinformatics, where the production bioinformatics pipelines were run. Kure processed a little more than 80 percent of all that information, while Lineberger’s own computing environment handled the remaining 20 percent. Some 40 to 50 percent of the data was kept on Research Computing’s Isilon storage space. The results of these pipelines were finally passed back to the Lineberger Bioinformatics Group for analysis, deposition in national repositories and ultimately publication.
“The goal is to eventually affect patient treatment in the long run,” Roach said. Research Computing’s role seems very technical, but clinicians are eager to obtain and use this information. In fact, both research results and production pipeline developments are beginning to be applied in pilot clinical sequencing efforts here at UNC-Chapel Hill.
For Research Computing, the effort represented the largest amount of storage ever needed for a single project.
Project used a quarter of Kure’s capacity
Over the four years Lineberger purchased about 775 terabytes of disk space on the ITS research cluster Kure. This represents about one quarter of Kure’s capacity. At the high-water point, 1.2 petabytes of this cancer sequencing research was stored here. For context, all other projects combined total less than 2 petabytes.
“This is by far the biggest amount of data we’ve ever dealt with,” Roach said. “Before the high-throughput sequencing, we never had anything this large. In fact, at one point in the late summer of 2011 approximately 70 percent of the campus network traffic to coming in and out of our storage,” he said.
Previously a project requiring tens of terabytes would have been considered large. Now that’s a medium-sized sequencing effort, he said. The Cancer Genome Atlas work was particularly challenging also because of the rapid development in sequencing technology. In one year, Roach noted, the amount of data derived from a physical sample doubled. This technology change created a sudden large-scale computing and storage need, not just for the large projects, but for the smaller labs as well. Small labs have found themselves without the capacity they need as high-throughput sequencing becomes increasingly important to them. These labs benefit immensely from the infrastructure enhancements made by ITS Networking and Research Computing to support Lineberger’s Cancer Genome Atlas effort. In particular Research Computing now distributes the disk space allotted for cancer sequencing to other sequencing and non-sequencing efforts on campus, such as the Department of Marine Sciences biodiversity studies of North Carolina rivers.
Computation technology is changing rapidly
This cancer genome research was a large, high-impact project for the University and for Research Computing. The project also “changed the way we do things at Research Computing,” Roach said. “We’re at the confluence of a number of trends that are forcing us to look at a different and wider class of problems.”
The demand for large-scale computing for life sciences is increasing as research grant funding for physics and chemistry decreases, he said. That is not to say that the computational needs in physics and chemistry are going away– just that the number of demands placed on a computational infrastructure have increased as the life sciences and social sciences develop into more quantitative practices.
Furthermore, the rapid changes in computational technology are charging the types of problems that are of interest and how solutions are approached. As computer memory, speed and bandwidth increase tremendously, computational time has become cheap compared to analyst time. That’s resulted in a shift in how researchers approach complex problems. The solutions composed of a relatively small number of complex, heavily connected, and closely related pieces are disappearing in favor of a more granular approach of decomposing large problems into an enormous number of small problems that are loosely coupled tasks.