From DNA To Big Data
Broad scientists generate 24 terabytes of data every day. That’s the equivalent of 7.4 million photos, 4.8 million pop songs, or 12,000 hours of movies.
How to handle that data is a major hurdle in genomics. How to extract meaning from that data—and turn it into something a physician can use in the clinic—is an even bigger one.How to handle that data is a major hurdle in genomics. How to extract meaning from that data—and turn it into something a physician can use in the clinic—is an even bigger one.
Unveiled in November 2016, the Intel-Broad Center for Genomic Data Engineering takes aim at this challenge by streamlining, speeding up, and disseminating leading-edge genome analysis methods and tools.
The seeds of this collaboration were planted three years earlier. At the time, the Broad’s Genome Analysis Toolkit (GATK) was already a world-leading software suite. But the increased ubiquity of low-cost cloud computing meant that the needs of researchers—and the means through which software developers could meet them—were evolving. To optimize GATK for distributed cloud computing, the Broad’s Data Sciences Platform (DSP) teamed up with Intel. Together, they made sure GATK ran well on major cloud platforms like Google and Amazon’s. Intel also collaborated with DSP to create a nimble and powerful way to store vast amounts of patient variant data and to perform fast processing called GenomicsDB—an improvement that makes research exponentially more efficient.
The partnership between the Broad and a Fortune 50 technology company might seem odd at first. But because of the sheer size and scale of its data, genomics has long outgrown more traditional approaches to scientific computing.
Indeed, Intel’s key insight was that the Broad could process genomics data from academic biomedicine deploying technological tricks similar to what Amazon, Google, and Facebook use to target ads on the internet. Think about it: when a researcher uses GATK to analyze her data, she is generating an individualized readout from a massive global database of genome or exome information. Genomics software has to be dynamic enough to yield these rapid readouts, while continuing to add new data points of reference, based on what it learns from researchers. In this way, the technology behind GATK is analogous to the technology behind the personalized ad on, say, your Facebook page—it’s an individualized output algorithmically generated (and unendingly updated) from a massive database of global users.
This approach—enabling the wide and rapid distribution of genomic data—has already proven fruitful for the scientific community. For example, when researchers began using GenomicsDB, the time to perform the variant discovery process on 100 genomes shrunk from eight days to 18 hours. That speed and processing power helped the Broad’s Daniel MacArthur assemble the world’s largest collection of human exome data, the Exome Aggregation Consortium (ExAC), which houses data for roughly 250,000 exomes—and has bloomed into a resource used by scientists worldwide.
At Intel’s suggestion, GATK was built on an open-source, cluster-computing framework called Spark, which global web leaders like eBay and Netflix use to rapidly process thousands of terabytes coming and going from users in thousands of locations. Spark gives GATK the same levels of processing power. Tellingly, GATK’s newfound speed and scale have already become points of pride for the Spark community. At the 2017 Spark Summit, the keynote speaker, legendary Spark creator Matei Zaharia, told a crowd of 3,000 techies that GATK is one of the three projects he is most excited about. The other two he cited were projects at Facebook and Riot Games—placing the Broad in the ranks of the high-tech titans.