Today, a teaspoon of spit and a hundred bucks is all you need to get a snapshot of your DNA. But getting the full picture—all 3 billion base pairs of your genome—requires a much more laborious process. One that, even with the aid of sophisticated statistics, scientists still struggle over. It’s exactly the kind of problem that makes sense to outsource to artificial intelligence.
On Monday, Google released a tool called DeepVariant that uses deep learning—the machine learning technique that now dominates AI—to assemble full human genomes. Modeled loosely on the networks of neurons in the human brain, these massive mathematical models have learned how to do things like identify faces posted to your Facebook news feed, transcribe your inane requests to Siri, and even fight internet trolls. And now, engineers at Google Brain and Verily (Alphabet’s life sciences spin-off) have taught one to take raw sequencing data and line up the billions of As, Ts, Cs, and Gs that make you you.
And oh yeah, it’s more accurate than all the existing methods out there. Last year, DeepVariant took first prize in an FDA contest promoting improvements in genetic sequencing. The open source version the Google Brain/Verily team introduced to the world Monday reduced the error rates even further—by more than 50 percent. Looks like grandmaster Ke Jie isn’t be the only one getting bested by Google’s AI neural networks this year.
DeepVariant arrives at a time when healthcare providers, pharma firms, and medical diagnostic manufacturers are all racing to capture as much genomic information as they can. To meet the need, Google rivals like IBM and Microsoft are all moving into the healthcare AI space, with speculation about whether Apple and Amazon will follow suit. While DeepVariant’s code comes at no cost, that isn’t true of the computing power required to run it. Scientists say that expense is going to prevent it from becoming the standard anytime soon, especially for large-scale projects.
But DeepVariant is just the front end of a much wider deployment; genomics is about to go deep learning. And once you go deep learning, you don’t go back.
It’s been nearly two decades since high-throughput sequencing escaped the labs and went commercial. Today, you can get your whole genome for just $1,000 (quite a steal compared to the $1.5 million it cost to sequence James Watson’s in 2008).
But the data produced by today’s machines still only produce incomplete, patchy, and glitch-riddled genomes. Errors can get introduced at each step of the process, and that makes it difficult for scientists to distinguish the natural mutations that make you you from random artifacts, especially in repetitive sections of a genome.
See, most modern sequencing technologies work by taking a sample of your DNA, chopping it up into millions of short snippets, and then using fluorescently-tagged nucleotides to produce reads—the list of As, Ts, Cs, and Gs that correspond to each snippet. Then those millions of reads have to be grouped into abutting sequences and aligned with a reference genome.
That’s the part that gives scientists so much trouble. Assembling those fragments into a usable approximation of the actual genome is still one of the biggest rate-limiting steps for genetics. A number of software programs exist to help put the jigsaw pieces together. FreeBayes, VarDict, Samtools, and the most well-used, GATK, depend on sophisticated statistical approaches to spot mutations and filter out errors. Each tool has strengths and weaknesses, and scientists often wind up having to use them in conjunction.
No one knows the limitations of the existing technology better than Mark DePristo and Ryan Poplin. They spent five years creating GATK from whole cloth. This was 2008: no tools, no bioinformatics formats, no standards. “We didn’t even know what we were trying to compute!” says DePristo. But they had a north star: an exciting paper that had just come out, written by a Silicon Valley celebrity named Jeff Dean. As one of Google’s earliest engineers, Dean had helped design and build the fundamental computing systems that underpin the tech titan’s vast online empire. DePristo and Poplin used some of those ideas to build GATK, which became the field’s gold standard.
But by 2013, the work had plateaued. “We tried almost every standard statistical approach under the sun, but we never found an effective way to move the needle,” says DePristo. “It was unclear after five years whether it was even possible to do better.” DePristo left to pursue a Google Ventures-backed start-up called SynapDx that was developing a blood test for autism. When that folded two years later, one of its board members, Andrew Conrad (of Google X, then Google Life Sciences, then Verily) convinced DePristo to join the Google/Alphabet fold. He was reunited with Poplin, who had joined up the month before.
And this time, Dean wasn’t just a citation; he was their boss.
As the head of Google Brain, Dean is the man behind the explosion of neural nets that now prop up all the ways you search and tweet and snap and shop. With his help, DePristo and Poplin wanted to see if they could teach one of these neural nets to piece together a genome more accurately than their baby, GATK.
The network wasted no time in making them feel obsolete. After training it on benchmark datasets of just seven human genomes, DeepVariant was able to accurately identify those single nucleotide swaps 99.9587 percent of the time. “It was shocking to see how fast the deep learning models outperformed our old tools,” says DePristo. Their team published the results on bioRxiv in December of 2016, and the next summer it went on to win a top performance award at the PrecisionFDA Truth Challenge.
DeepVariant works by transforming the task of variant calling—figuring out which base pairs actually belong to you and not to an error or other processing artifact—into an image classification problem. It takes layers of data and turns them into channels, like the colors on your television set. In the first working model they used three channels: The first was the actual bases, the second was a quality score defined by the sequencer the reads came off of, the third contained other metadata. By compressing all that data into an image file of sorts, and training the model on tens of millions of these multi-channel “images,” DeepVariant began to be able to figure out the likelihood that any given A or T or C or G either matched the reference genome completely, varied by one copy, or varied by both.
But they didn’t stop there. After the FDA contest they transitioned the model to TensorFlow, Google’s artificial intelligence engine, and continued tweaking its parameters by changing the three compressed data channels into seven raw data channels. That allowed them to reduce the error rate by a further 50 percent. In an independent analysis conducted this week by genomics computing platform, DNAnexus, DeepVariant vastly outperformed GATK, Freebayes, and Samtools, sometimes reducing errors by as much as 10-fold.
“That shows that this technology really has an important future in the processing of bioinformatic data,” says DNAnexus CEO, Richard Daly. “But it’s only the opening chapter in a book that has 100 chapters.” Daly says he expects this kind of AI to one day actually find the mutations that cause disease. His company received a beta version of DeepVariant, and is now testing the current model with a limited number of its clients—including pharma firms, big health care providers, and medical diagnostic companies.
More on Genetics
To run DeepVariant effectively for these customers, DNAnexus has had to invest in newer generation GPUs to support its platform. The same is true for Canadian competitor, DNAStack, which plans to offer two different versions of DeepVariant—one tuned for low cost and one tuned for speed. Google’s Cloud Platform already supports the tool, and the company is exploring using the TPUs (tensor processing units) that connect things like Google Search, Street View, and Translate to accelerate the genomics calculations as well.
DeepVariant’s code is open-source so anyone can run it, but to do so at scale will likely require paying for a cloud computing platform. And it’s this cost—computationally and in terms of actual dollars—that have researchers hedging on DeepVariant’s utility.
“It’s a promising first step, but it isn’t currently scalable to a very large number of samples because it’s just too computationally expensive,” says Daniel MacArthur, a Broad/Harvard human geneticist who has built one of the largest libraries of human DNA to date. For projects like his, which deal in tens of thousands of genomes, DeepVariant is just too costly. And, just like current statistical models, it can only work with the limited reads produced by today’s sequencers.
Still, he thinks deep learning is here to stay. “It’s just a matter of figuring out how to combine better quality data with better algorithms and eventually we’ll converge on something pretty close to perfect,” says MacArthur. But even then, it’ll still just be a list of letters. At least for the foreseeable future, we’ll still need talented humans to tell us what it all means.