Graphiti
Bases to Bytes
Cheap sequencing technology is flooding the world with genomic data. Can we handle the deluge?
The cost of sequencing human genomes is plunging—in the most advanced genomics centers, it’s falling five times faster than the cost of computing. Increasingly, people are getting their DNA sequenced by companies and research labs in a search for clues about genetic variation and disease.
But the industry must figure out how to cheaply store all the resulting data. Each of the 3.2 billion DNA base pairs in a human genome can be encoded by two bits—800 megabytes for the entire genome. But considerable data about each base is usually collected, and genes are often sequenced many times to ensure accuracy, so it’s common to save around 100 gigabytes when sequencing a human genome with a machine made by industry leader Illumina. Keeping this much data about every person on the planet would require about as much digital storage as was available in the whole world in 2010.
The trick, then, will be to save less. Harvard geneticist George Church says that eventually only the differences between a newly sequenced genome and a reference genome will need to be stored. That information could be encoded in as little as four megabytes. Then your genome might be just another e-mail attachment.
Information graphics by Infographics.com