![]() |
Broad Area Colloquium for Artificial Intelligence,
|
A method of sequencing that is quickly becoming the method of choice, is "whole genome shotgun" sequencing. This involves breaking several copies of the genome into random fragments, sequencing these fragments (reads), and computationally reconstructing the full genome from the reads. Fragment assembly is the computational task of piecing together the genome from these "shotgun" reads. It becomes difficult as the length and repeat content of the genome increases. Genomes of mammals are very long (~3 billion bases) and contain a lot of repeats. As a result, assembling such a genome from whole genome shotgun reads overwhelms traditional computational assemblers. For that reason the human genome was sequenced with the help of physical maps, which typically take a long time to build.
We present Arachne, a new system for whole genome shotgun assembly. Arachne uses a series of sub-quadratic algorithms designed to process the initial reads and piece them together accurately so as to reconstruct the genome. The key elements enabling Arachne to handle large, repeat-rich genomes, are (i) an O(N) algorithm for comparing all pairs of reads to find the ones that overlap; (ii) a multiple alignment procedure for correcting errors in the reads; (ii) a series of heuristics designed to assemble the reads into longer genomic pieces, while avoiding wrongly joins that are the typical result of repeats in the genome.
We test our system using real whole genome shotgun data from Neurospora and the mouse, and simulated whole genome shotgun data from a variety of publicly available genomes.