stanford.seal64.gif (1,768 bytes)

Broad Area Colloquium for Artificial Intelligence,
Geometry, Graphics, Robotics and Vision

Arachne: A Whole Genome Shotgun Assembly System

Serafim Batzoglou

Monday, October 29th, 2001, 4:15PM
Gates B01


Every living organism has a genome which is a DNA sequence of four letters (bases): A, C, G, and T. Finding this sequence, i.e. "sequencing" the DNA of an organism, is an important step towards studying the organism's biology.

A method of sequencing that is quickly becoming the method of choice, is "whole genome shotgun" sequencing. This involves breaking several copies of the genome into random fragments, sequencing these fragments (reads), and computationally reconstructing the full genome from the reads. Fragment assembly is the computational task of piecing together the genome from these "shotgun" reads. It becomes difficult as the length and repeat content of the genome increases. Genomes of mammals are very long (~3 billion bases) and contain a lot of repeats. As a result, assembling such a genome from whole genome shotgun reads overwhelms traditional computational assemblers. For that reason the human genome was sequenced with the help of physical maps, which typically take a long time to build.

We present Arachne, a new system for whole genome shotgun assembly. Arachne uses a series of sub-quadratic algorithms designed to process the initial reads and piece them together accurately so as to reconstruct the genome. The key elements enabling Arachne to handle large, repeat-rich genomes, are (i) an O(N) algorithm for comparing all pairs of reads to find the ones that overlap; (ii) a multiple alignment procedure for correcting errors in the reads; (ii) a series of heuristics designed to assemble the reads into longer genomic pieces, while avoiding wrongly joins that are the typical result of repeats in the genome.

We test our system using real whole genome shotgun data from Neurospora and the mouse, and simulated whole genome shotgun data from a variety of publicly available genomes.

About the Speaker

Serafim Batzoglou received his S.B. in Mathematics and in Computer Science, and his M.Eng. in Computer Science from MIT in 1996. He received his PhD in Computer Science from MIT in 2000. During his PhD he was affiliated with the Whitehead Institute/ MIT Center for Genome Research in Cambridge, USA, where he was a Research Scientist until August 2001. In September 2001 he joined Stanford as an Assistant Professor of Computer Science. His research work focuses on comparative genomics, gene finding, genome sequencing, and other applications of computer science to biology.


Back to the Colloquium Page