Lucio Cetto
Analysis of DNA Chromatograms for Base Calling Using Unsupervised Statistical Learning Methods
Date: July 9, 2007
ABSTRACT
DNA sequencing remains an intensively used analytical method even after the completion of the Human Genome Project. It is now being used to determine the sequence of cloned DNA in avenues of inquiry that range from sequencing the genome of other model organisms, to those that make quantitative comparisons of gene expression. The cost of undertaking such comparative genomics is currently quite high and could be reduced substantially if the length of the sequence that could be accurately base called, using the same raw data, could be extended significantly.
We have developed an end-to-end unsupervised statistical learning framework for modeling the underlying process that generates the DNA sequencing data (observed as fluorescence-based chromatograms) and then used it to develop new algorithms that address effectively, and in a unified manner, three interrelated and challenging problems: (i) the robust standardization and time-warping of DNA sequencing traces of varying characteristics and quality, (ii) the accurate identification of bases in extended regions of the chromatograms and without the need of costly recalibration for emerging sequencing technologies, and (iii) the assignment of statistically meaningful measures of confidence to all base calling decisions. The particularities of the base calling problem lead to an interesting probabilistic graph model, where the structure of the dependencies between the random variables depends itself on hidden variables, thus complicating exact probability inference. The end result of this research is a new base-caller (called FOVD) that matches and often exceeds the accuracy of Phred, the most widely used base-calling sofware program.
Dissertation Committee:Prof. David Brady
Prof. Dana Brooks
Prof. Jennifer Dy
Prof. Elias Manolakos (advisor)