Sameh El-Difrawy

A Soft-Computing System for accurate DNA base-calling"

Date: November 15, 2001

DNA base-calling is the operation of identifying the ordered sequence of nucleotides (bases) by analyzing the electrophoretic data output (signal) produced by a DNA sequencing machine. Accurate base-calling becomes extremely difficult towards the end of a sequencing run (at long ``readlength'') because the Signal to Noise Ratio (SNR) of the electropherogram and the resolution of its peaks (which denote the presence of bases) becomes very low. Being able to maintain high base-calling accuracy at large readlengths is a very desirable property for automated sequencers and can greatly contribute to reducing the cost of DNA sequencing.

In this dissertation we approach the base-calling problem from a pattern recognition point of view. After appropriate and extensive pre-processing the signal is segmented into time intervals of interest that correspond to potential base-call events. Each event forms a pattern characterized by a low dimensional feature vector. Then nsupervised, non-parametric, fuzzy clustering methods are employed to assign the events into classes, where class-i contains events that are thought to contain i consecutive base-calls. Each one of the four channels (A, C, G, T) of the DNA electropherogram is processed independently and the resulting four partial sequences are then merged. The use of soft-computing methods for event clustering allows for the automatic generation of base-call confidence scores, which are very useful when partially overlapping short DNA sequences should be combined, by assembly tools, to produce a long consensus sequence prior to submission to an HGP database.

An extensive evaluation, using different pools of data sets, reveals that the system can achieve average base identification performance that often exceeds that of commercially available software. Furthermore, the same algorithms can be used to process uniformly data sets generated using slab-gel or capillary electrophoresis sequencers using primer or terminator dye chemistries without any noticeable performance degradation.

Advisor: Prof Elias Manolakos