Software for Missing Data Estimation in High Throughput Typing Studies


Back to Annual Meeting Page		American Public Health Association 133rd Annual Meeting & Exposition December 10-14, 2005 Philadelphia, PA

5193.1: Wednesday, December 14, 2005 - 3:10 PM

Abstract #107089

Software for Missing Data Estimation in High Throughput Typing Studies

Joel S. Parker, MS¹, Venetia Raheja, MS¹, Ivan Rusyn, MD, PhD², and David W. Threadgill, PhD³. (1) Center for Health Research, Constella Health Sciences, 2605 Meridian Parkway, Durham, NC 27713, 919-313-7721, jparker@constellagroup.com, (2) Department of Environmental Sciences and Engineering, University of North Carolina, School of Public Health, 357 Rosenau Hall, Chapel Hill, NC 27599-7431, (3) Department of Genetics, University of North Carolina, Campus Box 7264, Chapel Hill, NC 27599

High throughput typing of SNPs has generated much interest for finding causative genes in complex diseases. These experiments may allow development of screens for patients at elevated risk for certain diseases, and optimization of treatment regimens for specific patients. Current laboratory methods for high throughput typing produce up to 5-25% missed calls in a dataset. Subsequent typing of these missing data points may double the cost of an experiment, but it is necessary for utilization of many statistical data analysis tools.

In order to address this problem we have developed a general strategy for estimating missed calls. The algorithm is based on a k-nearest neighbors (k-NN) approach. Three variants of this methodology have been tested and characterized. The methods were tested using two datasets that contain genome wide SNP types in a variety of mouse strains. It was shown that the k-NN based method estimates missed calls with up to 95% accuracy overall, and greater than 99% accuracy in some cases. Further, the imputation accuracy varies only slightly across different strains within the same species of mouse, suggesting that these results will translate to other model organisms, and humans.

Therefore, imputation of missed calls is capable of significantly reducing the cost of high throughput typing experiments. Here we will describe the algorithm and follow with a tutorial on how to use our publicly available software that implements this solution. At the conclusion, participants will understand the algorithms behind imputation and be able to optimally utilize the software in their research.

Learning Objectives: At the conclusion, participants will

Understand the algorithms behind imputation of genetic data
Recognize the trade-offs of the imputation parameters
Utilize the software in their research

Keywords: Genetics, Statistics

Presenting author's disclosure statement:

I wish to disclose that I have NO financial interests or other relationship with the manufactures of commercial products, suppliers of commercial services or commercial supporters.

Recorded presentation

Statistical Software and Science

The 133rd Annual Meeting & Exposition (December 10-14, 2005) of APHA