Computers have transformed nearly every aspect of modern life, and research in the biological sciences is no exception. In fact, computers have spawned a new field of Biology called Bioinformatics. Bioinformatics is the science of creating, managing, and mining biological information from sources as diverse as medical records, images and sequence data. My own laboratory is particularly interested in sequence data. Usually DNA consists of just 4 nucleotides, denoted with the letters A, T, G, & C. The human genome sequence, for example, is some 3 billion characters long. At 500 words per page, that¿s a book 6,000,000 pages long. Manipulating, managing and deciphering a sequence of this size are largely computer-based endeavors; and creating tools to aid these ends is the central focus of research in my laboratory.
Lab Website: www.yandell-lab.org
Sequenced genomes contain a treasure trove of information about how genes function and evolve. Getting at this information, however, is challenging and requires novel approaches that combine computer science and experimental molecular biology. My lab works at the intersection of both domains, and research in our group can be summarized as follows: generate hypotheses concerning gene function and evolution by computational means, and then test these hypotheses at the bench. This is easier said than done, as serious barriers still exist to using sequenced genomes and their annotations as starting points for experimental work. Some of these barriers lie in the computational domain, others in the experimental. Though challenging, overcoming these barriers offers exciting training opportunities in both computer science and molecular genetics, especially for those seeking a future at the intersection of both fields. Ongoing projects in the lab are centered on genome annotation and comparative genomics. New areas of inquiry include high-throughput biological image analysis, and exploring the relationships between sequence variation and human disease.
Genome annotation. One of the great ironies of the DNA sequencing revolution is that genome annotation, not genome sequencing, has become the bottleneck in genomics today. New genomes are being sequenced at a far faster rate than they are being annotated. As of 2007, there are nearly 700 eukaryotic genomes in the sequencing pipeline. Many of these genomes are associated with relatively small research communities who are finding themselves left in the lurch when it comes to annotating their genomes.
Over the past year my lab has been working on an easy-to-use genome annotation pipeline called MAKER. Our goal is to provide research communities without extensive bioinformatics expertise the means to independently annotate their genomes and to distribute the results to the larger biomedical community. For proof of principle, we have collaborated with the S. mediterranea genome project lead by Prof. Alejandro Sánchez Alvarado, Dept. of Neurobiology & Anatomy, University of Utah School of Medicine. To date, our successful annotation of this genome has produced three papers¿one describing MAKER, one describing the genome database that we constructed from MAKER¿s outputs, and another paper describing the our analyses of the S. mediterranea genome and its contents. The first two papers are now in press at Genome Research and Nucleic Acids Research respectively; the third is under review at Science. Going forward, we plan to use the S. mediterranea genome annotations for functional genomics screens. This work will provide many opportunities for research with both computational and experimental components.
High-throughput biological image analysis. The production and analysis large numbers of digital images is an emerging field of bioinformatics. High-throughput imaging screens typically involve placing living cells or embryos in 96 well plates, and then adding different RNAi constructs or small molecules to each well. An automated microscope is then used to capture the results as digital images. These screens combine computation, genomics and molecular biology in new ways¿genome annotations are used to design RNAi constructs; cell-lines and embryos expressing various fluorescent markers must be constructed; and software must be written to process the results. My lab is currently engaged in active collaborations with other groups on campus working in this area, as there is a pressing need to develop image-processing pipelines to analyze the data these screens produce.
In 2006, I helped to organize an R21 large-equipment grant to purchase an automated confocal microscope for high-throughput image based screens. The application was successful, and the university has now acquired a BD Pathway Bioimager. This instrument will provide a basic resource for university researchers carrying out high-throughput image-based screens.
In a continuation of my collaboration with the S. mediterranea genome project, Prof. Sánchez Alvarado and I are using the S. mediterranea genome annotations for a genome-wide, image-based RNAi screen for genes involved in cellular regeneration and wound healing. The Bioimager is essential equipment for this work. Our results to date demonstrate that S. mediterranea is an ideal organism for high-throughput image-based screening, in part because it is literally a flatworm. This fact allows us to circumvent some of the technological problems that limit the scope and power of image-based screens of (not so flat) D. melanogaster and C. elegans.
Sequence Variation and Human disease. The Utah Population database (UTPD) and associated phenotype & clinical data collected through the Utah Genetic Reference Project (UGRP) offer unique resources for human genomics research. Tying the clinical and phenotypic data contained within these databases to the genome and genome annotations, however, is a challenging task. My is lab interested in characterizing large-scale trends in the UTPD & UGRP data, both with respect to sequence variation and demographics; developing methods to identify cohorts for clinical studies; and the development of diagnostic devices for purposes of personalized medicine.
References to Publications:
Korf I., Yandell M. & Bedell J. BLAST O’Reilly & Associates. July 2003, 360pp. ISBN:0596002998.
Kovach A, Wegrzyn JL, Parra G, Holt C, Bruening GE, Loopstra CA, Hartigan J, Yandell M, Langley CH, Korf I, Neale DB. (2010). The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences.BMC Genomics. 2010 Jul 7;11(1):420. [Epub ahead of print]
Levesque CA, Brouwer H, Cano L, Hamilton JP, Holt C, Huitema E, Raffaele S, Robideau GP, Thines M, Win J, Zerillo MM, Beakes GW, Boore JL, Busam D, Dumas B, Ferriera S, Fuerstenberg SI, Gachon CM, Gaulin E, Govers F, Grenville-Briggs L, Horner N, Hostetler J, Jiang RH, Johnson J, Krajaejun T, Lin H, Meijer HJ, Moore B, Morris P, Phuntmart V, Puiu D, Shetty J, Stajich JE, Tripathy S, Wawra S, van West P, Whitty BR, Coutinho PM, Henrissat B, Martin F, Thomas PD, Tyler BM, De Vries RP, Kamoun S, Yandell M, Tisserat N, Buell CR. (2010). Genome sequence of the necrotrophic plant pathogen, Pythium ultimum, reveals original pathogenicity mechanisms and effector repertoire. Genome Biol. 2010 Jul 13;11(7):R73. [Epub ahead of print]
Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth G, Stein L, Flicek P, Yandell M, and Eilbeck K. (2010) A standard variation file format for human genome sequences. In press, Genome Biology. Manuscript # 2142790872386656
Eilbeck K., Moore B., Holt C., Yandell M. (2009). Quantitative Measures for the Management and Comparison of Annotated Genomes. BMC Bioinformatics, 10(67), doi:10.1186/147.
Genome-wide analysis of human disease alleles reveals that their locations are correlated in paralogous proteins. Yandell M, Moore B, Salas F, Mungall C, MacBride A, White C, Reese MG. PLoS Comput Biol. 2008 Nov;4(11):e1000218. Epub 2008 Nov 7.
Cantarel B, Korf I, Robb SMC, Parra, G, Ross E, Morre B, Holt C, Sanchez Alvarado A, Yandell M. MAKER: An Easy-to-use Annotation Pipeline Designed for Emerging Model Organism Genomes In press Genome Research
Yandell MD, Mungall CJ, Prochnik S, Smith C, Kaminker J, Hartzell G, Lewis S, Rubin GM. Large-Scale Trends in the Evolution of Gene Structures within 11 Animal Genomes PLoS Comput Biol 2006 2(3): e15 doi:10.1371/journal.pcbi.0020015
Yandell M, Bailey AM, Misra S, Shu S, Wiel C, Evans-Holm M, Celniker SE, and Rubin GM, (2005). A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. PNAS 102:5, 1566-1571
Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M., (2005). The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology 6:R44
Majoros WH, Subramanian GM, Yandell MD., (2003) Identification of key concepts in biomedical literature using a modified Markov heuristic. Bioinformatics 19(3):402-7.
Zdobnov EM, et al., (2002). Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298(5591):149-59.
Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR Wincker P, Clark AG, Ribeiro JMC, Wides R, Salzberg SL, Loftus B, Yandell MD., et al., (2002). The genome sequence of the malaria mosquito Anopheles gambiae. Science 298(5591):129-49.
Yandell MD, Majoros WH., (2002) Genomics and natural language processing. Nat Rev Genet. (8):601-10. Review
Mural, et al., (2002). A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296(5573):1661-71.
Kerlavage A, Bonazzi V, di Tommaso M, Lawrence C, Li P, Mayberry F, Mural R, Nodell M, Yandell M, Zhang J, Thomas P., (2002). The Celera Discovery System. Nucleic Acids Res. 30(1):129-36.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M. et al., (2001). The sequence of the human genome. Science 291(5507):1304-51.
Jin S, Martinek S, Joo WS, Wortman JR, Mirkovic N, Sali A, Yandell MD, Pavletich NP, Young MW, Levine AJ., (2000). Identification and characterization of a p53 homologue in Drosophila melanogaster. PNAS 97(13):7301-6.
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, George RA, Lewis SE, Richards S, Ashburner M, Henderson SN, Sutton GG, Wortman JR, Yandell MD, et al., (2000). The genome sequence of Drosophila melanogaster. Science 287(5461):2185-95.
Rubin GM, Yandell MD, et al. (2000). Comparative genomics of the eukaryotes. Science 287(5461):2204-15.
Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish WR. (1999). A general approach to single-nucleotide polymorphism discovery. Nat Genet. (4):452-6.
Mark Yandell, Ph.D.
Department of Human Genetics
University of Utah
15 N 2030 E RM 6160B
Salt Lake City, Utah 84112-5330