AcknowledgementsProject and Graduate School: Initially, thanks to Ms. Marsha Christiansen at Glendora High School for introducing me to matrix algebra. Thanks to Mikeal Roose, Timothy Close, Janet and Fred Lyon, Heather Lyon, the Citrus Research Board, the UC Biostar/Discovery Grant, UCR Graduate Division, and UCR Department of Botany and Plant Sciences for procuring financial support. Also, thanks are extended Thomas Girke for computing cluster support and some software advice, Linda Walling and Julia Bailey-Serres for Dissertation and Test Committees. Linda Walling for plant genome class (awesome) and Julia Bailey-Serres for the proper pseudo-code for Mendelian inheritance, Patty Springer for Arabidopsis genetics background and positive encouragement. Norman Ellstrand for constant advice. Darleen DeMason for test committee presence and some anatomy and microscopy background. David Hull from Paracel for sparking my interest in choosing the best algorithms, Tracy Kahn for Citrus nomenclature help and access to the UC Riverside Citrus Variety Collection, Karolin Luger at Colorado State University for sparking my interest in bioinformatics, Dave Meyer from Paracel for teaching me Perl, and Mike Sievers from Paracel for initiating my interest in pipelined processes, Matthew Gregory David for countless instant messages for help on computer related topics, Jennifer Crowley for defense support and the mandarin genotyping project help. Linear Model System: Justin Borevitz, Xinping Cui for the modeling help and ANOVA background. Shizong Xu for completing the ANOVA story. Ben Bolstad for normalization method. Mikeal Roose for teaching Advanced Plant Breeding Class (this was my favorite class, ever). Ivo Welch (Yale) for writing the regression engine. Eugene Bolotin for help prioritizing linear model inputs. Close and Roose for help with visualization. Coloring by grouping inspiration in the graphic SFP/AMP output from Caroline Ridley’s presentation on the radish work. Database: Gene Tanimoto for citrus sequences and coordinates on the Affymetrix GeneChip. Neil Stone for teaching me MsAccess. Steve Wanamaker for live links to HarvEST:Citrus and annotation verifications. Map: Mikeal Roose who is an expert in the examination of Citrus data. Also, Kim Bowman, Jose Chaparro, Chunxian Chen, Young Choi, California Citrus Research Board, Florida Citrus Production Research Advisory Council, Donna Clevenger, Timothy Close, Phat Dang, Claire Federici, Fred Gmitter, Julie Gmitter, Jude Grosser, Misty Holt, Shu Huang, Yildiz Kacar, Peter McClure, T. Greg McCollum, Lisa Mu, David O’Malley, Madhugiri Nageswara, Jaya Soneji, Barbara Walter, Steve Wanamaker, Margie Wendell. Dissertation: Mikeal Roose, Mike Sievers, Timothy Close, Kara Oswood, Mike Calise
Dedicated to memory of Michael S. Williams,
who passed away in February, 2008.
His natural insight and leadership in the area of in silico procedural computation
and processing have inspired this and similar life pursuits for both myself and others.
He will be missed and we will not disappoint him as we proceed forward in his absence.
ABSTRACT OF THE DISSERTATION
A Genomic Genetic Mapping of the Common
Sweet Orange and Poncirus trifoliata
Matthew Peter Lyon
Doctor of Philosophy, Graduate Program in Plant Biology, Plant Genetics
University of California, Riverside, June 2008
Dr. Mikeal L. Roose, Chairperson
Initially part of an international plan to sequence the sweet orange genome, this work has produced sweet orange (Citrus sinensis) and Poncirus trifoliata maps. These have at least doubled the density of existing maps. A custom Affymetrix microarray was produced toward this goal, integrating microarray expression and mapping technologies. The mapping portion of this chip was composed of highly redundant probe sets which allowed for bi-allelic, co-dominant SNP (Single Nucleotide Polymorphism) markers to be mapped. The sequence source was an extensive EST (Expressed Sequence Tag) assembly from which SNP sequences were selected and a filtering methodology incorporating the known biological sequences of citrus ESTs and Arabidopsis genomic sequence were applied in an information-theory-based sieve (chapter 2). A storage database (“LISA”) handled SSR (Simple Sequence Repeat) and microarray data and allowed for the mapping populations differing in parentage to be merged (chapter 3). Quality control features built into the Affymetrix chip were utilized to maximize the quality of data from this platform in an unsupervised data re-characterization technique with a linear model at its core (chapter 4). This provided SFP (Single Feature Polymorphism) markers derived from the expression probesets of the chip. Markers from GTYPE, an Affymetrix genotyping software product used to analyze tiled SNP markers, were also integrated into the final maps. In total, the final sweet orange map includes 988 markers in 972 loci of combined SSR, GTYPE, and SFP data and is shared on a website with hyperlinks to full gene annotation data. An analysis of the two maps is provided, relating the density and segregation distortion quantity amounts with each other and to established Aurantioideae maps (chapter 5). Further, documentation describing the database backbone for the project exists in appendix A, presented in the form of a brief user manual with figures. Also, a comprehensive re-annotation (full sequence search) of the Affymetrix expression and SNP chip content has been completed and stored in the LISA database (appendix A).
Table of Contents
High Throughput Chip Technologies 9
List of Tables Table 1.1. Numerical comparison of previous Citrus maps 22
Table 2.1. Known citrus sequence masking 41
Table 3.1. Scalability of the database, with various numbers of records shared 51
Table 4.1. Terms included in model 72
Table 4.2. Kolmogorov-Smirnov tests for normality and uniformity 73
Table 4.3. Pooled linear model terms weighting for chip genotyping 74
Table 4.4. Single chip linear model terms weighting for chip genotyping 77
Table 4.5. Linear model terms significance for mapping 78
Table 5.1. Numerical comparison of previous Citrus maps 103
Table 5.2. Reference table of constituent individuals 106
Table 5.3. Pre-marker counts by stage for both maps 107
Table 5.4. Citrus sinensis map summary 108
Table 5.5. Poncirus trifoliata map summary 110
Table 5.6. Final map Monte Carlo noise rejection simulation 114
Table 5.7. Maternal versus paternal distortion in the Citrus sinensis map 115
Table 5.8. Summary of segregation distortion in split Poncirus trifoliata maps 116
Figure A.8. Cartoon illustration of information output flow 176
Figure A.9. Screen capture of table relationships 179
Figure A.10. Illustration of tree names source tables 180
Figure A.11. Screen capture of marker names source table 181
Figure A.12. Screen capture of bandingAndGenotypes table 182
Figure A.13. Breeding diagram illustrating MAS 191
Figure B.1. log2(signal) for binding controls versus log2(signal) for binding controls 197
Figure B.2. Tm (degC) versus log2(signal) for binding controls 198
Figure B.3. log2(signal) for binding versus log2(signal) for mismatch probes 199
Chapter 1 - Introduction The Importance of Citrus
Citrus is an important crop and nutrition source. Recent world agriculture statistics demonstrate this well. In 2005, there were 7,603,431 hectares of citrus harvested worldwide. Oranges, alone, had a worldwide import value of 3,189,251,000 USD in 2004 and 5,101,414 metric tones were traded worldwide. In 2004, the top importers of oranges and mandarin oranges were Germany, France, and the United Kingdom. The top exporters of oranges and mandarin oranges were Spain, South Africa, and United States of America (FAOSTAT data, 2006). Citrus is also traded as a fruit juice and fruit juice concentrate with 4,234,111 metric tones of juice traded in 2004 at a value of 3,431,279,000 USD. Top importers of citrus juice in 2004, by world market value, were the Netherlands, Germany, and Japan and the top exporters were the Netherlands, Italy, and Spain (FAOSTAT data, 2006). In 2004, in the United States, the prime locations for citrus cultivation were Florida, California, Texas, and Arizona (USDA, 2004). Citrus scion, or the fruit tree portion of a citrus plant, and rootstock varieties are genetically diverse and demonstrate a wide range of reproducible responses per genotype to issues such as soil mineral composition (a special issue in developing nations). Also notably varying by genotype are a range of responses to viral infections and changing climatic conditions. The flavor and fruit quality of citrus scion varieties also vary with both genotype and environment (Roose et al., 1995). These underlying differences suggest that genetic selection and directed crossing within citrus can improve varieties by combining the optimum set of genes to create superior new types (Luro et al., 1995). Through these recurrent methods, citrus genetic improvements will continue by extending the growing seasons, ranges, and yields, reducing fertilizer and pesticide inputs, providing pathogen resistances, and custom tailoring fruit composition for health benefits (Soost and Roose, 1996).
The Benefits of Citrus
A brief discussion of the currently studied health benefits of citrus defines the current understanding of citrus as a global dietary component. Citrus consumption has an accepted place in the diets of healthy individuals. It is enjoyed as a palatable table fruit as oranges, Citrus sinensis (L.) Osbeck, lemons, C. limon (L.) Burm., limes, C. aurantifolia Swingle, and grapefruit, C. paradisi Macfad. Citrus is a prime source of vitamin C, which is an essential chemical compound not produced by the human body. Vitamin C is obtained solely through dietary sources. Historically, a scarcity of vitamin C in the human diet produced a disease condition known as scurvy. This was noted as early as the 1700s and the inclusion of lemon and lime juices into the diets of British sailors allowed the British to make long overseas voyages (Lind, 1716; Moore, 1909). The full chemical name of this vitamin is ascorbic acid for its anti-scorbic, or anti-scurvy, activity.
In modern times, the value of citrus in a healthy diet is reinforced by popular magazines and television advertisements that promote health products containing citrus by name. Beyond the health and culinary importance, citrus also serves as a natural or “green” source for chemical solvents for the plastics and electronics industries. Further, the chemical constituents of citrus are being investigated for human medicinal use. The terpenes and carotenoid compounds from citrus have been explored as cancer fighting agents (Chug-Ahuja et al., 1993; Holden et al., 1999). In research literature, citrus bioflavonoids have demonstrated inflammation reduction capabilities, blood lipid level reduction, and antioxidant actions (Hasegawa and Miyake, 1996; Lam and Hasegawa, 1989; Ogawa et al., 2000; Tanaka et al., 1998; Bok et al., 1999; Choi et al., 2001). Also, grapefruit (C. paradisi) compounds have been shown to enhance the effects of certain pharmaceuticals through inhibition of a major human detoxification enzyme, CYP3A4 (Fukuda et al., 1997). Of the fruit available in a typical market, citrus has the highest number of the colored plant compounds, carotenoids, (Gross, 1987) which are responsible for anti-oxidant activities. Additionally, furanocoumarin compounds from citrushave been implicated in preventing HIV replication in in vitro studies (Murray et al., 1982; Sancho et al., 2004). Further research into these and other citrus compounds may yield additional therapeutic insights.
A Brief Citrus History
Citrus originated in Southeast Asia. Cultivation of this fruit crop has been occurring since at least 2100 BC (Webber, 1967; Scora, 1988) and varieties of cultivated Citrus species consist of blends of genes 4from various ancestral citrus types which are selected by farmers and breeders (Federici et al., 1998; Pedrosa et al., 2000; Nicolosi et al., 2000). Citrus genetic improvement programs currently exist in countries with significant citrus production including Australia, Brazil, Spain, Japan, and China, and Turkey. Within the United States, such programs exist in California, Florida, and Texas. At the University of California, Riverside an extensive history of citrus improvement began in the early 1900s.
Current genetic improvement efforts in the crop include both selection of better output traits, such as fruit sweetness, color, and seedlessness, and the inclusion of input traits such as nematode, fungal, bacterial, and insect pathogen resistances, yield increases, and pollen sterility. The genetic improvement of citrus proceeds by directed crosses or identification of mutants, followed by temporally expansive and expensive evaluation of selections on irrigated land. Genetic improvement in citrus also makes use of ploidy changes and irradiation to impart sterility, seedlessness, and improved fruit quality. Further, alleles from citrus ancestral types including citrons, pummelos, mandarins, and wild relatives are introgressed into modern cultivated citrus. For the most part, and exceptions certainly do exist, all sexual forms of citrus are crossable with one another and it is thus, biologically, one species (Soost and Cameron, 1975; Cameron and Soost, 1984).
It is generally acknowledged that breeding accelerations provided by marker aided selection, MAS, will become increasingly useful (Young, 1999) and recent work in maize has demonstrated this well (Ribaut and Ragot, 2007). This is may become especially true in citrus, where minimum generation times are five to seven years. In citrus, time until evaluation can be up to seven years (Davies and Albrigo, 1994). Marker aided approaches need genotypic descriptions of the examined individuals, however, and these descriptions of genotypes can be made more accurately with mapped genetic markers. Such genomic approaches have already been utilized to understand the genetics of polygenic QTLs, quantitative trait loci, for salt tolerance in Citrus and Poncirus (Sykes, 1992; Tozlu et al., 1999). Further, molecular marker data has helped elucidate the genetic derivation of modern citrus varieties including sweet orange. Figure 1.1 shows the passage of time represented by the large arrows with the ancestral species on the left transitioning into the popular modern types on the right. A diversity of molecular marker types were examined by others and their results were combined to derive this understanding (Federici et al., 1998; Pedrosa et al., 2000; Nicolosi et al., 2000). This evidence points to a hybrid origin for sweet orange resulting from an asexually preserved cross between the ancestral species pummelo, Citrusmaxima (L.) or Citrusgrandis (L.) Osbeck, and mandarin, Citrusreticulata (L.). This hybrid nature of Citrus sinensis enables the creation of genetic maps using alleles segregating away from each other when sweet orange is a parent in a mapping population. Further, detailed maps may enable the recreation of sweet orange from parents similar to its ancestors, creating a sweet orange engineered to have improved contributions of desired traits.
The Importance of Maps
Genetic maps are a fundamental tool of most advanced plant genetics or genomic efforts. They describe the statistical likelihood that after meiosis two segments of DNA will be inherited together or will separate. Typically presented in the form of ‘linkage groups’, these groups have the format of lists of marker names in a specific order, separated by numbers that represent map distances between each of these list entries. The distances are estimated via a numerical transform (the mapping function) of an observed or inferred value, the 'recombination frequency'. Therefore, between each locus on a genetic map there is a recombination frequency value that generated the map distance (Russell, 2006). These recombination frequencies are values less than 50, as greater than 50 would denote a chance occurrence.
These recombination frequency values represent the observed frequency at which two or more unique DNA sequences, or allelic states, are found in different non-parental combinations in the progeny population. This is a valuable piece of information because the DNA sequences of an organism determine which genes and traits are present. Different combinations of original DNA sequences, then, define individuals with different genes and traits. Therefore, these differing progeny individuals represent novel individuals with combinations of traits that differ from those of their parents.
The recombination frequency is expressed as the proportion of new combinations of these alleles found in the progeny population. Any occurrence of a novel allelic combination, or differing DNA sequences occurring together in a new way as compared to the parents, is recorded by counting the number of progeny that have this condition. These novel combinations of DNA sequences, or the source of this recombination of different states, is determined by meiosis. In understanding meiosis it is imperative to understand that all diploid, sexual organisms have two versions of the DNA sequences that define which genes and traits it has. Meiosis has occurred in the parents before a sexual mating or crossing and consists of a stochastic recombination of the neighboring DNA sequences via the breaking and reconnection of these two large DNA molecules in a process called crossing over. Recombination frequency, then, is a measure of how often this crossing over event occurs in each parent for a given section of the DNA molecule. It is measured by counting the progeny that have recombinant states. This count is divided by the total number of progeny in the population analyzed. While this analysis appears simple, in certain crosses estimating the number of recombination events is considerably more complex. This is because if some alleles are dominant, the observed phenotype, may obscure classification of progeny as parental or recombinant in a phenomenon called dominance (Mendel, 1866).
A genetic map, then, consists of a record of these recombination frequencies often presented as a graphic (can be just a list of distances), organized by neighboring DNA sequence elements, loci. A separate map exists for each parent, as the set of available unique DNA sequences differs in each parent which affects the possible outcomes of this stochastic process (Russell, 2006). Stated in other words, genetic maps of individuals describe the linkage, or distance, between loci or markers in groups. The overall maps are useful to the geneticist because these allow prediction of how often two loci (DNA sequences) will separate if a given number of progeny are produced using that parentage. This is of importance when one is trying to predict probability of obtaining progeny with specific combinations of alleles. A small map distance, and thus recombination frequency, between two loci means that the chances of producing a progeny individual where both allelic states are not in agreement with the parent is relatively low. If the breeder desires that two loci of known states, that are separated by a small map distance, take on a recombined state, then the breeder knows that a large population of progeny will be needed to find these rare individuals. This is the simplest, most fundamental use of genetic maps in breeding operations. Thus, in plant breeding, the map is important in breeding efforts for predicting whether useful DNA will be found in a particular seed. This is the aforementioned MAS, or marker aided selection, breeding technique. Under this technique, undesired traits are identified and seedlings are eliminated using DNA information (Hanson, 1959; Tanksley et al., 1989; Paterson et al., 1991; Dudley, 1993).
Although not uniform over all chromosomal regions, recombination frequency is related to the physical proximity along the DNA molecule for two portions of DNA within the organism. Therefore, a genetic map also assists in the cloning of neighboring DNA, which can be eukaryotic genes, into bacteria (Wallace et al., 1990). In this process, bacterial colonies from the source organism are cloned and plated (in a large insert library or BAC, bacterial artificial chromosome) and then the sequence of interest is located using a marker for- or near- a gene of interest. This is called a labeled probe sequence. Often radiolabeled nucleotides are used for synthesis of a labeled marker and this probe is applied to a flexible membrane containing the DNA from the colonies of bacteria bound to this membrane. This surface is examined for the presence of these radioactive probes, or matching DNA. The identification of such colonies allows for selection of a section of cloned DNA that has an increased likelihood of containing the DNA for the gene of interest.
For whole genomic sequencing or anchoring sequences to maps, a similar pursuit is followed. In this process each identified sequence corresponds to subsequent markers along the map, either overlapping or separated by gaps. Thus, the identified linear sequence of colonies has an increased likelihood of having the same ordering as the original DNA sequence, or chromosome, from which the original markers were mapped. This is important in work involving whole chromosomes in which the rough order of constituent DNA sequences must be known, such as the assembly process in whole-genome sequencing.
The Citrus Map
Previous to this work citrus maps have not focused on gene-specific markers or SSR markers applicable to different citrus populations. These maps consisted of different marker types including ISSR: inter-simple sequence repeat, SSR: simple sequence repeat, isozyme, RAPD: random amplification of polymorphic DNA, cDNA: complementary DNA, and gDNA: genomic DNA based markers. Of these corresponding DNA sequences are known only for SSR and some of the cDNA and gDNA markers. The map presented in the dissertation of Barkley (2003), showed 180 molecular markers distributed over in sixteen large and small linkage groups based on a population of 57 progeny (Barkley, 2003; Barkley et al., 2006). A LOD score, or base 10 logarithm of odds, of 3.0 was employed to identify linkage groups in these maps. The map had mean and median counts of eleven and twenty five markers across sixteen groups. Kepiro (2003) analyzed segregation in 84 progeny to produce a map of Chandler pummelo containing 257 AFLP: amplified fragment length polymorphism markers distributed in nine linkage groups with mean and median counts of 29 and 27 markers per group. Also produced were AFLP maps of the trifoliate orange. Other recent, notable maps in Citrus include 310 ISSR, RAPD, RFLP, and isozyme markers by Sankar and Moore (2001) and 168 RAPD, SSR, and IRAP: inter-retrotransposon amplified polymorphism markers in Poncirus and C. aurantium, the sour orange (Ruiz and Asins, 2003). The map of Chen et al. (2008) with 141 markers was the first published citrus map to focus on gene-specific markers. The mapping population used in this dissertation includes some of the plants and markers studied by Chen et al. (2008) allowing these maps to be compared. For an excellent review of Aurantioideae mapping efforts see Chen et al., 2008 and table 1.1.
The primary objective of this research project was the creation of a genetic linkage map for Citrus on which the markers correspond to known sequences from genes. Secondarily, tools were developed to support accomplishment of this primary goal. Prior to this study, there was no relatively dense and publicly available gene-specific map of Citrus. Similar well-populated maps will be useful for efforts such as QTL analysis (Flint and Mott, 2001) and the gene specific nature of these maps can help map based cloning and whole-genome sequencing efforts. Molecular linkage maps are the standard roadmap used in genetic studies, and have been developed for many commercially important organisms such as tomato and potato, pig, cow, and sheep (Tanksley et al., 1992; Rohrer et al. 1996; Kappes et al. 1997; Groenen et al. 2000). Further, studies of taxonomy, evolution, and systematics are greatly aided by such maps (Whitkus et al., 1994) because gene ordering is typically well preserved within- and even across- genera and this provides for comparisons at a larger scale of examination than at the basepair-sequence level.