Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil) Webb Millera,1, Vanessa M. Hayesb,c,1,2, Aakrosh Ratana, Desiree C. Petersenb,c, Nicola E. Wittekindta, Jason Millerc, Brian Walenzc, James Knightd, Ji Qia, Fangqing Zhaoa, Qingyu Wanga, Oscar C. Bedoya-Reinaa, Neerja Katiyara, Lynn P. Tomshoa, Lindsay McClellan Kassona, Rae-Anne Hardieb, Paula Woodbridgeb, Elizabeth A. Tindallb, Mads Frost Bertelsene, Dale Dixonf, Stephen Pyecroftg, Kri
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii  (Tasmanian devil) Webb Miller a,1 , Vanessa M. Hayes b,c,1,2 , Aakrosh Ratan a , Desiree C. Petersen b,c , Nicola E. Wittekindt a , Jason Miller c ,Brian Walenz c , James Knight d , Ji Qi a , Fangqing Zhao a , Qingyu Wang a , Oscar C. Bedoya-Reina a , Neerja Katiyar a ,Lynn P. Tomsho a , Lindsay McClellan Kasson a , Rae-Anne Hardie b , Paula Woodbridge b , Elizabeth A. Tindall b ,Mads Frost Bertelsen e , Dale Dixon f , Stephen Pyecroft g , Kristofer M. Helgen h , Arthur M. Lesk a , Thomas H. Pringle i ,Nick Patterson  j , Yu Zhang a , Alexandre Kreiss k , Gregory M. Woods k,l , Menna E. Jones k , and Stephan C. Schuster a,1,2 a Pennsylvania State University, Center for Comparative Genomics and Bioinformatics, University Park, PA 16802; b Children ’ s Cancer Institute Australia andUniversity of New South Wales, Lowy Cancer Research Centre, Randwick, NSW 2031, Australia; c The J. Craig Venter Institute, Rockville, MD 20850; d 454 LifeSciences, Branford, CT 06405; e Center for Zoo and Wild Animal Health, Copenhagen Zoo, 2000 Frederiksberg, Denmark; f Museum and Art Gallery of theNorthern Territory, Darwin 0801, Australia; g Department of Primary Industries and Water, Mt. Pleasant Animal Health Laboratories, Kings Meadows,Tasmania 7249, Australia; h National Museum of Natural History, Smithsonian Institution, Washington, DC 20013-7012; i The Sperling Foundation, Eugene, OR97405; j Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge Center, Cambridge, MA 02142; k University of Tasmania,Hobart, TAS 7001, Australia; and l Immunology, Menzies Research Institute, Hobart, Tasmania 7000, AustraliaEdited* by Luis Herrera Estrella, Center for Research and Advanced Studies, Irapuato, Mexico, and approved May 23, 2011 (received for review February24, 2011) The Tasmanian devil ( Sarcophilus harrisii  ) is threatened with ex-tinction because of a contagious cancer known as Devil Facial Tu-mor Disease. The inability to mount an immune response and toreject these tumors might be caused by a lack of genetic diversitywithin a dwindling population. Here we report a whole-genomeanalysis of two animals srcinating from extreme northwest andsoutheast Tasmania, the maximal geographic spread, togetherwith the genome from a tumor taken from one of them. A 3.3-Gb de novo assembly of the sequence data from two complemen-tary next-generation sequencing platforms was used to identify1 million polymorphic genomic positions, roughly one-quarter ofthe number observed between two genetically distant humangenomes. Analysis of 14 complete mitochondrial genomes fromcurrent and museum specimens, as well as mitochondrial and nu-clear SNP markers in 175 animals, suggests that the observed lowgenetic diversity in today ’ s population preceded the Devil FacialTumor Disease disease outbreak by at least 100 y. Using a geneti-cally characterized breeding stock based on the genome sequencewill enable preservation of the extant genetic diversity in futureTasmanian devil populations. wildlife conservation | ancient DNA | population genetics | semiconductorsequencing | selective breeding G lobal estimates are that 25% of all land mammals are at risk for extinction (1). Endemic Australian mammals are no ex-ception, with 49 currently named on the International Unionfor Conservation of Nature (IUCN) Red List of ThreatenedSpecies ( Carnivorous marsupials pro- vide striking examples of recent extinction and critical populationdeclines.Afterthe lossofthe thylacine( Thylacinuscynocephalus ),also known as the Tasmanian tiger or Tasmanian wolf, in 1936,theTasmanian devil ( Sarcophilus harrisii ) inherited the titleofthe world ’ s largest surviving carnivorous marsupial. Con fi ned, in the wild, to the island of Tasmania, it too is under threat of extinctionbecause of a naturally occurring infectious transmissible cancerknown as Devil Facial Tumor Disease (DFTD).First observed in 1996 in the far northeastern corner of theisland state of Tasmania, DFTD has resulted in continuing pop-ulation declines of up to 90% in areas of the longest diseasepersistence (2, 3). This rapidly metastasizing cancer is transferredphysically as an allograft between animals (4), with a 100%mortality rate.It is predictedthat in aslittle as 5y DFTDwillhavespread across the entire Tasmanian devil native habitat, makingimminent extinction a real possibility (5).Cloning and sequencing of MHC antigens has suggested thatlow genetic diversity may be contributing to the devastatingsuccess of DFTD (6, 7). Because MHC antigens can be incommon between each individual host and the tumor, whichinitially arose from Schwann cells in a long-deceased individual(8), the host ’ s immune system may be unable to recognize thetumor as “ nonself. ” On the other hand, a recent study demon-strated a functional humoral immune response against horse redblood cells, although cytotoxic T-cell immunity has not beenevaluated to date (9). An extensive effort is underway to maintain a captive pop-ulation of Tasmanian devils until DFTD has run its course in the wild population, whereupon animals can be returned to the spe-cies ’ srcinal home range. The strategy for selecting animals forthe captive population follows traditional conservation principles(10), without the potential bene fi ts of applying contemporary methods for measuring and using actual species diversity. Inhopes of helping efforts to conserve this iconic species, we aremaking available a preliminary assembly of the Tasmanian devilgenome, along with data concerning intraspecies diversity, in-cluding a large set of SNPs. Results To better assess the genetic diversity of the S. harrisii population, we have sequenced the nuclear genomes of two individuals. Oneanimal, named Cedric, was an offspring of parents from north- west Tasmania and survived multiple experimental infections with different strains of tumor, although he eventually suc-cumbed. The other animal, a female named Spirit, came fromsoutheastern Tasmania and was close to death from DFTD whencaptured. Cedric ’ s genome was sequenced to sixfold coverage onthe Roche GS FLX platform with Titanium chemistry, as well asan experimental version of the upcoming XL+ chemistry of  Authorcontributions:W.M.,V.M.H.,andS.C.S.designedresearch;M.F.B.,D.D.,S.P.,K.M.H.,A.K., G.M.W., and M.E.J. directed fi eld studies and provided samples; W.M., V.M.H., A.R.,D.C.P., N.E.W., J.M., B.W., J.K., J.Q., F.Z., Q.W., O.C.B.-R., N.K., L.P.T., L.M.K., R.-A.H., P.W.,E.A.T., M.F.B., D.D., S.P., K.M.H., A.M.L., T.H.P., N.P., Y.Z., A.K., G.M.W., M.E.J., and S.C.S.analyzed data; and S.C.S. wrote the paper.The authors declare no con fl ict of interest.*This Direct Submission article had a prearranged editor.Data deposition: The sequences reported in this paper have been deposited in the Gen-Bank database (accession no.AFEY00000000). 1 W.M., V.M.H., and S.C.S. contributed equally to this work. 2 To whom correspondence may be addressed. This article contains supporting information online 12348 – 12353 | PNAS | July 26, 2011 | vol. 108 | no.  Roche/454 Life Sciences, with read lengths ranging up to 800base pairs. Roche/454 long read pairs (with inserts up to 17 kb) were used for contig assembly and scaffolding. In addition,Cedric was sequenced on an Illumina platform (GA IIx) to 16.7-fold coverage using paired-end sequencing with short inserts(around 300 bp). Spirit was sequenced to twofold on the RocheGS FLX Titanium platform and to 32.2-fold on the Illuminaplatform. We also sequenced a tumor taken from Spirit to 19.7-fold coverage. The distributions of coverage depths (determined by aligning reads to the assembly described next) are shown in Fig. 1. As an intermediate step for measuring intraspecies diversity, we created a de novo genome assembly using the CABOGsoftware package (11); the alternative approach of basing theanalysis on comparison with a fully sequenced genome was lessattractive because Sarcophilus is so evolutionarily distant fromthe available sequenced marsupial genomes [wallaby, opossum(12)] that many of its genomic regions cannot be accurately compared among those species. The assembly took advantage of the four data types: 454 Titanium paired reads, 454 Titaniumunpaired reads, 454 XL+ unpaired reads, and Illumina GA IIx reads, and used reads from both Cedric and Spirit (but not thetumor). See Table 1 for summary statistics and SI Appendix forassembly details. The total size of the assembly, about 3.3 Gb(billion bases), is slightly larger than the average for mammaliangenomes, but this is to be expected given earlier estimations thatthe Sarcophilus genome size “ C-value ” is 3.63 (13). Although it was not a main goal of the project to evaluate methods for as-sembling next-generation sequence data, our project provided anopportunity to compare the performance of two of the bettercurrent methods in a real-world setting ( SI Appendix ). Our belief is that the fi eld is not suf  fi ciently mature to allow creation of a de fi nitive reference assembly from data like ours. On the otherhand, for assessing genetic diversity and providing a catalog of nucleotide variants, the method works well. It is important tonote that by design, the draft assembly resulted from sequencingtwo individuals to yield a haploid sequence with no variant in-formation. In a subsequent step, Illumina reads were mapped tothe assembly and SNPs were called based on differences amongthe reads, rather than a difference between the reads and the as-sembly; thus, the SNP calls are largely resilient to assembly errors.Mapping the Illumina reads to the assembled contigs let usidentify the genetic diversity among the three samples, as well as within each genome (i.e., heterozygosity). We detected 1,057,507SNPs (i.e., genomic positions where distinct nucleotides can becalled with con fi dence). It is dif  fi cult to interpret the SNP countexcept by comparison with analogous results for species with which we are more familiar. Humans are the only species for which directly comparable data have been published. To avoideffects of methodological differences, we determined SNP countsfor several pairs of human individuals exactly as we foundCedric-Spirit differences. Between Cedric and Spirit we found914,827 substitutions; a southern African Bushman (14) and aJapanese individual (15) contain 4,800,466 SNPs, compared with3,256,979 for a Chinese individual (16) and the Japanese in-dividual. Surprisingly (given the small number of remainingindividuals), lower-coverage Illumina data (5 × ) indicates thatdivergence in each of the two threatened orangutan species isabout twice that of humans (17).Classi fi cation of nucleotide variants between Cedric and Spiritshowed striking differences that indicate a historical mixing of the devil population, in contrast to the ancient separation of theBushman and Japanese populations or the more recent separa-tion of the Chinese and Japanese populations (Table 2 and SI  Appendix ). In a perfectly mixed population (i.e., matching thehypothesis of  “ random mating ” ), there should be twice as many biallelic positions, where both individuals are heterozygous, as where both are homozygous (for different nucleotides). In somesense the departure from the theoretical ratio 2 (see the last rowof Table 2) measures strati fi cation between the populationsrepresented by the two individuals. This inference can also bemade by considering only heterozygous positions in individuals(Fig. 2  A ) (see SI Appendix for details). Although the populationsubdivision in Tasmanian devils appears to be less deep than thatfor humans, below we show that a substructure exists and hasrelevance for efforts to conserve the species.By sequencing one of  fi  ve tumors removed from Spirit, weinvestigated tumor-speci fi c alleles. Using the Galaxy Web site(18) (see Materials and Methods ), we found 118,575 SNPs thatare unique to the tumor: that is, where Cedric and Spirit appearhomozygous for the same allele. (By comparison, 198,953 var-iants are unique to Cedric.) This large number of variants seenonly in the tumor con fi rms that the tumor ’ s source was not a cellfrom the host, Spirit; rather, the tumor cells contain chromo-somes from a different individual. Interestingly, only 20,822 variants were unique to Spirit, which we believe is a result of thepresence of Spirit DNA in the tumor sample. As tumors are likely to contain DNA from both normal andtumor tissue, we estimated the respective amounts by de-termining the ratio of mitochondrial and nuclear markers thatare speci fi c for each. The predicted tumor variants were veri fi edby amplicon sequencing on 110 alleles, thus allowing us to seg-regate Spirit normal vs. tumor and srcinal host alleles at highsequencing coverage ( > 1,000-fold). We estimate that 30% of thenuclear DNA and 15% of the mitochondrial DNA in the tumorsample is from Spririt (see Materials and Methods ). We hypoth-esize that the difference indicates a higher number of mito-chondria per cell in cancerous tissue.Beside “ contamination ” from host DNA, there is another in-herent limitation to analysis of the tumor sample. Unlike normal/ tumor pairings used in other genomic analyses of cancer (e.g.,ref. 19), the Tasmanian devil tumors are an infectious cell line,meaning they are “ grafted ” onto a new host whose genomediffers from the original genetic background from which thetumor evolved. Therefore, the genetic analysis must take into Fig. 1. Sequence coverage depth used for genetic variant detection. Thecoverage was calculated for Illumina sequences used for our three specimensin SNP calling against a de novo assembled reference sequence (14x cover-age 454/Roche and Illumina hybrid assembly), and does not include potentialPCR duplicates and secondary alignments. The y  axis indicates the fraction ofthe non-N bases in the reference sequence that have a particular coverage.Vertical lines on the x  axis indicate average coverage for the three samples. Table 1. Assembly statistics Contig ScaffoldCount Length (Gbp) N50 (bp) Count Span (Gbp) N50 (bp)457,980 2.932 9,495 148,891 3.228 147,544 Miller et al. PNAS | July 26, 2011 | vol. 108 | no. 30 | 12349      E     N     V     I     R     O     N     M     E     N     T     A     L     S     C     I     E     N     C     E     S  account the diploid genome of the present host, the diploid ge-nome of the srcinal host, as well as the somatic mutations of thetumor onto its respective genetic background over many hostgenerations. Although our approach can identify differencesbetween the genomes of Spirit and the tumor, it does not allowus to estimate which of these are somatic mutations that accu-mulated over time in the tumor cell line. For that identi fi cation,it will be necessary to genotype them in a number of individualsso as to identify naturally occurring variants.We estimate that the number of amino acid differences in thediploid genomes of Spirit and Cedric is roughly 3,000 to 4,000. Although it was outside the scope of this project to predict a de- fi nitive Sarcophilus gene set, we used the Monodelphis genome andits gene annotations to identify 1,141 putative intraspecies protein variants. See SI Appendix for more information, including a dis-cussion of how this information might be used to study DFTD.To estimate the extent and trajectory of  Sarcophilus geneticdiversity since Europeans colonized Tasmania, we sequenced themitochondrial genomes of seven modern and six historic sam-ples, along with the tumor taken from Spirit. The genomes eachcontain 16,940 bases of nonrepetitive DNA, together with a shorthypervariable region that we did not analyze. The 13 mito-chondrial differences between Cedric and Spirit are roughly half the average number for two Europeans and, we estimate, one-sixth the number between two Bushmen (14), an unusually var-iable human population. Fig. 2 C compares the number of mi-tochondrial differences in several species and populations, andindicates that the mitochondrial diversity of  Sarcophilus is low inabsolute terms. On the other hand, the rate that this diversity isdecreasing may also be low, as we did not detect much increaseddiversity in the historic samples (Fig. 2  B ). Excluding the tumormitochondrial sequence, we detected 24 variable mitochondrialpositions. The tumor mitochondria contained an additional fi  veSNPs, but was otherwise identical to that of Spirit, again con-sistent with the tumor ’ s srcin in eastern Tasmania. As the fi  veSNPs from the tumor were not found in the remainder of thepopulation, they may have arisen as a consequence of the in-creased mutational activity of the tumor tissue. As our sequencing effort progressed, we were able to constructa series of increasingly extensive genotyping arrays to explore the Sarcophilus population structure across Tasmania. We geno-typed 17 informative mitochondrial SNPs in 87 wild animals,identifying four persistent mitochondrial haplogroups (denoted A, B, C, and E) ( SI Appendix , Table S14). Screening an addi-tional 81 wild and 7 captive animals ( SI Appendix ) con fi rmedregion-speci fi c haplogrouping, and identifyied a fi fth minorhaplogroup, D (Fig. 3  A ). A specimen collected between 1870and 1910 (OUM5286) showed a unique ancient haplogrouping(denoted hF), but otherwise all of the mitochondrial diversity found in historic samples persists in the extant population.To provide an opportunity for a higher-resolution analysis of the population structure, we computationally inferred nuclear-genome nucleotide substitutions (20) between Spirit and Cedricas soon as we achieved 0.5 × and, later, 2 × sequence coverage,generating 96 and 1,536 SNP genome-wide genotyping arrays,respectively. Analysis of 1,532 potential SNP positions identi fi ed702 informative variants used to genotype the 87 wild animals.Using this larger number of SNPs and EIGENSTRAT (21) todraw a principal-components analysis scatter plot (Fig. 3  B )allows for inferences based on smaller population sizes (in thiscase, an average of eight per subpopulation) to quantify ancestry.Together with fi  xation index (  F  ST ) estimates ( SI Appendix , TableS15) from the 12 geographical locations, nonsex-biased analysisreveals additional subpopulation structure. We note that the plotof Fig. 3  B roughly recapitulates the geography of the devilsamples in a way reminiscent of how human genes have beenreported to mirror geography in Europe (22). Discussion  Although most of the capacity of advanced sequencing instru-ments is currently devoted to resequencing humans (23) andhuman cancers, interest in sequencing other vertebrates remainsalive and well (24). This interest has spawned a growing effort todevelop de novo genome-assembly methods that can be appliedto data from the so-called next-generation sequencing instru-ments (25). However, although deep coverage of a vertebrategenome can now be generated in 1 wk on a single instrument,methods for effectively using the data have not kept pace. Forexample, although the fi nal assembly of the orangutan genome was released in July 2007, the analysis of the data, by a largeconsortium, was not published until January 2011 (17). Cur-rently, it is not feasible to fully analyze genomes in such depth asquickly as the data can be produced; rather, to keep pace it isnecessary to focus the analysis on particular issues. One possi-bility is to investigate intraspecies diversity, without attemptinga de fi nitive analysis of the species ’ protein sequences. Although the Sarcophilus population is prone to boom-or-bust fl uctuations in size (26), the observed near-constancy of mito-chondrial diversity over the last 100 y justi fi es guarded optimismthat the species can survive, assuming adequate habitat areas andpopulation numbers and that current diversity can be maintained with the help of a captive breeding program. With the increasedsensitivity of using larger numbers of biallelic nuclear markers(vs. only mitochondrial markers), we were able to identify ad-ditional population substructure, providing an ideal startingposition and rationale for evaluating the on-going breeding pro-gram. An alternative to a retrospective analysis of the establishedbreeding population could be random selection of insurance ani-mals guided by the population structure. Our data suggest equalselection from seven zones across Tasmania (Fig. 3 C ), includingthe diseased region, to ensure adequate capturing of current ge-netic diversity to supplement and boost current insurance breed-ing. Indeed, sampling healthy animals in a disease-impactedregion may even enrich for alleles offering some protectionagainst DFTD. A third possible use of our data is to genotypea large number of healthy wild animals and select a subset of speci fi edsizeandsexcompositionwhoseoverallallelefrequenciesare as close as possible to a desired distribution; see ref. 27, which alsopresents a method foroptimal selectionofungenotypedindividuals from genetically characterized subpopulations (e.g.,Fig. 3 A and B ).Rather than planning a traditional genome-analysis project,our goal is to provide genomic resources to aid conservationefforts for the Tasmanian devil. We are making freely available( i ) the Sarcophilus genomic contigs, ( ii ) alignments of the readsto those contigs, ( iii ) our complete set of 1,057,507 SNP pre-dictions, with allele calls for the three individual samples, and( iv ) alignments of 121,265 annotated Monodelphis protein-codingexons to Sarcophilus contigs, covering 17.2 million base pairs, in-cluding 1,134 amino acid differences and 1,891 synonymous sub-stitutions among the three Sarcophilus genomes (see Materials and Methods ). Those exons exhibit 91.1% nucleotide identity and94.7% amino acid identity between Monodelphis and Sarcophilus ,although it should be kept in mind that our procedure strongly favors well-conserved regions. A potential follow-up study is to search for protein poly-morphisms possibly related to an individual ’ s ability to resist or Table 2. Major categories of variant positions between twoindividuals TypeCedric-SpiritBushman-JapaneseChinese-JapaneseSNPs (in millions) 0.91 4.80 3.26 i  Heterozygous in both(e.g., AG and AG)23.8% 10.1% 17.1% ii  Heterozygous in one(e.g., AG and GG)57.9% 70.5% 68.4% iii  Heterozygous in neither(e.g., AA and GG)18.3% 19.4% 14.5% i  and ii  1.30 0.52 1.18 Minorcategories(suchas putativetriallelic sites)arereportedin SIAppendix  . 12350 | et al.  delay the onset of DFTD. One speculative case, the ERN2 gene,is discussed in the SI Appendix to illustrate computationalmethods that can be applied to winnow candidates down inpreparation for laboratory experiments. Another line of study,starting with our data, could be to look for differences betweenthe tumor and normal tissues, perhaps using as clues the 138amino acid variants that we observed only in the tumor ( SI Ap- pendix , Table S10). In this regard we have validated 110 variants ABC Fig. 2. Genetic diversity of Sar-cophilus . (  A ) The numbers of het-erozygous sites in Cedric and Spirit(in millions), and the numbershared between them, comparedwith two human pairs (the onlyother vertebrate species for whichstrictly comparable data are avail-able). Sarcophilus has far fewersuch sites. In addition, a muchhigher fraction is shared betweenindividuals, indicating less popula-tion strati fi cation than in humans(see SI Appendix  ). ( B ) Mitochon-drial diversity covering the last100 y. Locations of single nucle-otide variations (neglecting thehypervariable region) are indi-cated as vertical lines in the sevenmodern and six museum specimensrelative to the eastern-derived an-imal, Spirit. Diversity ranges fromthe geographically most westernanimal (Cedric) to the most distanteastern animal (Spirit). ( C  ) Averagenumbers of mitochondrial genomedifferences between pairs of in-dividuals, ignoring hypervariableregions. Species designated by the2008 IUCN Red List of ThreatenedSpecies as “ endangered ” or “ criti-cally endangered ” are indicatedin red, and extinct species are inblack. Species and populations inblue are thriving. † Species repre-sented by only two sequences.*Whales are averaged over fi vespecies. Woolly mammoths aredivided into two mitochondrialclades (30). The gorillas may befrom separate subspecies, Gorillagorilla and Gorilla beringei  . It isapparent that mitochondrial di-versity is not the only factor af-fecting species endangerment;habitat loss and other factors areoften critical. Miller et al. PNAS | July 26, 2011 | vol. 108 | no. 30 | 12351      E     N     V     I     R     O     N     M     E     N     T     A     L     S     C     I     E     N     C     E     S
Similar documents


View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!