Funny & Jokes

Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa

Description
Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa
Categories
Published
of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
  RESEARCH Open Access Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiencyestimation of genomic sampling in non-modeltaxa Ana Riesgo 1,2* , Sónia C S Andrade 1 , Prashant P Sharma 1 , Marta Novo 1,3 , Alicia R Pérez-Porro 1,2 , Varpu Vahtera 1,4 ,Vanessa L González 1 , Gisele Y Kawauchi 1 and Gonzalo Giribet 1 Abstract Introduction:  Traditionally, genomic or transcriptomic data have been restricted to a few model or emergingmodel organisms, and to a handful of species of medical and/or environmental importance. Next-generationsequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually anyspecies at a modest cost. Here we provide a comparative analysis of   de novo  assembled transcriptomic data for tennon-model species of previously understudied animal taxa. Results:  cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina GenomeAnalyzer II (read length 100 or 150 bp), rendering between  ca . 25 and 52 million reads per species. Read thinning,trimming, and  de novo  assembly were performed under different parameters to optimize output. Between 67,423and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigsretrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned withGene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts againstour datasets, using selected genes from major signaling pathways and housekeeping genes, revealed highefficiency in gene recovery compared to available genomes of closely related species. Intriguingly, ourtranscriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, includinghousekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. (Continued on next page) * Correspondence: anariesgogil@gmail.com 1 Museum of Comparative Zoology, Department of Organismic andEvolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA02138, USA 2 Centro de Estudios Avanzados de Blanes, CSIC, c/ Accés a la Cala St.Francesc 14, Blanes, Girona 17300, SpainFull list of author information is available at the end of the article © 2012 Riesgo et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the srcinal work is properly cited. Riesgo  et al. Frontiers in Zoology   2012,  9 :33http://www.frontiersinzoology.com/content/9/1/33  (Continued from previous page) Conclusions:  We generated the first study of comparative transcriptomics across multiple animal phyla (comparingtwo species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge,nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments)and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e.,  Octopusvulgaris ). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expressionprofiling, developmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth,functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar forall considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, arecurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution,development, and cellular biochemistry. Keywords:  Annelida, Arthropoda, Illumina, Mollusca, Nemertea, Next-generation sequencing, Porifera, Sipuncula Background Genetic studies in non-model organisms have been hin-dered by the lack of reference genomes, necessitatingresearchers to adopt time consuming and/or expensiveexperimental approaches. The advent of next-generationsequencing platforms (e.g., 454, Illumina, and SOLID),with concomitant decreases in sequencing costs due toescalating technological development, has made genomicand transcriptomic data increasingly accessible to re-search groups. To date, most  de novo  transcriptomeshave been generated using Roche/454 (e.g.[1-5]) and have focused on single species. More recently, Illuminashort reads have been used to build transcriptomic data-sets in non-model species [6-11], or combined with 454 data to assemble whole genomes [12], offering promisingprospects for the availability of such data for taxa of bio-logical significance.The advantages of transcriptomic data over genomesequencing range from their tractable size (ten to hun-dred times smaller than genomes) to their rapid pro-curement via large numbers of reads (from tens to a few hundred millions of short reads per lane, 100 – 150 bp) tofacile assembly with intuitive software [13-15]. Tran- scriptomic sequencing offers advantages in the detectionof rare transcripts with regulatory roles, given the enor-mous amount of reads covering each base pair (from100 to 1,000x/bp generally) [16]. Also, transcriptomescontain fewer repetitive elements than genomes, redu-cing analytical burden during post-sequencing assembly.  De novo  assembled transcriptomes have been employedfor gene discovery [3], phylogenomic analysis (e.g.,[8,11,17-19]), microRNA and piRNA detection [16], detecting selection in closely related species [20], as wellas for studies of differential gene expression (e.g.[2,7,21- 23]), among other applications. Disadvantages of usingtranscriptomes for  de novo  assembly include issues withgene duplication, genetic polymorphism, alternative spli-cing, and transcription noise (e.g.[24,25]). Many invertebrate phyla have been overlooked for gen-ome and transcriptome sequencing priority, and for somegroups, genomic data are particularly scarce. Amongthem, sponges (Porifera), ribbon worms (Nemertea), andpeanut and segmented worms (Annelida) are particularly poorly studied with regard to genomics. The significanceof such taxa stems from their utility for investigation of fundamental questions in evolutionary biology, such asthe srcins of metazoan organogenesis (e.g.[26], the evolu-tion and loss of segmentation (e.g.[27-29]), and the evolu- tion of terrestriality [30,31]. Lack of genomic data for these lineages is often accompanied by poor knowledge of basal relationships and evolutionary history. Furthermore,currently available genomic resources are often insuffi-cient for studying a broad diversity of organisms, giventhe phylogenetic distance between the lineage of interestand the available model organisms. For example, amongarthropods, traditional model organisms are restricted toHolometabola — the lineage of insects with complete meta-morphosis — although many questions of evolutionary sig-nificance involve lineages outside of this derived group,such as the srcin of flight at the base of Palaeoptera, andthe evolution of terrestriality at the base of Hexapoda.A comparative characterization of transcriptomic dataacross phyla in non-model species has not been carriedout yet, and would be desirable for two reasons. First,generating such data enables estimating the efficacy of short-read data in sampling gene transcripts among dis-tantly related lineages and with genomes of variable size.To date, Illumina data for comparative biology of mul-tiple species have only been published for a few groups[8,11,32], but little has been done to compare libraries across different phyla. Second, this characterization isanticipated to guide future efforts to obtain transcrip-tomic data for non-model metazoans lineages, particu-larly those for which such efforts have not beenpreviously undertaken. To abet forthcoming studies of development, phylogenomics, molecular evolution, and Riesgo  et al. Frontiers in Zoology   2012,  9 :33 Page 2 of 24http://www.frontiersinzoology.com/content/9/1/33  toxicology  — among other applications of interest to us — we report here  de novo  assembled transcriptomes from10 non-model invertebrate species belonging to five ani-mal phyla: Porifera (  Petrosia ficiformis ,  Crella elegans ),Nemertea ( Cephalothrix hongkongiensis ,  Cerebratulusmarginatus ), Annelida (  Hormogaster samnitica ,  Sipuncu-lus nudus ), Mollusca ( Chiton olivaceus ,  Octopus vul- garis ) and Arthropoda (  Metasiro americanus ,  Alipes grandidieri ). Two species per phylum were selected (wegrouped the annelid and the sipunculan species for com-parison; although the relationships between theselineages are not well established, most studies favor ei-ther a sister relationship of the two or a paraphyleticAnnelida that includes Sipuncula [18,29,33,34]) to allow  comparisons within and among phyla. Among the spe-cies selected, one is important for fisheries (the commonoctopus,  Octopus vulgaris ) and another has medical sig-nificance due to its potent venom (e.g., the Africancentipede  Alipes grandidieri ).In this article we characterized the effectiveness of theIllumina platform transcriptome sequencing strategy across these selected species with respect to data yieldand quality. We compared the completeness of the data-sets obtained for each taxon by assessing the sequencingdepth and recovery of gene ontology identifications, aswell as protein families. Also, searches of targeted genes(e.g., elements of conserved signaling pathways as wellas housekeeping genes) in our datasets and their coun-terparts in three fully sequenced invertebrate genomeswere used to compare and assess the suitability of ourtranscriptome datasets for gene discovery. Our study should thus contribute towards assessing the use of Illu-mina sequencing for  de novo  transcriptome assembly innon-model organisms as a cost-effective and efficientway to obtain vast amounts of comparable data for ap-plication in a broad array of downstream procedures. Results and discussion Transcriptome analysis  Assembling reads and selecting optimal assemblies cDNA libraries were obtained from high quality mRNA(Additional file 1) for the ten species (Figure 1) and  yielded between  ca . 25 and 52 million short reads usingIllumina GAII (Table 1 and Additional file 2). After adaptor removal, thinning and trimming, we were leftwith  ca . 15 to 45 million high quality reads per species,which were assembled using  de novo  assembly algo-rithms (Table 2 and Additional file 2).  De novo  assembly of either genomic or transcriptomic data poses substan-tial computational challenges [16,35,36]. Several short- read assemblers are now available, such as Velvet [13],ABySS [14], Trinity [36], and CLC Genomics Work- bench (CLCbio, Aarhus, Denmark), among others. Mostof these use de Bruijn graphs to assemble the reads,although there are slight variations among them, withfew showing more efficiency [9,16,37-40]. We selected CLC for its desktop application with a graphical user-interface, which facilitates analysis of the transcriptomicdata.We processed the sequences obtained following theworkflow shown in Figure 2. The filtering of reads basedon quality parameters when using 0.005 as the limitresulted in removal of a larger portion of each readwhen low quality was detected, and in many instancesan entire low-quality read was removed. Trimming per-formed with 0.005 as the limit was preferred if the initialquality of the reads was not very high. Otherwise, theleast stringent value was preferred. Mean length of readsranged between 65.4 bp in  Petrosia ficiformis  to 134.8 bpfor  Alipes grandidieri  (Additional file 2). Although onemay expect to have longer contigs with higher numbersof reads (Table 2), contig size did not have a direct cor-relation with the number of input reads. The length of the reads used for the assembly appeared to have an ef-fect on the length of the assembled contigs — the longestcontigs appearing when the read length was greater than120 bp (Table 2 and Additional files 2 and 3). Assem- blies performed with reads srcinally sequenced at 101bp had an average maximum contig length of 6,939 bp ±1,744.9 bp, whereas those obtained with reads srcinally sequenced at 150 bp showed larger numbers (9,809 ±5,505.1 bp) of longest contigs.Among the two resulting assemblies for each species(A and B, see Methods section; Additional file 2), we selected one (Table 2) based on combinations of opti-mality criteria (Additional file 4). The assemblies per-formed with the largest numbers of reads were notalways the optimal ones (see Table 2 and Additionalfile 2). Parameters that affected the final decision were:number of contigs, number of bases, N50, number of contigs longer than 2 Kb, and maximum contig length(Additional file 4). In all cases, the selected assembly was that containing the largest amount of contigs over 2Kb (Additional file 2). Only the selected assemblies arediscussed below (Table 2 and Additional file 2). Transcriptome descriptors: number and length of contigs More than 40% of the reads were successfully assembledinto contigs in all cases (Table 2), with more than 85%of the reads matching to resulting contigs in  P  .  ficiformis (Table 2). Coverage values for our transcriptomes(defined by number of reads covering a single base ineach contig) varied between the lowest value of 36.2 in Cerebratulus marginatus  to the highest value of 92.1 in Sipunculus nudus  (see Table 3). In all cases, the longerthe contig, the higher the coverage for each base(Additional file 5), although in some cases such as  Chi-ton olivaceus  and  Sipunculus nudus , coverage values Riesgo  et al. Frontiers in Zoology   2012,  9 :33 Page 3 of 24http://www.frontiersinzoology.com/content/9/1/33  were much higher in shorter contigs (Additional file 5).Coverage values are usually higher for Illumina than forother NGS platforms, ranging from around 5 to 7 for454 datasets [1,41,42], to more than 30 for Illumina [9,39,43]. The average number of reads building each contig varied greatly, ranging from 421.7 reads for  Petro- sia ficiformis  to 124.3 reads for  Chiton olivaceus  (seeTable 3). The maximum number of reads used to buildeach contig ranged from 65,985 in  Octopus vulgaris  to543,848 in  Hormogaster samnitica , and the minimum of 1 or 2 reads for each species (Table 3). Since very shortcontigs could be built with 1 paired-end read, weremoved all contigs below 300 bp for each species priorto subsequent analyses. The minimum coverage for thesub-selections was highly variable: between 2 and 10reads per contig (see Table 3). Our coverage resultssuggested the possibility of redundancy in the sequen-cing process (i.e., a great number of reads assemblinginto one contig, meaning a much deeper sequencing of some DNA fragments). This redundancy was toleratedbecause the downstream applications for these datasets,include gene expression and/or population genetics, forwhich redundancy can be addressed at a later analyticalstep [44].An average of 47.1 Mb (ranging from 26.7 for  Crellaelegans  to 75.9 Mb for  Chiton olivaceus  and  Hormoga- ster samnitica ; Table 2) were assembled into contigs inour datasets, with results falling in a range comparable HomoscleromorphaChoanozoaDemospongiaeCalcareaCnidariaCtenophoraPlacozoa Acoelomorpha DeuterostomiaChaetognathaPriapulidaKinorhynchaLoriciferaNematodaOnychophoraTardigradaArthropodaPolyzoaBrachiopodaNemertea Annelida Mollusca NeScE PhoronidaPlatyzoaNematomorphaHexactinellida Pf: Porifera Pf P: ProtostomiaE: EcdysozoaT: TrochozoaS: SpiraliaSc: ScalidophoraNe: Nematozoa MBNPS T B: BilateriaM: Metazoa Figure 1  Phylogenetic position of the higher taxonomic ranks of the species selected for this study, and accessory pictures of theliving animals. a.  Petrosia ficiformis .  b.  Crella elegans .  c.  Cerebratulus marginatus .  d.  Cephalothrix hongkongiensis .  e.  Chiton olivaceus .  f.  Octopusvulgaris .  g .  Sipunculus nudus .  h .  Hormogaster samnitica .  i .  Metasiro americanus .  j .  Alipes grandidieri  . (Pictures taken by Ana Riesgo ( a ), Alicia R.Pérez-Porro ( b ), Gonzalo Giribet ( c, f, j ), Sichun Sun ( d ), Jiri Nóvak ( e ), Gisele Kawauchi ( g ), Marta Novo ( h ), and Prashant Sharma ( i ). Riesgo  et al. Frontiers in Zoology   2012,  9 :33 Page 4 of 24http://www.frontiersinzoology.com/content/9/1/33  to other previous studies with non-model species using454 [41,45], although in many cases the assemblies were smaller [1]. Likewise, prior assemblies performed withIllumina reads ranged from 20 to 30 Mb [24,43,46-48],  values lower than ours, probably because they usedshorter sequencing lengths.Contig N50 is a weighted median statistic such that50% of the entire assembly is contained in contigs equalto or larger than this value (in bp). N50 for a genome isusually around 1 Kb, which represents the average sizeof an exon for animals [49]. The lowest N50 recoveredamong our selected datasets was that of   Chiton olivaceus (372, with an average length of 627.0 ± 305.3 bp) andthe highest was for  Octopus vulgaris  (599, with an aver-age length of 1,122.9 ± 660.5 bp) (see Table 2). These values are smaller than those observed for transcrip-tomes assembled from 454 pyrosequencing data (e.g.,900 bp for the chickpea [39]; 893 bp for  Oncopeltus  [41];693 bp for  Acropora  [1]) but similar to N50s obtainedwith Illumina RNAseq (e.g.[24,48]). Our datasets contained a larger number of short con-tigs when compared to data obtained with 454 pyrose-quencers (e.g.[2,4,50]), with only 4.7% to 15.7% of our assemblies constituted by contigs > 1 Kb (Additional file3). However, the proportion of contigs over 1 Kb foundin our data was surprisingly high for transcriptomic data(Additional files 2 and 6), surpassing that of 454 sequen- cing in other invertebrates with comparable sequencingeffort, and similar to assemblies built with equal num-bers of Illumina reads [8,46]. For instance, the transcrip- tome of the deep-sea mollusk  Bathymodiolus azoricus (sequenced with 454) contained 3,071 contigs over 1 Kb[45], a smaller number than the > 5,000 contigs longerthan 1 Kb in our mollusks,  Chiton olivaceus  and  Octopusvulgaris  (Additional file 6). Similarly, our results forarthropods (Additional file 6) outperform those obtainedwith 454 for several arthropod species [2,4,50]. Interest- ingly, our results for the number of contigs over 1 Kb(and also contigs > 500 bp) in the sponges  Petrosia fici-  formis  and  Crella elegans  (Additional file 6) are similarto those found for the coral  Acropora millepora , using454 [22], indicating a similar sequencing depth. Detection of chimeric sequences The maximum contig length for each species variedgreatly, ranging from 3,032 bp for  Sipunculus nudus — the library with the lowest values for most metrics of data quality  — to 16,472 bp for  Octopus vulgaris  (Table 2).The appearance of very long contigs in transcriptomicassemblies can be due to the existence of chimeric ormiss-assembled sequences. Therefore, to check for puta-tive chimeras (assembly artifacts), we translated thelongest contig for each assembly to all 6 possible readingframes, took the longest open reading frame, and re-blasted it using the blastp program in NCBI. We alsoblasted the first and last 500 bases of each contig to Table 1 Collecting information for the 10 species used for this study Phylum Species Class ,  Order Collection site VouchernumberBody part Preservation Porifera  Petrosia ficiformis  Demospongiae,HaploscleridaPunta Santa Anna, Blanes, Girona,SpainDNA105722* Entire animal LN 2  /-80°C Crella elegans  Demospongiae,Poecilosclerida Tossa de Mar, Girona, Spain DNA105740* Entire animal RNA later  Nemertea  Cephalothrix hongkongiensis Anopla, Paleonemertea Akkeshi, Hokkaido, Japan DNA106145* Entire animal RNAlater Cerebratulusmarginatus Anopla,HeteronemerteaFalse Bay, San Juan Island,Washington, USADNA105590* Entire animal LN 2  /-80°CMollusca  Chiton olivaceus  Polyplacophora,Chitonida Tossa de Mar, Girona, Spain DNA106012* Entire animal RNA later Octopus vulgaris  Cephalopoda,OctopodaBlanes Bay, Blanes, Girona, Spain DNA106283* Fragment of arm RNA later  Sipuncula  Sipunculus nudus  Sipunculidae Fort Pierce, Florida, USA DNA106878* Distal fragment of animalLN 2  /-80°CAnnelida  Hormogaster samnitica Oligochaeta,OpisthoporaGello, Toscana, Italy GEL6** Distal fragment of animalRNA later  Arthropoda  Metasiro americanus  Arachnida, Opiliones Kingfisher Pond, Savannah, Georgia,USADNA101532* Entire animal LN 2  /-80°C  Alipes grandidieri   Chilopoda,Scolopendromorpha Tanzania; pet supplier (www.kenthebugguy.com)DNA106771* Mid part of body LN 2  /-80°C Voucher numbers refer to specimens collected in the same area as the one used for the nucleic extraction, since most of the times the entire animal (or the entirecollected piece of animal) was processed. A single asterisk refers to voucher numbers in the Museum of Comparative Zoology, Harvard University, and a doubleasterisk to those deposited in the Department of Zoology and Physical Anthropology, Universidad Complutense de Madrid. In all cases only one specimen wasused for extraction, except for  Metasiro americanus , which also had embryos in several developmental stages. Riesgo  et al. Frontiers in Zoology   2012,  9 :33 Page 5 of 24http://www.frontiersinzoology.com/content/9/1/33
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x