« Home « Kết quả tìm kiếm

Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data


Tóm tắt Xem thử

- Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data.
- On average, 90% of the splice junctions were supported by RNA-seq within tissue.
- We validated a large proportion of these extensions by independent pig poly(A) selected 3 ′ -RNA- seq data, or human FANTOM5 Cap Analysis of Gene Expression data.
- Indeed, Iso-seq data has been used for genome annotation of different species from Maize to Human [24–26].
- A recent study on the pig transcriptome [28] used Pac- Bio Iso-seq data from 38 porcine tissues to improve the previous pig genome assembly (Sscr10.2).
- We also provide direct evidence that the predicted novel genes and tran- scripts are valid for creating improved annotation, by performing independent chromatin immunoprecipita- tion sequencing (ChIP-seq), poly(A) selected 3′-RNA-- seq experiment and human FANTOM5 CAP Analysis of Gene Expression (CAGE) data.
- A total of approximately 4.4 M Iso-seq reads and 499 M RNA-seq reads were collected, with a minimum of K) Iso-seq and M) RNA-seq reads from each tissue (average 491 K ± 92 K and 55 M ± 20 M, respectively) (Additional file 1: Table S1 and Additional file 1: Table S2).
- The RNA-seq data was not independently as- sembled.
- instead transcripts and transcript isoforms were defined from the Iso-seq reads and error-corrected, vali- dated, and quantified using the short reads.
- This approach identified a total of 67,746 unique transcripts (1.2% of total Iso-seq reads) across all nine tissues.
- An average of 90% of predicted splice junctions across the nine tissues were supported by Illumina-seq reads that spanned the splice junction (Additional file 1: Figure S2), support- ing the accuracy of the transcript definition from Iso-seq reads..
- We evaluated the set of Iso-seq-defined transcripts for potential tissue-specific transcripts.
- RNA-seq data were used to test whether the absence of these transcript from the Iso-seq reads in the other tissues is due to tissue-specificity or potentially due to lack of data.
- From the complete set of 4733 unique brain transcripts that were not observed in the Iso-seq data from any other tissue transcripts had RNA-seq reads span- ning all splice junctions in at least one other tissue, and these reads represent transcripts with expression levels more than 0.1 FPKM (inflection point in expression plot of transcripts detected in more than one tissue by Iso-seq data.
- Thus, reliance on just Iso-seq data to predict tissue-specific transcripts may overestimate tissue-specificity due to a high false nega- tive rate for transcript detection.
- been detected by Iso-seq data in that tissue or (2) it had been detected by Iso-seq data in another tissue but all of its splice junctions were validated by Illumina reads in the tissue of interest with expression level more than 0.1 FPKM (see Methods section).
- Reference and predicted Iso-seq transcripts are identified by black and blue color, respectively.
- This resulted in the identification of transcripts for 14,021 known genes or 57% of all Iso-seq data-associ- ated genes (24,486) (Fig.
- Known genes (in either Ensembl or NCBI gene sets) that we did not detect in our Iso-seq data (Fig.
- i.e., predicted Iso-seq genes (see Method) that produced “s”, “x”, “i”,.
- This re- sulted in total of 10,465 novel genes or 43% of all Iso-seq data-associated genes.
- Only 21% of the novel.
- All possible combinations of presence or absence in NCBI and Ensembl annotations for the Iso-seq annotated genes, as well as intragenic/intergenic location relative to those annotations, were determined and are summarized in Fig.
- For example, there were 1965 Iso-seq genes that were found in the NCBI gene set but not in the Ensembl annotation gene set (NCBI specific Iso-seq genes).
- In contrast, only 364 Iso-seq genes were found in the Ensembl gene set, by not in the NCBI gene set (Ensembl specific Iso-seq genes).
- genes that had at least one protein-coding transcript in these 364 Ensembl spe- cific Iso-seq genes (56%) was higher than that for the 1965 NCBI specific Iso-seq genes (24%) (Fig.
- Out of 493 liver detected NCBI specific Iso-seq genes that were located in intergenic region of pig genome based on Ensembl gene set, 358 genes (72%) had H3K4me3 and H3K36me3 peaks that mapped to their promoter and gene body, respectively (Fig.
- This new Iso-seq based pig gene set annotation extended (5′ end extension, 3′ end extension or both) more than 6000 known Ensembl or NCBI gene borders (Table 2).
- To validate 3′ end extensions, an independent liver poly(A) selected 3′-RNA-seq dataset (Quantseq, Lexogen.
- Out of 3228 3′ end extended Ensembl genes with transcripts detected in liver, 2902 genes (90%) had 3′-RNA-seq reads that mapped to the 3′.
- To measure the effect of these 3′ end extension events on gene expression values, we narrowed down the analysis to those liver de- tected Iso-seq genes with exact same 5′ end but extended 3′ end compared to the reference Ensembl genes (233 genes).
- one-to-one orthologous genes with more than 90% nucleotide similarity [29], with an extended 5′ end based on Iso-seq data were selected for validation.
- The promoter region as defined by the median length of H3K4me3 peaks (600 bp) that overlapped with both the Iso-seq and Ensembl gene set annotations, is too broad to identify the correct 5′ end.
- Iso-seq defined 5′ ends, we developed an ad hoc method as described here.
- The candidate 5′ end region predicted by Ensembl or Iso-seq genes was defined based on the gene start site plus or minus 1/3 of the Ensembl gene extended region length (Additional file 1: Figure S12a)..
- Out of 1270 human-pig orthologous genes with an extended 5′ end, 320 genes had human CAGE reads that uniquely mapped from the region de- fined as the Iso-seq candidate 5′ end to the Ensembl can- didate 5′ end (Additional file 1: Figure S12a).
- had CAGE reads that mapped to the Iso-seq candidate 5′ end, i.e.
- the Iso-seq 5′ end (Additional file 1: Figure S12a).
- This in- cludes 105 genes with only validated Iso-seq 5′ end and 98 genes with both validated Iso-seq and Ensembl 5′ end (multiple promoter genes) (Fig.
- 5 Example of validation of novel intergenic Iso-seq gene using matched RNA-seq reads and independent liver ChIP-seq (H3K4me3 and H3K36me3) and 3 ′ -RNA-seq experiments.
- Table 2 Gene border extensions in current Ssc11.1 genome annotations by PacBio Iso-seq data.
- 6 Example of validation of extended 3 ′ annotation using an independent liver 3 ′ -RNA-seq experiment.
- In this study, using Iso-seq data from nine different porcine tissues, we could identify 10,465 novel genes not reported in current pig gen- ome annotations (Ensembl release 93 and NCBI release 109).
- In addition, there were 1961 predicted Iso-seq genes reported in NCBI annotation but not in Ensembl annota- tion and 364 predicted Iso-seq genes reported in Ensembl annotation but not in NCBI annotation.
- These numbers are likely an underestimation as the Iso-seq transcripts used in this study were poly(A) se- lected and we did not have the ability to capture non-polyadenylated lncRNAs.
- Also, during the process, we removed predicted single-exon transcripts that had only been detected in a single tissue by Iso-seq data as we could not verify whether they were real transcripts or fragments resulting from a decayed transcript.
- The 3′-RNA-seq technology sequences RNA frag- ments close to the 3′ end of poly-adenylated transcripts and by reducing the sequencing space/sample, provides a cheap, alternative tool to quantify gene expression level [38].
- Our Iso-seq transcripts ex- tended the 3′ end of more than 4000 known genes in ei- ther of Ensembl or NCBI annotations.
- The high validation rate of these extended regions at both Ensembl and NCBI annotations using an independent 3′-RNA-seq experiment shows improvement of gene 3′-end location compared to current pig annotations.
- In addition, our results showed the significant effect of these 3′ end extensions on improving gene expression quantification using 3′-RNA-seq data in pig genomics..
- Our novel Iso-seq-based analysis extended the 5' end of more than 3000 known genes, however the library preparation method used in this study did not specially target 5′ end caps, meaning the transcript 5′.
- A recent study on the chicken transcriptome using Iso-seq data [24] reported ~ 8% of predicted Iso-seq genes in this species have at least one NMD transcript..
- A recent study on the pig transcriptome based on Pac- Bio Iso-seq data [28] improved previous gene structure annotation (Sscrofa10.2) in terms of novel genes (26,881) and novel transcripts (28,127).
- Although this study used Iso-seq data sourced from 38 porcine tissues, it has five major differences compared to our study..
- Second, sequencing depth per tissue in their experiment was lower (514,659 Iso-seq reads pooled from all 38 tissues) compared to our Iso-seq dataset (4.4 M Iso-seq reads from all nine tissues;.
- Third, Illumina data used for error correction of Iso-seq reads in their study was obtained from a subset of tissues (8 tissues) with lower sequencing depth.
- Considering the high error rate of Iso-seq data this design could increase the false positive rate for novel transcript detection..
- High base-coverage of selected single-exon transcript by RNA-seq data im- plies that they are less likely to be gDNA.
- In-depth analysis of error-corrected long read iso-seq data for nine porcine tissues provided evidence to im- prove the annotation of thousands of protein-coding and lncRNA genes.
- We provide direct evidence that the predicted novel genes and transcripts extended existing gene models, by verifying such extensions with inde- pendent ChIP-seq, 3′-RNA-seq experiment and human CAGE data.
- Overall, it can be concluded that the current public pig genome annotations (NCBI and Ensembl) are still far from complete and our new Iso-seq based annotation improves these annotations..
- Sequencing the transcriptomes of nine porcine tissues by using the PacBio Iso-seq and Illumina RNA-Seq.
- PacBio Iso-seq libraries were constructed per the PacBio Iso-seq protocol.
- Error-correction of PacBio Iso-seq full-length cDNA reads The Read of Insert (ROI) were determined by using Con- sensusTools.sh in the SMRT-Analysis pipeline v2.0, with reads which were shorter than 300 bp and whose pre- dicted accuracy was lower than 75% removed.
- Errors in the full-length, non-chimeric cDNA reads were corrected with the pre- processed RNA-Seq reads from the same tissue samples.
- The collapsed transcripts from the different tissues were then merged using in-house python scripts to create an Iso-seq based transcriptome annotation.
- Iso-seq tran- scripts were compared with annotated transcripts of Ensembl (release 93) and NCBI (Release 109) by Gffcompare [53] and transcripts were classified into 10 groups based on their exon structures (splicing junctions)..
- To re- duce transcription noise, single tissue detected Iso-seq transcripts were required to have minimum expression level of 0.1 FPKM (selected based on the inflection point of >.
- 1 tissue detected Iso-seq transcripts, Fig.
- 3 ′ -RNA-seq sample preparation.
- 3 ′ -RNA-seq data analysis.
- 3'RNA-seq reads uniquely mapped to pig genome were used for downstream analysis..
- Relating reads to the extended 3′ end of annotated genes was per- formed using bedtools [59] so that 100% of mapped 3′-RNA-seq read length was covered by the exonic region of the extended 3′ end..
- (e) distribution of the number of exons per transcript.
- Expression analysis of transcripts detected in more than one tissue by Iso- seq data.
- Example of validation of novel intergenic Iso-seq gene using matched RNA-seq reads.
- and independent liver ChIP-seq (H3K4me3 and H3K36me3) and 3 ′ -RNA-seq experiments.
- Venn diagram of the number of livers detected Ensembl (a) and NCBI (b) genes with validated extended 3 ′ end across different samples of an independent liver 3 ′ -RNA-seq experiment.
- Example of validation of extended 3 ′ annotation using an independent liver 3 ′ -RNA-seq experiment.
- Effect of extended annotation on the expression level of Ensembl genes using liver 3 ′ -RNA-seq reads.
- Genes with same expression in both Iso-seq and Ensembl annotations were marked with red color.
- Blue line in each graph shows the average of Iso-seq gene expression fold changes over of their matched Ensembl genes in log2 scale that is equal to 0.485 or 40% expression increase.
- PacBio Iso- seq sequence alignment statistics.
- 3 ′ -RNA-seq sequences alignment statistics.
- Ensembl specific Iso-seq genes: Iso-seq genes were found in the Ensembl gene set, by not in the NCBI gene set.
- Iso-seq: single-molecule long-read isoform sequencing.
- NCBI specific Iso-seq genes: Iso-seq genes that were found in the NCBI gene set but not in the Ensembl annotation gene set.
- RNA-seq: Illumina RNA sequencing.
- The authors would also like to thank Martine Shroyen for preparing RNA samples used to generate 3 ’ -RNA-seq data..
- Error-corrected IsoSeq full-length cDNA reads and stranded RNA-seq data for nine porcine tissues are publicly available from the NCBI SRA under accession number PRJNA351265.
- ChIP-seq and 3 ′ -RNA-seq data for porcine liver tissue are publicly available from the NCBI SRA under accession number PRJNA529704 and PRJNA529249, respectively.
- Our PacBio Iso-seq based transcriptome annotated based on Ensembl Sscrofa11.1 (release 93) and NCBI Sscrofa11.1 (release 109) are publicly available in GitHub repository (https://github.com/hamidbeiki/Porcine-PacBio)..
- HL and NM performed the Iso-seq data error correction.
- A high utility integrated map of the pig genome..
- Mapping and quantifying mammalian transcriptomes by RNA-Seq.
- Identification of novel transcripts in annotated genomes using RNA-Seq.
- Full-length transcriptome assembly from RNA-Seq data without a reference genome.
- Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.
- TopHat: discovering splice junctions with RNA-Seq.
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and cufflinks

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt