« Home « Kết quả tìm kiếm

DoGFinder: A software for the discovery and quantification of readthrough transcripts from RNA-seq


Tóm tắt Xem thử

- DoGFinder: a software for the discovery and quantification of readthrough transcripts from RNA-seq.
- This novel phenomenon, initially identified from analysis of RNA-seq data, suggests intriguing new levels of gene expression regulation.
- Furthermore, the readthrough response to stress has thus far not been investigated outside of mammalian species, and the occurrence of readthrough in many physiological and disease conditions remains to be explored..
- Results: To facilitate a wider investigation into transcriptional readthrough, we created the DoGFinder software package, for the streamlined identification and quantification of readthrough transcripts, also known as DoGs (Downstream of Gene-containing transcripts), from any RNA-seq dataset.
- Using DoGFinder, we explore the dependence of DoG discovery potential on RNA-seq library depth, and show that stress-induced readthrough induction discovery is robust to sequencing depth, and input parameter settings.
- We further demonstrate the use of the DoGFinder software package on a new publically available RNA-seq dataset, and discover DoG induction in human PME cells following hypoxia – a previously unknown readthrough inducing stress type..
- for dozens of kilobases beyond gene ends, were highly induced, and interestingly, remained in the cell nucleus [1, 2].
- Yet, the mechanism of readthrough induction and its role in gene expression and control remain elusive.
- Therefore, the ability to readily identify and quantify DoGs may provide new insights and under- standing of the readthrough phenomenon and its rela- tionship to gene expression regulation..
- 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated..
- their expression levels from RNA-seq data.
- This package can be run on any RNA-seq dataset, single or paired-end, strand-specific or unstranded, polyA selected or non-selected, and can be used to identify readthrough transcripts in different conditions and different organisms..
- DoGFinder was designed to identify and quantify the ex- pression levels of readthrough transcripts, DoGs, for every annotated gene, given an RNA-seq bam file and gene annotation file(s).
- The DoGFinder package is writ- ten in python, and uses samtools [5] and bedtools [6], to search for continuous RNA-seq read density down- stream of gene ends.
- Given any RNA-seq bam file, and one or more compatible gene annotation file(s), DoG- Finder searches, downstream of every gene end in an in- put annotation, for a region of continuous read density (Additional file 1: Figure S1A).
- Candidate DoGs are identified by requiring a minimal coverage minDoGCov over a minimal initial length minDoGLen downstream of the 3′ end of every gene locus.
- Upon initial identification of candidate DoGs, DoGFinder elongates DoGs in over- lapping running windows, until read coverage drops below minDoGCov.
- DoGFinder can be used to discover DoGs from any RNA-seq dataset, and produce DoG an- notation files in bed format.
- It can further merge or intersect DoG annotation files (see below), and finally, quantify the expression levels of DoGs..
- Get_loci_annotation: In order to make sure that the DoG discovery procedure only considers non-genic reads, which are not related to the expression of different alternative isoforms or other non-coding RNAs, the Get_loci_annotation function (Additional file 1:.
- Pre_Process: Before identifying DoGs, several pre-processing steps of the RNA-seq bam files are re- quired (Additional file 1: Figure S1C).
- We note that the mapped RNA-seq bam files can be a result of any map- ping program, however they should be mapped to a gen- ome, rather than a transcriptome.
- This function is applied to all RNA-seq bam files in the dataset, and allows the use of multiple cores for parallel processing.
- Pre_Process can be run once for each dataset, and both pre-processed (in case of paired-end input datasets) and down-sampled bam files are kept, in order to save runtime for subsequent steps and to allow users to run Get_DoGs multiple times with different parameter settings..
- Get_DoGs: Given a loci annotation bed file and a pre-processed RNA-seq bam file, the user can identify DoGs using the Get_DoGs function (Additional file 1:.
- The use of pre-processed bam files is re- quired, especially in the case of paired-end data.
- First, the function removes all genic reads, and then identifies DoG candidates based on a minimal DoG length (minDoGLen), and minimal DoG coverage (minDoG- Cov).
- 1 Osmotic stress DoG induction discovery using DoGFinder is robust to RNA-seq library depth.
- Performance test results of mouse NIH3T3 cells paired-end strand specific RNA-seq data before and after osmotic stress (2 h of KCl).
- Additionally, DoG boundaries are limited by the location of the nearest 3′ neighboring gene in the genome, and genes with 3′ nearest neighbor closer than minDoGLen are discarded.
- When using stranded libraries, DoG boundary limitations are set in consideration with genes on the same strand, while in the case of unstranded libraries, neighboring genes are used to constrain DoG boundaries regardless of strand..
- minDoGLen and minDoGCov default parameters were assigned to be 4000 bases and 60% coverage respectively, which we found to be suitable for polyA-selected RNA-seq libraries (see below), which are the major library type generated.
- However, we note that stricter minimal length and coverage parameters could be considered when using non-polyA selected RNA-seq libraries, as DoGs have been shown to remain nuclear [1, 2], and non-polyA selected RNA-seq libraries are often enriched with nuclear RNA.
- The Get_DoGs function outputs an annotation file of the identified DoGs in bed format..
- Get_DoGs_rpkm: Finally, given a bed format DoG an- notation file and a pre-processed RNA-seq bam file, the Get_DoGs_rpkm function calculates DoG expression levels by a simple RPKM metric (Reads Per Killobase per Million mapped reads).
- Strand information is taken into consideration if the RNA-seq libraries are stranded..
- Its output is a csv format tab-delimited text file that contains the DoG annotation, DoG length and DoG RPKM values (Additional file 1: Figure S1E)..
- DoGFinder demonstrates sensitivity and robustness of readthrough detection to sequencing depth and recapitulates readthrough identification from various inputs.
- To test the performance of DoGFinder, and assess its sensitivity to varying sequencing depths, we used our published nuclear-enriched rRNA-depleted strand spe- cific paired-end RNA-seq data from mouse NIH3T3 cells that were exposed to osmotic stress (KCl, 2 h) [1]..
- We previously identified that stress both induces DoGs in numerous genes, and that DoGs get massively longer..
- library depth, using the minimal DoG length and min- imal DoG coverage parameters from our published study, namely minDoGLen of 4500 bases and minDoG- Cov of 80%.
- We note that, as this dataset is enriched in nuclear fractions, and is non-polyA selected, we used stricter minimal length and coverage parameters than the default values, which are more suitable for polyA se- lected RNA-seq libraries (see below).
- We performed this test by first mapping the raw RNA-seq fastq files to the mouse genome using Tophat2 [9], and subsequently down-sampling the resulting bam files of untreated and osmotic stress (KCl) samples to various depths using samtools [5], and running DoGFinder..
- We found that the number of DoGs is dependent on the library depth (Fig.
- The finding that osmotic stress (KCl) induces readthrough in more genes, however, was robust to se- quencing depth, and at any given coverage, DoGFinder identified around 2.7 fold more DoGs in osmotic stress vs.
- Interest- ingly, while the number of DoGs has not been saturated even at a library size of 200 M reads, DoG length has reached saturation already at 25 M read library depth, but only for the untreated cells.
- Indeed, transcriptional readthrough is considered rare in untreated cells, and these untreated DoGs probably represent natural ter- mination events that occur more than 4.5 kb down- stream of the annotated transcript end.
- To further substantiate the sensitivity of DoGFinder, we tested its capabilities in identification of DoGs from another readthrough inducing condition, HSV infection, using a different type of input data, namely 4sU-labeled newly synthesized RNA [3].
- It has been previously re- ported that HSV infection of human fibroblasts induces transcriptional readthrough in thousands of genes, by examining reads mapped to the 5 kb region downstream to all human gene ends [3].
- We therefore used several different initial parameter settings to test the robustness of DoGFinder in discovering the induction in the num- ber of DoGs following HSV infection, and characterized their typical lengths.
- Using DoGFinder we found that thousands of DoGs are indeed induced at 7-8 h post HSV infection, compared to only hundreds identified in control conditions (Additional file 1: Figure S3A)..
- DoGFinder reveals induction of readthrough in response to hypoxia stress.
- To illustrate the DoGFinder package usage on a new dataset, we used a publically available polyA selected single-end RNA-seq dataset of human pulmonary micro- vascular endothelial cells (HPMECs) before and after hypoxic stress (48 h of hypoxia, GEO Accession GSE53510).
- We mapped the raw RNA-seq data using Tophat2 [9] to the human genome to obtain bam files, and ran Get_loci_annotation on RefSeq, Ensembl and UCSC annotation gtf files of the human genome (build hg19) to generate the most inclusive gene annotation file.
- We then asked whether hypoxia stress induces transcriptional readthrough, similarly to osmotic, oxidative and heat stress, and ran Get_DoGs with varying minimal DoG coverage and min- imal DoG length parameter settings.
- Both the number of DoGs and the DoG average length were higher in hypoxia treated com- pared to control cells for all minimal DoG coverage and minimal DoG length parameter combinations tested (Fig.
- Having found that readthrough induction in hypoxia is robust to parameter settings, we chose to proceed with a minimal DoG length of 4 kb and minimal DoG coverage of 60%.
- we found 508, 509 and 543 DoGs in the three control replicates, with 420 DoGs common to all three replicates (p <.
- and 742 and 1004 DoGs in the two hypoxia replicates, with an overlap of 688 DoGs (p using a hypergeo- metric p-value)..
- The DoG length distribution demonstrated that, simi- larly to other proteotoxic stress conditions, DoGs tend to get significantly longer under hypoxic stress (Fig.
- We then used Union_- DoGs_annotation to merge the lists of DoGs that were common to all replicates of a given condition, in order to obtain a unified DoG annotation file containing all DoGs identified in both control and hypoxia-treated cells.
- We found that, simi- larly to other stresses, DoG expression levels were significantly higher in hypoxia compared to the control cells (Fig.
- DoGFinder run on RNA-seq data of HPMECs before and after hypoxia stress (GEO accession GSE53510), with various initial parameter settings, show that hypoxia generates a larger number of DoGs (a) which are also getting longer (b).
- To the best of our knowledge, DoGFinder is the first tool specifically designed to identify and quantify read- through transcription, whereas most standard RNA-seq quantification methods are based on existing annota- tions [13], and would thus not identify DoGs.
- Due to the abundance of RNA-seq in biomedical research, scientists can use DoGFinder to ask whether transcriptional read- through occurs in any condition of interest, and identify the genes subject to readthrough, readthrough levels and lengths.
- Readthrough can often be identified from polyA-selected RNA-seq data, as we have previously shown for heat shock in mouse [1], and as demonstrated above for the human hypoxia dataset (Figs.
- Nevertheless, non-polyA se- lected RNA-seq data often reveals a broader picture of DoG prevalence, length and expression..
- DoGFinder can run on any RNA-seq data of any se- quenced genome.
- However, the discovery of readthrough relies on the availability of good transcript annotations..
- Osmotic stress had the largest number of DoGs identified: over 92% of the entire set, suggesting.
- that this number, representing about 60% of expressed genes in NIH3T3 cells, is close to the actual percentage of stress-induced DoGs in mammalian genomes..
- However, since these are three proteotoxic stress condi- tions, we wondered if comparison between osmotic stress DoGs and HSV infection DoGs would reveal a more elaborate picture.
- Comparing osmotic stress DoGs from human SK-N-BE(2) C cells [2] with HSV-induced DoGs from HFF cells [3] using DoGFinder with minDo- GLen of 4000 bases and minDoGCov of 80% (DoGs common to both replicates in each condition) showed that 30% of the ~ 2200 DoGs discovered were unique to HSV compared to osmotic stress.
- Furthermore, out of the 641 genes associated with these HSV-exclusive DoGs, only a third were not expressed in the SK-N-BE(2) C cells.
- Nevertheless, the exact role of DoGs in regu- lation of gene expression is still unresolved.
- DoGFinder can now provide streamlined identification and quantifi- cation of readthrough transcripts from any RNA-seq data, which will facilitate further research into the causes and consequences of transcriptional readthrough..
- For a single parameter setting, minimal DoG length of 4000 bases and 60%.
- coverage: a The DoG length distribution shows significantly longer DoGs in hypoxia ( p ranksum test) b DoG expression levels are significantly higher in hypoxia compared to control cells ( p ranksum test).
- quantification of readthrough transcripts from any type of RNA-seq data, in any sequenced organism.
- The results of DoGFinder can provide: (1) new insights to existing RNA-seq datasets, as demonstrated for HPMECs subject to hypoxic stress.
- Addition- ally, the study of DoGs can shed new light on different transcription and polyadenylation regulatory programs that act to shape the transcriptome of various cell types and organisms under changing environments..
- Illustrations of the DoGFinder package workflow and key functions.
- (A) DoGFinder workflow as detailed in the Implementation section, and the documentation on Github.
- (B-E) illustrations of the methodology and logic of the main DoGFinder functions.
- DoGFinder identification of osmotic stress-induced readthrough is robust to library depth.
- DoGFinder run on untreated and osmotic stress (KCl) treated NIH3T3 RNA-seq data [1] which were downsampled to varying depths using samtools.
- DoGFinder found on average 2.7 fold more DoGs in osmotic stress than in untreated cells, regardless of the library depth.
- DoGFinder run on HSV-infected human cell 4sU-labeled RNA-seq dataset (7-8 h and control) from [3] illustrates the robustness of readthrough finding at 7-8 h post infection vs.
- control with respect to: (A) number of DoGs discovered (B) DoG length increase.
- (A) We found 508, 509 and 543 DoGs in the control replicates, with an overlap of 420 DoGs ( p <.
- Steitz and her lab for critical reading of the manuscript..
- The funding body had no role in the design of the study, collection, analysis, and interpretation of data, or in writing the manuscript..
- The osmotic stress NIH3T3 RNA-seq data can be found in GEO, accession no..
- The hypoxia RNA-seq data was downloaded from GEO, accession no.
- All authors read and approved the final version of the manuscript..
- RSeQC: quality control of RNA-seq experiments..
- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.
- Evaluation and comparison of computational tools for RNA-seq isoform quantification

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt