« Home « Kết quả tìm kiếm

RNA-QC-chain: Comprehensive and fast quality control for RNA-Seq data


Tóm tắt Xem thử

- RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data.
- Background: RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology.
- However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions.
- Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results..
- Results: We developed a comprehensive, fast and easy-to-use QC pipeline for RNA-Seq data, RNA-QC-Chain, which involves three steps: (1) sequencing-quality assessment and trimming.
- This package was developed based on our previously reported tool for general QC of next-generation sequencing (NGS) data called QC-Chain, with extensions specifically designed for RNA-Seq data.
- It has several features that are not available yet in other QC tools for RNA-Seq data, such as RNA sequence trimming, automatic rRNA detection and automatic contaminating species identification.
- The three QC steps can run either sequentially or independently, enabling RNA-QC-Chain as a comprehensive package with high flexibility and usability.
- The performance of RNA-QC-Chain has been evaluated with different types of datasets, including an in-house sequencing data, a semi-simulated data, and two real datasets downloaded from public database..
- Comparisons of RNA-QC-Chain with other QC tools have manifested its superiorities in both function versatility and processing speed..
- Conclusions: We present here a tool, RNA-QC-Chain, which can be used to comprehensively resolve the quality control processes of RNA-Seq data effectively and efficiently..
- RNA-Seq has become a routinely and extensively applied approach for transcriptome profiling that relies on high- throughput sequencing (HTS) technologies, which pro- vides a far more profound and precise measurement at the transcript level than microarray and other traditional gene expression analysis methods [1].
- However, due to intrinsic limitations of HTS technologies and RNA-Seq protocols, quality problems are quite common in raw RNA-Seq data.
- “RNA-Seq-specific” quality issues, such as ribosomal RNA (rRNA) residual, RNA degradation and varied read coverage.
- Therefore, before downstream analysis, raw RNA-Seq data must be checked and processed by quality control (QC) procedures to ensure accurate transcript measurements and correct knowledge acquirements from the data..
- QC-Chain [5], NGS QC Toolkit [7].
- However, most of them mainly focus on trimming of general HTS data, but not for specific RNA-Seq QC problems.
- Though some tools are de- signed specifically for RNA-Seq data, they suffer from different kinds of restrictions.
- Therefore, there is a pressing need for a new and powerful QC method for RNA-Seq data..
- Here we present RNA-QC-Chain, an easy-to-use, highly efficient and one-stop QC tool for RNA-Seq data.
- With both quality check and data processing capability, RNA- QC-Chain includes three related functional components, called Parallel-QC, rRNA-filter and SAM-stats.
- In addition to covering most types of quality assessments offered by currently available tools, RNA-QC-Chain can filter out the poor-quality reads and contaminations, and generate the ready-to-use data for downstream analysis.
- Notably, parallel computation is embedded in RNA-QC-Chain, which could significantly accelerate its processing speed and makes it an extremely fast QC software..
- The workflow of RNA-QC-chain.
- RNA-QC-Chain has three sequential QC procedures, with parallel computation as the backbone to provide a complete and high-performance QC solution for RNA- Seq data (Fig.
- 1 The workflow and functions of RNA-QC-Chain.
- Finally, based on results of reads alignment (to reference genome), multiple mapping metrics are pro- vided to evaluate the RNA-Seq data and experiment by another embedded module called SAM-stats..
- Notably, the duplications in RNA-Seq data should not be removed because this information is closely relative to RNA abundance calculation.
- In RNA-QC-Chain, a tool called “rRNA-filter” was developed to extract rRNA reads and to identify both internal and external contaminations.
- Since the HMM algorithm does not rely on the annotation of the source genome of the rRNA, but the pattern of the rRNA sequences, the RNA-QC-Chain makes the removal of the rRNA fragment to be alignment and annotation free.
- Based on the assignment of the classification terms, the taxonomical components of the RNA-Seq data were produced, which indicated whether there was contaminating species, and if so, what these species was..
- Furthermore, using a script called “SAM-stats”, RNA- QC-Chain provides the assessment profiles based on read alignment.
- Within the RNA-QC-Chain software package, sequencing- quality trimming and contamination filtering steps could be performed based on sequencing data in either FASTA or FASTQ format.
- Output formats depend on the specific steps of RNA- QC-Chain.
- Parallel computation optimization was applied on RNA-QC-Chain in sequencing-quality trimming and contamination filtering steps.
- We used four datasets to test the performance of RNA- QC-Chain (Table 1).
- Dataset 1 was a real in-house sequenced RNA-Seq data of algae species Nannochlor- opsis.
- Dataset 2 was a semi-simulated data, inte- grating a real RNA-Seq data for a Sprague–Dawley rat sample, and simulated contaminating reads from yeast Saccharomyces cerevisiae.
- Dataset 3 and Dataset 4 were human RNA-Seq data, which were produced under the ENCODE project and downloaded from Gene Expression Omnibus database (GEO accession number GSM958728) with data sizes of 9.6 Gb and 16.4 Gb, respectively..
- Compared to traditional technology like microarray, RNA-Seq has a higher productivity and better resolution, therefore, it has become the mainstream of high throughput and large scale RNA-level study.
- Therefore, quality control is a first essential step in bioinformatics analysis of RNA-Seq data.
- RNA-Seq measures the abundance and structure of genes at the RNA level, and employs different analytical approaches compared with those for DNA-Seq data: firstly, DNA is quite stable and the DNA sequences are highly constant, while RNA are fragile and gene expression values are very dynamic.
- secondly, all DNA sequences could be recovered when the se- quencing depth is high enough, while sequencing bias may occur to a higher level in RNA-Seq data.
- These dif- ferent features place distinguished and high demands for accurate QC on RNA-Seq data to ensure highly reliable subsequent analytical results.
- RNA-QC-Chain was devel- oped based on our published QC tool called QC-Chain, which provides basic QC solutions for general HTS data..
- However, RNA- QC-Chain can identify all kinds of rRNA reads in SILVA database and automatically remove them, including 16S, 23S, 18S and 28S rRNA, while QC-Chain can only identify 16S or 18S rRNA and cannot remove the identified rRNA reads.
- Another difference is that RNA-QC-Chain can.
- perform the alignment statistics, which is not applicable in QC-Chain.
- Therefore, RNA-QC-Chain has essential functional extensions that are particular for RNA-Seq data..
- Therefore, although there are a number of tools that can perform this step, we integrated Parallel-QC in our RNA-QC-Chain, to make our pipeline as a one-stop and convenient tool for RNA-Seq data QC.
- For RNA-Seq data, the contaminations can be classified into two types of.
- In extracted total RNA, up to 80–90% are rRNA sequences, thus a high quality RNA-Seq experiment requires an intact total RNA extraction and efficient.
- 2 Selected outputs of RNA-QC-Chain for a RNA-Seq data of Nannochloropsis (Dataset 1).
- In Dataset 2, we artificially added some yeast data as external contaminations but are not aware in advance that bacteria reads were also included in the downloaded rat RNA-Seq data.
- Therefore, even when the RNA-Seq data has a single end and a very.
- 3 Contamination identification for a semi-simulated RNA-Seq data (Dataset 2) using RNA-QC-Chain.
- On the other hand, our results demonstrated that the sequencing quality trimming and contamination identifi- cation steps are absolutely necessary and important for the QC of RNA-Seq data, because either low sequencing quality reads or contaminations may result in poor usable data yield and thus might damage further down- stream analysis results..
- Herein we should point out that RNA-QC-Chain has a limitation in external contamination identification in very rare cases when rRNA sequences are not involved in the data, since the foreign organism identification is based on rRNA annotation..
- Read alignment to reference genome is essential for most downstream analysis of RNA-Seq data.
- Quality check on the read alignment results can indicate how well the target RNA was captured, amplified and sequenced, thus provides a comprehensive insight into the quality of RNA- Seq experiment and data.
- We took Dataset 1 as an example to manifest the per- formance of the alignment statistics reporting step of RNA-QC-Chain.
- Due to the innate features of RNA-Seq data, such as alternative splicing and different expression patterns, the indexes listed above may vary in different RNA-Seq samples, and these assessment can provide a global indication of the data status.
- RNA-QC-Chain is an easy-to-use and flexible tool.
- By RNA-QC-Chain, all analytical tasks are automatically distributed to different threads with dynamic scheduling for optimization of the computing loading balance, and the shared memory space among all threads also signifi- cantly reduced the RAM usage.
- Compared to other pipe- lines that integrated with multiple software packages to depend on additional I/O operations for data transfer, the high efficiency I/O strategy of RNA-QC-Chain signifi- cantly decreased the entire analytical time.
- Consequently, RNA-QC-Chain has the ability of fast processing massive scale of RNA-Seq data.
- Comparisons with other QC tools for RNA-Seq data We compared RNA-QC-Chain with two other QC tools for RNA-Seq data, RNA-SeQC and RSeQC, which were developed by different techniques with different features, and have been widely used for RNA-Seq quality checking [17].
- RNA-QC-Chain is.
- Firstly, in functional aspects, both RSeQC and RNA- QC-Chain can evaluate the sequencing quality of raw reads, but only RNA-QC-Chain can complete the trim- ming of poor quality sequences and produce the trimmed reads.
- For contamination filtering, all the three tools can estimate how many reads are possibly origi- nated from rRNA genes, however, RNA-QC-Chain is the only one that is capable of automatically removing them, while other tools needs the user to provide an rRNA ref- erence file.
- In addition, neither RNA-SeQC nor RSeQC can identify the contaminating foreign species in the data, whereas RNA-QC-Chain is unique in this function..
- Therefore, when using RNA-SeQC and RSeQC, users have to turn to other data processing tools to filter the poor-quality reads and contaminations, but RNA-QC- Chain can directly produce the usable data for further analysis.
- These measure- ments are based on reads per kilobases per millionreads (RPKM) values, which are not necessary for all kinds of RNA-Seq analysis.
- Therefore, we did not involve mea- surements based RPKM values in RNA-QC-Chain..
- Giving considerations to conveni- ence and flexibility, RNA-QC-Chain contains three rela- tively independent QC components, each with a single command line, to perform both quality check and data processing functions..
- For RNA- QC-Chain, no additional files are needed.
- In particular, for both RNA-SeQC and RSeQC, Table 2 Comparison of functions and features of different QC tools for RNA-Seq data.
- RNA-QC-Chain RSeQC RNA-SeQC.
- On the contrary, RNA-QC-Chain applied a different strategy with a built-in comprehensive rRNA database, making the prediction of rRNA reads more precise, comprehensive and easy-to-operate..
- The output of RNA-QC-Chain dem- onstrates different quality aspects of data in different formats, including the quality-filtered sequence file in FASTA/FASTQ format that can be directly used for downstream analysis, rRNA reads filtered out, an active graph in HTML report suggesting the contamination information and plots/texts for the alignment metrics..
- RNA-QC-Chain was developed to cope with such need for fast analyses.
- The running time of RNA-QC-Chain, RSeQC and RNA-SeQC was compared using datasets with different data size.
- Benefited by the whole-process parallel scheduling, multi-thread memory sharing and C++ programming, RNA-QC-Chain achieved approximately 7 – 13 times faster than other tools (Fig.
- For example, for dataset 3 RNA- QC-Chain only took about 6 min, while RSeQC and RNA- SeQC ran more than 20 and 700 min, respectively (Fig.
- This high speed running demonstrated the capability of RNA-QC-Chain to accomplish the analysis of data in huge size and large amount of samples, for which is essentially important for high efficient bioinformatics analysis..
- RNA-QC-Chain provides a comprehensive, one-stop and high efficient solution for RNA-Seq data QC, which would be very beneficial for knowledge discovery from RNA-Seq data.
- Comparisons with other QC tools indicated that RNA-QC-Chain out- performed in both function and speed.
- This tool can be used as the QC tool for the first step in analysis pipeline of RNA- Seq data to quickly provide the data quality information and the filtered reads ready for downstream analysis..
- Availability and requirements Project name: RNA-QC-Chain.
- Project home page: http://bioinfo.single-cell.cn/rna- qc-chain.html or http rna-qc-chain.html.
- Availability: RNA-QC-Chain, including source code, documentation, and examples, is freely available for non-commercial use with no restrictions at http://bioin- fo.single-cell.cn/rna-qc-chain.html or http .
- rna-qc-chain.html.
- 4 The running time of RNA-QC-Chain compared to RSeQC and RNA-SeQC using testing datasets.
- The datasets generated and/or analyzed during the current study are available in the website of http://bioinfo.single-cell.cn/rna-qc-chain.html..
- RNA-Seq: a revolutionary tool for transcriptomics.
- QC-chain: fast and holistic quality control method for next-generation sequencing data.
- RSeQC: quality control of RNA-seq experiments..
- RNA-SeQC: RNA-seq metrics for quality control and process optimization.
- Comparison of RNA-Seq by poly (a) capture, ribosomal RNA depletion, and DNA microarray for expression profiling.
- A survey of best practices for RNA-seq data analysis

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt