- RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data. - Background: RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. - However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. - Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results.. - Results: We developed a comprehensive, fast and easy-to-use QC pipeline for RNA-Seq data, RNA-QC-Chain, which involves three steps: (1) sequencing-quality assessment and trimming. - This package was developed based on our previously reported tool for general QC of next-generation sequencing (NGS) data called QC-Chain, with extensions specifically designed for RNA-Seq data. - It has several features that are not available yet in other QC tools for RNA-Seq data, such as RNA sequence trimming, automatic rRNA detection and automatic contaminating species identification. - The three QC steps can run either sequentially or independently, enabling RNA-QC-Chain as a comprehensive package with high flexibility and usability. - The performance of RNA-QC-Chain has been evaluated with different types of datasets, including an in-house sequencing data, a semi-simulated data, and two real datasets downloaded from public database.. - Comparisons of RNA-QC-Chain with other QC tools have manifested its superiorities in both function versatility and processing speed.. - Conclusions: We present here a tool, RNA-QC-Chain, which can be used to comprehensively resolve the quality control processes of RNA-Seq data effectively and efficiently.. - RNA-Seq has become a routinely and extensively applied approach for transcriptome profiling that relies on high- throughput sequencing (HTS) technologies, which pro- vides a far more profound and precise measurement at the transcript level than microarray and other traditional gene expression analysis methods [1]. - However, due to intrinsic limitations of HTS technologies and RNA-Seq protocols, quality problems are quite common in raw RNA-Seq data. - “RNA-Seq-specific” quality issues, such as ribosomal RNA (rRNA) residual, RNA degradation and varied read coverage. - Therefore, before downstream analysis, raw RNA-Seq data must be checked and processed by quality control (QC) procedures to ensure accurate transcript measurements and correct knowledge acquirements from the data.. - QC-Chain [5], NGS QC Toolkit [7]. - However, most of them mainly focus on trimming of general HTS data, but not for specific RNA-Seq QC problems. - Though some tools are de- signed specifically for RNA-Seq data, they suffer from different kinds of restrictions. - Therefore, there is a pressing need for a new and powerful QC method for RNA-Seq data.. - Here we present RNA-QC-Chain, an easy-to-use, highly efficient and one-stop QC tool for RNA-Seq data. - With both quality check and data processing capability, RNA- QC-Chain includes three related functional components, called Parallel-QC, rRNA-filter and SAM-stats. - In addition to covering most types of quality assessments offered by currently available tools, RNA-QC-Chain can filter out the poor-quality reads and contaminations, and generate the ready-to-use data for downstream analysis. - Notably, parallel computation is embedded in RNA-QC-Chain, which could significantly accelerate its processing speed and makes it an extremely fast QC software.. - The workflow of RNA-QC-chain. - RNA-QC-Chain has three sequential QC procedures, with parallel computation as the backbone to provide a complete and high-performance QC solution for RNA- Seq data (Fig. - 1 The workflow and functions of RNA-QC-Chain. - Finally, based on results of reads alignment (to reference genome), multiple mapping metrics are pro- vided to evaluate the RNA-Seq data and experiment by another embedded module called SAM-stats.. - Notably, the duplications in RNA-Seq data should not be removed because this information is closely relative to RNA abundance calculation. - In RNA-QC-Chain, a tool called “rRNA-filter” was developed to extract rRNA reads and to identify both internal and external contaminations. - Since the HMM algorithm does not rely on the annotation of the source genome of the rRNA, but the pattern of the rRNA sequences, the RNA-QC-Chain makes the removal of the rRNA fragment to be alignment and annotation free. - Based on the assignment of the classification terms, the taxonomical components of the RNA-Seq data were produced, which indicated whether there was contaminating species, and if so, what these species was.. - Furthermore, using a script called “SAM-stats”, RNA- QC-Chain provides the assessment profiles based on read alignment. - Within the RNA-QC-Chain software package, sequencing- quality trimming and contamination filtering steps could be performed based on sequencing data in either FASTA or FASTQ format. - Output formats depend on the specific steps of RNA- QC-Chain. - Parallel computation optimization was applied on RNA-QC-Chain in sequencing-quality trimming and contamination filtering steps. - We used four datasets to test the performance of RNA- QC-Chain (Table 1). - Dataset 1 was a real in-house sequenced RNA-Seq data of algae species Nannochlor- opsis. - Dataset 2 was a semi-simulated data, inte- grating a real RNA-Seq data for a Sprague–Dawley rat sample, and simulated contaminating reads from yeast Saccharomyces cerevisiae. - Dataset 3 and Dataset 4 were human RNA-Seq data, which were produced under the ENCODE project and downloaded from Gene Expression Omnibus database (GEO accession number GSM958728) with data sizes of 9.6 Gb and 16.4 Gb, respectively.. - Compared to traditional technology like microarray, RNA-Seq has a higher productivity and better resolution, therefore, it has become the mainstream of high throughput and large scale RNA-level study. - Therefore, quality control is a first essential step in bioinformatics analysis of RNA-Seq data. - RNA-Seq measures the abundance and structure of genes at the RNA level, and employs different analytical approaches compared with those for DNA-Seq data: firstly, DNA is quite stable and the DNA sequences are highly constant, while RNA are fragile and gene expression values are very dynamic. - secondly, all DNA sequences could be recovered when the se- quencing depth is high enough, while sequencing bias may occur to a higher level in RNA-Seq data. - These dif- ferent features place distinguished and high demands for accurate QC on RNA-Seq data to ensure highly reliable subsequent analytical results. - RNA-QC-Chain was devel- oped based on our published QC tool called QC-Chain, which provides basic QC solutions for general HTS data.. - However, RNA- QC-Chain can identify all kinds of rRNA reads in SILVA database and automatically remove them, including 16S, 23S, 18S and 28S rRNA, while QC-Chain can only identify 16S or 18S rRNA and cannot remove the identified rRNA reads. - Another difference is that RNA-QC-Chain can. - perform the alignment statistics, which is not applicable in QC-Chain. - Therefore, RNA-QC-Chain has essential functional extensions that are particular for RNA-Seq data.. - Therefore, although there are a number of tools that can perform this step, we integrated Parallel-QC in our RNA-QC-Chain, to make our pipeline as a one-stop and convenient tool for RNA-Seq data QC. - For RNA-Seq data, the contaminations can be classified into two types of. - In extracted total RNA, up to 80–90% are rRNA sequences, thus a high quality RNA-Seq experiment requires an intact total RNA extraction and efficient. - 2 Selected outputs of RNA-QC-Chain for a RNA-Seq data of Nannochloropsis (Dataset 1). - In Dataset 2, we artificially added some yeast data as external contaminations but are not aware in advance that bacteria reads were also included in the downloaded rat RNA-Seq data. - Therefore, even when the RNA-Seq data has a single end and a very. - 3 Contamination identification for a semi-simulated RNA-Seq data (Dataset 2) using RNA-QC-Chain. - On the other hand, our results demonstrated that the sequencing quality trimming and contamination identifi- cation steps are absolutely necessary and important for the QC of RNA-Seq data, because either low sequencing quality reads or contaminations may result in poor usable data yield and thus might damage further down- stream analysis results.. - Herein we should point out that RNA-QC-Chain has a limitation in external contamination identification in very rare cases when rRNA sequences are not involved in the data, since the foreign organism identification is based on rRNA annotation.. - Read alignment to reference genome is essential for most downstream analysis of RNA-Seq data. - Quality check on the read alignment results can indicate how well the target RNA was captured, amplified and sequenced, thus provides a comprehensive insight into the quality of RNA- Seq experiment and data. - We took Dataset 1 as an example to manifest the per- formance of the alignment statistics reporting step of RNA-QC-Chain. - Due to the innate features of RNA-Seq data, such as alternative splicing and different expression patterns, the indexes listed above may vary in different RNA-Seq samples, and these assessment can provide a global indication of the data status. - RNA-QC-Chain is an easy-to-use and flexible tool. - By RNA-QC-Chain, all analytical tasks are automatically distributed to different threads with dynamic scheduling for optimization of the computing loading balance, and the shared memory space among all threads also signifi- cantly reduced the RAM usage. - Compared to other pipe- lines that integrated with multiple software packages to depend on additional I/O operations for data transfer, the high efficiency I/O strategy of RNA-QC-Chain signifi- cantly decreased the entire analytical time. - Consequently, RNA-QC-Chain has the ability of fast processing massive scale of RNA-Seq data. - Comparisons with other QC tools for RNA-Seq data We compared RNA-QC-Chain with two other QC tools for RNA-Seq data, RNA-SeQC and RSeQC, which were developed by different techniques with different features, and have been widely used for RNA-Seq quality checking [17]. - RNA-QC-Chain is. - Firstly, in functional aspects, both RSeQC and RNA- QC-Chain can evaluate the sequencing quality of raw reads, but only RNA-QC-Chain can complete the trim- ming of poor quality sequences and produce the trimmed reads. - For contamination filtering, all the three tools can estimate how many reads are possibly origi- nated from rRNA genes, however, RNA-QC-Chain is the only one that is capable of automatically removing them, while other tools needs the user to provide an rRNA ref- erence file. - In addition, neither RNA-SeQC nor RSeQC can identify the contaminating foreign species in the data, whereas RNA-QC-Chain is unique in this function.. - Therefore, when using RNA-SeQC and RSeQC, users have to turn to other data processing tools to filter the poor-quality reads and contaminations, but RNA-QC- Chain can directly produce the usable data for further analysis. - These measure- ments are based on reads per kilobases per millionreads (RPKM) values, which are not necessary for all kinds of RNA-Seq analysis. - Therefore, we did not involve mea- surements based RPKM values in RNA-QC-Chain.. - Giving considerations to conveni- ence and flexibility, RNA-QC-Chain contains three rela- tively independent QC components, each with a single command line, to perform both quality check and data processing functions.. - For RNA- QC-Chain, no additional files are needed. - In particular, for both RNA-SeQC and RSeQC, Table 2 Comparison of functions and features of different QC tools for RNA-Seq data. - RNA-QC-Chain RSeQC RNA-SeQC. - On the contrary, RNA-QC-Chain applied a different strategy with a built-in comprehensive rRNA database, making the prediction of rRNA reads more precise, comprehensive and easy-to-operate.. - The output of RNA-QC-Chain dem- onstrates different quality aspects of data in different formats, including the quality-filtered sequence file in FASTA/FASTQ format that can be directly used for downstream analysis, rRNA reads filtered out, an active graph in HTML report suggesting the contamination information and plots/texts for the alignment metrics.. - RNA-QC-Chain was developed to cope with such need for fast analyses. - The running time of RNA-QC-Chain, RSeQC and RNA-SeQC was compared using datasets with different data size. - Benefited by the whole-process parallel scheduling, multi-thread memory sharing and C++ programming, RNA-QC-Chain achieved approximately 7 – 13 times faster than other tools (Fig. - For example, for dataset 3 RNA- QC-Chain only took about 6 min, while RSeQC and RNA- SeQC ran more than 20 and 700 min, respectively (Fig. - This high speed running demonstrated the capability of RNA-QC-Chain to accomplish the analysis of data in huge size and large amount of samples, for which is essentially important for high efficient bioinformatics analysis.. - RNA-QC-Chain provides a comprehensive, one-stop and high efficient solution for RNA-Seq data QC, which would be very beneficial for knowledge discovery from RNA-Seq data. - Comparisons with other QC tools indicated that RNA-QC-Chain out- performed in both function and speed. - This tool can be used as the QC tool for the first step in analysis pipeline of RNA- Seq data to quickly provide the data quality information and the filtered reads ready for downstream analysis.. - Availability and requirements Project name: RNA-QC-Chain. - Project home page: http://bioinfo.single-cell.cn/rna- qc-chain.html or http rna-qc-chain.html. - Availability: RNA-QC-Chain, including source code, documentation, and examples, is freely available for non-commercial use with no restrictions at http://bioin- fo.single-cell.cn/rna-qc-chain.html or http . - rna-qc-chain.html. - 4 The running time of RNA-QC-Chain compared to RSeQC and RNA-SeQC using testing datasets. - The datasets generated and/or analyzed during the current study are available in the website of http://bioinfo.single-cell.cn/rna-qc-chain.html.. - RNA-Seq: a revolutionary tool for transcriptomics. - QC-chain: fast and holistic quality control method for next-generation sequencing data. - RSeQC: quality control of RNA-seq experiments.. - RNA-SeQC: RNA-seq metrics for quality control and process optimization. - Comparison of RNA-Seq by poly (a) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. - A survey of best practices for RNA-seq data analysis
Xem thử không khả dụng, vui lòng xem tại trang nguồn hoặc xem
Tóm tắt