« Home « Kết quả tìm kiếm

Strategies for detecting and identifying biological signals amidst the variation commonly found in RNA sequencing data


Tóm tắt Xem thử

- STRING database analysis of the identified genes characterize gene affiliations involved in physiological regulatory networks that contribute to biological variability..
- Control data from an in- house data set and two archived samples revealed that 65 – 70% of the sequenced genes displayed trendlines with minimal variation and dispersion across the sample group after rank-ordering the samples.
- STRING database analysis of these genes identified interferon-mediated response networks in 11 – 20% of the individuals sampled at the time of blood collection.
- The identification of highly variable genes and their network associations within specific individuals empowers more judicious inspection of the sample group prior to differential gene expression analysis..
- Full list of author information is available at the end of the article.
- More broadly untangling the impact of variability on each step of the RNA-seq pipeline is difficult.
- They deter- mined that scaling normalizations performed better than other strategies because they removed the dependence of the metabolites initial ranking based on the magnitude of a quantitative response.
- Utilizing trendline ana- lysis, we determined that 65–70% of the genes in our con- trol data set follow a linear relationship with minimal variance when the genes were scaled and rank-ordered..
- When genes displaying the most variable and dispersed trendline expression patterns were evaluated with the STRING database [13–15], distinct bio- logical regulatory pathways were identified in some indi- viduals, thereby providing an explanation for some of the variability in the sample group..
- STRING-db analysis of genes displaying the most variable and dispersed trendlines re- vealed that 11–20% of the individuals in our control sample and two archived control data sets, identified a prominent network of interferon-stimulated genes.
- The breadth of the box illustrates the degree of count dispersion across the 35 data points for each gene.
- a Box plots of sequencing counts for five genes INTS6, AKAP13, KCNJ2, IFIT3 and EIF1AY depicting increasing levels of sample dispersion with computed coefficient of variation values ranging from 17.9 to 171.2% of the unadjusted TPM gene counts (Mean ± 1SD).
- identifying their inflection point(s) and providing an estimate of the relative change in gene expression based on the computed slope ratio change.
- A coefficient of variation (CV) of 17.9% and the co- efficient of determination (R 2 ) of 0.9498 further supports the linear profile of the INTS6 trendline.
- While these genes showed slightly more dispersion across the 35 samples (Panels a and b, with CV values of 25.26 and 29.62% and R 2 values of 0.8499 and 0.8418, respectively), rank-ordering the counts revealed more complex trendlines where the slope of the line for samples in quartiles 1 and/or 4 deviated from the slope of the line for samples in quartiles 2 plus 3 (Fig.
- The observed trendline profiles illus- trate how rank-ordering of RNA sequencing counts can identify marked changes in gene expression variability among some of the 8746 protein coding genes identified in our study.
- 70% of the 8000 to 10,000 evaluated genes (3 data sets) displayed trendlines where the incremental difference in gene expression across the group followed a linear pat- tern resulting in R 2 values that were ≥ 0.9 (e.g.
- Under ideal conditions with minimal within sample variation, one might expect all of the se- quenced genes in the control sample to follow this linear pattern but this is not the case.
- however, the unique incremental sample-to-sample gene expression relationship of the 35 rank-ordered samples was maintained irrespective of the trendline profile (Fig.
- Calculations were computed for each of the 8746 genes and the results were ranked in descending order (Additional file 2, sheet 6).
- When the unadjusted gene counts were used for these calculations, parameters that measure the relative magnitude of the.
- all select highly expressed genes in Biological GO pathways associated with protein synthesis and targeting proteins to different areas of the cell (Panel 4A vs 4B).
- Therefore, the type of measure- ment used for gene trendline characterization prior to STRING-db analysis impacts pathway selection if the heteroscedastic nature of the raw counts was not ad- dressed prior to pathway analysis..
- Although all three parameters demonstrated proficiency in selecting genes with “tailing” profiles, only 8 of the top 10 pathways were identical among the three cal- culations and 7–14% fewer total genes were identified when either kurtosis or range/median measures were employed.
- Changes in the order of the top 10 identified pathways were impacted by the number of known genes in a des- ignated pathway and the selected measure used to identify the pathway-related genes in the sample.
- The identification of the top 300 computed trendline values, as outlined above, was also used to evaluate gene.
- These pathways involve thousands of genes and due to the size of the pathways much lower FDR’s were observed (e.g.
- The application of the MVA scaling reduced heterosce- dasticity as previously noted [9] while preserving important sample-to-sample incremental changes that contributed to the rank-ordered trendline profiles.
- To limit the size of the correlation matrix (>.
- 78 × 10 6 values) to a more discernable number of terms, estimated values for the highest correlation and anticorrelation range was used to provide a count of the number of genes displaying correlation values >.
- STRING- db analysis of the most highly correlated genes within the entire data set identified gene pathways that were activated in response to virus exposure..
- It is important to emphasize that the elevated level of gene expression of these 7 genes is confined to specific indi- viduals in the sample group and the non-random nature of the response is unlikely due to methodological variability..
- Table 1 provides an abbreviated summary of the results.
- STRING database analysis of the 13 genes found to be highly correlated (r with the IFIT3 gene.
- Eight of the highlighted genes (red, blue and green) form statistically significant groupings with False Discovery Rates ranging from E − 17 to E − 21 that may collectively integrate the activity of all three pathways.
- The analysis of the 35 control samples identified 6 individ- uals or 17% of the sample group with genes displaying marked “tailedness”.
- 0051607) which was identified in 4 of the 6 individuals (11.
- A Venn Plot of the genes identified in all three data sets (e.g.
- In panel a (35 in house Controls), b (9 Controls, [24]) and c (12 Controls, [25]) the interferon induced IFI44L and ISG15 genes were specifically elevated in approximately 12% of the individuals (gene expression levels >.
- until all of the samples received a positional gene assignment ranking (see Additional file 6).
- The positionally ranked genes were evaluated to determine if any genes with range/median, range/Q3, kurtosis and Q4/Q(2 + 3) slope values were within a group of the top 300 genes previously identified for each of the selection parameters.
- For example, in a list of 1000 positionally ranked genes, only genes with a range/median value ≥ to the computed value of the 300th gene would be identified.
- A detailed summary of the results is presented in (Additional files 6, 8 and 9)..
- Employing the 4 parameters used in our previ- ous study (Table 1), a list of positionally ranked genes was assembled with each of the screening parameters.
- STRING-db analysis of the positionally ranked genes iden- tified one individual in Control group b (sample 7, Fig.
- 3 demonstrate that positional rank ana- lysis identified from 11 to 17% of the individuals in the three control data sets with gene associations represent- ing virus activated immune pathways.
- 20% of the surveyed samples contained individuals in which viral induced immune pathways were identified.
- The individuals identified as undergo- ing viral-induced immune responses significantly im- pact gene expression levels in the pathways that were identified by our analysis, thereby increasing the bio- logical variability of the control sample groups.
- In three separate RNA-seq studies, we noted that 65–70% of the genes follow linear trendline profiles with R 2 values ≥0.9.
- This normalization was performed by calculating gene expression ratios for each of the 35 samples and evaluating the degree of count dispersion across samples when the counts are expressed in relation to a known stable gene.
- When the highly correlated and variable expression profiles of the IFIT3 and IFI44L genes, previously identified in Figs.
- Gene trendline linearity was independent of the level of gene expression.
- However, sample 7 did not display similar deviations in any of the four genes depicted in the left panel of Fig.
- This analysis demonstrates that the relative magnitude of the gene ratio responses identified in samples 9, 6 and 12 were much larger than the 2-fold range of sample-to-sample variation observed for 65–.
- 70% of the sequenced genes as depicted in panel a of Fig.
- Stable genes identified in three data sets were used to normalize gene expression (see Additional file 10).
- In panel a, the CBX3/ATG3 gene ratios of two highly stable genes are plotted for each of the 35 samples in our study.
- In contrast, when two of the interferon regulated genes were normalized in relation to the ATG3 gene (panels b and c) samples 9, 6 and 12 were highlighted with ratios 2 to 5-fold higher than noted in panel a.
- RNA sequencing counts routinely display large differ- ences in their relative gene expression levels, which scales with the mean of the sequenced counts.
- In our Control sample, MVA of the 8746 genes reduced the mean and standard deviation by an average of 3.9-fold.
- We note that while MVA-scaled data is suitable for trendline analysis, it is important to follow the correct scaling protocol for dif- ferential expression analysis by following the specific guide- lines of the software that is being employed..
- After performing MVA scaling on our data set, we determine that ~ 70% of the 8746 genes in our sample group displayed trendline linearity as assessed by R 2 values ≥0.9.
- Moreover, the remaining 30% of the genes that deviate from this linear profile were easily identified and evalu- ated due their increased variability and dispersion.
- 1c), contribute significantly to the variability in the control data set (see Additional file 1).
- The approach employed in our manuscript is designed to provide an explanatory (and visually inspectable) methodology that can augment existing tools and guide the decisions of the investigator.
- A dramatic ex- ample of the variability associated with these genes is depicted by the marked increase in their computed coef- ficient of variability (Additional file 1, Fig.
- Statistical evaluation of the trendlines identi- fied several robust measures that provided the greatest ability to characterize the biological variability or “tailed- ness” of the expression values.
- STRING-db analysis of the genes exhibiting the most pronounced “tailedness” expres- sion profile revealed that these genes were associated with important regulatory networks (Additional file 4).
- The “ tailing ” trendline observed for the IFIT3 gene indicates that gene expression in about 25% of the individuals was markedly different from the other members of the sample group.
- Although, kurtosis and skewness cal- culations identified the degree of “tailedness” of the gene sample distribution, quartile slope analysis provided a more direct measure of these changes.
- Calculating the slope ratios of Q1/(Q2 + Q3) or Q4/(Q2 + Q3) identified individuals that deviated from the central core of the sample group.
- STRING-db analysis of the genes display- ing these non-linear trendline profiles identified highly integrated pathway associations, as depicted in Figs.
- We note that the robustness of the slope ratio calculations is dependent on the size of the sample group (e.g.
- Intra-group identification of 11–20% of the individuals in three separate data sets as responding to a viral- induced immune response is an important observation that should be considered prior to differential expression.
- Identification of individuals that exhibit a defense-related response within an experimental group is consistent with the time- dependent activation and transition of the immune sys- tem from the detection of a foreign object to a defined immune response [27–30]..
- for example, a recent infection, an immunization, or a response to one of the many latent viruses we commonly harbor in our bodies [40].
- R 2 , a measure of the variance explained by regression, pro- vided an estimate of the linearity of every gene in the.
- 5, may also be impacted by the initial composition of the RNA sample [3, 19].
- Therefore, it may significantly con- tribute to the clarity of our understanding of the manner in which previously unrecognized sources of biological variability may have biased or confounded the experimen- tal analysis with minimal overhead.
- Rank-order analysis of the MVA gene expression values in conjunction with R 2 calculations revealed that 65–.
- 70% of the sequenced genes display a linear “baseline”.
- level of gene expression across the data set.
- Pathways relating to viral-induced immunological responses were identified in 11–20% of the 54 individuals evaluated in our combined studies.
- Furthermore, the resulting range and magnitude of the incremental changes in gene expression following MVA were markedly similar even though the RNA was ex- tracted and processed differently in the three data sets evaluated in our study.
- While MVA may also reduce some of the variability that is commonly introduced when data sets are sequenced in different laboratories and at different times, its designated utility in our studies is in the analysis of intra-group comparisons..
- In our control data set, an iterative correction of the length-adjusted ERCC spike-in concentration ratios was used to proportionally adjust for sample processing effects, pipetting errors, dilutional differences and other sources of methodological variability while the archived data sets that did not contain ERCC spike-ins were limited to size factor normalization using the median-to- ratio method as previously outlined [23].
- To address the heteroscedastic nature of the raw data, we applied Mini- mum Value Adjustment (MVA) scaling normalization strategy to our counts.
- After MVA, all of the incremental changes for the 8746 protein-coding genes across the 35-sample data set fall within a numeric range of 1 – 60 relative counts.
- This adjustment of every gene to its lowest common denomin- ator eliminated large comparative differences in the rela- tive magnitude of the observed counts between genes within and across individuals while maintaining the im- portant incremental changes when the adjusted counts are rank-ordered.
- 0.5) and nonsignificant gene counts minimized the possibility of inflating the magni- tude of the sample-to-sample incremental changes that can be sensitive to very small outliers during scaling ad- justments [9]..
- An over- view of the data input and processing pipeline is summa- rized in sheet 1 of file 2..
- Prominent genes consid- ered for additional analysis fell among the top 300 genes characterized by the largest or smallest measurements for any of the computed parameters.
- DGE: Differential gene expression.
- STRING db Pathway Identification Based on the Analysis of the Top 300 Genes Identified by the Designated Statistical Measures..
- KM and PC were involved in the design of the sample collection protocol and IRB approval process.
- WW, KM and PC participated in the sample extraction and processing of the RNA samples.
- WW, HE and PC contributed to the development of the data analysis rationale outline in the manuscript.
- WW wrote the manuscript and all the authors assisted in the review, edits and final approval of the manuscript..
- Copies of the three processed unadjusted data files that were used in this study are provided in Additional file 11.
- Expansion of the Gene Ontology knowledgebase and resources

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt