« Home « Kết quả tìm kiếm

BALLI: Bartlett-adjusted likelihood-based linear model approach for identifying differentially expressed genes with RNA-seq data


Tóm tắt Xem thử

- differentially expressed genes with RNA-seq data.
- Background: Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many statistical methods have been proposed to identify differentially expressed genes (DEGs) under two or more conditions with RNA-seq data.
- However, statistical analyses with RNA-seq data are often limited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as prior distributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicated settings.
- We herein proposed a Bartlett-Adjusted Likelihood-based LInear mixed model approach (BALLI) to analyze more complicated RNA-seq data.
- The proposed method estimates the technical and biological variances with a linear mixed-effects model, with and without adjusting small sample bias using Bartlkett ’ s corrections..
- Results from the simulation studies showed that BALLI correctly controlled the type-1 error rates at various nominal significance levels and produced better statistical power and precision estimates than those of other competing methods in various scenarios.
- Conclusions;: BALLI is statistically more efficient and valid than existing methods, and we conclude that it is useful for identifying DEGs in RNA-seq analysis..
- Furthermore, RNA-seq is robust against systematic errors and has therefore emerged as a successful alternative to microarray analysis [4]..
- RNA-seq quantifies the numbers of reads aligned to particular transcripts or genes, and various approaches have been proposed to manage RNA-seq data [5].
- Full list of author information is available at the end of the article.
- Negative-binomial distributions for read counts and normal distributions for log-transformation of counts per million (CPM) successfully describe distributions of RNA- seq data.
- However, RNA-seq is relatively expensive com- pared with microarray, and thus, further adjustment has been made to handle the problem of small sample size.
- In this article, we present new methods for identifying DEGs with RNA-seq data, namely, BALLI and LLI.
- For our simulation studies, artificial RNA-seq data were generated based on real data and negative binomial distri- butions.
- Our studies showed that the proposed method performed better than existing methods.
- The proposed methods were applied to Holstein milk yield data at the false discovery rate (FDR)-adjusted 0.1 significance level and uniquely produced significant results.
- The proposed methods were implemented as an R package and are freely downloadable at CRAN (http://cran.r-project.org) or http://healthstat.snu.ac.kr /software/balli/..
- We assumed that there were M different groups, and the averages of the expressed read counts for each gene were compared among these groups.
- If we denoted the normalized R i with the trimmed mean of the M-value [19] by R i , the log-cpm of gene g for subject i were de- fined by:.
- Linear mixed-effects model.
- The proposed linear mixed effects model may be conceptually useful for understanding the variance struc- ture of the RNA-seq data.
- However, statistical analyses with RNA-seq data often involve few samples, and the LRT statistic has a bias with order O p (N − 1 ) to its null distri- bution.
- σ 2 g Þ , the likelihood for the proposed linear mixed model is.
- β g and ^ α g can be obtained by.
- Simulation studies with RNA-seq data from Nigerian individuals.
- We applied the proposed linear mixed models to the simu- lated data based on Nigerian individuals’ RNA-seq data and calculated empirical type-1 error rates and statistical powers with these models.
- 1 show results from simulation studies based on Nigerian individuals’ RNA-seq data.
- Nigerian individuals’ RNA-seq data consisted of 52, 580 genes, and after filtering genes whose total read counts across samples were smaller than one-tenth of the sample size, each replicate had around genes.
- Em- pirical type-1 error rates and powers were estimated with 50 replicates.
- Table 1 and Additional file 3 assumed δ = 0, and thus, their estimates indicated the empirical type-1 error rates.
- For the proposed methods, we assumed that ψ g = 1 and σ 2 g1 ¼ σ 2 g2 , and the proposed methods with and without Bartlett’s corrections are denoted as BALLI and LLI, respectively, for the remainder of this article.
- Accord- ing to Table 1 and Additional file 3, BALLI and voom al- ways controlled the nominal type-1 error rates correctly if N was greater than or equal to 12.
- LLI also successfully controlled the nominal type-1 error rates if N was greater than or equal to 20.
- edgeR showed the least performance, and the estimated type-1 error rates were always inflated at and 0.005 nominal significance levels.
- Thus, we could conclude that the proposed linear mixed model with Bartlett’s correction reasonably controlled the type-1 error, and Bartlett ’ s correction was required if the sample size was less than 20..
- If N = 40 and δ = 0.5 σ , the estimated power and precision of BALLI were 0.342 and 0.943, which were higher than those of DESeq and voom .
- Our method was also applied to simulation data based on RNA-seq data from Holstein cows.
- The results were similar to those of the simulation data based on Nigerian people’s data.
- Additional file 4 shows that BALLI and.
- LLI controlled the nominal type-1 error rates if N ≥ 8 and if N ≥ 20, respectively.
- Simulation studies with randomly generated RNA-seq data.
- RNA-seq data are generally known to follow negative binomial distribution, and we conducted simulation studies with RNA-seq data generated from negative.
- The overall trend of the estimated type-1 error rate was similar to that of simulation studies based on real RNA-seq data.
- Estimated type-1 error rates by voom and BALLI usually maintained the nominal significance levels (Table 2 and Additional file 6).
- DESeq2 generally showed deflation of type-1 error rates at a 0.1 nominal significance level.
- 1 Estimated powers and precisions with simulation data based on Nigerian people ’ s RNA-seq data.
- Figure 2 shows the estimated type-1 error rates at the 0.05 significance level according to differ- ent choices of u.
- The estimated type-1 error rates of LLI were affected by sample size.
- If N was larger than or equal to 40, LLI controlled the type-1 error rates the most correctly and was not af- fected by the library size variation.
- BALLI usually had the best estimated power and preci- sion, as was observed in simulation studies based on real RNA-seq data.
- For example, when u = 0.2, N = 28, and δ = 0.5 σ , the estimated power by BALLI was 0.438, whereas those for DESeq2 and voom were 0.317 and 0.352, respect- ively (Fig.
- Results when u = 0.2 and δ = 1 σ in Fig.
- 2 Effect of varying library sizes on the type-1 error rates.
- Type-1 error rates were estimated by BALLI, DESeq2, edgeR, LLI, and voom when u and 1 and sample size (N) is 12 (a), 16 (b), 20 (c), 24 (d), 28 (e), 40 (f), 64 (g), and 68 (h) at the 0.05 nominal significance level.
- a Estimated power when u=0.2 and δ =0.5 σ .
- b Estimated precision when u=0.2 and δ =0.5 σ .
- c Estimated power when u=0.2 and δ =1 σ .
- d Estimated precision when u=0.2 and δ =1 σ.
- values for most of the nine genes obtained by LLI and BALLI were smaller than those obtained by other methods.
- We also analyzed all genes by the proposed methods.
- Of the 12.
- Therefore, we con- cluded that the proposed method, BALLI, worked well for real data analysis..
- In this article, we suggested new methods, designated BALLI and LLI, for identifying DEGs with RNA-seq data.
- The proposed methods were compared with existing methods, such as DESeq2, edgeR, and voom, with extensive simulation studies.
- According to our re- sults, negative-binomial-based approaches often failed to preserve the nominal type-1 error rates.
- However, the proposed method with Bartlett’s correc- tion, BALLI, preserved the nominal type-1 error rates and was the most powerful method other than LLI.
- Un- less sample sizes were small, LLI controlled the type-1 error rates as well and was the most powerful method..
- We found that library size variance could affect the estimated type-1 error rates, and the effect was the largest for voom.
- The proposed methods assumed that log-cpm values of read counts asymptotically followed a normal distri- bution and that their variances were approximately equal to 1/ μ + ϕ with first order approximation.
- How- ever, our simulation studies revealed the superiority of the proposed methods compared with voom, which was found to be attributable to their different variance struc- tures.
- For the proposed methods, 1/ μ + ϕ was derived from the first order approximation of the negative bino- mial distribution and thus may be a natural assumption for RNA-seq data.
- The proposed model assumed that the variance of log-cpm was φ / μ + ϕ and had the most generalized variance parameter space.
- For example, subjects with different ethnicities can cause φ to be larger than 1, and thus, a better model may differ ac- cording to RNA-seq data.
- φ and ϕ can be estimated with the proposed linear mixed model by implement- ing only a simple modification, and thus, we can Table 4 Significant genes in all methods of Holstein milk data.
- Fur- thermore, in contrast to methods based on a general- ized linear mixed model such as MACAU [27], the proposed methods can be easily extended to various scenarios with a simple modification.
- Maximiz- ing the likelihood for negative binomial or Poisson distributions with random effects is computationally intensive, but the proposed methods can easily obtain variance parameter estimates using existing R pack- ages, such as lme4 and nlme..
- With simulation studies for various scenarios, we showed that the proposed methods were usually the most efficient.
- Our results were ob- tained from simulation data based on RNA-seq data from Nigerian individuals and Holstein cows RNA-seq data and random samples from negative binomial dis- tributions, but any systematic differences in RNA-seq data could generate different results, depending on se- quencing errors or differences in preparation steps..
- However, despite such limita- tions, we believe that our results illustrate the prac- tical value of the proposed methods.
- In this article, we proposed likelihood-based linear mixed model approaches with and without Bartlett’s cor- rection to analyze more complicated RNA-seq data.
- The proposed methods consider log-cpm values of genes as response variables, and technical and biological vari- ances are estimated with a linear mixed model.
- Accord- ing to our simulation studies and real data analysis, our methods are statistically more efficient than existing methods and correctly control the type-1 error rates..
- (DOCX 18 kb) Additional file 3: Estimated type-1 error rates with simulation data for N.
- Additional file 4: Estimated type-1 error rates with simulation data for N.
- Additional file 6: Estimated type-1 error rates with simulation data for N.
- Additional file 7: Effect of varying library sizes on the type-1 error rates when u and 1 and N or 68 at the 0.005 nominal significance level.
- BALLI: Bartlett-Adjusted Likelihood based LInear mixed model approach;.
- RNA- seq: RNA sequencing.
- Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells.
- Mapping and quantifying mammalian transcriptomes by RNA-Seq.
- RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.
- Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
- Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.
- Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation..
- Misspecification of the covariance structure in generalized linear mixed models.
- Comparison of software packages for detecting differential expression in RNA-seq studies.
- A comparison of methods for differential expression analysis of RNA-seq data.
- A scaling normalization method for differential expression analysis of RNA-seq data.
- ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets.
- RNA-seq analysis for detecting quantitative trait-associated genes

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt