« Home « Kết quả tìm kiếm

On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments


Tóm tắt Xem thử

- The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined..
- Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates.
- Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is effective in maintaining the power of testing DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment.
- In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize the cost and maintain the power..
- Full list of author information is available at the end of the article.
- The number of biological repli- cates in an RNA-seq experiment is typically small because of financial or technical constraints.
- For RNA-seq data, Rajkumar et al.
- In addition, we have defined the data generating mech- anism in sample pooling strategies from a mathematical perspective for better interpretation of the empirical and simulation results.
- In pooled RNA-seq experi- ments, a number (q) of randomly selected RNA samples are mixed before library preparation and sequencing.
- In the subsequent sections, we formalize the RNA sample pooling procedure for better understanding of the data generating process..
- 1 Summary of the workflow.
- We further assume that the A jk are independent of the U j .
- That is, we mix w A and w B fractions of the RNA molecules from biological sample A and B, respectively.
- We consider the mixing weights as random variables and account for their contribution to the variability of the pooled outcome.
- Model (1) indicates that Y k is the weighted sum of the virtual counts U j from the q biological samples in pool k.
- Under the assumption that U j , A jk and W jk are inde- pendent random variables, the expectation of the gene expression measures in pooled sample k becomes.
- This indicates that the expected gene expression measurements in a particular pool is equal to the average of the expected expression levels from the q biological samples included in that pool.
- The variability of the gene expression lev- els in pool k, accounting for the sampling variability.
- (3) The proof is available in Section 1.1 of the Supplementary file, with empirical confirmation by Monte-Carlo simu- lations (see Supplementary Fig.
- 3 indicates that Var { Y k } is inversely proportional to the pool size q, sug- gesting that pooling reduces the variability of the gene expression measurements, given σ 2 is sufficiently small..
- The mean expression of a gene Y ¯ from a pooled experiment is an unbiased estimator of the true mean expression similar to that of the standard experiment.
- Furthermore, we examine the effect of pooling on the estimation of the relative abundance ρ of a gene and the log-fold-change (LFC) between two independent groups..
- Although pooling results in expression levels with a lower variance, the variance of the estimates of the relative abundance ( ρ ˆ ) and the LFC between two independent groups.
- θ ˆ , have a variance that is at least 2q/(q + 1) times higher than that of the estimates from standard experiments (see Section 1.2 of the Supplementary file for details).
- This is the direct consequence of the reduction of the number of replicates in pooled experiments.
- That is, given the number of RNA samples in groups 1 and 2 (n 1 and n 2 , respectively), pool size q, the LFC to be detected θ , and the.
- The variance of the gene expression levels from pooled and non-pooled experiments.
- over-dispersion φ , the power of the two-sided likelihood- ratio test at significance level α can be calculated as,.
- is the standard normal cumulative distribution function, Z α/2 is the (1 − α/2)100% quantile of the stan- dard normal distribution, and V 0 and V A are the variances of the LFC estimate.
- The details of the power calculation can be found in Section 1.3 of the Supplementary file..
- S2–S4, we presented the relationship between the power and the total cost of the data generation for different experimental designs, including RNA sample pooling.
- Further details are in Section 1.3 of the Supplementary file.
- Moderate reduction of the sequencing depth without reducing the number.
- In summary, an RNA sample pooling strategy can be a good choice to opti- mize the power and data generation cost, especially when many of the genes are expressed at low or medium levels.
- Strategy B is similar to the reference, except the number of samples is reduced to n.
- The relative cost is calculated as the total cost of a particular strategy divided by that of the reference design.
- like long-non-coding RNAs [6] with a substantial reduc- tion of the library and sequencing costs.
- In such scenarios, it can be suggested that reducing the number of samples (strategy B) or pooling with a large pool size can be used to optimize the cost with comparable power to the reference design..
- That is, let q k denote the pool size in pool k, then the variance of the LFC estimate in the pooled experiment θ ˆ ∗ becomes at least m 2n 2 m.
- times higher than that of the estimates from standard experiment.
- The reference scenario represents a standard tissue RNA-seq experiment with- out pooling consuming a maximum budget in terms of the number of samples, number of libraries, and sequencing depth.
- The 12 test scenarios include a unique combination of the number of RNA samples, sequencing depth, num- ber of libraries, and pool size (q).
- Table 1 Summary of RNA-seq experimental scenarios.
- In particular, reducing the number of RNA samples (sce- nario A1), reducing both the number of RNA samples and sequencing depth (scenario A2), reducing the sequenc- ing depth (scenarios A3 and A4), pooling of RNA samples (scenarios B1 and C1), pooling and reducing the number of RNA samples (scenarios B2 and C2), pooling and reduc- ing the sequencing depth (scenarios B3 and C3), and both (i.e.
- pooling, reducing the sequencing depth and reduc- ing the number of RNA samples, scenarios B4 and C4)..
- In particular, we focus on comparing the distribution of the mean and variability of normalized gene expression levels, the LFC estimates, and the num- ber and characteristics of genes called DE at 5% nominal FDR level..
- The two-dimensional visualization of the neuroblastoma samples (for each sce- nario) using principal component analysis also shows that the within-group variability is smaller than the between- group variability in pooled experiments, where group is here the MYCN status (Supplementary Fig.
- This result is in line with the theoretical result that pooling results in an unbiased estimate of the average gene expression level even for different choices of pool size.
- We also evaluated the bias of the LFC estimates in each test scenario relative to the estimates from the ref- erence scenario.
- a–distributions of the average normalized counts per genes (in log 2 scale), b–distributions of the variability of normalized counts per gene (in log 2 scale), and c–The LFC bias in terms of the mean absolute difference with the LFC estimate from the reference scenario (A0).
- The magnitude of bias caused by the reduction in the number of replicates is relatively smaller for pooling scenarios than for non-pooling scenarios.
- On the other hand, pooling scenarios B2, B4, C2 and C4 resulted in the largest bias, which can be explained by the highest reduction of the number of replicates..
- Reducing the within-group variability by pooling RNA samples may enhance the resolution of the biological effect.
- On the other hand, the pool- ing scenarios B1, B3, C1 and C3 have lower variability and hence resulted in estimates almost close to that of the reference scenario, but with fewer libraries..
- number of replicates per group (e.g.
- Furthermore, the characteristics of the genes that are exclusively called DE in each test scenario are quite dif- ferent.
- The results from the comparison of the second set of experimental scenarios with the NGP nutlin dataset (see Table 1b) show that pooling did not have much effect on the overall result.
- In particular, unlike the pooling sce- narios based on the Zhang data, the variability of the gene expression data did not change across the scenarios (Supplementary Fig.
- Consequently, in line with the theoretical results, a large pool size is required to reduce the variability of the virtual counts.
- Only a small reduction of the LFC estimation bias was observed for the pooling scenarios than for the non-pooling scenario (Supplementary Fig..
- The number of detected DE genes (at 5%.
- In particular, the power-cost trade-off was assessed for designs with equal number of replicates (3 replicates per group), that is, 3 individual cell line samples (scenario A), 3 pools of 2 cell line samples (q = 2, scenario B) and 3 pools of 3 cell line samples (q = 3, scenario C).
- The level of sensitivity for limma-voom differs among scenarios quite substantially with a range 20-75% and 55- 95% at the 5% nominal FDR when the LFC of the DE genes is greater than 0.5 and 1, respectively.
- In particular, scenar- ios with equal number of libraries and pool size resulted in almost the same sensitivity, regardless of the sequencing depth difference.
- Results of the simulation based evaluation: The curves show the trade-off between the true positive rate (TPR) and the actual FDR evaluated at 0-40% nominal FDR level.
- This result points at the utility of an RNA sample pooling strategy to balance the number of repli- cates and variability.
- In general, the magnitude of the LFC for the simulated DE genes showed considerable effect on the sensitivity (for both edgeR and limma), such that for simulations with high LFC for DE genes, the performance-gap between scenarios narrows down.
- For example, edgeR focuses on maximizing the sensitivity to detect true DE genes with liberal performance in terms of the FDR control.
- On the other hand, limma-voom guarantees control of the FDR for all design choices but its sensitivity is strongly dependent to the number of replicates.
- Therefore, if one aims at maximiz- ing the sensitivity with the actual FDR within the tolerance range and pooling does not result in too much reduction of the sample size, then limma-voom is a good choice for pooled experiments..
- General summary of the empirical and simulation results The experimental scenarios in Table 1a are ranked based on a score that summarizes the empirical and simulation results (see Fig.
- In particular, five metrics were summa- rized: the inverse of the LFC estimate bias, standardized LFC for MYCN geneset (absolute value), concordance.
- This can be seen by the fact that because of the library size reduction, scenarios A3 and A4 resulted in a substantially smaller number of genes with sufficient level of expressions compared to that of all the remaining scenarios (Supplementary Fig.
- In general, the difference in the number of libraries appeared to be a critical factor that leads to the overall performance difference between scenarios (Supplementary Fig.
- Performance ranking of RNA seq experiment design scenarios..
- In particular, five metrics were summarized: the inverse of the LFC estimate bias, standardized LFC for MYCN geneset (absolute value), concordance with reference scenario, one minus the actual FDR (at 5% nominal FDR level), and sensitivity (at 5% nominal FDR level).
- to have the potential to optimize both the cost of the data generation process as well as the statistical power for testing DGE [9–13].
- Con- sequently, the statistical power of testing for DGE using pooled experiments can be lower than that of the full bud- get design unless a proper pool size is chosen.
- Specifically, given there is a sufficient number of RNA samples, a small pool size such as q = 2 is sufficient to stabilize the large variability among the gene expression levels and optimize the trade-off between the power and the data genera- tion costs.
- Of note, the parameters of the Dirichlet distribution, α 1.
- Similarly, the assumption between W and U holds if the mixing weights in a pool are determined independently of the transcrip- tome size in each biological sample.
- However, pooling in general does not guarantee better results unless the key elements of the pooling experiment are carefully chosen.
- That is, for pooling to be equally effective to the standard RNA-seq experiment, it is essential to carefully determine the pool size, the number of pools, and sequencing depth depend- ing on the level of variability and the number of RNA samples.
- One of the apparent drawbacks of pooling experiments is the reduction of the number of replicates, which most statistical methods strongly rely on for optimal perfor- mance.
- However, our results demonstrate that pooling has the potential to compensate for the loss of the num- ber of replicates by reducing the within-group variability unless the pooling strategy results in too much reduction of the number of replicates.
- In practice, however, pooling experiments would involve pooling of the RNA molecules before library preparation, and hence extra technical variability resulting from pooling could be antic- ipated.
- We have shown that the utility of an RNA sample pooling strategy depends on the choice of the pooling parameters, such as the pool size and the number of RNA samples..
- Since the cost of RNA sample preparation is relatively low, one may consider using as many RNA samples as possible to capture the heterogeneity of the population under study, and using an adequate pooling strategy, one can substantially reduce the cost of the subsequent steps, which are considerably more expensive, and maintain the power of a DGE test.
- In particular, for scenarios with a high biological variability, a small pool size such as 2 can be effective to optimize the cost of the experiment and maintain the power that one would attain without pooling.
- Unlike the typical cost-saving strategies, such as reducing the sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is par- ticularly effective for scenarios with many genes with low and moderate levels of expression.
- We have demonstrated that pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to further optimize the cost of the experiment without much loss of the power of the DGE test.
- Methods RNA-seq datasets.
- Two series of simulations were run, with an absolute LFC estimate of the simulated DE genes of least 0.5 or 1, respectively.
- supplementary figures directly referred in this paper as well as the details of the data generating model for pooled experiments and power calculation..
- RNA-seq: RNA sequencing.
- We wish to acknowledge the technical support by Celine Everaert (Ghent University) and Jasper Anckaert (Ghent University) for processing and quality assessments of the NGP nutlin data..
- The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript..
- RNA-seq: a revolutionary tool for transcriptomics.
- Design and validation issues in rna-seq experiments.
- Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt