« Home « Kết quả tìm kiếm

Compacta: A fast contig clustering tool for de novo assembled transcriptomes


Tóm tắt Xem thử

- When an organism has a well-characterized and annotated genome, reads obtained from RNA-Seq experiments can be directly mapped to that genome to estimate the number of transcripts present and relative expression levels of these transcripts.
- The user can determine the minimum coverage of the contigs to be clustered, as well as a threshold for the proportion of shared reads in the clustered contigs, thus providing a dynamic range of transcriptome compression that can be adapted according to experimental aims.
- We compared the performance of Compacta against state of the art clustering algorithms on assemblies from Arabidopsis, mouse and mango, and found that Compacta yielded more rapid results and had competitive precision and recall ratios.
- The cDNA is sequenced to obtain ‘reads’ that represent parts of the original mRNA molecules.
- 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- by temporal transcription, wherein significant parts of the genome, both coding and non-coding segments, are transcribed only at specific points during development or under specific conditions [11, 12].
- As a result, transcriptome assemblies typically produce very large contig sets that in some cases are many-fold larger than the number of genes in the entire species genome.
- For example, de novo assem- bly of the transcriptome for the polychaete annelid Pla- tynereis dumerilii using Trinity gave a set of 273,087 non-redundant contigs, which were identified through a pipeline that included sequence homology to only 17, 213 genes [14], nearly 16-fold fewer than the number of contigs..
- The program first filters out contigs that have a low number of mapped reads (<.
- A distance threshold for clustering can be set by the user, but the default value of 0.3 is equivalent to sharing of 70% of the reads between two entities, i.e., original contigs or clusters already obtained by the algo- rithm.
- The number of shared reads is also updated at each iteration and clustering of a contig set stops when either all the contigs have been grouped into a single cluster or the current minimum distance increases above the distance threshold..
- The Corset algorithm has two disadvantages: First, it uses a fixed number of reads to assess contig coverage, disregarding contig and read lengths.
- The nature and number of condi- tions used to obtain different transcriptome samples can.
- Also, for annotation of ongoing eukaryotic genome projects, an equimolar mix- ture of RNA from different tissues of the same species is sequenced [18].
- Grouper shares the same disadvantages with Corset, i.e., the program uses an arbitrary minimum number of reads to consider whether a contig is valid (in Grouper the user cannot modify this value) and contig segrega- tion depends on the RNA-Seq experimental conditions..
- Under this scenario, the best use of information from de novo assembly is formation of a contig cluster that can be used to identify the core set of expressed genes that allows the most effective comparison of the relative ex- pression of such entities based on the design of the RNA-Seq experiment..
- Compacta is designed to reduce the number of contigs to a smaller set of representative sequences while pre- serving the information about relative expression given by read abundance.
- When d = 0.3, for ex- ample, all pairs of contigs sharing 30% or more of the reads that reference the contig having fewer reads will be clustered into a single entity.
- In addition to file locations, Compacta includes options for number and names of samples and experimental groups, as well as options that allow parallelization of part of the algorithm..
- Compacta output comprises files that: (i) define the obtained clusters as sets of the original contigs.
- The following list describes the parameters of the Compacta algorithm..
- min R j where R i and R j are the number of reads that independ- ently map to contigs i and j, respectively, while R ij is the total number of reads that map to both contigs i and j;.
- i.e., R ij is the number of reads shared by contigs i and j..
- The weight of an edge, w ij , ranges from zero, when the edge contigs share no sequencing reads indicating no similar- ity (disconnected contigs), to one, indicating that one of the contigs is a proper subset of the other..
- The value c i is defined as the length of contig i and s i is the sum of the lengths of all reads that map to that contig.
- marked as ‘pre-clusters’ that are each loaded into a heap structure self-ordered by edge weight, ensuring that the first value in the heap is always the edge having the heaviest weight, i.e., the largest value of w ij.
- In this scenario, weights, w ij , are re-calculated for the new conformation of the pre-cluster and the process is repeated until the first edge in the heap has a weight that is less than the threshold d or all its contigs are clustered to- gether.
- The final content of the heap structure, which can contain one or more clusters, goes to the output..
- Once Compacta processes all pre-clusters, it produces files that include the description of each cluster (sets of the original contigs), as well as lists indicating which contig could represent each one of the clusters, either by being the longest contig in the cluster or the one that has the largest number of reads mapping to it..
- In summary, from BAM files containing the informa- tion of the original contigs and reads mapping to them, Compacta produces a set of representative contigs for use in downstream analyses..
- In step (2) of the algorithm, the graph is constructed.
- Thus, if w ij = 0 we will consider that the corresponding contigs are completely unrelated, whereas w ij = 1 means that the smaller contig is a proper subset of the second, or, when they are the same size, they will be some per- mutation of the positions of the same reads..
- In step (3) of the algorithm, Compacta uses effective contig coverage, expressed as the number of times that the full-length contig is covered by reads, as a measure.
- In contrast, Corset and Grouper allow selection of contigs only through a fixed threshold in the number of reads that map to each contig, inde- pendently of contig length.
- However, a fixed threshold number of reads is inadequate to judge contigs having different lengths.
- 0 are input into the fourth step of the algorithm, ‘pre-clus- ter detection’.
- If a pre-cluster graph is plotted, it is possible to go from any of the con- tigs to any other contig by following a path.
- The core of the Compacta algorithm is step (5), in- volving agglomerative clustering of connected contigs or.
- needs only to analyze the first element of the heap, thus saving valuable time.
- d during the iterations, the en- tire content of the heap is sent to the output, including the definitions of clusters and the number of reads that map to them.
- This process guarantees that the number of entities in the output is smaller than or at most equal to the number of input contigs.
- Another substantial way that Compacta differs from Corset and Grouper is that Compacta uses no computa- tional methods to determine if two contigs were the product of transcription from ‘the same gene’, whereas both Corset and Grouper attempt to estimate and con- sider contig origin.
- In our opinion, in the absence of genomic information, accurate prediction of whether two contigs are the product of: a) different alleles of the same gene, b) alternative splicing forms produced from the same gene or c) two highly similar genes (close para- logs or two close members of the same gene family) is essentially impossible due to the high diversity and con- formations of eukaryotic genomes..
- To achieve these aims, the ability to downsize the potentially very large number of contigs given by the assembler into a smaller and more manageable set of representative sequences is valuable..
- The assembly process is likely to yield many related contigs that represent tran- scription variants of the same gene as alternative splicing forms, alleles, or products of the transcription of close paralogs of the same gene or gene family.
- Apart from filtering low-evidence contigs with the parameter -l = l, the number of clusters given by the algorithm is a function only of the param-.
- eter d –the threshold for clustering contigs into clusters,.
- z, considering only the number of input contigs, c, and the number of clusters output, z.
- 0 and thus be clustered together, giving the smallest number of clusters in the output.
- The number of clusters result- ing from that operation can be termed z min , where f(c, d = 0.
- On the other hand, if d is set to 1, we will ask the algorithm to group only contigs that share all reads of the smaller contig, because in order to have w ij = R ij /min (R i ,R j.
- In that case, we will have a max- imum number of clusters in the output, where f(c,d = 1.
- z max , such that Compacta will cluster only those contigs that are proper subsets of the longest contig in the group (pre-cluster) and will likely produce clusters containing only highly similar gene alleles, splicing forms that share most exons in the genes, or very close para- logs.
- z is non-decreasing fol- lows from the fact that a larger value of d can only in- crease the number of output clusters, z, given that the clustering algorithm will be more stringent, i.e., if d 1 <.
- and ‘Contigs’ shows the number of contigs ob- tained from the assembly.
- 3 were obtained by running CD-HIT, Compacta, Corset, Grouper and the clustering facility of the Trinity suite on the contigs from assem- blies of the Arabidopsis and mouse datasets (Table 1);.
- Table 2 shows the number of clusters output by Com- pacta, Corset and Grouper for the Arabidopsis, mouse and mango datasets.
- Compacta produced a larger num- ber of contigs in the Arabidopsis and mouse real data- sets, and the smaller number of contigs for the mango dataset and the simulated datasets of Arabidopsis and mouse.
- As mentioned above, Compacta uses an agglomera- tive algorithm with a heap that auto-sorts elements upon insertion and deletion that reduces computation time up to O(n 2 logn) [28], which is considerably faster than the other algorithms, particularly when the size of the input data increases.
- Sources and characteristics of the RNA-.
- However, for larger and more complex assemblies, such as those for mango and mouse, input file parsing repre- sents a much small fraction of the total processing time, such that Compacta is faster than Grouper (c.f., Com- pacta was 340-fold faster than Grouper for the mouse assembly.
- Numbers in the upper bars for Corset and Grouper are the number of rounds that the execution took for the corresponding program compared with the Compacta execution time.
- red line), number of Arabidopsis sequences identified (n As .
- As shown above, Compacta can be adjusted through the d parameter to give a number of clusters within the range [z min ,z max.
- Test runs of differential expression programs, such as edgeR [38], can be performed with each representative set of contigs to evaluate the suitability of the results for achieving the particular aims of an RNA-Seq experiment.
- By taking the largest contig as representative of each cluster, the number of distinct Arabidopsis sequences identified (col- umn n As in Table 3) varies from a minimum of 3344 when d = 0 to a maximum of 23,607 for d = 1.
- This latter value corresponds to the total number of Arabidopsis sequences.
- Table 2 Number of representative contigs selected by each algorithm from each transcriptome when run with default parameters.
- identified in the entire assembly.
- The ratio between the maximum and minimum of Arabidopsis sequences identi- fied is which is similar to the ratio z max / z min of ≈ 7.5 However, we will see that the proportion of changes in the number of clusters and identified se- quences do not follow a linear function of d..
- The percentages of clusters, z, number of identified se- quences, n As , and clustering efficiency, Ef, defined as Ef = n As /z, are estimated as curve functions of d, which again does not have a linear relationship as shown by the grey dashed line with slope 10 (Fig.
- This inaccuracy results from not taking into account genome architecture, which in turn can confound identification of the gene from which each transcript originated to create multiple putative tran- scripts of which an unknown proportion could be arti- facts, i.e., transcripts that do not exist physiologically..
- Analyses of a representative set of contigs at d = 0 will give the minimum resolution of the assembly, because all representative contigs at that point represent com- pletely independent ‘genes’ or ‘gene families’ and conse- quently the number of identified sequences will be a minimum (Fig.
- However, for many purposes, the number of represen- tative sequences at d = 0 will be ‘too few’ and further analyses will be needed to improve accuracy..
- On the other extreme, at d = 1, we have the largest number of representative contigs, and consequently the largest number of identified sequences (Fig.
- The d l point gives large increases in the number of contigs, z, and identified sequences, n As , when compared with the point at d = 0, say Δz(d.
- Parameter value, z - Number of clusters (representative contigs), n As - Number of Arabidopsis sequences identified.
- If the differential expression of the representative contigs at max (Ef) is satisfactory for the aims of the RNA-Seq experiment, the analysis at this point could be reported as the final result..
- When such selection is done, a simple BLAST experi- ment using all contigs from the assembly as queries and the full transcriptome of the known organism as the tar- get can be easily performed (details of this procedure are presented in Section 3 of Additional file 1 for the Arabi- dopsis assembly).
- The general lines of the pro- cedure are, first, to obtain all BUSCO terms that corres- pond to all contigs in the assembly, and upon gathering these terms, use the sets of representative contigs ob- tained with Compacta with a grid of d values.
- As with BLAST, with BUSCO we can obtain a point d that will correspond to the maximum efficiency, max (Ef), but in this case the Ef for each value of d is defined as the number of BUSCO terms found over the number of con- tigs.
- An additional advantage of the BUSCO approach is that the terms found will generally have straightforward biological interpretations, which is useful for under- standing differential expression analyses..
- An ideal clustering algorithm for assemblies will cor- rectly identify all contigs that arise from transcription of the same locus and cluster all of these contigs together..
- the cases (in each of the 4 cells of the table) is repre- sented by a, b, c and d, whereas the sum of the frequen- cies, a + b + c + d, gives the total number of pair-wise contig comparisons.
- A disadvantage of CD-HIT is that the clusters it produces contain only a small number of contigs –those that are highly similar at the sequence level, and the user has little control over the degree of assembly compression (data not shown.
- The higher complexity of the mouse as- sembly relative to that for Arabidopsis can explain the generally lower recall ratios of the four programs and that higher numbers of false negatives can be produced with complex assemblies.
- Even when Compacta did not have the highest precision in the assemblies studied, it was faster than the other programs, and, importantly, can be adjusted to yield an optimum number of clusters with the highest efficiency (see previous section).
- Furthermore, be- cause Compacta does not try to determine the gene origin of each contig, its results are independent of the RNA-Seq experimental design (treatments), whereas Corset and Grouper are affected by experimental factors due to the use of statistical tools to try to estimate genes from which contigs originated..
- In most cases de novo assemblies produce an excess number of contigs, many of which represent minor tran- scription variants from expression of the same gene.
- contigs that are subsets of the same reads are clustered together and thus represented by a single contig.
- Recommended Hardware: The memory needed to run the software is determined by the size of the largest input BAM file.
- FR-M programmed Compacta, performed the analyses and wrote the first draft of the manuscript.
- CH-K and OM conceived the research, supervised both software development and analyses and wrote the final version of the manuscript.
- FR-M acknowledges the Mexican Council of Science and Technology (CONACyT) for support from a PhD scholarship (261122) during the development of the project.
- The funding body did not contribute to the design of the study or collection, analysis and interpretation of data and in writing the manuscript..
- The source code and standalone executable of the version of Compacta used in this study are available at https://doi.org/10.5281/zenodo.3469484..
- Characterizing the major structural variant alleles of the human genome.
- Most of the human genome is transcribed..
- Genome and transcriptome analysis of the mesoamerican common bean and the role of gene duplications in establishing tissue and temporal specialization of genes.
- Genomics: The amazing complexity of the human transcriptome: Nature Publishing Group

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt