« Home « Kết quả tìm kiếm

Sketch distance-based clustering of chromosomes for large genome database compression

- Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes.
- Results: We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes.
- This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster.
- A final reference is then selected from these reference genomes for the compression of the remaining reference genomes.
- Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences.
- The compression ratio gain can reach up to 20-30%.
- Conclusions: The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods.
- Keywords: NGS data, Data compression, Reference-based compression, Unsupervised learning.
- 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- Existing reference-based genome compression algorithms include GDC [9], GDC2 [10], iDoComp [11], ERGC [12], HiRGC [13], CoGI [14], RlZAP [15], MSC [16], RCC [17], NRGC [18.
- A straightforward application of these reference-based compression algorithms to solve the chal- lenging problem of compressing a database containing n number of genome sequences is to conduct a one-by-one sequential reference-based compression for every genome in the database using one fixed reference genome..
- A critical issue of this straightforward approach is the performance variation—the performance of reference- based algorithms highly depends on the similarity between the target and reference sequence, which can cause non-trivial performance variation in the compres- sion of the same target sequence when a different ref- erence is used.
- For instance, in a set of eight genome sequences, the compression ratios for genome hg19 by GDC2 [10] using seven different reference genomes var- ied remarkably from 51.90 to 707.77 folds [13].
- Therefore, clustering similar genomes and specific reference identi- fication within the clusters are of great significance in the compression of large scale genome databases..
- We propose ECC, an Efficient Clustering-based refer- ence selection algorithm for the Compression of genome databases.
- Instead of using a fixed reference sequence by the literature methods, our idea is to cluster the genome sequences of the database into subsets such that genomes within one subset are more similar than the genomes in the other subsets, and then select the centroid genome as reference within each cluster for the compression.
- We use the MinHash technique [21, 22] to measure the distance between sequences to construct a distances matrix of the genomes for the clustering.
- Then a small q number of the minimal hash values are sorted.
- 1 Construct a distance matrix of the n genome sequencesusing the pairwise sketch distance method Mash [22]..
- 3 Compress the target sequences within each cluster by a reference-based compression algorithm, and a final reference sequence is selected for the.
- compression of the remaining reference sequences..
- (ii) Our initial setting of the centroid in the clustering is not randomly as by RCC, but determined by the analysis on the whole database;(iii) The reference selection within the clusters is also decided by the clustering method instead of the reconstruction of the original target genome set by RCC..
- The third point implies that our method compresses sequence set without the need to record additional information in the result.
- GDC2 is so far the best reference-based algorithm for the compres- sion of the Human 1000 Genomes Database, the reference was selected external to the database.
- However, when the user is unfamiliar with the similarity between sequences in given set, the selection of one fixed reference sequence may result in very poor performance on dissimilar target sequences and a long running time in the compression..
- While the reference selection by ECC is decided by the clustering step, and all the reference are internal genomes of the database which are required to be compressed..
- More related work in detail are provided in the next section to highlight the novelty of our method..
- In the experiments, we compared the performance on genome databases between the straightforward reference- fixed compression approach and our clustering approach ECC for the state-of-the-art reference-based compression algorithms.
- Our approach achieved 22.05% compression gain against the best case of the reference-fixed compres- sion approach on a set of 60 human genomes collected from NCBI, where the compression ratio increases from 351.74 folds to 443.51 folds.
- On the union set of the Human 1000 Genomes Project and the 60-genome NCBI dataset, the compression ratio increases from 2919.58 folds to 3033.84 folds.
- The assembled whole genome sequencing data are in the FASTA format.
- The common idea of the existing reference-based genome compression algorithms is to map subsequences in the target genome sequence to the reference genome sequence [8].
- Firstly, an index such as a hash table or a suffix array is constructed from the reference genome to reduce the time complexity of the search process..
- Then an encoding strategy such as LZ77 [27] is applied to parse the target sequence to position number and length of the subsequence with regard to the reference sequence or mismatched subsequence.
- For instance, a subsequence in the target sequence is encoded as “102 72”, which stands for that this subsequence is identical to the subsequence from position 102 to 173 in the reference genome..
- For a set of target genome sequences, the similarity between the reference sequence and the selected target sequence has a large effect on compression ratio.
- Exist- ing attempts for reference selection in the compression of genome sequence databases can be categorized into three types.
- The first category selects a single reference genome to perform one-by-one sequential reference-based com- pression on all target genomes, which is named straight- forward reference-fixed approach as in the previous.
- Most of the reference-based compression algo- rithms applied that on genome set compression and select the single reference sequence randomly from the genome database, such as HiRGC [13], GECO [28], ERGC [12], iDoComp [11], CoGI [14], RLZ-opt [29], RLZAP [15]..
- The second category of algorithms utilizes not only one fixed reference for the compression of all sequences, but also the inter-similarity of the whole sequence set.
- MSC [16] utilizes both intra-sequence and inter-sequence similarities for com- pression via searching subsequence matches in reference sequence and other parts of the target sequence itself, the compression order is determined by a recursive full search algorithm..
- RCC [17] performs clustering on the local histogram of dataset and derives a representa- tive sequence of each cluster as the reference sequence for the corresponding cluster.
- A schematic diagram of the method is shown in Fig.
- to compute pairwise sketch distances of the sequences to form a distance matrix.
- Denote the hash values set of the constituent k-mers set from S i as H ( S i.
- It can also be understood that the calculation of the sketch distance between two long sequences is much more efficient than the calculation by using k-mer feature vector direct comparison.
- The efficiency becomes signifi- cant, especially in the construction of the whole distance matrix M..
- Clustering of chromosomes from the distance matrix Clustering is the process of grouping a set of samples into a number of subgroups such that similar samples are placed in the same subgroup.
- An important step in the process of cluster- ing is to determine the number of clusters in the data.
- Subtractive clustering is an extension of the Mountain method [34].
- It estimates cluster centroid based on the density of points in the data space.
- K-medoids clustering of the collection of n genomic sequences K-medoids is a partition-based cluster analysis method..
- The clustering result highly depends on the setting of the initial centroids.
- First, use O i as the reference sequence for the other sequences in cluster C i .
- R from the centroid set as the reference for the other centroid sequences:.
- In detail, all the sequences in cluster C i is compressed using O i as the reference sequence except O i itself.
- Then all the reference sequences except R is compressed using R as the reference sequence.
- All reference-based compression algorithms can take this clustering approach to compress a set of genomic sequences.
- then the reference sequence of each cluster is decompressed by R, all the remaining sequences in the cluster are decom- pressed by the reference sequence in its cluster.
- As the process is invertible, the compression scheme is lossless as long as the used reference-based compression algorithm is lossless..
- To assess the performance of our proposed method ECC, we compare the compression ratio based on ECC result with the reference-fixed compression approach on multi- ple genome databases..
- In particular, the compression ratio and running time of our algorithm are presented and discussed in comparison with the reference-fixed compression approach..
- Our algorithm was implemented in the C++11 language..
- Six state-of-the-art reference-based compression algo- rithms were tested on the three genome databases to understand the performance improvement achieved by our clustering approach in comparison with the reference-fixed compression approach.
- runnable for the compression of long genome sequences (such as human and rice) due to its time complexity—.
- For GDC2, as its two-level compression structure tends to compress all the target sequences using the same ref- erence, we compress the datasets using the final reference selected by ECC, and the compression order of GDC2 is also adjusted in accordance with the ECC clustering result..
- As mentioned before, the performance of a reference- based algorithm on the NGS dataset is highly depend- able on the option of the reference sequence.
- To reduce the variance from an arbitrary selection, we randomly selected multiple reference sequences from the tar- get dataset and obtain the compression performance with each of them for the compression algorithms (the randomly selected reference file itself is not compressed, so all experiments compress the same number of genome sequences)..
- To measure the performance improvement, we denote the compression ratio with fixed single reference as C S and the compression ratio on same dataset with ECC as C E , and introduce a relative compression ratio gain as:.
- Due to page limitation, we only report the compression gain against the best result of the reference-fixed compression approach for the reference- based compression methods..
- Our proposed ECC method outperforms over the reference-fixed compression approach in all cases on dataset-60 (see Table 1).
- The compression gains against the best results by the reference-fixed compression approach are Table 1 Compression ratio for the H.
- Performance details are summarized in Table 2 for HiRGC, iDoComp and GDC2 which are the three algorithms of the highest compression performance on dataset-60.
- The ratio gain of GDC2 is only 3.77%, but more importantly, ECC helped GDC2 avoid 3 of the 7 time- consuming cases in the reference-fixed approach..
- On the rice genome dataset-2818, through our ECC clustering approach, HiRGC gained 13.89% compression performance against the best case by the reference-fixed compression approach, iDoComp gained 21.22%, and GDC2 gained 2.48% (Table 3).
- The compression ratio gain of HiRGC is more stable than on the first two human genome databases.
- A reason is that all the genomes in the rice database were aligned to the sequenced rice culti- vars: 93-11 (indica variety) [37].
- Hence this dataset has a higher inter-similarity and the variance from the random selection of the fixed reference is smaller..
- From these comparisons, we can understand that our ECC clustering approach can make significant com- pression improvement for most of the state-of-the-art algorithms and can avoid selecting some inappropriate.
- Running time is an essential factor for measuring the applicability of an algorithm in the compression of large-scale genome databases.The running time of ECC includes two parts: reference selection time (only depend- ing on the input sequence set) and the compression time (depending on the input sequence set and the reference- based compression algorithm).
- The detailed compression time of each reference-based compression algorithm with difference references are listed in Additional file 1..
- As shown in Table 4, ECC took h on the reference selection part for dataset-60, dataset-1152 and rice genome dataset-2818 respectively.
- But the com- pression time for these three datasets are h (Table 5) by HiRGC ,which is the fastest algorithm in the compression.
- The reference selection time is much shorter than the sequence compression time..
- We have also observed that the total time of reference selection and compression by ECC is highly competi- tive with the reference-fixed compression approach.
- In fact, the compression time via ECC after the reference selection is shorter than the compression time of the reference-fixed compression in most cases except GDC2 on the dataset-1152 (Table 5)..
- In this work, we introduced ECC, a clustering-based ref- erence selection method for the compression of genome databases.
- The time by the reference-fixed approach is the average running time of several fixed single-reference cases by each algorithm, please see the supplementary file for the time range of all the cases and compression time by ERGC, SCCG and NRGC.
- This algorithm is universal for genome sequence sets of the same species.
- We have demonstrated that the six state-of-the-art reference-based compression algorithms all achieved a substantial improvement after the clustering of the genome sequences, with similar amounts of com- pression time consumed by the reference-fixed approach..
- Select the reference for new sequence via heuristic method.
- If make full use of the k-mer fea- tures computed in distance matrix construction stage, it is possible to construct a universal sequence via merging k-mers with suffix-prefix overlaps.
- The compression time(in hours) for algorithms using different references on three datasets..
- Xuan Zhang of the University of Technology Sydney (UTS) for expertise used in genome sequence dataset..
- This article has been published as part of BMC Genomics Volume 20 Supplement 10, 2019: Proceedings of the Joint International GIW &.
- The full contents of the supplement are available online at https://.
- All authors contributed to the preparation and development of the manuscript.
- All data associated with this study are available in the Supplementary Data file under Additional Files.
- In: Proceedings of the Thirty-Fourth Australasian Computer Science Conference, vol.
- In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt

Sketch distance-based clustering of chromosomes for large genome database compression

CHỦ ĐỀ LIÊN QUAN