« Home « Kết quả tìm kiếm

QuantTB – a method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data

- In contrast, whole genome sequencing offers sensitive views of the genetic differences between strains of M..
- Full list of author information is available at the end of the article.
- one of the oldest diseases in the world - continues to devastate the lives of millions per year.
- Finally, spoligo- typing analyzes a series of 43 spacer oligonucleotides in the directed repeat region [12].
- In addition, these approaches only examine a small portion of the genome, and were not originally intended for the detec- tion of mixed infections..
- Although QuantTB can use either assemblies or raw se- quencing reads for the construction of the reference database, assemblies are the preferred input.
- Assemblies represent aggregate, error-corrected versions of the cor- responding read set and will yield superior results.
- Selecting high quality SNPs for each genome present in the reference database is paramount to the success of our method.
- In the analysis presented here, we extracted SNPs from the 5637 reference assemblies that passed quality filtering for our reference database..
- The complete collection of SNP sequences in the refer- ence database is stored in a binary matrix, where rows are the genomes and columns are the locus/allele pair (Fig.
- To remedy this issue, a custom SNP-based representa- tion of the H37Rv sequence was generated, based on the.
- If the same variant is observed in al- most all the genomes in the reference database, we designate this as an H37Rv specific variant, i.e.
- Therefore, QuantTB generates an “H37Rv SNP se- quence” including positions where more than 75% of the genomes in the reference database have a common allele that differs from H37Rv.
- These locations are a finger- print for H37Rv-like strains to identify them from the rest of the database..
- First, SNPs from the sample are compared against SNP sequences in the reference database to calculate a strain presence score for every genome in the database.
- Strain presence scores are calculated for every genome in the reference database.
- two steps: 1) Extracting SNPs from a sample 2) Iterative classification of strains in the sample..
- Insertions, deletions, bases with low quality (Phred less than 11) and bases within PE/PPE regions are re- moved as in the construction in the reference database..
- The end result is a dic- tionary containing the extracted allele coverages and fre- quencies for every SNP position identified in the database.
- Iterative classification of strains in the sample.
- The steps of the algorithm are as follows:.
- O i represents the fraction of SNPs from a particular reference genome, i, that was observed in the sample..
- The higher O i , the more likely the set of SNPs observed in the sample originated from genome i..
- the effect of random errors in the sample, while retain- ing sensitivity for true variation.
- This threshold t a , is dy- namic and determined by the average coverage of the sample, C sample , and the average coverage of the genome identified in the previous iteration, C G k−1.
- For each iteration k, the threshold is set as 5% of the average coverage of the strain identified in the previous iteration.
- This is initial- ized at k = 0 as 5% of the sample coverage (C sample.
- Ap- plying a coverage threshold diminishes the effect of random errors in the sample, while retaining sensitivity for true variation.
- A i represents the frequency with which a particular ge- nome ’ s SNPs accounts for all the allelic variants present in the sample.
- is calculated as an average between O i and A i , and the genome with the highest s i ,is selected as being present in the sample..
- Remove the chosen genome’s SNPs from the database and sample Before the next iteration begins, SNPs corresponding to the chosen genome are 1) re- moved from each SNP sequence in the database and 2) removed from the sample.
- In addition, any H37Rv alleles present in the sample at positions outside of the identi- fied genomes’ SNP sequences are also removed.
- This is because those alleles have already been accounted for by the presence in the identified genome..
- Because it is unlikely that the true strain present in the sample shares the exact collection of SNPs with its high- est scoring match in the database, additional SNPs from the sample could match erroneously across multiple other genomes in the database with enough coverage to be marked as ‘observed’.
- To account for spuri- ously detected genomes due to higher coverages (greater than 25), we only allow strains to be detected in a sam- ple when their prevalence accounts for at least 1% of the sample coverage.
- Before starting the next iteration, a check is performed to ensure that a sufficient number of SNPs (15) still remain in the sample and in the database for reliable classification.
- At the end of the iterations, relative abundance is cal- culated by taking the average coverage of unique SNPs for each genome in the sample..
- In order to identify presence or absence of a resistance phenotype in the sample, QuantTB uses a curated set of SNPs conferring antibiotic resistance to 7 TB drugs gen- erated from the previous study of Manson et al.
- QuantTB outputs the results of the resistance testing in a separate file, if the appropriate command-line flag is set..
- To generate synthetic two-strain mixtures of strains at different relative abundances, we randomly selected 100 pairs of assemblies from each of the d50 and d100 refer- ence databases.
- Using the reference genomes from the d10 database (see Methods), we randomly selected 200 genomes such that each TB lineage was represented in proportion to its relative incidence in the overall data- set, with a minimum requirement of five representatives.
- False positive (FP) refers to the number of identified strains that were not present in the sample.
- False negative (FN) refers to the number of strains present in the sample that were not identified..
- Comprehensive TB reference database captures the breadth of the Mycobacterium tuberculosis species QuantTB requires a reference database of known M.
- The number in the box plot is the median distance of all pairs of samples from that lineage.
- There was good concordance between the diversity represented in the complete data set (Fig..
- We used five reference databases that varied both in size and in the genetic distance between representative genomes (Table 1).
- The ability to take advan- tage of a large reference database is a substantial advan- tage for QuantTB over StrainSeeker and Sigma, since the number of publicly available TB sequences in NCBI that could be included in the database is increasing rap- idly.
- 3b), achieving F1 scores above 0.9 at all coverages above 1x per strain, indicating that QuantTB was almost always able to predict all four strains in the synthetic mixes correctly.
- The decreased SNP counts in these very low-coverage simulations led QuantTB to predict only one of the strains present for.
- At 1× coverage per strain, QuantTB still per- formed adequately, with only a slight performance dip noticeable in the largest database containing 4933 strains differing by at least 10 SNPs.
- We observed that the lower performance occurred mostly because QuantTB would predict a genetically similar strain instead of the correct strain.
- The setup repre- sented a more realistic scenario, where strains in the samples (sourced from the d50 database) were not already present in the database (d10small).
- QuantTB identified the correct number of strains (two) in the majority of samples (72.
- Sigma failed to predict the correct number of strains in any sample, pre- dicting at least 9 strains for all of the samples (Fig.
- There- fore, as genomes from the d50 database were used as test samples and tested against genomes in the d10small database, we evaluated the accuracy of strain predictions by assigning a true positive to each strain in a sample if QuantTB predicted the ‘correct’ relative genome in the d10small database (i.e.
- Thus genomes in the samples were not present in the underlying database the tools were trained on.
- This lets us see how well each tool is at predicting the correct number of strains and the correct relative abundance between strains if the ‘ correct ’ strain in the sample is not already present in the database.
- strains, suggests that QuantTB is able to accurately pre- dict the correct number of strains even in cases where a near-identical strain is not already present in the data- base.
- For both databases, QuantTB accurately classified the identity of each strain in the pair (F1 measure of 0.98 and 0.92 for the d100 and d50 databases, respectively, Additional file 7: Table S5) and accurately determined the relative abundance for each strain in the pair (Fig.
- The ma- jority of relative abundances predicted were within 0.05 of the correct value (Additional file 2: Figure S2).
- Even in the few cases where QuantTB predicted the incorrect strain, QuantTB predicted it to be present in the sample at the correct relative abundance..
- Bryant et al.
- In the original study, a sample was labeled as mixed if the number of heterozygous loci exceeded a threshold, and as a reinfection if the SNP distance between pairs exceeded a threshold..
- As it is impossible to know the identity of the strains present in the real samples in advance, we limited analysis to the multipli- city, or the number of strains identified in each sample..
- QuantTB reported a consistently low (0–2) number of strains, and identified the same seven samples as mixed, irrespective of the database used as a reference, which was in agreement with the expected strain multiplicity based on Bryant et al.
- Samples 42 and 45 were identified as mixed infections in the original study.
- As the coverage for the H37Rv reference strain was high in these three samples, our analysis supports the hypothesis that three culture negative isolates resulting in the sequencing of the H37Rv laboratory strain.
- We observed that most sister leaves in the tree were part of the same sam- ple isolate pair, representing relapse cases.
- 5, box C) and the two isolates of Sample 15 were found on opposite ends of the tree (Fig.
- Samples labeled as Clinically TB negative on follow up were cases in which the second of the isolate pair assigned to the H37Rv strain by QuantTB, and tested negative for TB in the original study.
- But with QuantTB this oc- currence can be explained by reviewing the strain iden- tities, because QuantTB outputs which genome has been detected in the sample..
- Overall, QuantTB and the manual curation presented in the original study resulted in agreement for 43 of the 47 sample predictions (91.
- In the remaining cases, we have presented reasons why QuantTB’s prediction may be at least as accurate as the original manual designations.
- QuantTB provides insight into antibiotic resistance Using QuantTB, we determined the antibiotic resistance genotype for each of the isolates.
- Antibiotic resistance was indicated if the sample had a SNP in one of the antibiotic resistance causing loci from a previously published cu- rated list (see Methods) [24].
- Tips are labeled with the isolate number and its part of the pair (a or b), and are colored by its isolate classification as predicted by QuantTB.
- To the right of the mixed and reinfection isolates, we show the strains present in the isolate as predicted by QuantTB.
- Boxes are discussed in the main text.
- We found no relation between mixed infections and het- eroresistance, nor do we find evidence of the emergence of antibiotic resistance within a relapse case.
- of alleles were of the resistance phenotype, and 87% were susceptible.
- irrespective of the relative.
- QuantTB: 1) outputs the specific identity of the strain, making the tracking of specific strains across samples pos- sible.
- 2) outputs the abundances of every strain identified in the sample, enabling the quick identification of major and minor subpopulations.
- This may have occurred in two of the samples we surveyed in the data of Bryant et al.
- In addition, QuantTB predicted the closest related genome in the database for these strains in 94% of the samples..
- In order to ensure erroneous SNPs are not considered, QuantTB disregards SNPs present at less than 5% abundance relative to that of the previously identified strain.
- Therefore, QuantTB can only detect mixed infections in which the minor strain represents at least 5% of the allelic variation.
- QuantTB’s ability to detect a strain not in the database depends on how distant it is from its nearest relative in the database..
- [24] we found antibiotic resistance in five samples, one being a case of heteroresistance in the second isolate of its sam- ple pair.
- We did not observe any relationship between antibiotic resistance and mixed infections in the clinical.
- Since Bryant et al.
- The databases vary in the number of genomes and the minimum SNP distance between strains..
- Determination of the.
- lineage is explained in the methods section.
- Whether or not it passed QC filtering is indicated in the ‘ passed ’ column..
- is the H37Rv reference allele at that position, mutation is the corresponding resistant causing mutation, and the drug initial is indicated in the ‘ drug ’ column.
- Lineage specification method can be found in the methods section..
- The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript..
- Patients with active tuberculosis often have different strains in the same sputum specimen

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt

QuantTB – a method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data

CHỦ ĐỀ LIÊN QUAN