« Home « Kết quả tìm kiếm

A data science approach for the classification of low-grade and high-grade ovarian serous carcinomas


Tóm tắt Xem thử

- classification of low-grade and high-grade ovarian serous carcinomas.
- Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features.
- Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades..
- The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades.
- High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments.
- The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications.
- The patterns captured by Bag- of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction..
- Conclusions: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification.
- This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor.
- The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value..
- Full list of author information is available at the end of the article.
- 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- Small deletion events may target local regions of the genome harboring tumor suppressor genes locations, while amplifications preferentially target oncogenes locations [1].
- The recurrent CNAs across tumor types have been studied in an attempt to gain a deeper understanding of the pan-cancer mechanisms driving tumorigenesis..
- Ovarian serous carcinomas, previously felt to be a dis- ease continuum with a spectrum of differentiation from well to poorly differentiated, are now classified into two distinct categories, low grade and high grade serous car- cinomas, based on their histologic features.
- Additionally, low grade serous carcinomas have a more indolent prognosis and respond less well to standard platinum-based chemotherapy than high grade serous carcinomas [4]..
- Not only the precision of the CNAs detection is enhanced but also the number of copy changes can be more accurately defined.
- The method can be used to assist the grade classification for the ovarian serous carcinomas, especially for the cases with ambiguous morphology..
- The developed method called Bag-of-Segments (refer- ring to CNA segments) is derived from the Bag-of- Features method.
- Bag-of-Features has been extensively used in the classification of image objects [6] and time series data [7].
- Although currently surpassed by other methods such as deep learning, Bag-of-Features remains an ideal approach when dealing with small sample sizes like in the case for our study..
- The Bag-of-Segments was used to obtain a fixed-length data representation of CNA segments that vary in num- bers between samples.
- The Bag-of-Segment approach was used to generate features needed for the development of a classification model for grading of ovarian serous carci- noma and was trained on CNA segments of 14 high-grade and 20 low-grade carcinoma samples.
- The analysis of the Bag-of-Segment features contributes to the differentiation highlighted in two different under- lying biological processes, one that involves large scale deletions or amplifications suggesting abnormal mitotic events, while the other involves local amplification and deletions commonly associated to DNA repair malfunc- tions..
- The methodological approach includes several steps: 1) the processing of the low coverage sequencing data and reporting CNAs using an in-house developed tool, 2) the application of the Bag-of-Segment method to extract pre- dictive CNA patterns and 3) the training of a classification model to predict the histologic type (low or high grade) of a sample..
- Among these patients, 14 cases were diag- nosed with high-grade and the remaining 20 with low- grade ovarian serous carcinoma based on the histologic review of the surgical material from tumor debulking surgery.
- The photomicrograph of low grade and high grade serous carcinoma examples is shown in Fig.
- 1 Photomicrograph of low grade and high grade serous carcinoma cases.
- a Low grade serous carcinoma, 20X.
- b High grade serous carcinoma, 20X.
- classify ovarian serous tumors into low grade and high grade groups.
- This parameter is control by the user and can be adjusted as a function of the need of the project.
- The selection of the optimal Cp value is described in a following section..
- Bag-of-Segments.
- The Bag-of-Segments approach used for this project is derived from the bag-of-features methodology [6, 7].
- The height is measured as the log 2 ratio to the median coverage of the sample, and the width is measured in the proportion of the chro- mosome length.
- The Bag-of-Segments is used to summarize the CNA profile of a sample in a limited set of features comparable across samples..
- Let h α and h 1−α denote the α and 1 - α percentiles of the segment heights, and let w α and w 1−α denote the α and 1 - α percentiles of the segment widths.
- For each indi- vidual sample, the empirical frequency distribution of its segments over these 9 classes generates the Bag-of- Segments representation..
- The Cp and α are two parameters that respectively con- trol the complexity of the CNA segment landscape and CNA segment classes defined by the bag-of-segment approach.
- These two parameters are set by the user and are adjustable as function of the sequencing cov- erage, complexity of the genomics alterations, quality of the sequencing results that is largely depending on the quality of the starting material (sample degradation,.
- 2 Bag-of-Segments representation workflow.
- For each combination, the average accuracy of the Random Forest (RF) model (dis- cussed in the next section) was obtained by repeated 10 times of the LOOCV..
- A model was developed to classify samples into low-grade and high-grade serous carcinomas.
- We use a Random Forest (RF) [10], an ensemble model trained on the 9- feature Bag-of-Segments representation.
- RF is an ensemble of decision tree models each trained on a bootstrap sample of the original training data.
- In addition to the strong performance, RF provides two benefits: first, it provides a continuous probability score for each sample indicating how likely the sample is high-grade by counting the vote proportion from the tree models.
- Second, the Gini importance score typically used in RF model enables the identification of the important features.
- More specifically, at each split of a node in fit- ting each tree model, the Gini impurity [11] is calculated from the two child nodes should be smaller than that of the parent node.
- Cp = 0.05 and α = 0.25 were identified as one of the good performing settings and were selected to generate the data used by the classification model.
- b 2D distribution of the segment width and height for the segmentation in a.
- The corresponding 2D distribution of the segment width and height is shown in Fig.
- 4, the 2D joint distribution of the seg- ment height and width as well as their marginal distri- butions are obtained after the aggregation over the 34 samples.
- The segments from the high- and low- grade samples are colored in red and blue respectively.
- 4, we can easily observe that segments from the low grade- and high grade- samples follow different joint and marginal distributions in terms of their height and width..
- The results of the two-sample Kolmogorov-Smirnov tests that are provided in Table 2 further confirm our obser- vation.
- The width distribution is more different than the height distribution based on the magnitude of the p- values.
- This observation lays the foundation of our bag-of- segment representation.
- 4 the quantiles ( α = 0.25) of the segment height and the segment width are indicated by the black horizon- tal and vertical lines, accounting 9 CNA segment classes..
- The Bag-of-Segments representation is obtained based on the frequency distribution over the segment classes as.
- This representation is used as the input of the RF model..
- Our method shows high accuracy in the ovarian serous carcinoma classification study.
- performance of the model.
- We investigated the independent contribution of the 2 dominant classes (NA, WN) by performing another set of 10 fold LOOCV using the same RF model.
- The model displays an average accuracy near to the previ- ous one (99%) highlighting that these two features are sufficient for capturing the determinative information from the data..
- Generalization of the model.
- We expected the risk to overfit the model to be low since the Bag-of-Segments representation includes only 9.
- Table 2 Test results of the two-sample Kolmogorov-Smirnov tests for the segment width and segment height.
- features, which is a very small compared to the dimen- sionality of the original data (over 25 thousand).
- Finally, the good performance of the model can be obtained by multiple parameter settings and is weakly affected by the small parameterization changes, also an indicator of low risk of overfitting and good generalization..
- Model Interpretation and clinical relevance of the most significant classes.
- Table 3 Bag-of-Segments representation based on the distribution over the CNA segment classes.
- classes in group 1 are more frequent in the high-grade samples, than the low-grade samples which are more dominated by segment classes in group 2.
- 6 RF importance score for Bag-of-Segments features.
- 7 Correlation plot for Bag-of-Segments features.
- However, the significant contribution of the height related group of CNA segments to the model could be of concern since the height related features are more affected by the sequencing coverage.
- This ratio may vary significantly as a function of the type of tissue biopsied, possibly affecting the performance of the model.
- This representation may also be considered as a linear combination of the original Bag-.
- of-Segments representation, for example feature W = WA + WN + WD, and similar relationship is applied for fea- ture M and feature N.
- The average accuracy of the mode was 100%, suggesting that width-based features only can be used to accurately classify our ovarian dataset and therefore could be used to support clinical decisions.
- 8 Box plots for the values of Bag-of-Segments features.
- In the future research, the analysis on more samples will help provide better understanding of the utility of this method in classification of ovarian serous carci- noma, especially those with challenging morphology and immune-profile..
- In this manuscript, we describe a new data-driven approach to classify ovarian serous carcinoma into high grade and low grade types with high accuracy.
- The proposed Bag-of-Segments method is used to summarize the CNA features from sequencing coverage data.
- The Bag-of-Segments was used to derive 9 features predic- tive of tumor type.
- Due to the high correlation between several of 9 fea- tures, models of lower dimensionality could be built.
- We demonstrated that Narrow Amplified (NA) and Wide Normal (WN) CNA features were sufficient to discrimi- nate low grade from high grade ovarian tumor samples..
- The patterns captured by these two groups correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events.
- associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction..
- Beyond methodological interest, this result indicates that mod- els that are weakly influenced by the sequence cov- erage and the purity of the sample defined by the ratio tumor/normal cell in a sample can be built..
- Finally, we believe that this new method could be applied to the more challenging task of classifying ovar- ian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor.
- We are collecting these morphologically challeng- ing ovarian serous carcinoma cases, and by modeling the low grade and high grade serous carcinoma, we hope to be able to characterize their molecular nature and have a bet- ter understanding of their pathogenesis.
- Classification of ovarian serous carcinomas to low or high grade based on their genomics may provide valuable clue on how to best manage these patients in clinic..
- The funding body had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript..
- Sangdi/Bag-of-Segments.
- SL and CW were the main contributors to the developed methodology.
- SL, JPAK and CW were the main contributors to the manuscript..
- All authors contributed to the scope of the work, discussed and sanitized the results, and read, revised and approved the final manuscript..
- Molecular alterations of tp53 are a defining feature of ovarian high-grade serous carcinoma: a rereview of cases lacking tp53 mutations in the.
- A bag-of-features framework to classify time series

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt