« Home « Kết quả tìm kiếm

Regression trees for regulatory element identification


Tóm tắt Xem thử

- When applied to two data sets of the yeast Saccharomyces cerevisiae, the method success- fully identifies most of the regulatory motifs that are known to control gene transcription under the given experimental con- ditions, and suggests several new putative motifs.
- Analysis of the tree structures also reconfirms several pairs of motifs that are known to regulate gene transcription in combination..
- The transcription of a gene is controlled by diverse regulatory proteins called transcriptional factors (TFs), which bind to specific DNA sequences in the promoter region of the gene.
- Each TF recognizes a unique family of binding sites based on sequence binding preferences that arise through the energetic interactions between the atoms of the TF and those of the DNA sequence.
- How a col- lection of TFs regulates the transcription of a gene depends to a large extent on the binding sites found in the gene’s promoter.
- Hence, identifying and characterizing regulatory motifs that serve as TF binding sites is important for our understanding of the complex regulation of gene expression..
- First, genes with similar expression patterns across experimental conditions are grouped together by applying clustering analysis to genome-wide expression data sets (Eisen et al., 1998).
- There are many algorithms and methods that can be used to search for conserved sequences (Lawrence et al., 1993;.
- van Helden et al., 1998.
- Bussemaker et al.
- (2001) noted that despite the success in the identification of many motifs, this strategy has a drawback: there are genes in the cluster without the motif, and many genes with the motif do not respond.
- In a recent paper, Segal et al.
- A Bayesian score is used to evaluate how well a regulation program can explain the expression behavior of the genes in the module as a function of the expression level of a small set of regulators.
- A motif finder then searches for conserved motifs upstream of the genes in each module..
- Another approach to the identification of regulatory motifs is based on the association between gene expression values and the abundance of motifs (Bussemaker et al., 2001.
- Keles et al., 2002).
- A linear regression procedure is used to fit the model and select the motifs that contribute most heavily to the model..
- The underlying assumption is that if a motif represents a functional binding site for an active TF, then the presence of the motif contributes additively to the expression level of the gene under the given experimental condition.
- Conlon et al..
- In this paper, we formulate the problem in the regression framework and present a method for identifying regulatory motifs using tree-based regression models.
- The tree-based regression paradigm was introduced by Breiman et al.
- We evaluate the importance of the motifs by analyzing the structure of the tree as well as using a technique based on surrogate splits..
- In this section we give a brief introduction to the tree-based models.
- We refer the reader to Breiman et al.
- In the context of motif iden- tification, p variables X i are the motifs, response Y is the expression level for a single time point and n learning cases are the genes..
- The left and right child nodes contain disjoint subsets of the parent con- tent and are defined by splitting the parent node.
- The cutoff value c is in the range of observed values of x i .
- A split is chosen so as to get the distributions of responses in the child nodes are most homogeneous.
- Two such split functions are discussed by Breiman et al.
- The non-negativity of the split function ensures that recursive splitting will create smaller nodes with increased homogeneity.
- 1% of the sum-of-square of the root node..
- where V is the covariance matrix of y i in the root node and y(g.
- Segal (1992) noted that taking V to be the pooled sample covari- ance matrix of the responses in the root node corresponds to a two-sample Hotelling’s T 2 statistic.
- With the new generalized split functions, the recursive algorithm proceeds as in the case with a single response..
- 2.3 Determining the tree size.
- Breiman et al.
- 0, (4) whereα is the complexity parameter that penalizes the number of the terminal nodes..
- Once R(G) is defined, steps (b) and (c) are carried out as described in (Breiman et al., 1984)..
- (1) to predict the values of the response variables in the future;.
- Our main purpose is not to predict but to discover those predictor variables (motifs) that are most relevant to the responses.
- Hence, a question of interest is: which variables are the most important? In other words, how do we rank those variables that, while not giv- ing the best split of a node and thus do not appear in the tree structure, may give the second-best or third-best split.
- When a transcription factor binds to an appropriate motif, it regulates the expression level of the respective gene.
- where x ij is the num- ber of times motif j appears in the promoter of i.
- where y ij is the logarithm base 2 of the expression level of gene i in sample point j.
- The simplest way is to enumerate all the sequences in the genes’ promoters as potential motifs (Bussemaker et al., 2001.
- Second, most of considered sequences are not present in the final models while they can make noise that affect the model selection process..
- Our method is similar to that proposed by Conlon et al.
- These motifs can be found using a motif finding algorithm such as AlignACE (Hughes et al., 2000) or MEME (Bailey and Elkan, 1995).
- 3.2.1 Candidate motifs For experiments, we used the set of 356 motifs compiled by Pilpel et al.
- The motif matrices are derived by applying the motif finding program AlignACE to the upstream regions of genes in the MIPS (Mewes et al., 2000) functional categor- ies.
- Of the 356 motifs, 25 are known motifs described in the biological literature.
- For each motif, the number of times the motif appears in the promoter regions of genes is counted by applying program ScanACE (Hughes et al., 2000).
- When applying ScanACE to the promoters of the genes in Saccharomyces cerevisiae, Pilpel et al.
- 3.2.2 Microarray data We tested the tree model on two sets of microarray data for the yeast S.cerevisiae, specifically, the cell cycle data set of Cho et al.
- (1998) and the sporulation data set of Chu et al.
- Cho et al.
- Following Tavazoie et al.
- In this selection, the metric of variation was the ratio between the standard variation and mean of the expression levels of each gene across the time points.
- Of the 3000 genes retained, only 2584 genes contain motifs from our set of candidate motifs..
- The missing values were replaces with the mean of the observed data over sample points.
- Tavazoie et al..
- This data set consists of the expres- sion levels of about 6200 genes over 10 sample points during meiosis and spore formation.
- Following Eisen et al.
- The final trees of the cell cycle and sporulation data sets are shown in Figures 1 and 2, respectively.
- The split used for a node is shown below the circle in the form ‘motif i name ⇐ n’, which means that genes with more than n instances of motif i in their promoters go to the right node, and other genes go to the left node.
- For example, of the 2584 genes in the first data set, only 124 genes have motif MCB in their promoters.
- The remaining 2460 genes, therefore, go to the left node 1..
- Among the motifs in the tree structure, MCB, SCB, ECB and MCM1 were mentioned previously by Cho et al.
- In particular, motifs MCB and ECB have a strong effect on transcription in the late G1 phase, SCB is active in the early G1 phase, and MCM1 plays a regulatory role dur- ing the G2 and M phases.
- Although the tree model does not give direct information about when a motif is active, it can be inferred by analyzing the mean temporal profile (Tavazoie et al., 1999) of the node containing the motif (the right child node after splitting a node using the motif.
- The motif AAAANGTAAACAA that has a high score in Cho et al..
- Other motifs—RAP1, MET31-32 and mRRPE(M3A)—were found by Tavazoie et al.
- (1999), who applied clustering analysis to the same data set.
- 1 Motif names are taken from Pilpel et al.
- the tree structure but that may be relevant to the expression data.
- We found that most motifs that appear in the tree have high ranking in Table 1.
- Regulatory ele- ments MSE and URS1 are known to play active roles in the regulation of gene expression during sporulation (Chu et al., 1998).
- Another motif, MCB, is one of the highly scored motifs found in (Bussemaker et al., 2001).
- These combinations of motifs are consistent with those reported by Pilpel et al.
- From the tree in.
- Figure 2, we find one motif combination—mRRPE(M3A) and PAC—which is among the findings of Pilpel et al.
- For each terminal node we plotted expres- sion profiles of the genes in the node and calculated deviation (not shown here due to lack of space).
- We found that ter- minal nodes located closer to the root are more homogeneous than those located further from the root.
- In this study we have developed a new approach to the iden- tification of regulatory elements based on regression trees..
- Tree-based methods are more suitable than parametric meth- ods when the data set is large both in terms of the number of observations and the number of variables, which is the norm for genetics data.
- in that case, it is more similar to the approach of Bussemaker et al.
- Our approach is similar to that used in previous studies (Bussemaker et al., 2001.
- Keles et al., 2002.
- Conlon et al., 2003) in that it considers motifs as predictors, gene expression levels as responses and then uses regression fitting to extract most relevant predictors.
- As noted by Bussemaker et al.
- Another characteristic of the model is that it clusters genes into homo- geneous groups when constructs trees.
- However, clustering here is different from that used by Eisen et al., (1999) in that it considers simultaneously expression data and motifs;.
- Experiments in which the proposed approach was applied to two-data sets of the yeast S.cerevisiae demonstrated the ability of our method to identify biologically verified regulat- ory motifs.
- For the cell cycle and sporulation data sets, we identified most of the motifs known to be active in the given experimental conditions and suggest several new putative.
- Our approach of filtering motif candidates prior to tree construction reduced the computational complexity while simultaneously increasing the specificity of the results..
- If several motifs participate independently in the same regulatory event, the tree model generally picks up only one representative..
- Although in the experiments we constructed trees from the putative/known motifs compiled by Pilpel et al.
- Park for his comments on an earlier version of the paper, and the IBM SUR program for providing computing facilities for this research.
- (1998) A genome-wide transcrip- tion analysis of the mitotic cell cycle.
- Proceedings of the Seventh International Confer- ence on Intelligent Systems for Molecular Biology