- When applied to two data sets of the yeast Saccharomyces cerevisiae, the method success- fully identifies most of the regulatory motifs that are known to control gene transcription under the given experimental con- ditions, and suggests several new putative motifs. - Analysis of the tree structures also reconfirms several pairs of motifs that are known to regulate gene transcription in combination.. - The transcription of a gene is controlled by diverse regulatory proteins called transcriptional factors (TFs), which bind to specific DNA sequences in the promoter region of the gene. - Each TF recognizes a unique family of binding sites based on sequence binding preferences that arise through the energetic interactions between the atoms of the TF and those of the DNA sequence. - How a col- lection of TFs regulates the transcription of a gene depends to a large extent on the binding sites found in the gene’s promoter. - Hence, identifying and characterizing regulatory motifs that serve as TF binding sites is important for our understanding of the complex regulation of gene expression.. - First, genes with similar expression patterns across experimental conditions are grouped together by applying clustering analysis to genome-wide expression data sets (Eisen et al., 1998). - There are many algorithms and methods that can be used to search for conserved sequences (Lawrence et al., 1993;. - van Helden et al., 1998. - Bussemaker et al. - (2001) noted that despite the success in the identification of many motifs, this strategy has a drawback: there are genes in the cluster without the motif, and many genes with the motif do not respond. - In a recent paper, Segal et al. - A Bayesian score is used to evaluate how well a regulation program can explain the expression behavior of the genes in the module as a function of the expression level of a small set of regulators. - A motif finder then searches for conserved motifs upstream of the genes in each module.. - Another approach to the identification of regulatory motifs is based on the association between gene expression values and the abundance of motifs (Bussemaker et al., 2001. - Keles et al., 2002). - A linear regression procedure is used to fit the model and select the motifs that contribute most heavily to the model.. - The underlying assumption is that if a motif represents a functional binding site for an active TF, then the presence of the motif contributes additively to the expression level of the gene under the given experimental condition. - Conlon et al.. - In this paper, we formulate the problem in the regression framework and present a method for identifying regulatory motifs using tree-based regression models. - The tree-based regression paradigm was introduced by Breiman et al. - We evaluate the importance of the motifs by analyzing the structure of the tree as well as using a technique based on surrogate splits.. - In this section we give a brief introduction to the tree-based models. - We refer the reader to Breiman et al. - In the context of motif iden- tification, p variables X i are the motifs, response Y is the expression level for a single time point and n learning cases are the genes.. - The left and right child nodes contain disjoint subsets of the parent con- tent and are defined by splitting the parent node. - The cutoff value c is in the range of observed values of x i . - A split is chosen so as to get the distributions of responses in the child nodes are most homogeneous. - Two such split functions are discussed by Breiman et al. - The non-negativity of the split function ensures that recursive splitting will create smaller nodes with increased homogeneity. - 1% of the sum-of-square of the root node.. - where V is the covariance matrix of y i in the root node and y(g. - Segal (1992) noted that taking V to be the pooled sample covari- ance matrix of the responses in the root node corresponds to a two-sample Hotelling’s T 2 statistic. - With the new generalized split functions, the recursive algorithm proceeds as in the case with a single response.. - 2.3 Determining the tree size. - Breiman et al. - 0, (4) whereα is the complexity parameter that penalizes the number of the terminal nodes.. - Once R(G) is defined, steps (b) and (c) are carried out as described in (Breiman et al., 1984).. - (1) to predict the values of the response variables in the future;. - Our main purpose is not to predict but to discover those predictor variables (motifs) that are most relevant to the responses. - Hence, a question of interest is: which variables are the most important? In other words, how do we rank those variables that, while not giv- ing the best split of a node and thus do not appear in the tree structure, may give the second-best or third-best split. - When a transcription factor binds to an appropriate motif, it regulates the expression level of the respective gene. - where x ij is the num- ber of times motif j appears in the promoter of i. - where y ij is the logarithm base 2 of the expression level of gene i in sample point j. - The simplest way is to enumerate all the sequences in the genes’ promoters as potential motifs (Bussemaker et al., 2001. - Second, most of considered sequences are not present in the final models while they can make noise that affect the model selection process.. - Our method is similar to that proposed by Conlon et al. - These motifs can be found using a motif finding algorithm such as AlignACE (Hughes et al., 2000) or MEME (Bailey and Elkan, 1995). - 3.2.1 Candidate motifs For experiments, we used the set of 356 motifs compiled by Pilpel et al. - The motif matrices are derived by applying the motif finding program AlignACE to the upstream regions of genes in the MIPS (Mewes et al., 2000) functional categor- ies. - Of the 356 motifs, 25 are known motifs described in the biological literature. - For each motif, the number of times the motif appears in the promoter regions of genes is counted by applying program ScanACE (Hughes et al., 2000). - When applying ScanACE to the promoters of the genes in Saccharomyces cerevisiae, Pilpel et al. - 3.2.2 Microarray data We tested the tree model on two sets of microarray data for the yeast S.cerevisiae, specifically, the cell cycle data set of Cho et al. - (1998) and the sporulation data set of Chu et al. - Cho et al. - Following Tavazoie et al. - In this selection, the metric of variation was the ratio between the standard variation and mean of the expression levels of each gene across the time points. - Of the 3000 genes retained, only 2584 genes contain motifs from our set of candidate motifs.. - The missing values were replaces with the mean of the observed data over sample points. - Tavazoie et al.. - This data set consists of the expres- sion levels of about 6200 genes over 10 sample points during meiosis and spore formation. - Following Eisen et al. - The final trees of the cell cycle and sporulation data sets are shown in Figures 1 and 2, respectively. - The split used for a node is shown below the circle in the form ‘motif i name ⇐ n’, which means that genes with more than n instances of motif i in their promoters go to the right node, and other genes go to the left node. - For example, of the 2584 genes in the first data set, only 124 genes have motif MCB in their promoters. - The remaining 2460 genes, therefore, go to the left node 1.. - Among the motifs in the tree structure, MCB, SCB, ECB and MCM1 were mentioned previously by Cho et al. - In particular, motifs MCB and ECB have a strong effect on transcription in the late G1 phase, SCB is active in the early G1 phase, and MCM1 plays a regulatory role dur- ing the G2 and M phases. - Although the tree model does not give direct information about when a motif is active, it can be inferred by analyzing the mean temporal profile (Tavazoie et al., 1999) of the node containing the motif (the right child node after splitting a node using the motif. - The motif AAAANGTAAACAA that has a high score in Cho et al.. - Other motifs—RAP1, MET31-32 and mRRPE(M3A)—were found by Tavazoie et al. - (1999), who applied clustering analysis to the same data set. - 1 Motif names are taken from Pilpel et al. - the tree structure but that may be relevant to the expression data. - We found that most motifs that appear in the tree have high ranking in Table 1. - Regulatory ele- ments MSE and URS1 are known to play active roles in the regulation of gene expression during sporulation (Chu et al., 1998). - Another motif, MCB, is one of the highly scored motifs found in (Bussemaker et al., 2001). - These combinations of motifs are consistent with those reported by Pilpel et al. - From the tree in. - Figure 2, we find one motif combination—mRRPE(M3A) and PAC—which is among the findings of Pilpel et al. - For each terminal node we plotted expres- sion profiles of the genes in the node and calculated deviation (not shown here due to lack of space). - We found that ter- minal nodes located closer to the root are more homogeneous than those located further from the root. - In this study we have developed a new approach to the iden- tification of regulatory elements based on regression trees.. - Tree-based methods are more suitable than parametric meth- ods when the data set is large both in terms of the number of observations and the number of variables, which is the norm for genetics data. - in that case, it is more similar to the approach of Bussemaker et al. - Our approach is similar to that used in previous studies (Bussemaker et al., 2001. - Keles et al., 2002. - Conlon et al., 2003) in that it considers motifs as predictors, gene expression levels as responses and then uses regression fitting to extract most relevant predictors. - As noted by Bussemaker et al. - Another characteristic of the model is that it clusters genes into homo- geneous groups when constructs trees. - However, clustering here is different from that used by Eisen et al., (1999) in that it considers simultaneously expression data and motifs;. - Experiments in which the proposed approach was applied to two-data sets of the yeast S.cerevisiae demonstrated the ability of our method to identify biologically verified regulat- ory motifs. - For the cell cycle and sporulation data sets, we identified most of the motifs known to be active in the given experimental conditions and suggest several new putative. - Our approach of filtering motif candidates prior to tree construction reduced the computational complexity while simultaneously increasing the specificity of the results.. - If several motifs participate independently in the same regulatory event, the tree model generally picks up only one representative.. - Although in the experiments we constructed trees from the putative/known motifs compiled by Pilpel et al. - Park for his comments on an earlier version of the paper, and the IBM SUR program for providing computing facilities for this research. - (1998) A genome-wide transcrip- tion analysis of the mitotic cell cycle. - Proceedings of the Seventh International Confer- ence on Intelligent Systems for Molecular Biology