« Home « Kết quả tìm kiếm

Learning Rules to Extract Protein Interactions from Biomedical Text


Tóm tắt Xem thử

- We present a method for automatic extraction of protein interactions from scientific abstracts by combing machine learning and knowledge-based strategies.
- The study of protein interactions is one of the most important issues in recent molecular biology research, especially for system biology.
- Mining biological literature for extracting protein interactions is essential for transforming discoveries reported in the literature into a form useful for further computational analysis [5]..
- However, most of the interaction data are stored in free text format with irrelevant and confusing text bodies, which makes automatic querying for specific information inefficient.
- The score is computed based on the frequencies of discriminating words found in the abstracts.
- [1] used the simple pattern “protein A…interaction verb…protein B” to detect and classify protein interactions in abstracts related to yeast.
- They simplified the task assuming that protein names are pre-specified by users.
- However, this approach cannot automatically detect new protein names.
- The authors added some kind of protein names verification to the system and reported the interesting fact that more restrict name verification rules lead to higher recall and precision..
- All of the mentioned approaches and systems use hand-crafted extraction rules..
- We use the link grammar parser to parse sentences.
- The learned rules are then used to detect potential protein interactions.
- We describe the heuristic rules that verify detected nouns and noun phrases if they are protein names or not.
- The remainder of the paper is organized as follows.
- 1.1 Link Grammar.
- The arcs between words are called “links” with the labels showing the link types..
- For example, the verb “ran” is connected to “from school” identifying a prepositional phrase by the link “MV”.
- The uppercase letters of the link labels indicate the primary types of links (there are 107 primary link types for the link parser version 4.1), and the lowercase letters detail the relationships of words.
- The meanings of several link labels uppercase letters are given in the right part of Fig.1..
- Each word in the link grammar must satisfy the linking requirements specifying which types of links it can attach and how they are attached.
- The reason why we have decided to adopt the link grammar in our work is that the grammar provides the simple and effective way to express relationships between terms in a sentence.
- This feature is very important in detecting protein interactions..
- We can follow the related links to find participants of an interaction without concerning the rest of the sentence..
- 2 Extraction of Protein Interactions.
- In this section, we describe how to extract protein interactions using extraction rules and heuristic rules for protein name verification.
- First, sentences are parsed by the link parser.
- Next, our method looks for terms involved in interactions by following the links described in the extraction rules.
- Finally, the heuristic rules are applied to verify if the terms detected from the previous step are protein names or not..
- Our method requires three terms for an interaction: two protein names and a keyword that indicates the type of relationship betweens the proteins.
- We use the following keywords: “interact”, “bind”, “associate”, and “complex” as well as their inflections..
- Consider a sample sentence, which has the parse result shown in Fig.2a.
- In this example, the keyword is “binds”, the pairs of proteins are “Ash1p.
- “HO” and “Ash1p.
- We can think of the parsed sentence as a graph where words are vertices and links are edges.
- (b) Links from the keyword to protein names.
- A sample of connections between a keyword and protein names.
- A rule for the above example is shown in Fig.
- The first and second lines specify the links to follow for detecting the first and second protein names respectively.
- For example, the second line of the sample rule in Fig.3 looks for keyword “bind”.
- If the keyword is found, the rule looks further for a word “to” which is on the right of “bind” and which is connected to “bind” by a link.
- The next step is looking rightward for a word connected to “to” by “J”.
- The procedure for name extension will be described in the next subsection.
- the link labels represent search directions for right and left respectively.
- We call intermediate words in link paths between keywords and protein names (for example word “to” in the above rule) nodes.
- Given a sentence, a rule is said to be satisfied if we can find all links and words specified in the rule within this sentence.
- For instance, applying the above rule to the sentence depicted in Fig.2 will output two candidates of interactions.
- the first line looks for either “to” or “after”, connected by “MV” to “bind”, whereas the second lines allows any word on the right of “bind” which are connected to “bind”.
- In practice, many protein names are compound words.
- For example, in the sample sentence shown in Fig.4, protein names are “general transcription factor” and “TATA binding protein” but not “IIA” and “TATA”.
- The rule matching procedure described above can detect only two words “IIA” and “protein”.
- An example of protein names which are compound words.
- checking whether the detected names are protein names or not.
- Examples of rules are:.
- If a name ends with one of the following words (molecule, gene, bacteria, base) then reject the name..
- If a name is single word with suffix “-in” then give score 0.5..
- Names with the scores higher than a predefined threshold will be accepted as protein names.
- For example, the sentence presented in Fig.
- Our algorithm begins with creating a rule for each interaction being tagged in the training set.
- To create the first line of a rule, the algorithm looks in the link-parsed sentence for the shortest path from the keyword to any word of the first name.
- The rule line for the second name is created in the similar way.
- For the potential application of our algorithm – populating protein interactions into databases – precision must be emphasized.
- The learning algorithm is shown in Fig.
- Each of the procedures produces a set of more general rules CandidateSet.
- In the case of calling GENERALIZE_FRAGMENT, the whole CandidateSet will subsume the whole RuleSet if the former performs not worse than the latter..
- We use word “term” to refer to any word staying at a node of the link path given by the first line or the second line of a rule.
- For example, the rule shown in Fig.3 has term “to” in the second line.
- CandidateSet ‘ disjuntions of the found rules.
- In the example above a fragment can be “of J+ @NAME2” or “@NAME2”.
- The procedure looks for a pair of rules that differ from each other only by the suffixes of one line.
- If such pair is found, the suffixes of the rules are recorded.
- Then the procedure looks for all rules in RuleSet with suffixes identical to one of the recorded suffixes.
- Is such rules are found the procedure builds more general rules from them by replacing their suffixes with the disjunction of the recorded suffixes.
- In the example below, the first two rule lines have two pairs of suffixes (to J+ @NAME2.
- The underlying assumption of fragment generalization is that if two different suffixes are found in the same position of two similar rules, the suffixes probably can appear in similar contexts and therefore can replace each other in other rules..
- As an example of the learning process, consider generalizing the rules based on the following three sentences (only fragments are shown).
- During term generalization, the second and third lines are found to have only different terms “between” and “of”.
- During fragment generalization, these lines give two suffixes “@NAME2” and.
- “domain M+ of J+ @NAME2”, which are used to build the following general rule interact M.
- The abstracts were obtained by querying Medline with the following keywords: “Saccharomyces cerevisiae” and “protein” and.
- We filtered the returned 3343 abstracts and retained 550 sentences containing at least one of four keywords “interact”, “bind”, “associate”, “complex” or one of their inflections..
- An extracted interaction is considered correct if both extracted protein names are identical to those tagged by the user.
- The order of protein names (which is the first, which is the second) is not taken into consideration although experimental results show that the order is retained..
- We evaluate performance in terms of precision, the number of correct extracted interactions divided by the total number of extracted interactions, and recall, the number of correct extracted interactions divided by the total number of interactions actually mentioned in the sentences..
- In order to analyze the effect of term generalization and fragment generalization on the results, four versions of the learning algorithm were tested.
- Recalls and precisions of the four versions are given in Table.
- Results of different versions of the learning algorithm.
- It is more reasonable to compare our approach with those that do not require pre-specified protein names or dictionaries.
- Both of the systems require hand-crafted patterns or rules for detecting protein interactions..
- The rules learned by our algorithm can be used in combination with heuristic rules, which verify whether a noun phrase is protein name or not, to extract protein interactions from scientific abstracts.
- The learning and extraction algorithms exploit the link grammar parser to parse input sentences.
- Blaschke, A., Andrade, M.A., Ouzounis, C., Valencia, A.: Automatic extraction of biological information from scientific text: protein-protein interactions.
- In: Proceedings of the 5 th Int.
- identifying protein names from biological papers.
- In: Proceedings of the Pacific Symposium on Biocomputing.
- Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions.
- protein-protein interactions from the biological literature.
- In: Proceedings of the Pacific Symposium on Biocomputing (2001)..
- In: Proceedings of the Pacific Symposium on Biocomputing (2000)..
- Tanabe, L., Wilbur, W.: Tagging gene and protein names in biomedical text.
- Thomas, J., Milward, D., Ouzounis, C., Pulman, S., Carroll, M.: Automatic extraction of protein interactions from scientific abstracts