« Home « Kết quả tìm kiếm

Developing Tools and Building Linguistic Resources for Vietnamese Morpho-Syntactic Processing


Tóm tắt Xem thử

- Developing Tools and Building Linguistic Resources for Vietnamese Morpho-Syntactic Processing.
- Thanh Bon Nguyen (1.
- are extremely difficult due to the lack of formal linguistic knowledge on one hand, and the specificities of isolating languages on the other hand.
- In this paper we present our efforts to develop a set of tools permitting the construction and management of language resources for Vietnamese in a normalized framework, whose aim is to be largely distributed and usable for research purposes in NLP.
- We first define a tagset by constructing Vietnamese morpho-syntactic descriptors that fit in a model compatible with MULTEXT 1 , so as to account for possible multilingual applications as well as the reusability of defined tagsets.
- Our system ensures a representation format of linguistic resources that is currently considered in the framework of ISO TC37 SC4 2 .
- Except a few works carried out on English to Vietnamese, research in this domain has not until recently raised much attention amongst the scientific community in Vietnam..
- There does not even exist any recognizable standard for Vietnamese word categories (Cao X..
- Another major difficulty for the automatic processing of Vietnamese comes from the fact that it is an isolating language in which almost every word is monosyllabic and there is no morphological variation.
- are composed from monosyllabic words and frequently appear in the texts.
- These specificities make the tasks of word segmentation and morpho-syntactic annotation of Vietnamese extremely difficult..
- In this paper, we present our efforts to develop a set of tools permitting the construction and management of language resources for Vietnamese in a normalized framework (cf.
- After a brief reminder of the linguistic characteristics of Vietnamese (section 2), we present our system for morpho-syntactic annotation (section 3).
- We define a tagset by constructing Vietnamese morpho-syntactic descriptors that fit in a model compatible with MULTEXT, so as to account for possible multilingual applications as well as the reusability of defined tagsets.
- Our system ensures a representation format of linguistic resources that is currently considered in the framework of ISO subcommittee TC37 SC4.
- The annotated corpus and the lexicon that we have produced are accessible from our team website 3.
- This work is undertaken in the perspective of building a Vietnamese syntactic lexicon for a TAG parser..
- Some Characteristics of Vietnamese.
- To begin with, we remind some important specificities of Vietnamese (Thanh Bon Nguyen et al., 2004)..
- In these languages, topics are coded in the surface structure and they tend to control co-referentiality (cf.
- There is a cat in the garden should be translated in Có một con mèo trong vườn / exist one <animal- classifier>.
- Corpus Morpho-Syntactic Annotation.
- Many applications in the domain of NLP make use of annotated corpora.
- As our project is one of the first projects in this domain, our objective is to make available in the research community a set of tools permitting the construction and management of language resources for Vietnamese in a normalized framework..
- In this section we focus on the system of morpho-syntactic annotation.
- In the next section we introduce a modelisation of Vietnamese syntax..
- One of the necessary resources for language processing is lexical resource.
- For the fundamental tasks of Vietnamese text analysis, i.e.
- the text segmentation and the POS tagging, we need a list of all syllables in Vietnamese and a lexicon containing morpho-syntactic information.
- The set of 8 categories used in this dictionary is a compromise accepted by the Vietnam Committee of Social Sciences (Uỷ ban KHXHVN, 1983)..
- As described in (Thanh Bon Nguyen et al., 2004), we have built a morpho-syntactic lexicon containing all headwords in the above print dictionary.
- Each headword can correspond to several entries in our lexicon, upon to its number of morpho-syntactic descriptions.
- We make use of 11 main grammatical categories instead of 8 categories in the print dictionary, in maintaining the coherence with the descriptions in (Uỷ ban KHXHVN, 1983).
- The subcategorisation of each category is described by a feature structure, of which each feature is chosen from different discussions on Vietnamese grammar in the literature.
- A proposition for lexical markup normalization is being discussed in the framework of the ISO TC 37 SC 4 "Language Resource Management".
- For our morpho-syntactic lexicon, we choose for the time being a simple representation with explicit XML tags (cf.
- Thanh Bon Nguyen et al., 2004)..
- The morpho-syntactic descriptions elaborated in the lexicon give us the capacity to define tagsets with 1-1 mappings from the description space to the tag space.
- We now present the developed tools for the task of morpho-syntactic annotation..
- Our system ensures the annotation of the corpus in two principal steps: the text tokenization and the POS tagging..
- The tokens in question are lexical units that are supposed present in the system lexicon.
- In the first phase, we make use of the syllable base and the lexicon to recognize all possible segmentations for each text segmen t (separated by the punctuation).
- In the second phase, equipped with tagged corpus, we replace the user intervention of the first tokenizer with the automatic choice using tag sequence probabilities.
- The considered tagset is the one comprising 11 main grammatical categories of the system lexicon.
- In the same way as the lexical resource building, all the produced resources of the system are marked up in a structure equivalent with the propositions considered by the ISO TC 37 SC 4..
- <token id = ”t1”>Ông</token>.
- <token id = ”t2”>già</token>.
- <token id = ”t3”>đi</token>.
- <token id = ”t4”>rất</token>.
- <token id = ”t5”>nhanh</token>.
- <token id=”t6”>.</token>.
- <wordForm entry = ”Ông già” tokens = ”t1 t2” tag = ”pos@N” />.
- <wordForm entry = “đi” tokens = “t3” tag = “pos@V” />.
- <wordForm entry = “rất” tokens = “t4” tag = “pos@R” />.
- <wordForm entry = “nhanh” tokens = “t5” tag = “pos@A” />.
- <wordForm entry.
- tokens = “t6” tag = “pos@dot” />.
- <wordForm entry = ”Ông” tokens = ”t1” tag = ”pos@P” />.
- <wordForm entry = ”già” tokens = ”t2” tag = ”pos@A” />.
- <wordForm entry = “đi” tokens = “t3” tag = “pos@R” />.
- As mentioned in the introduction, Vietnamese linguists have not been involved in computational linguistics yet.
- Therefore there does not exist any valid formal grammar for Vietnamese until now.
- The formalism that we choose in our project for Vietnamese parsing is Tree Adjoining Grammar (TAG), which is well studied for French and English grammars (Xtag, 2001.
- In this section we present a tentative TAG modelisation of Vietnamese noun phrases (NP).
- This work is in the perspective to build Vietnamese syntactic resources for a parser using TAG formalism..
- Thanh Bon Nguyen et al., 2004);.
- The importance differences of Vietnamese in comparison with a language such as English are the following:.
- There exists the noun couple N 1 and N 2 as “center” of the noun phrase.
- in the first two cases, N 1 is the syntagm head, and in the last case the head is N 2.
- sense = empty, full (an “empty” value constraints the obligation of the presence of C 4 or C 5 ) unit = human [classifier], thing [classifier], exact [measure], inexact [measure] (for N 1.
- The constraints about the order of complements in the sequence C4 are also ignored for lack of space..
- As in (Abeillé, 1993), we make use of N (Noun) for the root node of the NP tree..
- the above description) In this case N 2 is the head of the syntagm..
- In this case N 1 is the head of the syntagm..
- We have presented our work in the objective to construct a set of basis tools such as tokenizer, POS tagger with validation editors and concordancer for Vietnamese text analysis, as well as to build a database of linguistic resources such as morpho-syntactic lexicon, annotated corpus and syntactic lexicon for TAG formalism.
- We are now in the process of extending the annotated corpus with tagset evaluation and improvement.
- Proceedings of the IRCS Workshop on Linguistic Databases, Philapdelphia, US..
- Lexical descriptions for Vietnamese language processing.
- Proceedings of the Asian Language Resources Workshop (to appear), IJC-NLP 2004, Hainan, CN..
- A case study of the probabilistic tagger QTAG for Tagging Vietnamese Texts

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt