[PhD] - Aquisição Automática de Subcategorização Sintáctico-Semântica e sua utilização
em Sistemas de Processamento de Língua Natural
Nov 2001 - Nov 2006
Development of robust syntactic parsers for natural language texts requires resolution of syntactic ambiguity. Most modern natural
language processing techniques rely on a subcategorization lexicon to restrict possible parses. Words are combined following specific linguistic constraints. The constraints imposed by a particular word
in order to limit the words with which it can combine are known as subcategorization restrictions. Subcategorization is expressed at both syntactic (subcategorization frames) and semantic (selection restrictions) levels of abstraction. Syntactic frames are based on constraints referring to morphosyntactic categories and syntactic contexts. Selection restrictions, on the other hand, require arguments to match a specific semantic class. The parser needs both syntactic
constraints and selection restrictions information to prefer some parses from several possible grammatical ones.
The purpose of this work is to investigate the process of automatic subcategorization acquisition from data. In order to do that, it is proposed an unsupervised strategy to acquire syntactic-semantic requirements of nouns, verbs, and adjectives from partially parsed text corpora. The main aim of the learning strategy presented in this thesis is to cluster similar contexts by identifying the words that extensionally define the requirements of those contexts.
This strategy allows us to learn the syntactic and semantic
requirements of words in different contexts.
This information is used to build a subcategorization lexicon and to
solve parsing attachment ambiguities. The results obtained show that the learning strategy is robust in relation to the noise present in the input data and also in relation to input sparseness problem.