The lexicon as a possibility: the contribution of semantic-terminological information to lexical substitution tasks in natural language processing
Descripción
The aim of this work is to investigate the phenomenon of lexical variation in Portuguese and English in terms alignment and lexical substitution steps in Natural Language Processing (NLP) taking into account the specialized domain of retail. As a theoretical contribution, we are based on an interdisciplinary interface that considers the postulates of the areas of Computing and Linguistics. Therefore, we offer a theoretical overview of the use of semantic information in the development of NLP systems and demonstrate ways of implementing semantic information in computational lexical bases such as WordNet, FrameNet and FrameNet Brasil. With regard to Linguistics, we rely on the definitions of Murphy (2003, 2010), L'Homme (2020) and Croft & Cruse (2004) regarding the semantic relations directed to specialized terminology. We also take into account León-Araúz & Faber's (2014) classifications and inferences regarding lexical variations and translation equivalents within the scope of Terminology. Our methodology is based on the conjectures of Corpus Linguistics and relies on the use of the Sketch Engine tool to analyze the corpora in English and Portuguese that seek to represent the terminology of the domain. The pairs of terms chosen for the research exercise of the lexical substitution task are “plant” – “site” and “material” – “article”. The terminology used in the monolingual analysis stage comes from the predictions generated by three lexical substitution models: the first one takes into account the synonymy between terms, the second one considers an additional layer of information, the word embeddings, and the third one works with the aid of an additional information layer that recovers the semantic frames. The terminology used in the multilingual analysis stage comes from the corpus used and from a collection of retail terminological bases. Our monolingual analysis seeks to classify the models' predictions according to the semantic relations and results in a categorization of terms according to the definitions of terminological variation by León-Araúz & Faber (2014). The bilingual analysis, in turn, classifies the translation equivalents of the pairs of terms according to the translation problem they represent and according to the types of equivalence that were listed by León-Araúz & Faber (2014). Finally, based on analyses of a semantic-terminological nature, our results point to improvements in lexical substitution models and automatic translation models that take into account the semantic information and the terminological classification categories in order to advance in the quality and linguistic accuracy of the results.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior