Vladimir Petkevic (Czech Republic)
Charles University, Faculty of Arts, Institute of Theoretical and Computational Linguistics, Deputy Head
Lexical database of multiword expressions in Czech
Abstract. The paper describes a typology of Czech multiword expressions (MWE) contained in MWE entries in the electronic lexical database LEMUR and the structure of a database entry. The database is to be used (a) for coping with NLP problems (tagging and parsing, word sense disambiguation...), (b) for tackling theoretical linguistic problems associated with MWEs: study of variants/fragments of standard MWEs in use, represented by language corpora of Czech, and identification of MWEs in corpus texts based on the properties accounted for in the database entries. Then the structure and content of a database entry is depicted: a detailed typology, based primarily on a three-dimensional classification adopted in the PARSEME project (syntactic structure, fixedness/flexibility and idiomaticity) and exhanced with specific typological features of Czech (especially morphological ones), style/usage of an MWE etc. The main emphasis is laid on a classification of types of idiomaticity.
Robert Reynolds (USA)
Brigham Young University, Assistant Research Professor
Russian NLP for language learners: technologies and applications
Abstract. In this talk, I discuss my research in building Russian natural language processing tools intended for applications in Russian Computer-Assisted Language Learning. As should be expected, research on natural language processing is dominated by applications for native speakers using normative language, i.e. language that conforms to orthographic and grammatical standards. This means that the implicit assumptions in the design of mainstream tools can be ill-suited for applications intended for non-native speakers, whether processing normative language or learner language. These assumptions involve the content of system, the nature of the information that the system delivers, and the confidence with which the system delivers it. I present a two-level morphological analyzer and constraint grammar for Russian, UDAR (short for udarenie), discussing the explicit design decisions that make it more amenable to language-learning applications, most notably the inclusion of wordstress, and the ability to analyze, diagnose, and correct some common learner errors. I also showcase a number of language-learning applications which rely on UDAR. Among these are automatic wordstress placement in Russian running text, a web browser extension for automatically generating grammatical exercises in context, and a web search engine for Russian language teachers and learners.
Dept. of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Standards for morphosyntactic tagsets
Abstract. The talk gives an overview of the projects and initiatives aimed at developing a consistent and documented set of multilingual morphological features and tagsets for facilitating natural language processing tasks. After giving motivations for this undertaking, a historical overview is presented, starting with the EAGLES and MULTEXT(-East) projects from the last century. The talk then moves to the present, concentrating and contrasting the MULTEXT-East Version 6 morphosyntactic specifications and the Universal Dependencies project, esp. as regards Slavic languages. We also mention challenges that arise with extending the scope of such specifications to historical language and user-generated content.