Presentation

This collaborative project, funded for three years (June 2015 - June 2018) by COMUE Sorbonne Paris Cité, brings together several Sorbonne-Paris-Cité laboratories (LIPN, LDI, CLILLAC-ARP, ERTIM), EMPNEO group stakeholders and the University of São Paulo (USP).

The project aims to:

  • set up a multilingual platform for monitoring and tracking neologisms from very large contemporary corpora in seven languages (French, Greek, Polish, Czech - languages of the EmpNeo-Portuguese group of Brazil, Chinese and Russian);
  • use this platform to conduct a study of borrowing (including but not limited to Anglicisms) in different languages (French, Greek, Polish, Czech, Brazilian Portuguese, Chinese and Russian);
  • use this platform to study semantic innovation and propose automatic procedures to track it in monitor corpora;

General architecture

The general architecture of the system is shown in Figure 2.

In this architecture, the horizontal line separates the components where the linguistics expert can intervene (lower part) from the components yo which he has no access (field of the computational linguist).

There are six main modules:

  • The corpus manager: the linguist expert can determine (add, delete, modify) the corpora he wishes to have analyzed by the system, currently either an RSS feed or a website. It can also explain a certain number of meta-information: name of the journal, entry url, category of information provided (general or specialized press at the moment), field (computer science, health, economics, fashion, etc.), language (among the seven languages of the project), country of the journal (this information can later be used to study neological differences by country for the same language), type of resource (website or RSS feed currently), publication frequency. This information is associated with each information unit ("article") that will be retrieved and can filter the results in the search engine. (see corresponding tab)
  • Retrieving RSS feeds, related articles and their linguistic analysis: this module makes it possible to regularly retrieve explicit press articles from RSS feeds and web pages and to perform different linguistic processing: word segmentation, morphosyntactic and syntactic analysis. This module allows you to add content elements to each feed: article title, article description (denoting a summary of the content, or a teaser), article content itself, morphosyntactically labelled content, document lems (restricted to the noun, verb and adjective categories), document proper nouns.
  • Automatic identification of neologisms using the reference dictionary method as an exclusion corpus: this module allows, following morphosyntactic analysis, to retain only neologisms candidates after several filters: proper names, typographical errors, then pre-categorizations of candidate neologisms into borrowings and'internal' neologisms.
  • The neologisms search and analysis engine: this interface allows you to search the results obtained by the previous steps via a search engine including different properties (see corresponding tab)
  • Neologisms manager: this is a database that existed before the project was developed in collaboration with Jean-François Sablayrolles at LDI. We refer to (Cartier and Sablayrolles, 2010) for details of this module. Neologia interacts with the Neoveille engine in two main ways: on the one hand, the neologisms presented and their contexts can be directly exported to the Neologia database; on the other hand, it is always possible to obtain information on the life cycle of neologisms after its insertion in Neologia, by returning to the Neoveille engine.
  • (To come) The identification of semantic neologisms by the combinatorial profile method is launched on the target lexia and will also be available in the search and analysis interface.
Lexicographers
# Authors Institution
French Jean-François Sablayrolles(3), Emmanuel Cartier(1), Najet Boutmgharine(2), Massimo Bertocci(1), John Humbley(2), Natalie Kübler(2), Giovanni Tallarico (5), Christine Jacquet-Pfau(4) LIPN-RCN (UP13) (1), CLILLAC-ARP (UP7) (2), HTL (UP7) (3), Collège de France (4), Université de Vérone (5)
Estonian Jelena Kallas Institute for the Estonian Language
Chinese Lichao Zhu (2017) Université Paris 13
Greek Anna Anastassiadis-Symfonidis, Dimitra Alexandridou Université de Thessalonique (groupe EMPNEO)
Italian Jana Altmanova (1), Claudio Grimaldi (1), Silvia Zollo (1), Michela Murano (2), Maria-Teresa Zanolla (2) Université de Naples (1), Université Catholique de Milan (2)
Polish Alicja Kacprzak, Anna Bobińska et Andrzej Napieralski Instytut Romanistyki Uniwersytet Łódzki (groupe EMPNEO)
Portuguese (Brésil) Ieda Alvès Université de Sao Paulo
Russian Tatiana Iakovleva (2017) CLILLAC-ARP (UP7)
Czech Jan Lazar, Radka Mudrochova, Zuzana Hildenbrand groupe EMPNEO
Computational Developments
# Contributors Institution
Emmanuel Cartier Project coordinator, Backend and frontend development LIPN - RCLN (UP13)
Gaël Lejeune (sept. 2016- sept. 2017) Machine Learning Experiments for the Formal Neologism Detection Module LIPN - RCLN (UP13)
Loïc Galand (nov. 2017-) Néonaute Project LIPN - RCLN (UP13)
Néoveille general Presentation (check Official website for last updates)

Cartier, Emmanuel (2016), « Neoveille, système de repérage et de suivi des néologismes en sept langues », Neologica 10, p. 101-131. Pre-print (ce document expose le projet à son démarrage. Pour une version récente, consulter l'article de 2018)

Cartier, Emmanuel (2017), Neoveille, a Web Platform for Neologism Tracking, Proceedi ngs of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017.

Cartier, Emmanuel (2018, à paraître), « Neoveille, plateforme de détection, de repérage et de suivi des néologismes en dix langues », pdf

Linguistics Studies from Néoveille data

Boutmgharine Idyassner, Najet (2016), « Les stratégies de glose sur l’emprunt en discours », Colloque Emprunts néologiques et équivalents autochtones. Mesure de leurs circulations respectives, Universytet Łódzki, 10-12 octobre 2016, Łódz, Pologne. http://neologie.uni.lodz.pl.

Tallarico Giovanni (2016), « Cinquante nuances de board : les anglicismes néologiques et leurs équivalents dans le domaine des sports de glisse ». Colloque Emprunts néologiques et équivalents autochtones. Mesure de leurs circulations respectives, Universytet Łódzki, 10-12 octobre 2016, Łódz, Pologne. http://neologie.uni.lodz.pl

Viaux Julie, Cartier Emmanuel (2016), « Étude linguistique et quantitative de la pénétration des anglicismes de type (N,ADJ)-Ving dans sept langues à partir d’un corpus contemporain journalistique », Colloque international Emprunts néologiques et équivalents autochtones. Mesure de leurs circulations respectives, Universytet Łódzki, 10-12 octobre 2016, Łódz, Pologne.

Lejeune Gaël, Cartier Emmanuel (2017), Character Based Pattern Mining for Neology Detection,Proceedings of the First Workshop on Subword and Character Level Models in NLP , EMNLP 2017, Copenhagen, p.25-30.

Cartier E., Sablayrolles J.-F., Boutmgharine N., Humbley J., Bertocci M., Jacquet-Pfau C., Kübler N. et Tallarico G. (2018), « Détection automatique, description linguistique et suivi des néologismes en corpus : point d'étape sur les tendances du français contemporain » , Actes du Congrès Mondial de Linguistique Française, Mons (Belgique), 9-13 juillet 2018, 20p.

Cartier E. (2018). « Emprunts en français contemporain : étude linguistique et statistique à partir de la plateforme Néoveille » dans Emprunts en question(s), Kacprzak, A. ; Mudrochová, R. ; Sablayrolles, J.-F. (éds), La Lexicothèque, Limoges, Lambert-Lucas, 27p.

Présentation vidéo de la plateforme Néoveille
Présentation vidéo de l'interface publique