Digital Humanities at LATTICE

Digital Humanities (DH) is an area of scholarly activity at the intersection of computing and the disciplines of the humanities. At LATTICE, we are active in the field since a few years now. We are specifically interested in the production of new natural language processing techniques for the different fields of the Humanities and the Social Sciences. The lab has developed two complementary lines of research:

  • the use of Digital Humanities techniques for linguistics. We have developed techniques to automatically annotate corpora with different levels of annotation (part-of-speech, syntax, semantics). Realizations include the SEM parser, that provides part-of-speech annotations, chunks and named entities for contemporary French texts. Adaptation of this parser is ongoing for different languages, including morphologically-rich Finno-Ugric languages. The lab has also developed a series of resources, especially annotated corpora. A major recent output is the Syntactic Reference Corpus of Medieval French (SRCMF) covering a period from 842 to the end of the 13th century and containing about 251000 words with syntactic annotations. It is the first corpus of this size syntactically annotated and manually checked for Medieval french. The corpus will be realized and available online soon.
  • the use of natural language processing techniques (NLP) for different Digital Humanities areas. Ongoing projects covers a wide range of topics from the social sciences to the Humanities. We have been collaborating for example with the UCL Centre for Digital Humanities on the Transcribe Bentham corpus. In this project, we aim at analysing the different texts included in the corpus, extracting their main topics, clustering together related texts and providing original and meaningful visualisations of the structure of the corpus. Another project deals with Climate Negotiation analysis. In this context, our system identifies points supported and opposed by negotiating actors and extracts key concepts from those points. The results are displayed in a specific interface, allowing for a comparison of different actors’ positions. Recent projects include a collaboration called `Distant Rhythm’, with UNED in Madrid (Open University in Spain). Our goal here is to automatically detect enjambments in four centuries of Spanish Sonnets.

Our research has been presented and published mainly in conferences in linguistics for the first line of research, and in the Digital Humanities Conferences (the main forum for research in the domain) for what concerns NLP applied to DH issues. We are also preparing extended versions of these publications to be published in specific journals. A selection of references is given below.

Recent and Current projects

  • The ANR-DFG SRCMF project. The Syntactic Reference Corpus of Medieval French (SRCMF) was financed by the Agence nationale de la recherche (ANR) and Deutsche Forschungsgemeinschaft (DFG) between 2010 and 2013 (principal investigators: Sophie Prévost, LATTICE and Achim Stein, University of Stuttgart). The SRCMF is the first dependency treebank for Medieval French. It consists of syntactically annotated parts of two text corpora of Medieval French: the Base de Français Médiéval (BFM), and the Nouveau Corpus d’Amsterdam (NCA). Texts covering the Old French period from 842 to the end of the 13th century and containing about 251000 words were annotated manually and published along with the tools and documentation presented on the project website.
  • LAKME is a PSL funded project exploring new NLP techniques (esp. machine learning techniques) to annotate scholarly relevant corpora. The project focuses on morphologically-rich languages that are especially challenging for current NLP systems. Three languages (or groups of languages) are considered: Rabbinic Hebrew, Medieval French and some Uralic languages (esp. Finnish, Komi and Udmurt). The project is a collaboration between Lattice (PI, Thierry Poibeau), the Ecole Pratique des Hautes Etudes (Daniel Stoekl Ben Ezra) and the Ecole Nationale des Chartes (Jean-Baptiste Camps).
  • The ANR DEMOCRAT Project also contributes to the research in Digital Humanities in proposing new methods for the automatic annotation of co-reference chains in texts (mostly Medieval and contemporary French texts).

International collaborations

  • UCL Centre for Digital Humanities. We collaborate with DH@UCL since 2014 through the Bentham Project. UCL has provided the Transcribe Bentham corpus and LATTICE has developed text mining and content analysis tools to extract key information from the corpus. See our publications and the online demo.
  • Digital Humanities Innovation Lab (LINHD) at Universidad Nacional de Educación a Distancia UNED (Open University) of Spain in Madrid. With UNED, we have started a collaboration over a collection of four centuries of Spanish poems.
  • Collaborations are expected to start soon with other labs in Europe.

National collaborations

  • Within PSL, collaborations are ongoing with EPHE and ENC, see the LAKME project for more information. LATTICE is also one of the leading labs involved in the E-Philologie series of doctoral courses exploring different facets of DH at the Master and Doctoral level (ENS, EPHE, ENC, EHESS).
  • We are also working with AOROC, a research uni specialized in archeology at EPHE and ENS. The collaboration mainly consists in extracting key information from written documents in order to provide semantic indexing and search functionalities.
  • We are a member of the labex TRANSFERS, which also includes a Digital Humanities group mainly working on databases and maps.

Recent applications and demos

  • SEM, our part-of-speech, chunker and named entity recognizer for French

Three Selected Publications

  • Estelle Tieberghien, Frédérique Mélanie-Becquet, Pablo Ruiz Fabo, Thierry Poibeau, Melissa Terras, et Tim Causer. Mapping the Bentham Corpus. Digital Humanities 2016, Jul 2016, Krakow, Poland. 2016, Digital Humanities 2016. <hal-01378029>