Supervisory authorities

CNRS ENS Université Paris 3 USPC PSL

Our partners

Labex Transfers Labex EFL Labex Transfers


Home > Projects > Recent projects > Rhapsodie Project (SHS Corpus ANR)

Rhapsodie Project

by webmestre - published on

You can read the text of the project here.


Since the beginning of the 1980s, a number of large-scale projects aiming to set up oral corpora for widely-spoken languages have been launched. At the same time, certain international consortia for project coordination were set up (for example: Clarin ). In this vibrant context, the French authorities have become aware of the fact that there has been a falling behind in the setting up and exploitation of oral corpora. This is no doubt the reason why a large number of projects aimed at elaborating large corpora of spoken French were initiated over the last 20 years. More recently, various systems for sharing of resources and exchange were put in place at national level (see the Resource Centre for the Description of the Spoken Language (CRDO)). Three basic questions arise from these efforts to collect, exploit and store oral corpora: their subdivision into representative discourse genres, the transcription conventions adopted, the types of annotation made available (with the associated issue of standards of annotation - a major issue in connection with prosody, which taken overall remains the poor relation). Relatively few corpora have been annotated, and where they have been, their transcription is founded upon theoretical assumptions which are too powerful to be shared. This is the case with the TOBI system which is de facto imposed as the norm in the annotation of prosody. It is also the case with C-ORAL-ROM, where the annotation is closely bound up with the notion of speech act, as conceived by E. Cresti. In addition, the syntactic treatment of oral corpora remains insufficient, often boiling down to the lemmatisation and labelling of parts of speech.

In this context, our project aims to constitute a reference corpus of spoken French subdivided into different representative discourse genres equipped with prosodic and syntactic annotations that may be used in the analysis of the status of prosody in discourse as well as of its relations with syntax and information structure.

The term "reference corpus" is justified here in several important respects:

- Through the subdivision into representative discourse genres, which is based on a thorough study of these types.
- Through current research which is developing that of the past: we are building on the results of the work on macro-syntax of the last 20 years, as well as on those of the research on the prototypical structures of spoken French yielded by the reference corpora C-ORAL ROM and DELIC. Our added value consists in enriching our knowledge of the intonational profiling of these structures which have been well documented by this earlier research, but whose detailed prosodic analysis still remains to be done.
- Through an annotational strategy which is not placed within a narrow phonological framework, which makes it possible to envisage different interpretative approaches developed within a variety of theoretical frameworks, around complementary research axes, and which will enrich the existing grammars of spoken French.
- Through the programming of labelling and semi-automatic robust segmentation algorithms which are easily usable, freely distributed and hence shared usage, thus capable of being stored by the Aix annexe of the Resource Centre for the Description of the Spoken Language (CRDO).
- "Reference" also, since the corpus will have been set up on the basis of minimal charters in order to be freely distributable. As far as the sound quality of the recordings is concerned, we refer to PFC protocol. Regarding the deontological and legal dimension (request for authorisation and enlightened consent, anonymisation), the coding and cataloguing of the
metadata concerning the sociological identification of the speakers, and the genres represented, etc.), we follow the recommendations stipulated in the DGLFL good practice guide.
- "Reference", finally, since the project will be the subject of an in-depth discussion on the standards for the annotation and formatting of the resources (TEI, XML), whence, once again, the potential for storage in the CRDO (Paris annexe, under the direction of M. Jacobson) which makes for international visibility.