Master TAL - MSc. NLP

Course Unit

Written Corpora







Course Description

This course explores methods and techniques used in NLP in building and using written corpora. Concepts and notions are introduced such as corpus linguistics and the criteria necessarily taken into account when collecting a corpus (size, variety, etc.) as well as the forms the corpora make take (character encoding, XML, …). Other notions developed include corpus collection from the Web, the use of different formats of documents, and normalisation of badly collected or heterogeneous data.


Learning Outcome

  • Ability to conceive of the content of a corpus consisting of written documents
  • Capacity to normalise the data of a corpus
  • Ability to identify principles and examples of annotations of a copus



  • The courses for the first semester of the master do not have prerequisites other than those defined for the specialisation

Targeted Skills

  • Capacity to collect, structure, and represent data (sound, text, images,… )
  • Combine and utilise interdisciplinary skills and know-how in the aims of creating innovative solutions


Bruno Guillaume


More Informations


  • To be completed

Course URL – Arche

  • To be completed

Link with other courses

  • 702-EC2, 803 and 902-EC2

Evaluation procedures

Number of Tests

  • 2

Nature of the tests

  • labs
  • final exam

Group work

  • N/A

Combine with other specialization

  • No

Back to MSc Sciences Cognitives

Back to Master TAL - MSc. NLP