The Emille Corpus (Beta Release Version)

Title

The Emille Corpus (Beta Release Version) [Electronic resource]

Editor McEnery, A.M. (ed.); Baker, Paul (ed.); Hardie, Andrew (ed.)
Availability This resource is freely available, you should be able to download it now.
Languages

English; Gujarati; Tamil; Hindi; Panjabi; Urdu; Bengali

Editorial Practice

Encoding format: SGML

OTA keywords Linguistic corpora
Corpus
LC keywords

South Asia--Languages
Indo-Aryan languages, Modern
Linguistic analysis (Linguistics)

Extent
  • designation: Text data
  • size: (tbc)
Creation Date 2003
Source Description

: :

Notes

Mode of access: Online. OTA website

The collection consists of: Thirty million words of monolingual written data (Gujarati, Tamil, Hindi, Punjabi-news website articles); 600,000 words of monolingual spoken data (Hindi, Urdu, Punjabi, Bengali, Gujarati-radio broadcasts); 120,000 words of parallel data in each of English, Hindi, Urdu, Punjabi, Bengali and Gujarati (U.K. government leaflets).

Further information available at: http://www.emille.lancs.ac.uk/home.htm