The Emille Corpus (Beta Release Version)

Title

The Emille Corpus (Beta Release Version) [Electronic resource]

Editor

McEnery, A.M. (ed.); Baker, Paul (ed.); Hardie, Andrew (ed.)

Availability

Distributed by the University of Oxford under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Download: zip

Languages

English; Gujarati; Tamil; Hindi; Panjabi; Urdu; Bengali

Editorial Practice

Encoding format: SGML

OTA keywords

Linguistic corpora
Corpus

LC keywords

South Asia--Languages
Indo-Aryan languages, Modern
Linguistic analysis (Linguistics)

Extent
  • designation: Text data
  • size: 6551 files : ca. 482 MB
Creation Date

2003

Source Description

no source record

Notes

The collection consists of: Thirty million words of monolingual written data (Gujarati, Tamil, Hindi, Punjabi-news website articles); 600,000 words of monolingual spoken data (Hindi, Urdu, Punjabi, Bengali, Gujarati-radio broadcasts); 120,000 words of parallel data in each of English, Hindi, Urdu, Punjabi, Bengali and Gujarati (U.K. government leaflets).

Further information available at: http://www.emille.lancs.ac.uk/home.htm

Permanent URL

http://purl.ox.ac.uk/ota/2460