Faroese Megaword Corpus and Infrastructure for Research and Language Technology

  • Debess, Iben Nyholm (PI)
  • Simonsen, Annika (PI)
  • Sigurðsson, Einar Freyr (PI)
  • Steingrímsson, Steinthór (PI)

Project Details

Description

The aim of the project is to build a large representative corpus of Faroese texts. We will build as large a corpus as possible, consisting of Faroese texts, preferably not smaller than 25 million words. Even though much of the corpus will consist of texts taken from the Internet, we will focus on getting a large variety of texts, representing as many text genres as possible.
The texts will be automatically tagged with morphosyntactic tags.
Furthermore, as part of the project, we will fix OCR-errors and work on a Faroese OCR-model using an already-built model for Icelandic.

We aim to publish all data under an open license, CC BY 4.0. Users will be able to download the corpus or search it online.
The project is modelled after the highly successful Icelandic Gigaword Corpus (see http:// malheildir.arnastofnun.is/ and https://www.aclweb.org/anthology/L18-1690) which consists of more than 2.5 billion running words.

The project will be primarily led from the University of the Faroe Islands in close collaboration with experts at the Árni Magnússon Institute for Icelandic Studies.

By combining Faroese and Icelandic data resources, tools, and expertise, we aim for a high-quality product.

The corpus will be published on a Faroese domain, and will be accessible to anyone to download or search for online.

The project is funded by Nordplus Nordic Languages.
Short titleFaroese Megaword Corpus
StatusActive
Effective start/end date1/08/231/08/25

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.