Abstract
We investigate GPT-Sw3, a generative language model for the Nordic languages, to assess its understanding of low-resource Faroese. Our aim is to demonstrate the advantages of using language-family-specific generative models to augment data for related languages with fewer resources. We evaluate GPT-Sw3 by prompting it for Faroese to English translation in a zero, one and few-shot setting. We assess such translations with an ensemble score consisting of an arithmetic average between the BLEU and a semantic similarity score (SBERT). Moreover, we challenge the model’s Faroese language understanding capabilities on a small dataset of curated Faroese trick sentences. There, we compare the model’s performance with Open AI’s GPT 3.5 and GPT 4, demonstrating the advantages of using a language family specific generative model for navigating non trivial scenarios. We evaluate the pipeline thus created and use it, as a proof of concept, to create an automatically annotated Faroese semantic textual similarity (STS) dataset.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) |
Place of Publication | Torino |
Pages | 6496–6503 |
Number of pages | 8 |
Publication status | Published - 22 May 2024 |
Event | LREC-COLING 2024 - Torino, Italy Duration: 20 May 2024 → 25 May 2024 https://lrec-coling-2024.org/ |
Conference
Conference | LREC-COLING 2024 |
---|---|
Country/Territory | Italy |
City | Torino |
Period | 20/05/24 → 25/05/24 |
Internet address |
Keywords
- Semantic Textual Similarity
- low-resource language
- Machine translation
- Data augmentation