Accelerating global language loss, associated with elevated incidence of illicit substance use, type 2 diabetes, binge drinking, and assault, as well as sixfold higher youth suicide rates, poses a mounting challenge for minority, Indigenous, refugee, colonized, and immigrant communities. In environments where intergenerational transmission is often disrupted, artificial intelligence neural machine translation systems have the potential to revitalize heritage languages and empower new speakers by allowing them to understand and be understood via instantaneous translation. Yet, artificial intelligence solutions pose problems, such as prohibitive cost and output quality issues. A solution is to couple neural engines to classical, rule-based ones, which empower engineers to purge loanwords and neutralize interference from dominant languages. This work describes an overhaul of the engine deployed at LemkoTran.com to enable translation into and out of Lemko, a severely endangered, minority lect of Ukrainian genetic classificability indigenous to borderlands between Poland and Slovakia (where it is also referred to as Rusyn). Dictionary-based translation modules were fitted with morphologically and syntactically informed noun, verb, and adjective generators fueled by 877 lemmata together with 708 glossary entries, and the entire system was riveted by 9,518 automatic, codification-referencing, must-pass quality-control tests. The fruits of this labor are a 23% improvement since last publication in translation quality into English and 35% increase in quality translating from English into Lemko, providing translations that outperform every Google Translate service by every metric, and score 396% higher than Google’s Ukrainian service when translating into Lemko.
Please cite as: Orynycz, P. (2023). BLEU Skies for Endangered Language Revitalization: Lemko Rusyn and Ukrainian Neural AI Translation Accuracy Soars. In: Degen, H., Ntoa, S. (eds) Artificial Intelligence in HCI. HCII 2023. Lecture Notes in Computer Science(), vol 14051. Springer, Cham. https://doi.org/10.1007/978-3-031-35894-4_10
This version of the contribution has been accepted for publication after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at https://doi.org/10.1007/978-3-031-35894-4_10. Use of this Accepted Version is subject to the publisher’s Accepted Manuscript terms of use: https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms.
Table of contents
1 Introduction
1.1 The Problem
Languages are being lost at a rate of at least one per calendar quarter, with such loss set to triple by 2062, and increase fivefold by 2100, affecting over 1,500 speaker communities [1, pp. 163 and 169]. Such outcomes are associated with elevated incidence of illicit substance use [2, p. 179], type 2 diabetes [3], binge drinking, and assault [4], as well as sixfold higher youth suicide rates when fewer than of half of community members have language knowledge [5].
A recent study in the United States found that Indigenous language use has positive effects on health, regardless of proficiency level [6]. An experiment on speakers in Poland has found that use of Lemko moderates emotional, behavioral, and depressive symptoms stemming from cognitive availability of trauma [7].
Artificial intelligence machine translation might be of service in spreading the aforementioned protective effects to heritage speakers by revitalizing dying and Sleeping languages [8, p. 577]. For example, new speakers might produce correct text instantaneously and enjoy reading comprehension using automatic machine translation devices as an aid until full, independent fluency is achieved.
1.2 System Under Study
Lemko is a definitively to severely endangered [9, pp. 177–178] East Slavic lect of southwestern Ukrainian genetic classificability [10, p. 52; 11, p. 39] indigenous to borderlands between the Republic of Poland and Slovak Republic; some have referred to it as Rusyn [11, p. 39; 12].
Eastern boundaries
A unique isogloss differentiating Lemko to the East is fixed paroxytonic (penultimate syllable) stress, a feature shared with Polish and Eastern Slovak dialects [10, pp. 161–162 and 972–973; 11, p. 50; 13, pp. 70–73], making its extent in Eastern Slovakia at least to the Laborec River, with a transitional zone extending thereafter [13, p. 70; 11, p. 50]. Meanwhile in Poland, the historical extent of Lemko reaches at least the Osławica or Wisłok rivers, with a transitional zone beyond them [11, p. 50].
Western boundaries
The historical western boundaries of Lemko are the Poprad and Dunajec rivers [14, p. 459].
Ancestral villages of native speakers whose interviews comprise the corpus are found within the current administrative borders of today’s Lessor Poland Province, whose capital is Cracow.
Lemko name | Transliteration | Polish name | County Seat | Commune Seat |
Ізбы | Izbŷ | Izby | Gorlice | Uście Gorlickie |
Ґлaдышiв | Gladŷšiv | Gładyszów | Gorlice | Uście Gorlickie |
Чорне | Čorne | Czarne | Gorlice | Sękowa |
Долге | Dolhe | Długie | Gorlice | Sękowa |
Білцарьова | Bilcarʹova | Binczarowa | Nowy Sącz | Grybów |
Фльоринка | Flʹorynka | Florynka | Nowy Sącz | Grybów |
Чырна | Čŷrna | Czyrna | Nowy Sącz | Krynica-Zdrój |
2 State of the Art
Last year, the world’s first quality evaluation results were published for machine translations into Lemko: BLEU 6.28, which was nearly triple that of Google Translate’s Ukrainian service[1] (BLEU 2.17) [15, p. 570]. The year before, my colleagues and I had published and presented the world’s first results for Lemko to English machine translation: BLEU 14.57 [16].
[1] Disclosure: I work as a paid Ukrainian, Polish, and Russian translation quality control specialist for the Google Translate project. My client’s headquarters are in San Francisco, California.
The engine has been deployed and made freely available at the universal resource locator https://www.LemkoTran.com, where a transliteration engine has been in service since the autumn of 2017. The translation engine was first alluded to in print by Drs. Scherrer and Rabus in the Cambridge University Press journal Natural Language Engineering in 2019 [17].
3 Materials and Methods
3.1 Materials
The experiment was performed on a bilingual corpus comprising Lemko Cyrillic transcripts and English translations of interviews with survivors and children of forced resettlements from ancestral lands in Poland. The transcripts and their translations[1] were aligned across 3,267 segments, with Microsoft Word providing a Lemko source word count of 68,944 and an English target word count of 81,188.
[1] I was hired to produce the transcripts and translate them by the John and Helen Timo Foundation of Wilmington, Delaware, who then donated the work products to my scientific research and development endeavors.
Sources of truth included the dictionaries of Jarosław Horoszczak [18], Petro Pyrtej [19], Ihor Duda [20], and Janusz Rieger [21], as well as the grammars of Henryk Fontański and Mirosława Chomiak [22] and Petro Pyrtej [23].
3.2 Methods
Engine Upgrades
For this experiment, the engine deployed at LemkoTran.com was fitted with newly built generators informed by part of speech, grammatical case, and number for the purpose of producing grammatically and syntactically appropriate translations for 1,585 dictionary entries, about half of which do not inflect in Polish or Lemko, allowing for simple substitution.
Quality Assurance Tests
Quality was ensured by 9,518 tests cross-referenced when feasible with the Lemko codifications, grammars, and dictionaries listed above under Materials. The tests themselves assert that the system translates given utterances in the desired manner.
Description | Quantity |
Noun stem | 414 |
Verb stem | 296 |
Adjective stem | 167 |
Pronoun, personal | 87 |
Pronoun, other | 178 |
Numeral | 86 |
Other dictionary entries | 357 |
Total | 1,585 |
Rule-Based Machine Translation (RMBT)
Text was given a Lemko or Polish look and feel by replacing character sequences, and especially inflectional endings.
Polish Sequence | Lemko Sequence | Position |
ować | uwaty | Final |
iami | iamy | Final |
ają | ajut | Final |
ze | zo | Initial |
pod | pid | Initial |
Translation Quality Scoring
Translation quality was measured per industry standard metrics using the default settings of the SacreBLEU tool invented at Amazon Research by Matt Post [24]. For the sake of comparability, Polish was rendered in Lemko Cyrillic in the same way as the last experiment [15, p. 573].
Bilingual Evaluation Understudy (BLEU)
This n-gram-based metric has enjoyed wide currency for decades. It was developed in the United States at the IBM T. J. Watson Research Center with support from the Defense Advanced Research Projects Agency (DARPA) and monitoring by the United States Space and Naval Warfare Systems Command (SPAWAR) [25].
Translation Edit Rate (TER)
This metric reflects the number of edits necessary for output to semantically approach a correct translation, aiming to be more tolerant of phrasal shifts than BLEU and other n-gram-based metrics. It is determined by dividing a calculation of edit distance between a hypothesis and a reference by average reference wordcount. Its development in the United States was also supported by DARPA [26].
Character n-gram F-score (chrF)
This European metric been shown to correlate very well with human judgments and even outperform both BLEU and TER [27].
4 Results and Discussion
The experimental system, LemkoTran.com, outperformed every Google Translate service by every metric. English to Lemko translation BLEU quality scores improved 35% in comparison with last published results [15], producing results four times better than Google Translate’s next-best offering, its Ukrainian service. Meanwhile, Lemko to English translation quality improved by 23% since last published results [16], achieving BLEU scores 16% higher than the best obtained by Google Translate, which automatically recognized Lemko as Ukrainian 76% of the time, as Russian 16% of the time, and as Belarusian 6% of the time.
4.1 English to Lemko Translation Quality
The engine deployed at LemkoTran.com bested Google Translate by every metric when translating from English into Lemko. The next-highest scoring system in the experiment was either the output of Google Translate’s Ukrainian service (using the BLEU or chrF metrics) or that of its Polish service (using the TER metric).
The translation quality of the system deployed at LemkoTran.com as measured by the most widespread BLEU metric rose to 8.48, a 35% improvement on results last published in 2022 [15], and now quadruple Google Translate’s highest score.

The LemkoTran.com engine achieved the best English to Lemko character n-gram f-score (chrF 37.30), which is 37% higher than the next best, Google Translate’s Ukrainian service. Meanwhile, Google Translate’s Russian service scored higher than its Polish and Belarusian counterparts when measured against the Lemko corpus by this metric.

The LemkoTran.com engine achieved the best English to Lemko Translation Edit Rate (TER), scoring 81.33. Google Translate’s Polish service scored second best, followed closely by its Ukrainian one.

Output from the translation systems when fed English is given below.
Input | Our children were smart too. But where were they supposed to study? | |||
Description | Output | Transliteration | Quality Scores | |
Lemko reference (native speaker) | В нас діти тіж были мудры, але де мали ся вчыти? | V nas dity tiž bŷly mudrŷ, ale de maly sja včŷty? | BLEU 100 chrF2 100 TER 0 | |
Translation into Lemko by LemkoTran.com | Нашы діти тіж были мудры. але де мали ся вчыти? | Našŷ dity tiž bŷly mudrŷ. ale de maly sja včŷty? | BLEU 58.34 chrF2 79.03 TER 27.27 | |
Google Translate (control) | Translation into Ukrainian | Наші діти теж були розумними. Але де вони мали вчитися? | Naši dity tež buly rozumnymy. Ale de vony maly včytysja? | BLEU 4.41 chrF2 25.80 TER 72.73 |
Translation into Russian | Наши дети тоже были умными. Но где им было учиться? | Naši deti tože byli umnymi. No gde im bylo učitʹsja? | BLEU 3.71 chrF2 16.95 TER 90.91 | |
Translation into Polish | Наше дзєці теж били мондре. Алє ґдзє мєлі сє учиць? | Naše dzjeci tež byly mondre. Alje gdzje mjeli sje učycʹ? | BLEU 3.12 chrF2 13.84 TER 100 | |
Translation in Belarusian | Разумныя былі і нашы дзеці. Але дзе яны павінны былі вучыцца? | Razumnyja byli i našy dzeci. Ale dze jany pavinny byli vučycca? | BLEU 3.09 chrF2 12.83 TER 100 |
Input | And generally speaking, Lemkos in Poland don’t have a leader, so to speak, who would say something. | |||
Description | Product | Transliteration | Quality Scores | |
Lemko reference (native speaker) | А воґулі Лемкы в Польщы не мают такого, же так повім, такого лідера, котрий бы штоси повіл. | A voguli Lemkŷ v Pol’ščŷ ne majut takoho, že tak povim, takoho lidera, kotryj bŷ štosy povil. | BLEU 100 chrF2 100 TER 0 | |
Translation into Lemko by LemkoTran.com | І генеральні Лемкы в Польщы не мают лидера, же так повім, котрий бы штоси повіл. | I heneral’ni Lemkŷ v Pol’ščŷ ne majut lydera, že tak povim, kotryj bŷ štosy povil. | BLEU 55.58 chrF2 65.32 TER 29.41 | |
Google Translate (control) | Translation into Polish | І ґенеральнє Лемковє в Польсце нє майон лідера, же так повєм, ктури би цось повєдзял. | I general’nje Lemkovje v Pol’sce nie majon lidera, že tak povjem, ktury by cos’ povjedzjal. | BLEU 9.26 chrF2 29.29 TER 82.35 |
Translation into Ukrainian | І взагалі, лемки в Польщі не мають лідера, так би мовити, який би щось сказав. | I vzahali, lemky v Pol’shchi ne mayut’ lidera, tak by movyty, yakyj by shchos’ skazav. | BLEU 5.15 chrF2 26.56 TER 82.35 | |
Translation into Russian | И вообще, у лемков в Польше нет, так сказать, лидера, который бы что-то сказал. | I voobšče, u lemkov v Polʹše net, tak skazatʹ, lidera, kotoryj by čto-to skazal. | BLEU 2.96 chrF2 25.87 TER 88.24 | |
Translation into Belarusian | І ўвогуле лэмкі ў Польшчы ня маюць лідэра, так бы мовіць, які б нешта сказаў. | I ŭvohule lèmki ŭ Pol′ščy nja majuc′ lidèra, tak by movic′, jaki b nešta skazaŭ. | BLEU 2.72 chrF2 18.05 TER 94.12 |
Lemko to English Translation
For every metric, the engine deployed at LemkoTran.com outperformed Google Translate, for which translation as if from Standard Ukrainian was always second best, followed by it automatically detecting the source language, then translating as if from Belarusian, and then Polish, with Russian always coming in last place. Google Translate recognized Lemko as Ukrainian 76% of the time, as Russian 16% of the time, as Belarusian 6% of the time, and as sundry languages using Cyrillic alphabets (e.g. Mongolian) the rest of the time.
LemkoTran.com scored BLEU 17.95 when translating into English, a 23% improvement on last published results of BLEU 14.57, and 16% higher than Google Translate’s Ukrainian service’s score of BLEU 15.43.

The engine deployed at LemoTran.com achieved a character n-gram f-score (chrF) of 45.89 when translating into English, which was 5% better than the score of Google Translate’s Ukrainian service.

LemkoTran.com scored a Translation Edit Rate (TER) of 70.38 translating into English, which was 7% better than the score of Google Translate’s Ukrainian service.

Output from the translation systems when fed English is given below.
Description | Product | Quality Scores | |
Input transcription of Lemko spoken by a native speaker | Як розділяме языкы, то мала-м контакт з польскым, то не было так, же пішла-м до школы без польского, бо зме мали сусідів Поляків. | n/a | |
Transliteration | Jak rozdiljame jazŷkŷ, to mala-m kontakt z pol’skŷm, to ne bŷlo tak, že pišla-m do školŷ bez pol’skoho, bo zme maly susidiv Poljakiv. | n/a | |
Reference translation by a bilingual professional | When it comes to separating languages, I had contact with Polish. It wasn’t like I started school without knowing Polish because we had Polish neighbors. | BLEU 100 chrF2 100 TER 0 | |
Translation from Lemko by the system at LemkoTran.com | When we separate languages, I had contact with Polish, it wasn’t like I went to school without Polish, because we had Polish neighbors. | BLEU 45.84 chrF2 69.60 TER 32.00 | |
Google Translate (control) | from Ukrainian (autodetected with 92% confidence) | As we divide the languages, then I had contact with Polish, then it was not like that, and I went to school without Polish, because I had Poles as neighbors. | BLEU 15.87 chrF2 54.38 TER 72.00 |
from Belarusian | As we separate the languages, then I had little contact with Polish, then it was not like that, but I went to school without Polish, because we had few Polish neighbors. | BLEU 11.76 chrF2 58.92 TER 68.00 | |
from Russian | As we spread languages, then there was little contact with Polish, then it wasn’t like that, but I went to school without Polish, for the snakes were sucid in Polyakiv. | BLEU 6.87 chrF2 42.66 TER 92.00 | |
from Polish | As I spread the language, I have little contact with the Polish language, it wasn’t like that I went to school without Polish, because I will change my little Polish language. | BLEU 5.02 chrF2 45.35 TER 84.00 |
5 Conclusion
Coupling morphologically and syntactically informed generators to neural engines can improve machine translation quality by at least a third, while also having the side benefit of empowering engineers to purge loanwords and counteract other dominant-language interference, as well as ensure compliance with standards, such as codifications of minority languages. Quality-score glass ceilings imposed by the imperfections inherent to artificial intelligence models can also be shattered through sound engineering. For Lemko, as well as fellow low-resource, Indigenous minority languages, the sky is now the limit for translation quality, as well as revitalization revolutions just over the horizon.
I would like to thank Dr. Ming Qian of Charles River Analytics for the inspiration to conduct this experiment, Michael Decerbo of Raytheon BBN Technologies and Dr. James Joshua Pennington for their insightful remarks, as well as Dr. Yves Scherrer of the University of Helsinki for his interest in the project and ideas.
