Exploring lexical diversity in Basque news: original vs. machine-translated texts [Euskarazko albisteetako aniztasun lexikoaren azterketa: jatorrizko testuak eta automatikoki itzulitakoak]

Gako-hitzak: itzulpen automatikoa, aniztasun lexikoa, euskara

Laburpena

Itzulpen automatikoaren (IA) aurrerapen azkarraren ondorioz, IAren bidez sortutako testuekiko dugun esposizioa handitu da. Teknologia horrek hizkuntzaren erabilera eraldatzeko potentziala duenez, azterketa honetan automatikoki euskaratutako eta jatorriz euskaraz sortutako albiste digitalen aniztasun lexikoa alderatzen ditugu. Finkatutako metrika automatikoak eta eskuzko azterketa xeheak bateratuz, aniztasun lexikoaren hainbat dimentsio zabalki aztertzen ditugu. Gure aurkikuntzen arabera, bi testu-motek antzekotasun nabarmenak dituzte aberastasun-lexikoaren aldetik. Izan ere, alderdi oso espezifiko gutxi batzuetan ikusten dugu jatorrizko testuen aberastasun-lexikoa apur bat handiagoa dela.

Estatistikak

10
##plugins.generic.usageStats.noStats##

Erreferentziak

Aranberri, N. & Iñurrieta, U. (2024). When minoritized languages encounter MT: perceptions and expectations of the Basque community. The Journal of Specialised Translation, 41, 179-205. https://doi.org/10.26034/cm.jostrans.2024.4718

Baker, M. (1993). Corpus linguistics and translation studies: implications and applications. In M. Baker, G. Francis & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 233-250). John Benjamins.

Baker, M. (1995). Corpora in translation studies. Target. International Journal of Translation Studies, 7(2), 223-243. https://doi.org/10.1075/target.7.2.03bak

Baroni, M. & Bernardini, S. (2005). A new approach to the study of translationese: machine-learning the difference between original and translated text. Literary and Linguistic Computing, 21(3), 259-274. https://doi.org/10.1093/llc/fqi039

Bengoetxea, K. & González-Dios, I. (2021). MultiAzterTest: a multilingual analyzer on multiple levels of language for readability assessment. Computing Research Repository (CoRR). https://arxiv.org/abs/2109.04870

Bernardini, S. (2022). How to use corpora for translation. In A. O’Keeffe & M. J. McCarthy (Eds.), The Routledge handbook of Corpus Linguistics (pp. 485-498). Routledge. https://doi.org/10.4324/9780367076399 (Original work published 2010)

Bizzoni, Y., Juzek, T. S., España-Bonet, C., Dutta Chowdhury, K., van Genabith, J. & Teich, E. (2020). How human is machine translationese? Comparing human and machine translations of text and speech. In M. Federico, A. Waibel, K. Knight, S. Nakamura, H. Ney, J. Niehues, S. Stüker, D. Wu, J. Mariani & F. Yvon (Eds.), Proceedings of the 17th International Conference on Spoken Language Translation (pp. 280-290). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.iwslt-1.34

Blum-Kulka, S. (1986). Shifts of cohesion and coherence in translation. In J. House & S. Blum-Kulka (Eds.), Interlingual and intercultural communication: Discourse and cognition in translation and second language acquisition studies (pp. 17-35). Gunter Narr Verlag.

Blum-Kulka, S. & Levenston, E. A. (1983). Universals of lexical simplification. In C. Færch & G. Kasper (Eds.), Strategies in interlanguage communication (pp. 119-139). Longman.

Castilho, S., Resende, N. & Mitkov, R. (2019). What influences the features of post-editese? A preliminary study. In Proceedings of the human-informed translation and interpreting technology workshop (HiT-IT 2019) (pp. 19-27). Incoma Ltd. https://doi.org/10.26615/issn.2683-0078.2019_003

Chesterman, A. (2010). Why study translation universals. Kiasm. Acta Translatologica Helsingiensia (ATH), 1, 38-48. http://hdl.handle.net/10138/24319

Etchegoyhen, T., Azpeitia, A. (2016). Set-theoretic alignment for comparable corpora. In E. Katrik & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1 (pp. 2009-2018). Association for Computational Linguistics. https://aclanthology.org/P16-1189/

Etchegoyhen, T. & Gete, H. (2020). Handle with care: a case study in comparable corpora exploitation for neural machine translation. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3799-3807). European Language Resources Association. https://aclanthology.org/2020.lrec-1.469

Gamallo, P. & Labaka, G. (2021). Using dependency-based contextualization for transferring passive constructions from English to Spanish. Procesamiento del lenguaje natural, 66, 53-64. https://doi.org/10.26342/2021-66-4

Gellerstam, M. (1986). Translationese in Swedish novels translated from English. In L. Wollin & H. Lindquist (Eds.), Translation studies in Scandinavia: proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II (pp. 88-95). CWK Gleerup.

Green, S., Heer, J. & Manning, C. D. (2013). The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on human factors in computing systems (pp. 439-448). Association for Computing Machinery. https://doi.org/10.1145/2470654.2470718

Hernáez, I., Navas, E., Odriozola, I., Sarasola, K., Diaz de Ilarraza, A., Leturia, I., Diaz de Lezana, A., Oihartzabal, B. & Salaberria, J. (2012). The Basque language in the digital age / Euskara aro digitalean. Springer.

Liu, Z. & Dou, J. (2023). Lexical density, lexical diversity, and lexical sophistication in simultaneously interpreted texts: a cognitive perspective. Frontiers in Psychology, 14, 1-11. https://doi.org/10.3389/fpsyg.2023.1276705

Lu, X. (2014). Computational methods for corpus annotation and analysis. Springer.

Macken, L., Van Brussel, L. & Daems, J. (2019). NMTs wonderland where people turn into rabbits. A study on the comprehensibility of newly invented words in NMT output. Computational Linguistics in the Netherlands Journal, 9, 67-80.

Oakes, M. P. & Ji, M. (Eds.). (2012). Quantitative methods in corpus-based translation studies: a practical guide to descriptive translation research. John Benjamins. https://doi.org/10.1075/scl.51

Otegi, A., Ezeiza, N., Goenaga, I. & Labaka, G. (2016). A modular chain of NLP tools for Basque. In P. Sojka, A. Horák & I. Kopeček (Eds.), Proceedings of the 19th International Conference of Text, Speech, and Dialogue, SD 2016, Brno, Czech Republic, Lecture Notes in Computer Science (pp. 93-100). Springer International. https://doi.org/10.1007/978-3-319-45510-5_11

Pociello, E., Agirre, E. & Aldezabal, I. (2011). Methodology and construction of the Basque WordNet. Language Resources and Evaluation, 45(2), 121-142. https://doi.org/10.1007/s10579-010-9131-y

Sarasola, K., Aldabe, I., de Ilarraza, A. D., Grützner-Zahn, R. A. & Giagkou, M. (2022). Project European Language Equality (ELE) Grant agreement no. LC-01641480–101018166 ELE Coordinator Prof. Dr. Andy Way (DCU) Co-coordinator Prof. Dr. Georg Rehm (DFKI) Start date, duration 01-01-2021, 18 months.

Shaitarova, A., Göhring, A. & Volk, M. (2023). Machine vs. human: exploring syntax and lexicon in German translations, with a spotlight on anglicisms. In T. Alumäe & M. Fishel (Eds.), Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 215-227). University of Tartu Library. https://aclanthology.org/2023.nodalida-1.22

Sim Smith, K. (2017). On integrating discourse in machine translation. In B. Webber, A. Popescu-Belis & J. Tiedemann (Eds.), Proceedings of the third workshop on discourse in machine translation (pp. 110-121). Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-4814

Toral, A. (2019). Post-editese: an exacerbated translationese. In M. Forcada, A. Way, B. Haddow & R. Sennrich (Eds.), Proceedings of Machine Translation Summit XVII: Research Track (pp. 273-281). European Association for Machine Translation. https://aclanthology.org/W19-6627

Toury, G. (1980). In search of a theory of translation. Porter Institute for Poetics and Semiotics, Tel Aviv University.

Toury, G. (2012). Descriptive translation studies – and beyond. John Benjamins.

Vanmassenhove, E., Shterionov, D. & Gwilliam, M. (2021). Machine translationese: effects of algorithmic bias on linguistic complexity in machine translation. In P. Merlo, J. Tiedemann & R. Tsarfaty (Eds.), Proceedings of the 16th

Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 2203-2213). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.188

Vanmassenhove, E., Shterionov, D. & Way, A. (2019). Lost in translation: loss and decay of linguistic richness in machine translation. In M. Forcada, A. Way, B. Haddow & R. Sennrichar (Eds.), Proceedings of Machine Translation Summit XVII: Research Track (pp. 222-232). Association for Computational Linguistics. https://aclanthology.org/W19-6622

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, V. Vishwanatan & R. Garnett (Eds.), Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 30, 5999-6010.

Volansky, V., Ordan, N. & Wintner, S. (2015). On the features of translationese. Digital Scholarship in the Humanities, 30(1), 98-118. https://doi.org/10.1093/llc/fqt031

Yule, G. U. (1944). The statistical study of literary vocabulary. Cambridge University Press.

Zabaleta, J. (2019). Itzulpengintza eta euskararen batasuna eta normalizazioa: mende erdiko historiaren berrikusketa eta gogoeta batzuk. Senez, 50, 85-109.

Argitaratuta
2025-12-30