Matrix Similarity Analysis of Texts Written in Belarusian and Ukrainian

Downloads

Authors

  • Artur Niewiarowski Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Kraków, Poland
  • Anna Plichta Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Kraków, Poland

Abstract

This publication presents the results of a study on text similarity between Belarusian and Ukrainian, utilizing a matrix-based analysis method grounded in edit distance. A distinctive feature of this approach is the absence of language-specific vocabulary rules, highlighting the algorithm’s linguistic universality in similarity analysis. The analyzed texts were sourced from excerpts of online encyclopedias, translated using AI-powered online translation  services provided by well-known companies. The primary objective of this study is to determine whether it is possible to compare texts written in these languages without prior translation into a common language. Additionally, it aims to assess whether a method that does not belong to the large language model (LLM) family or the broader category of AI-based approaches can effectively compare languages within the same linguistic group. Furthermore, the study provides insights into the degree of similarity between Belarusian and Ukrainian, investigating the extent to which speakers of one language might partially understand the other.

Keywords:

text-mining, anti-plagiarism, text similarity analysis, Levenshtein edit distance, matrix-based text analysis, Belarusian language, Ukrainian language, East Slavic language group, Old Russian language, Indo-European language, Bielaruskaja mova, Ukrainska mova

References







      1. A. Niewiarowski, Zastosowanie algorytmu odległości edycyjnej do ilościowej analizy danych tekstowych [in Polish], PhD dissertation, IPPT PAN, Warsaw, 2024.


      2. A. Niewiarowski, Similarity detection based on document matrix model and edit distance algorithm, Computer Assisted Methods in Engineering and Science, 26(3–4): 163–175, 2019, https://doi.org/10.24423/cames.277


      3. A. Niewiarowski, Short text similarity algorithm based on the edit distance and thesaurus, Technical Transactions, 113(1-NP): 159–173, 2016, https://doi.org/10.4467/2353737XCT.16.149.5760


      4. A. Niewiarowski, M. Stanuszek, Parallelization of the Levenshtein distance algorithm, Technical Transactions, 111(3-NP): 109–122, 2014, https://doi.org/10.4467/2353737XCT.14.319.3407


      5. K. Katzner, The Languages of the World, Taylor & Francis, London, 2002.


      6. R. Posner, The Romance Languages, Cambridge Language Surveys, Cambridge University Press, Cambridge, 1996.


      7. R. Penny, A History of the Spanish Language, Cambridge University Press, Cambridge, 2002.


      8. V.I. Levenshtein, Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian], Reports of the Academy of Sciences of the USSR, 163(4): 845–848, 1965.


      9. P.R. Petrucci, Slavic Features in the History of Rumanian, LINCOM Europa, München, 1999.


      10. A. Dziob, M. Piasecki, Implementation of the verb model in plWordNet 4.0, [in:] Proceedings of the 9th Global Wordnet Conference, Singapore, January 8–12, pp. 113–122, Nanyang Technological University, 2018.


      11. W. B.A. Karaa, A new stemmer to improve information retrieval, International Journal of Network Security & Its Applications, 5(4): 143–154, 2013, https://doi.org/10.5121/ijnsa.2013.5411


      12. D. Khyani et al., An interpretation of lemmatization and stemming in natural language processing, Journal of University of Shanghai for Science and Technology, 22(10): 350–357, 2021.


      13. M.M. Maulana, R. Arifudin, A. Alamsyah, Autocomplete and spell checking Levenshtein distance algorithm for text suggestion error data searching in library, Scientific Journal of Informatics, 5(1): 75, 2018, https://doi.org/10.15294/sji.v5i1.14148


      14. R. Gabrys, E. Yaakobi, O. Milenkovic, Codes in the Damerau distance for DNA storage, [in:] 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, https://doi.org/10.1109/ISIT.2016.7541778


      15. R. Smith, An overview of the Tesseract OCR engine, [in:] Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 2: 629–633, 2007, https://doi.org/10.1109/ICDAR.2007.4376991


      16. B.D. Lund, T. Wang, Chatting about ChatGPT: How may AI and GPT impact academia and libraries?, Library Hi Tech News, 40(3): 26–29, 2023, https://doi.org/10.1108/LHTN-01-2023-0009


      17. A.J. Adetayo, Artificial intelligence chatbots in academic libraries: The rise of ChatGPT, Library Hi Tech News, 40(3): 18–21, 2023, https://doi.org/10.1108/LHTN-01-2023-0007


      18. O. Bakhteev et al., Cross-language plagiarism detection: A case study of European languages academic works, [in:] S. Bjelobaba, T. Foltýnek, I. Glendinning, V. Krásničan, D.H. Dlabolová [Eds.], Academic Integrity: Broadening Practices, Technologies, and the Role of Students, Ethics and Integrity in Educational Contexts, Vol. 4, Springer, Cham, pp. 143–161, 2022, https://doi.org/10.1007/978-3-031-16976-2_9


      19. B. Agarwal, Cross-lingual plagiarism detection techniques for English-Hindi language pairs, Journal of Discrete Mathematical Sciences and Cryptography, 22(4): 679–686, 2019, https://doi.org/10.1080/09720529.2019.1642626


      20. A. Niewiarowski, A. Plichta, Matrix similarity analysis of texts written in Romanian and Spanish, [in:] ECMS 2023: Proceedings of the 37th ECMS International Conference on Modelling and Simulation, Florence, Italy, June 20–23, 37(1): 507–512, 2023.


      21. V. Komorovskaya, The future of the Belarusian language: Is it doomed to extinction? Controversies and challenges in language maintenance and revitalization, Acta Philologica, 48: 15–28, 2016.


      22. M.S. Flier, A. Graziosi, The battle for Ukrainian: An introduction, Harvard Ukrainian Studies: The Journal of the Ukrainian Research Institute at Harvard University, 35(1–4): 11–30, 2017–2018.


      23. E. Agirre, Cross-Lingual Word Embeddings, Computational Linguistics, 46(1): 245–248, 2020, https://doi.org/10.1162/coli_r_00372


      24. N.R. Schneider, A. Das, K. O'Sullivan, H. Samet, Cross-lingual clustering using large language models, [in:] Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '24), Association for Computing Machinery, New York, USA, pp. 1–10, 2024, https://doi.org/10.1145/3687123.3698280


      25. S. Dutta, “Alignment is all you need”: Analyzing cross-lingual text similarity for domain-specific applications, [in:] Proceedings of the International Workshop on Cross-lingual Event-centric Open Analytics, CEUR Workshop Proceedings, Vol. 2829, pp. 13–22, 2021.


      26. C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.


      27. Website: https://antyplagius.n-dms.com https://antyplagius.n-dms.com New Data Mining Systems sp. z o.o. YouTube channel of the project: https://youtube.com/@n-dms


      28. Slavic languages, Britannica, https://www.britannica.com/topic/Slavic-languages


      29. C.D. Manning, P. Raghavan, H. Schütze, Stemming and lemmatization, in:. Introduction to Information Retrieval, Cambridge University Press, 2008, https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html


      30. Google Translate, https://translate.google.com


      31. Microsoft Translator, https://www.bing.com/translator. 








Additional Online Resources



A1. A full excerpt from the text is available at https://antyplagius.n-dms.com/tests/Spanish-Romanian/Espania-Spanish-wikipedia-google-translate.txt
A2. A full excerpt from the text is available at https://antyplagius.n-dms.com/tests/Spanish-Romanian/Espania-Romanian-wikipedia-google-translate.txt
A3. N-DMS, Belarusian and Ukrainian – an analysis of similarities. Antyplagiat N-DMS Antyplagius [in Polish], YouTube, 06.01.2022, https://youtu.be/d6o3QAQDWPk
A4. N-DMS ANTYPLAGIUS, Belarus – Wikipedia, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Belarus-Wikipedia.pdf
A5. Wikipedia, Belarus, https://be.wikipedia.org/wiki/%D0%91%D0%B5%D0%BB%D0%B0%D1%80%D1%83%D1%81%D1%8C
A6. N-DMS ANTYPLAGIUS, Belarusian-Belarus, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Belarusian-Belarus.txt
A7. N-DMS ANTYPLAGIUS, Ukrainian-Belarus, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Ukrainian-Belarus.txt
A8. N-DMS ANTYPLAGIUS, EN Fragment – Ukraine – Britannica Online Encyclopedia, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/EN%20FRAGMENT%20-%20Ukraine%20–%20Britannica%20Online%20Encyclopedia.txt
A9. N-DMS ANTYPLAGIUS, Ukraine – Britannica Online Encyclopedia, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Ukraine%20–%20Britannica%20Online%20Encyclopedia.pdf
A10. Ukraine, Britannica, https://www.britannica.com/place/Ukraine
A11. N-DMS ANTYPLAGIUS, Chapter “Plant and animal life” in English from encyclopaedia: https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/ENG%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A12. N-DMS ANTYPLAGIUS, Belarusian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/BY%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A13. N-DMS ANTYPLAGIUS, Ukrainian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/UA%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A14. N-DMS ANTYPLAGIUS, Bulgarian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/BUL%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A15. N-DMS ANTYPLAGIUS, Serbian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/SR%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A16. N-DMS ANTYPLAGIUS, Macedonian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/MK%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A17. N-DMS ANTYPLAGIUS, Kazakh, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/KZ%20-%20Brit%20-%20Animals%20-%20google%20translate.txt
A18. N-DMS, Antyplagius vs chatGPT-4 – Review of the film Troy [in Polish], YouTube, 10.04.2023, https://youtu.be/ ejk1xTPDDQ.
A19. N-DMS, Antyplagius vs chatGPT (part 1), YouTube, 27.04.2023, https://youtu.be/PxrVB9AwcR0
A20. N-DMS, Are the Spanish and Romanian languages similar to each other? Test using the Antyplagiat N-DMS Antyplagius [in Polish], YouTube, 22.01.2022, https://youtu.be/JhfdwbyIsFc