Preview

PARALLEL CORPUS OF THE KAZAKH AND RUSSIAN LANGUAGES: DEVELOPMENT, OPERATION AND PROBLEMS

https://doi.org/10.55491/2411-6076-2023-2-49-61

Abstract

The research paper gives a brief overview of the history of the creation of linguistic corpora, describes their classification according to various criteria and types of parallel subcorpuses. The original Kazakh text of M. Auezov's epic novel «Abai Zholy» and its Russian translation, made by A. Kim, were manually aligned at the level of a paragraph (sentence) in a parallel subcorpus being developed as part of the national corpus of the Kazakh language.
During the development of the parallel subcorpus, Microsoft Office Excel, Notepad++, Python, Django, MySQL software tools were used. The software architecture and the order of operation of the parallel subcorpus can be represented as follows: 1) texts in two languages were collected using the Excel office program and aligned manually at the paragraph (sentence) level; 2) aligned texts were loaded directly from an Excel file into the MySQL database management system; 3) the downloaded texts were sorted using the Notepad++ word processor program, their statistics were obtained; 4) the Django web server was used to publish the sorted texts on the Internet and provide user requests; 5) the Processing.py program written in Python and equipped with a search function was used to connect the Django web server to the MySQL database management system; 6) the parallel subcorpus software architecture was developed using client-server and MVC (Model-View-Controller) technologies.
The parallel subcorpus consists of a database of aligned texts, markups, metamarkups and a search engine, information about the text entered into the subcorpus (metamarkup) includes the following parameters: author, translator, work title, translation title, publication date of the work, translation period, original language, translation language. The search engine allows users to find the desired word by parameters: word, phrase, sentence, and capital letters (in Kazakh and Russian). The paper describes the interface of the parallel subcorpus in Kazakh and Russian and the interface of the results after searching for the desired word through one of the search parameters, the total and non-repeating number of words used in the text in two languages, the number of sentences, as well as numerical and percentage values of the ten most commonly used words in both languages were determined.
In addition, in the process of aligning the original Kazakh text of the epic novel with the Russian translated version at the paragraph (sentence) level, the following features were identified: 1) from the point of view of structure, that is, the words used in the paragraph (sentence) are approximately equivalent in number; 2) from the point of view of content, they approximately coincide; 3) do not coincide in structure and content: some paragraphs (sentences) in the original text in Kazakh are translated into Russian incorrectly, superficially or briefly, their approximate meaning is given.

About the Authors

N. M. Ashimbaeva
A. Baitursynuly Institute of Linguistics
Kazakhstan

Almaty



A. Z. Bisengali
A. Baitursynuly Institute of Linguistics
Kazakhstan

Almaty



S. K. Kulmanov
A. Baitursynuly Institute of Linguistics
Kazakhstan

Almaty



G. M. Ayazbaev
A. Baitursynuly Institute of Linguistics
Kazakhstan

Almaty



M. Nurlan
A. Baitursynuly Institute of Linguistics
Kazakhstan

Almaty



References

1. Svartvik J., Quirk R. (1980)A corpus of English Conversation. – Lund: Gleerup, 1980. – 284 p. (in English)

2. Francis W. (2022) Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. [Electron. resource] – URL: http://icame.uib.no/brown/bcm.html (date of review – 01.02.2022). (in English)

3. Hundt, Marianne.(2022) Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN). [Electron. resource] – URL: http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM (date of review – 01.02.2022). (in English)

4. Leech C. (2005) Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB. ICAME Journal. – Geoffrey & Nicholas Smith, 2005. № 29. – P. 83-98. (in English)

5. Zhubanov A., Zhanabekova A. (2017) Korpustyq lingvistica. [Corpus Linguistics] – Almaty, 2017. – 318 b. (in Kazakh)

6. Zakharov V.P. (2005)Korpusnaia lingvistica: Uchebn.-metod. posobie. [Corpus linguistics: Textbook.-method. stipend]. – SPb., 2005. – 48 s. (in Russian)

7. Zhubanov A.Q.(2009) Qazaq tili matinder korpusynyn kompiuterlik bazasyn qurudyn algysharttary // Adebi til zhane qazaq tilinin omirshendigi. [Prerequisites for creating a computer base of the corpus of texts of the Kazakh language] / Zhauapty red. Q.Kuderinova. – Almaty: «KIE» lingvoeltanu innovatsialyq ortalygy, 2009. –175-179-bb. (in Kazakh)

8. Zhanabekova A.A. (2012) Qazaq tili matinderine morpologikalyq belgilenim qoiudyn gylymyi-tazhiribelik negizderi [Scientific and practical foundations of morphological designations of texts of the Kazakh language] // Nauchno-praktisheskaia konferensia «Iazyki menshenstv v compiuternyq tehnologiah: opyt, zadachi i perspektivi». – Ufa, 2012. – B. 42-50. (in Kazakh)

9. Zhubanov A.K.(2015) Prinsipy avtomatizatsi morpologicheskoi razmetki tekstov Natsionalnogo korpusa kazaqskogo iazyka (NKKIA) [Principles of automation of morphological markup of texts of the National corpus of the Kazakh language (NKKYA)] // Materialy mezhdunarodnoi nauchno-praktisheskoi konferensii «Kontrastivnye issledovania I prikladnaia lingvistika». – Minsk: MGLU, 2015. – Ss. 111-119. (in Russian)

10. Zhanabekova A., Pirmanova K. (2019) Tehnologia programmy poluavtomatisheskoi metarazmetki kazaqskogo natsionalnogo korpusa [Technology of the program of semi-automatic meta-marking of the Kazakh national corpus] // Actualnye problem sovremennoi lingvistiki i gumanitarnyq nauk: sbornik statei ХІ Vserossooskoi nauchno-metodicheskoi konferensii s mezhdunarodnym uchastiem. – M.: RUDN, 2019. – S. 465-474. (in Russian)

11. Qulmanov S.Q., Zhanabekova A.A., Ashimbayeva N.M., Bisengali A.Z., Shulenbayev N.K., Kordabay B.K. (2022) Korpusqa engiziletin matinderdegi sozderge morpologialyq belgilenim qoiyu zhane olardy kompiuterlik bagdarlamaga engizu maseleleri [Problems of morphological designations of words in texts included in the Corps and their implementation in a computer program] // L.N. Gumilev atyndagy Evrasia ulttyq universitetinin Qabarshysy. «Filologia gylymdary» seriasy № 3(140)/2022. – 103-113-bet (DOI: https://doi.org/10.32523/2616-678X-2022-140-3-103-113) https://bulphil.enu.kz/index.php/main/article/view/29/9 (in Kazakh)

12. Koibagarov K.Ch., Musabaev R.R., Kulmanov S.K. (2012) Razrabotka algoritmov avtomatisheskogo analiza slovoform kazaqskogo iazyka [Development of algorithms for automatic analysis of word forms of the Kazakh language] // «Qazirgi qazaq til bilimi: qoldanbaly lingvistikanyn ozekti maseleleri»: qalyqaralyq gylymyi-toerialyq konferensia materialdatynyn zhyinagy. – Almaty, A. Baitursynuly atundagy Til bilimi instituty, 2012. –S. 83-87. (in Russian)

13. Lauridsen, Karen (1996) Text Corpora and Contrastive Linguistics: Which Type of Corpus for which Type of Analysis? In: Aijmer, Karin /Altenberg, Bengt /Johannson, Mark (eds) Languages in Contrast. Papers from a Symposium on Text-based Cross Linguistic Studies. Lund: Lund University Press. – 1996. –P. 63-72. (in English)

14. Zakharov V.P. (2020) Korpushaia lingvistika. [Corpus linguistics] – Peterburg, 2020. – 234 s. (in Russian)

15. Dobrovolsky D.O. (2015) Lingvospesifichnaia leksika v korpusaq parallelnyq tekstov / D.O.Dobrovolsky [Linguospecific vocabulary in the corpus of parallel texts] // Rechebye zhanry sobremennogo obshenia. Tezisy dokladov mezhdunar.konf. «11-e Shmelevskie chtenia (23-25 febral 2015)». – М.: IRIA RAN, 2015. – S. 47-49. (in Russian)

16. Dobrovolsky D.O. (2009) Korpus parallelnyq textov v issledovanii kulturno-spetsifichnoi leksiki [The corpus of parallel texts in the study of culturally specific vocabulary] // Natsionalnyi korpus russkogo iazyka: 2006-2008. Nobyie rezultaty i perspektivy. –SPb.: Nestor-Istoria, 2009. – S. 383-401. (in Russian)

17. Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014) An overview of the European Union's highly multilingual parallel corpora. Language Resources and Evaluation Journal (LRE) 2014. DOI: 10.1007/s10579-014-9277-0. (in English)

18. Waldenfels R. (2006) Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment // Beitrage der Europaischen Slavistischen Linguistik (POLYSLAV). 9. Munchen, 2006. –S. 123- 138. (in English)

19. Sichinava D.V. (2015) Parallelnye teksty v sostave natsionalnogo korpusa russkogo iazyka: novye napravlenia razbitia i rezultaty [Parallel texts as part of the national Corpus of the Russian language: new directions of development and results] // Trudy Instituta russkogo iazyka RAN. – M., 2015. – S.194-234. (in Russian)

20. Tao Yu., Zakharov V.P. (2015) Razrabotka i ispolzovanie parallelnogo korpusa russkogo i kitaiskogo iazykov [Development and use of a parallel corpus of Russian and Chinese languages] // NTI. Ser. 2. Inform. Prosessy i sisremy. 2015. № 4. – S. 18-27. (in Russian)

21. Auezov M. (2009a) Abai zholy: Roman-epopeia. Birinshi kitap. [The Path of Abai: A novel-epic. The first book] – Almaty: «Zhazushy», 2009. – 376 bet. – «Qazaqtyn 100 romany» seriasy (in Kazakh)

22. Auezov M. (2009ä) Abai zholy: Roman-epopeia. Ekinshi kitap. [The Path of Abai: A novel-epic. The second book] – Almaty: «Zhazushy», 2009. – 432 bet. – «Qazaqtyn 100 romany» seriasy (in Kazakh)

23. Auezov M. (2009b) Abai zholy: Roman-epopeia. Ushinshi kitap. [The Path of Abai: A novel-epic. The third book] – Almaty: «Zhazushy», 2009. – 384 bet. – «Qazaqtyn 100 romany» seriasy (in Kazakh)

24. Auezov M. (2009v) Abai zholy: Roman-epopeia. Tortinshi kitap. [The Path of Abai: A novel-epic. The four book] – Almaty: «Zhazushy», 2009. – 400 bet. – «Qazaqtyn 100 romany» seriasy (in Kazakh)

25. Auezov M., (2012a) – Put Abaia. [Way Abay] / Perevod A. Kima. – Almaty: ID «Zhibek Zholy», 2012. Kn. 1. – 568 s. (in Russian)

26. Auezov M., (2012b) – Put Abaia. [Way Abay] / Perevod A. Kima. – Almaty: ID «Zhibek Zholy», 2012. Kn. 2. – 556 s. (in Russian)

27. Auezov M., (2012v) – Put Abaia. [Way Abay] / Perevod A. Kima. – Almaty: ID «Zhibek Zholy», 2012. Kn. 3. – 504 s. (in Russian)

28. Englund-Dimitrova B. (1999) Tolkens roll: perspektiv från nyare forskning och implikatoner för tolkutbildning // Materials of the Scientific and Practical Seminar on teaching Interpretation (Swedish/Russian).– М., 1999. – С.36-47. (in Swedish)

29. Anastasiev N.(2007) Trete svidanie [The third date] // Adebiet aidyny. 2007. –№ 44 (141). – S. 9. (in Russian)

30. Kuttikadam S. (2007) Zabety Muhtara [Kuttikadam S. Mukhtar's Precepts] // Mysl. – 2007. – № 11. – S. 2. (in Russian)

31. Ananyeva S. (2009) «Put Abaia» M.O.Auezova v perevode A.Kim [«The Way of Abai» by M.O.Auezov translated by A. Kim] // Keruen. – 2009. – № 2. – S. 136-157. (in Russian)

32. Belger G. (2009) Zhizn – epopeia (esse-triptiq, stati). [Life – epic (essay-triptych, articles)] – Almaty: ID «Zhibek Zholy», 2009. – 144 s. (in Russian)

33. Zhaksylykov A.Zh. (2013) Osobennosni novogo perevoda romana-epopei M.O.Auezova «Put Abaia» na russkii iazyk [Features of the new translation of M.O.Auezov's epic novel «The Way of Abai» into Russian] (in Russian)// Qudozhestbennyi perevod i literaturnyi prosess (izbrannye lektsii i issledovania). – Almaty, 2013. – S. 167-178.

34. Bolatova G. Zh. (2017) Zhana «Abai zholy»: A.Kimnin audarmasyndagy keibir erekshelikter [New «The Way of Abai»: some features in the translation of A. Kim] // QazUU Habarshysy. Pilologia seriasy. №2 (166). 2017. – 320-324-bb. (https://philart.kaznu.kz/index.php/1-FIL/article/view/2357/2262) (in Kazakh)


Review

For citations:


Ashimbaeva N.M., Bisengali A.Z., Kulmanov S.K., Ayazbaev G.M., Nurlan M. PARALLEL CORPUS OF THE KAZAKH AND RUSSIAN LANGUAGES: DEVELOPMENT, OPERATION AND PROBLEMS. Tiltanym. 2023;(2):49-61. (In Kazakh) https://doi.org/10.55491/2411-6076-2023-2-49-61

Views: 501


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2411-6076 (Print)
ISSN 2709-135X (Online)