


The base of a certain national corpus is not limited to information in the form of meta descriptions and metadata of entered texts. It is also necessary to develop linguistic markup. Linguistic markup is a linguistic information that is cataractarized to each lexical unit in the text according to spelling, phonetic, lexical, grammatical features. However, the development of the linguistic markup of the historical subcorpus is one of the complex tasks that require in-depth research. This is due to the fact that most of the texts included in the historical subcorpus are written in Arabic graphics. And the texts of the Middle Ages written in Arabic graphics were transcribed in different ways. The development of a historical subcorpus has difficulties both theoretically and technically, compared with other subcorpuses. In this regard, the purpose of the article is to consider the issue of linguistic markings for the texts of the historical subcorpus, which are being developed for the first time at the Institute of Linguistics named after Akhmet Baitursynula. Tasks: to identify linguistic, lexical and grammatical markup for transcribed texts; to take into account the experiences of other countries in the development of lexical and grammatical markup; to analyze transcribed texts from Arabic graphics to Cyrillic graphics; to identify the variability of transcribed words; to describe the mechanism of functioning of the lexical and grammatical markup program.

The study uses descriptive, historical-comparative, linguotextological, linguostatistical methods. As a result of the study, when developing the markup, the experiments of the development of the historical subcorpus of the Russian language were considered; transcribing texts written in Arabic graphics of different periods of the Middle Ages were anasized; lexical and grammatical markup for transcribed texts were determined; the mechanisms of a lexical and grammatical search system for transcribed texts were described.

Practical significance. The development of lexical and grammatical markup for transcribed texts included in the historical subcorpus will be a useful linguistic tool for studying the evolution of a certain lexical unit.

About the Authors

A. Seitbekova
Akhmet Baitursynuly Institute of Linguistics


A. Fazyljanova
Akhmet Baitursynuly Institute of Linguistics


Ғ. Aiazbayev
Akhmet Baitursynuly Institute of Linguistics



1. Gavrilova T. S., Shalganova T. A., Lyashevskaya O. N. (2016) K zadache avtomaticheskoy leksiko-grammaticheskoy razmetki starorusskogo korpusa XV-XVII vv [On the problem of automatic lexico-grammatical marking of the Old Russian corpus of the XV-XVII centuries] // Vestnik PSTGU. Seriya III: Filologiya. 2016. Vyp. 2 (47). S. 7 – 25. [in Russian]

2. Bembeyev Ye.V. (2012) Kollektsii rukopisey na starokalmytskom (oyratskom) yazyke XVII–XIX vv. v svete komp'yuternoy obrabotki: postanovka problemy [Collections of manuscripts in the Old Kalmyk (Oirat) language of the 17th– 19th centuries. in the light of computer processing: problem statement] // Informatsionnyye tekhnologii i pis'mennoye naslediye. El’Manuscript-2012: Materialy IV mezhdunarodnoy nauchnoy konferentsii (Petrozavodsk, 3–8 sentyabrya 2012 g.). Petrozavodsk, Izhevsk, 2012. S. 31–34. [in Russian]

3. Corpus of Historical Portuguese. [in English]

4. Vatri A, McGillivray B. (2018)The Diorisis Ancient Greek Corpus Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences. Издатель:Brill E-ISSN:2452-3666. 2018. page 55–65

5. Xml [in English] Ester Simon. (2014) Zdaniye korpusa iz drevnevengerskikh kodeksov [The building of the building from the ancient Hungarian codes] In: Katalin E. Potseluy (red.): Evolyutsiya funktsional'noy levoy periferii v vengerskom sintaksise. Oksford: Izdatel'stvo Oksfordskogo universiteta, 2014. [in English]

6. Serdyuchenko G.P. (1967) Russkaya transkriptsiya dlya yazykov zarubezhnogo vostoka [Russian transcription for languages of the foreign east] – Moskva: Nauka, 1967. S 359. [in English]

7. Fazylov E.I. (1971) Staryy uzbekskiy yazyk. Pamyatniki Khorezma v 14 veke. [Old Uzbek language. Khorezm monuments of the 14th century] - T.2. Tashkent: Fan, 1971. - 778 s. [in Russian]

8. Ivanov S. N. (1969) Genealogicheskoye drevo tyurka Abu-l-Gazi-khana [Family tree of the Turks Abu-l-Ghazi Khan] - Tashkent: izd-vo "Fan" Uzbekskoy SSR, 1969. S 202. [in Russian]

9. Äbilqasımov Ä. (2001) Äbilğazı xannıñ «Türki şejiresi» jäne onıñ tili. – Almatı: Arıs, 2001. – 246 b. [in Kazakh]

10. Sızdıqova R. (2004) Yasawï «Xïkmetteriniñ» tili [Yasawi "Hikmetterinin" tіli] – Almatı, Sözdik-Slovar, 2004. – 552 b. [in Kazakh]

11. Savchuk S.O. (2008) Korpus tekstov XVIII veka kak chast' natsional'nogo korpusa russkogo yazyka: problemy i perspektivy. 25 iyunya 2008 g. [Corpus of texts of the 18th century as part of the national corpus of the Russian language: problems and prospects] // Html. [in Russian]


For citations:

Seitbekova A., Fazyljanova A., Aiazbayev Ғ. LEXICAL AND GRAMMATICAL MARKUP OF TEXTS OF THE HISTORICAL SUBCORPUS. Tiltanym. 2023;(3):163-172. (In Kazakh)

Views: 364

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2411-6076 (Print)
ISSN 2709-135X (Online)