Preview

Altaistics

Advanced search

On development of ontological linguistic database as a resource for language processors

Abstract

This article describes the development of linguistic ontological database for Turkic languages, which can be used in a number of linguistic processors for texts processing in Turkology. The relevance of this work lies in the fact that despite active research and development for Turkic languages in the past 10-15 years, almost all (except for Turkish) continue to belong to the type of low-resource languages. This is due to the fact that there is a shortage of linguistic resources for Turkic languages applicable in software development for natural language processing. These can be various types of ontological databases such as WordNet, FrameNet, VerbNet, RuTez, etc., as well as combinations of these resources with electronic corpuses. Such ontological databases can be used in information and reference systems, in creation of syntactic, semantic and semanticsyntactic analyzers, as well as in educational and scientific applications. In our work, an integral approach is presented that combines ontological models of frame and taxonomic types, in a structural-parametric model of the Turkic Morpheme. The integral model is initially based on principles of multilingualism, multifunctionality and pragmatic orientation. Multilingualism presupposes universality for all languages of the Turkic group, and pragmatic orientation is in focus on structural and functional features of agglutinative languages. Creation of software, in addition to above methods and technologies, involves usage of technologies for designing complex databases, web programming, client-server technologies. The resulting database can be used to generate contextfree grammar rules for a semantic-syntactic analyzer, which receives sentences in Turkic languages as input, and produces structured data at output. The analyzer obtained in this way is applicable for semantic-syntactic tagging of Turkic electronic corpuses and for development of semantic search software.

About the Authors

A. R. Gatiatullin
Tatarstan Academy of Sciences
Russian Federation

GATIATULLIN Ayrat Rafizovich – Candidate of Technical Sciences, Leading Researcher

Kazan



N. A. Prokopiev
Tatarstan Academy of Sciences
Russian Federation

PROKOPYEV Nikolay Arkadievich – Researcher

Kazan



References

1. Dybo A.V., Shejmovich A.V. (2014) Avtomaticheskij morfologicheskij analiz dlja korpusov tjurkskih jazykov. Filologija i kul’tura, №2, s. 20-26.

2. Zheltov P.V. (2002) Morfologicheskij analizator chuvashskogo jazyka. Materialy Mezhdunarodnoj konferencii studentov i aspirantov po fundamental’nym naukam «Lomonosov 2002».

3. Sharipbaev A.A., Bekmanova G.T., Ergesh B.Zh., Buribaeva A.K., Karabalaeva M.H. (2012) Intellektual’nyj morfologicheskij analizator, osnovannyj na semanticheskih setjah. Materialy mezhdunarodnoj nauchno-tehnicheskoj konferencii «Otkrytye semanticheskie tehnologii proektirovanija intellektual’nyh sistem» (OSTIS-2012), s. 397-400.

4. Sharipbay A.A., Bekmanova G., Yergesh B., Mukanova A. (2014) Synchronized liner tree for morphological analysis and generation of the Kazakh language. Proceedings of the international conference “Turkic languages processing”, TurkLang 2014, pp. 113-117.

5. Orhun, M., Tantuğ A.C., Adalı E. (2010) Morphological Disambiguation Rules For Uyghur Language. IEEE International Conference on Software Engineering and Service Sciences (ICSESS), pp. 542-546. doi: 10.1109/ ICSESS.2010.5552304

6. Sahin G.G., Adalı E. (2018) Annotation of semantic roles for the Turkish proposition bank, 52(3), pp. 673-706. doi: 10.1007/s10579-017-9390-y

7. Eryiğit G., Nivre J., Oflazer K. (2008) Dependency Parsing of Turkish. Computational Linguistics, 34(3), pp. 357-389. doi: 10.1162/coli.2008.34.4.627

8. Lyashevskaya O., Kashkin E. (2015) FrameBank: A Database of Russian Lexical Constructions. Proceedings of the 4th International Conference on Analysis of Images, Social Networks and Texts (AIST 2015). Communications in Computer and Information Science, vol. 542, pp. 350-360. doi:10.1007/978-3-319-2

9. Turkish National Corpus (TNC). URL: http:// www.tnc.org.tr.

10. Almatinskij korpus kazahskogo jazyka. URL: http://web-corpora.net/KazakhCorpus/search/.

11. Korpus altajskogo jazyka. URL: http://altay 2.gasu.ru.

12. Nacional’nyj korpus bashkirskogo jazyka. URL: http://bashcorpus.ru.

13. Bashkirskij pojeticheskij korpus. URL: http:// web-corpora.net/bashcorpus/search/.

14. Korpus tatarskogo jazyka ‘Tugan tel’. URL: http://tugantel.tatar.

15. Pis’mennyj korpus tatarskogo jazyka. URL: http://www.corpus.tatar.

16. Korpus hakasskogo jazyka. URL: http://khakas.altaica.ru.

17. Korpus jakutskogo jazyka. URL: http://adictsakha.nsu.ru/corpora/corp.

18. Korpus uzbekskogo jazyka. URL: http://corpus-uz.herokuapp.com.

19. Korpus shorskogo i teleutskogo jazykov. URL: https://corpora.iea.ras.ru/corpora.

20. Lingvisticheskoe PO «MetaFraz R10». URL: http://www.metafraz.ru.

21. C. F. Hockett, Two models of grammatical description, WORD Vol. 10 (1954) 210–234.

22. Yelibayeva G., Sharipbay A., Mukanova A., Razakhova B. (2020) Applied ontology for the automatic classification of simple sentences of the Kazakh language. 5th International Conference on Computer Science and Engineering, UBMK 2020. pp. 13-18. doi: 10.1109/UBMK50275.2020.9219461

23. FrameNet. URL: https://framenet.icsi.berkeley.edu.

24. Palmer M. (2009). Semlink: Linking PropBank, VerbNet and FrameNet. Proceedings of the Generative Lexicon Conference., pp. 9-15.

25. Gatiatullin A., Suleymanov D., Prokopyev N., Khakimov B. (2020) About turkic morpheme portal. CEUR Workshop Proceedings Institute for history, language and literature, Ufa scientific center, Russian Academy of Sciences Proceedings of TurkLang 2020, pp. 226-243.


Review

For citations:


Gatiatullin A.R., Prokopiev N.A. On development of ontological linguistic database as a resource for language processors. Altaistics. 2021;1(1):77-88. (In Russ.)

Views: 236


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2782-6627 (Online)