Data & Research Access
What data AvarLab produces, how researchers can access it, licensing terms, and how to cite the platform.
What data AvarLab produces
Morphological paradigm database
On request14,768 base lemmas expanded into 1,026,668 inflected forms, each annotated with part of speech, case, number, grammatical class, tense, and polarity. Includes full paradigm tables for nouns (up to 72 case forms), verbs (affirmative and negative synthetic forms, masdars, participles), adjectives, pronouns, numerals, adverbs, and postpositions.
Silver-standard POS-tagged corpus
On request296,228 monolingual Avar sentences automatically tagged using dictionary-driven tagging (65.8% POS coverage). Sources include AvarCorpora, Telegram news channels, Wikipedia, literature, and educational materials. Includes sentence segmentation, language detection filtering, and quality control.
Trilingual parallel corpus (Avar–Russian–English)
On request18,680 aligned trilingual segments drawn from dictionary imports, academic texts, literature, folk texts, and user contributions. Supports cross-lingual research, machine translation groundwork, and semantic alignment.
Lexical database (dictionary entries)
Browsable online29,197 total lexical units (14,768 lemmas + 14,429 multi-word expressions), with Russian and English glosses, IPA transcriptions, grammatical metadata, corpus attestation counts, and audio files where available.
Export formats
AvarLab is designed from the ground up for interoperability with standard NLP toolchains:
- CoNLL-U — Universal Dependencies format; compatible with spaCy, Stanza, UDPipe, and most dependency parsers. Optional train/validation/test splits.
- JSON — Full structured exports of lexical entries, paradigm tables, and corpus sentences with all metadata fields.
- spaCy-compatible — Directly importable as spaCy DocBin datasets for training custom NLP pipelines.
- REST API — The Django REST Framework API layer provides programmatic access to lexical entries, morphological data, and corpus examples for external interfaces and bots.
Licensing
Software & architecture
The platform codebase, morphological rule scripts, and generator architecture.
MIT / Apache 2.0Annotated text corpora
Annotated datasets, morphological paradigm exports, and aligned parallel corpora.
CC-BY 4.0Community-contributed data (audio recordings, lexical suggestions) is subject to explicit contributor consent protocols and editorial moderation, in accordance with the ethical review guidelines of Universitat Pompeu Fabra (CIREP).
How to cite
If you use AvarLab data or the platform in your research, please cite:
BibTeX:
@misc{zagidov2025avarlab,
title = {AvarLab: An Integrated Digital Ecosystem for Avar},
author = {Zagidov, Kebed and Brochhagen, Thomas},
year = {2025},
url = {https://avardict.upf.edu}
}
Request access or collaborate
For early research access to datasets, collaboration proposals, or questions about the platform's data pipeline, contact the project lead directly.
kebed.zagidov@upf.edu