Data & Research Access

What data AvarLab produces, how researchers can access it, licensing terms, and how to cite the platform.

AvarLab is currently in active development as part of a doctoral research programme. Data deposits on public repositories (Zenodo, Hugging Face) are planned upon thesis completion. For early research access, please contact us directly.

What data AvarLab produces

Morphological paradigm database

On request

14,768 base lemmas expanded into 1,026,668 inflected forms, each annotated with part of speech, case, number, grammatical class, tense, and polarity. Includes full paradigm tables for nouns (up to 72 case forms), verbs (affirmative and negative synthetic forms, masdars, participles), adjectives, pronouns, numerals, adverbs, and postpositions.

JSON CSV

Silver-standard POS-tagged corpus

On request

296,228 monolingual Avar sentences automatically tagged using dictionary-driven tagging (65.8% POS coverage). Sources include AvarCorpora, Telegram news channels, Wikipedia, literature, and educational materials. Includes sentence segmentation, language detection filtering, and quality control.

CoNLL-U JSON spaCy

Trilingual parallel corpus (Avar–Russian–English)

On request

18,680 aligned trilingual segments drawn from dictionary imports, academic texts, literature, folk texts, and user contributions. Supports cross-lingual research, machine translation groundwork, and semantic alignment.

JSON CoNLL-U

Lexical database (dictionary entries)

Browsable online

29,197 total lexical units (14,768 lemmas + 14,429 multi-word expressions), with Russian and English glosses, IPA transcriptions, grammatical metadata, corpus attestation counts, and audio files where available.

Web interface REST API

Export formats

AvarLab is designed from the ground up for interoperability with standard NLP toolchains:

  • CoNLL-U — Universal Dependencies format; compatible with spaCy, Stanza, UDPipe, and most dependency parsers. Optional train/validation/test splits.
  • JSON — Full structured exports of lexical entries, paradigm tables, and corpus sentences with all metadata fields.
  • spaCy-compatible — Directly importable as spaCy DocBin datasets for training custom NLP pipelines.
  • REST API — The Django REST Framework API layer provides programmatic access to lexical entries, morphological data, and corpus examples for external interfaces and bots.

Licensing

Software & architecture

The platform codebase, morphological rule scripts, and generator architecture.

MIT / Apache 2.0

Annotated text corpora

Annotated datasets, morphological paradigm exports, and aligned parallel corpora.

CC-BY 4.0

Community-contributed data (audio recordings, lexical suggestions) is subject to explicit contributor consent protocols and editorial moderation, in accordance with the ethical review guidelines of Universitat Pompeu Fabra (CIREP).

How to cite

If you use AvarLab data or the platform in your research, please cite:

Zagidov, K., & Brochhagen, T. (2025). AvarLab: An Integrated Digital Ecosystem for Avar. Universitat Pompeu Fabra, Barcelona.

BibTeX:

@misc{zagidov2025avarlab,
  title   = {AvarLab: An Integrated Digital Ecosystem for Avar},
  author  = {Zagidov, Kebed and Brochhagen, Thomas},
  year    = {2025},
  url     = {https://avardict.upf.edu}
}

Request access or collaborate

For early research access to datasets, collaboration proposals, or questions about the platform's data pipeline, contact the project lead directly.

kebed.zagidov@upf.edu

How AvarLab works · Team & publications · About the project