About AvarLab

An integrated digital ecosystem for Avar — combining a trilingual dictionary, morphological generator, and a growing corpus to preserve a linguistically rich endangered language.

Also on this site: The Avar language · How it works · Team & publications · Mission & goals · Data & research

What is AvarLab?

AvarLab is an integrated digital platform for Avar, a morphologically rich Northeast Caucasian language spoken by approximately 700,000 people, classified as vulnerable by UNESCO. The platform addresses a fundamental challenge in low-resource language technology: the circular dependency between linguistic resources and NLP tools. Without annotated corpora you cannot build NLP models, yet without NLP models you cannot efficiently build annotated corpora.

AvarLab breaks this deadlock with a generate–verify framework: morphological paradigms are generated from linguistic rules, verified against a growing corpus, and refined through community feedback. The result is a living resource that simultaneously serves as a dictionary for speakers, a corpus platform for researchers, and a training data source for computational linguists.

Platform at a glance

15582

Lexical entries

1,000,000+

Generated inflected forms

298794

Corpus sentences

18,680

Trilingual segments

The generate–verify approach

Standard NLP requires large annotated corpora — which Avar lacks. AvarLab reverses this pipeline:

1. Generate

Rule-based engine produces full inflectional paradigms for every lexical entry

2. Annotate

Forms are automatically labeled for POS, case, grammatical class, and tense

3. Verify

Forms are cross-checked against the corpus; attested ones are stabilised

4. Community

Unattested forms are open to validation by native speakers

Full technical description of the methodology →

Research context

AvarLab is the core platform of a doctoral research project at Universitat Pompeu Fabra (UPF) in Barcelona, within the COLT (Computational Linguistics and Linguistic Theory) research group. The thesis — Natural Language Processing for Low-Resource Languages: A Comprehensive Study on Avar — evaluates the generate–verify framework as a sustainable, language-agnostic methodology for bootstrapping full-stack NLP infrastructure in contexts of extreme data sparsity.

Expansion & Collaboration

AvarLab is an open-source initiative. We are actively looking for researchers, linguists, and native speakers who want to collaborate on building dialectal versions of this platform or adapting its architecture for other Dagestani languages. If you share our mission of preserving and digitizing the linguistic diversity of the Caucasus, please get in touch.

Project team

Kebed Zagidov

PhD student · COLT, Universitat Pompeu Fabra, Barcelona

Native speaker of Avar and computational linguist. Lead developer and researcher of AvarLab: platform architecture, morphological models, corpus integration, and community tooling.

kebed.zagidov@upf.edu

Thomas Brochhagen

Tenure-track Professor & Ramón y Cajal Fellow · COLT, Universitat Pompeu Fabra, Barcelona

Doctoral supervisor. Researcher in language evolution, emergence, and computational linguistics.

thomas.brochhagen@upf.edu Personal site

Get in touch

Questions, feedback, data requests, or collaboration proposals — reach out by email.