About AvarLab
An integrated digital ecosystem for Avar — combining a trilingual dictionary, morphological generator, and a growing corpus to preserve a linguistically rich endangered language.
What is AvarLab?
AvarLab is an integrated digital platform for Avar, a morphologically rich Northeast Caucasian language spoken by approximately 700,000 people, classified as vulnerable by UNESCO. The platform addresses a fundamental challenge in low-resource language technology: the circular dependency between linguistic resources and NLP tools. Without annotated corpora you cannot build NLP models, yet without NLP models you cannot efficiently build annotated corpora.
AvarLab breaks this deadlock with a generate–verify framework: morphological paradigms are generated from linguistic rules, verified against a growing corpus, and refined through community feedback. The result is a living resource that simultaneously serves as a dictionary for speakers, a corpus platform for researchers, and a training data source for computational linguists.
Platform at a glance
The generate–verify approach
Standard NLP requires large annotated corpora — which Avar lacks. AvarLab reverses this pipeline:
Research context
AvarLab is the core platform of a doctoral research project at Universitat Pompeu Fabra (UPF) in Barcelona, within the COLT (Computational Linguistics and Linguistic Theory) research group. The thesis — Natural Language Processing for Low-Resource Languages: A Comprehensive Study on Avar — evaluates the generate–verify framework as a sustainable, language-agnostic methodology for bootstrapping full-stack NLP infrastructure in contexts of extreme data sparsity.
Expansion & Collaboration
AvarLab is an open-source initiative. We are actively looking for researchers, linguists, and native speakers who want to collaborate on building dialectal versions of this platform or adapting its architecture for other Dagestani languages. If you share our mission of preserving and digitizing the linguistic diversity of the Caucasus, please get in touch.
Project team
Kebed Zagidov
PhD student · COLT, Universitat Pompeu Fabra, Barcelona
Native speaker of Avar and computational linguist. Lead developer and researcher of AvarLab: platform architecture, morphological models, corpus integration, and community tooling.
kebed.zagidov@upf.eduThomas Brochhagen
Tenure-track Professor & Ramón y Cajal Fellow · COLT, Universitat Pompeu Fabra, Barcelona
Doctoral supervisor. Researcher in language evolution, emergence, and computational linguistics.
Get in touch
Questions, feedback, data requests, or collaboration proposals — reach out by email.
kebed.zagidov@upf.edu