How AvarLab Works

The generate–verify framework: how AvarLab builds morphological coverage, annotates a corpus, and improves through community validation — without requiring a large manually annotated dataset to start.

The problem: a circular dependency

Building NLP tools for a language normally requires large, manually annotated corpora. But creating those corpora efficiently requires NLP tools. For a language like Avar — with approximately 700,000 speakers, no standardised digital keyboard, and only limited digitised texts — this circular dependency is not a minor obstacle. It is a structural barrier that has kept Avar almost entirely absent from modern NLP pipelines.

Standard data-driven approaches are also poorly suited to Avar's extreme morphological density. A single Avar noun can produce up to 72 distinct case forms. This means that even a relatively large corpus of 300,000 sentences captures only a small fraction of possible word forms, making frequency-based learning highly unreliable for rare but grammatically valid constructions.

The solution: generate first, verify after

AvarLab reverses the conventional pipeline. Instead of deriving morphological patterns from an annotated corpus, the system first generates complete morphological paradigms using rule-based models derived from descriptive grammars. These generated forms are then verified against empirical corpus evidence and refined through community feedback.

This approach transforms descriptive grammatical knowledge — the kind that linguists have already produced for Avar — into an active computational resource. Lexical entries, corpora, and linguistic models grow side by side through iterative interaction, rather than in sequence.

The five-step generate–verify loop

1

Generate

For each lexical entry in the dictionary, a dedicated morphological module generates the complete set of inflected forms. Each part of speech (noun, verb, adjective, adverb, pronoun, numeral, postposition) has its own generation module containing affixation rules, morphophonological transformations (vowel ablaut, consonant alternation, syncope, epenthesis), and exception lists for irregular forms. The system currently generates over 1,026,668 inflected forms from 14,768 base lemmas.

2

Annotate

Every generated form is automatically labelled with grammatical features: part of speech, case, number, grammatical class (I/II/III/Plural), tense, aspect, and polarity. For semantic features (animacy, abstractness, valency), the system uses cross-lingual alignment — mapping Russian morphological analyses via pymorphy3 and fastText semantic embeddings onto the Avar entry — since most existing resources provide Russian glosses. The result is a large structured morphological database linked directly to dictionary entries.

3

Verify

The corpus verification module scans the growing monolingual and trilingual corpora (currently 296,228 sentences across 684 source documents) for attested occurrences of each generated form. An orthographic normalisation pipeline standardises all corpus tokens and generated forms to canonical Unicode before matching, resolving the palochka substitution problem (users write 1, I, or ! for Ӏ). Attestation counts distinguish three tiers: frequently occurring forms, rare but attested forms, and theoretically predicted but unattested forms.

4

Stabilise (lock)

Forms that meet the attestation threshold are locked — they are marked as stabilised and protected from accidental overwriting by future generator runs. Locking preserves forms that have been empirically grounded while still allowing the generator to be updated as the morphological rules are refined. This is what the Verified badge means on dictionary entries.

5

Community validation

Unattested or ambiguous forms — marked Unverified — are not discarded. They remain as candidates for validation by native speakers through the platform's participatory interfaces. Forms that receive ≥ 10 positive community votes are promoted to gold-standard human-validated annotations. Users can also submit new entries, suggest corrections, and upload audio pronunciations. This loop means the platform continuously converges toward greater accuracy as corpus coverage grows and community contributions accumulate.

Dictionary-driven POS tagging

A key consequence of the generate–verify architecture is a reversed tagging pipeline. Instead of learning morphological patterns from annotated text, the system derives them directly from the morphological generator:

Dictionary → Morphological Generator → WordForm Database → Corpus Tagging

When a corpus token matches a generated form, it inherits that form's grammatical annotation automatically. This dictionary-driven tagging produces large-scale silver-standard datasets — currently achieving 65.8% POS tagging coverage across the corpus — without requiring thousands of hours of manual annotation. These datasets can be exported for training neural POS taggers and language models.

To handle Avar's morphological syncretism (visually identical forms with different grammatical roles), the tagging pipeline applies contextual syntax rules: pattern-matching heuristics over POS-tagged sentences — such as identifying ergative–nominative–verb valency frames or adjacent genitive–noun pairs — that actively disambiguate syntactic roles for high-frequency constructions. This progressive formalisation of Avar's syntax lays the groundwork for a full Universal Dependencies (UD) dependency parser.

What Verified and Unverified mean

Verified

The entry's generated forms have been found in the corpus at or above the attestation threshold. The form is empirically grounded in real Avar texts and has been stabilised (locked) against accidental overwriting. This does not mean the entry is authoritative — it means it is confirmed as occurring in real usage.

Unverified

The generated forms have not yet been found in the corpus. This may reflect a rare construction, a gap in current corpus coverage, or algorithmic over-generation. Unverified entries are not wrong — they are computational hypotheses awaiting empirical grounding through corpus expansion or community validation.

The overall verification rate for generated forms is 7.4%. This reflects Avar's extreme morphological density — the linguistic system generates many valid forms that simply do not appear in a 296,228-sentence corpus — not a system weakness.

Current scale and coverage

14,768
Base lemmas
1,026,668
Generated inflected forms
76,295
Corpus-attested forms
65.8%
POS tagging coverage

Technical implementation

  • Backend: Django + PostgreSQL with a relational architecture centred on the Entry model; all paradigm forms linked via foreign keys for referential integrity.
  • Morphological rules: Forward-only procedural Python scripts manually engineered from descriptive grammars (Alekseev et al. 2012; Forker 2017; Khangereev 2011). Because the architecture is procedural rather than compiled, rules can be patched dynamically as new linguistic evidence emerges.
  • Search: PostgreSQL full-text indexing over both Cyrillic and normalised forms; fuzzy matching, lemma search, and keyword-in-context retrieval.
  • Cross-lingual annotation: pymorphy3 (Russian morphological analyser) and fastText semantic embeddings map Russian gloss features to Avar grammatical classes and constraints.
  • API: Django REST Framework serves both the web platform and external interfaces, with export support for JSON, CoNLL-U, and spaCy-compatible formats.

About the project · Data & research access · User manual