Universal Dependencies (UD)
20. Universal Dependencies (UD)
Universal Dependencies (UD) is a framework for consistent annotation of grammar across different human languages. For Avar, we adhere to the UD v2 guidelines, utilizing a subset of Universal Part-of-Speech (UPOS) tags and dependency relations, supplemented with language-specific morphological features (FEATS).
20.1 UPOS Tag Inventory
Avar utilizes 11 core UPOS tags mapped directly from the language-specific POS (XPOS) defined in the dictionary database:
- NOUN: Nouns (
noun). - VERB: Verbs, including non-finite forms like masdars and converbs (
verb). - ADJ: Adjectives (
adj). - ADV: Adverbs (
adv). - PRON: Pronouns (
pron). - NUM: Numerals (
num). - PART: Particles (
part). - ADP: Adpositions, primarily Postpositions in Avar (
post). - CCONJ: Coordinating conjunctions (
conj). - INTJ: Interjections (
interj). - X: Unclassified, phraseological units, or unknown tokens (
phrase). - PUNCT: Punctuation markers.
20.2 Dependency Relations (deprel)
The syntactic structure of Avar is strictly head-final (SOV), meaning dependents typically precede their heads. The primary dependency relations assigned by the Avar Dependency Parser are:
root: The root of the sentence, strictly assigned to the main finite VERB.nsubj(Nominal Subject): Assigned to Ergative arguments of transitive verbs, and Absolutive arguments of intransitive verbs.obj(Object): Assigned to Absolutive arguments of transitive verbs.iobj(Indirect Object): Mapped primarily to Dative case recipients of verbs like giving or saying, as well as experiencers of affective verbs (e.g., божизе - to believe/trust).nmod:poss(Possessive Nominal Modifier): Assigned to Genitive case modifiers indicating possession.obl:instr(Instrumental Oblique): Assigned to Ergative/Instrumental case markers indicating a tool or cause.obl:tmod(Temporal Oblique Modifier): Used specifically for the Temporal Ergative Rule, where the Ergative case denotes duration of time (e.g., къоялъ хӀалтӀизе "to work all day").obl:lmod(Locative Oblique Modifier): A language-specific extension grouping all 20 spatial cases (Locative, Allative, Ablative, Translative across 5 series) into a spatial adjunct relation.amod(Adjectival Modifier): Assigned toADJtokens modifying the nearest followingNOUN.case(Case Marker/Adposition): Assigned to Postpositions (ADP), linking them to the noun they govern.advmod/advcl: Assigned to Adverbs (ADV) and Converbs modifying the verb.
20.3 Language-Specific Extensions (FEATS)
Given Avar's complex agglutinative morphology, standard UD features are extended to accurately capture grammatical realities in the CoNLL-U FEATS column.
20.3.1 Class Agreement
Avar verbs, adjectives, and some pronouns must agree with the noun class of the Absolutive argument. We encode this using the Gender or language-specific Class feature:
Class=I(Male rational)Class=II(Female rational)Class=III(Inanimate/Animals)
20.3.2 The Locative Series Matrix
Avar possesses a 5x4 locative case matrix (5 spatial orientations × 4 movement directions, totaling 20 spatial cases).
To correctly represent this in Universal Dependencies without violating global Case validators, the Avar treebank will adopt an Orthogonal Features approach. The movement direction is mapped to the standard Case feature, while the spatial orientation (Series 1-5) is mapped to a language-specific Local feature:
1. Movement Direction (Case):
Case=Loc(Static position)Case=All(Motion towards)Case=Abl(Motion away)Case=Tra(Motion through)
2. Spatial Orientation (Local):
Local=In(Series I: In/On)Local=Sub(Series II: Under)Local=Apud(Series III: At/Near)Local=Post(Series IV: Behind)Local=Inter(Series V: Among/In a hollow)
Example: ГамачӀитӀа (On the stone, Series II Locative) → Case=Loc|Local=Sub
(Note: While the NounCase database model stores the case_series, the current corpus_pos_tagger prototype temporarily collapses these into just the Case tag. Extracting the Local feature is a required Phase 3 upgrade).
20.3.3 Non-Finite Clauses
Avar relies heavily on non-finite clauses (masdars and converbs) rather than subordinate clauses with conjunctions.
- Masdars act as verbal nouns. They receive both verbal and nominal features (
VerbForm=Vnoun) and can hold nominal Case markers. - Converbs act as adverbial clauses. They are tagged
VerbForm=Convand take dependency relations likeadvcl(adverbial clause modifier).
20.3.4 Clitics
Avar heavily utilizes enclitics for conjunction (-ги), interrogation (-йи), and emphasis (-ин). These are segmented programmatically and appended to the host word's FEATS column, rather than splitting them into separate syntactic tokens:
Clitic=Add(Additive/Conjunctive: -ги)Clitic=Int(Interrogative: -йи / -иш)Clitic=Quot(Quotative: -ин / -ан)
20.4 Treebank Scope and Methodology (Roadmap)
The Avar UD treebank generation is currently a planned architectural upgrade (tracked as Phase 3 / C-2 in the project roadmap). While the rule-based dependency parser (AvarDependencyParser) has its logic and relation mappings defined, the pipeline is not yet fully integrated with the database.
Currently:
- Tokenization & Morphological Tagging: Operational. The
corpus_pos_taggersuccessfully assigns UPOS, XPOS, and FEATS based on dictionary suffix rules and ML-fallbacks. - Dependency Parsing (Pending): The
AvarDependencyParserexists as a standalone prototype. It correctly maps cases to syntactic roles (nsubj,obj,obl), but theparse_avar_sentence()function is currently a stub (tokens = []). - Database Integration (Pending): The corpus database (e.g.,
CorpusToken) does not yet store syntactic edges. The planned implementation will addhead_tokenanddeprelfields to persist the parsed relations. - Validation (Pending): Once the parser is fully integrated and writing to the database, the resulting output will be serialized into standard CoNLL-U format for global UD validation.