Avar IME (Windows)
Linguistic background and usage instructions for the Avar Keyboard layout.
Downloads & Manuals
Download the installer and the user manuals to get started with typing Avar on your Windows computer.
Why This Project Exists
Avar (авар мацӀ) uses a Cyrillic alphabet with additional characters unique to the language that are not supported by standard Windows keyboard layouts:
- Palochka (Ӏ) — A vertical bar character (U+04C0) used in many Caucasian languages.
- Digraphs — Two-character combinations like гъ, гь, гӀ, къ, кь, кӀ, хъ, хь, хӀ, лӀ, тӀ, цӀ, чӀ.
While standard Russian keyboard layouts provide the base Cyrillic letters, there is no easy or standard way to type the Palochka or the many digraph combinations essential to Avar writing.
Previous Approaches & Limitations
Before developing this dedicated Input Method Editor (IME), several alternative approaches were evaluated but ultimately fell short:
- AutoHotkey: Struggled with deep integration into the Windows input system and behaved inconsistently across different applications.
- Custom Keyboard Layouts (MSKLC): Microsoft Keyboard Layout Creator is limited to 1-to-1 character mappings. It cannot output multi-character digraphs natively nor display candidate windows for disambiguation.
- Third-party Frameworks: Alternative cross-platform IME engines were either too complex to deploy, poorly documented, or inflexible for Avar's specific phonetic mapping needs.
Design Rationale: Why This Input Model?
Three deliberate decisions shaped the user experience and internal mechanics of the Avar IME:
- Candidate-Window Selection vs. Key Combinations:
Avar has over 14 digraph variants plus geminates. A "one Alt+key combination per digraph" approach would force users to memorize dozens of unique shortcuts. Candidate-window selection provides a near-zero learning curve—users simply look at the popup and press a number. It also integrates seamlessly with the predictive text engine. - Reusing the Russian ЙЦУКЕН Layout:
Most Avar speakers type fluently in Russian. Forcing a completely new Avar-specific layout would destroy established muscle memory for no linguistic benefit. The IME maps directly over the standard Russian layout, only invoking candidate windows for consonants with Avar-specific variants. Switching between Russian and Avar requires zero finger re-orientation. - Phonetic Mapping via Virtual Keys (VKs):
The engine reads raw hardware keys (Virtual Key Codes) rather than interpreted text characters. If the system relied on text characters, users with their Windows OS set to 'English' would output Latin letters instead of Cyrillic. Mapping via VKs guarantees the IME works perfectly regardless of the user's base OS language settings.
How It Works (Architecture)
The Avar IME is built directly on Microsoft's Text Services Framework (TSF). TSF is the official Windows API for input methods. It provides deep integration, allowing the IME to insert text properly across all TSF-aware applications (Notepad, Word, Edge, Chrome, etc.) and handle the lifecycle of candidate windows natively.
Under the Hood
Developed in C++, the core architecture consists of several specialized components:
- Event Handlers: Intercepts keyboard events (ITfKeyEventSink) before they reach the active application.
- Phonetic Rules Engine: Translates hardware keys (Virtual Key Codes) to Latin letters, maps them to the Russian ЙЦУКЕН layout, and generates the list of Avar variants.
- Composition Manager: Manages the state of text being typed (ITfCompositionSink). Crucially, text is not committed to the document until the user makes a firm selection, preventing ghost characters and text replacement bugs.
- Win32 Candidate UI: A custom overlay window that displays numbered choices next to the cursor when typing ambiguous letters.
The Typing Flow
- You type on a standard QWERTY keyboard.
- Keys are mapped to Russian Cyrillic.
- For letters with Avar variants, the IME intercepts the character and displays the candidate window.
- You press a number key (1-9) to select the desired variant, which is then committed to the document.
Linguistic Model & Prediction Engine
To provide highly accurate typing suggestions, the IME models Avar's dense phonemic and morphological structure using a large-scale dictionary and n-gram language model.
Phonology and the Palochka Problem
Avar has a dense phonemic inventory (~45 consonants), including ejective stops, pharyngealized fricatives, and geminates. Most complex consonants are written as digraphs (e.g., гъ, хӀ, кӀ).
The palochka character (Ӏ, U+04C0) appears in roughly one-third of all Avar wordforms. Because users often mistakenly type a Latin 'I' or a lowercase Cyrillic 'ӏ', the IME's prediction engine strictly normalizes these inputs under the hood to guarantee consistent dictionary lookups.
Morphology and the Dictionary
Avar is highly agglutinative, meaning words take on many suffixes (a single noun might have ~40 distinct surface forms). Instead of slowing down typing with a runtime morphological analyzer, the IME uses a massive prefix tree (Trie) dictionary containing ~1.28 million fully inflected forms. This trades memory size for extreme predictability and zero-latency lookups.
Contextual Prediction (Stupid Backoff)
The prediction bar doesn't just guess words in isolation—it learns from context. The engine uses a "Stupid Backoff" algorithm backed by a massive corpus of unigrams, bigrams, and trigrams. If it sees a strong three-word pattern (trigram), it prioritizes that candidate; otherwise, it backs off to two-word patterns (bigram), and finally raw word frequency. This is the same fast, rank-coherent architecture used by major production keyboards like Gboard.
Send Feedback
Have you encountered a bug, or do you have a feature request? Let us know!