Academic Context
The state of Han Nôm OCR research, how NomLens compares to existing academic approaches, data sources, the full product roadmap, and partnership opportunities.
The preservation crisis
“Once a script dies, the history it carried dies with it.”
For nearly a thousand years, Han Nôm was the soul of Vietnamese culture — the script in which ancestors recorded history, poetry, law, medicine, philosophy, and everyday life. It appears on ancient stone steles, temple inscriptions, wooden couplets, imperial manuscripts, and fragile paper documents. It is not merely a writing system. It is Vietnamese history and cultural memory made visible.
Today, that memory is in mortal danger. Every year, physical inscriptions erode under rain, wind, and pollution. Manuscripts crumble. The last generation of living scholars who can fluently read Han Nôm is rapidly shrinking — perhaps no more than a hundred people in the world still possess deep, native-level mastery of the script. When they pass, and when the stones and papers finally disintegrate, a vast and irreplaceable portion of Vietnam's heritage will be lost forever.
We are in a narrow window — perhaps the last one — where the final expert readers, the surviving physical artifacts, and modern AI technology all still exist at the same time. The Vietnamese Nôm Preservation Foundation spent two decades digitizing manuscripts before dissolving in 2018. Their work was extraordinary. But the window hasn't closed yet. NomLens exists to seize what remains of it.
Scholars worldwide who can read Han Nôm fluently
Years of Vietnamese history written in this script
Year the Nôm Preservation Foundation dissolved
Rate of physical artifact deterioration
No Han Nôm inscription should ever be lost again simply because no one could read it. Future generations — scholars, students, ordinary Vietnamese — deserve to still touch the words of their ancestors.
Prior academic work
NomNaOCR (2022) — VNUHCM-UIT
Most significant2,953 pages, 38,318 labeled character patches from woodblock prints (Truyện Kiều × 3, Lục Vân Tiên, Đại Việt Sử Ký Toàn Thư). Architecture: DBNet detection + CRNN/Transformer recognition. Server-side only.
Nom Document Digitalization (2020) — Pattern Recognition Letters
719 pages of Truyện Kiều. U-Net segmentation + CNN classifier with attention. Results: segmentation IoU 92%, character recognition 85.07%. Clean woodblock prints only.
Integrating Nôm Language Model (2022)
Tests on unseen woodblock editions (cross-edition generalization). Results: 71–74% mAP at sequence level on unseen data. This is the honest real-world number for the best academic sequence-level work.
Scene Sino-Nom OCR (2025) — most recent
Focuses on real-world photos (temples, signs) — the most relevant use case to NomLens. Combines deep learning with explicit linguistic knowledge of Nôm structure.
How NomLens compares
Note: the 71–74% accuracy of the best academic sequence-level work on unseen data vs. NomLens's 97.6% per-character accuracy is not a fair comparison — they solve different tasks. Per-character classification is simpler and more accurate than sequence OCR, given reliable upstream segmentation.
| Dimension | Academic / Existing Apps | NomLens |
|---|---|---|
| Architecture | CRNN / Transformer / U-Net | EfficientNet-B0 (mobile) |
| Deployment | Server-side only | On-device, offline |
| Task | Line-level sequence OCR | Single-character classification |
| Accuracy (real data) | 71–85% depending on method | 97.6% val (Han layer) |
| Inference speed | No on-device benchmarks | <10ms on Neural Engine |
| iOS app | None with real ML | NomLens |
| Data flywheel | None | Every correction = training data |
NomNaOCR dataset
The NomNaOCR dataset (38,318 labeled character patches from real Nôm woodblock-print manuscripts) is publicly available on Kaggle. This is real printed Nôm character images with labels — adding it to training would meaningfully improve the model's knowledge of what printed manuscript characters look like, beyond synthetic font renders.
Product roadmap
AI-Powered Decoder
- ✓Claude Vision API per-character decode
- ✓Core Image preprocessing (adaptive threshold, deskew, perspective correction)
- ✓Vision framework segmentation + reading order sort
- ✓SwiftData local history
- ✓Structured JSON export
On-Device Core ML Model
- ·EfficientNet-B0 trained (97.6% val accuracy, 10.6 MB)
- ·Temperature calibration (ECE 0.0034)
- ·OTA model delivery (iOS components built, manifest server pending)
- ·Confidence routing: on-device → Claude fallback
Active Learning Loop
- ·In-app correction flow (user verifies low-confidence characters)
- ·Supabase backend: correction store, image crops, auth
- ·Data quality pipeline (outlier detection, expert review queue)
- ·Threshold-based retraining trigger (500+ new verified corrections)
- ·Automated evaluation + Core ML export + OTA push
Community & Collaboration
- ·Scholar verification accounts with elevated correction weight
- ·GPS-tagged decode archive (living map of Han Nôm inscriptions)
- ·Public session publishing and community annotation
- ·Context tagging (dynasty, text type, location, subject matter)
Deeper NLP
- ·Literary translation with cultural context (Claude or fine-tuned model)
- ·Poetic form identification (lục bát, Đường thi forms)
- ·Historical period estimation from vocabulary and style
- ·Named entity recognition (place names, reign titles, personal names)
Academic Integration
- ·Open dataset export — verified corrections contributed to Nom Foundation and Kaggle
- ·Research API for batch manuscript digitization
- ·NEH / Mellon Foundation grant applications
- ·Partnerships: VNUHCM-UIT, Yale, Harvard, Cornell Southeast Asian Studies
Grant & partnership opportunities
The open dataset contribution (verified user corrections exported as labeled training data) is the key that opens academic and institutional doors. We're not asking for money to build a product — we're offering infrastructure that advances shared scholarly goals.
NEH (National Endowment for the Humanities)
Open dataset contribution, manuscript digitization infrastructure
Mellon Foundation
Cultural heritage preservation, academic access to endangered script data
Vietnamese Ministry of Culture
National interest in preserving Nôm manuscript heritage
University Partners
Batch digitization of institutional manuscript collections, student research access
Cite NomLens
Academic citation format (placeholder — update with publication details):
NomLens: On-Device Han Nôm Character Classification with Confidence-Routed AI Fallback. [Author]. 2026. https://nomlens.app