Research

Academic Context

The state of Han Nôm OCR research, how NomLens compares to existing academic approaches, data sources, the full product roadmap, and partnership opportunities.

The preservation crisis

“Once a script dies, the history it carried dies with it.”

For nearly a thousand years, Han Nôm was the soul of Vietnamese culture — the script in which ancestors recorded history, poetry, law, medicine, philosophy, and everyday life. It appears on ancient stone steles, temple inscriptions, wooden couplets, imperial manuscripts, and fragile paper documents. It is not merely a writing system. It is Vietnamese history and cultural memory made visible.

Today, that memory is in mortal danger. Every year, physical inscriptions erode under rain, wind, and pollution. Manuscripts crumble. The last generation of living scholars who can fluently read Han Nôm is rapidly shrinking — perhaps no more than a hundred people in the world still possess deep, native-level mastery of the script. When they pass, and when the stones and papers finally disintegrate, a vast and irreplaceable portion of Vietnam's heritage will be lost forever.

We are in a narrow window — perhaps the last one — where the final expert readers, the surviving physical artifacts, and modern AI technology all still exist at the same time. The Vietnamese Nôm Preservation Foundation spent two decades digitizing manuscripts before dissolving in 2018. Their work was extraordinary. But the window hasn't closed yet. NomLens exists to seize what remains of it.

<100

Scholars worldwide who can read Han Nôm fluently

1,000+

Years of Vietnamese history written in this script

2018

Year the Nôm Preservation Foundation dissolved

Growing

Rate of physical artifact deterioration

No Han Nôm inscription should ever be lost again simply because no one could read it. Future generations — scholars, students, ordinary Vietnamese — deserve to still touch the words of their ancestors.

Prior academic work

NomNaOCR (2022) — VNUHCM-UIT

Most significant

2,953 pages, 38,318 labeled character patches from woodblock prints (Truyện Kiều × 3, Lục Vân Tiên, Đại Việt Sử Ký Toàn Thư). Architecture: DBNet detection + CRNN/Transformer recognition. Server-side only.

Dataset on Kaggle →GitHub →

Nom Document Digitalization (2020) — Pattern Recognition Letters

719 pages of Truyện Kiều. U-Net segmentation + CNN classifier with attention. Results: segmentation IoU 92%, character recognition 85.07%. Clean woodblock prints only.

Integrating Nôm Language Model (2022)

Tests on unseen woodblock editions (cross-edition generalization). Results: 71–74% mAP at sequence level on unseen data. This is the honest real-world number for the best academic sequence-level work.

Scene Sino-Nom OCR (2025) — most recent

Focuses on real-world photos (temples, signs) — the most relevant use case to NomLens. Combines deep learning with explicit linguistic knowledge of Nôm structure.

How NomLens compares

Note: the 71–74% accuracy of the best academic sequence-level work on unseen data vs. NomLens's 97.6% per-character accuracy is not a fair comparison — they solve different tasks. Per-character classification is simpler and more accurate than sequence OCR, given reliable upstream segmentation.

Dimension	Academic / Existing Apps	NomLens
Architecture	CRNN / Transformer / U-Net	EfficientNet-B0 (mobile)
Deployment	Server-side only	On-device, offline
Task	Line-level sequence OCR	Single-character classification
Accuracy (real data)	71–85% depending on method	97.6% val (Han layer)
Inference speed	No on-device benchmarks	<10ms on Neural Engine
iOS app	None with real ML	NomLens
Data flywheel	None	Every correction = training data

The key finding: A production-quality, on-device, real-time Nôm character recognizer for iOS is genuinely novel. No direct competitor exists. Academic work is server-side and not productized. Existing App Store apps are undisclosed black boxes with no accuracy claims or on-device inference.

NomNaOCR dataset

The NomNaOCR dataset (38,318 labeled character patches from real Nôm woodblock-print manuscripts) is publicly available on Kaggle. This is real printed Nôm character images with labels — adding it to training would meaningfully improve the model's knowledge of what printed manuscript characters look like, beyond synthetic font renders.

View on Kaggle →Evaluated for v2 / v3 training inclusion

Product roadmap

Phase 1 · Complete

AI-Powered Decoder

Complete

✓Claude Vision API per-character decode
✓Core Image preprocessing (adaptive threshold, deskew, perspective correction)
✓Vision framework segmentation + reading order sort
✓SwiftData local history
✓Structured JSON export

Phase 2 · Active

On-Device Core ML Model

In progress

·EfficientNet-B0 trained (97.6% val accuracy, 10.6 MB)
·Temperature calibration (ECE 0.0034)
·OTA model delivery (iOS components built, manifest server pending)
·Confidence routing: on-device → Claude fallback

Phase 3 · Planned

Active Learning Loop

Future

·In-app correction flow (user verifies low-confidence characters)
·Supabase backend: correction store, image crops, auth
·Data quality pipeline (outlier detection, expert review queue)
·Threshold-based retraining trigger (500+ new verified corrections)
·Automated evaluation + Core ML export + OTA push

Phase 4 · Future

Community & Collaboration

Future

·Scholar verification accounts with elevated correction weight
·GPS-tagged decode archive (living map of Han Nôm inscriptions)
·Public session publishing and community annotation
·Context tagging (dynasty, text type, location, subject matter)

Phase 5 · Future

Deeper NLP

Future

·Literary translation with cultural context (Claude or fine-tuned model)
·Poetic form identification (lục bát, Đường thi forms)
·Historical period estimation from vocabulary and style
·Named entity recognition (place names, reign titles, personal names)

Phase 6 · Future

Academic Integration

Future

·Open dataset export — verified corrections contributed to Nom Foundation and Kaggle
·Research API for batch manuscript digitization
·NEH / Mellon Foundation grant applications
·Partnerships: VNUHCM-UIT, Yale, Harvard, Cornell Southeast Asian Studies

Grant & partnership opportunities

The open dataset contribution (verified user corrections exported as labeled training data) is the key that opens academic and institutional doors. We're not asking for money to build a product — we're offering infrastructure that advances shared scholarly goals.

NEH (National Endowment for the Humanities)

Digital Humanities Advancement Grants

Open dataset contribution, manuscript digitization infrastructure

Mellon Foundation

Scholarly Communications & Information Technology

Cultural heritage preservation, academic access to endangered script data

Vietnamese Ministry of Culture

Cultural Heritage Digitization Program

National interest in preserving Nôm manuscript heritage

University Partners

Yale, Harvard, Cornell — Southeast Asian Studies

Batch digitization of institutional manuscript collections, student research access

Cite NomLens

Academic citation format (placeholder — update with publication details):

NomLens: On-Device Han Nôm Character Classification with Confidence-Routed
AI Fallback. [Author]. 2026. https://nomlens.app

← Model details Developer docs