Back to Projects

Synthetic Dataset Generation ·

M-Synth: High-Fidelity Synthetic OCR Dataset Generator for Indic Scripts

A production-grade, memory-safe synthetic data generator designed to solve the chronic data scarcity problem for Indic OCR pipelines. It produces optimally balanced, physically realistic training images by combining linguistic corpus analysis with advanced image synthesis.

Malayalam script sample with skew and physical degradation
Malayalam script sample with low illumination and blur
English script sample on lined notebook paper background
English script sample with elastic distortion and shear
Mixed Malayalam-English script sample with perspective shift
Mixed-script sample with erosion, dilation, and ink bleed
Malayalam script sample with skew and physical degradation
Malayalam script sample with low illumination and blur
English script sample on lined notebook paper background
English script sample with elastic distortion and shear
Mixed Malayalam-English script sample with perspective shift
Mixed-script sample with erosion, dilation, and ink bleed
Malayalam script sample with skew and physical degradation
Malayalam script sample with low illumination and blur
English script sample on lined notebook paper background
English script sample with elastic distortion and shear
Mixed Malayalam-English script sample with perspective shift
Mixed-script sample with erosion, dilation, and ink bleed

Synthesized Malayalam text with severe skew, paper texture background, and physical scan artifacts.

Problem Statement

Developing robust Optical Character Recognition (OCR) systems for Brahmic scripts specifically Malayalam is historically bottlenecked by chronic training data scarcity. Malayalam orthography represents an exceptionally high linguistic complexity, featuring over 900 valid character combinations, intricate overlapping ligatures (കൂട്ടക്ഷരം) and historical Lipi shifts (clashes between legacy and reformed scripts).

Existing text corpora are heavily skewed toward common characters, causing CTC-based deep recognition models to suffer from alignment collapse and failure on rare glyphs (e.g. , , , ). Furthermore, collecting and manually annotating diverse, physically realistic scanning anomalies and document degradation conditions is resource-prohibitive.

Architecture Breakdown

M-Synth resolves this data constraint via a two-phase, memory-safe pipeline that synthesizes highly balanced, physically realistic document image strips:

  • Linguistic Corpus Balancing (Phase 1): Utilizes stratified reservoir sampling (Algorithm R) to stream massive raw text corpora into memory-restricted environments, building a distribution-controlled manifest. This phase handles chillu canonicalization (resolving legacy ZWJ sequences) and dynamically injects rare character templates.
  • High-Fidelity Rendering & Synthesis (Phase 2): Uses the vector-based Pango/Cairo rendering engine to shape complex Brahmic ligatures flawlessly, backed by fontTools coverage analysis. Renders are then fed through an advanced photometric and geometric augmentation pipeline.

Why This Needed Custom Engineering

Brahmic text rendering is inherently non-linear; simple character-by-character glyph plotting causes severe spatial overlap and font-breakage. M-Synth uses Cairo/Pango text-shaping paradigms to ensure correct ligature composition, but this introduces processing challenges when scaling to millions of samples.

Moreover, naive random text generation leads to highly unnatural letter distributions. M-Synth addresses this by implementing Stratified Reservoir Sampling with custom script classifiers to dynamically balance five distinct script categories (Malayalam-only, English-only, mixed-script, symbol-heavy, and rare character focus) and length buckets (short, medium, long) in a single streaming pass.

Synthesized Script Showcase & Visual Capabilities

Below is a live display of representative synthetic images generated by M-Synth, showcasing each script category and its associated visual properties:

Synthetic Preview ImageScript CategoryStructural Ligature TraitApplied Augmentations & Visual Properties
Malayalam SkewedMalayalamFeatures legacy chillu markers (, ) and complex overlapping glyph structuresSkewed layout, paper-texture background overlay, contrast degradation
Malayalam IlluminationMalayalamHigh-density conjuncts with correct shapingMotion-blur simulation, localized shadow casting, slight horizontal shear
English LinedEnglishLatin numeral sequences and alphabetic stringsLined notebook paper background, ink bleed-through effect, slight angle shift
English GridEnglishLatin character seriesGrid paper texture, elastic character distortion, salt-and-pepper noise, dilation
Mixed PerspectiveMixed-ScriptCode-mixed English + Malayalam token sequenceWrinkled paper texture, light perspective distortion, localized contrast compression
Mixed Ink BleedMixed-ScriptCode-mixed Indic-Latin character sequenceHeavy ink bleed simulation, spatial erosion, Gaussian blurring, noise overlays

Key Technical Challenges & Mitigations

ChallengeEngineering DescriptionPipeline Mitigation Strategy
Brahmic Font Spacing BreaksStandard PIL/Pillow libraries render Brahmic characters sequentially, separating complex joint ligatures into disjointed glyph components.Engineered a vector-level rendering layer utilizing Pango/Cairo to perform proper HarfBuzz-backed text shaping before canvas creation.
System Memory ExhaustionProcessing multi-gigabyte raw text corpora to find rare character sequences crashes host systems due to RAM limits.Built a stream loader that processes JSONL/TXT inputs incrementally via Python generators and samples a fixed-size dataset using Algorithm R.
Character Truncation in CropsApplying heavy geometric distortions (skew, rotation, perspective) often cuts off start/end characters on tight text boundaries.Integrated dynamic canvas boundary calculation in get_text_dimensions to add elastic padding to the canvas prior to rotation and crop.
Silent Font Deficit CorruptionMany Unicode fonts lack full glyph maps for rare Malayalam glyphs, rendering empty blocks (tofu) in synthetic datasets.Implemented fontTools font analysis to pre-scan system fonts and dynamically rank font fallbacks based on precise sub-range Unicode coverage.

Engineering Outcome

The M-Synth pipeline enables rapid, deterministic creation of robust training datasets for OCR architectures like PaddleOCR and TrOCR. By ensuring uniform character-level coverage and simulating complex real-world scanning defects, models trained on M-Synth corpora exhibit SOTA generalization capabilities, effectively mitigating the Indic low-resource script data barrier.