Phase 2 — Living Technical Report — v5.0
ALM Synthetic Data Pipeline
Engineering & Architecture Blueprint
A synthetic training data factory for MAL — the formal language powering the ALM platform.
What 50 hand-authored seeds → auto-generated MAL training records
How Seeds → Jinja2 expansion → GPT-4o synthesis → 5-stage quality gate
Why So ALM can fine-tune on structurally correct, domain-rich MAL examples
Project: ALM / MAL
Phase: 2
All 4 Stages Live
50 Canonical Seeds
AI Synthesis: GPT-4o
Quality Gate: 5-stage
8 Frontend Routes
23 API Endpoints
Version: 5.0 — Apr 2026
Quick Brief — What is this and where are we?
We are building an offline data factory for MAL — the formal domain language used by the ALM platform. The factory takes 50 hand-authored seed scenarios, expands them deterministically via Jinja2 templates, then synthesizes diverse training records using GPT-4o in either Jinja2-First or Intent-First mode, routes everything through a 5-stage MAL quality gate, and stores validated results in a browsable training registry. All 4 pipeline stages are fully live with a React/Vite/TypeScript frontend dashboard, 23 mounted API endpoints, and 7 integration test modules. The system currently processes the telecom domain. Research is now underway (RE-474) to replace Jinja2 Stage 2 with a smarter generation layer and make the LLM stage production-grade.
50
Gold Seeds
telecom_v1.jsonl — hand-authored
23
API Endpoints
9 route groups, all mounted
2
LLM Modes
Jinja2-First + Intent-First
5
Validation Stages
syntax → schema → policy → dep → compile
8
Frontend Routes
wired in App.tsx, all live
7
Test Modules
first integration test coverage
Current System Status
| Capability | Status |
| Seed library (50 telecom seeds) | Live ✓ |
| Jinja2 expansion (4 templates, domain pack) | Live ✓ |
| LLM synthesis — Jinja2-First mode | Live ✓ |
| LLM synthesis — Intent-First mode | Live ✓ |
| 5-stage MAL quality gate + repair | Live ✓ |
| Training registry + record browser | Live ✓ |
| Diversity metrics (NLD, TTL cached) | API live, UI route pending |
| Integration test suite (7 modules) | Live ✓ — first test coverage |
| Jinja2 replacement research (RE-476) | In research |
| Production-grade LLM architecture (RE-477) | In research |
Tech Stack — aligned with Metafore platform
| Component | Role | Metafore ✓ |
| React 19 + TypeScript ~6 | Frontend SPA — 8 routes, dashboard, pipeline controls | ✓ |
| Vite 8 + Tailwind v4 | Build tool + utility CSS — instant HMR, design tokens | ✓ |
| Radix UI + Shadcn components | Accessible, unstyled primitives — same pattern as Varnam | ✓ |
| TanStack Query v5 | Server state, caching, background refetch for all API calls | ✓ |
| Lucide React + Framer Motion | Icons + animations — same icon set as Metafore Varnam | ✓ |
| FastAPI ≥0.115 + uvicorn | Async REST API — 23 endpoints, OpenAPI docs at /docs | ✓ |
| Pydantic v2 | Request/response validation, settings — zero manual parsing | ✓ |
| PostgreSQL + SQLAlchemy 2 | 2 DB tables — full pipeline state, LLM batch tracking | ✓ |
| OpenAI SDK (GPT-4o) | Stage 3 AI synthesis — Jinja2-First + Intent-First modes | — |
| parsimonious (PEG parser) | Embedded MAL grammar validator — 5-stage, in-process | — |
| Jinja2 ≥3.1 + PyYAML | Stage 2 deterministic expansion — under research for replacement | ~ |
| Docker Compose | Local dev — API:8010, UI:8080, DB:5432 | ✓ |
| Pytest + httpx | 7 integration test modules — all mounted routes covered | — |
✓ Matches Metafore platform stack | ~ In transition | — Project-specific
Color Key:
Confirmed Verified in working tree
In Progress Active work or research
Planned Approved, not yet started
Research Active research question