Phase 2 — Living Technical Report — v5.0

ALM Synthetic Data Pipeline
Engineering & Architecture Blueprint

A synthetic training data factory for MAL — the formal language powering the ALM platform.
What 50 hand-authored seeds → auto-generated MAL training records How Seeds → Jinja2 expansion → GPT-4o synthesis → 5-stage quality gate Why So ALM can fine-tune on structurally correct, domain-rich MAL examples
Project: ALM / MAL Phase: 2 All 4 Stages Live 50 Canonical Seeds AI Synthesis: GPT-4o Quality Gate: 5-stage 8 Frontend Routes 23 API Endpoints Version: 5.0 — Apr 2026
Quick Brief — What is this and where are we?
We are building an offline data factory for MAL — the formal domain language used by the ALM platform. The factory takes 50 hand-authored seed scenarios, expands them deterministically via Jinja2 templates, then synthesizes diverse training records using GPT-4o in either Jinja2-First or Intent-First mode, routes everything through a 5-stage MAL quality gate, and stores validated results in a browsable training registry. All 4 pipeline stages are fully live with a React/Vite/TypeScript frontend dashboard, 23 mounted API endpoints, and 7 integration test modules. The system currently processes the telecom domain. Research is now underway (RE-474) to replace Jinja2 Stage 2 with a smarter generation layer and make the LLM stage production-grade.
Seed Library
Jinja2 Expansion
AI Synthesis
Quality Gate
Registry
50
Gold Seeds
telecom_v1.jsonl — hand-authored
23
API Endpoints
9 route groups, all mounted
2
LLM Modes
Jinja2-First + Intent-First
5
Validation Stages
syntax → schema → policy → dep → compile
8
Frontend Routes
wired in App.tsx, all live
7
Test Modules
first integration test coverage
Current System Status
CapabilityStatus
Seed library (50 telecom seeds)Live ✓
Jinja2 expansion (4 templates, domain pack)Live ✓
LLM synthesis — Jinja2-First modeLive ✓
LLM synthesis — Intent-First modeLive ✓
5-stage MAL quality gate + repairLive ✓
Training registry + record browserLive ✓
Diversity metrics (NLD, TTL cached)API live, UI route pending
Integration test suite (7 modules)Live ✓ — first test coverage
Jinja2 replacement research (RE-476)In research
Production-grade LLM architecture (RE-477)In research
Tech Stack — aligned with Metafore platform
ComponentRoleMetafore ✓
React 19 + TypeScript ~6Frontend SPA — 8 routes, dashboard, pipeline controls
Vite 8 + Tailwind v4Build tool + utility CSS — instant HMR, design tokens
Radix UI + Shadcn componentsAccessible, unstyled primitives — same pattern as Varnam
TanStack Query v5Server state, caching, background refetch for all API calls
Lucide React + Framer MotionIcons + animations — same icon set as Metafore Varnam
FastAPI ≥0.115 + uvicornAsync REST API — 23 endpoints, OpenAPI docs at /docs
Pydantic v2Request/response validation, settings — zero manual parsing
PostgreSQL + SQLAlchemy 22 DB tables — full pipeline state, LLM batch tracking
OpenAI SDK (GPT-4o)Stage 3 AI synthesis — Jinja2-First + Intent-First modes
parsimonious (PEG parser)Embedded MAL grammar validator — 5-stage, in-process
Jinja2 ≥3.1 + PyYAMLStage 2 deterministic expansion — under research for replacement~
Docker ComposeLocal dev — API:8010, UI:8080, DB:5432
Pytest + httpx7 integration test modules — all mounted routes covered
✓ Matches Metafore platform stack  |  ~ In transition  |  — Project-specific
Contents
Part I — Current State — Everything is built and live. Research is next.
1.Executive Summary 2.Live System Dashboard 3.Next Steps & Research Agenda
Part III — Research Direction — Two open questions: what replaces Jinja2, how to make LLM smarter.
11.Stage 2 Upgrade — Moving Beyond Templates 12.Making AI Synthesis Smarter — Research Agenda
Part II — Pipeline Deep Dive — Exactly how each stage works and which files run it.
6.Task Coverage & Construct Map 7.Pipeline Architecture 7b.Frontend Interface & Pages 8.Stage 2 Rationale & Transition
Part IV — Engineering Reference — APIs, schemas, services, tests — for reference.
13b.API Quick Reference 14.Data Schemas & Services 15.Tech Stack — Why Each Tool
Part V — Roadmap & Glossary — Where we are headed and what terms mean.
18.Roadmap 22.Glossary 21.Changelog
Color Key: Confirmed Verified in working tree In Progress Active work or research Planned Approved, not yet started Research Active research question
Loading sections…