Phase 2 — Living Technical Report — v5.0

ALM Synthetic Data Pipeline
Engineering & Architecture Blueprint

A synthetic training data factory for MAL — the formal language powering the ALM platform.

What 50 hand-authored seeds → auto-generated MAL training records How Seeds → Jinja2 expansion → GPT-4o synthesis → 5-stage quality gate Why So ALM can fine-tune on structurally correct, domain-rich MAL examples

Project: ALM / MAL Phase: 2 All 4 Stages Live 50 Canonical Seeds AI Synthesis: GPT-4o Quality Gate: 5-stage 8 Frontend Routes 23 API Endpoints Version: 5.0 — Apr 2026

Quick Brief — What is this and where are we?

We are building an offline data factory for MAL — the formal domain language used by the ALM platform. The factory takes 50 hand-authored seed scenarios, expands them deterministically via Jinja2 templates, then synthesizes diverse training records using GPT-4o in either Jinja2-First or Intent-First mode, routes everything through a 5-stage MAL quality gate, and stores validated results in a browsable training registry. All 4 pipeline stages are fully live with a React/Vite/TypeScript frontend dashboard, 23 mounted API endpoints, and 7 integration test modules. The system currently processes the telecom domain. Research is now underway (RE-474) to replace Jinja2 Stage 2 with a smarter generation layer and make the LLM stage production-grade.

Seed Library

Jinja2 Expansion

AI Synthesis

Quality Gate

Registry

Gold Seeds

telecom_v1.jsonl — hand-authored

API Endpoints

9 route groups, all mounted

LLM Modes

Jinja2-First + Intent-First

Validation Stages

syntax → schema → policy → dep → compile

Frontend Routes

wired in App.tsx, all live

Test Modules

first integration test coverage

Current System Status

Capability	Status
Seed library (50 telecom seeds)	Live ✓
Jinja2 expansion (4 templates, domain pack)	Live ✓
LLM synthesis — Jinja2-First mode	Live ✓
LLM synthesis — Intent-First mode	Live ✓
5-stage MAL quality gate + repair	Live ✓
Training registry + record browser	Live ✓
Diversity metrics (NLD, TTL cached)	API live, UI route pending
Integration test suite (7 modules)	Live ✓ — first test coverage
Jinja2 replacement research (RE-476)	In research
Production-grade LLM architecture (RE-477)	In research

Tech Stack — aligned with Metafore platform

Component	Role	Metafore ✓
React 19 + TypeScript ~6	Frontend SPA — 8 routes, dashboard, pipeline controls	✓
Vite 8 + Tailwind v4	Build tool + utility CSS — instant HMR, design tokens	✓
Radix UI + Shadcn components	Accessible, unstyled primitives — same pattern as Varnam	✓
TanStack Query v5	Server state, caching, background refetch for all API calls	✓
Lucide React + Framer Motion	Icons + animations — same icon set as Metafore Varnam	✓
FastAPI ≥0.115 + uvicorn	Async REST API — 23 endpoints, OpenAPI docs at /docs	✓
Pydantic v2	Request/response validation, settings — zero manual parsing	✓
PostgreSQL + SQLAlchemy 2	2 DB tables — full pipeline state, LLM batch tracking	✓
OpenAI SDK (GPT-4o)	Stage 3 AI synthesis — Jinja2-First + Intent-First modes	—
parsimonious (PEG parser)	Embedded MAL grammar validator — 5-stage, in-process	—
Jinja2 ≥3.1 + PyYAML	Stage 2 deterministic expansion — under research for replacement	~
Docker Compose	Local dev — API:8010, UI:8080, DB:5432	✓
Pytest + httpx	7 integration test modules — all mounted routes covered	—

✓ Matches Metafore platform stack | ~ In transition | — Project-specific

Contents

Part I — Current State — Everything is built and live. Research is next.

1.Executive Summary 2.Live System Dashboard 3.Next Steps & Research Agenda

Part III — Research Direction — Two open questions: what replaces Jinja2, how to make LLM smarter.

11.Stage 2 Upgrade — Moving Beyond Templates 12.Making AI Synthesis Smarter — Research Agenda

Part II — Pipeline Deep Dive — Exactly how each stage works and which files run it.

6.Task Coverage & Construct Map 7.Pipeline Architecture 7b.Frontend Interface & Pages 8.Stage 2 Rationale & Transition

Part IV — Engineering Reference — APIs, schemas, services, tests — for reference.

13b.API Quick Reference 14.Data Schemas & Services 15.Tech Stack — Why Each Tool

Part V — Roadmap & Glossary — Where we are headed and what terms mean.

18.Roadmap 22.Glossary 21.Changelog

Color Key: Confirmed Verified in working tree In Progress Active work or research Planned Approved, not yet started Research Active research question

Loading sections…

ALM Synthetic Data PipelineEngineering & Architecture Blueprint

ALM Synthetic Data Pipeline
Engineering & Architecture Blueprint