Technology — Alembic

The problem

Three approaches. Three dead ends.

Every team extracting data from documents lands on one of three paths. Each one works just well enough to seem promising — and just poorly enough to create real problems downstream.

▦

Reading text, missing meaning

OCR tools convert pages to raw character strings — then throw away everything that made the document make sense. That table with merged cells, that handwritten annotation, that logo distinguishing an amendment from the original? Gone. You get text. You lose the document.

✨

Brilliant, but unsupervised

Large language models are genuinely impressive at reading documents. The problem isn’t capability — it’s reliability. One prompt, one model, no validation, no memory. It works great on the demo. It hallucinates on page 47 of a real contract, and nobody catches it until the data’s already in your system.

⚙

Powerful, if you have six months

Enterprise platforms can handle complexity — after weeks of template configuration, months of training data, and a team dedicated to maintaining the rules. They’re built for organizations with dedicated ops staff and seven-figure volumes. For everyone else, the implementation cost outweighs the extraction value.

The approach

Three layers. One system that actually learns.

Alembic combines visual AI, orchestrated agents, and a learning engine into a single pipeline. Each layer solves a specific failure mode — and they compound.

Layer 1

The AI sees the page like you do

Most tools convert your PDF to plain text before the AI ever touches it. Alembic skips that lossy step entirely. Your documents go directly to the AI as visual input — the same way you’d hand a page to a smart colleague and say “pull out the key terms.”

Tables stay tables. Merged cells, nested headers, multi-page layouts — processed visually, not reconstructed from text fragments.
Handwriting and annotations included. Margin notes, stamps, signatures, and corrections are part of the input, not discarded noise.
Format-agnostic from day one. PDFs, scanned images, photos of paper — no special handling, no separate pipelines.

Layer 2

Multiple AI agents, each with a job to do

A single AI model running a single prompt is a demo. Production extraction requires coordination — one agent to classify, another to extract, another to validate, and an orchestrator to manage the whole pipeline. Alembic assigns the right model to each task automatically. Fast models handle simple lookups. Powerful models handle the hard stuff.

Per-space orchestration. Each workspace gets its own pipeline, tuned to that document type. Your invoices don’t share a brain with your contracts.
Confidence-based routing. High confidence flows through. Low confidence gets flagged with reasoning — not a binary pass/fail, but an explanation.
Proactive monitoring. Each space gets its own AI agent that watches the pipeline, resolves routine issues, and surfaces decisions in a priority-coded standup.

Layer 3

Every correction makes the system permanently smarter

When you fix an extracted value, Alembic doesn’t just update the record. It creates a memory pattern — a persistent rule that ensures the same mistake never happens again. Patterns accumulate into a knowledge base specific to your documents, your terminology, your edge cases. The accuracy curve only goes up.

Zero training data required. Start extracting immediately. The system learns from production corrections, not pre-labeled datasets.
Schema built by conversation. Tell the AI what you need in plain language. It designs the schema, builds the pipeline, and tests it on your samples.
Institutional knowledge captured. When your best analyst retires, their corrections and judgment calls live on as patterns in the system.

The result

Extraction that makes decisions, not just data dumps

The point was never the extraction itself. It was always about what comes next — the approval, the payment, the flag, the decision. Alembic closes the loop between “data extracted” and “action taken.”

Automatic approvals on high-confidence extractions. When the system is certain, documents flow through without human touch — straight into your database, your ERP, your workflow.

Ambiguity surfaced with context, not just flagged. When something needs a human eye, Alembic shows you exactly which field, exactly why, and what the competing interpretations are. Decide in seconds, not minutes.

Full API for headless operation. 32 REST endpoints, webhooks, SSE streaming, batch operations. Run your entire pipeline without ever opening the UI — or build your own interface on top.

Cost-intelligent by default. The orchestrator matches model complexity to field difficulty. Speed where it’s easy, depth where it matters, without paying for overkill on every field.

Your documents deserve better than OCR