Chaining AI Prompts Before Agents Existed

February 15, 2024

In early 2024, there were no agent frameworks, no MCP, no reasoning models, no standardized way to build multi-step AI workflows. It was the wild west.

The idea came from painful, organic iteration inside Frontly. I’d been trying to get AI to generate full application structures - pages, components, data models, layouts - from a single prompt. The output was often terrible. So I broke it into two steps. Then three. Then six. And at some point I realized: if I let the user actually see and edit the output at each step before the next one runs, the quality of the final result improved dramatically - because each step was building on validated, human-approved output instead of compounding hallucinations.

I didn’t know the term “human-in-the-loop” at the time. I just knew that single-prompt generation produced garbage, multi-step was better, and letting users course-correct between steps was better still. That progression - from one prompt to many steps to human review at each step - happened slowly over months of shipping and watching what broke.

What I was building

The platform let users define structured JSON output schemas - either manually or by pasting in raw JSON that the system would parse into a reusable template. You’d then build multi-step flows where each step generated or transformed structured data, building on the output of the previous step.

Each step would pause, display its intermediate output in an editable UI, and let the user review, fix, and approve before the next step ran. This wasn’t just a nice-to-have - it was half the reason the product existed. Building multi-step AI flows that need to pause, request user interaction, display intermediate progress, and allow real-time editing was genuinely hard in early 2024. It probably still is.

This was the right architecture for the moment. Pre-reasoning-models, AI did dramatically better when you broke complex tasks into small, focused steps. And your user is the best judge of quality for their own output - so instead of trying to make the AI perfect in one shot, we gave users the ability to course-correct at each stage.

The engineering was harder than it looked

This wasn’t just “chain some API calls together.” Each step in the flow had to stream its AI response in real-time, parse partial JSON as tokens arrived, handle broken or hallucinated JSON gracefully with automatic retries, and render intermediate results in a UI that a non-technical user could actually understand, edit, and approve - all while maintaining session state across steps.

Live streaming with partial JSON parsing. Users needed to see output as it generated, not wait for a complete response. That meant parsing incomplete JSON on every token, displaying what was valid so far, and gracefully handling the constant stream of temporarily invalid states. When the model hallucinated a broken structure mid-stream, the system had to detect it, retry, and resume without the user losing their work.

JSON consistency was a nightmare. In early 2024, there was no function calling, no structured outputs, no response schemas. You sent a prompt that said “respond with valid JSON in this format” and you got back… something. Sometimes valid JSON. Sometimes JSON wrapped in markdown code fences. Sometimes a conversational response with JSON buried in the middle. I wrote parser after parser - strip code fences, find the first {, try to parse, fall back to regex extraction, try again. A five-step chain where each step needed valid JSON meant five opportunities for the whole thing to break.

Hallucination broke step-to-step data bindings. The model would generate a component that referenced a field called customer_name, but a previous step had called it clientName. Every step in the chain could hallucinate details that made its output incompatible with the steps before or after it.

The review UI was a design problem, not just an engineering problem. Each intermediate output needed to be displayed in a way that felt natural to edit - not raw JSON, but a structured form that mapped to the underlying data. The user had to be able to modify any field, see how their changes would affect downstream steps, and approve with confidence. Making AI-generated structured data editable by non-technical users, mid-pipeline, in real-time - that was genuinely hard.

The through-line

Everything I was doing by hand in early 2024 now has a name and a framework. Chained prompts became “agentic workflows.” Structured output became function calling and response schemas. The human review steps became “human-in-the-loop.” Context management between steps became “context engineering.”

This project eventually split into two separate things. The human-in-the-loop pattern - pausing AI workflows for human review - was a hard enough problem on its own that I pulled it out and built a standalone, platform-agnostic HITL product, which I launched in November 2024. Meanwhile, the multi-step AI generation architecture fed directly back into Frontly - when I rebuilt the AI app generator as a vibe coding platform in March 2025, the pipeline had a significant human-in-the-loop process baked in, informed by everything I’d learned building these chains a year earlier.

The core insight from this period wasn’t just that context quality matters - it was that AI does better with small, focused steps, and that humans are the best quality gate between those steps. That pattern keeps showing up in everything I build.