A client came to me with a matchmaking problem. On one side: companies that need employee training in specific areas: compliance courses, technical certifications, specialized skills. On the other side: companies that offer exactly those trainings, built to order. Both sides were still operating through phone calls and website browsing. No platform brought them together. And the training providers, despite facing more potential demand than they could handle, were competing against each other for the same inquiries.

The idea was straightforward in concept: scrape provider websites, use an LLM to extract structured training data (topics, dates, formats, capacity), run it through a human-in-the-loop quality process, and publish it to a searchable frontend. Before committing to architecture, team structure, and a real delivery timeline, we needed to know which parts were actually hard.

So we built a prototype. Five moving parts: tech scouting, database setup and schema definition, a frontend, a backend, and an LLM pipeline for processing unstructured input. Barely a week. Not perfect code. Not the best architectural decisions. No tailored UX. None of that was the point.

What the prototype actually found

A planning document would not have found any of these four things.

The LLM needed to score its own confidence. Without human review, the pipeline produced training events with wrong dates and incorrect details. The model was filling gaps it didn't have enough signal to fill reliably. But the LLM could flag its own uncertainty: low-confidence extractions could be held for review, high-confidence ones could publish automatically. That single insight changed the entire quality architecture. Instead of a binary human/no-human decision, we had a tiered system driven by the model's own assessment of what it knew.

Code-only extraction was not the answer. Provider websites come in every format imaginable: PDFs, oddly structured HTML, embedded tables, content split across multiple pages. A hard-coded extraction approach would have required handling hundreds of edge cases: months of work that would still be fragile. The LLM didn't just make extraction faster; it made it feasible at all. It could reason about structure rather than pattern-match against it. That is a qualitatively different capability, not just a productivity gain.

Human review introduced a speed tradeoff. Data quality improved, but publishing slowed. Newly crawled providers sat in a queue waiting for review. The confidence scoring became the mechanism for managing that tradeoff, routing only the uncertain cases to a human and letting the rest through automatically. That specific design only became visible once we could measure it against real data. It wasn't in the original plan because it couldn't be.

Infrastructure was more important than anticipated. Everything processed through the pipeline touched personal and organizational data, which meant DSGVO compliance wasn't optional. The prototype revealed that Infrastructure as Code wasn't a nice-to-have; it was a first-class architectural requirement. Managing compliance at scale through manual infrastructure would have been a liability. We found this in week one instead of month four.

The conversation that followed

After that week, we were no longer debating the concept. We were discussing specific problems with specific solutions. Not "how do we handle data quality?" but "what's the right confidence threshold for auto-publishing, and how do we surface borderline cases to reviewers without creating a bottleneck?" The prototype had converted hypotheses into problems. And problems are something you can actually solve.

Code had receded into the background. Not because the technical findings didn't matter. They shaped every subsequent architectural decision. But they were no longer the loudest thing in the room. The session had found the real problems: the quality architecture, the human workflow, the infrastructure requirements, the edge cases that would define the implementation. The conversations after the prototype were grounded in a way that no amount of upfront planning had achieved.

Where this connects

I started my career at frog and later at Intuity Media Lab. Fifteen years and several large platforms later (Siemens, Deutsche Telekom, Mercedes-Benz), I've watched a consistent pattern: the technology conversation crowds out the customer conversation. Framework debates. Architecture arguments. Tool choices that become the main event when they should be infrastructure. I've contributed to those conversations. They're an industry-wide habit.

What changes with AI is not the principle: rapid prototyping existed long before LLMs, and the philosophy of building to learn rather than planning to certainty is decades old. What changes is the cost model. The LLM replaced months of edge-case parser development with days of prompt iteration. The confidence scoring architecture emerged from the prototype because we had real data to score against. The DSGVO infrastructure requirement surfaced early enough to be designed for, not retrofitted. That is what "barely a week" means: not that building got faster in general, but that the specific parts that needed to work did work, and everything else could be rough.

When building is expensive, the argument about approach becomes the culture. You debate frameworks because you're going to live with the choice for a long time before you know whether it was right. When building gets cheap enough, you find out. The upfront conversation can be about the right things: the problem, the user, the outcome.

The plan for real implementation improved dramatically. Not because the prototype was good code. Because it had found the hard parts.

Code in the background. That's where it belongs. It was always supposed to be there.

Related: AI Changed How I Work, Not What I Do covers the broader shift from coding companion to autonomous builder. Where AI Actually Helps in Enterprise Architecture goes deeper on which integrations create real leverage versus expensive distractions.