The dark factory
What happens when you try to build a software factory that runs with the lights off. Org theory, error math, and why the interesting problem was never the code.
I spent the last few months building a software factory that generates complete, production-ready applications without a human touching the code. Not a code assistant. Not a copilot. A factory. Lights off. Door closed. You feed it a product brief, and it hands you a deployed app.
I'm far from the only one trying. The pace right now is wild. Claude Opus 4.6 dropped, then GPT-5.3, then the next thing. Every week some new model or capability reshapes what agents can actually do. The race between Anthropic, OpenAI, Google, and the open-source crowd means the ceiling keeps moving upward while you're still building the floor. It's the most exciting time to be working on this stuff, and also the most disorienting, because the architecture you designed last month might be underpowered by next Tuesday.
The manufacturing world has a name for what I'm building. They call it a dark factory. A production line that runs without humans on the floor, without the lights on. Foxconn has been chasing this for years with their "lights-out" iPhone assembly lines. Dan Shapiro wrote a great piece mapping this to software: five levels from "spicy autocomplete" all the way to the fully autonomous software factory. Most developers plateau at Level 3, where you're reviewing AI-generated code like a manager. The real deflation, the part where small teams do extraordinary things, only kicks in at Levels 4 and 5. That's where I'm trying to get.
The interesting part isn't the automation itself. It's what breaks when you try.
The org chart problem
Here's what I didn't expect: the hardest part of building an AI software factory isn't the AI. It's the organizational design.
I started reading Ethan Mollick's work on management as an AI superpower, and it clicked. He frames it simply: the bottleneck for AI productivity isn't model capability. It's the traditional management skills. Problem-scoping. Deliverable definition. Work evaluation. The boring stuff that nobody wants to talk about at AI conferences. Then he posted a piece on LinkedIn arguing that agent systems need real organizational structure, not just more agents. That 100 subagents is too many for an orchestrator. That boundary objects beat raw text handoffs.
So I did my research. Late nights with Claude Code open, pulling up old org theory papers. Graicunas found in 1933 that a manager's coordination burden explodes beyond about 5 direct reports. I found the same threshold with AI agents: beyond 4 in a flat swarm, coordination overhead eats the productivity gains. The fix is the same one organizations discovered 90 years ago. Hierarchy. Not a single orchestrator managing everything, and not a flat swarm where everyone talks to everyone. Layers.
Deterministic bones, non-deterministic flesh
The factory I built runs an 8-stage pipeline (later compressed to 6). Each stage has a producer agent that generates artifacts and an evaluator agent that reviews them. The producer uses Claude, the evaluator is GPT-5.3 on xhigh reasoning, the OCD-fueled reviewer that will nitpick every field name and missing relationship. They argue until the output is good enough, up to 3 iterations.
But here's the thing that took me embarrassingly long to figure out: you can't build reliable software on non-deterministic foundations alone.
LLMs are probabilistic. They satisfice (Herbert Simon's term, not mine). They find "good enough" answers under information constraints, not optimal ones. That's fine for writing a product brief analysis. It's catastrophic when your domain model has a broken foreign key reference that cascades through 5 downstream stages.
So the real architecture is a hybrid. Non-deterministic generation for the creative work (writing PRDs, designing screens, generating components) layered on top of deterministic validation at every boundary. JSON schema validation. Cross-stage semantic checks. Does every entity have at least one screen? Does every screen have content? Does every component have design tokens?
The non-deterministic parts dream. The deterministic parts keep them honest.
Where non-determinism fights back
That hybrid works well for structured artifacts. Domain models, screen definitions, component specs. But there's one part of the pipeline where deterministic validation can't save you: design.
The factory has a designer agent. Its core concept is what I call genesis-based metamorphosis. The genesis DNA profile (those 5 dimensions that capture the product's personality) gets fed into the design stage, and the designer has to transform that abstract identity into concrete visual decisions. Color palettes, typography scales, spacing systems, illustration styles. The metamorphosis from "expressive vibe, low density, novice proficiency" into an actual design system that feels like something a human art director would produce.
On top of the designer sits an artifact creator, a routing layer that orchestrates multiple generation tools depending on what's needed. Icons get generated as SVGs by Claude (code-based, tiny, supports currentColor). Illustrations go to Recraft for native vector output. Hero images route to FLUX. Photorealistic content goes to DALL-E. Logos with text go to Ideogram because it's the only provider that gets text rendering right (90%+ accuracy vs 40-60% from the others). Video and motion assets route to tools like Runway and Veo. The router analyzes each request against the genesis DNA and picks the right provider with an optimized prompt.
And this is where I'll be honest: it's the weakest link in the factory. The structured stages (domain modeling, screen architecture, component mapping) produce reliable, validatable output. The design stage produces output that's technically correct but often soulless. You can validate that a color palette has sufficient contrast ratios. You can't validate that it has personality.
This is the frontier where non-deterministic systems genuinely struggle. Generating "a design system" is easy. Generating a design system that feels like the same person made every decision, that has a point of view, that a real designer would look at and say "yes, that's intentional" rather than "that's generic"? That's a different problem entirely. And it's not a problem you can solve with more stages or better prompts. It might be the last thing the dark factory learns to do with the lights off.
Some math behind it
Ethan Mollick talks about a framework with three variables: human baseline time, probability of success, and AI process time. Simple enough. But when you chain stages together, the probability math gets ugly fast.
If each stage in your pipeline has 95% accuracy (which sounds great), and you have 8 stages, your end-to-end success rate is 0.95 to the 8th power. That's 66%. One in three runs fails somewhere. Drop to 90% per stage and you're at 43%. Less than a coin flip.
This is the error compounding problem, and it's the number one reliability killer in any multi-stage AI system. A 5% improvement at each stage doesn't give you 5% better output. It compounds. Going from 90% to 95% per stage across 8 stages takes your end-to-end from 43% to 66%. That's a 23 percentage point swing from a seemingly small per-stage improvement.
I found a concrete example of this in my own pipeline. The evaluator agent was truncating its review at 5,000 characters. Domain models run 15-30KB. So the evaluator was approving artifacts based on the first third of the content, missing broken references and schema violations in the rest. A deterministic JSON schema validator would have caught every one of those bugs instantly and cheaply. I was using an LLM to do a computer's job.
The real product isn't code
The most counterintuitive thing I learned: the factory's real output isn't the application code. It's the intermediate artifacts.
Each stage produces structured documents. A genesis DNA profile that captures the product's personality across 5 dimensions. A domain model with explicit relationships (not abstract hasMany/belongsTo, but concrete foreign keys with cascade rules). Screen definitions. Content maps. Design tokens. Component specifications.
Star and Griesemer called these "boundary objects" in their 1989 paper. Artifacts plastic enough to adapt to local needs but robust enough to maintain identity across different groups. In my pipeline, the genesis-dna.json serves as a standardized form. The domain-model.json is a repository. The screens.json is an ideal type atlas. Each one translates intent from one stage to the next without requiring the stages to share context or agree on methodology.
From what's publicly visible, none of the tools I've looked at (v0, Lovable, Bolt.new, Devin) produce structured boundary objects. I don't know their internals, but from the outside it looks like they pass raw prompts to code generators. That's the difference between a factory and a workshop.
What actually works
After months of iteration, here's what I know works and what doesn't.
Works: the generator-critic pattern. Having two different models argue over output quality produces 23% better factual accuracy than single-model generation. MetaGPT published this at ICLR 2024 with their "Code = SOP(Team)" framework, where agents communicate through structured documents, not dialogue.
Works: sequential pipeline with artifacts. Google recommends this explicitly for debugging clarity. When something breaks at stage 5, you can inspect the stage 4 output and find the root cause. With a flat swarm, good luck tracing the failure.
Works: schema validation as a cheap quality gate. Costs almost nothing, catches the most common class of errors, and runs in milliseconds.
Doesn't work: LLM-based review of structured data. Use computers for computer work. Use LLMs for judgment calls.
Doesn't work: flat agent swarms beyond 3-4 agents. The coordination overhead kills you. Hierarchy isn't a bureaucratic relic. It's an information processing optimization.
Doesn't work: treating every run as equally complex. A simple CRUD app doesn't need the same pipeline as a real-time agentic system. Skipping unnecessary stages for simpler archetypes cuts cost and error surface.
The lights are still on
I'm not going to pretend the dark factory is running lights-off today. It's not. The visual review gate after the design stage still needs a human. The edge cases in complex domain models still trip up the evaluator. The code generation stages are still more fire-and-forget than I'd like.
But I've stopped thinking about this as an AI problem. It's an organizational design problem that happens to use AI as its workforce. The agents are the employees. The pipeline is the org chart. The structured artifacts are the memos and specifications that keep everyone aligned. And the management principles that make it work were published between 1933 and 1990.
The irony is thick. We're building the most advanced software systems in history, and the playbook that works best was written before computers existed.