Taming Code Generation: An Exploration in Scaling Myself

Gone are the days when your imagination was out of reach. Builders used to spend months typing away lines of code, inching towards their vision. We are living in a true golden age of software development. Code is cheap now. Your vision and agency are in the driver's seat.

In this moment, you have graduated from the orchestra. Now you are the composer.

I wrote recently about vision and agency in shipping books at home in two days. But this shift has arrived, not without its own problems. You have to ensure that your vision is getting executed as per your expectation.

You. Are. The. Bottleneck. Now.

I feel the same way. I have more unreviewed implementations than unimplemented ideas. And this is my attempt to scale myself.

The Problem With Moving Fast

Here's what typically happens when you use LLMs without explicit specifications:

You ask Claude Code to build a feature. You describe what you want in the message. Claude builds it. Then you review, find ambiguities, ask for changes. Claude updates. Back and forth. Back and forth.

The real problem? Your intent only lives in that conversation. There's no single source of truth. No checklist. No way to verify "did the code actually implement what we agreed on?"

Reviews become expensive. You have to mentally reconstruct the spec from code, then check if the code matches. That's backwards.

But more importantly, this pattern doesn't scale. As you add more features, more decisions, more complexity, the conversation gets longer. The drift gets wider. Your thinking gets slower relative to the machine's speed.

You're not actually reviewing code. You're reverse-engineering intention.

The Shift: Specs Before Code

I started experimenting with a different approach. What if the spec came first? Not as documentation added after. As the actual blueprint, written before implementation, that drives everything—implementation, testing, and review.

Make specs precede code.

Not in theory. In practice. Every feature has a written spec before Claude Code touches it. Every spec has clear acceptance criteria. Every review checks the code against those criteria.

The shift is small. The impact is different.

Instead of: "Does this code implement the feature correctly?" (vague, requires context)

You ask: "Does this code satisfy acceptance criterion #1? #2? #3?" (objective, checklist-based)

Reviews go from two hours (reconstructing intent) to fifteen minutes (checking a list).

The System

I built a system around four skills, each serving a specific purpose in the workflow. I've open-sourced it at znck/spec-driven-development.

spec/
├── CLAUDE.md                           # Main instructions for spec-driven development
└── skills/
    ├── spec-feature-writer/
    │   └── SKILL.md                    # Draft feature specifications
    ├── spec-decision-writer/
    │   └── SKILL.md                    # Document architectural decisions
    ├── spec-e2e-test-generator/
    │   └── SKILL.md                    # Generate E2E tests from acceptance criteria
    └── spec-maintenance/
        └── SKILL.md                    # Update indices, activate features

The skills follow the agentskills.io format, which means they work as slash commands in Claude Code (e.g., /spec-feature-writer).

The CLAUDE.md file is the entry point. It tells Claude Code when to use spec-driven development (new user-facing functionality, behavior changes, architectural decisions, multi-component changes) and when to skip it (bug fixes, small tweaks, docs).

Then the instructions in CLAUDE.md set up the full project structure:

spec/
├── CLAUDE.md                           # Instructions for Claude Code
├── README.md                           # Product overview (you create this)
├── skills/
│   └── ...
├── features/
│   ├── README.md                       # Feature index (WIP, Active, Inactive)
│   └── YYYY-MM-DD-<feature-name>.md
└── decision-records/
    ├── README.md                       # Decision index (Active, Superseded)
    └── YYYY-MM-DD-<decision-name>.md

Four Skills, One Workflow

Skill 1: /spec-feature-writer

Before Claude Code builds anything, you use this skill. It asks clarifying questions about user value, workflows, and edge cases. Then it drafts a spec with four essential parts:

User Value: Why does the user care? Not "implement voice interviewing", but "users practice interviews with realistic feedback, without needing a human partner"
How It Works: User-centric description. What does the user experience?
Key Interactions: Edge cases and behaviors. What happens when the user pauses? When connection drops?
Acceptance Criteria: 5-8 testable requirements. These become your review checklist.

The acceptance criteria are the contract. The code either satisfies them or it doesn't.

Skill 2: /spec-decision-writer

Why did you make certain architectural choices? This skill documents them. It captures the question you were answering, the decision, the rationale, user impact, and what alternatives you rejected.

Decisions go to spec/decision-records/YYYY-MM-DD-<name>.md. This prevents repeating old arguments and helps anyone understand the "why" behind choices.

Skill 3: /spec-e2e-test-generator

Tests come from specs. Each acceptance criterion gets a corresponding test. The mapping is direct:

Criterion	Test
`[ ] User can [action]`	`test('user can [action]')`
`[ ] System [behavior]`	`test('system [behavior]')`

Test names match spec language. When a test fails, you immediately know which criterion wasn't satisfied.

Skill 4: /spec-maintenance

This skill keeps everything in sync. When a feature ships: update status from WIP to Active, check all acceptance criteria boxes, update the feature index, add to Recent Changes.

The maintenance skill enforces a key rule: only WIP specs can be edited. Active specs are frozen. To change behavior, you create a new spec. This prevents spec drift after implementation.

How the Workflow Actually Works

The workflow has five phases:

1. Draft

Use /spec-feature-writer or /spec-decision-writer. Ask clarifying questions about user value, workflows, edge cases. Draft spec with acceptance criteria. Iterate based on feedback. Save to spec/features/ or spec/decision-records/.

2. Lock

Summarize acceptance criteria, confirm with user. This is the contract. From here, implementation must satisfy exactly what's written—nothing more, nothing less.

3. Implement

Claude Code reads the spec thoroughly, implements to satisfy acceptance criteria, and flags ambiguities immediately instead of guessing:

SPEC CLARIFICATION NEEDED
[Quote ambiguous part]
Interpretations: 1) ... 2) ...
Which is correct?

4. Test

Use /spec-e2e-test-generator. One test per criterion. Test names match spec language. Run tests to verify implementation.

5. Activate

When tests pass, confirm with user, then use /spec-maintenance to check acceptance criteria boxes, update status from WIP to Active, and update indices.

That's it. Specs drive code. Tests validate specs. Code passes tests.

The Three States

Specs have three states with strict editing rules:

State	Editable	Notes
WIP	✅	Edit freely during development
Active	❌	Create new spec to change behavior
Inactive	❌	Can only append deprecation reason

This is crucial. Once a spec goes Active, it's frozen. The code matches the spec. The tests validate the spec. Everything is aligned.

To change behavior, you don't edit the Active spec. You create a new WIP spec, go through the workflow again, and the old one gets deprecated if needed.

This prevents the drift that destroys most documentation systems.

The Review Process Changes Everything

This is where the real shift happens.

Instead of spending two hours reconstructing what should have happened, you spend fifteen minutes checking: does the code satisfy criterion #1? Yes. Criterion #2? Yes. Criterion #3? Yes. Ship it.

Here's a real example. A feature spec for "Voice-Based Interview Sessions" has acceptance criteria like:

User can start speaking and system captures voice
Speech is transcribed to text
Claude generates a contextual response
Response is played as audio
Timer counts down
User can pause and resume without losing state
User can manually end interview

For each one, you ask: Does the code do this? Measurable. Objective. Done.

And because Claude Code implemented to this spec, the answer is almost always yes. The code passes tests. The tests validate the criteria. Everything aligns.

Code reviews become about acceptance criteria, not style or efficiency. The spec is the arbiter.

What I'm Still Learning

I'm not claiming this is solved. I'm exploring it.

Some things I've discovered:

Specs scale better than code. When you have five features, the spec system saves time. When you have fifty features, it saves your sanity. The clarity compounds.

Acceptance criteria are the contract. They're not suggestions or aspirations. They're the definition of "done". Once all criteria pass, you're finished. This prevents scope creep.

The frozen Active state is powerful. When specs can't change after implementation, you're forced to think clearly before locking. And you never have documentation drift.

Flagging ambiguities beats guessing. When Claude Code says SPEC CLARIFICATION NEEDED instead of making assumptions, you catch misalignment early. This saves hours of rework.

The real bottleneck is intent clarity. Not execution speed. Not tool capability. How clearly you can think about what you want. That's what limits everything else.

The Deeper Pattern

I think what's really happening is this: the faster the machine gets at generating code, the more clarity matters.

Slow tools hide unclear thinking. When it takes a human two weeks to write code, small ambiguities don't matter. You figure it out as you go.

Fast tools expose unclear thinking. When Claude Code implements in minutes, ambiguity becomes immediately visible. You spend your time reconstructing intent instead of building features.

So the game changes. You're not competing with the machine on speed. You're competing with clarity of thought.

And the way you scale clarity is through specs. Through forcing yourself to think carefully before asking the machine to build.

Getting Started

If you want to try this, I've packaged everything into a repository you can drop into your project.

Install with one command:

curl -L https://github.com/znck/spec-driven-development/archive/main.tar.gz | tar -xz --strip-components=1 -C . spec-driven-development-main/spec

Or clone and copy manually:

git clone https://github.com/znck/spec-driven-development.git
cp -r spec-driven-development/spec ./spec

Once installed, Claude Code will automatically follow the spec-driven workflow when you request significant changes. But here's the full process if you want to understand it:

Pick a feature. Something you're about to build.
Use /spec-feature-writer. Let it ask clarifying questions. Draft the spec with user value, how it works, key interactions, and acceptance criteria.
Make sure each criterion is testable. "User can adjust timer from 5–90 minutes" is testable. "System is responsive" is not.
Lock the spec. Confirm acceptance criteria with yourself. This is the contract.
Implement. Claude Code reads the spec, implements to criteria, flags ambiguities.
Test. Generate tests from criteria using /spec-e2e-test-generator. Run them.
Activate. When tests pass, use /spec-maintenance to update status and indices.

The barrier to entry is low. The impact compounds over time.

In Closing

I started this exploration because I was drowning in code reviews. The machine moved faster than I could think.

What I've learned is that the real scaling isn't about moving faster. It's about thinking more clearly before the machine starts.

Specs do that. They force clarity. They make you write down "why" before "what". They turn vague intentions into testable contracts.

And they let you review in minutes instead of hours.

Is this a complete solution? No. I'm still exploring. I'm still learning what works and what doesn't.

But I'm confident about this: in a world where code generation is cheap and fast, the scarce resource is clear thinking. And specs are how you cultivate that.

That's what I'm attempting. That's what I'm learning.

I'd be curious to hear what you discover if you try it.