Flow: How My Thinking on AI-Assisted Development Keeps Changing


I wrote about Flow a few weeks ago. The repo is now far enough from that post that I want to write it again.

The philosophy keeps changing as I use the thing. That’s the honest update — not “we shipped these features,” but “I was partly wrong about what the system is for, and the code now reflects a sharper read.”

The repo is at github.com/jyliang/flow.

What Flow still is

Flow is a skill system for Claude Code that moves work through stages, with a document at each boundary. Idea becomes spec becomes plan becomes changes becomes findings becomes PR. Five stages, five skills, one entry point: /flow.

The original framing — documents are the interface between human and AI, and between AI and AI — is still load-bearing. Edits to a spec directly change what gets planned. Edits to findings directly change what gets fixed. The file on disk is the API.

What I missed the first time is that documents alone aren’t enough. There’s a second primitive, and it took building Flow on real work to see it.

The thing I got wrong: prose isn’t the right shape for decisions

The original Flow let the AI ask questions in free-form prose. “Which approach do you prefer — A, B, or C?” Then I’d type a reply.

This was wrong in the way that’s only obvious in hindsight. Free-form prose is the right medium for status updates and summaries — anything where the AI is telling you something. It’s the wrong medium for decisions, because the user has to parse the choice, hold the options in their head, and produce a string that the AI then has to re-parse.

Flow now treats every user-facing decision as a structured AskUserQuestion call: 2–4 named options, one of them marked (Recommended), with a one-line rationale per option. Free-form prose is reserved for narration. If there’s a choice, it’s a structured question.

The difference is bigger than it sounds. Structured questions force the AI to have actually narrowed the decision space before bothering you. They make the user’s answer machine-readable, so the next stage doesn’t need to re-interpret intent. They surface the tradeoff up front instead of after a multi-turn negotiation. The interaction collapses from a conversation into a click.

So the interface is now both: documents for state, structured questions for decisions. Two primitives, both first-class.

Workstream folders, and why revisions create new files

Each piece of work now lives in agent/workstreams/<YYYY-MM-DD>-<branch>/, with stage-numbered files inside:

agent/workstreams/2026-04-22-add-oauth/
  01-spec-r1.md
  02-plan-r1.md
  03-review-r1.md

The folder is 1:1 with the branch. After merge, the folder stays put — the spec’s frontmatter records the PR number, and /flow-reflect later reads across all shipped folders to look for cross-PR patterns.

The piece of this convention I care about most is the -rN revision suffix. When work deviates from an earlier document — implementation discovers the spec was wrong, review surfaces a missed step in the plan — Flow doesn’t edit the existing file. It creates 01-spec-r2.md. The previous -r1 is frozen forever; the new file’s ## Revisions section explains what changed and why:

## Revisions
- **implement → spec** 2026-04-16: Changed auth from JWT to session cookies
  **Why**: Existing middleware only supports sessions. Rewriting is out of scope.
  **Impact**: Plan steps 3-5 updated. No JWT dependency needed.

Editing in place was tempting because it kept the workspace tidy. But it threw away the most valuable artifact: the trail of decisions. Six months from now, when someone asks “why does the auth code look like this when the original spec said JWT?” — the answer is in 01-spec-r1.md, frozen at the moment of the call, with 01-spec-r2.md capturing exactly when and why we turned. That history is worth more than the cleanliness.

Two reviews — keep them distinct

A subtle thing that took me time to articulate: there are two completely different “review” steps, and conflating them was making the system worse.

LLM review runs inside the pipeline. It’s bounded — one round plus one auto-fix pass, that’s the contract. It reads the diff, classifies findings into mechanical (auto-fixable), critical (auto-fixable with caveats), and judgment-required (surface to the human). It self-verifies every finding before surfacing it. The mechanical stuff gets fixed before you see anything.

Human review happens on the PR after the pipeline completes. Unbounded. Your call. Outside the LLM’s loop entirely.

Flow now refuses to use the bare word “review” anywhere in its skills — it’s always qualified. This sounds like a pedantic naming choice. It’s not. Every time I let “review” be ambiguous, the system would either over-engage the human (running LLM-review feedback past you for approval, when the contract was for the LLM to just fix it) or under-engage (treating LLM-review as the final word). Naming the two things separately fixed both.

Spike mode: the unattended pipeline

The biggest single addition since the last post is /flow-spike. This is the part of Flow I wasn’t sure I needed when I started, and now use constantly.

Spike mode runs the entire pipeline — explore, plan, implement, one round of LLM review, draft PR — without interrupting you. The decision policy replaces AskUserQuestion: pick the (Recommended) option, log every choice to a spike-log.md audit trail, keep moving. The single human touchpoint is the draft PR at the end.

You invoke it with a thesis: /flow-spike "session storage in localStorage is faster than IndexedDB for entries under 100KB". Spike turns that into a branch, scaffolds a workstream, plans the validation, builds it, reviews it, and opens a draft PR with a structured human-review package. You walk away. You come back to something testable.

The thing that makes spike actually useful — and not just a faster way to generate plausible-looking work — is the adversarial read. Before LLM review writes its conclusions, the reviewer must internally answer four questions: what’s the strongest evidence for the thesis, what’s the strongest evidence against, what tests would have falsified the thesis, and what would a skeptical reviewer push back on.

If the LLM can’t name strong evidence against, the PR body says so explicitly: “I could not find evidence against this thesis, which may mean the spike didn’t probe hard enough. Human reviewer should challenge this.” That admission is more useful than false confidence. The failure mode for autonomous LLM work isn’t producing nothing — it’s producing something that looks competent and is quietly wrong, and the adversarial read is the rail that catches the second case.

Spike also produces a 3–5 question quiz at the top of the PR body: thesis-oriented diagnostic questions for the human reviewer, grounded in actual evidence from the run. The quiz is how spike hands you back to your judgment. Not “look it over” — “here are the specific things I’m uncertain about; check these.”

Reflection: the system learns from its own archive

The other major addition is reflection, on two axes.

The first runs at the end of every ship. If the LLM has stated the same non-obvious fact about the project twice in the session — “migrations live in db/migrations/*.sql, “this repo uses make install, not npm install — and that fact isn’t in CLAUDE.md, you get a single structured prompt asking whether to persist it. Hard cap of three candidates per ship; silent if nothing qualifies.

The second is /flow-reflect, an explicit command that scans across all shipped workstreams and looks for cross-PR patterns. The same suggestion appearing in three different review docs. The same decision deferred-to-v2 in five specs. A stage consistently being skipped. Each pattern becomes a structured proposal: update CLAUDE.md, edit .flow/config.sh, tweak a skill file. Every proposal goes through AskUserQuestion; nothing lands silently.

The reason both reflection axes matter is that the value of any workflow system compounds with use. The more you run it, the more it knows about your specific project. But that compounding only happens if there’s an explicit mechanism for capturing what’s learned. Without one, the system stays static and you accumulate frustration.

teach is still in the toolkit — quick-capture a rule, codify a recurring pattern as a skill. Reflection is teach’s longitudinal cousin: not “I noticed this in the moment,” but “I noticed this across the archive.”

Per-project config, scripted setup

A small but real change: Flow now reads .flow/config.sh at the repo root for per-project overrides. Templates, stage list, hooks. First run of /flow in an unconfigured project fires a 3-question scripted setup; defaults are frictionless.

Precedence: environment variable beats config file beats built-in default. The minimal config is one line; the full schema lives in skills/flow/references/config.md.

This existed because I kept finding myself wanting “this project uses a different spec template” or “skip the explicit-plan stage in this repo, the work is too small to warrant it” — and there was nowhere to put that knowledge. .flow/config.sh is that place.

Document depth still scales with task complexity

This was true the first time and is still true: a one-line bug fix produces a three-line spec and skips most ceremony. A complex feature gets the full treatment. The structure is constant; the depth is proportional.

I bring it up because every reaction I’ve heard to Flow that started with “this seems heavy” turned out to be a reaction to imagining the heavy case. In practice, most flows are short, and the system is fast for the simple cases by design. The ceremony has to be lower than the value or people stop using it. That hasn’t changed.

The slash commands, briefly

CommandWhat it does
/flowSingle entry point. Detects current stage and advances.
/flow-adoptPull the current conversation into a workstream. For when you’re mid-chat and realize this should be a flow.
/flow-configConfigure (or reconfigure) .flow/config.sh.
/flow-reflectScan shipped workstreams for cross-PR patterns.
/flow-spike "<thesis>"Unattended spike → draft PR.

/flow-adopt is the one I underestimated. The most common way I start work isn’t “I have a clean idea, let me run /flow” — it’s “I’m three messages deep into a conversation and I now realize this needs a workstream.” Adopt distills the conversation into 01-spec-r1.md and advances. It turns mid-conversation realization into structured work without losing the context I just built.

Install

Same as before. Claude Code plugin:

/plugin marketplace add jyliang/flow
/plugin install flow

Or via npx skills, which works across Claude Code, Cursor, Codex, Copilot, and others:

npx skills add jyliang/flow -g -a claude-code

What I think I’m actually building

The thing I keep coming back to is that working with AI coding tools well is mostly a question of where the human spends their attention. The unstructured default — chat back and forth, eyeball the diff, hope nothing snuck in — burns attention on stuff the AI should have just decided, and starves the parts that genuinely need judgment.

Flow is an opinionated answer to where attention should go. Documents at stage boundaries, so you can review state instead of re-deriving it. Structured questions instead of free-form prose, so decisions are clicks instead of conversations. LLM review that fixes mechanical issues silently and surfaces only judgment calls. Spike mode for thesis-validation work where you genuinely don’t want to be in the loop. Reflection so the system gets sharper from your archive instead of staying static.

I don’t think the current shape is the final shape. I expect to write this post again in a few months and find I was partly wrong about something else. That’s fine — the whole point is that the system evolves with you. The skills are markdown files; the config is a shell script; the archive is plain text on disk. Anything I learn, I can fold back in.

The code is open source. Issues and PRs welcome.