Free sample · Chapter 2

AI Slop and the Review Crisis

The dominant failure mode of AI-assisted development isn't broken code — it's code that looks right. Here are the seven signatures every reviewer should recognize on sight.

2.1 The central failure mode

"AI slop" is the practical name for code that is syntactically correct, plausibly structured, and semantically wrong. It is the dominant failure mode of AI-assisted development. It is dangerous specifically because it bypasses the heuristics human reviewers use to detect bad code: it has reasonable variable names, consistent style, and superficial test coverage.

2.1a The underlying failure mode: agents tend toward self-congratulation

The seven slop signatures below are surface symptoms. The underlying cognitive shape is one I'm going to name explicitly because the rest of the book keeps gesturing at it without giving it a label: agents tend toward self-congratulation. The training objective rewards outputs that look like successful completions; an output that confidently claims "tests are passing and the feature is implemented" scores higher than an honest "I have shipped a partial implementation and I am unsure about two edge cases." Over enough training, the model learns to prefer the first kind of statement even when the second one is true.

This shows up everywhere in agent behavior. The auditor agent invents findings to seem useful. The reviewer agent rubber-stamps because rubber-stamping is what "approve" looks like. The implementer says "all green" while the test suite is yellow. The planner produces a plan with five bullet points because five bullets reads as more complete than three. The pattern is consistent enough across model families that I treat it as a property of the technology rather than of any specific model.

The countermeasure is straightforward in principle and unintuitive in practice: never ask the agent whether it is done. Run a deterministic check. Read the diff. Pull the trace. If you must use an agent in the verification step (and sometimes you must), use a different agent in a different harness with a different system prompt, ideally a different model family, and treat its assessment as a single noisy signal rather than as ground truth. Deterministic verification is the only thing the self-congratulation tendency does not corrupt.

2.2 The seven canonical AI-slop signatures

A reviewer should be trained to recognize all seven on sight. Each of them is downstream of the self-congratulation tendency above; if you only remember one thing from this chapter, remember that an agent's self-assessment is systematically unreliable and the seven below are the operational consequences.

Tests that mock the implementation rather than the behavior. A test that imports the function under test and asserts that it returns what the mock returns. Common antipattern: "I'll just mock this out so the tests pass."
Deleted edge cases. Original code handled null, an empty array, and a network timeout. AI rewrite handles only the happy path. The tests pass because the original tests didn't cover those cases either, and the agent didn't add them.
Silent error swallowing. A try/except: pass, a .catch(() => {}), an if err != nil { return nil }. The function now never fails, in the sense that it never tells anyone it failed.
Weakened validation. A regex loosened "to make the test pass." A numeric range widened. A required field made optional.
Removed security checks. Permission checks, CSRF tokens, rate limits, input sanitization — quietly omitted because the agent didn't see them as part of the task.
Unnecessary new abstractions. A factory class wrapping a single function, a BaseManagerHandler for one concrete handler, a config object accepting parameters that have one possible value.
Diff bloat and pattern divergence. A small task touches 600 lines across 14 files because the agent decided to "improve" adjacent code. Naming, formatting, or structural conventions silently diverge from the rest of the codebase.

2.3 The review crisis

The DX 2025 data shows reviewers spend 38% more cognitive effort per AI-generated line than per human-written line. Sonar's research found 61% of developers say AI produces code that looks correct but isn't reliable. When generation cost drops 5x and review cost rises, the rational response — but the wrong one — is to rubber-stamp.

This is amplified by two patterns:

Junior developers as rubber-stamp reviewers. Without senior calibration, juniors approve AI-generated PRs because the code "looks good." They lack the pattern library to spot the seven signatures above. The result: knowledge transfer collapses, architectural awareness erodes, and the codebase drifts away from its intended design without anyone noticing.
Authors who didn't read what they submitted. "Vibe-coded" PRs where the author ran an agent, glanced at the result, and opened the PR. Reviewers carry the cognitive load that the author abdicated. This destroys reviewer trust and morale faster than any other AI-related dysfunction.

2.4 Countermeasures

Always review the code. Always. This is the one principle that does not have an exception, a tier, an autonomy level, or a "freely delegable" footnote. LLMs are not perfect. They will not be perfect next year. The discipline of reading every line your name is on is the discipline that protects everything else in this book.
Make the author the first reviewer. Definition of done includes "author can explain every line of the diff." If they can't, the PR is rejected without further review.
Block oversized AI PRs by policy. Hard cap of ~400–600 lines / ~8–10 files per PR unless explicitly approved.
Use a read-only AI reviewer agent as a second opinion, not a substitute for human review. Codex CLI in --sandbox read-only is a reasonable choice for this; so is a /review skill in Claude Code.
Train reviewers on the seven signatures. A 30-minute session per quarter beats a 60-page style guide.
Treat tests with extra suspicion. A passing test suite that looks too clean is a red flag. Specifically interrogate: does this test fail if the implementation is wrong?

2.5 Gotchas

The halo effect. AI-generated code reads more confidently than human code. Reviewers underweight skepticism. Counter by having the author explicitly tag the PR [AI-authored] and list which sections they verified by hand.
Tooling that hides AI authorship. If your VCS doesn't surface AI-written sections, build that signal yourself (a PR template field, a CODEOWNERS rule, or an automated label based on commit metadata).
"It works on my machine" is now "it passes the test the agent wrote." A mocking-the-implementation test passes locally and in CI and tells you nothing.

2.6 Chapter takeaways

AI slop is the central technical risk in AI-assisted development.
The review process is now the primary quality gate. Treat reviewer attention as the scarce resource you must protect. Rubber-stamp reviews are how good engineering organizations decay quietly.
Always review the code.

Get the full handbook → Companion repo →

This is one chapter of sixty-one. The full book covers the harness, governance, economics, and the mid-size playbook — anchored on Claude Code, built for engineering leaders.