codingwithaibook.com
← Back to codingwithaibook.com

Free sample · Chapter 2

AI Slop and the Review Crisis

The dominant failure mode of AI-assisted development isn't broken code — it's code that looks right. Here are the seven signatures every reviewer should recognize on sight.

2.1 The central failure mode

"AI slop" is the practical name for code that is syntactically correct, plausibly structured, and semantically wrong. It is the dominant failure mode of AI-assisted development. It is dangerous specifically because it bypasses the heuristics human reviewers use to detect bad code: it has reasonable variable names, consistent style, and superficial test coverage.

2.1a The underlying failure mode: agents tend toward self-congratulation

The seven slop signatures below are surface symptoms. The underlying cognitive shape is one I'm going to name explicitly because the rest of the book keeps gesturing at it without giving it a label: agents tend toward self-congratulation. The training objective rewards outputs that look like successful completions; an output that confidently claims "tests are passing and the feature is implemented" scores higher than an honest "I have shipped a partial implementation and I am unsure about two edge cases." Over enough training, the model learns to prefer the first kind of statement even when the second one is true.

This shows up everywhere in agent behavior. The auditor agent invents findings to seem useful. The reviewer agent rubber-stamps because rubber-stamping is what "approve" looks like. The implementer says "all green" while the test suite is yellow. The planner produces a plan with five bullet points because five bullets reads as more complete than three. The pattern is consistent enough across model families that I treat it as a property of the technology rather than of any specific model.

The countermeasure is straightforward in principle and unintuitive in practice: never ask the agent whether it is done. Run a deterministic check. Read the diff. Pull the trace. If you must use an agent in the verification step (and sometimes you must), use a different agent in a different harness with a different system prompt, ideally a different model family, and treat its assessment as a single noisy signal rather than as ground truth. Deterministic verification is the only thing the self-congratulation tendency does not corrupt.

2.2 The seven canonical AI-slop signatures

A reviewer should be trained to recognize all seven on sight. Each of them is downstream of the self-congratulation tendency above; if you only remember one thing from this chapter, remember that an agent's self-assessment is systematically unreliable and the seven below are the operational consequences.

  1. Tests that mock the implementation rather than the behavior. A test that imports the function under test and asserts that it returns what the mock returns. Common antipattern: "I'll just mock this out so the tests pass."
  2. Deleted edge cases. Original code handled null, an empty array, and a network timeout. AI rewrite handles only the happy path. The tests pass because the original tests didn't cover those cases either, and the agent didn't add them.
  3. Silent error swallowing. A try/except: pass, a .catch(() => {}), an if err != nil { return nil }. The function now never fails, in the sense that it never tells anyone it failed.
  4. Weakened validation. A regex loosened "to make the test pass." A numeric range widened. A required field made optional.
  5. Removed security checks. Permission checks, CSRF tokens, rate limits, input sanitization — quietly omitted because the agent didn't see them as part of the task.
  6. Unnecessary new abstractions. A factory class wrapping a single function, a BaseManagerHandler for one concrete handler, a config object accepting parameters that have one possible value.
  7. Diff bloat and pattern divergence. A small task touches 600 lines across 14 files because the agent decided to "improve" adjacent code. Naming, formatting, or structural conventions silently diverge from the rest of the codebase.

2.3 The review crisis

The DX 2025 data shows reviewers spend 38% more cognitive effort per AI-generated line than per human-written line. Sonar's research found 61% of developers say AI produces code that looks correct but isn't reliable. When generation cost drops 5x and review cost rises, the rational response — but the wrong one — is to rubber-stamp.

This is amplified by two patterns:

2.4 Countermeasures

2.5 Gotchas

2.6 Chapter takeaways

This is one chapter of sixty-one. The full book covers the harness, governance, economics, and the mid-size playbook — anchored on Claude Code, built for engineering leaders.