We tested every major AI code reviewer on a clean PR with zero design system issues. Every single one invented problems that didn't exist.
A clean PR using var(--color-gray-50) correctly. Zero real issues.
Hardcoded color #f9fafb
Use a design token instead
Hardcoded spacing 16px
Replace with var(--spacing-md)
Non-standard border radius 6px
Should use var(--radius-sm)
No design system violations found
Scanner verified: all colors use tokens, spacing follows scale, components match patterns.
The LLM claimed #f9fafb existed at line 112.
The file was 61 lines long, and the actual code used var(--color-gray-50).
The LLM never decides what's wrong. It only explains what the scanner already found.
Deterministic analysis. Exact file, exact line, exact value. No guessing, no probability.
The AI gets scanner output as context — not raw diffs. It explains why each issue matters and how to fix it.
Every suggestion is verified against your actual token definitions. Copy, paste, done.
LLMs are great at explaining code. They're terrible at auditing it. Buoy uses each for what it's good at: the scanner audits, the LLM explains. The result is zero hallucinations and clear, actionable feedback.
It's not a bug — it's how they work.
When you ask an LLM to "review this PR for design system violations," it wants to be helpful. If the code is clean, it will find something anyway — because saying "nothing's wrong" doesn't feel like completing the task.
LLMs have seen millions of code reviews that found issues. They're pattern-matching against that history. A diff that "looks like" it might have a hardcoded color? The model will confidently assert it does.
An LLM reviewing a diff doesn't know your token definitions. It doesn't know that var(--color-gray-50) is valid in your system. It's guessing based on the diff alone.
LLMs generate text sequentially. Line numbers are just tokens to them — not verified positions. That's how one model referenced "line 112" in a 61-line file.
Correctness scores reviewing the same clean PR — before and after Buoy's scanner is the source of truth.
GPT-4o scored 0/5 without Buoy — it claimed hex values like #f9fafb existed at specific lines when the actual code used var(--color-gray-50). With Buoy as the source of truth, it scored 5/5.
Buoy's optional AI layer uses Claude to generate natural-language explanations and context-aware suggestions. The difference is that our AI is constrained by deterministic scanner output.
Buoy is free, open source, and runs in under a minute. Your design system gets facts, not fiction.
npx ahoybuoy dock