Based on real testing across 4 leading LLMs

Your design system deserves
facts, not hallucinations.

We tested every major AI code reviewer on a clean PR with zero design system issues. Every single one invented problems that didn't exist.

Same PR. Same code. Very different reviews.

A clean PR using var(--color-gray-50) correctly. Zero real issues.

LLM-Only Review
ai-reviewer[bot]
error
Card.tsx line 112 file is 61 lines

Hardcoded color #f9fafb

Use a design token instead

error
Card.tsx line 34

Hardcoded spacing 16px

Replace with var(--spacing-md)

warn
Badge.tsx line 7

Non-standard border radius 6px

Should use var(--radius-sm)

3 issues found None are real
Buoy Review
buoy[bot]

No design system violations found

Scanner verified: all colors use tokens, spacing follows scale, components match patterns.

All checks passed 0 false positives

The LLM claimed #f9fafb existed at line 112. The file was 61 lines long, and the actual code used var(--color-gray-50).

4
Leading LLMs tested
Claude & GPT families
100%
Hallucinated on clean PRs
Every model, every time
0%
Hallucinations with Buoy
Scanner-verified output
5/5
Correctness with Buoy
Across all 4 models

How Buoy eliminates hallucinations

The LLM never decides what's wrong. It only explains what the scanner already found.

1

Scanner finds real issues

Deterministic analysis. Exact file, exact line, exact value. No guessing, no probability.

100% deterministic
2

LLM explains findings

The AI gets scanner output as context — not raw diffs. It explains why each issue matters and how to fix it.

Constrained by facts
3

Developer gets one-click fixes

Every suggestion is verified against your actual token definitions. Copy, paste, done.

Verified replacements
🛟

The key insight

LLMs are great at explaining code. They're terrible at auditing it. Buoy uses each for what it's good at: the scanner audits, the LLM explains. The result is zero hallucinations and clear, actionable feedback.

Why do LLMs invent issues on clean code?

It's not a bug — it's how they work.

🎯

Task-completion bias

When you ask an LLM to "review this PR for design system violations," it wants to be helpful. If the code is clean, it will find something anyway — because saying "nothing's wrong" doesn't feel like completing the task.

🔮

Pattern hallucination

LLMs have seen millions of code reviews that found issues. They're pattern-matching against that history. A diff that "looks like" it might have a hardcoded color? The model will confidently assert it does.

🗺️

No source of truth

An LLM reviewing a diff doesn't know your token definitions. It doesn't know that var(--color-gray-50) is valid in your system. It's guessing based on the diff alone.

📍

Line number fabrication

LLMs generate text sequentially. Line numbers are just tokens to them — not verified positions. That's how one model referenced "line 112" in a 61-line file.

Model-by-model results

Correctness scores reviewing the same clean PR — before and after Buoy's scanner is the source of truth.

Model
Hallucinated?
Correctness
With Buoy
Correctness
Claude Sonnet 4.5
Yes
3/5
No
5/5
Claude Haiku 4.5
Yes
2/5
No
5/5
GPT-4o
Yes
0/5
No
5/5
GPT-4o-mini
Yes
1/5
No
5/5

GPT-4o scored 0/5 without Buoy — it claimed hex values like #f9fafb existed at specific lines when the actual code used var(--color-gray-50). With Buoy as the source of truth, it scored 5/5.

We're not anti-AI. We use AI.

Buoy's optional AI layer uses Claude to generate natural-language explanations and context-aware suggestions. The difference is that our AI is constrained by deterministic scanner output.

LLM-only: Give the model a diff and hope for the best
Buoy: Scanner finds facts, AI explains them clearly

Stop reviewing hallucinated issues.

Buoy is free, open source, and runs in under a minute. Your design system gets facts, not fiction.

$ npx ahoybuoy dock