LLM Code Review Hallucinations | Why Buoy Uses Deterministic Scanning

Same PR. Same code. Very different reviews.

A clean PR using var(--color-gray-50) correctly. Zero real issues.

LLM-Only Review

ai-reviewer[bot]

error

Card.tsx line 112 file is 61 lines

Hardcoded color #f9fafb

Use a design token instead

error

Card.tsx line 34

Hardcoded spacing 16px

Replace with var(--spacing-md)

warn

Badge.tsx line 7

Non-standard border radius 6px

Should use var(--radius-sm)

3 issues found None are real

Buoy Review

buoy[bot]

No design system violations found

Scanner verified: all colors use tokens, spacing follows scale, components match patterns.

All checks passed 0 false positives

The LLM claimed #f9fafb existed at line 112. The file was 61 lines long, and the actual code used var(--color-gray-50).

How Buoy eliminates hallucinations

The LLM never decides what's wrong. It only explains what the scanner already found.

Scanner finds real issues

Deterministic analysis. Exact file, exact line, exact value. No guessing, no probability.

100% deterministic

LLM explains findings

The AI gets scanner output as context — not raw diffs. It explains why each issue matters and how to fix it.

Constrained by facts

Developer gets one-click fixes

Every suggestion is verified against your actual token definitions. Copy, paste, done.

Verified replacements

🛟

The key insight

LLMs are great at explaining code. They're terrible at auditing it. Buoy uses each for what it's good at: the scanner audits, the LLM explains. The result is zero hallucinations and clear, actionable feedback.

Why do LLMs invent issues on clean code?

It's not a bug — it's how they work.

🎯

Task-completion bias

When you ask an LLM to "review this PR for design system violations," it wants to be helpful. If the code is clean, it will find something anyway — because saying "nothing's wrong" doesn't feel like completing the task.

🔮

Pattern hallucination

LLMs have seen millions of code reviews that found issues. They're pattern-matching against that history. A diff that "looks like" it might have a hardcoded color? The model will confidently assert it does.

🗺️

No source of truth

An LLM reviewing a diff doesn't know your token definitions. It doesn't know that var(--color-gray-50) is valid in your system. It's guessing based on the diff alone.

📍

Line number fabrication

LLMs generate text sequentially. Line numbers are just tokens to them — not verified positions. That's how one model referenced "line 112" in a 61-line file.

Model-by-model results

Correctness scores reviewing the same clean PR — before and after Buoy's scanner is the source of truth.

Model

Hallucinated?

Correctness

With Buoy

Correctness

Claude Sonnet 4.5

Yes

3/5

5/5

Claude Haiku 4.5

Yes

2/5

5/5

GPT-4o

Yes

0/5

5/5

GPT-4o-mini

Yes

1/5

5/5

↑

GPT-4o scored 0/5 without Buoy — it claimed hex values like #f9fafb existed at specific lines when the actual code used var(--color-gray-50). With Buoy as the source of truth, it scored 5/5.

✨

We're not anti-AI. We use AI.

Buoy's optional AI layer uses Claude to generate natural-language explanations and context-aware suggestions. The difference is that our AI is constrained by deterministic scanner output.

✗ LLM-only: Give the model a diff and hope for the best

✓ Buoy: Scanner finds facts, AI explains them clearly

Your design system deserves
facts, not hallucinations.

Same PR. Same code. Very different reviews.

How Buoy eliminates hallucinations

Scanner finds real issues

LLM explains findings

Developer gets one-click fixes

The key insight

Why do LLMs invent issues on clean code?

Task-completion bias

Pattern hallucination

No source of truth

Line number fabrication

Model-by-model results

We're not anti-AI. We use AI.

Stop reviewing hallucinated issues.

Your design system deserves facts, not hallucinations.

Same PR. Same code. Very different reviews.

How Buoy eliminates hallucinations

Scanner finds real issues

LLM explains findings

Developer gets one-click fixes

The key insight

Why do LLMs invent issues on clean code?

Task-completion bias

Pattern hallucination

No source of truth

Line number fabrication

Model-by-model results

We're not anti-AI. We use AI.

Stop reviewing hallucinated issues.

Your design system deserves
facts, not hallucinations.