Why AI Visibility Tools Don’t Agree—And What Marketers Should Do Instead

Here’s a fun experiment: run the same brand query through three different AI visibility tools. Watch as each one spits out wildly different numbers. One says you’re everywhere, another says you barely exist, and the third seems to be grading you on a rubric only it understands.

If that feels like SEO déjà vu (remember the days of contradictory keyword trackers?) you’re not wrong. The big difference is that AI answer engines are not search engines. They’re weirder, fuzzier, and much harder to measure. Which is why the tools trying to “track” them keep disagreeing.

Here, I’ll break down some of the reasons why this is, and share some thoughts about what it worth using at this stage of the game.

What “AI Mention” Tools Promise

Every one of these platforms has a shiny pitch:

“Discover your visibility in ChatGPT!”
“See if Gemini cites your blog posts!”
“Benchmark your brand against competitors across 14 LLMs!”

If you’re the kind of marketer who loves dashboards and KPIs, this all sounds fantastic. Finally, some real numbers to see which tactics are working!

But let’s get real: Most of these tools are still in beta—in spirit if not in name.

I myself decided to test drive a few tools; and while I don’t pretend that this is at all comprehensive (these tools seem to be popping up faster than I can sign up for them), it might serve as a first-blush overview for the overwhelmed:

Here’s the Landscape from My Own “AI Mentions Dashboard” Test Drive:

Peec: Neat industry rating, shows top URLs cited. But no real sense of frequency—you can’t tell if your brand is in every other answer or just once in a blue moon.
Profound: Squarely enterprise-only. No free trial, no sandbox. You’ll need to book a demo before you see anything.
LLMRefs: Gives brand rankings and top prompts, with an excellent and extensive source list. But if you’re not in the top 30, you won’t see your placement at all—which makes it great for “winners,” frustrating for everyone else.
Rankability: A general SEO suite wearing an AI tracker hat. Useful, sure, but not exactly groundbreaking.
AthenaHQ: Requires a lengthy setup before hitting you with “please pay first.” Thanks, I guess?
SEMRush: The heavyweight. Share of voice, sentiment drivers, cited pages, even advice for improvement. Probably the best-rounded of the bunch.
Trakkr: Surprisingly good. Clean dashboard, visibility numbers, competitor benchmarks, even a breakdown by AI tool. Easily the most marketer-friendly.
Carma.ai: Barebones. Numbers go up or down, but no context, no graphs, no sentiment. Imagine an intern emailing you weekly saying, “We’re at 12 now.” That’s it.

Let’s Get Down to It: Why Do Answers Differ?

The short version: they’re measuring different things, in different ways, against different models, with different definitions of “visibility.”

The long version:

Different sources. Some tools scrape outputs from AI chats (e.g., run a batch of prompts through ChatGPT and record the responses). Others scrape citations in AI answers (e.g., Perplexity and Gemini sometimes show the sources they pulled from, and tools log those). Still others look at training or reference lists (e.g., LLMRefs gives you a list of which documents or domains show up in its testing corpus). These differences naturally produce different dashboards.
Different models. Ask GPT-4, Claude, and Gemini the same question—you’ll get three different answers. Tools reflect that inconsistency.
Different prompt sets. One tool might test 100 prompts, another 1,000, and the phrasings can vary wildly. “Best CRM software” vs. “Top CRM platforms” vs. “What’s the best system to manage customer relationships?” — those are three different questions to an LLM. Some tools build big libraries of prompts to simulate real-world queries; others lean on a smaller, more curated set. The size, wording, and frequency of those prompt banks all affect what “visibility” looks like.
Different ranking logic. Even if two tools are asking the same questions, they may not agree on how to score the answers. One might count any mention of your brand as visibility, another only counts citations, and a third collapses everything into a proprietary “visibility score.” Some factor in placement (are you named first, or buried in paragraph four?), while others treat all mentions equally. It’s apples, oranges, and bananas (to steal one of LLMs favorite phrases), which is why the dashboards rarely line up.
Dynamic outputs. AI answers aren’t static. You don’t always get the same output twice. SERPs don’t behave this way, which is why marketers are struggling to adapt.

Put all that together and it’s no wonder your dashboards don’t line up.

Measuring a Moving Target

What all of the above gets at is something at the core concept behind all LLMs: AI outputs are probabilistic. They’re stitched together from training data, context windows, and a dash of randomness. Measuring them is a little like measuring how often a comedian “uses irony.”

Add in training cutoffs (your shiny new blog post may not show up in GPT-4’s corpus for months, if ever), and the whole thing gets even fuzzier.

And the big question—what even counts as visibility? A citation? A passing mention? Being the first example in the answer? Each tool answers that differently.

Long story short: We are still in the Wild West of AI citation tools. And that will mean some chaos for any marketing strategy leaning on AISEO, GEO, and related tactics.

How to Use These Tools Without Losing Your Sanity

Here’s the marketer’s guide to not going cross-eyed:

Don’t treat dashboards as gospel.
Do use them for trend-spotting. Is your visibility growing over time? Are you showing up more than Competitor X? That’s useful.
Don’t freak out over daily fluctuations. AI is noisy; so are these measurements.
Do compare relative presence across tools. If three different platforms all show you trending up, you can be reasonably confident you’re making progress.
Do talk to your own customers or clients. Are they seeing you mentioned by AI tools or LLMs? If so, great. If not, you might need to do some good old-fashioned brand work.

And for the love of all that is good…

Don’t obsess over every fluctuation.
Don’t screenshot a single dashboard to prove ROI.

Closing Thoughts

Marketers want the holy grail: “Google Analytics for AI search.” We don’t have it yet. What we have are imperfect, noisy, sometimes contradictory tools. They will likely improve with time, though, so keep an eye out.

In the meantime, there are still other things you can do. The winning strategy—for now—is to create content AI systems actually use. Until measurement catches up, that’s your best lever.

And that, conveniently, is what we do at Words Have Impact: help companies create content that isn’t just search-friendly—it’s AI-friendly.

In the meantime, enjoy the dashboards, laugh at their contradictions, and don’t let them drive you crazy.

Why AI Visibility Tools Don’t Agree—And What Marketers Should Do Instead

Photo from Pexels

Why AI Visibility Tools Don’t Agree—And What Marketers Should Do Instead

What “AI Mention” Tools Promise

Here’s the Landscape from My Own “AI Mentions Dashboard” Test Drive:

Let’s Get Down to It: Why Do Answers Differ?

Measuring a Moving Target

How to Use These Tools Without Losing Your Sanity

Closing Thoughts

Recent Posts