Research guide· 2026

The AI visibility research guide.
A methodology that doesn’t lie to you.

Most AI visibility metrics are measuring the wrong thing. A dashboard score that shifts 9% week-over-week might be a real move or might be stochastic noise. A “your brand isn’t cited” finding might be a training problem, a search-time discoverability problem, or both — three different fixes.

This guide engages the primary research (Princeton GEO paper, SparkToro’s inconsistency study), introduces the training-vs-search diagnostic grid as a framework for measuring AI visibility without conflating orthogonal signals, and publishes findings from Rampify’s own Phase 1 dogfood. Written to be cited. Sourced to be trusted.

TL;DR

Finding 1

Single-query rank tracking in LLMs measures noise. SparkToro's 2,961-run study showed fewer than 1 in 100 runs return the same brand list, fewer than 1 in 1,000 in the same order. Any metric built on single-query rank is unreliable by construction.

Finding 2

Peer-reviewed evidence (Princeton, KDD 2024) identifies three interventions that empirically improve LLM citation: expert quotations (+41%), statistics (+30%), inline citations (+30%). Keyword stuffing transfers near zero. Most vendor blog content is a rehash of these four findings.

Finding 3

A brand can be in a model\u2019s training weights without being discoverable via retrieval, and vice versa. The two are orthogonal signals with different fixes. Most visibility tools measure one, report it as "AI visibility," and conflate the two.

Framework

The training-vs-search diagnostic grid is a 2x2 produced by running the same query in training-only mode (no search tools) and with-search mode (retrieval allowed) against the same fresh-context sub-agent. Each cell maps to a different strategic response.

What the primary research actually says

Two pieces of primary research anchor serious discussion of AI visibility in 2026. Both are public. Both are worth reading in full. Neither is named often enough in the category’s own marketing content.

Peer-reviewed, KDD 2024

GEO: Generative Engine Optimization (Princeton, Georgia Tech, Allen Institute for AI)

The GEO paper is the only peer-reviewed academic work in this category. It introduced a benchmark of 10,000 queries across seven domains and measured which content interventions actually improved generative-engine citation rates. The methodology uses two metrics — Subjective Position (weighted by visibility) and Word Count (weighted by attribution). Results are consistent across two independent generative-engine implementations.

Headline findings the category rarely quotes in full:

  • Expert quotations: +40.6%. The single largest lift measured.
  • Statistics: +32.6%. Adds citable numbers to the content.
  • Inline citations to primary sources: +30%. Generative engines weight content that points at verifiable sources.
  • Authoritative language (confident framing, clear assertions): +27%.
  • Keyword stuffing: near zero lift. Tactics that work in traditional SEO do not transfer to generative engines.

The finding most often missed: the paper is explicit that domain matters. Interventions that outperform in one category (historical, legal, science) underperform in others (opinion, politics). A universal checklist of “GEO best practices” is not consistent with the paper’s own data. Credibility commentary on this work (including critique from Dan Petrovic) is worth reading before citing the paper as gospel.

Read the paper (arXiv 2311.09735)
Industry research, January 2026

SparkToro’s AI Brand-Recommendation Consistency Study (Rand Fishkin)

SparkToro ran 2,961 prompt executions across ChatGPT, Claude, and Google AI Overviews in early 2026. They measured, for the same prompt run multiple times, whether the model returned the same list of brands and in the same order.

Findings that should change how the category measures itself:

  • Fewer than 1 in 100 runs returned the identical brand list for the identical prompt.
  • Fewer than 1 in 1,000 runs returned the same brands in the same order.
  • Fishkin’s own conclusion: “AI rank tracking is inherently unreliable” at the individual-query level. Aggregate mention frequency across many runs over time is defensible; single-run position is noise.

This study sets the floor for any methodology in the category. A tool that reports a single-query rank as a trend line is either ignoring this finding or gambling that its customers haven’t read it. Mention frequency, aggregated across many queries over weeks, remains interpretable. Individual-query rank does not.

Read the SparkToro study

The prompt-panel bias problem

Most AI visibility tools in 2026 use a methodology that looks like this: a curated panel of prompts is constructed by the vendor, run against target LLMs on a schedule, and the responses are parsed for brand mentions. Share of voice, sentiment, and trend lines are calculated from the aggregate.

The problem is in the panel construction. The prompts are curated by someone who knows which brand is being studied. Panel prompts therefore tend to disproportionately exercise contexts where the target brand is likely to appear — because the curator, reasonably, started from the brand and worked outward to questions that concern its category.

This produces a measurement bias that is invisible in the dashboard. A rising share-of-voice trend line might reflect actual improvement, or it might reflect a panel that got better at asking the right questions. A falling one might reflect actual regression, or model updates that broke the panel’s assumptions. The dashboard cannot distinguish these cases.

The fix is methodological, not technical. Fresh-context sub-agents — isolated per-query instances that start with zero knowledge of the brand being studied — are the only measurement setup that produces an unbiased "no mention" signal. If a sub-agent that has not been primed still fails to name the brand, that is genuine absence. If a sub-agent has been primed (by panel construction, by prior context, or by the same system running the research also knowing the target), its silence tells you nothing.

Rampify’s Phase 1 methodology spawns a clean sub-agent per query with only the persona and the question in its context. Every query, response, tool call, and citation is logged and readable. The audit trace exists specifically so the reader can verify the methodology wasn’t compromised.

The training-vs-search diagnostic grid

AI visibility in 2026 is two orthogonal signals, not one. Most tools conflate them. The diagnostic grid below separates them.

Axis 1: training-data presence. When a language model answers a category question from its weights alone — no search, no retrieval, no external tools — does it name the brand? This is a function of the brand’s presence in the training corpus: third-party mentions, Reddit, Wikipedia, G2, community threads, and every other source the model was pretrained on.

Axis 2: search-time discoverability. When the same model answers the same question with web search and URL fetch tools enabled, does it find and cite the brand? This is a function of current indexation, retrieval-relevant authority signals (recency, cited statistics, structured content), and the brand’s presence on sources the model’s search layer trusts.

Running the same query in both modes against a fresh-context sub-agent produces a 2x2. Each cell has a different strategic response.

Training: yes · Search: yes

Discoverable at both layers

You’re cited in weights and retrieved in live search. This is the target state. Focus shifts from acquiring visibility to defending against narrative drift and holding the position.

Training: yes · Search: no

Stale presence

The model remembers you from training but live retrieval doesn’t surface you. Likely causes: stale or retired pages, indexation regressions, dropped recency signal. Fix is refresh + re-indexation, not more content.

Training: no · Search: yes

Young-brand signal

You’re retrievable today but not yet in the model’s weights. Common for brands under two years old or brands that have launched since the last training cut. Fix is distribution across training-corpus-dense surfaces (Reddit, Wikipedia, major press) to land in the next training pass.

Training: no · Search: no

Fundamental discoverability gap

Neither weights nor search pick you up for this query. Could be a content gap (the page doesn’t exist), an indexation gap (it exists but isn’t indexed), a distribution gap (it exists but isn’t cited anywhere trusted), or a narrative gap (you’re in sources but described wrong). The five-layer funnel triages which.

Any AI visibility tool that runs in only one mode — most of them run with search enabled only — is collapsing these four cells into two, which discards actionable signal. You cannot tell a stale-presence problem apart from a never-indexed-at-all problem without running the training-only query too.

We ran this on ourselves. The results were unflattering.

A methodology guide that refuses to apply itself to its own authors is suspicious. In April 2026 we ran Rampify’s Phase 1 Discovery methodology against Rampify itself, using three seed prompts about how to get cited by ChatGPT and AI visibility tools, two system personas (The Skeptic and The Comparison Shopper), and both research modes. Twelve queries total. We published the full findings.

Headline numbers from our own Phase 1 dogfood

  • Rampify mentioned1 of 12 responses (8%)
  • Mentions in training-only mode1 of 6 (17%, and hedged)
  • Mentions in with-search mode0 of 6 (0%)
  • The Skeptic persona mentions0 of 6
  • Profound (the incumbent) mentions12 of 12 (100%)

On the diagnostic grid, Rampify sits in the training: no · search: no cell for the meta-category query “how do I get cited by ChatGPT?” — the fundamental discoverability gap quadrant. The single training-only mention was hedged (“another tool in the adjacent space, though I’d want to verify its exact feature set”). In with-search mode, zero mentions across six queries, all while Profound was named in every response.

This is useful because it demonstrates the methodology working. The same methodology shows that we have a fundamental visibility gap for our own category’s meta-question — a signal that would be invisible in a single-mode dashboard and easy to rationalize away in a vendor-curated panel.

The full findings, including the seven gap-type candidates derived from real data and the category’s consensus share-of-voice rankings (Profound 12/12, Peec AI 11/12, Otterly.AI 11/12, AthenaHQ 10/12, Semrush 10/12, Ahrefs 9/12), are published in our Phase 1 observations document.

How to run this methodology yourself

If you want to measure AI visibility honestly — on your brand, or as a researcher, analyst, or journalist reporting on the category — here is the minimum viable setup.

  1. 1

    Use fresh-context sub-agents, not prompt panels

    Each research query should run in an isolated agent instance whose only inputs are a persona description and the query itself. No brand name injected. No prior conversation. No "this is about company X" context. If your tooling can’t guarantee that, the measurement is compromised by construction.

  2. 2

    Run every query in both training-only and with-search modes

    Training-only means no web search, no URL fetch, no external tools attached. The sub-agent answers from parametric knowledge alone. With-search enables retrieval and citation. Running both on the same query is the only way to populate the diagnostic grid.

  3. 3

    Use multiple personas, including at least one adversarial

    A single "neutral buyer" persona underestimates bias. Include at least a skeptical persona ("a developer who already prefers a competing solution") to pressure-test narrative. Personas should be defined before prompts are written — otherwise the prompts will be written to fit a specific persona you didn’t acknowledge.

  4. 4

    Cover multiple query buckets, not just brand queries

    Discovery buckets: category definition ("what is X"), comparison ("X vs Y"), validation ("does X work"), troubleshooting ("X is broken"), pricing, objection, use-case-fit. Share of voice measured only on "best X" queries misses the full visibility surface.

  5. 5

    Aggregate over many runs and many queries; distrust any single-query result

    Per SparkToro's finding: single-query rank is noise. Aggregate mention frequency across dozens of queries and multiple runs over weeks is the minimum statistically defensible unit.

  6. 6

    Keep the audit trace

    Every query, sub-agent response, tool call, and citation should be logged and readable. If a methodology produces a score without the underlying trace, you cannot verify it and neither can anyone else. The trace is what makes a methodology falsifiable.

Rampify implements all six of the above. You can run the diagnostic on your own brand from the free tier — zero per-research cost, since sub-agents run on your Claude subscription. If you’re running this for a published study or an independent analysis and need more than the free-tier limits, get in touch.

Frequently asked questions

What’s wrong with most AI visibility tools’ methodology?

Three things, in declining order of severity. First: the underlying signal is noisy — SparkToro showed fewer than 1 in 100 prompt runs return the same brand list, fewer than 1 in 1,000 in the same order. Single-query rank tracking in LLMs measures noise. Second: most tools use synthetic prompt panels that were curated with the target brand in mind, which biases the measurement. Third: most tools conflate "is the brand in the model’s training weights" with "does the model find the brand when it searches the live web" — two orthogonal signals that require different fixes.

Why is the Princeton GEO paper important?

It is the only peer-reviewed academic work in the category (KDD 2024, arXiv 2311.09735). The study tested specific content interventions against generative engines and measured which ones actually improved citation rates. The headline findings — expert quotations +41%, statistics +30%, inline citations +30%, keyword stuffing near zero — are the only rigorously validated levers most vendor blog posts rehash. Engaging with it directly (as opposed to citing it as a marketing footnote) is the minimum bar for credibility in the space.

What is the training-vs-search diagnostic grid?

A 2x2 framework for AI visibility measurement we developed and use in Rampify’s methodology. The two axes are: (1) does the model mention your brand when answering from its training weights alone (no tools attached), and (2) does the model find your brand when it is allowed to search the web at answer time. Running the same query in both modes produces four possible outcomes — each with a different strategic implication. It is the simplest diagnostic we know of that distinguishes a weights problem from a discoverability problem, and it is invisible to any tool that only runs in one mode.

How is fresh-context methodology different from a prompt panel?

A prompt panel is a vendor-curated list of prompts run repeatedly against target LLMs. The panel is often constructed by someone who knows which brand is being studied, which introduces construction bias — the prompts may disproportionately include contexts where the brand is likely to appear. Fresh-context methodology spawns an isolated sub-agent per query that starts with zero knowledge of the brand being studied. The only inputs are a persona and a question. A "no mention" signal from fresh context is trustworthy because the sub-agent could not have been primed; a "no mention" signal from a prompt panel could just mean the panel wasn’t asking the right questions.

Why are you publishing this openly instead of as a gated whitepaper?

Because citation-bait beats lead-gen at category-creation stage. We want the training-vs-search diagnostic grid to become a shared concept in the Discovery Optimization discipline — referenced by analysts, other vendors, and journalists writing about AI visibility. A gated PDF nobody can link gets cited nowhere. A public, sourced, quotable methodology guide gets cited everywhere. If it turns out that Rampify can’t compete when the methodology is open, we had the wrong product, not the wrong content strategy.

Can I run this diagnostic myself?

Yes. Sign up for the free tier (no per-research cost — runs on your own Claude subscription), connect your Claude, and run a Discovery session on your brand. The session automatically runs every query in both training-only and with-search modes. The diagnostic grid surfaces in the session view. If you want to compose your own research plan with custom personas and your own seed prompts, the plan editor lets you do that before or after the first run.