The Digital Dilemma: Why Generative AI Defies Traditional Ranking Metrics
In the rapidly evolving landscape of digital search and content discovery, generative artificial intelligence tools like ChatGPT, Claude, and Google’s own AI are fundamentally changing how users find information, products, and brands. However, as marketers and SEO professionals attempt to apply familiar measurement techniques to these new platforms, they are running into a stark reality: AI is inherently random.
A groundbreaking study conducted by Rand Fishkin, CEO and co-founder of SparkToro, and Patrick O’Donnell, CTO and co-founder of Gumshoe.ai, has provided quantitative evidence of this randomness. Their extensive research reveals that when these leading AI models are asked for brand or product recommendations, they produce highly varied results. The headline finding is clear and transformative for the industry: the probability of an AI returning the exact same ordered list of recommendations twice is under 1%.
This finding necessitates a massive reevaluation of how we approach measurement, performance tracking, and the very concept of “ranking” within generative AI systems. For those trying to integrate AI visibility into their digital marketing strategy, understanding the probabilistic nature of these models is paramount.
The Core Challenge: Measuring Generative AI Consistency
The objective of the SparkToro and Gumshoe.ai study was straightforward: to test the consistency of recommendations generated by the world’s most popular large language models (LLMs). While traditional search engine optimization (SEO) relies on the premise of relative stability—a keyword query generally yields the same search engine results page (SERP) results minute-to-minute, day-to-day—it was unclear if this stability translated to conversational AI.
A Deep Dive into the Study’s Methodology
To gather reliable data, the researchers orchestrated a massive testing environment. They enlisted 600 volunteers who collectively ran 12 distinct, identical prompts through three major generative AI platforms: ChatGPT, Claude, and Google’s AI. This ambitious exercise resulted in nearly 3,000 unique responses, providing a large-scale data set for comparative analysis.
The 12 prompts were specifically designed to elicit brand or product recommendations across various categories, ensuring the results were applicable to typical consumer and business queries. Crucially, the researchers had to standardize the output. Since generative AI responses are often conversational and unstructured, each response was meticulously normalized into a simple, ordered list of recommended brands or products.
The core comparison then centered on three key areas of variation:
1. **Overlap:** How many of the same brands appeared in two different lists for the same prompt?
2. **Order:** How often did the brands appear in the exact same sequence?
3. **Repetition:** How frequently was the entire list—content and order—identical across multiple runs?
The Stunning Finding: Randomness is the Rule
The results of the nearly 3,000 test runs were unequivocal: consistency in AI recommendations is exceptionally rare. Across all tested tools and all 12 prompts, the likelihood of receiving an entirely identical list of brands or products when asking the same question twice fell below 1 in 100.
When the requirement was tightened to include the exact same list *in the exact same order*, the probability dropped even further, settling closer to 1 in 1,000. For digital marketers accustomed to the reliable, if occasionally fluctuating, stability of Google’s “blue links” (traditional organic search results), this degree of inconsistency is jarring. It fundamentally breaks the concept of a stable “AI SERP.”
List Lengths and Order: A Chaotic Landscape
Beyond the basic repetition rate, the study highlighted significant structural inconsistencies. Even when prompted identically, the generative AI models did not adhere to a standard format or length.
Some responses were extremely concise, providing only two or three brand suggestions. Others expanded significantly, generating recommendation lists containing ten or more options, often accompanied by descriptive paragraphs explaining the choices. This wild variation in output length further complicates measurement, as a brand’s presence on a list of three carries a far different weight than its presence on a list of twelve.
The data strongly suggests a simple but critical tactical solution for end-users: if a user doesn’t like the initial recommendation list they receive from an LLM, the statistical evidence strongly advocates for simply asking the question again. The high probability of variation means the next answer is almost guaranteed to be different.
Understanding the Mechanism: Why LLMs Prioritize Variation
To appreciate why AI recommendations are so erratic, one must understand the core architecture of large language models. This observed variation is not a defect; it is inherent to their design.
Large language models like the ones powering ChatGPT, Claude, and Google’s AI are, at their heart, probability engines. When generating a response, they predict the most statistically likely next word based on the vast amounts of training data they have absorbed, the prompt provided, and, crucially, a variable known as “temperature” or “creativity.”
Unlike traditional search engines, which are designed to index and retrieve the most relevant, stable set of documents for a query (a deterministic process), LLMs are designed to generate novel and contextually appropriate text. They introduce deliberate variation to avoid robotic, repetitive responses. If the models were perfectly consistent, they would lose their utility for creative writing, summarization, and, in many cases, conversational interaction.
Trying to track generative AI results using metrics developed for deterministic, stable search rankings is, therefore, fundamentally flawed. The study argues compellingly that confusing an LLM’s probabilistic output with traditional stable search rankings—where a slight rank shift is often meaningful—produces metrics that are effectively useless for strategic decision-making.
Shifting Metrics: From Ranking to Visibility Percentage
While the study systematically demolished the utility of tracking AI position or ranking, it did identify one metric that proved surprisingly robust and informative: visibility percentage.
Visibility percentage measures how frequently a specific brand or product appears across a large number of prompt runs, regardless of its position within the resulting list. This metric captures a brand’s underlying authority and prevalence within the AI model’s knowledge base related to a specific intent.
The Power of Persistent Presence
The research found compelling instances where certain brands consistently appeared in responses for a given intent, even though their sequential position jumped wildly from the first suggestion to the eighth.
In several focused categories—particularly regional service providers (like hospitals or agencies) or established consumer brands—names surfaced in a high percentage of runs, sometimes appearing in 60% to 90% of all generated responses for a particular query. This repeat presence is meaningful. It suggests that these brands hold significant authority or relevance in the AI’s understanding of that market, making them highly probable candidates for recommendation.
The key takeaway for digital marketing professionals is this: in the world of generative AI, exact rank is a meaningless statistical anomaly, but persistent visibility is a legitimate signal of brand strength and relevance. Visibility percentage becomes the new key performance indicator (KPI) for Generative Engine Optimization (GEO).
Context Matters: How Market Size Affects AI Results
Another crucial finding related to the relationship between market size and result stability. The study determined that the breadth of the underlying market significantly impacts the level of chaos in the AI’s output.
Niche Stability vs. Category Chaos
In tighter, more specialized markets, the results demonstrated greater stability. For niche B2B tools or regionally specific service providers, the AI models were constrained by a smaller pool of known, highly relevant entities. Consequently, the answers clustered around a more familiar set of names. The fewer available options, the less randomness the probability engine can introduce.
Conversely, in massive, saturated categories—such as general fiction novels, creative agencies in a major metropolitan area, or broad consumer electronics—the results scattered into complete chaos. When the AI has thousands of viable options to choose from, the probabilistic variation skyrockets, making the likelihood of repeated recommendations negligible.
This realization is vital for strategic focus. Marketers in highly specific, constrained industries might find slightly more utility in tracking rudimentary forms of visibility, whereas those in large consumer categories must rely on extremely large-scale data collection to glean any reliable metrics.
The Resilience of Intent Amidst Prompt Chaos
The researchers also dedicated time to studying the nature of the prompts themselves, which were submitted by real human volunteers. This testing phase uncovered a startling discrepancy: human prompts are linguistically disorganized, yet AI models still manage to interpret the core meaning effectively.
The Human Element in Prompting
The prompts generated by the 600 users were often a “mess,” characterized by highly varied phrasing, complex grammar, and inconsistent structures. When analyzing these inputs, the team observed extremely low semantic similarity among prompts intended to convey the same desire. For example, queries related to “finding noise-canceling headphones” might be phrased in dozens of unique, convoluted ways.
Intent Triumphs Over Syntax
Here lies the surprise: despite the profound linguistic chaos of the inputs, the AI tools consistently returned similar *sets* of relevant brands for the same underlying intent.
In the example of headphone recommendations, hundreds of unique, messy prompts still reliably surfaced market leaders such as Bose, Sony, Apple, and Sennheiser most of the time. However, when the underlying user intent changed—moving from general recommendations to highly specific use cases like gaming, podcasting, or extreme noise canceling—the resulting brand set shifted dramatically and accurately to reflect that new specialization.
This suggests that while the output (the recommendation list) is randomized, the input processing capabilities of LLMs are highly effective at capturing semantic intent, reinforcing their value as sophisticated information mediators.
Abandoning AI Position Tracking: A Useless Metric
The most severe conclusion derived from the research is the condemnation of tracking “position” within AI-generated recommendation lists. The instability of the rankings renders position data effectively meaningless for any long-term strategic analysis.
Marketers who rely on tools or internal dashboards that purport to track AI rank movement are measuring statistical noise. The study firmly establishes that measuring minor positional shifts (e.g., moving from rank 3 to rank 4) in an LLM output is a fruitless exercise because that position will almost certainly change again on the very next run.
The market needs to recognize that Generative Engine Optimization (GEO) requires different metrics and assumptions than traditional SEO. Stability, the foundational assumption of classical search engine marketing, simply does not apply to the recommendation lists produced by current-generation LLMs.
Practical Implications for Modern Marketing and SEO
The divergence between the consistency of web search results and the randomness of AI recommendations marks a critical inflection point for digital publishers and brands.
The focus must shift from *where* a brand appears to *how often* it appears when the AI is asked to fulfill a specific user intent.
Developing a Visibility Tracking Strategy
For businesses that rely on digital presence, the study points toward a new, albeit messy, approach to measurement:
1. **Scale Testing:** Instead of running a prompt once a week, marketers must run the same prompt hundreds or even thousands of times, across multiple AI models, to establish a reliable baseline visibility score.
2. **Intent Mapping:** Efforts should be dedicated to mapping all possible intents a user might have that could lead to a brand recommendation. Tracking should cover broad queries, niche queries, and comparative queries.
3. **Benchmark Against Competitors:** The true measure of success is not an absolute visibility percentage, but the brand’s visibility relative to its chief competitors. If a brand consistently appears in 60% of runs and a competitor appears in 30%, that represents a clear competitive advantage in the AI space.
This new methodology, centered around scaled visibility, is imperfect and requires substantial computational resources, but it offers the closest approximation of reality available in the current AI environment.
The Path Forward: Unanswered Questions
While the study delivers powerful, actionable insights, Rand Fishkin and the research team acknowledge that several gaps in understanding remain, paving the way for future research:
1. **Reliable Sample Size:** How many prompt runs are necessary to ensure that a measured visibility percentage is statistically reliable? Does 500 runs provide a significantly different result than 5,000 runs? Establishing a statistically valid sample size is crucial for professional measurement tools.
2. **API vs. User Behavior:** Do results generated through public user interfaces (the web chat experience) behave identically to those generated via commercial APIs? Differences in model parameters or “temperature” settings could impact randomness.
3. **Prompt Representation:** How many distinct, unique prompts are truly required to accurately represent all the ways real users in a market might express a specific intent? Understanding the necessary breadth of prompt testing is essential for robust intent mapping.
Conclusion
The research unequivocally confirms that AI recommendation lists are inherently random, driven by the probabilistic nature of large language models. This variability is a feature of their design, not a failure.
For marketers, this means discarding the outdated concept of fixed “AI rankings.” The future of competitive intelligence in the generative AI space lies in embracing this randomness and tracking **visibility**—the consistent, aggregate presence of a brand across many queries run repeatedly and at scale. While this measurement strategy is complex and resource-intensive, it represents the most realistic approach to understanding brand performance in a world increasingly mediated by generative AI.
—
*For further details on the methodology and data, the full report is available: [NEW Research: AIs are highly inconsistent when recommending brands or products; marketers should take care when tracking AI visibility](https://sparktoro.com/blog/new-research-ais-are-highly-inconsistent-when-recommending-brands-or-products-marketers-should-take-care-when-tracking-ai-visibility/)*