How Researchers Reverse-Engineered LLMs For A Ranking Experiment via @sejournal, @martinibuster

Understanding the Shift from Search Engines to Generative Engines

The landscape of digital information retrieval is undergoing its most significant transformation since the inception of the World Wide Web. For decades, Search Engine Optimization (SEO) has been the primary vehicle for visibility, focusing on keywords, backlinks, and technical site health to appease Google’s algorithms. However, the rise of Large Language Models (LLMs) like GPT-4, Claude, and Gemini has introduced a new paradigm: Generative Engine Optimization (GEO).

As users increasingly turn to AI chatbots and generative search experiences—such as Perplexity AI or Google’s Search Generative Experience (SGE)—the goal for marketers and developers has shifted. It is no longer enough to rank on the first page of search results; brands now need to be the “chosen” answer generated by an LLM. To understand how to achieve this, researchers have begun reverse-engineering the internal ranking mechanisms of these models, exploring complex methodologies such as Shadow Models and Query-based solutions.

These experiments are crucial because LLMs operate as “black boxes.” Unlike traditional search engines that follow relatively predictable (though complex) rules, LLMs generate responses based on probabilistic weights and attention mechanisms. Understanding how to influence these outputs requires a scientific approach to reverse-engineering the logic behind LLM preferences.

The Challenge of LLM Ranking Transparency

Traditional SEOs are accustomed to having tools like Ahrefs, Semrush, and Google Search Console to provide data on rankings and traffic. In the world of LLMs, this data is largely non-existent. When an LLM recommends a specific product or cites a particular source, it isn’t always clear why that source was prioritized over others. Is it because of the source’s authority, the semantic relevance of the text, or the specific way the query was phrased?

Researchers investigating this problem face the challenge of non-determinism. If you ask an LLM the same question twice, you might get two slightly different answers. This variability makes it difficult to pinpoint specific ranking factors. To combat this, researchers have developed frameworks to isolate variables and test how different inputs affect the final output. This is where the concepts of Shadow Models and Query-based solutions come into play.

Deep Dive into Shadow Models

One of the most sophisticated ways researchers are reverse-engineering LLMs is through the use of Shadow Models. A Shadow Model is essentially a smaller, more transparent model trained or fine-tuned to mimic the behavior of a larger, “target” model (like GPT-4). By observing how the target model responds to thousands of prompts, researchers can create a proxy that behaves similarly but allows for much deeper inspection.

The Architecture of a Shadow Model

Shadow Models work on the principle of knowledge distillation. Because researchers cannot see the internal weights of a proprietary model, they treat the model as an oracle. They feed the oracle a vast array of queries and record the responses. They then train a secondary model on these input-output pairs. Once the Shadow Model reaches a high level of parity with the original, researchers can analyze the Shadow Model’s decision-making process.

This method allows researchers to identify “activation patterns.” For instance, they can see which parts of a prompt trigger the model to prioritize a specific type of information. This insight is invaluable for understanding how an LLM evaluates the “quality” of a piece of content before including it in a generative summary.

Advantages of Using Shadow Models

The primary advantage of a Shadow Model is control. In a live environment, testing a large-scale LLM is expensive and slow. A Shadow Model can be run locally, allowing for rapid-fire testing of different optimization strategies. Furthermore, Shadow Models help identify “biases” in the original model. If a Shadow Model consistently ranks shorter, more concise answers higher, it likely reflects a preference ingrained in the larger model’s training data.

The Role of Query-Based Solutions

While Shadow Models focus on replicating the model itself, Query-based solutions focus on the interaction between the user and the model. This approach is more practical for the average SEO professional because it doesn’t require training a secondary AI. Instead, it involves the systematic manipulation of prompts and the retrieved context (often referred to as the “context window”) to see what sticks.

Understanding Retrieval-Augmented Generation (RAG)

To understand Query-based solutions, one must understand Retrieval-Augmented Generation (RAG). Most modern LLM search experiences don’t rely solely on the model’s pre-trained knowledge. Instead, when a user asks a question, the system searches the web (or a specific database) for relevant documents, feeds those documents into the LLM, and asks the LLM to summarize them.

Query-based experiments look at how the LLM decides which part of that retrieved text to emphasize. Researchers test different variables such as:

Semantic Density: Does the model prefer text that is packed with facts, or text that flows naturally?
Citation Placement: Does placing a brand name at the beginning of a paragraph increase the likelihood of it being mentioned in the AI’s response?
Authority Signals: Does the inclusion of expert quotes or statistical data within the retrieved text improve its “ranking” within the LLM’s output?

The Effectiveness of Prompt Engineering

Query-based solutions also involve “jailbreaking” or probing the model’s instructions. By using specific phrasing, researchers can force the model to reveal its prioritization logic. For example, asking the model to “Compare these three sources and explain why one is better than the others” can provide direct insight into the internal evaluation criteria the LLM is using at that moment.

Comparative Analysis: Shadow Models vs. Query-Based Solutions

The research suggests that both methods are essential but serve different purposes. Shadow Models are excellent for broad, foundational research. They help us understand the “psychology” of the AI—what it values at a structural level. This is useful for long-term content strategy and understanding the inherent limitations of different LLM architectures.

On the other hand, Query-based solutions are more tactical. They are highly effective for “live” optimization. Because LLMs are updated frequently, a Shadow Model can quickly become outdated. Query-based testing allows for real-time adjustments to content to ensure it remains visible in generative search results. Researchers found that while Shadow Models provided better theoretical insights, Query-based solutions often yielded more immediate “ranking” improvements in experimental settings.

Key Findings from the Ranking Experiment

The experiment yielded several surprising insights into how LLMs process and rank information. These findings challenge some traditional SEO assumptions and provide a roadmap for future optimization efforts.

The Power of Relevance Over Authority

In traditional search, a high-authority domain (like a major news site) can often rank for a keyword even if the content isn’t perfectly aligned with the user’s intent. The LLM experiment suggested that generative models prioritize semantic relevance over raw domain authority. If a smaller, less-known site provides a more direct and contextually accurate answer to a specific part of a query, the LLM is more likely to use that information in its response.

The “Middle-Ground” Bias

Researchers noticed an interesting phenomenon often called the “Lost in the Middle” effect. When an LLM is given a large amount of information to process, it tends to remember and prioritize information at the very beginning and the very end of the text. Information buried in the middle of a long article is frequently ignored. This has massive implications for content structure; key takeaways and brand mentions should be placed strategically at the start or conclusion of content blocks.

Source Diversity and Agreement

The experiment also showed that LLMs are more likely to include information if it is corroborated by multiple sources within the retrieved context. If three different articles all state the same fact, the LLM views that fact as highly “probable” and is more likely to include it. This suggests that a “consensus-building” strategy—ensuring your information is consistent across multiple platforms—is vital for LLM ranking.

Practical Strategies for Generative Engine Optimization (GEO)

Based on the researchers’ reverse-engineering efforts, how can digital publishers and marketers adapt? The following strategies are designed to align content with the ranking preferences observed in both Shadow Model and Query-based experiments.

1. Optimize for “N-gram” Relevance

LLMs predict the next token (word or part of a word) based on probability. To rank well, your content should use the language that the LLM “expects” to see when discussing a specific topic. This isn’t just about keywords; it’s about using the semantic clusters associated with a subject. If you are writing about “Cloud Computing,” using related terms like “scalability,” “latency,” and “provisioning” in close proximity helps the LLM recognize your content as a high-quality source.

2. Structure Content for Easy Extraction

Since LLMs often use RAG to pull information, your content must be easy for a machine to parse. Use clear headings, bullet points, and concise summaries. The experiment showed that models are more likely to correctly attribute information that is presented in a structured format. Think of your content as a set of modular “data nuggets” that an AI can easily pick up and move into its response.

3. Increase “Citability”

One of the metrics used in the experiment was “citability”—how easy it is for the model to link a specific claim back to a source. You can improve this by using clear, declarative statements. Instead of saying, “It is often thought that our product might be the fastest,” say, “[Brand Name] is the fastest [Product Category] according to [Year] benchmarks.” This makes it much easier for the LLM to provide a direct citation.

The Technical Future of LLM Reverse-Engineering

As LLMs become more complex, the methods used to reverse-engineer them will also evolve. We are likely to see the rise of “Automated SEO Agents”—AI tools specifically designed to run thousands of Query-based tests every hour to find the optimal way to phrase content for different LLMs. These agents will use the principles of the Shadow Model to predict how an update to an LLM (like moving from GPT-4 to GPT-5) will affect existing rankings.

Furthermore, the integration of multi-modal capabilities (images, video, and audio) into LLMs will add another layer of complexity. Researchers will need to determine how an LLM “ranks” an image within a generative response. Does it look at alt-text, or is it performing its own visual analysis to determine relevance?

Ethical Considerations in LLM Manipulation

Reverse-engineering LLMs for ranking purposes raises important ethical questions. If marketers can figure out exactly how to “game” the AI to always recommend their products, does the utility of the AI for the end-user decrease? Much like the early days of keyword stuffing in SEO, there is a risk that GEO could lead to a decrease in the quality of information if not managed properly.

However, the researchers behind these experiments argue that understanding these models is the only way to ensure a level playing field. If only the developers of the LLMs understand the ranking factors, they hold an absolute monopoly over digital visibility. Reverse-engineering democratizes this knowledge, allowing smaller creators and businesses to compete with larger entities.

Adapting to the New Reality

The experiment comparing Shadow Models and Query-based solutions marks a turning point in our understanding of AI-driven search. We are moving away from the era of “guessing” what works and into an era of data-driven optimization for generative models. The transition from SEO to GEO is not just a change in tactics; it is a change in how we think about the relationship between human language and machine probability.

To stay ahead, businesses must move beyond traditional keyword strategies and start thinking about how their content fits into the latent space of an LLM. By focusing on semantic relevance, structural clarity, and consensus-based authority, brands can ensure they remain visible in a world where the search box is being replaced by a conversation. The work of these researchers provides the first real blueprint for navigating this complex, ever-changing digital frontier.

As we look forward, the key to success will be agility. The “ranking factors” of an LLM are not static; they are the result of ongoing training and reinforcement learning from human feedback. Continuous experimentation—using both the deep insights of Shadow Models and the practical applications of Query-based testing—will be the hallmark of the next generation of successful digital marketing.