How Google Discover qualifies, ranks, and filters content: Research

Google Discover has long been a mysterious driver of massive traffic for publishers, news organizations, and tech blogs. Unlike traditional Google Search, which relies on a user entering a specific query, Discover is a proactive, personalized feed that delivers content based on what it predicts a user will want to see. This predictive nature makes it incredibly lucrative but also notoriously volatile.

Recent SDK-level research by Metehan Yesilyurt has pulled back the curtain on how this system operates. By analyzing the internal signals and telemetry within the Google Discover app framework, Yesilyurt has mapped out a sophisticated, multi-stage pipeline that dictates which articles make it to the feed and which are filtered out before they even have a chance to rank. Understanding this architecture is essential for any digital publisher looking to stabilize their traffic in an era of ever-shifting algorithms.

The Nine-Stage Lifecycle of Google Discover Content

According to the research, content does not simply “appear” in a user’s feed. It undergoes a rigorous nine-stage process on the server side before a single pixel is rendered on a smartphone screen. This pipeline ensures that content is relevant, high-quality, and compliant with Google’s strict safety and interest-matching standards.

The stages of the Discover pipeline are as follows:

Crawling and Semantic Understanding: Google’s bots crawl the page to index its content. Beyond simple keyword matching, the system attempts to understand the entities, topics, and overall sentiment of the piece.
Meta Tag Extraction: The system specifically looks for structured data and meta tags, primarily focusing on Open Graph (og:) tags for titles and images.
Content Classification: The article is categorized. Is it a breaking news story? Is it a “how-to” evergreen guide? This classification determines which “freshness” rules will apply later.
Publisher Block Screening: Before matching content to a user, the system checks if the user has previously blocked the publisher. If a block exists, the content is discarded immediately.
User Interest Matching: The system compares the article’s topics against the user’s documented interests, search history, and app usage.
Predicted Click-Through Rate (pCTR) Modeling: An AI model on Google’s servers estimates the likelihood of the user clicking the story based on its title, image, and the user’s past behavior.
Feed Layout Assembly: The system decides where the card will sit in the feed and whether it will be a large feature card or a smaller thumbnail.
Content Delivery: The content is pushed to the user’s device.
Feedback Recording: The system monitors whether the user clicks, dismisses, or ignores the content, using this data to refine future ranking decisions.

The Power of the Publisher Block: A Pre-Ranking Hurdle

One of the most significant findings in the research is the placement of the publisher block in the pipeline. Many SEOs believe that blocks are just one of many signals used in ranking. However, the data reveals that publisher-level blocks happen before interest matching and ranking even begin.

When a user selects “Don’t show content from [Site Name],” that site is effectively dead to that specific user across the entire Discover ecosystem. This is a binary filter, not a weighted signal. There is no equivalent “sitewide boost” mechanism that a user can trigger to ensure they always see a specific site. While a user can “Follow” a topic or a publisher, the “Block” function is a far more powerful technical tool used by the system to prune the candidate pool of content before the heavy lifting of AI ranking begins.

For publishers, this means that even a few days of “clickbait-y” or low-quality content can lead to a wave of user blocks that permanently shrink their potential audience size on Discover. Once a user blocks a domain, regaining that real estate is nearly impossible.

Technical Prerequisites: Meta Tags and Image Quality

Google Discover is a highly visual medium. The research highlights that the system relies heavily on specific page-level metadata to build its cards. If these elements are missing or poorly implemented, the content may be disqualified from the feed entirely.

The 1200px Image Rule

To qualify for large, high-engagement cards, Google requires images to be at least 1200 pixels wide. While smaller images might still appear, they are relegated to small thumbnails next to the headline. Data shows that large cards receive significantly higher click-through rates. If your CMS is serving low-resolution featured images or failing to specify them in the Open Graph tags, you are effectively capping your Discover potential.

The “Kill” Tags: Notranslate and Nopagereadaloud

The research identified two specific meta tags that act as total blockers for Discover eligibility: "notranslate" and "nopagereadaloud". If Google detects these tags, it often excludes the page from the Discover pipeline. The logic is that Google wants Discover content to be as accessible and versatile as possible within its ecosystem. If a publisher restricts Google’s ability to translate the page or read it aloud via Assistant, the system views the content as “low utility” for the Discover platform.

Backup Metadata Logic

Google prioritizes the og:title tag for headlines. However, the research shows a clear fallback hierarchy. If the og:title is missing, the system will look for Twitter card titles, and failing that, the standard HTML <title> tag. While the system is flexible, relying on fallbacks can lead to truncated or poorly formatted headlines that hurt your pCTR.

The Ranking Model: Understanding pCTR

Once a piece of content passes the initial filters, it enters the ranking phase. The core of this phase is the Predicted Click-Through Rate (pCTR) model. This is a server-side calculation that attempts to guess the future.

The model uses several signals to calculate this probability:

Historical Performance: How have previous URLs from this domain performed? If your site has a history of high engagement, your new content starts with a “trust” advantage.
Image Quality and Loading: The system checks if the image URL is valid and if the image has a history of loading successfully. Broken images are a fast track to being filtered out.
Title Sentiment and Relevance: The model analyzes the title’s effectiveness in attracting clicks without crossing the line into forbidden clickbait territory (which triggers user dismissals).

Because this model is running on Google’s servers and not within the app itself, it is difficult for publishers to see the direct “score” of their articles. However, the telemetry shows that Google is constantly tweaking these weights based on real-time user feedback.

The Freshness Decay: The Race Against Time

Google Discover is primarily a news and interest engine, which means “freshness” is a dominant ranking factor. The research mapped out a specific decay schedule that content typically follows:

1 to 7 Days: The Peak Window

This is where the strongest “freshness boost” occurs. New content is prioritized to ensure the feed feels current. Most viral Discover traffic happens within the first 48 hours of publication.

8 to 14 Days: The Moderate Phase

Visibility begins to drop significantly. Content in this window usually only appears to users with very deep, specific interests in the topic that haven’t been satisfied by newer content.

15 to 30 Days: Limited Visibility

Visibility becomes sporadic. Only extremely high-performing pieces with sustained engagement remain in the feed.

30+ Days: Gradual Decline and Evergreen Classification

After 30 days, most content falls out of the feed. However, there is a separate classification for “Evergreen” content. If a piece of content is identified as high-value and non-time-sensitive (e.g., a “Best of” list or a comprehensive guide), it can bypass the freshness decay and resurface months or even years later when a user shows a new interest in that topic.

Real-Time Volatility and Server-Side Experiments

Many publishers express frustration over the “hero to zero” nature of Discover traffic. One day a site gets 500,000 visitors, and the next, it gets 500. The research explains why this volatility is a feature, not a bug, of the system.

At any given moment, Google is running hundreds of simultaneous experiments on the Discover feed. During the observation period, researchers noted approximately 150 server-side experiments and over 50 feature controls running at once. These experiments might change how cards are ranked, how images are displayed, or how much weight is given to a specific interest category.

This means that two users with identical interests might see completely different feeds because they are in different “experiment buckets.” For publishers, this creates an environment where performance can shift overnight without any change to the website’s content or technical setup. Success in Discover requires an understanding that volatility is the baseline.

Personalization and Permanent User Actions

The personalization layer is what makes Discover unique. It uses “NAIADES,” a system mentioned in the research that processes user behavior signals into actionable feed adjustments. Personalization is driven by:

Active Signals: When a user “Follows” a topic in Google Search or “Saves” an article to their reading list.
Passive Signals: The amount of time a user spends reading an article after clicking it from the feed.
Permanent Dismissals: If a user swipes away a card or selects “Not interested,” that specific URL is permanently hidden from that user. Unlike a site-wide block, a dismissal is URL-specific, but it serves as a negative signal to the pCTR model for that article as a whole.

The feed is also dynamic. It isn’t just a static list that generates once per day. The system can add, remove, or reorder cards in real-time as the user scrolls or as new breaking news becomes available. This “live” nature of the feed ensures that it always feels urgent, but it also means content can be “bumped” out of view by newer, more relevant stories in seconds.

Actionable Takeaways for Publishers and SEOs

Based on these research findings, publishers can move away from “guessing” and toward a data-driven strategy for Google Discover optimization.

Optimize for the pCTR Model

Since the system predicts your click-through rate before showing your content to a wider audience, your headlines and images must be top-tier. Use compelling, curiosity-driven headlines that remain honest to the content. Avoid “extreme” clickbait, as the subsequent user dismissals and low dwell time will eventually tank your domain’s trust score in the Discover pipeline.

Prioritize High-Resolution Visuals

If you are not using 1200px wide images, you are losing money. Ensure your CMS automatically generates these sizes and correctly identifies them in your og:image meta tags. Test your pages in the Rich Results Test tool to ensure Google can see and process your images correctly.

Audit Your Meta Tags

Check your site for the “notranslate” and “nopagereadaloud” tags. While these might be used for specific technical reasons, they are often leftover code from old plugins that could be silently throttling your Discover reach. Additionally, ensure your Open Graph tags are fully populated so Google doesn’t have to rely on fallback HTML tags.

Focus on Topic Authority

Because interest matching is a core stage of the pipeline, sites that focus on specific niches often perform better than “general” news sites in the long run. If Google classifies your site as an authority on “Gaming Tech,” your content will have a higher baseline pCTR for users interested in that specific entity.

Expect and Plan for Volatility

Do not build a business model that relies solely on Discover traffic. Given the 150+ experiments running at any time, a sudden drop in traffic might not be a “penalty”—it might simply be a shift in the current experimental framework. Use Discover as a “bonus” traffic source while building a stable foundation through search SEO and direct audience engagement.

Google Discover remains one of the most complex components of the modern web ecosystem. It is a blend of traditional crawling, advanced AI prediction, and real-time user feedback. By understanding the multi-stage pipeline—from the initial crawl to the final user dismissal—publishers can better position their content to thrive in this high-reward environment.