What Data Does AI Overview Use? Sources Behind Google’s Generated Answers

Google’s AI Overview generates its answers from a combination of the live web index, structured data on cited pages, the Knowledge Graph, real-time retrieval at query time, and the underlying Gemini model’s pretraining data. It is not drawing from a separate “AI dataset” – it draws from the same Search infrastructure that produces the classical SERP, with additional signals layered on top to decide which passages to extract and which sources to cite.

Understanding the data layer matters because the levers a publisher can actually pull are tied to it. Pretraining data is fixed; you cannot influence what the model already learned. The web index, structured data, Knowledge Graph entries, and the passages an engine can extract at query time are all influenceable by content and technical SEO work. Sites that want to be cited inside AI Overviews work on the influenceable layers; sites that misunderstand the architecture often spend effort on layers they cannot move.

This article walks through each data source AI Overview uses, what each contributes to the generated answer, and which are publisher-influenceable versus fixed inside the model.

Key Takeaways

  • Pretraining data is fixed and not directly influenceable; the live retrieval and structured data layers are where publisher work has impact.
  • Citations in AI Overview come from real-time web retrieval, not from the model’s training set – which is why fresh, well-structured content can be cited within days.
  • AI Overview blends multiple data sources at query time: the web index, structured data, the Knowledge Graph, real-time retrieval, and Gemini’s pretraining data.

The Gemini model’s pretraining data

The first data layer is the pretraining corpus of the underlying Gemini model. This corpus contains a large slice of public web content, books, code, and other text material captured up to a model-specific cutoff date. Pretraining data is what gives the model its general language ability, factual breadth, and reasoning patterns.

This layer is fixed at training time. Publishers cannot edit what is already in the corpus, cannot remove their content from past snapshots, and cannot influence which model version is in use. Pretraining data shapes how the model phrases answers, what background knowledge it brings to a query, and which entities it recognises by default – but it is not where AI Overview citations come from.

The implication for SEO is that the pretraining layer is largely background. Optimisation work targets the live layers below.

The live web index and real-time retrieval

When a user runs a query that triggers an AI Overview, Google performs real-time retrieval against the live web index – the same index that produces the classical SERP. Retrieval pulls a candidate set of pages relevant to the query, then a passage-extraction layer scans those pages for high-confidence answer spans the model can quote, paraphrase, or cite.

Citations in AI Overview come from this live retrieval step, not from the pretraining corpus. That is why a freshly published article can be cited within days – the model does not need to be retrained to surface it. It only needs to be indexed, judged relevant for the query, and contain extractable passages.

The retrieval layer is where classical SEO signals carry over. Pages that rank well organically are more likely to be in the AI Overview candidate set. Pages that don’t rank usually are not retrieved, regardless of how well-written their content is. This is why classical SEO remains a prerequisite even for AI-search-focused sites.

Structured data and schema markup

Structured data on a page – Article, FAQPage, HowTo, Product, Organization, and other schema.org types – acts as an explicit signal to Google about what the page contains. AI Overview uses these signals at two stages: candidate selection (does this page have content matching the query intent) and passage extraction (which discrete answers are marked up cleanly enough to lift).

FAQPage markup is the key example. A page with FAQPage JSON-LD that lists a question matching the user’s query gives Google a clean Q&A pair the model can use directly. The same Q&A in plain HTML without schema is less explicit and competes with all the other passages on the page.

Article schema with author, publisher, and date_published fields establishes provenance signals that influence trust scoring. Organization schema helps tie the page back to a known entity in the Knowledge Graph. None of these guarantee citation – they raise the probability by giving the system more interpretable signals to work with.

The Knowledge Graph and entity grounding

Google’s Knowledge Graph is a structured database of named entities – people, companies, places, concepts, products – and the relationships between them. AI Overview uses the Knowledge Graph to ground entity mentions in the user’s query and in candidate pages, so that “Apple” the company is correctly disambiguated from “apple” the fruit and so that synonyms map to the same underlying entity.

Pages whose content references entities clearly – using the canonical entity name, surfacing entity attributes, and linking to or being linked from authoritative entity pages – inherit some of that grounding. The model can reason about the page in terms of known entities rather than treating it as an undifferentiated text blob. That recognisability often translates into more frequent citation on entity-related queries.

Publishers cannot directly edit the Knowledge Graph, but they can build the kinds of structured signals (Organization schema, sameAs links, entity-clear writing) that help their entity become recognised and well-attributed over time.

Other quality and freshness signals

Beyond the four primary data layers, AI Overview also weighs a set of secondary signals that influence which sources are cited. These include freshness (how recently the page was published or updated, especially for time-sensitive queries), source authority (signals carried over from classical ranking – link patterns, site-level trust), structured E-E-A-T cues (named author with expertise signals, clear publisher information), and content depth (whether the page covers the topic substantively or thinly).

The freshness signal is particularly visible in AI Overview behaviour. On news, product, and rapidly evolving topics, the model heavily favours recent sources – often within the last few months. On stable, definitional topics, older authoritative sources can hold their citation slot for years.

For publishers, the practical takeaway is that AI Overview citations are not drawn from a single source type. They are drawn from a blended set: live retrieval narrowed by relevance and authority, weighted by freshness where appropriate, with structured data and entity signals deciding which passages get extracted from the surviving candidates.

What publishers can and cannot influence

The data layers split cleanly into influenceable and fixed.

Fixed: the pretraining corpus of the Gemini model, the model’s reasoning patterns, the underlying retrieval algorithms.

Influenceable: whether your page is in the live web index (publish, get crawled, get indexed), whether it ranks well enough to be retrieved (classical SEO signals), what structured data it exposes (schema.org markup), how cleanly it references known entities (entity-first writing, Organization schema, sameAs), how extractable its passages are (direct-answer leads, definitional sentences, FAQPage markup), and how fresh and substantive the content is (regular updates, depth on the topic).

The publishable work to be cited inside AI Overviews is concentrated in the influenceable layers. Time spent trying to “get into the training data” is time spent on a layer publishers do not control. Time spent on indexing, ranking, structured data, entity clarity, passage engineering, and freshness is time spent on layers where the work translates into outcomes within weeks rather than years.

Conclusion

AI Overview generates its answers from a blended data layer: the Gemini model’s pretraining corpus for general language and reasoning, Google’s live web index for candidate retrieval, structured data and schema markup for explicit content signals, the Knowledge Graph for entity grounding, and real-time retrieval at query time for the actual citations. The pretraining layer is fixed and not publisher-influenceable. The live retrieval, structured data, entity, and freshness layers are where content and SEO work translate into citation outcomes. Publishers who understand this split focus their effort on indexing, ranking, schema, entity clarity, and passage extractability – the layers where the work moves the needle within weeks rather than years. Trying to “get into the training data” misallocates effort to a layer that is not how citation actually happens. A reader who can name the data layers AI Overview uses, and tell which are fixed versus influenceable, can scope citation work accurately and skip the work that does not pay off.

Frequently Asked Questions

What data does AI Overview use to generate answers?
AI Overview combines several data sources: the underlying Gemini model’s pretraining corpus, Google’s live web index, structured data (schema.org markup) on indexed pages, the Knowledge Graph for entity grounding, and real-time retrieval at the moment of the query. Citations come from the live retrieval and extraction layers, not from the pretraining corpus, which is why fresh content can be cited within days of publication.
Does AI Overview cite content from the model’s training data?
No. Citations come from live web retrieval at query time, not from the pretraining corpus. The pretraining data shapes the model’s language ability and background knowledge, but the specific URLs cited in an AI Overview are drawn from the live web index. This means a page published after the model’s training cutoff can still be cited – it just needs to be indexed, ranked relevant to the query, and contain extractable passages.
Does structured data affect AI Overview citation?
Yes. Schema.org markup – particularly FAQPage, Article, HowTo, and Organization – gives Google explicit signals about what a page contains and how to interpret discrete answers on it. FAQPage markup that aligns Q&A pairs with the user’s query is one of the higher-leverage signals because it gives the system clean, extractable answer spans. Structured data does not guarantee citation, but it raises probability by making the page easier to parse and trust.
What role does the Knowledge Graph play in AI Overview?
The Knowledge Graph grounds entity mentions in the query and in candidate pages, so that names, brands, and concepts are correctly disambiguated and linked to a canonical entity record. Pages that reference entities cleanly – using canonical names, surfacing entity attributes, linking to authoritative entity pages – benefit from this grounding. The system can reason about the page in entity terms rather than as undifferentiated text, which often translates into more frequent citation on entity-related queries.
How fresh does my content need to be for AI Overview to cite it?
It depends on the query type. For news, product launches, and rapidly evolving topics, AI Overview heavily favours sources from the last few weeks or months – older content often gets displaced quickly. For stable, definitional topics, older authoritative sources can hold their citation slot for years. The practical guidance is to update high-priority pages periodically, especially for queries where the underlying answer changes over time.
Can I get my content into the AI Overview’s training data?
Not in any direct or short-term way. The pretraining corpus is fixed at the model’s training time, includes a broad slice of public web content, and is refreshed only when a new model version is trained. Publishers cannot opt content in or guarantee inclusion. The practical alternative is to focus on the live retrieval layer – indexing, ranking, structured data, entity clarity, and extractable passages – because that is where citations actually come from at query time.
Why does AI Overview sometimes cite a low-ranking page over a high-ranking one?
Ranking governs which pages enter the candidate set, but extraction governs which one gets cited. A page can rank tenth and still be cited above the first result if its passages are more directly extractable – cleaner direct-answer phrasing, better structured data, more self-contained sentences. Conversely, a top-ranked page that is comprehensive but diffuse may be skipped because no single passage is high-confidence enough to quote. Citation share and ranking position are correlated but not identical signals.

If you want a structured view of which influenceable signals on your site are weakest – retrieval ranking, structured data, entity clarity, or passage extractability – we can scope an AI Overview citation audit.


Alva Chew

We help businesses dominate AI Overviews through our specialised 90-day optimisation programme.