How AI Engines Decide Whom to Cite

When you ask ChatGPT a question, and it responds with an answer, only two or three sources usually get cited at the bottom. When Perplexity gives you a synthesised reply, the same thing happens: a small handful of citations from across the open web. When Google’s AI overviews summarise a topic, the same applies. From the millions of pages that could have informed that answer, fewer than five make the list.

Contents

The rest stay invisible.

This is the new technical SEO problem, and it is not the same problem as ranking on Google. The old loop, where users search, Google ranks pages, users click, and sites get traffic, has held for twenty years. AI Search has broken it. When the AI engine answers the question directly, the click rarely happens, and the only thing that matters for visibility is whether your site was one of the few that got cited.

This article is about what those engines actually weigh when they decide which sources to cite, why traditional ranking signals are not enough on their own anymore, and what you can do about it on the technical layer underneath your content.

The shift: from ranking pages to citing sources

For most of search history, the question Google asked was a ranking question. Given a query, which pages best satisfy it, and in what order should we list them? The answer was a ranked list of ten blue links. Whether you got a click depended on where you sat in that list, the quality of your title and snippet, and your reader’s patience for scrolling.

AI engines do not ask that question.

What they ask is something closer to: given a query, which sources should I read to compose an accurate, useful answer, and which of those sources should I cite as evidence? This is retrieval, not ranking. The output is not a list of pages for the user to choose from. The output is an answer, and the citations are the receipts.

The mechanical difference matters. In the old game, ten sites won a slot on page one and competed for the click. In the new game, two or three sites win the citation slot and the rest never enter the user’s awareness at all. The cost of being twentieth is roughly the same as being two-hundredth. The cost of being fourth is enormous.

This is why Technical SEO for the AI Search era is calibrated differently. You are no longer optimising to be one of ten options. You are optimising to be one of two or three sources an engine considers trustworthy enough to surface its own reputation behind.

What AI engines actually do when you ask them something

Before we talk about the signals these engines weigh, it helps to understand the basic pipeline they run when generating a cited answer. The technical name for this architecture is Retrieval-Augmented Generation, or RAG. Most consumer AI engines that produce cited answers use some flavour of it.

The pipeline runs in three stages.

1. Retrieval

The retrieval step is where the engine decides which documents from the open web are even worth reading for your query. It does not feed your whole site into the language model. It fetches a handful of candidate documents, often broken into smaller chunks, that are most likely to contain the answer.

The retrieval source varies by engine. ChatGPT (when browsing or using its search integration) fetches from Bing-style web results plus its own internal indices. Perplexity runs its own live web crawl. Google’s AI Overviews queries Google’s own index. Claude pulls from Common Crawl and its native browsing capability.

The mechanics differ, but the principle is shared: there is a candidate pool, and your content is either in it or not.

2. Synthesis

Once the engine has its candidate documents, it reads them and composes an answer that draws on the content. This is the generative step. The language model takes what it retrieved, weighs the information against the query, and writes a coherent reply.

This step is where the engine decides what is true, what is relevant, and what is worth including in the final answer. Sources that contain clear, specific, well-structured information get used. Sources that bury their answers under fluff often get skipped, even when they were retrieved.

3. Citation

After synthesis, the engine attaches citations to the parts of the answer that came from specific sources. Not every retrieved document gets cited. The engine effectively decides which sources earned the credit and which were used as background context without attribution.

Being retrieved is necessary but not sufficient. You can be in the pool of candidate documents and still not get cited because your content was harder to extract from, less directly relevant, or less trustworthy in the engine’s judgement than another retrieved source.

The signals that move you from retrieved to cited are what the rest of this article is about.

The signals AI engines weigh

I think about citation signals in three layers. Source-level signals operate at the level of your whole domain. Content-level signals operate at the level of the specific page. Match signals operate at the level of how well your specific page fits the specific query.

Each layer has its own internal logic, and the engines apply them in roughly that order. Source-level signals determine whether you make it into the retrieval pool in the first place. Content-level signals determine whether you survive the synthesis step. Match signals determine whether you earn the citation.

1. Source-level signals

These are the signals that operate at the level of your whole domain or publication. They determine whether the engine considers your site a trustworthy place to retrieve information from at all.

The most important ones are below.

Entity strength. Does the engine know who is behind this site? Is the publisher a recognisable entity in its Knowledge Graph or internal index? Is there a named author with verifiable expertise? Sites that are clearly attributable to a real person or organisation with stated credentials are weighted higher than anonymous or thinly attributed sites. This is why Person and Organization schema, alumniOf, jobTitle, knowsAbout, and sameAs properties matter so much. They feed the entity profile that engines build about you over time.

Topical authority. Does this site consistently publish on a coherent topic? Engines reward focus. A site that publishes a hundred articles on Technical SEO is treated as a stronger authority on Technical SEO than a site that publishes one Technical SEO article amongst a hundred articles on unrelated topics. This is partly behavioural (the engine has more material to learn from) and partly trust-based (focused sites are less likely to be content farms).

Editorial transparency. Does the site state its publishing principles, corrections policy, and disclosure standards? Engines, especially the AI ones, have started using publishingPrinciples, correctionsPolicy, and ethicsPolicy schema properties as quality signals. A site that has thought about its credibility scores higher than one that has not.

Trust signals across the web. Backlinks still matter, but their weight has shifted. What engines look for now is whether other trusted entities reference this source. A link from a recognised expert, a peer-reviewed publication, or a known authority in the topic weighs more than a hundred links from low-quality directories. The directional pattern matters more than the volume.

AI crawler accessibility. This one is easy to miss. If your site blocks GPTBot, Google-Extended, or CCBot in robots.txt, the engines cannot retrieve you, full stop. Many sites are blocking themselves through plugin defaults without realising it. We will come back to this.

2. Content-level signals

Once your site is in the retrieval pool, the engine looks at the specific page that matched the query. The content-level signals decide whether your page survives the synthesis step.

Here is what those signals look like in practice.

Direct, specific answers. Pages that state a claim clearly and specifically get used. Pages that bury an answer under three paragraphs of preamble often get skipped even when they contain it. The engine is looking for extractable, quotable content. “X is the most important factor because Y” is more usable than “well, it depends, and there are many factors to consider…”

Structured headings and chunked sections. Modern engines chunk content into smaller passages before retrieval. A page with clear H2 and H3 structure breaks cleanly into chunks that the engine can match against specific sub-queries. A wall of text without headings is harder to parse, harder to chunk, and less likely to be cited even if the answer is in there somewhere.

Schema markup. TechArticle, Article, FAQPage, HowTo, and DefinedTerm schemas tell engines what kind of content they are looking at and how to parse it. FAQPage schema in particular, gives engines pre-chunked question-answer pairs that are easy to retrieve and cite. Articles without schema are not invisible to engines, but they require more parsing work to use, which often means they lose the citation race to articles that did the markup work.

Information density. How much useful information sits in each paragraph? Engines reward density over verbosity. A 1,500-word article that says something specific in every paragraph is more citation-ready than a 4,000-word article that pads the same insight across forty paragraphs. This is the opposite of the old SEO instinct to write long for ranking.

Concrete claims with evidence. “I tested this on twelve sites and saw X” is more citable than “studies have shown X.” The engine cannot verify your study claim, but it can quote your tested observation. This is where having a named author with a stated expertise pays off: the engine has reason to trust the testimony.

Quote-able phrasing. Sentences that are self-contained and read well out of context get cited more. A claim that needs three preceding paragraphs to make sense will rarely survive into the engine’s answer. Writing with citation in mind means writing sentences that work as pulled quotes.

3. Match signals

The third layer is about how well your specific page matches the specific user query. This is where retrieval scoring happens.

The factors that move the needle here are different from the first two layers.

Semantic relevance. Embedding-based matching means the engine is looking for content whose meaning aligns with the query, not just content with matching keywords. A page that discusses “how Google decides which pages to surface” can match the query “what does a search engine actually do” even without keyword overlap, because the embeddings are close in semantic space.

Entity alignment. If the user asks about “schema markup”, the engine looks for content that demonstrates a real understanding of schema as an entity, not content that just mentions the word. Pages that link schema to its related entities (JSON-LD, Schema.org, Google’s Rich Results, structured data, semantic web) signal a deeper understanding than pages that use the word in isolation.

Recency. For time-sensitive queries (anything about current AI engine behaviour, platform updates, and recent algorithm changes), engines weigh recent content more heavily. dateModified in your schema and the visible last-updated stamp on your articles both feed into this.

Specificity to the query. A page that answers the exact question being asked beats a page that mentions the topic generally. This is why building a library of focused articles each answering a specific question outperforms one mega-post trying to cover everything.

How different engines handle this differently

The signals above describe the common ground. Each major AI engine applies them with its own emphasis, and the differences are worth understanding because they affect how you optimise.

A short tour of where the engines diverge:

ChatGPT retrieves via Bing’s index when browsing and also draws on its training data, plus any retrieved web content during a conversation. It is generally conservative about citations and prefers established, well-attributed sources. ChatGPT tends to weight editorial credibility (clear authorship, stated expertise, well-structured content) more than raw freshness.

Perplexity is built around citation by design. Every answer comes with sources listed prominently. Perplexity does its own live web retrieval and is more willing than ChatGPT to cite less-established sources if their content directly answers the query. This makes Perplexity a particularly important target for newer publications because the bar for entry into its citation pool is lower than for ChatGPT.

Google’s AI Overviews integrate deeply with the Google index. Sources that already rank well in traditional Google search are more likely to be cited in AI Overviews, but the citation logic is not identical to ranking. AI overviews especially favour content with clear answers in early paragraphs, strong schema markup, and high trust signals at the domain level.

Claude can retrieve via Common Crawl and through its native browsing capability. It tends to be cautious about citation and prefers sources with clear authorship and topical focus. Like ChatGPT, it weights editorial credibility heavily.

Gemini sits in the Google ecosystem alongside AI Overviews and shares many of the same retrieval patterns, though with its own model-side weighting.

The shared pattern across all of them: they all reward sites that are easy to retrieve, easy to extract from, easy to attribute, and easy to trust. Get those four right and you will be cited across engines, not just one.

Why traditional ranking signals are not enough on their own

If your background is in traditional SEO, you might be wondering: surely if I rank well on Google, that takes care of most of this? The answer is partly. Some signals do transfer. Domain trust, content quality, and topical authority all matter in both worlds. But three significant gaps remain.

1. Keyword targeting falls down

AI engines work on semantic matching, not exact keyword matching. A page optimised tightly around a keyword phrase may rank well on Google but lose to a more semantically rich page when an AI engine retrieves for a related but differently phrased query. The shift is from keyword targeting to entity-and-question targeting.

2. Thin content is doubly punished

A page that ranks adequately on Google because it has a backlink advantage but contains thin content may never be cited by an AI engine because there is nothing extractable to quote. Citation requires substance, not just authority.

3. Click-optimised titles backfire.

Titles that are designed to maximise click-through on a Google SERP (clickbait formulations, listicle headlines, curiosity gaps) are often weaker for AI citation because they obscure rather than describe the content. AI engines prefer titles that clearly state what the article covers.

These gaps are not arguments to abandon traditional SEO. They are arguments to build on top of it.

What to do, practically

If you are reading this and wondering where to start, here is the work that actually moves the citation needle. None of it is glamorous. All of it compounds.

Below are the priorities in roughly the order I would tackle them on a new or existing site.

Build strong entity signals. Set up Person schema for the author with name, jobTitle, alumniOf, knowsAbout, sameAs, and image. Set up Organization schema for the publisher. Connect them via the schema graph. Make sure the author and publisher are referenced consistently across every article. This is the foundation engines use to decide whether your site is a real entity with real expertise.
Use the correct schema on every post. Yoast Premium handles this if you set the schema type correctly in Content types > Posts. Add FAQPage schema where your content has clear question-answer pairs. Add HowTo schema for step-by-step content.
Structure your articles for chunking. H2 sections for major topics, H3 for subsections. Each section should be self-contained enough to make sense as an extracted chunk. Avoid massive walls of text under one heading.
Lead each section with the answer, then explain. The AI engine often reads the first paragraph after each heading carefully. State the claim clearly there, then expand. Do it at every section, not just at the top of the article.
Write for extract-ability. Sentences that work as pulled quotes. Self-contained claims that do not require three paragraphs of setup. Specific examples with named tools or platforms instead of vague references.
Set up Editorial Standards, Corrections Policy, and Ethics Policy pages. Reference them from your schema. These transparency signals are increasingly weighted by AI engines as trust indicators.
Do not block AI crawlers. Check your robots.txt and any Yoast or SEO plugin “crawl optimisation” settings. Make sure GPTBot, Google-Extended, and CCBot are not disallowed. Many sites are blocking themselves by accident.
Implement llms.txt at your root. This is an emerging standard that lets you publish a curated list of your most important content for AI engines to discover. Not all engines use it yet, but the early adopters are signalling that they will, and the cost of setting it up is minimal.
Publish on a focused topic consistently. Topical authority is built by volume in a niche, not breadth across niches. Twenty articles on Technical SEO outperforms two hundred articles spread across SEO, web design, content marketing, and email.
Update articles when the underlying systems change. AI engine behaviour shifts faster than traditional search did. Articles older than twelve months should be reviewed and updated. The dateModified signal carries weight.

What not to do

A few things to actively avoid, because they look like SEO best practice in the old playbook but quietly hurt you in the new one.

These are the traps I see most often on sites I audit.

Do not keyword-stuff. Engines penalise this and human readers notice. Specificity and natural language outperform keyword density every time.
Do not block AI bots by default. The bots are not your enemy. They are the new distribution channel. Block them, and you have removed yourself from the citation pool entirely.
Do not rely on thin, AI-generated content. Engines are increasingly tuned to detect and discount low-density, machine-written content. Even if you use AI tools in your workflow (and you probably should), the final published content needs to carry the weight of your own thought and testing.
Do not chase every algorithm rumour. Most public commentary on AI engine behaviour is speculative. Test things on your own site. Watch what actually gets cited. The publicly available signal is noisy. Your own data is not.
Do not ignore schema because it feels invisible. Schema is the difference between an engine guessing at your content and an engine knowing exactly what it is looking at. Invisible to readers, foundational for citation.

Closing the loop

The library you are reading is built on the thesis that Technical SEO has three jobs today: getting your content discovered, getting it understood, and getting it cited. This article is about the third job, the one that decides whether you exist or stay invisible when AI engines answer questions.

The mechanics underneath are not mysterious. They reward sites that have done the unsexy technical work: real entity signals, clean schema, structured content, transparent editorial standards, and accessible crawl paths for the bots that now decide visibility. Most of that work is durable. You do it once, you maintain it, and it compounds over years.

The first move is to stop optimising only for the old loop. The traffic from Google clicks is not going to come back to where it was. The publications that adjust early, while the AI search layer is still being calibrated, will be the ones whose citations show up in millions of conversations a year from now.

That is the work the library is here for. The next two cornerstones will go upstream to discovery (why some sites get found at all) and understanding (how engines decide what your site is about). Together they form the foundation everything else builds on.

If you are figuring this out alongside me, you are in the right place.

Frequently asked questions

A short section of follow-up questions readers tend to ask after reading the main article. Self-contained answers, written to stand alone so AI engines and human readers can pull them out of context.

Is being cited by AI engines different from ranking on Google?

Yes, mechanically and strategically. Ranking on Google delivers a list of options for users to click. Being cited by an AI engine means your content was used to compose an answer and credited as a source. The user often never sees the original page. Some signals overlap (domain trust, content quality, schema), but optimisation for citation rewards different things: extractable answers, clear authorship, dense information, and structured chunking. A site can rank well and never get cited, and vice versa.

Which AI engine should I optimise for first?

Perplexity, if you have to pick one. The citation bar is lower than ChatGPT or AI Overviews, which makes it the most accessible target for newer publications. Perplexity also cites more transparently and more frequently, so you can measure progress sooner. Once you are reliably cited in Perplexity, the same technical foundations carry over to ChatGPT, Gemini, and AI Overviews with diminishing additional work. Optimise for one engine well, and most of the work transfers.

How can I tell if my content is being cited by AI engines?

There is no single dashboard yet, which is one of the harder problems in this space. The practical approach is to manually test: take your most important topics, type representative queries into ChatGPT, Perplexity, and Gemini, and see if your site shows up in the citations. Tools like Otterly.AI, Profound, and Athena are starting to track AI citation visibility, but they are early. Setting up a tracking spreadsheet of target queries and checking monthly is the most reliable method right now.

Should I block AI crawlers to protect my content from being trained on?

Only if you have a specific reason to (paywalled content, licensed material, news under embargo). For most sites, blocking AI crawlers removes you from the citation pool entirely, which is a much larger cost than the marginal benefit of withholding your content from training data. The realistic trade is: be readable, be cited, get the visibility. Many publishers who blocked AI bots in 2024 have quietly unblocked them since.

Do AI engines penalise content created with AI?

Not directly, but indirectly yes. Engines are tuned to detect and discount low-density, generic content, which a lot of unedited AI output happens to be. The penalty is not “AI-written content is bad,” it is “low-quality content is bad,” and pure AI generation often produces low-quality content. If you use AI tools as part of your workflow but the final published article carries your own thinking, voice, and testing, citation behaviour does not penalise you. Transparency about your AI use, via your Editorial Standards page, strengthens trust signals further.

How long does it take to start getting cited by AI engines?

For a new site, expect three to six months before you start appearing in citations for niche queries, and six to twelve months for citation on broader topics. The timeline is faster than traditional SEO ranking timelines but slower than people expect. Citation also tends to compound: once you are cited a few times, your entity strength grows, and subsequent citations come more easily. The first one is the hardest.

Does my content need to rank on Google for AI engines to cite it?

For Google’s AI Overviews, ranking helps significantly because Overviews draw heavily from the Google index. For ChatGPT and Perplexity, ranking on Google matters less because those engines retrieve from their own sources (Bing-based for ChatGPT, proprietary crawl for Perplexity). A page that ranks poorly on Google can still be cited in Perplexity if it directly answers a query and has strong content-level signals. Ranking is a helpful baseline but not a prerequisite.

Was this article helpful?

YesNo