The Sitemaps.org Protocol Explained

Every XML sitemap on the web follows the same spec, agreed by Google, Yahoo, and Microsoft in 2006. Here's what it actually says and why the rules matter.

Victor Ijomah
By
Victor Ijomah
Victor Ijomah
Technical SEO Specialist
Victor Afamefuna Ijomah is a UK-based Technical SEO Specialist focused on how Google and AI engines like ChatGPT, Perplexity, and AI Overviews decide what gets discovered,...
- Technical SEO Specialist
Highlights
  • The sitemaps.org protocol is the shared spec every XML sitemap follows, agreed by Google, Yahoo, and Microsoft in November 2006.
  • A single sitemap file is capped at 50,000 URLs and 50MB uncompressed, with larger sites splitting across multiple files plus a sitemap index.
  • URLs in a sitemap must be absolute, properly escaped, and on the same host as the sitemap file itself.
  • The protocol has stayed at version 0.9 since launch because changing it would break compatibility with billions of existing sitemaps across the web.
  • Almost every modern CMS and SEO plugin generates protocol-compliant sitemaps automatically, so most site owners never need to think about the spec directly.

In the last lesson, we settled the practical question of whether your site needs a sitemap. For almost everyone, the answer is yes. Now we shift gears.

From this lesson forward, Module One opens up what’s actually inside a sitemap. Before we get to the anatomy of one and the individual elements that follow, there’s something foundational to understand first. The document that defines what a sitemap is in the first place.

That document is the sitemaps.org protocol. It’s the spec every XML sitemap on the web follows, the shared agreement that lets your sitemap work across Google, Bing, DuckDuckGo, AI search crawlers, and every other engine that respects the standard. This lesson is about understanding what that protocol says, where it came from, and what the rules mean for how you build sitemaps in practice.

The history: how the protocol came to exist

The story is worth knowing, because it explains why sitemaps work the way they do across every search engine you’ll ever encounter.

In June 2005, Google launched what they called Sitemaps 0.84. It was a proprietary format Google introduced as a way for site owners to help Google’s crawler find pages on their sites. Useful, but only useful for Google. Other search engines (Yahoo and MSN as it was then) didn’t recognise the format.

That changed in November 2006, when Google, Yahoo, and Microsoft jointly announced that they were adopting a common sitemap format. The three companies launched sitemaps.org as a neutral home for the standard, dropped the Google-specific branding, bumped the version number to 0.9, and agreed to use the same format going forward. Ask.com and IBM joined the agreement a few months later in April 2007, making the standard genuinely industry-wide for the major search engines of the time.

The result is what we now call the sitemaps.org protocol, formally documented at sitemaps.org. It’s been at version 0.9 ever since, which is unusual longevity for a web spec, and it works because every major search engine, including the AI search crawlers that have emerged since, respects it. The protocol is licensed under Attribution-ShareAlike Creative Commons, which is part of why it’s been adopted so widely. No vendor owns it, and no vendor can change it unilaterally.

Knowing this history matters because it explains a quirk that still confuses newcomers. When you Google “sitemap protocol” or “sitemap specifications”, you’ll find references to “Sitemap 0.9”, “the sitemaps.org standard”, “the Google sitemap format”, and “the XML sitemap protocol”. All of them refer to the same document. The naming inconsistency is a reflection of the standard’s path from Google-specific to industry-wide.

What the protocol actually specifies

The protocol’s job is to define what an XML sitemap looks like at the structural level. Strip it down to the essentials, and the spec covers four things.

  1. The first thing the protocol specifies is the XML structure. A sitemap is an XML document with a root element called <urlset>, declaring the sitemaps.org namespace. Every URL on the site is wrapped in a <url> element nested inside the urlset. We’ll walk through the actual structure with examples in the next lesson, but at the spec level, that’s the skeleton.
  2. The second thing it specifies is the elements you can include for each URL. The required one is <loc>, which holds the actual URL. The optional ones are <lastmod> (when the URL was last modified), <changefreq> (how often it changes), and <priority> (how important it is relative to other URLs on your site). We’ve referenced these in passing, and they each get their own treatment in Lesson 8.
  3. The third thing it specifies is the rules for URL formatting. URLs must be absolute, meaning they include the full protocol and domain. Relative URLs aren’t allowed. Special characters in URLs must be properly entity-escaped, so an ampersand becomes &amp;, for example. Every URL must follow the RFC-3986 standard for URIs, and every URL must be on the same host as the sitemap file itself, unless the sitemap is being cross-submitted through Search Console.
  4. The fourth thing it specifies is the constraints on sitemap files themselves. These are the rules most people bump into in practice, and they deserve their own section.

The size and count limits of SiteMaps

Two constraints in the protocol have practical consequences for how you organise your sitemaps. Both are worth knowing.

  1. The first is the URL count limit. A single sitemap file can contain a maximum of 50,000 URLs. Once you exceed that number, the protocol requires you to split your URLs across multiple sitemap files and combine them using a sitemap index file (which we covered in Lesson 2: Types of Sitemaps). Most websites never approach 50,000 URLs, but e-commerce sites with large product catalogues, news sites with deep archives, and documentation sites with thousands of pages bump into this limit regularly.
  2. The second is the file size limit. A single sitemap file can be a maximum of 50MB uncompressed (52,428,800 bytes if you want to be precise). Even if you have fewer than 50,000 URLs, if their combined size with all the metadata pushes past 50MB, you still need to split. The protocol allows sitemap files to be gzip-compressed for delivery, which is recommended because smaller files mean faster crawls, but the 50MB limit applies to the uncompressed size.

A useful piece of context here: the 50MB limit wasn’t always 50MB. The original protocol set the file size limit at 10MB. In 2016, the major search engines agreed to raise it to 50MB to accommodate the growing size of websites and the additional metadata (like hreflang annotations for international sites) that often pushed sitemaps past the older limit. The 50,000-URL limit, by contrast, hasn’t changed since the protocol launched.

These two limits work together. You can hit the URL count first or the file size first, depending on the length of your URLs and how much metadata you include per entry. The protocol simply says, ‘Whichever you hit first, that’s when you split.’

There’s also a practical upper ceiling worth knowing about. A sitemap index file can itself contain up to 50,000 sitemap entries, which means a single site can theoretically reference up to 2.5 billion URLs across nested sitemap files. No real website needs this, but the architecture scales well past any practical site size you’ll encounter.

Why the protocol works the way it does

Some of the protocol’s design choices feel restrictive when you first encounter them. Why must URLs be absolute? Why can’t I include URLs from a different domain? Why does the namespace declaration have to be exactly that string?

The answers come back to one principle. The protocol exists so that every search engine in the world can parse every sitemap on the web without ambiguity. Strict rules make sitemaps reliable. Loose rules would make sitemaps inconsistent, and inconsistent sitemaps would be ignored by search engines that couldn’t trust their contents.

The absolute URL requirement removes ambiguity about where a URL points. The same-host restriction prevents sitemap pollution, where a site lists URLs from other domains as if they were their own. The strict namespace declaration ensures the file is unambiguously a sitemap and not some other XML document that happens to use similar tag names. Each rule has a purpose, even when it feels pedantic.

The same logic explains why the protocol has stayed at version 0.9 since 2006. It works. Search engines built their crawlers around it. Site owners built their tools around it. Changing the protocol now would break compatibility with billions of existing sitemap files across the web. The cost of any change would massively outweigh the benefit, so the protocol has quietly stabilised at what it always was. The only real exception is the 2016 file-size bump from 10MB to 50MB, which was a small accommodation rather than a redesign.

What this means for how you build sitemaps

For most readers, the practical implications of the protocol come down to three things.

  1. The first is that your CMS or SEO plugin is almost certainly already protocol-compliant. WordPress sitemaps generated by Yoast, Rank Math, or AIOSEO follow the spec correctly. Shopify, Wix, Squarespace, and other major platforms generate compliant sitemaps. Unless you’re hand-rolling your sitemap or using a tool from outside the mainstream ecosystem, you don’t need to worry about the protocol details at the implementation level.
  2. The second is that when you do encounter protocol issues, typically through Search Console flagging an error, the error message will reference protocol violations directly. Knowing the protocol exists and what it specifies helps you understand the error rather than guessing at what went wrong. A “URL not allowed” error usually means your sitemap is listing URLs from a domain other than the sitemap’s host. A “namespace mismatch” error means the urlset declaration is wrong. The protocol is your reference document for these debugging moments.
  3. The third is that the protocol’s limits explain decisions other lessons in this series have referenced. The sitemap index file we covered in Lesson 2 exists because of the 50,000-URL limit. The reason large sites use multiple sitemaps comes back to the same place. Understanding the spec makes the rest of Module One makes sense rather than feels like a collection of arbitrary rules.

Where this leaves us

The sitemaps.org protocol is the foundation document every XML sitemap respects. It defines the structure, the rules, the limits, and the conventions that make sitemaps work consistently across every search engine. Almost every sitemap you encounter on the web is protocol-compliant, because the platforms that generate them handle compliance automatically.

That’s the spec. Now we’ll look at what it produces in practice. The next lesson opens up a real XML sitemap and walks through the anatomy element by element, so you can see exactly how the protocol’s rules translate into the file your CMS generates for you.

In the next lesson, we’ll walk through the anatomy of a complete XML sitemap, with a working example you can compare against your own. Once you’ve seen one structured properly, the elements lesson that follows will be much easier to make sense of.

Up next: The Anatomy of an XML Sitemap (with Example) →


This is Lesson 6 of The Sitemap Series, a Technical SEO series on sitemaps from first principles, built for the AI Search era.

Share This Article
Victor Ijomah
Technical SEO Specialist
Follow:
Victor Afamefuna Ijomah is a UK-based Technical SEO Specialist focused on how Google and AI engines like ChatGPT, Perplexity, and AI Overviews decide what gets discovered, understood, and cited. He holds an M.Sc in Digital Marketing from the University of Chester and is the editor of The Technical SEO Library, a publication on crawl systems, schema, entity SEO, AI crawler management, and the technical foundations of visibility in the AI Search era.
Leave a Comment