How to Validate Your Website Sitemap

Contents

Your sitemap is built, hosted in the right location, declared in robots.txt, submitted to Google Search Console and Bing Webmaster Tools, and backed up by IndexNow for fast notifications. Everything from the last nine lessons is in place.

The previous lessons assumed one thing about your sitemap. They assumed it was correct.

Validation is the step that confirms the assumption. It is the difference between submitting a sitemap that helps the search engines understand your site and submitting one that silently misleads them. A broken sitemap is worse than no sitemap because the engines trust what you submit. If you tell them about URLs that return 404, they crawl those URLs and waste budget. If you tell them about pages with noindex tags, they get conflicting signals. If your XML has structural errors, the file gets rejected or partially parsed in ways that are hard to predict.

This lesson covers why validation matters more than people think, how to validate the XML structure, how to validate the URLs inside the file, how to use the search engine reports as validators, what validation cannot catch, and a pre-submission checklist you can run through every time.

Why validation matters more than people think

The temptation is to skip validation and assume the tool that generated the sitemap got it right. Yoast generated this XML, so it must be valid. Rank Math is well-tested, so the URLs must all work. The sitemap loads in a browser without an error, so it must be fine.

This is a reasonable assumption for the XML structure. Modern plugins rarely produce broken XML. But it is not a safe assumption for what is inside the file.

The URLs inside the sitemap are pulled from your site’s content, and that content changes constantly. A page gets deleted but stays in the sitemap until the plugin refreshes. A category gets renamed, and old category URLs return 404. A redirect gets added, but the sitemap still lists the old URL. A staging URL accidentally makes it into the production sitemap. These are not plugin bugs. They are normal site-evolution issues that the plugin has no way of catching on its own.

Validation is the step that catches them before the engines do. A search engine crawling 100 URLs from your sitemap and finding 30 of them return 404 will treat the sitemap as low-quality. The damage is not always reversible. Once Google or Bing learns to mistrust your sitemap, getting that trust back takes time.

How to validate the XML structure

XML validation answers two questions. Is the file well-formed XML? And does it conform to the sitemaps.org protocol the engines expect?

The easiest first check is loading the sitemap in a browser. Open the URL directly (yoursite.com/sitemap.xml). A well-formed XML file loads with a visual tree showing the URL list. A broken file shows an error message at the line where parsing failed. This catches obvious structural problems in seconds and is enough for most situations.

For deeper validation, several tools are worth knowing:

xml-sitemaps.com validator: Paste in a URL or file and check it against the sitemaps.org schema. Free and fast.
W3C XML validator: Checks for general XML well-formedness, but does not know the sitemaps.org schema specifically.
Screaming Frog SEO Spider: Has a built-in sitemap validation mode under Mode > List > Upload Sitemap. Crawls the file, validates the XML, and checks the URLs in one pass.
Yoast and Rank Math plugins: Both validate their own output and surface errors in the WordPress admin if something goes wrong during generation.

Common XML errors these tools catch include unescaped ampersands in URLs (& needs to be &), missing closing tags, wrong or missing namespace declarations, and BOM characters before the XML declaration. The first two are by far the most common, especially in manually generated sitemaps. The namespace and BOM issues mostly come from misconfigured CMS exports.

If the XML validates cleanly, that is one layer done. The URLs inside the file are the next layer.

How to validate the URLs inside the sitemap

Once the XML is clean, the URLs need their own check. For each URL in the sitemap, four things should be true.

It returns a 200 status code when fetched.
It is the canonical version of the page (not a URL that redirects to a different one).
It is not blocked by robots.txt.
It does not have a noindex tag in its HTML.

A URL that fails any of these does not belong in the sitemap. Including it sends mixed signals to the engines.

Screaming Frog is the most thorough tool for this layer. In List mode, you upload the sitemap (or the URLs from it), Screaming Frog crawls each URL, and you get a full report with status codes, redirect chains, robots.txt blocks, canonical tags, and meta robots directives. This is the same data the engines collect when they crawl your sitemap, just in a format you can review and act on yourself.

For smaller sites, you can spot-check by hand. Pick 10 to 20 representative URLs from the sitemap and load each one. Check the response status (browser dev tools show this under Network), check whether the page redirects, and check the HTML head for <meta name="robots" content="noindex">. Five minutes of manual checking catches most issues on a small site.

For larger sites where manual checking is impractical and Screaming Frog feels heavy, online sitemap crawlers exist. Sitebulb, Ahrefs Site Audit, and Semrush Site Audit all have sitemap-aware modes. The principle is the same: crawl the URLs, surface the failures.

How to use Search Console and Bing Webmaster Tools as validators

Once you have submitted the sitemap (covered in Lesson 7: How to Submit Your Website Sitemap to Google Search Console and Lesson 8: How to Submit Your Website Sitemap to Bing Webmaster Tools), both engines do their own validation when they fetch it. Their feedback is the most authoritative validation you can get, because it tells you what the engines themselves see.

In Google Search Console, the Sitemaps report shows:

Status: Success, Has errors, or Could not fetch.

Discovered URLs: The number of URLs Google registered from your sitemap. Compare this to the count you submitted. A gap suggests URLs are being silently excluded.

Specific error messages: When something goes wrong, the report describes it. Common messages include “URL not allowed by robots.txt” and “URL has redirects”.

In Bing Webmaster Tools, the equivalent fields show:

Status: The same set of values as Google.

URLs submitted versus URLs indexed: A wider gap is normal for Bing (it indexes more conservatively than Google), but the trend should be upward over time.

Error breakdown: Bing surfaces specific errors with the affected URL counts so you can see which problems affect the most pages.

If your own validator says the sitemap is clean but the engines report errors, trust the engines. They are seeing something your local tools missed. The most common cause is a robots.txt rule you do not realise is blocking the sitemap path or a CDN configuration returning unexpected status codes for the engine’s crawler IP range.

What sitemap validation cannot catch

Validation tells you the file is technically correct. It does not tell you the file is strategically right.

Four issues pass every validation tool but still cause problems.

URLs missing from the sitemap entirely. Orphan pages that exist on the site but never get listed because the plugin’s discovery logic missed them. No validator can flag what is not there. The fix is a manual audit of expected URLs against the sitemap contents.
URLs that exist but should not be indexed individually. Filtered shop URLs, faceted navigation results, deep pagination. These return 200 and have no noindex tag, so validators pass them. Including them in the sitemap promotes pages that may dilute your topical focus.
URLs of low content quality that pass technical checks. Thin pages, near-duplicate content, AI-generated pages without editorial review. These promote pages that probably should not be in the sitemap on their own merit, regardless of what validation says.
lastmod values that are not informative. Some plugins update lastmod for every page on every site change, including changes that did not affect the page. The engines learn to ignore lastmod fields that never differentiate one page from another.

These are strategic problems, not validation problems. They need editorial judgement, not a tool. The next lesson covers how to think about them.

A pre-submission validation checklist

Run through these checks before every fresh sitemap submission and before any major site change. Most take seconds individually; together they take a few minutes for a typical site.

XML is well-formed (loads in a browser without parse errors).
The sitemaps.org namespace is declared correctly in the urlset tag.
UTF-8 encoding is declared in the XML declaration.
All URLs are absolute and use the same protocol (https throughout).
All URLs return 200 status codes when crawled.
No URLs are blocked by robots.txt.
No URLs have a noindex tag in their HTML.
All URLs are canonical, not redirect targets.
lastmod values reflect actual modification times, not always today.
File is under 50MB uncompressed and contains under 50,000 URLs (a sitemap index covers larger sites).

If anything on this list fails, the next lesson covers the specific fix.

Where this leaves us

You can now confirm your sitemap is technically correct before you rely on it. The XML structure is sound, the URLs inside actually work, and the engines accept what you have submitted. Validation has done its job.

But validation only tells you something is wrong. It does not always tell you how to fix it, and some errors are common enough that a dedicated treatment is worth the time. The next lesson covers the most frequent sitemap errors people run into in practice, what each one actually means, and the specific fix for each one.

Up next: How to Fix Common Sitemap Errors →

This is Module 2: Lesson 10 of The Sitemap Series, a Technical SEO series on sitemaps from first principles, built for the AI Search era.

Was this article helpful?

YesNo