Part of the SiteMap Series
You discovered how to find your existing sitemap in lesson 1, and now your sitemap is loaded on screen. Maybe it appeared as a clean-styled table, the kind Yoast and Rank Math render for you. Maybe it appeared as raw XML, hundreds of lines of angle brackets and URLs scrolling past with no formatting at all. Either way, you have the file in front of you. The question is what to do with it.
Finding a sitemap is the easy part. Reading one properly is where most people get stuck. They scroll through the URLs, recognise a lot of pages they expected to see, conclude everything looks fine, and move on. They miss the things that matter.
This lesson is the audit. By the end, you will know how to read any sitemap, your own or someone else’s, and tell whether it is genuinely doing its job. The same checks apply whether the crawler reading your sitemap is Google, Bing, or the AI search era crawlers like GPTBot and PerplexityBot. If the sitemap is junk, every crawler that visits gets junk, and you have no idea what is showing up on your behalf in their results.
First, understand what you are looking at
Before you start scrolling through URLs, take a moment to orient yourself. Three things to clock at a glance:
- The format
- The size and
- The structure.
These give you the lay of the land before you commit to a closer look.
The format question is simple. Is this a single sitemap, or is it a sitemap index? You can tell by skimming the first few tags. A single sitemap contains <url> tags with your actual page URLs inside. A sitemap index contains <sitemap> tags with <loc> elements pointing to other sitemap files. Most sites running a modern SEO plugin generate an index file. WordPress with Yoast, for example, gives you /sitemap_index.xml which then links to /post-sitemap.xml, /page-sitemap.xml, /category-sitemap.xml, and so on. Each child sitemap holds a different content type.
If you are looking at an index file, the actual page URLs live inside the child sitemaps. Click into each one to see what is really there. The index itself contains no page URLs, only references.
The size question is next. How many URLs are listed? Most styled XML viewers (the ones Yoast and Rank Math generate) show row counts at the top of the page. Plain, unstyled XML you would need to scroll through, or you could view the page source and count <url> tags. The number itself matters less than whether it matches what you would expect for your site. A blog with 200 published posts should not have 12 URLs in its sitemap. A site with 30 pages should not have 8,000. If the number feels wildly off in either direction, that is your first finding before you have even looked at the URLs themselves.
The structure question is the third orientation check, and it only applies to sitemap index files. Are all the child sitemaps that should be there actually there? A typical WordPress site with posts, pages, categories, and tags should have separate child sitemaps for each. If a content type is missing from the index entirely, the plugin is excluding it. Sometimes that is the correct call. Often it is not.
Spot-check the URLs themselves
Once you have a sense of the format and size, the next move is to look at the actual URLs. You do not need to check every single one. A spot check on five to ten random URLs tells you most of what you need to know.
Pick URLs from different parts of the sitemap. The first few, the last few, a handful from the middle. For each URL, open it in a new tab and watch what happens.
There are three things to check on each URL you open.
- The first is whether it returns a 200 status code, meaning a normal working page. A URL that redirects (301 or 302) somewhere else should not be in a sitemap. A URL that returns 404 (not found) or 410 (gone) definitely should not. A sitemap full of broken or redirecting URLs tells search engines that the maintenance is sloppy, and over time they trust the sitemap less.
- The second check is whether the URL is the canonical version of the page. If your sitemap lists
https://example.com/about-us/but the page itself setshttps://example.com/about/as the canonical, the sitemap is pointing to a non-canonical version. Search engines will figure this out eventually, but it is friction you do not need to add. - The third check is whether this URL is even a page you want in the search index at all. Some sitemap-generating plugins are aggressive and include URLs you would rather hide: parameter URLs from filters, tag archive pages, search results pages, internal admin URLs. If a URL is in the sitemap, you are telling crawlers “please look at this, it is important”. Make sure that is actually true.
Look for what should not be there
This is where most sitemap audits find their biggest problems. A sitemap that includes URLs it should not include is worse than a smaller sitemap that only includes the right ones. Junk URLs in a sitemap waste crawl budget, dilute the trust signal, and frustrate crawlers that have to work out what is actually worth indexing.
These are the most common categories of URLs you should not see in a sitemap:
- Tag archives and category archives with thin content. WordPress generates these automatically, and if your tags lead to pages with one or two posts each, those archives have no real value and should not be advertised to crawlers.
- Internal search result pages. URLs like
/?s=keywordor/search/keyword/are not pages you maintain or want indexed. - Filter and sort URLs from e-commerce sites. Anything with
?color=redor?sort=priceis usually a parameter variation of an existing canonical URL. - Admin or staging URLs. If you can see
/wp-admin/orstaging.example.com/anywhere in your sitemap, something is seriously misconfigured. - Author archive pages on sites that do not need them. A solo blog gains nothing from an author archive that mirrors the main feed.
- Attachment pages on WordPress. Each image upload can generate its own dedicated page. Most of these are useless and should not be in the sitemap.
- Test pages and draft pages that accidentally got included. It happens more often than you would think.
If any of these appear in your sitemap, that is a finding to fix. Some you fix by configuring your SEO plugin to exclude them. Some you fix by changing the canonical URL on the offending pages. Some you fix by noindexing the pages and waiting for the sitemap to update on its own.
Look for what is missing
A sitemap is also wrong when important pages are absent from it. Spotting what is missing requires a mental walkthrough of your site structure. Open another tab, navigate your own site, and as you click around, ask whether each significant page is represented somewhere in the sitemap.
Start with the obvious. The homepage. The main category or hub pages. The cornerstone blog posts. The pillar service pages. If any of these are absent from the sitemap, that is a meaningful finding and you need to know why.
Then think about content types your CMS might be ignoring. Custom post types are a common omission. If your site has case studies, courses, or any custom content type, the sitemap plugin might not be including them unless you tell it to. WordPress with Yoast lets you configure which post types appear in the sitemap, and the default is usually conservative. Sometimes a content type you spent months building is sitting in the database, fully published, completely absent from the sitemap.
Pages behind logged-in areas are correctly excluded from the sitemap. Pages with noindex meta tags are correctly excluded. Pages disallowed in robots.txt should also be excluded. But normal public pages should not be missing, and when they are, the cause is almost always a plugin configuration that needs adjusting rather than a broken sitemap.
The lastmod pattern
The lastmod element on each URL is the most under-appreciated signal in a sitemap. It tells crawlers when each URL was last meaningfully changed. Module One Lesson 9 covered why Google now ignores priority and changefreq but still uses lastmod, so this is the one element worth actually reading carefully.
Look at the lastmod dates across the sitemap and ask whether the pattern is honest. Three patterns to watch for, and only one of them is the healthy version.
The first pattern is when every URL has the same lastmod date. This usually means the plugin is setting the date to “now” whenever the sitemap regenerates, not to when each page was actually last updated. Crawlers can spot this pattern and will eventually start ignoring the lastmod entirely. It looks like noise rather than signal, and they treat it accordingly.
The second pattern is when every lastmod is far in the past. If your sitemap says everything was last modified two years ago, crawlers conclude the site is dormant. They visit less often. If pages have actually been updated since then, the sitemap is undercommunicating the freshness of your content.
The third pattern, and the healthy one, is a mix. Some URLs with recent dates because those pages were genuinely recently updated. Some with older dates because those pages have not changed in a while. A few in between. This is what crawlers expect to see, and it is the pattern that earns trust over time.
If your lastmod values look wrong, that is a plugin issue. Yoast and Rank Math both pull lastmod from the actual page modification date by default, but custom configurations can override this, and badly configured plugins are a common cause of the same-date-everywhere problem.
When the audit reveals problems
If you have worked through the sections above and found things you do not like, you are in one of three situations. The shape of your findings tells you what the next step in this module needs to be for you specifically.
If the audit reveals junk URLs that should not be in the sitemap, your next move is to configure your SEO plugin to exclude them or adjust which post types and taxonomies are included. The building lessons later in this module cover the plugin configuration paths for the common SEO plugins.
If the audit reveals missing pages that should be in the sitemap, your next move is also configuration. Most missing-page issues are plugin defaults being too conservative, and a single change in settings is all that is needed.
If the audit reveals deeper problems, plugin conflicts, lastmod values that are clearly fake, and an entire content type ignored, the right move may be to switch sitemap sources entirely or rebuild the sitemap from scratch. The decision lesson coming up next in this module covers exactly how to make that call.
For now, you have a list of findings, and that is what an audit produces. You know what is wrong. The rest of the module shows you how to fix it.
Where this leaves us
You have now run a proper audit on your sitemap. You know the format, the size, and the structure. You have spot-checked the URLs. You have looked for what should not be there and what is missing. You have read the lastmod pattern. You have a list of findings, whatever they happen to be.
The audit result tells you which path to take next. Some readers will have a working sitemap with minor issues they can fix in a plugin settings page. Some will have a sitemap so flawed that starting over is the right call. Some will have no sitemap at all and need to build one from scratch. The next lesson is the decision: given what you have just found, should you fix the existing sitemap, rebuild it from a different source, or replace it with something entirely different?
Up next: Choosing How to Build a Sitemap →
This is Module 2: Lesson 2 of The Sitemap Series, a Technical SEO series on sitemaps from first principles, built for the AI Search era.