← Clarigital·Clarity in Digital Marketing
Technical SEO · Session 4, Guide 11

XML Sitemaps & Robots.txt · Complete Technical Guide

XML sitemaps tell Google which pages exist and should be crawled. Robots.txt tells Googlebot which pages it should not crawl. Together they are the primary crawl control tools available to site owners. This guide covers correct sitemap structure, submission, when sitemaps genuinely help, robots.txt syntax and directives, and the critical difference between robots.txt (blocks crawling) and noindex (blocks indexing).

Off-Page SEO2,700 wordsUpdated Apr 2026

What You Will Learn

  • When XML sitemaps genuinely help and when they do not
  • Correct XML sitemap format including optional tags
  • How to submit sitemaps to Google Search Console and monitor errors
  • Robots.txt syntax, how Googlebot reads it, and common mistakes
  • All robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap
  • The critical difference between robots.txt disallow and noindex meta tag

XML Sitemaps

An XML sitemap is a file listing URLs on your site that you want Google to crawl and index. Submitting a sitemap does not guarantee indexing — Google decides which submitted URLs to crawl and index based on its own quality and relevance assessments. What a sitemap does is ensure Googlebot knows these URLs exist and has a direct path to discover them.

When sitemaps genuinely help

  • Large sites. Sites with thousands of pages benefit most from sitemaps — Googlebot may not discover all pages through link crawling alone, especially deep pages with few internal links.
  • New sites. A new site with few external backlinks may not be discovered quickly through normal crawling. A sitemap submitted to Search Console accelerates initial discovery.
  • Sites with pages not linked internally. Orphan pages — those with no internal links — will not be discovered through crawling. A sitemap listing them gives Google a path to them (though fixing the orphan status is better long-term).
  • Sites with frequently updated content. News sites, blogs with high publishing frequency, and e-commerce with frequently changing inventory use sitemaps to signal freshness and prioritise recrawl.

When sitemaps are less important

For small sites (under 500 pages) with solid internal linking structures and existing external backlinks, Google typically discovers and crawls all pages without a sitemap. A sitemap adds no harm but has minimal incremental value in these cases.

Sitemap Format

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  
  <url>
    <loc>https://www.example.com/page/</loc>
    <lastmod>2026-04-04</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>

  <url>
    <loc>https://www.example.com/another-page/</loc>
    <lastmod>2026-03-15</lastmod>
  </url>

</urlset>

Tag notes: <loc> (required) — must be the canonical URL including trailing slash if used consistently; <lastmod> (recommended) — ISO 8601 format; use actual last-modified dates, not today's date for all pages; <changefreq> and <priority> (optional) — Google largely ignores these in practice.

Sitemap index files for large sites

A single sitemap file can contain up to 50,000 URLs and must be under 50MB. Large sites use a sitemap index file that points to multiple individual sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://www.example.com/sitemap-posts.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://www.example.com/sitemap-products.xml</loc>
  </sitemap>
</sitemapindex>

Submission and Monitoring

Submit your sitemap to Google via Search Console: Settings → Sitemaps → Enter sitemap URL → Submit. Once submitted, Google shows the sitemap's discovered URL count, indexed URL count, and any errors. The gap between discovered and indexed URLs is informative — a large gap suggests some URLs are being rejected due to quality issues, canonical conflicts, or crawl budget constraints.

Sitemap best practices

  • Only include canonical, indexable URLs — do not include noindex pages, URLs with canonical tags pointing elsewhere, or URLs blocked by robots.txt
  • Keep sitemaps current — remove deleted pages, add new pages promptly
  • Reference your sitemap location in robots.txt: Sitemap: https://www.example.com/sitemap.xml
  • Keep the sitemap URL consistent — changing it requires resubmission

Robots.txt

Robots.txt is a plain text file located at the root of your domain (https://www.example.com/robots.txt) that uses the Robots Exclusion Protocol to communicate crawling instructions to web robots including Googlebot. Googlebot fetches and reads robots.txt before crawling any other page on a site.

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?
Allow: /public/

User-agent: Googlebot
Disallow: /staging/

User-agent: Googlebot-Image
Disallow: /proprietary-images/

Sitemap: https://www.example.com/sitemap.xml

Robots.txt is read line by line. User-agent: * applies to all crawlers. Specific user agents (Googlebot, Googlebot-Image, Googlebot-Video) can have separate rules. Rules apply to the longest matching prefix — Disallow: /admin/ blocks all URLs starting with /admin/.

Robots.txt Directives

DirectiveSupported by GoogleMeaning
User-agentYesSpecifies which crawler the following rules apply to
DisallowYesBlocks the crawler from accessing URLs matching this path
AllowYesExplicitly allows access to a path within a broader disallowed path
SitemapYesPoints crawlers to the site's XML sitemap
Crawl-delayNoNot supported by Google — use Search Console crawl rate settings instead
NoindexNo longer supportedWas informally supported; Google officially dropped support in September 2019

Noindex vs Disallow — A Critical Distinction

This is one of the most commonly confused technical SEO concepts — and getting it wrong can have serious consequences:

  • robots.txt Disallow blocks crawling. Googlebot will not request the blocked URL. However, if external sites link to a disallowed URL, Google can still know about the URL and show it in search results — with no title or description (just the URL), because it cannot crawl the page to understand its content.
  • noindex meta tag blocks indexing. The page can be crawled normally, but Google will not include it in its search index. The page will not appear in search results. However, Googlebot must be able to crawl the page to see the noindex tag — if the page is also blocked by robots.txt, the noindex tag cannot be read.
Never disallow pages you want to noindex

A page that is both blocked by robots.txt and has a noindex meta tag creates a contradiction: Google cannot read the noindex tag because robots.txt prevents access. The page may still appear in search results as a URL-only result because external links reveal its existence. The correct approach: allow crawling (do not disallow in robots.txt) and use noindex in the page's meta robots tag.

Authentic Sources

OfficialGoogle Search Central — Sitemaps

Official sitemap documentation including format, submission, and best practices.

OfficialGoogle Search Central — Robots.txt Introduction

Official documentation on robots.txt syntax, supported directives, and how Googlebot reads it.

OfficialGoogle Search Central — Robots Meta Tag

The noindex meta tag and how it differs from robots.txt disallow.

OfficialSitemaps.org — Sitemap Protocol

The official sitemap protocol maintained by Google, Microsoft, Yahoo, and Ask.

600 guides. All authentic sources.

Official documentation only — no bloggers.