← Clarigital·Clarity in Digital Marketing
SEO Foundation · Guide 3 of 8

Crawlability & Indexation · The Complete Guide

How Google decides which pages to crawl, how often to crawl them, and whether to add them to the index — crawl budget, robots.txt, XML sitemaps, canonicalisation, noindex directives, and how to diagnose and fix indexing problems using Google Search Console.

SEO 2,700 words Updated Apr 2026

What You'll Learn

  • What crawl budget is and when it actually matters for your site
  • How to write and audit a robots.txt file correctly
  • How to create and submit XML sitemaps that help Google prioritise crawling
  • How canonicalisation works and how to avoid self-inflicted duplicate content problems
  • When to use noindex and how it differs from blocking crawling
  • How to find and fix crawl coverage issues in Google Search Console

Crawl Budget: What It Is and When It Matters

Crawl budget is the number of URLs on your site that Googlebot will crawl and process within a given time frame. Google's official documentation defines it as the product of two factors: crawl rate limit (how fast Googlebot can crawl without overloading your server) and crawl demand (how much Google wants to crawl your pages based on their perceived value and freshness).

The term "crawl budget" is frequently misunderstood in the SEO industry. Google has been explicit that for the vast majority of websites, crawl budget is not a concern. According to Google's official Search Central blog, crawl budget only becomes a meaningful factor for sites with more than one million unique URLs, sites where large portions of the site are updated on a rapid basis, or sites with significant numbers of low-quality, duplicate, or redirected pages consuming crawl resources without adding index value.

When budget matters

1M+

URLs before crawl budget is a real concern

Crawl rate limit

≈2s

Typical pause between Googlebot requests

Re-crawl frequency

3–7d

Average interval for established pages

What Consumes Crawl Budget Inefficiently

Google identifies several categories of URLs that waste crawl budget without contributing to indexing or ranking:

  • Faceted navigation URLs. E-commerce sites often generate thousands of filter combination URLs (e.g. /shoes?colour=red&size=42&brand=nike). Most of these pages are near-duplicates and provide little unique value.
  • Session IDs in URLs. If your site appends session tokens to URLs, Googlebot sees each session as a unique URL even though the content is identical.
  • Soft 404 pages. Pages that return a 200 HTTP status but display "no results found" or similar messages trick Googlebot into treating empty pages as crawlable content.
  • Redirect chains. Long chains of redirects consume crawl budget at each hop. Google recommends keeping redirects to a single hop where possible.
  • Low-quality or thin content pages. Pages with very little unique content that Google is unlikely to index consume budget without return.
Google's confirmed guidance on crawl budget

Google's Gary Illyes stated in 2017: "If you're a small or medium site (say, a few thousand URLs), you likely don't need to worry about crawl budget." The concern is primarily for large-scale sites. Source: Google Search Central Blog.

How to Improve Crawl Efficiency

Even if crawl budget is not a crisis for your site, improving crawl efficiency is good practice. The most effective actions are: consolidating duplicate content through canonicalisation and redirects, blocking URL parameters that don't produce unique content via Google Search Console's URL parameters tool, reducing redirect chains to single hops, and ensuring your server responds quickly (slow server response rates cause Googlebot to crawl less aggressively to avoid overloading it).

robots.txt: Controlling Crawl Access

The robots.txt file is a plain text file placed at the root of a domain (e.g. https://example.com/robots.txt) that tells crawlers which parts of the site they should not access. It is governed by the Robots Exclusion Protocol, first introduced in 1994 and formalised as an IETF proposed standard (RFC 9309) in 2022.

It is critical to understand what robots.txt does and does not do. It controls crawling, not indexing. If you block a URL in robots.txt, Googlebot will not crawl it — but if that URL is linked to from other pages, Google may still index it (as a URL without content). This distinction is important: blocking a page in robots.txt does not guarantee it will not appear in search results. To prevent a page from appearing in search results, you need a noindex directive.

robots.txt Syntax

A robots.txt file consists of one or more groups, each starting with a User-agent line followed by Disallow or Allow directives. Google supports these directives:

# Block all crawlers from the /admin/ directory
User-agent: *
Disallow: /admin/

# Block Googlebot specifically from staging content
User-agent: Googlebot
Disallow: /staging/

# Allow all crawlers access to everything
User-agent: *
Disallow:

# Point to the XML sitemap location
Sitemap: https://example.com/sitemap.xml
Never block CSS or JavaScript in robots.txt

A common historical mistake was blocking CSS and JavaScript directories to save crawl budget. Google now renders pages with Chromium and needs access to these resources to properly evaluate and render your pages. Blocking them causes Google to see a broken version of your page, which can negatively impact rankings. Source: Google Search Central documentation.

Common robots.txt Mistakes

  • Blocking the entire site. A Disallow: / directive blocks all crawling. This is sometimes accidentally left in place from development environments.
  • Case sensitivity. robots.txt path matching is case-sensitive on case-sensitive server systems. Disallow: /Admin/ does not block /admin/.
  • Using robots.txt for confidential pages. robots.txt is publicly readable. Do not list confidential URLs there hoping to hide them — this is counterproductive. Use server authentication instead.
  • Blocking pages you want ranked. Any page you want to appear in Google search results must be crawlable. Verify your robots.txt does not accidentally block important content.

Testing your robots.txt

Google Search Console includes a robots.txt Tester tool that allows you to check how Googlebot interprets your robots.txt file and test individual URLs against your current rules. Use it after any change to your robots.txt.

XML Sitemaps: Helping Google Discover and Prioritise

An XML sitemap is a file that lists the URLs on your site you want search engines to crawl and index. It serves as a direct communication channel to Google — telling it which pages exist, when they were last updated, and optionally how frequently they change. The XML sitemap format is standardised at sitemaps.org and is supported by Google, Bing, and other search engines.

Sitemaps are particularly valuable for large sites with deep URL structures, new sites with few external inbound links (which would otherwise limit Google's ability to discover pages through crawling), and sites with pages that are not easily discovered through internal linking (such as pages accessible only via search forms or heavy use of JavaScript).

XML Sitemap Structure

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/seo/how-google-works/</loc>
    <lastmod>2026-04-04</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

What to Include and Exclude

Your sitemap should only include URLs you want indexed. This means:

  • Include: Canonical versions of your important pages, paginated pages (the first page at minimum), and alternate language versions using hreflang.
  • Exclude: Pages blocked by robots.txt (a contradiction Google flags as an error), pages with a noindex meta tag, redirect URLs (list the final destination only), URLs with parameters that produce duplicate content, and pages you do not want indexed.

Sitemap Index Files

Sites with more than 50,000 URLs in a single sitemap, or sitemaps exceeding 50MB uncompressed, should use a sitemap index file — a sitemap of sitemaps. This allows you to split your sitemap into logical groups (e.g. one for blog posts, one for product pages) and submit a single index file to Google Search Console.

Submitting Your Sitemap

Submit your sitemap via Google Search Console under Indexing → Sitemaps. You should also include a Sitemap: directive in your robots.txt file pointing to the sitemap location — this allows any crawler, not just Google, to discover it automatically.

Sitemaps are a hint, not a command

Google treats sitemaps as a signal about what you want crawled, not as a guaranteed crawl list. Including a URL in your sitemap does not guarantee it will be crawled or indexed. Exclusion signals (noindex, robots.txt disallow, low-quality content) can still prevent indexing even for sitemapped pages.

Canonicalisation: Solving Duplicate Content

Canonicalisation is the process of selecting the preferred version of a URL when multiple URLs serve the same or very similar content. Google uses the canonical URL to consolidate ranking signals — links, PageRank, indexing — to a single preferred version. Without canonicalisation, sites with duplicate content dilute their own link equity and send confusing signals to Google about which page should rank.

Common Sources of Duplicate Content

  • HTTP vs HTTPS versions of the same page
  • www vs non-www versions (e.g. www.example.com vs example.com)
  • Trailing slash vs no trailing slash (/page/ vs /page)
  • URL parameters that don't change content (e.g. tracking parameters like ?utm_source=email)
  • Printer-friendly or AMP versions of pages
  • Paginated content where page 1 and the /page/1/ URL are both accessible
  • Product pages accessible under multiple category paths

Canonicalisation Signals Google Uses

Google uses multiple signals to determine the canonical URL. In order of typical influence:

  1. 1
    301/308 redirects

    The strongest signal. If you permanently redirect one URL to another, Google will treat the destination as canonical.

  2. 2
    rel="canonical" link element

    A <link rel="canonical" href="..."> tag in the <head> of a page. Google treats this as a strong hint but not an absolute directive.

  3. 3
    Sitemap inclusion

    URLs listed in your sitemap are treated as preferred versions. Including only canonical URLs in your sitemap reinforces the signal.

  4. 4
    Internal link consistency

    The URL you most frequently link to internally is treated as a signal that it is the preferred version.

rel="canonical" is a hint, not a directive

Google has explicitly stated that the canonical tag is treated as a "hint" rather than a strict instruction. Google may choose a different canonical if other signals (such as internal linking patterns or external link profiles) contradict the stated canonical. If you need to enforce canonicalisation, use 301 redirects.

Self-Referential Canonicals

Best practice is to include a self-referential canonical tag on every page — a page pointing to itself as the canonical. This prevents Googlebot from treating URL parameter variants as separate pages if your server accidentally serves them. A self-referential canonical also makes your canonical declarations consistent and easier to audit.

Noindex and Blocking: Keeping Pages Out of the Index

Noindex is a directive that tells Google not to include a page in its search index. Unlike robots.txt, which controls crawling, a noindex directive requires Google to crawl the page — it reads the directive and then removes or declines to add the page to the index. This is an important distinction: if you both block a page in robots.txt and add a noindex meta tag, Google cannot read the noindex tag because it cannot crawl the page.

Methods for Implementing Noindex

MethodWhere AddedUse Case
<meta name="robots" content="noindex">HTML <head>Most common. Prevents indexing of an individual page.
X-Robots-Tag: noindexHTTP response headerFor non-HTML files (PDFs, images) or when you can't edit the HTML.
Google Search Console removal toolGSC interfaceTemporary removal of URLs already in the index (lasts 6 months).

When to Use Noindex

Noindex is appropriate for pages that should exist on your site but should not appear in search results:

  • Thank-you pages after form submissions
  • Internal search results pages
  • Login and account management pages
  • Staging or preview pages accessible publicly but not intended for search
  • Low-value tag or archive pages on blog platforms
  • Paginated pages beyond page 2 (optional — depends on content value)
Noindex removes pages from the index over time

Adding noindex to a page that is already indexed does not immediately remove it. Googlebot must re-crawl the page, read the directive, and process the removal. For large sites this can take weeks. For urgent removal, use the URL Removal tool in Google Search Console alongside a noindex tag.

Index Coverage Issues: What They Mean and How to Fix Them

Google Search Console's Index Coverage report (now called the Pages report under Indexing) shows how Google has classified all discovered URLs on your site. Understanding these status categories is essential for diagnosing crawlability and indexation problems.

GSC StatusMeaningAction
IndexedPage is in Google's index and eligible to appear in search resultsNo action needed
Crawled – currently not indexedGoogle crawled the page but chose not to index it (quality signal)Improve page quality or consolidate with canonical
Discovered – currently not indexedGoogle knows the URL exists but hasn't crawled it yetImprove internal linking; check crawl budget
Duplicate, Google chose different canonicalGoogle selected a different URL as the canonical than what you declaredAudit canonical signals; check redirect chains and internal linking
Excluded by noindexPage was not indexed because of a noindex directiveIntentional — verify it is correct; remove noindex if page should be indexed
Blocked by robots.txtGooglebot was prevented from crawling by robots.txtVerify intentional; if not, update robots.txt
Not found (404)Page returned a 404 errorRedirect to relevant page if content has moved; leave as 404 if genuinely deleted
Soft 404Page returns 200 but displays no useful contentReturn proper 404/410, or add meaningful content

The "Crawled — Currently Not Indexed" Problem

This is one of the most common and misunderstood GSC statuses. When Google crawls a page but declines to index it, the most common reasons are: the content is too thin or provides little value beyond other indexed pages, the page is near-duplicate of another already-indexed page, the page has very few or no inbound internal links, or Google's quality systems have flagged the page as low-quality.

The fix depends on the cause. For thin content pages, either substantially improve the content quality or consolidate multiple thin pages into one comprehensive page. For near-duplicates, implement canonical tags pointing to the preferred version. For pages with poor internal linking, add contextually relevant internal links from high-authority pages on your site.

Running a Crawl and Indexation Audit

A crawl and indexation audit gives you a complete picture of how Google sees your site's structure. Here is a systematic approach using free tools — primarily Google Search Console.

  1. 1
    Check your robots.txt

    Visit yourdomain.com/robots.txt directly. Use GSC's robots.txt tester to verify no critical pages or resources are accidentally blocked.

  2. 2
    Review the GSC Pages report

    Under Indexing → Pages, review the breakdown of indexed vs non-indexed URLs. Export the full list and categorise the reasons for non-indexing.

  3. 3
    Audit your sitemap

    Submit your sitemap in GSC and check the "Submitted" vs "Indexed" count. A significant gap signals either quality issues or canonicalisation problems.

  4. 4
    Check for canonical conflicts

    Look for pages where GSC shows "Duplicate, Google chose different canonical than user." These indicate a mismatch between your declared canonicals and the signals Google is reading.

  5. 5
    Crawl your site with a crawler

    Tools like Screaming Frog (free up to 500 URLs) or Sitebulb allow you to crawl your site and identify redirect chains, broken internal links, missing canonical tags, and pages returning incorrect status codes.

  6. 6
    Check server response time

    Slow server responses reduce crawl rate. Use GSC's Crawl Stats report (Settings → Crawl Stats) to see Google's average response time for your server over the past 90 days.

GSC crawl stats report

Google Search Console's Crawl Stats report (under Settings) shows exactly how many pages Googlebot crawled per day over the last 90 days, the average response time, and the breakdown of crawl requests by file type. This is the most direct way to observe your actual crawl budget in use.

Authentic Sources Used in This Guide

Official documentation, academic standards, and verified technical sources only.

OfficialGoogle Search Central — Crawlers Overview

Official documentation on Googlebot, crawl rate, and crawl budget.

OfficialGoogle Search Central — robots.txt Introduction

Official guidance on robots.txt syntax and best practices.

OfficialGoogle Search Central — Sitemaps Overview

Official sitemap documentation including format and submission guidance.

OfficialGoogle Search Central — Canonicalisation

Official guidance on duplicate content and canonical URL selection.

OfficialGoogle Search Central — Block Indexing

Official noindex directive documentation.

StandardRFC 9309 — Robots Exclusion Protocol

IETF proposed standard formalising the robots.txt specification (2022).

600 guides on digital marketing. All authentic sources.

Official documentation, academic research, and government data only. No blogger opinions. No affiliate links.