Web & Dev utilities
Development

Pro Robots.txt & Sitemap Generator

Draft crawl rules and a standards-friendly URL set in one static workspace—toggle between robots and XML, then copy or download. No sticky panels, no uploads.

User-agent

Choose a common crawler preset or enter a custom token.

Allow / Disallow rules

Paths are passed through as you type (leading slash recommended, e.g. /admin/).

One absolute URL per line. Emitted as Sitemap: directives at the end of the file.

Output

Live robots.txt — copy or download.

Privacy: Everything is assembled in your browser. Deploy the downloaded file to your web server root—this page does not publish or store your domains.

Knowledge base

The role of Robots.txt in crawl budget optimization

Crawl budget is not a mystical quota for every site—Google clarifies that most publishers do not need to micromanage it. Where it matters is large sites with faceted navigation, near-duplicate parameter explosions, or endless calendar archives. A precise robots.txt prevents crawlers from burning queues on low-value URLs (session IDs, admin themes, internal search facets) so discovery work concentrates on templates you want in the index.

Robots.txt is a gate, not a ranking lever: disallowing a URL stops responsible bots from fetching it but does not erase it from the index if external links still point there. Pair exclusions with canonical tags, noindex where appropriate, and server consolidation so marketing parameters redirect cleanly. Review quarterly after CMS migrations because stale disallow lines often linger from staging cutovers and accidentally block CSS assets needed for rendering.

Segment rules by user agent only when engineering must—most teams ship one User-agent: * block plus explicit lines for Googlebot or AdSense when ads.txt policy demands it. Over-splitting the file increases maintenance debt and makes Search Console tests harder to reason about for junior developers rotating through SEO rotations.

Why XML Sitemaps are essential for fast indexing

Sitemaps broadcast the discovery graph: new locales, freshly merged PDPs, and isolated landing pages without strong internal links yet. They do not replace hyperlinks—Google still prioritizes crawl paths it finds through authority—but they shorten time-to-first-crawl on fresh domains and after replatforms when redirects are mid-rollout.

Enterprise catalogs shard into multiple sitemap files and reference them from a sitemap index; this keeps each file under protocol limits and accelerates diffing during deploy pipelines. Continuous integration jobs can emit XML straight from the headless CMS webhook, then verify counts against database rows to catch silent drops in nightly builds before marketing pushes paid traffic.

Optional priority and changefreq fields help humans document intent inside the repo. Search engines treat them softly, so invest more effort in accurate <loc> values, valid XML escaping, and HTTPS consistency than in debating whether blog posts deserve 0.6 versus 0.7.

Common technical SEO mistakes in robots.txt configuration

The classic error is copying staging rules to production: a single Disallow: / during pre-launch slips into the live host and nukes crawl visibility until someone notices empty query data. Another frequent bug is assuming case insensitivity—while many crawlers normalize paths, consistency with your server’s URL case avoids edge mismatches on POSIX filesystem backends.

Misuse of wildcards also hurts: overly broad Disallow: /*? blocks can kneecap parameterized assets required for analytics or personalization QA, while too-narrow patterns leak duplicate faceted URLs. Test with the URL inspection and coverage exports instead of guessing from regex muscle memory.

Forgetting sitemap declarations in robots.txt is not fatal—Search Console accepts manual submissions—but listing canonical sitemap endpoints reduces friction for alternate crawlers and documents the intended discovery surface for security reviewers auditing your public endpoints.

Frequently asked questions

What is Crawl-delay in robots.txt and should I use it?

Crawl-delay is a non-standard directive some crawlers historically honored (for example, older Bing documentation referenced it) to suggest a minimum pause between requests in seconds. Googlebot generally ignores Crawl-delay in robots.txt; Google prefers managing crawl rate in Search Console for verified properties. Misconfigured Crawl-delay can throttle helpful bots—use it only when your origin team explicitly needs polite spacing and you have verified the target bot supports it.

What are the official limits for a single sitemap.xml file?

The sitemaps.org protocol allows a maximum of 50,000 URLs and an uncompressed file size of 50MB per sitemap. If your catalog exceeds either bound, split URLs into multiple sitemap files and reference them from a sitemap index file. CDNs and gzip transfer do not change the uncompressed limit for validators—plan shard files before launches.

How do I validate robots.txt and sitemaps in Google Search Console?

In Search Console, use the robots.txt Tester (under legacy tools in some views) to fetch your live file, simulate URLs, and see whether Google permits crawling. For sitemaps, open the Sitemaps report to submit the sitemap URL or inspect coverage after discovery. Fix server errors and avoid blocking URLs you want indexed—robots.txt removes crawl but does not remove URLs that are already indexed until recrawl.

Do priority and changefreq values guarantee ranking or crawl order?

No. The <priority> and <changefreq> hints in XML sitemaps are optional signals; search engines may ignore them in favor of observed popularity, internal links, and freshness. Use them to communicate relative importance within your own site map, not as a substitute for strong information architecture or internal linking—especially on large ecommerce catalogs.

Next steps

Ship search snippets and validate structured output before you merge.