The robots.txt file looks simple: a few lines in a text file at the root of the domain. In practice, it is one of the most sensitive areas of e-commerce SEO. One bad line blocks Google from 50,000 product pages. One missing line lets AI bots scrape your catalog without permission.
Here is the clean structure for 2026, the pitfalls to avoid, and how to handle AI bots, which have become a major issue.
For a catalog with 5 filters × 10 options each, that means 10^5 = 100,000 URLs generated dynamically. Google crawls all of that, finds nothing unique, and ignores the real product pages in favor of noise.
Solution: block all URLs with filtering query parameters:
Alternatively, on the frontend side, use clean URLs for important filters (/collections/black-shoes/ rather than ?color=black) and block only the infinite combinations.
Since 2023, AI bots (GPTBot, ClaudeBot, PerplexityBot, GoogleOther, ByteSpider, etc.) have been actively crawling the web to train their models. Your e-commerce catalog is a target: product pages are structured, informative, and high-volume.
Choice A — Allow all AI bots (recommended for 2026)
AI search engines (ChatGPT, Perplexity, Claude, Google AI Overviews) are citing more and more sources. Being included in those citations = qualified traffic in 2026 and 2027. Blocking AI bots cuts off that source.
Choice B — Block training AI bots, allow AI search
Some AI bots are for training (OpenAI GPTBot, Google-Extended), others are for real-time search (ChatGPT-User, Perplexity-User). You can separate them:
User-Agent: GPTBot # Training
Disallow: /
User-Agent: ChatGPT-User # Live search
Allow: /
This is pragmatic if you are concerned about your content feeding models without any return, while keeping visibility in AI search.
Choice C — Block everything
User-Agent: GPTBot
Disallow: /
Rarely a good idea in 2026. Cost: zero visibility in AI search engines. Benefit: your content no longer helps train AI systems for free.
For e-commerce stores, visibility in AI search has become measurable in traffic (5-10% of referrers in some verticals). Blocking AI bots does not prevent training — your content is picked up through other means anyway — but it does remove your chance of being cited later.
Specific URL test: for each blocked/allowed pattern, verify that a test URL returns the expected result
Crawl simulation: Screaming Frog can simulate a crawl based on your robots.txt before deployment
Classic mistake to avoid: pushing a Disallow: / (which blocks the entire site) to production while trying to test something. Real case: a large French store disappeared from Google for 3 weeks because this line was pushed by accident.
Shopify automatically generates a decent robots.txt. Since 2021, you can customize it through the robots.txt.liquid file in your theme (Online Store → Themes → Edit code). For a clean override:
{%- for group in robots.default_groups -%}
{{- group.user_agent }}
{%- for rule in group.rules -%}
{{ rule }}
{%- endfor -%}
{%- if group.sitemap != blank -%}
{{ group.sitemap }}
{%- endif -%}
{%- endfor -%}
# Custom rules (append)
User-Agent: GPTBot
Allow: /
Disallow: /admin/
No default robots.txt. Generate it with an SEO plugin (Yoast, RankMath) or manually at the domain root. Be careful not to leave the default WordPress robots.txt that blocks /wp-admin/ but allows admin-ajax.php (required for some frontend features).
Create a public/robots.txt file or generate it dynamically via app/robots.ts (Next 13+). For multi-locale sites, a single robots.txt is enough — no need to localize it.
Yes. robots.txt is a signal, not a firewall. Compliant bots (Google, Bing, the main AI bots) obey it. Malicious bots ignore it. If a bot appears in your logs despite a Disallow, use a WAF (Cloudflare, for example) to block it for real.
That is a strategic choice. Blocking Ahrefs/Semrush prevents competitors from analyzing your SEO profile. Cost: you also lose data about your own site in those tools. Recommendation: allow them if you use those tools, block them if you do not.
Yes. User-Agent: Googlebot + Disallow: / would block Google from all crawling. A real case where this is useful: a development site before launch, to avoid premature indexing. With one absolute warning: always remove that rule before launching the public production site.
Google re-crawls robots.txt roughly every 24-48 hours on an active site. If you newly allow content, expect 1-2 weeks to see effects in indexing. If you block content, the crawl stop is almost immediate.
Indirectly. When configured correctly, it focuses crawl budget on the right URLs, which speeds up indexing of new pages and improves perceived freshness. When configured badly, it can under-index your catalog and hurt organic traffic.
Nothing, that is their problem. It means all their product pages are no longer indexed — organic traffic drops. Do not copy that strategy unless your business model justifies it (confidential products, pre-order release before official launch).