E-commerce robots.txt: best practices and pitfalls

The robots.txt file looks simple: a few lines in a text file at the root of the domain. In practice, it is one of the most sensitive areas of e-commerce SEO. One bad line blocks Google from 50,000 product pages. One missing line lets AI bots scrape your catalog without permission.

Here is the clean structure for 2026, the pitfalls to avoid, and how to handle AI bots, which have become a major issue.

What robots.txt does and does not do

Does:

Tell compliant bots which URLs not to crawl
Indicate the sitemap location
Differentiate rules by user-agent

Does not:

Prevent indexing (a bot can index without crawling, through external links)
Secure confidential URLs (robots.txt is public, anyone can read it)
Force a bot to follow the rule (malicious bots ignore robots.txt)

Critical rule: robots.txt is a signal, not a barrier. To really block access, use authentication, meta robots noindex, or a firewall.

The standard structure of an e-commerce robots.txt

Here is a clean robots.txt for a Shopify or WooCommerce site:

# Rules for all bots
User-Agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /account/
Disallow: /cart
Disallow: /checkout
Disallow: /search?
Disallow: /*?variant=
Disallow: /*?utm_
Disallow: /*?preview=
Disallow: /collections/*?sort_by=
Disallow: /collections/*?*filter.*

# Special rules for AI bots (see dedicated section)
User-Agent: GPTBot
Disallow: /admin/
Allow: /

# Sitemap location
Sitemap: https://example.com/sitemap.xml

Breakdown:

User-Agent: * = default rule for all bots that do not have a specific section
Allow: / = allow everything by default
Disallow: = then block sensitive sections
Sitemap: = indicates the sitemap location (critical for AI bots without access to GSC)

The 7 areas to block systematically

On any e-commerce site, block:

/api/ — internal API endpoints, never useful in SERPs
/admin/ or equivalent (example: /wp-admin/) — back office, confidential
/account/ — authenticated customer area, private
/cart and /checkout — transactional pages, no SEO value
/search?q= — internal search results (creates infinite URLs)
/*?variant= — product variants (canonical to parent)
/*?utm_ — tracking URLs (duplicates of canonical URLs)

For WooCommerce specifically, add:

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /?add-to-cart=
Disallow: /?add_to_wishlist=

For Shopify, add:

Disallow: /orders
Disallow: /wishlist

Faceted filters: the #1 pitfall

This is the main cause of crawl budget waste. On a catalog with dynamic filters (size, color, brand, price), each combination generates a URL:

/collections/shoes?color=black
/collections/shoes?color=black&size=44
/collections/shoes?color=black&size=44&brand=atelier-maison
/collections/shoes?color=black&size=44&brand=atelier-maison&price=50-100

For a catalog with 5 filters × 10 options each, that means 10^5 = 100,000 URLs generated dynamically. Google crawls all of that, finds nothing unique, and ignores the real product pages in favor of noise.

Solution: block all URLs with filtering query parameters:

Disallow: /collections/*?*filter.*
Disallow: /collections/*?*sort_by*
Disallow: /collections/*?*pg=

Alternatively, on the frontend side, use clean URLs for important filters (/collections/black-shoes/ rather than ?color=black) and block only the infinite combinations.

AI bots: the new SEO reality

Since 2023, AI bots (GPTBot, ClaudeBot, PerplexityBot, GoogleOther, ByteSpider, etc.) have been actively crawling the web to train their models. Your e-commerce catalog is a target: product pages are structured, informative, and high-volume.

The 3 possible choices

Choice A — Allow all AI bots (recommended for 2026)

AI search engines (ChatGPT, Perplexity, Claude, Google AI Overviews) are citing more and more sources. Being included in those citations = qualified traffic in 2026 and 2027. Blocking AI bots cuts off that source.

User-Agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /account/

Repeat this pattern for: ClaudeBot, Claude-Web, anthropic-ai, PerplexityBot, Perplexity-User, ChatGPT-User, OAI-SearchBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, cohere-ai, Diffbot, FacebookBot, Meta-ExternalAgent, Meta-ExternalFetcher, Amazonbot, YouBot.

Choice B — Block training AI bots, allow AI search

Some AI bots are for training (OpenAI GPTBot, Google-Extended), others are for real-time search (ChatGPT-User, Perplexity-User). You can separate them:

User-Agent: GPTBot          # Training
Disallow: /

User-Agent: ChatGPT-User    # Live search
Allow: /

This is pragmatic if you are concerned about your content feeding models without any return, while keeping visibility in AI search.

Choice C — Block everything

User-Agent: GPTBot
Disallow: /

Rarely a good idea in 2026. Cost: zero visibility in AI search engines. Benefit: your content no longer helps train AI systems for free.

Why Ecomptimize recommends choice A

For e-commerce stores, visibility in AI search has become measurable in traffic (5-10% of referrers in some verticals). Blocking AI bots does not prevent training — your content is picked up through other means anyway — but it does remove your chance of being cited later.

Test before deployment

Before pushing a new robots.txt, test it:

Syntax test: via Google Search Console → robots.txt Tester
Specific URL test: for each blocked/allowed pattern, verify that a test URL returns the expected result
Crawl simulation: Screaming Frog can simulate a crawl based on your robots.txt before deployment

Classic mistake to avoid: pushing a Disallow: / (which blocks the entire site) to production while trying to test something. Real case: a large French store disappeared from Google for 3 weeks because this line was pushed by accident.

By platform

Shopify

Shopify automatically generates a decent robots.txt. Since 2021, you can customize it through the robots.txt.liquid file in your theme (Online Store → Themes → Edit code). For a clean override:

{%- for group in robots.default_groups -%}
  {{- group.user_agent }}

  {%- for rule in group.rules -%}
    {{ rule }}
  {%- endfor -%}

  {%- if group.sitemap != blank -%}
    {{ group.sitemap }}
  {%- endif -%}
{%- endfor -%}

# Custom rules (append)
User-Agent: GPTBot
Allow: /
Disallow: /admin/

WooCommerce / WordPress

No default robots.txt. Generate it with an SEO plugin (Yoast, RankMath) or manually at the domain root. Be careful not to leave the default WordPress robots.txt that blocks /wp-admin/ but allows admin-ajax.php (required for some frontend features).

Next.js / headless

Create a public/robots.txt file or generate it dynamically via app/robots.ts (Next 13+). For multi-locale sites, a single robots.txt is enough — no need to localize it.

10-minute robots.txt audit

Essential checks:

https://example.com/robots.txt returns a 200 with Content-Type: text/plain
No accidental Disallow: / in the general section
All sensitive areas (/admin/, /account/, /api/) are blocked
Faceted filters and query parameters are blocked
The sitemap is referenced at the end of the file
AI bots have a dedicated section with your strategy (allow or disallow depending on your choice)
No blocking of /wp-admin/admin-ajax.php (if WordPress) — it breaks frontend functionality

FAQ

Can I block a bot via robots.txt and still see it in my logs?

Yes. robots.txt is a signal, not a firewall. Compliant bots (Google, Bing, the main AI bots) obey it. Malicious bots ignore it. If a bot appears in your logs despite a Disallow, use a WAF (Cloudflare, for example) to block it for real.

Should I block competitor scraping bots (Ahrefs, Semrush)?

That is a strategic choice. Blocking Ahrefs/Semrush prevents competitors from analyzing your SEO profile. Cost: you also lose data about your own site in those tools. Recommendation: allow them if you use those tools, block them if you do not.

Can robots.txt block Google?

Yes. User-Agent: Googlebot + Disallow: / would block Google from all crawling. A real case where this is useful: a development site before launch, to avoid premature indexing. With one absolute warning: always remove that rule before launching the public production site.

E-commerce robots.txt: best practices and pitfalls