What Is Robots.txt and How Does It Control Search Engine Crawlers?

A robots.txt file is a small text file that tells search engine crawlers which parts of your website they are allowed to access. It is one of the basic tools in technical SEO, but it is also one of the most misunderstood.

Many website owners think robots.txt controls indexing. That is where serious SEO mistakes begin. Robots.txt mainly controls crawling. It tells crawlers where they can and cannot go. It does not guarantee that a page will stay out of search results.

Google explains in its official robots.txt introduction that the file tells crawlers which URLs they can access, mainly to avoid overloading the site with requests. Google also explains that robots.txt is not the right method for keeping a page out of Google.

This matters because a wrong robots.txt rule can block important pages from being crawled. A missing rule can allow crawlers to waste time on low-value URLs. A misunderstood rule can create indexing confusion, especially during migrations, redesigns, or technical SEO fixes.

For that reason, robots.txt should be reviewed as part of a wider technical SEO service or SEO audit and crawling process, not treated as a file that developers set once and forget.

What is a robots.txt file?

A robots.txt file is a plain text file placed at the root of a website. It gives instructions to search engine crawlers, also called bots or spiders, about which URLs or sections of the site they may crawl.

The file usually lives at:

example.com/robots.txt

For example, if your website is:

https://example.com

your robots.txt file should usually be available at:

https://example.com/robots.txt

A simple robots.txt file may look like this:

User-agent: *

Disallow: /admin/

Disallow: /cart/

Allow: /

Sitemap: https://example.com/sitemap.xml

This example tells all crawlers not to access the admin and cart sections, while also showing the location of the XML sitemap.

Robots.txt is part of the Robots Exclusion Protocol. The protocol was formalized as RFC 9309, which defines how crawlers should interpret robots.txt rules. In SEO work, the file is mostly used to manage crawler access and reduce waste on pages that do not need crawling.

How does robots.txt work?

Robots.txt works by matching crawler names with access rules. When a crawler visits a website, it usually checks the robots.txt file first. Then it reads the rules that apply to its user-agent name.

The most common directive is:

User-agent: *

The asterisk means the rule applies to all crawlers.

A rule for Googlebot may look like this:

User-agent: Googlebot

Disallow: /private/

This tells Googlebot not to crawl the /private/ section.

The main directives are simple:

Directive	What it does
User-agent	Defines which crawler the rules apply to
Disallow	Blocks crawling of a path
Allow	Allows crawling of a path, often inside a blocked section
Sitemap	Shows crawlers the location of the XML sitemap

Google’s documentation on how Google interprets robots.txt explains file location, supported rules, and how Google reads valid and invalid lines.

The key point is that robots.txt is a crawling instruction. It does not add password protection, does not hide sensitive data, and does not automatically remove URLs from search results.

Robots.txt vs noindex: what is the difference?

The difference between robots.txt and noindex is one of the most important technical SEO concepts.

Robots.txt controls crawling. Noindex controls indexing.

If you block a page in robots.txt, Google may not crawl the page. If Google cannot crawl the page, it may not see a noindex tag on that page. That is why blocking a URL in robots.txt while also using noindex can create confusion.

Google’s guide to blocking indexing with noindex explains that Google must be able to crawl a page to see the noindex rule. If robots.txt blocks access, Google may not see the tag.

Here is the practical difference.

Goal	Use robots.txt?	Use noindex?
Stop crawlers from accessing a section	Yes	No
Keep a page out of Google results	No	Yes
Reduce crawling of low-value URL patterns	Yes	Sometimes
Hide private information	No	No, use authentication
Remove duplicate thin pages from index	Sometimes	Usually yes
Block internal search results from crawling	Often yes	Sometimes
Prevent checkout or cart crawling	Yes	Usually no

If you want a page out of search results, use noindex or protect the page with login access. If you want to stop crawlers from spending time on a section, robots.txt may help.

Why does robots.txt matter for SEO?

Robots.txt matters for SEO because crawling is the first step before indexing and ranking. If search engines cannot crawl important pages, those pages may not be understood properly. If crawlers spend too much time on low-value areas, important sections may receive less attention.

A clean robots.txt file can help with:

Crawl management
Technical SEO control
Reducing crawler access to low-value sections
Protecting crawl budget on large websites
Supporting cleaner indexation
Guiding crawlers toward sitemap files
Preventing accidental crawling of internal tools
Managing staging or parameter-heavy sections

For small websites, robots.txt may be simple. For large e-commerce websites, multilingual websites, news websites, and marketplaces, robots.txt can become more strategic.

For example, an online store may need to prevent crawlers from wasting time on cart URLs, internal search results, login pages, or endless filtered URLs. At the same time, it must allow crawlers to access product pages, category pages, important images, and key resources.

This is why e-commerce SEO should include robots.txt review, especially when product filters and parameter URLs create many crawlable variations.

What should you block in robots.txt?

You should block pages or sections that search engine crawlers do not need to access. The goal is to reduce crawl waste while keeping important pages open.

Common sections to block include:

Admin pages
Login pages
Cart pages
Checkout pages
Internal search result pages
Test folders
Staging folders
Some filtered URLs
Some parameter URLs
Backend scripts
Duplicate technical paths

A WordPress robots.txt file often includes a rule like:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

This blocks the WordPress admin area while allowing an AJAX file that some themes and plugins may need.

For e-commerce websites, a robots.txt file may block internal search pages:

User-agent: *

Disallow: /search/

Disallow: /*?sort=

Disallow: /*?filter=

These rules should be used carefully. Some filters may be low value, while others may support SEO if they represent useful category landing pages. A technical SEO review should decide which URL patterns deserve crawling and which should be restricted.

A technical SEO audit can help avoid overblocking valuable pages.

What should you not block in robots.txt?

The biggest robots.txt mistakes usually come from blocking important resources or pages.

Avoid blocking:

Important service pages
Blog articles
Product pages
Product categories
Main landing pages
CSS files needed for rendering
JavaScript files needed for rendering
Images needed for page understanding
Canonical pages
Pages that contain noindex tags you want Google to see
Important multilingual URLs

Google needs access to key resources to understand how the page appears and functions. If CSS or JavaScript is blocked, Google may have trouble rendering the page correctly.

For example, this can be risky:

User-agent: *

Disallow: /assets/

Disallow: /scripts/

Disallow: /css/

If those folders contain files needed to render the page, blocking them may harm how search engines understand your site.

For content websites, blocking article folders by mistake can remove valuable pages from crawling. For service businesses, blocking landing pages can damage organic visibility. For multilingual websites, blocking one language folder can prevent that version from being crawled properly.

Robots.txt examples and what they mean

Robots.txt rules look simple, but one line can change how crawlers access a website.

Allow all crawlers

User-agent: *

Disallow:

This means all crawlers can access the website.

Block all crawlers from the entire site

User-agent: *

Disallow: /

This blocks all crawlers from crawling the whole site.

This rule is sometimes used on staging websites, but it can be dangerous if moved to the live website by mistake. It is one of the most serious robots.txt errors.

Block one folder

User-agent: *

Disallow: /admin/

This blocks crawling of the admin folder.

Block one crawler

User-agent: Googlebot

Disallow: /private/

This blocks Googlebot from crawling the private folder, while other crawlers may follow different rules.

Add sitemap location

User-agent: *

Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

The sitemap directive helps crawlers find the sitemap. It is useful, but submitting the sitemap in Search Console is still recommended.

For more on sitemap structure, our guide to sitemaps and indexing can be connected with robots.txt during a full technical review.

Common robots.txt mistakes

Robots.txt errors can be small in code and large in impact.

Blocking the whole website by accident

This is the classic migration mistake:

User-agent: *

Disallow: /

It may be correct on a staging site. It is dangerous on a live site.

Blocking pages that should rank

A rule like this may block all blog articles:

Disallow: /blog/

If your blog supports organic visits, this can reduce discovery and harm SEO performance.

Blocking CSS and JavaScript

Search engines need to render pages. Blocking important assets can make the website harder to understand.

Using robots.txt to hide private content

Robots.txt is public. Anyone can visit the file and see the paths listed. Sensitive content should be protected with authentication, not robots.txt.

Using robots.txt instead of noindex

If your goal is to keep a page out of search results, noindex is usually the correct method. Robots.txt controls crawler access, not guaranteed index removal.

Forgetting the sitemap directive

The sitemap directive is not mandatory, but it is useful. It helps crawlers discover your sitemap location quickly.

Leaving old rules after a redesign

Old blocked paths may not match the new website. After a redesign or CMS migration, robots.txt should always be reviewed.

Robots.txt and sitemap: how do they work together?

Robots.txt and sitemaps serve different purposes, but they support each other.

A sitemap lists important URLs that search engines should discover. Robots.txt tells crawlers which areas they can access. Together, they help shape crawling efficiency.

A good setup may look like this:

User-agent: *

Disallow: /wp-admin/

Disallow: /cart/

Disallow: /checkout/

Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

This file blocks low-value technical areas while pointing crawlers toward the sitemap.

However, the sitemap should not include URLs that robots.txt blocks. If your sitemap says a URL is important, while robots.txt says crawlers cannot access it, you are sending mixed signals.

A strong SEO audit should compare robots.txt, sitemap URLs, canonical tags, noindex tags, and actual crawl behavior together.

How to check your robots.txt file

Checking robots.txt should be part of every technical SEO review.

Start with these steps:

Open your robots.txt file in the browser.
Confirm it returns a 200 status code.
Check whether important pages are blocked.
Check whether CSS and JavaScript are blocked.
Check whether the sitemap directive is present.
Compare rules with your sitemap.
Test important URLs in Google Search Console.
Review robots.txt after migrations or redesigns.

Google Search Console can help you inspect specific URLs and understand whether Google can access them. This is especially useful when a page is discovered but not crawled, or when a page behaves differently from what the team expected.

A robots.txt review should also be repeated after:

Website launch
Redesign
CMS migration
Domain migration
Plugin changes
E-commerce filter changes
New language version launch
Technical SEO fixes
Staging to live deployment

Small robots.txt changes can affect large groups of URLs, so they should be tested carefully.

Robots.txt checklist for website owners

Use this checklist before making robots.txt changes.

Check	Why it matters
File exists at /robots.txt	Crawlers know where to find it
File returns 200 status code	Crawlers can access it properly
Important pages are allowed	Key SEO pages remain crawlable
Admin and private technical paths are blocked	Crawl waste is reduced
CSS and JavaScript are not blocked	Google can render pages correctly
Sitemap directive is included	Crawlers can find the sitemap
Noindex pages are not blocked if Google must see the tag	Indexing instructions stay visible
Staging rules are removed from live site	The live website is not blocked
Rules are reviewed after migrations	Old restrictions do not harm new URLs
Sitemap and robots.txt do not conflict	Signals stay consistent

This checklist is simple, but it can prevent serious crawling issues.

How robots.txt affects different website types

Robots.txt is not used the same way on every website. The right setup depends on the website structure.

Service websites

Service websites should usually keep main pages open for crawling. This includes the homepage, service pages, location pages, blog articles, and contact pages.

Robots.txt may block admin sections, login areas, and internal scripts. Content quality and on-page SEO remain more important than complex robots rules for most service websites.

E-commerce websites

E-commerce websites often need more careful robots.txt planning. Filters, sorting URLs, search pages, cart pages, and checkout pages can create many crawlable URLs with little SEO value.

The challenge is to block crawl waste without blocking useful category or product pages.

Content websites

Blogs and media websites should avoid blocking article sections unless there is a clear reason. Robots.txt may help block tag pages, internal search pages, or duplicate archives if they create crawl waste.

For content-heavy websites, robots.txt should support article writing and topic structure, not hide weak content problems.

Multilingual websites

Multilingual websites need extra care. Blocking a language folder by mistake can damage the visibility of that language version. Robots.txt should be checked with hreflang, canonicals, sitemap files, and internal links.

Can robots.txt improve rankings?

Robots.txt does not directly improve rankings. It helps search engines crawl the website more efficiently. That can support SEO, especially on larger websites, but it does not replace content quality, authority, internal linking, or technical health.

Robots.txt can support rankings indirectly when it:

Reduces crawl waste
Protects crawl attention for important pages
Prevents crawling of duplicate URL patterns
Supports cleaner technical structure
Works with sitemap and canonical signals
Keeps low-value sections away from crawlers

However, a robots.txt file cannot make poor content rank. It cannot fix thin pages, weak service copy, poor internal linking, or unclear search intent.

For that reason, robots.txt should be part of a wider SEO system that includes website content and landing pages, technical SEO, internal linking, and content maintenance.

Need a robots.txt and technical SEO review?

Robots.txt looks simple, but it can affect how search engines access your website. One wrong rule can block important pages. One missing rule can let crawlers waste time on low-value URLs. One misunderstood rule can create indexing confusion.

At Wordian, we help companies and teams review robots.txt, crawlability, indexing, and technical SEO through:

We work with businesses that want clearer SEO foundations, cleaner crawling, and practical decisions before publishing more content or changing site structure.

FAQs about robots.txt

1. What is robots.txt in simple words?

Robots.txt is a text file that tells search engine crawlers which parts of a website they can or cannot crawl. It usually sits at the root of the domain, such as example.com/robots.txt. It helps manage crawler access, reduce crawl waste, and guide bots away from areas like admin pages, cart pages, or internal search results.

2. Does robots.txt stop pages from appearing in Google?

Robots.txt does not guarantee that a page will stay out of Google. It blocks crawling, not indexing. If Google finds the URL through other links, it may still show the URL without crawling its content. To keep a page out of search results, noindex or password protection is usually more appropriate.

3. What is the difference between robots.txt and noindex?

Robots.txt tells crawlers not to access a URL or folder. Noindex tells search engines not to include a page in search results. Google must be able to crawl the page to see the noindex tag. If robots.txt blocks the page, Google may not see the noindex instruction.

4. Where should the robots.txt file be placed?

The robots.txt file should be placed at the root of the website. For example, if the site is https://example.com, the robots.txt file should be available at https://example.com/robots.txt. Placing it inside a subfolder will not work for the whole website.

5. Should I block admin pages in robots.txt?

Yes, admin areas are commonly blocked in robots.txt because search engine crawlers do not need to access them. For WordPress websites, /wp-admin/ is often blocked while /wp-admin/admin-ajax.php is allowed. Sensitive admin areas should still be protected by login security, because robots.txt is public.

6. Can a wrong robots.txt file hurt SEO?

Yes, a wrong robots.txt file can hurt SEO if it blocks important pages, resources, or entire website sections. For example, blocking /blog/ can stop crawlers from accessing articles. Blocking CSS or JavaScript can make it harder for Google to render pages correctly. Robots.txt should be tested before and after technical changes.

7. Should the sitemap be added to robots.txt?

Yes, adding the sitemap location to robots.txt is a useful practice. It helps crawlers find the sitemap quickly. A common line is Sitemap: https://example.com/sitemap.xml. This does not replace submitting the sitemap in Google Search Console, but it supports crawler discovery.

8. Do all search engines obey robots.txt?

Major search engines usually respect robots.txt rules, but robots.txt is based on crawler cooperation. Bad bots may ignore it. This is why robots.txt should not be used to protect private data, confidential files, or sensitive customer information. Use authentication and server-level security for anything private.

9. How often should robots.txt be reviewed?

Robots.txt should be reviewed after website launches, redesigns, migrations, CMS changes, plugin changes, and SEO audits. It should also be checked when Search Console shows crawling or indexing problems. For active websites, reviewing robots.txt every few months is a good technical SEO habit.

10. Is robots.txt important for small websites?

Yes, but small websites usually need a simple robots.txt file. The main goal is to avoid blocking important pages and to guide crawlers toward the sitemap. Large websites often need more advanced rules because they have more URL patterns, filters, duplicate paths, and crawl management issues.