What Is Robots.txt and How Does It Control Search Engine Crawlers?

A robots.txt file is a small text file that tells search engine crawlers which parts of your website they are allowed to access. It is one of the basic tools in technical SEO, but it is also one of the most misunderstood.
Many website owners think robots.txt controls indexing. That is where serious SEO mistakes begin. Robots.txt mainly controls crawling. It tells crawlers where they can and cannot go. It does not guarantee that a page will stay out of search results.
Google explains in its official robots.txt introduction that the file tells crawlers which URLs they can access, mainly to avoid overloading the site with requests. Google also explains that robots.txt is not the right method for keeping a page out of Google.
This matters because a wrong robots.txt rule can block important pages from being crawled. A missing rule can allow crawlers to waste time on low-value URLs. A misunderstood rule can create indexing confusion, especially during migrations, redesigns, or technical SEO fixes.
For that reason, robots.txt should be reviewed as part of a wider technical SEO service or SEO audit and crawling process, not treated as a file that developers set once and forget.
What is a robots.txt file?
A robots.txt file is a plain text file placed at the root of a website. It gives instructions to search engine crawlers, also called bots or spiders, about which URLs or sections of the site they may crawl.
The file usually lives at:
example.com/robots.txt
For example, if your website is:
https://example.com
your robots.txt file should usually be available at:
https://example.com/robots.txt
A simple robots.txt file may look like this:
User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /
Sitemap: https://example.com/sitemap.xml
This example tells all crawlers not to access the admin and cart sections, while also showing the location of the XML sitemap.
Robots.txt is part of the Robots Exclusion Protocol. The protocol was formalized as RFC 9309, which defines how crawlers should interpret robots.txt rules. In SEO work, the file is mostly used to manage crawler access and reduce waste on pages that do not need crawling.
How does robots.txt work?
Robots.txt works by matching crawler names with access rules. When a crawler visits a website, it usually checks the robots.txt file first. Then it reads the rules that apply to its user-agent name.
The most common directive is:
User-agent: *
The asterisk means the rule applies to all crawlers.
A rule for Googlebot may look like this:
User-agent: Googlebot
Disallow: /private/
This tells Googlebot not to crawl the /private/ section.
The main directives are simple:
| Directive | What it does |
| User-agent | Defines which crawler the rules apply to |
| Disallow | Blocks crawling of a path |
| Allow | Allows crawling of a path, often inside a blocked section |
| Sitemap | Shows crawlers the location of the XML sitemap |
Google’s documentation on how Google interprets robots.txt explains file location, supported rules, and how Google reads valid and invalid lines.
The key point is that robots.txt is a crawling instruction. It does not add password protection, does not hide sensitive data, and does not automatically remove URLs from search results.
Robots.txt vs noindex: what is the difference?
The difference between robots.txt and noindex is one of the most important technical SEO concepts.
Robots.txt controls crawling. Noindex controls indexing.
If you block a page in robots.txt, Google may not crawl the page. If Google cannot crawl the page, it may not see a noindex tag on that page. That is why blocking a URL in robots.txt while also using noindex can create confusion.
Google’s guide to blocking indexing with noindex explains that Google must be able to crawl a page to see the noindex rule. If robots.txt blocks access, Google may not see the tag.
Here is the practical difference.
| Goal | Use robots.txt? | Use noindex? |
| Stop crawlers from accessing a section | Yes | No |
| Keep a page out of Google results | No | Yes |
| Reduce crawling of low-value URL patterns | Yes | Sometimes |
| Hide private information | No | No, use authentication |
| Remove duplicate thin pages from index | Sometimes | Usually yes |
| Block internal search results from crawling | Often yes | Sometimes |
| Prevent checkout or cart crawling | Yes | Usually no |
If you want a page out of search results, use noindex or protect the page with login access. If you want to stop crawlers from spending time on a section, robots.txt may help.
Why does robots.txt matter for SEO?
Robots.txt matters for SEO because crawling is the first step before indexing and ranking. If search engines cannot crawl important pages, those pages may not be understood properly. If crawlers spend too much time on low-value areas, important sections may receive less attention.
A clean robots.txt file can help with:
- Crawl management
- Technical SEO control
- Reducing crawler access to low-value sections
- Protecting crawl budget on large websites
- Supporting cleaner indexation
- Guiding crawlers toward sitemap files
- Preventing accidental crawling of internal tools
- Managing staging or parameter-heavy sections
For small websites, robots.txt may be simple. For large e-commerce websites, multilingual websites, news websites, and marketplaces, robots.txt can become more strategic.
For example, an online store may need to prevent crawlers from wasting time on cart URLs, internal search results, login pages, or endless filtered URLs. At the same time, it must allow crawlers to access product pages, category pages, important images, and key resources.
This is why e-commerce SEO should include robots.txt review, especially when product filters and parameter URLs create many crawlable variations.
What should you block in robots.txt?
You should block pages or sections that search engine crawlers do not need to access. The goal is to reduce crawl waste while keeping important pages open.
Common sections to block include:
- Admin pages
- Login pages
- Cart pages
- Checkout pages
- Internal search result pages
- Test folders
- Staging folders
- Some filtered URLs
- Some parameter URLs
- Backend scripts
- Duplicate technical paths
A WordPress robots.txt file often includes a rule like:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
This blocks the WordPress admin area while allowing an AJAX file that some themes and plugins may need.
For e-commerce websites, a robots.txt file may block internal search pages:
User-agent: *
Disallow: /search/
Disallow: /*?sort=
Disallow: /*?filter=
These rules should be used carefully. Some filters may be low value, while others may support SEO if they represent useful category landing pages. A technical SEO review should decide which URL patterns deserve crawling and which should be restricted.
A technical SEO audit can help avoid overblocking valuable pages.
What should you not block in robots.txt?
The biggest robots.txt mistakes usually come from blocking important resources or pages.
Avoid blocking:
- Important service pages
- Blog articles
- Product pages
- Product categories
- Main landing pages
- CSS files needed for rendering
- JavaScript files needed for rendering
- Images needed for page understanding
- Canonical pages
- Pages that contain noindex tags you want Google to see
- Important multilingual URLs
Google needs access to key resources to understand how the page appears and functions. If CSS or JavaScript is blocked, Google may have trouble rendering the page correctly.
For example, this can be risky:
User-agent: *
Disallow: /assets/
Disallow: /scripts/
Disallow: /css/
If those folders contain files needed to render the page, blocking them may harm how search engines understand your site.
For content websites, blocking article folders by mistake can remove valuable pages from crawling. For service businesses, blocking landing pages can damage organic visibility. For multilingual websites, blocking one language folder can prevent that version from being crawled properly.
Robots.txt examples and what they mean
Robots.txt rules look simple, but one line can change how crawlers access a website.
Allow all crawlers
User-agent: *
Disallow:
This means all crawlers can access the website.
Block all crawlers from the entire site
User-agent: *
Disallow: /
This blocks all crawlers from crawling the whole site.
This rule is sometimes used on staging websites, but it can be dangerous if moved to the live website by mistake. It is one of the most serious robots.txt errors.
Block one folder
User-agent: *
Disallow: /admin/
This blocks crawling of the admin folder.
Block one crawler
User-agent: Googlebot
Disallow: /private/
This blocks Googlebot from crawling the private folder, while other crawlers may follow different rules.
Add sitemap location
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
The sitemap directive helps crawlers find the sitemap. It is useful, but submitting the sitemap in Search Console is still recommended.
For more on sitemap structure, our guide to sitemaps and indexing can be connected with robots.txt during a full technical review.
Common robots.txt mistakes
Robots.txt errors can be small in code and large in impact.
Blocking the whole website by accident
This is the classic migration mistake:
User-agent: *
Disallow: /
It may be correct on a staging site. It is dangerous on a live site.
Blocking pages that should rank
A rule like this may block all blog articles:
Disallow: /blog/
If your blog supports organic visits, this can reduce discovery and harm SEO performance.
Blocking CSS and JavaScript
Search engines need to render pages. Blocking important assets can make the website harder to understand.
Using robots.txt to hide private content
Robots.txt is public. Anyone can visit the file and see the paths listed. Sensitive content should be protected with authentication, not robots.txt.
Using robots.txt instead of noindex
If your goal is to keep a page out of search results, noindex is usually the correct method. Robots.txt controls crawler access, not guaranteed index removal.
Forgetting the sitemap directive
The sitemap directive is not mandatory, but it is useful. It helps crawlers discover your sitemap location quickly.
Leaving old rules after a redesign
Old blocked paths may not match the new website. After a redesign or CMS migration, robots.txt should always be reviewed.
Robots.txt and sitemap: how do they work together?
Robots.txt and sitemaps serve different purposes, but they support each other.
A sitemap lists important URLs that search engines should discover. Robots.txt tells crawlers which areas they can access. Together, they help shape crawling efficiency.
A good setup may look like this:
User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
This file blocks low-value technical areas while pointing crawlers toward the sitemap.
However, the sitemap should not include URLs that robots.txt blocks. If your sitemap says a URL is important, while robots.txt says crawlers cannot access it, you are sending mixed signals.
A strong SEO audit should compare robots.txt, sitemap URLs, canonical tags, noindex tags, and actual crawl behavior together.
How to check your robots.txt file
Checking robots.txt should be part of every technical SEO review.
Start with these steps:
- Open your robots.txt file in the browser.
- Confirm it returns a 200 status code.
- Check whether important pages are blocked.
- Check whether CSS and JavaScript are blocked.
- Check whether the sitemap directive is present.
- Compare rules with your sitemap.
- Test important URLs in Google Search Console.
- Review robots.txt after migrations or redesigns.
Google Search Console can help you inspect specific URLs and understand whether Google can access them. This is especially useful when a page is discovered but not crawled, or when a page behaves differently from what the team expected.
A robots.txt review should also be repeated after:
- Website launch
- Redesign
- CMS migration
- Domain migration
- Plugin changes
- E-commerce filter changes
- New language version launch
- Technical SEO fixes
- Staging to live deployment
Small robots.txt changes can affect large groups of URLs, so they should be tested carefully.
Robots.txt checklist for website owners
Use this checklist before making robots.txt changes.
| Check | Why it matters |
| File exists at /robots.txt | Crawlers know where to find it |
| File returns 200 status code | Crawlers can access it properly |
| Important pages are allowed | Key SEO pages remain crawlable |
| Admin and private technical paths are blocked | Crawl waste is reduced |
| CSS and JavaScript are not blocked | Google can render pages correctly |
| Sitemap directive is included | Crawlers can find the sitemap |
| Noindex pages are not blocked if Google must see the tag | Indexing instructions stay visible |
| Staging rules are removed from live site | The live website is not blocked |
| Rules are reviewed after migrations | Old restrictions do not harm new URLs |
| Sitemap and robots.txt do not conflict | Signals stay consistent |
This checklist is simple, but it can prevent serious crawling issues.
How robots.txt affects different website types
Robots.txt is not used the same way on every website. The right setup depends on the website structure.
Service websites
Service websites should usually keep main pages open for crawling. This includes the homepage, service pages, location pages, blog articles, and contact pages.
Robots.txt may block admin sections, login areas, and internal scripts. Content quality and on-page SEO remain more important than complex robots rules for most service websites.
E-commerce websites
E-commerce websites often need more careful robots.txt planning. Filters, sorting URLs, search pages, cart pages, and checkout pages can create many crawlable URLs with little SEO value.
The challenge is to block crawl waste without blocking useful category or product pages.
Content websites
Blogs and media websites should avoid blocking article sections unless there is a clear reason. Robots.txt may help block tag pages, internal search pages, or duplicate archives if they create crawl waste.
For content-heavy websites, robots.txt should support article writing and topic structure, not hide weak content problems.
Multilingual websites
Multilingual websites need extra care. Blocking a language folder by mistake can damage the visibility of that language version. Robots.txt should be checked with hreflang, canonicals, sitemap files, and internal links.
Can robots.txt improve rankings?
Robots.txt does not directly improve rankings. It helps search engines crawl the website more efficiently. That can support SEO, especially on larger websites, but it does not replace content quality, authority, internal linking, or technical health.
Robots.txt can support rankings indirectly when it:
- Reduces crawl waste
- Protects crawl attention for important pages
- Prevents crawling of duplicate URL patterns
- Supports cleaner technical structure
- Works with sitemap and canonical signals
- Keeps low-value sections away from crawlers
However, a robots.txt file cannot make poor content rank. It cannot fix thin pages, weak service copy, poor internal linking, or unclear search intent.
For that reason, robots.txt should be part of a wider SEO system that includes website content and landing pages, technical SEO, internal linking, and content maintenance.
Need a robots.txt and technical SEO review?
Robots.txt looks simple, but it can affect how search engines access your website. One wrong rule can block important pages. One missing rule can let crawlers waste time on low-value URLs. One misunderstood rule can create indexing confusion.
At Wordian, we help companies and teams review robots.txt, crawlability, indexing, and technical SEO through:
- Technical SEO services
- SEO audit and crawling
- On-page SEO
- SEO consultation sessions
- SEO training for teams
- Website content and landing page writing
We work with businesses that want clearer SEO foundations, cleaner crawling, and practical decisions before publishing more content or changing site structure.
FAQs about robots.txt
1. What is robots.txt in simple words?
Robots.txt is a text file that tells search engine crawlers which parts of a website they can or cannot crawl. It usually sits at the root of the domain, such as example.com/robots.txt. It helps manage crawler access, reduce crawl waste, and guide bots away from areas like admin pages, cart pages, or internal search results.
2. Does robots.txt stop pages from appearing in Google?
Robots.txt does not guarantee that a page will stay out of Google. It blocks crawling, not indexing. If Google finds the URL through other links, it may still show the URL without crawling its content. To keep a page out of search results, noindex or password protection is usually more appropriate.
3. What is the difference between robots.txt and noindex?
Robots.txt tells crawlers not to access a URL or folder. Noindex tells search engines not to include a page in search results. Google must be able to crawl the page to see the noindex tag. If robots.txt blocks the page, Google may not see the noindex instruction.
4. Where should the robots.txt file be placed?
The robots.txt file should be placed at the root of the website. For example, if the site is https://example.com, the robots.txt file should be available at https://example.com/robots.txt. Placing it inside a subfolder will not work for the whole website.
5. Should I block admin pages in robots.txt?
Yes, admin areas are commonly blocked in robots.txt because search engine crawlers do not need to access them. For WordPress websites, /wp-admin/ is often blocked while /wp-admin/admin-ajax.php is allowed. Sensitive admin areas should still be protected by login security, because robots.txt is public.
6. Can a wrong robots.txt file hurt SEO?
Yes, a wrong robots.txt file can hurt SEO if it blocks important pages, resources, or entire website sections. For example, blocking /blog/ can stop crawlers from accessing articles. Blocking CSS or JavaScript can make it harder for Google to render pages correctly. Robots.txt should be tested before and after technical changes.
7. Should the sitemap be added to robots.txt?
Yes, adding the sitemap location to robots.txt is a useful practice. It helps crawlers find the sitemap quickly. A common line is Sitemap: https://example.com/sitemap.xml. This does not replace submitting the sitemap in Google Search Console, but it supports crawler discovery.
8. Do all search engines obey robots.txt?
Major search engines usually respect robots.txt rules, but robots.txt is based on crawler cooperation. Bad bots may ignore it. This is why robots.txt should not be used to protect private data, confidential files, or sensitive customer information. Use authentication and server-level security for anything private.
9. How often should robots.txt be reviewed?
Robots.txt should be reviewed after website launches, redesigns, migrations, CMS changes, plugin changes, and SEO audits. It should also be checked when Search Console shows crawling or indexing problems. For active websites, reviewing robots.txt every few months is a good technical SEO habit.
10. Is robots.txt important for small websites?
Yes, but small websites usually need a simple robots.txt file. The main goal is to avoid blocking important pages and to guide crawlers toward the sitemap. Large websites often need more advanced rules because they have more URL patterns, filters, duplicate paths, and crawl management issues.