SEO

What is Robots.txt? A Beginner’s Guide to Understanding How and Why it is Written

Robots.txt files are guidelines written by the creator of a website that advise search engine crawlers regarding what sections of your website it should crawl and what it should leave alone. All bots (crawling is carried out by bots) may not pay heed to the guidelines.

However, when a good bot crawls your website it first reads your guidelines before accessing the main content and follows it diligently.Yes, bots are much like us, some of them are programmed to follow instructions (for example Googlebot), while some do not play by the rules.

robots.txt is a file that includes codes written by the designer of the website. As the extension suggests, it is a text file, most commonly written in notepad, and uploaded to the root of the domain or the root of the sub-domain. Crawling bots see this file first before accessing the main site but human users do not see this file. Humans can see it if they specifically request it. This is done by adding “/robots.txt” at the end of the URL. You can try looking up the robots.txt files of some websites, they often have humorous messages and cool illustrations! It is also a really cool trick to show off!

Fun note: Try looking up the robots.txt file of YouTube and see what it says. 

Components of a robot.txt File

Robot.txt files are guidelines therefore naturally they contain a set of directives. There are four directives namely:

  • User-agent directive, 
  • Disallow directive 
  • Allow directive
  • Sitemap directive.  

User-agent Directive

User agent is an assigned name that any program or person present on the internet has. Crawling bots also have names (e.g. Bingbot, Googlebot-Image, Baiduspider). This directive helps you issue specific sets of instructions for specific crawlers. It is an effective way of letting specific crawlers know that you want to be crawled by them while disallowing others. You can make your choice depending upon the nature of your content and your target audience. The User-agent(s) of search engine crawlers can be found in the engine’s official website. 

Disallow Directive

This instruction asks the bots to not access or index specific sections or pages of a website. Let us look at an example:

User-agent: *
Disallow: /admin

The command written above will block all URLs whose path starts with “/admin”. The Path of an URL begins from the first “/” used in it. 

http://yourdomain.com/admin
http://yourdomain.com/admin?test=0
http://yourdomain.com/admin/somethings
http://yourdomain.com/admin-example-page-keep-them-out-of-search-results

When a disallow command is not followed by anything, Disallow:    it invites search engine crawlers to crawl the entire site.

Allow Directive

This directive informs crawlers about which sections of the website it is permitted to crawl

User-agent: *
Allow: /some-directory/important-page
Disallow: /some-directory/

The command above will block the following URLs:

http://yourdomain.com/some-directory/
http://yourdomain.com/some-directory/everything-blocked-but

But it will not block any of the following:

http://yourdomain.com/some-directory/important-page
http://yourdomain.com/some-directory/important-page-its-someting
http://yourdomain.com/some-directory/important-page/anypage-here

Sitemap Directive

As the name suggests, this directive functions like a guide for the bot to easily access all the important URLs present in your site. Have you seen that some instagram users have a link tree pinned to their bio? It helps them guide you to specific links they want you to view, similarly the sitemap helps a bot locate all the links in your website so that it can easily access and index them. 

Example:

User-agent: *
Sitemap: http://yourdomain.com/sitemap.xml 

Crawl-delay Directive

Crawl delay directive is used to specify the number of seconds a crawler should wait in between sending requests to the website. This ensures that your website does not get overwhelmed by the number of requests a crawler sends.

 User-agent: *
Crawl-delay: 2

Make a note: Not all web crawlers support this directive. Google for instance uses a different method to control the crawling rate. You can set the crawling rate in Google Search Console. 

Wildcards (*) Asterisk in robots.txt

 User-agent: *

If user-agent is followed by an asterisk, it means whatever instructions are issued after it, are to be followed by all crawlers.

Asterisk can also be used in allow and disallow directives to specify a pattern of URLs that should be allowed or disallowed to crawl the website.  

For instance the directive:

Disallow: /names/*/details

Will block the following URLs

http://yourdomain.com/names/shoeba/details
http://yourdomain.com/names/agni/account/details
http://yourdomain.com/names/uma/details-about-something
http://yourdomain.com/names/abhay/search?q=/details

End-of-string Operator ($) Dollar Sign

The dollar sign is used to indicate the end of an URL. You can use this when you want to block a specific file type or extension.

User-agent: *
Disallow: /junk-page$
Disallow: /*.pdf$

This will block any URLs ending with pdf and junk-page. 

But it will not block any of the following:

http://yourdomain.com/junk-page-and-how-to-avoid-creating-them
http://yourdomain.com/junk-page/
http://yourdomain.com/junk-page?a=b

How do you block all the URLs that contain a dollar sign? 

Use this command:

 Disallow: /*$*

Remember if you forget the asterisk at the end and write this instead:

Disallow: /*$

It will block everything on your website. 

How to Test a robots.txt File?

Web-based tools like Google Search Console and Bing Webmaster Tools where you can upload the URL whose robots.txt file you want to test, and see how crawlers interpret your file. You would have to specify the User-agent whose behaviour toward your file you want to check. For instance, if you put Googlebot as your User-agent the tool will tell you how Googlebot specifically interprets your file, what it considers allowed and what disallowed.

If you have the technical skills you can also use Google’s open-source robots.txt library to test the file locally on your computer. 

The Two Protocols Used in Writing a robots.txt File

Protocol is a standardized language which computers use to communicate with one another. A robots.txt file primarily follows two kinds of protocols: 

  • Robots Exclusion Protocol (REP) 
  • Sitemaps Protocol. 

Robots Exclusion Protocol (REP)- Standardized language (Allow and Disallow directives) to be used when asking specified or all bots to crawl or not crawl certain sections of the website.

Sitemaps Protocols – The standardised way of telling crawlers to crawl certain pages so that they do not overlook any of them appearing on a website. Sitemaps also help bots locate all important URLs present in your site by listing them in one place. In this way, they do not have to jump from one page to another looking for the URLs and this significantly reduces the number of requests it makes to your site.

Common Mistakes to be Aware of when Composing a robots.txt File

  • robots.txt file must be uploaded to the root of the site, if done anywhere else, it will not be read by crawlers. If you do not have access to the root of the site you can block pages using robots meta tags or by using the X-Robots-Tag header in the .htaccess file (or equivalent) 
  • robots.txt file is applicable only to the specific domain it is placed on. For every subdomain you need to create specific files. 
  • robots.txt standard is case sensitive. “Disallow” and “disallow” will be treated as different commands so is the case for any other directive.
  • Not using the “User-agent” directive, which tells crawlers who the instructions are for, may lead them to ignore your file entirely. 

How does robot.txt Improve the Health of Your Site?

Whenever you upload new information on the web, bots are attracted to it and they crawl it for information. Without the bots crawling your page, it will not be made readily accessible to the users of the internet nor properly indexed. Here are some benefits of having a robots.txt file.

Hiding Irrelevant Information from Being Crawled

A website does not only contain information that pertains to the users, it also includes technical information that is not supposed to show up in search results. Disallowing bots from crawling these sections might prevent unnecessary information from being presented to users or negatively impacting the site’s indexing.

Managing Crawl Traffic

The more the crawlers on your website the slower your site will get. Therefore, effectively choosing what bots you want on your site could prevent your site from being slowed down unnecessarily. You can also get robot.txt to get crawlers to ignore similar pages on your site.

What are the Limitations of robot.txt?

There are numerous limitations of robots.txt especially in today’s age of generative AI where information translates to absolute power and mere suggestions are not effective tools to keep information from being exploited.

They are Not Enforceable Instructions

This article has listed what robot.txt “might” be able to do. “Might” is the most important keyword when it comes to understanding the function of robots.txt The guidelines in this file are merely polite requests, they are not in any way enforceable. There are no penalties in place for bypassing the guidelines.

They are not Security Mechanisms and can Expose Sensitive Data Instead

A robots.txt file should not be considered a mechanism that protects any of your site’s information in any way. Good bots like Bingbots and Googlebots may heed them in order to maintain a good relationship with website administrators, while most bots do not respect them.

If you list your areas of sensitive information in your commands, chances are, malicious bots will specifically pursue them and try to scrape data from them. By listing your sensitive or poorly protected sections you are basically providing them with a roadmap to the treasure and asking them to not follow it because you would not like them to.

Other Methods of Communicating with Web Crawlers

Apart from robots.txt, Meta Robots Tag and X-Robots-Tag headers can also help you issue instructions to crawlers.

Conclusion

If you have created a website or are planning to create one, it is important that you know about what improves your site’s health. It is very true that robots.txt is a helpful tool but very often creators make the mistake of conflating guidelines with security mechanisms. It is advisable to have the txt file in your root domain or sub domain because every little bit of cleaning up of your website of excessive crawlers helps.

Just make sure you do not expose your sensitive and unprotected content to malicious bots. Establish your boundaries robustly without oversharing.

Happy creation!

FAQs

This is a text file uploaded in the root of your website’s domain or sub-domain, which suggests crawlers to visit or ignore certain parts of the site.

Yes, they can be ignored and there are no penalties for ignoring them.

No, robots.txt is not a security mechanism; its usage does not ensure that a website’s data will remain protected

Sitemap is a directive in robot .txt which lists all the important URLs your site contains which you want the crawlers to crawl and index. This makes it easier for the crawler to locate all URLs. 

While it is not mandatory, it is advisable to have a robots.txt file as it reduces crawl traffic, keeps irrelevant information from showing up in search results, and helps your page get indexed in relevant search engines. 

The simplest way to check if your robots.txt file is working is by using Google Search Console, once you enter your domain details they will provide you with a comprehensive result. Of course this shall pertain only to Googlebot’s behaviour towards your robots.txt file.