A “robots.txt” is a text file in the root directory of a website informs web crawlers what are the content not allowed to be crawled in that site. The protocol to inform bots is called “robots.txt protocol” or “Robots Exclusion Protocol” or “Robots Exclusion Standard”. The name robots indicates that it is meant for the web crawlers like search engine bots and not for human users. Though it is up to the search engines to obey the request or not, many search engines like Google, Bing, Baidu and Yandex follow the content in robots.txt file.
How it Works?
Let us take an example of Googlebot (used by Google search engine) visiting a page “http://example.com/visit-my-page.html”. Before enters the page, it looks for a file “/robots.txt” in the root directory of the domain that is “http://example.com” and follow the rules in the file. This means Googlebot will read a file “http://example.com/robots.txt” before trying to read that web page.
Note: It is important to place the robots.txt file in the root directory of a site; crawlers will not look for the file in any other directory. Hence misplacing the file in other place will not have any impact on crawler’s behavior. The file name should be in small case like “robots.txt” without any capital letters like “Robots.txt”.
No Robots.txt File in a Site
As far as Google is concerned if there is no need to restrict access to certain pages on a site then there is no need of a robots.txt file. Google does not even need an empty file and Googlebot will crawl all your content. This may not be true for other bots crawling a site. If there is no file present in the root directory of a site, then other bots may also assume that the entire content can be crawled but your server logs will be cluttered with thousands of 404 – page not found errors. Since bot will first look for the file, the server has to respond with a 404 status code to inform the bot that there is no file available.
Though most of the recent content management tools dynamically generate a robots.txt file to avoid this issue, you can add an empty file to avoid server log issue even you do not have anything to restrict from search engines.
Note: Server logs are very important source to find what are the robots crawling your site and blocking them if it affects your site’s performance.
Robots.txt file has the simple structure containing two attributes: User agent and Allow or Disallow parameter. “User-agent” in the file indicates the name of the robot and “Disallow or Allow” informs the robot to crawl or not the mentioned path on the server. Below are some of the usages for your reference:
Allow all web crawlers to access all content:
Allow all web crawlers to access all content:
Restrict access to all content:
Restricting a directory:
Restricting a single page:
Some search engine crawlers like Google accepts the use of “Allow” attribute as below for allowing all content access:
Using “Disallow” and “Allow” attributes in a single file is also possible. You can provide access only to Google and block all other crawlers for a site:
User-agent: * # all robots
Disallow: / # are disallowed to crawl all pages
User-agent: Googlebot # except Googlebot
Allow: / # can crawl all content
All paths in the file are relative except Sitemap. Robots.txt file if added with a Sitemap directive should have an absolute path of a Sitemap to inform search engine crawlers about the location of your XML Sitemap as below:
- Use # to add comments to your robots.txt file.
- Using wild card is accepted for “User-agent:” and not defined as a standard for “Disallow:”. Hence using “Disallow: *” may not be interpreted in the same way by all crawlers.
- Not all search engines support and follow directives in robots.txt file.
How to Create and Validate Robots.txt File?
Robots.txt is a simple text file can be created with a Notepad in Windows based PCs or with TextEdit in OS X based Macs. The text file can be saved in ASCII format and uploaded in root directory of a web server. You can use simple robots.txt file generator tool to create a custom robots.txt file for your site.
If the hosting company provides a directory based site address like “http://example.com/user/site/” then it is not possible for individual users to create a separate “/robots.txt” file for their site. Validators check the correctness of the robots.txt file for possible misuse of slash symbol (/). Robots.txt tester is a free tool available in Google Search Console with the following features:
- View live robots.txt file.
- Update the file and an option to download it. (you need to upload this to your server).
- Submit updated file to Google.
- Test any URL is blocked or allowed for Googlebot, Google-News and Google-Image.
Using for Security
Robots.txt file of a website can be viewed in the web browser as “http://www.yoursitename.com/robots.txt” though the file is not displayed to the users in the site’s navigation menu and in the XML Sitemap. This means anyone can view the file publicly and try to open the disallowed content.
Due to this public visibility, we don’t recommend restricting individual pages on a site using robots.txt, instead it is a better option to restrict the directory. This makes it difficult to guess what could be the URLs inside the directory.
Moreover, it is not mandatory for bots to obey robots.txt file and there are plenty of spam bots which will still crawl the content of the blocked sites. If you want to block the content from search engines then the best thing is to restrict the access in server by adding login and necessary authorization.
Robots Meta Tag and rel=”Nofollow”
In addition to robots.txt file, you can restrict the content using meta robots tags. Webmasters generally confuse with robots.txt file, robots meta tags and rel=”nofollow” link attribute. Here is a short explanation of what will happen when you block a webpage:
Search engine crawlers will not go to the page and stop after reading robots.txt file. Still search results will the show page as a link in search results with no description. Sometimes you will see a messages like “We would like to show you a description here but the site won’t allow us” in Bing and “A description for this result is not available because of this site’s robots.txt” in Google.
By Robots Meta Tag
Crawlers will access the page and find meta robots tag when crawling. When search engine crawler found “noindex” attribute on a page then it will not index the page and show in search results. Similarly, if the crawler found “nofollow” attribute then it will not follow the links on that page.
Note: If a page is blocked by both robots.txt and meta robots tag then the robots.txt file will take precedence since it is read before crawler requests the page from the server.
This is used in HTML anchor tag <a> to inform crawlers not to follow the links in the page for considering ranking in the search results. Search engine crawlers will still crawl the page content, index and show it in the search results as normal.