Crawler or Spider is a piece of script used to collect information of all the web pages available on internet. Basic purpose of crawlers is to collect information on a page and index it in the database for later retrieval. But there are many type of crawlers used for multiple purposes both for good and bad. Search engine providers give them a name for the ease of understanding like Googlebot or Bingbot. For example, Google has many crawlers as indicated below:
These crawlers browse through the web and index new content in search engine’s database. When a user search for a query the content from the database is retrieved based on sophisticated algorithms.
On the flip side many bots are used to collect information for hacking purposes.
How to Know the Crawler?
Search engines provide Webmaster Tools or Search Console account to view and control crawler activities on your site if it affects the performance. Besides this all crawler entries can be obtained from the site’s server log for troubleshooting and analysis purpose. For example, if you see a bad robot crawling your site then you can block that particular one to safeguard site’s content.
In case if the crawler affects user activities on the site then you can control the rate and time of crawling from Webmaster Tools account in such a way that crawling can be done off peak time without affecting users.
Site owners decide whether a crawler should follow a webpage or not by adding appropriate attributes in the header section using robots meta tags or through robots.txt file in the root directory. User-agent attribute in the robots.txt file is used to specify name of the bot to allow or deny access to a specific page, directory or entire site. For example, if you do not want Google crawler to scan your site, use the below code to deny the access:
Though it is possible to instruct crawlers to follow robots.txt or robots meta tags, it is up to the crawlers to obey these rules. Generally search engines follow this but bad robots don’t.
Inform Crawlers Through Robots.txt and Nofollow Tag
Crawler or Spider is a piece of script used to collect the information of all the web pages available in the web. Search engine providers give them a name for the ease of understanding like Googlebot or Bingbot. The important part is that you as an owner of your site need to tell these crawlers what are the URL links to be indexed and what are the referring links from your site are to be considered by the search engine.
What is Robots.txt?
A “robots.txt” is a text file in the root directory of every website which informs the search engines whether the webpage is allowed to be crawled or not. This is an optional file used only if you have to instruct the crawlers and most of the content management systems automatically generate the robots.txt file. You can simply enter “www.yoursitename.com/robots.txt” in the browser’s address bar to view robots.txt file of your site.
Some of your site’s pages may contain confidential information and if you do not prevent search engines to stop crawling those pages using robots.txt file then all those confidential details will be shown in the search results to the public. Besides hiding pages from search engines, robots.txt file is also used to what exactly a particular search bot is expected to do on a site.
For example, you can stop Googlebot accessing only a specific directory on your site and stop/provide the complete access to Bingbot.
1. Google Search Console offers robots.txt file generator to help you create this file which you can upload it on your server.
2. It is recommended to have an empty robots.txt file even if you do not want to instruct crawlers.
Is that Enough Using Robots.txt to Hide Sensitive Information?
It is definitely not a highly secure way to hide your sensitive content from search engines just using robots.txt for the following reasons:
- As anyone can see the robots.txt file in the browser, some curious user may try to analyze the directories and judge the URLs you are hiding.
- Some search engine bots do not follow robots.txt exclusion and continue to index your confidential pages.
- Search engines will still show the blocked URL in search results.
Hypertext access or .htaccess is a most supported configuration file used to control a particular directory of a web server. This is used to control the behavior of individual site on a server though the server has its own global configuration. This is the file generally used to control the authorizations needed to access any particular part of the site.
For example, you can block specific IP address or domain from accessing your site. Also you can set redirect rules to inform search engines when accessing a particular page.
Generally more secured things are directly controlled at server configuration level using http.conf file.
Understanding rel=”nofollow” for Links
Google introduced a PageRank mechanism evaluating a page based on the external links. Later this was followed by most other search engines and changed the whole game of search engine optimization. Most webmasters and SEO companies started building unnatural links only to improve site’s rank in search results. To ensure the quality of external links Google again introduced rel=”nofollow” HTML link attribute to tell search engine crawlers whether to consider the link when evaluating search ranking.
Below is the syntax of how the “nofollow” attribute is used:
Where Can I Use Nofollow?
Rel=”nofollow” is an HTML link attribute used in anchor tags to inform the search engine crawlers not to consider the link when evaluating search ranking. This method was initially found by Google and then later adopted as a standard followed by other search engines like Bing.
Google’s search algorithm heavily depends on the external link weightage of a page thus resulting in webmasters spamming other sites with their links in order to improve their site’s search rank. One of the main target for spammer bots is the blog comments where it is easy to leave comment with link since most of the site owners auto approve the comments in earlier days. To overcome this comment spamming, Google introduced a mechanism of adding a “rel=”nofollow”” tag for any individual hyperlink to avoid considering that link when calculating the PageRank for search results.
Nofollow attribute is used within an HTML anchor tag as below:
<a href= "http://example.com" rel="nofollow"> This is a nofollow link, don’t spam </a>
Nofollow attribute is used to inform search engines that all the links in a page are not to be followed whereas rel=”nofollow” is used for specific links thus providing more control to webmasters.
Nofollow can be used in many cases, here we are some of the important cases:
- This is very useful to avoid spammy site links entered in the comment section of your blog since blog comment section is highly vulnerable to comment spam like the one shown below.
- By using No-follow in rel attribute to the links in comments confirms that you are not giving your page’s reputation to a spammy site.
- Nofollow will also be useful in forums, guest books and shout- boards. Most of the blogging and forum providers add nofollow to user comments by default to avoid manually adding it on each comment separately.
- You can also use comment moderation like entering CAPTCHA code and using social network login for commenting.
- Nofollow can also be useful when you are referring a link in your site but not interested in passing your outbound link reputation on to it.
- If you want to nofollow all the links on any of your site’s page use “nofollow” in your robots meta tag, which is placed inside the <head> tag of that page’s HTML as shown below:
Robots Meta Tags
Robots Meta Tags are HTML tags used within <head> section of a web page to inform search engine crawlers whether the page is to be indexed and the links on the page are to be followed or not. The name robots indicate that these tags are used to guide robots or crawlers and not for the human users.
Robots Meta Tags have the following two attributes:
- “Name” which should be always mentioned as “robots” and
- “Content” which shall have one of the below four parameters based on need:
- Index – allowed to index
- Noindex – Not allowed to index
- Follow – allowed to follow links in that page
- Nofollow – not allowed to follow links in that page
Robots Meta Tags are used in the following manner:
<HTML> <HEAD> <TITLE> Page Title </TITLE> <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </HEAD>
If no robots tags are mentioned then values “INDEX, FOLLOW” will be considered by default.
Crawlers take precedence of robots.txt file, when Robots Meta Tags are used in combination with robots.txt file. Thus allowing a directory in robots.txt file and using noindex in the meta tag to restrict a page within the same directory will have no impact on crawlers.