At times it will be very happy to see, Google crawling your site almost instantaneously. But this is not required for most of the sites as the content may get updated once in a day or even with longer interval. In such cases where the content is not updated, it does not make sense the search engine crawlers or bots continue to look for updates on the site. In this article, we will see why should you control the Googlebot and how to control the crawl rate of Googlebot and other search engine crawlers.
Why Should You Control Googlebot Crawl Rate?
Continuous crawling of search engine bots will have adverse effect on the server performance when you have multiple sites or larger site. Thus it is necessary to control the crawl rate of the bots crawling your site and Googlebot is the first one you should control in many cases.
- Your server resources are used whether it is a search engine bot or a real user.
- High crawling rate will result in high CPU utilization and may need to end up in paying more for additional resources. In shared hosting environment your host may stop the service to safeguard other sites hosted on the same server.
- When Googlebot crawl the sites the real users on the site may feel the slowness. Especially when you have ecommerce site it is mandatory to control Googlebot and other frequently crawling bots.
You may not see any problems with the bots if your site is smaller and having limited traffic. When you have multiple sites attracting thousands of visitors each day then you will notice the CPU usage shooting up due to the crawler’s activity. When CPU utilization is high, you probably will receive a warning message from your hosting company or your account will get suspended asking you to take necessary actions.
How to Monitor Googlebot?
There are two ways to monitor the crawling activities of Googlebot. One is to check from your Google Search Console under other is to monitor from your hosting account.
Login to your Google Search Console account and navigate to “Crawl > Crawl Stats” section. Here you can see the Googlebot activities during the past 90 days time frame. You will see three graphs – pages crawled per day, kilobytes downloaded per day and time spent downloading a page (in milliseconds). These graphs will give you an overall idea what Googlebot does on your site.
The second and most effective way is to monitor the activities on your server from your hosting account. Login to your hosting account and look for one of the statistics reporting tool. In this case we explain with Awstats which is offered by almost all cPanel hosting providers like Bluehost. If you are using custom hosting setup like SiteGround, then you may need to check the custom server statistics reports available for this purpose.
Open Awstats app and choose your site to view the statistics. Look under “Robots / Spider visitors” section the list of most active bots.
Note: You can also use plugins like Wordfence to monitor the live traffic and Googlebot activities.
How to Control Googlebot Crawl Rate?
When you notice Googlebot is crawling your site and consumes lot of bandwidth then its time to control the crawl rate. Some hosting companies automatically control the crawl delay by adding entries in robots.txt file. You can manually control the crawl rate of Googlebot from Google Search Console. Once logged in to your Search Console account click on the gear settings icon and choose “Site Settings” option.
You will see two options under “Crawl rate” section.
- Let Google optimize for my site (recommended)
- Limit Google’s maximum crawl rate
Choose the second radio button and drag the progress bar down to any desired rate. This will set the number of requests per second and the number of seconds between crawl requests.
You can discuss with your hosting company to get an idea of how much crawl rate is desirable. Once you have saved your settings there will be a message received informing the crawl rate was changed.
The new crawl rate settings will be effective for 90 days and automatically reset to the first option “Let Google optimize for my site” after expiry.
What About Bing?
Similar to Googlebot, you can also restrict Bingbot under Bing Webmaster Tools. Once you have logged in to your account navigate to “Configure My Site > Crawl Control” section. Choose the “Custom” option for “When do you receive the most traffic to this site for your local time of the day?”
Adjust the crawl rate by selecting the blue boxes on the graph.
Other Search Engine Crawlers
Besides Google and Bing, there are many other bots can crawl your site. You can block all other bots with the generic .htaccess directive. Add the below code in your .htaccess file to block all the bots except Google, Bing, MSN, MSR, Yandex and Twitter. All other bots will be redirected to the localhost IP address of 127.0.0.1.
#Disable bad bots RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^$ [OR] RewriteCond %{HTTP_USER_AGENT} (bot|crawl|robot) RewriteCond %{HTTP_USER_AGENT} !(bing|Google|msn|MSR|Twitter|Yandex) [NC] RewriteRule ^/?.*$ "http\:\/\/127\.0\.0\.1" [R,L]
You can also monitor the traffic statistics and block the spam traffic by IP addresses.
Crawl Optimization for WordPress Sites
Plugins like Yoast SEO and Perfmatters offer crawl optimization by removing unnecessary bloat from the WordPress site’s header. For example, if you are using Yoast SEO, go to “Advanced > Crawl Optimization” section and disable the items that you do not want Google to crawl on your site. This will add entries in your robots.txt file and prevent Googlebot from accessing the restricted pages.
Conclusion
It is necessary to monitor and control the crawler activities on your site, in order to keep the CPU utilization of your hosting server within the allowed limit. We have explained some of the methods and there are many other ways to stop the bad bots. It is also a good idea to discuss with your host and ensure you are doing correct thing and only blocking bad bots.
1 Comment
Leave your reply.