CheapWindowsHosting.com | In this post we will explain about how to Control Access of the Web Crawlers or Web Robots to Your Site.
There are numerous reasons as to why or when you should control the access of the web robots or web crawlers to your site. As much as you want Googlebot to come to you site, you don’t want the spam bots to come and collect private information from your site. Not to mention that when a robot crawls your site it uses the website’s bandwidth too! In this post I have explained how you can control the access of the web robots to your site through the usage of a simple ‘robots.txt’ file.
Web Robots (also known as bots, web spiders, web crawlers, Ants) are programs that traverses the World Wide Web in an automated manner. Search engines (like Google, Yahoo etc.) use web crawlers to index the web pages to provide up to date data.
Google bot may be crawling your site to provide better search results but at the same time other spam bots may be collecting personal information such as email addresses for spamming purpose. If you want to control the access of the web crawlers on your site, you can do so by using the “robots.txt” file
‘robots.txt’ is a plain text file. Use any text editor to create the ‘robots.txt’ file.
The entries (rules) in the robots.txt file are entered in a ‘field’ ‘value’ pair.
A simple robots.txt file uses the following three fields:
User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.
The following will stop all robots from crawling your site (‘*’ means all and ‘/’ is the root directory.)
User-agent: * Disallow: /
The following will stop all robots from crawling the ‘/private’ directory.
User-agent: * Disallow: /private
Stops Googlebot from indexing your images for Google image search. Use this to save bandwidth if u don’t want your images to be available for Google image search.
User-agent: Googlebot-Image Disallow: /
The following will block all robots from crawling your site except Googlebot
User-agent: * Disallow: / User-agent: Googlebot Allow: /
Put the robots.txt file in the root directory of your website. For example, put the file in the www.yoursite.com not in a sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.
You can verify that a bot that is visiting your site is really the Googlebot by following the instruction on this page.