Quantcast


Controlling Googlebot

Wed, Sep 24, 2008

MISC




For some webmasters Google crawls too often (and consumes too much bandwidth). For others it visits too infrequently. Some complain that it doesn’t visit their entire site and others get upset when areas that they didn’t want accessible via search engines appear in the Google index.
To a certain extent, it is not possible to attract robots. Google will visit your site often if the site has excellent content that is updated frequently and cited often by other sites. No amount of shouting will make you popular! However, it is certainly possible to deter robots. You can control both the pages that Googlebot crawls and (should you wish) request a reduction in the frequency or depth of each crawl.
To prevent Google from crawling certain pages, the best method is to use a robots.txt file. This is simply an ASCII text file that you place at the root of your domain. For example, if your domain is http://www.yourdomain.com, place the file at http://www.yourdomain.com/robots.txt. You might use robots.txt to prevent Google indexing your images, running your PERL scripts (for example, any forms for your customers to fill in), or accessing pages that are copyrighted. Each
block of the robots.txt file lists first the name of the spider, then the list of directories or files it is not allowed to access on subsequent, separate lines. The format supports the use of wildcard characters, such as * or ? to represent numbers or letters.
The following robots.txt file would prevent all robots from accessing your image or PERL script directories and just Googlebot from accessing your copyrighted material and copyright notice page (assuming you had placed images in an “images” directory and your copyrighted material in a “copyright” directory):

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/

User-agent: Googlebot
Disallow: /copyright/
Disallow: /content/copyright-notice.html

To control Googlebot’s crawl rate, you need to sign up for Google  Webmaster Tools . You can then choose from one of three settings for your crawl: faster, normal, or slower (although sometimes faster is not an available choice). Normal is the default (and recommended) crawl rate. A slower crawl will reduce Googlebot’s traffic on your server, but Google may not be able to crawl your site as often.
You should note that none of these crawl adjustment methods is 100% reliable (particularly for spiders that are less well behaved than Googlebot). Even less likely to work are metadata robot instructions, which you incorporate in the meta tags section of your web page.

However, I will include them for completeness. The meta tag to stop spiders indexing a page is:

<meta name=“robots” content=“NOINDEX”>

The meta tag to prevent spiders following the links on your page is:

<meta name=“robots” content=“NOFOLLOW”>

Google is known to observe both the NOINDEX and NOFOLLOW instructions, but as other search engines often do not, I would recommend the use of robots.txt as a better method.

Related Stories On FSM:

  1. SwipeControls: Adding Gestures For Controlling Your Music
  2. Google Phone and Google Audio: Google’s Ingredients To An Apple Takeover
  3. Gdrive: Google Documents Storage Service, Allows You To Store Any Type Of Files

, , , , , , ,

This post was written by:

murdaFSM - who has written 2535 posts on FSMdotCOM.


Contact the author