How Google finds sites and pages
by murda the funky space monkey on Sep.24, 2008, under MISC STUFF
New to this blog? I have tons of great articles that can benefit you. Don't miss the upcoming interesting articles, subscribe to my RSS. . Thanks for visiting!
All major search engines use spider programs (also known as crawlers or robots) to scour the web, collect documents, give each a unique reference,scan their text, and hand them off to an indexing program. Where the scan picks up hyperlinks to other documents, those documents are then fetched in their turn. Google’s spider is called Googlebot and you can see it hitting your site if you look at your web logs. A typical Googlebot entry (in the browser section of your logs) might look like this:
Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)
There are essentially four ways in which Googlebot finds your new site. The first and most obvious way is for you to submit your URL to Google for crawling, via the “Add URL” form at www.google.com/addurl.html.The second way is when Google finds a link to your site from another site that it has already indexed and subsequently sends its spider to follow the link. The third way is when you sign up for Google Webmaster Tools (more on this on page 228), verify your site, and submit a sitemap.
The fourth (and final) way is when you redirect an already indexed webpage to the new page (for example using a 301 redirect, about which there is more later).
In the past you could use search engine submission software, but Google now prevents this – and prevents spammers bombarding it with new sites – by using a CAPTCHA, a challenge-response test to determine whether the user is human, on its Add URL page. CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart, and typically takes the form of a distorted image of letters and/or numbers that you have to type in as part of the submission.


