SEO Site.in : Search Engine Optimization
Google
 Home 
 Hosting 
 Directory 
Add Your Company  | SEO Glossary
Hi Guest !!    Sign In     

Robots and Robots.txt file

Posted On 06 Dec, 2007Views : 3710 Previous | Next 

Before discussing Robot.text file, it is important to understand the term ‘Robot’ used on WWW. 

What is a Robot?

A Robot is an automated software program used to locate and collect data from web pages for inclusion in a search engine's database and to follow links to find new pages on the World Wide Web. Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Autonomous agents

These are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet.

Intelligent agents

These are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.

User-agent

It is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.

Spiders 
Spiders are same as robots, but sounds cooler in the press.

Worms
Same as robots, although technically a worm is a replicating program, unlike a robot.


Web crawlers
Same as robots, but note WebCrawler is a specific robot

WebAnts
Distributed cooperating robots.

 
What is a Robots.txt file?
Now, let us come to the ‘robots.txt file’, it is a file which instucts robots how to behave. These are simple instructions for robots to be followed. Sometimes, siteowners don’t want to list all the pages and keep some information for their own use, so they can restrain the robots to scroll those pages through robots.txt file. Whenever a robot starts visiting a page, it looks for robots.txt file first to know what it should do.

How to make?
It is a simple notepad type file named “robots.txt”, anyone who can make a website can make this file. You need to upload it to the root of your web site. But if you don't have access to the root then you will need to use a Meta tag to disallow access.

The basic structure is :
 
User-agent : Robot Name here

Disallow : /Filename here

User-agent
The first part of the file structure i.e. ‘User-agent’ is meant for directions to a specific robot. There are two ways to use this:

User-agent : *
This sentence structure acts as a wildcard and disallows all spiders. You may want to use this to stop search engines listing unfinished pages.

User-agent: Googlebot
This sentence structure means that these directions apply to just Googlebot.

Disallow
The second part of the file structure i.e. ‘Disallow’ is there to tell the robots what folders they should not look at e.g if you do not want search engines to index the photos on your site then you can place those photos into one folder called "photos". Now you want to tell all search engines not to index that folder.

In that case your sentence structure will be:

User-agent: *
Disallow: /photos

While writing this section of file following rules must be followed :

Disallow:/mydirectory/

This sentence structure disallows an entire directory.

Disallow:/file.htm

This sentence structure disallows an individual file.

You have to use a separate sentence structure for each disallow. Also you need to include both the user agent and a file or folder to disallow. Use of comma between the filenames are incorrect e.g.

Incorrect :

Disallow:/file1.htm,file2.html

Correct :

Use-agent/*
Disallow:/file1.htm
Disallow:/file2.htm

Some more rules:

User-agent: *

Disallow:

This sentence structure allows to visit the whole site to all robots.

If you don’t have a robots.txt file, it means that robots are free to access and index all of your web pages.

An empty file named robots.tet file also allows robots to freely access and index all webpages.

Googlebot’s Allow:
Googlebot which is a robot that Google uses to index the webpages understands a few more instructions than other robots. In addition to the "User-name" and "Disallow" Googlebot also uses the "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions.

In the above example of photo folder, if there was a photo called myphoto.jpg that you want Googlebot to index. Then the following sentence structure

User-agent: *
Disallow: /photos
Allow: /photos/mycar.jpg

would tell Googlebot that it can visit "myphoto.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.

Important Note

Utmost care must be taken before writing a robots.txt file because an incorrect file can block the bots that index your website.  There is also a robot.txt tool that allows you to experiment a little, letting you know if their are any problems with your file prior to putting it online. If you are using a Google sitemap as part of their webmaster tools, then you can log in and see if Google is having any issues crawling your site.


Related articles
What is Search Engine Optimization
Benefits of search engine optimization
Factors that matters for Indexing
Do's and Don't of Search Engine Optimization
Points to remember before building a web site
Google Page Rank
Tips to improve google page rank
How to use NOINDEX, NOFOLLOW and NOODP


 All Rights Reserved to SEOsite.in