What is the robots.txt file for and how to benefit from it?

Robots.txt Syntax Basics
Robots.txt is a text file located in the site's root directory that specifies for search engine crawlers and spiders which website pages and files you do or do not want them to visit. Usually site owners strive to be noticed by search engines, but there are cases when it is not necessary: for example, if you store sensitive data or if you want to save bandwidth by not indexing heavy pages with images.

Google's official position in the Robot.txt file:
When a crawler accesses a site, a file called '/robots.txt' is requested first. If such a file is found, the crawler checks it for website indexing instructions.

NOTE: There can only be one robots.txt file for the website. A robots.txt file for an additional domain must be placed in the root of the corresponding document.

Google's official position in the robots.txt file
A robots.txt file consists of lines containing two fields: a line with a user agent name (search engine crawlers) and one or more lines beginning with the directive

Disallow:
Robots.txt has to be created in the UNIX text format.

Robots.txt Syntax Basics
Typically, a robots.txt file contains something like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/

In this example three directories: '/cgi-bin/', '/tmp/' and '/~different/' are excluded from indexing.

NOTE: Each directory is written on a separate line. You cannot write 'Do not allow: /cgi-bin/ /tmp/' on one line, nor can you split a Do not allow or User-agent directive over multiple lines: use a new line to separate the directives from each other.

"Star" (*) in the User-agent field means "any web crawler." Consequently, directives such as 'Do not allow: * .gif' or 'User-agent: Mozilla *' are not supported; Pay attention to logical errors, as they are the most common. Other common errors are typos: misspelled directories, user agents, missing periods after user agent and reject, etc. When your robots.txt files become more and more complicated, and it is easy for an error to be introduced, there are some validations.

Examples of use
Here are some useful examples of robots.txt usage:
Prevent indexing of the entire site by all web crawlers:
User-agent: *
Disallow: /

Allow all web crawlers to index the entire site:
User-agent: *
Allow:

Prevent multiple directories from being indexed:
User-agent: *
Disallow: /cgi-bin

Prevent site indexing by a specific web crawler:
User-agent: GoogleBot
Disallow :/

Find the list with the names of all the user agents.
Allow indexing to a specific web crawler and prevent indexing from others:
User-agent: Opera 9
Allow:
User-agent: *
Disallow: /
Prevent all files from indexing except a single one.

This is quite difficult since the 'Allow' directive does not exist. Instead, you can move all files to a certain subdirectory and prevent them from being indexed, except for one file that allows it to be indexed:
User-agent: *
Allow: /docs/

You can also use an online robots.txt file generator.

Robots.txt and SEO
Removing image exclusion
The default robots.txt file on some CMS versions is set to exclude your images folder. This issue does not occur in newer versions of CMS, but older versions should be checked.

This exclusion means that your images will not be indexed or included in Google Image Search, which is something you would want as it increases your SEO rankings.

If you want to change this, open your robots.txt file and delete the line that says:
Disallow: /images/

Add reference to your sitemap.xml file
If you have a sitemap.xml file (and you should as you increase your SEO rankings), it would be good to include the following line in your robots.txt file: (this line should be updated with your domain name and sitemap file ).
sitemap: http://www.domain.com/sitemap.xml

Miscellaneous observations
Don't block CSS, Javascript and other resource files by default. This prevents Googlebot from correctly rendering the page and understanding that your site is mobile-friendly.

· You can also use the file to prevent specific pages from being indexed, such as login or 404 pages, but this is best done using the robots meta tag.

· Adding disallowed statements to a robots.txt file does not delete the content. It simply blocks access to spiders. If there is content you want to remove, it is best to use a meta noindex.

· As a general rule, the

robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag that is part of the HTML header of a web page.

· Always keep in mind that robots.txt is not subtle. There are often other tools at your disposal that can do a better job, such as the parameter management tools within Google and Bing Webmaster Tools, the x-robots tag, and the meta robots tag.

Robots.txt for WordPress
WordPress creates a virtual robots.txt file once you publish your first post with WordPress. Although if you already have a real robots.txt file created on your server, WordPress won't add a virtual one.

There is no virtual robots.txt file on the server, and you can only access it through the following link: http://www.yoursite.com/robots.txt

By default, you will be allowed the Google Mediabot, a bunch of disallowed Spambots, and some disallowed standard WordPress folders and files.

So, in case you haven't created an actual robots.txt file yet, create one with any text editor and upload it to the root directory of your server via FTP.

Blocking WordPress Home Directories
There are 3 standard directories on every WordPress installation: wp-content, wp-admin, wp-includes that do not need to be indexed.

However, don't choose to disallow the entire wp-content folder as it contains an 'uploads' subfolder with your site's media files that you don't want blocked. That is why you should proceed as follows:
Disallow: /wp-admin/
Disallow: /wp-includes/
Allow: /wp-content/plugins/
Allow: /wp-content/themes/

Block based on your site structure
Each blog can be structured in several ways:
a) Based on the categories
b) Based on labels
c) On the basis of both: neither of the
d) Based on date-based files

a)If your site is structured by categories, you do not need to have tag files indexed. Find your tag base on the Permalinks options page in the Settings menu. If the field is left blank, the tag base is simply 'tag':
Disallow: /tag/

b) If your site is structured by tags, you should block category files. Find your category base and use the following directive:
Disallow: /category/

c) If you use both categories and tags, you don't need to use any directives. If you do not use any of them, you must block them:
Disallow: /tags/
Disallow: /category/

d)If your site is structured based on date-based files, you can block them in the following ways:

Disallow:/2022/

NOTE: You cannot use Disallow: /20*/ such a directive will block every blog post or page starting with the number '20'.

Duplicate content problems in WordPress
By default, WordPress has duplicate pages that don't do your SEO rankings any good. To fix this, we recommend that you do not use robots.txt, but rather use a more subtle tag: the 'rel=canonical' tag which you use to place the only correct canonical URL in the section of your site. This way web crawlers will only crawl the canonical version of a page.

Knowledgebase