PHPMatters Help You Better Hosting Your PHP-based Sites
The Beginners' Guide on Robots.txt File

The Beginners’ Guide on Robots.txt File

Generally, you may want your entire site to be fairly indexed by search engines for the health indexing and search engine ranking. But sometimes, you may want a portion of your contents to be ignored by searching spiders. In this case, you need robots.txt.

Note that when a searching robot visits a website, it will firstly check http://www.example.com/robots.txt to see if some contents have been disallowed to index. In this case, you need to make sure that your website has such a special file and ensure that no errors are available.

In this case, we have come out a guide for robots.txt. Read on to know the basic information about this file.

What is Robots.txt?

The robots.txt is a specific file that search engine crawlers look for when they come to your site. It acts as a request for certain robots to ignore specific directories or contents on your site. Besides, this file is a simple text file with no HTML and is placed in the root directory of your site.

The file ensures that crawlers do not get to your private folders, and reduces the rates of crawlers accessing the less important content on your site. This way, you can up the chances of ranking your best content high. Also, allowing robots to crawl specific contents on your website saves the bandwidth in the long run.

In addition, this special file can provide the robots with the specific location of your Sitemap, helping them learn and index your site with the best efficiency. To do this, you simply need to enter a directive inside the robots.txt as following.

Sitemap: https://phpmatters.com/sitemap_index.xml

The Structure of Robots.txt

The structure of the robots.txt is simple. The file mainly contains user agents, and disallowed directories or files. The basic structure of it looks like as following.

User-agent: *
Sitemap: https://phpmatters.com/sitemap_index.xml
Disallow: /wp-admin/

The wildcard (*) means that the rule applies to all robots. In the above example, all robots can access anything on your site, and the sitemap is found at https://phpmatters.com/sitemap_index.xml.

The “User-agent” stands for the search engine crawlers, and the “Disallow” contains a list of contents that should not be indexed. Besides the two, Disallow and User-agent, you can have comment lines – that includes a # sign in front of the line.

What To Do With Robots.txt For The Sake Of SEO

When it comes to SEO, you should allow robots to crawl all contents on your site. This is when you know that all the contents on your site are great enough to be viewed and indexed. Use the code as below.

User-agent: *
Sitemap: https://phpmatters.com/sitemap_index.xml
Disallow: 

You can also disallow the search engine spiders from crawling some low-quality content on your site or prevent to public some private content to the search engines. You can use the following code.

User-agent: *
Sitemap: https://phpmatters.com/sitemap_index.xml
Disallow: /wp-admin/
Disallow: instruction.txt

Some of the files that you may not want to be indexed include non-public files, scripts, utilities or duplicate content. Disallowing these files can make the search engine spider concentrate on more important files on your website, resulting in giving you a higher rating and more traffic.

Things You Should Avoid Doing With Robots.txt

Errors in the robots.txt may result in having some or all files on your website being un-indexed. There are a few things you should avoid to ensure this does not happen.

1. You should avoid setting comments on the robots.txt. After all, you don’t know how the different search engines recognize them.

2. You should avoid using a text while starting a line. For instance, you need to avoid writing:

{text} User-agent: *
{text} Disallow: /wp-admin/
{text} Disallo{text} w: instruction.txt

Instead;

User-agent: *
Disallow: /wp-admin/
Disallow: instruction.txt

3. You need to avoid changing the order of commands. For instance, starting with disallow instead of user-agent can affect the working of the robots.txt.

4. Don’t use more than one directories or files in the line of Disallow such as:

User-agent: *
Disallow: /wp-admin/, /wp-include/, /others/

Rather, have the commands as below.

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-include/
Disallow: /others/

Note that the file names on the website server are case sensitive. You should thus have the correct case for the directories.

5. Avoid using the “Allow” command as it does not exist. The following example is totally wrong.

User-agent: *
Allow: /wp-content/

6. If you want the web crawler to ignore all files of a directory, do not list all of them as below.

User-agent: *
Disallow: /support/1.html
Disallow: /support/2.html
Disallow: /support/3.html

Instead, have a simple command as the code shown in the following.

User-agent: *
Disallow: /support/