Locate A Sitemap In A Robots.txt File

If you are a webmaster or a website developer, you will want your site to be seen in search results. And in order to be shown in search results you need your website and its various web pages crawled and indexed by search engine bots (robots).



There are two different files on the coded side of your website that helps these bots find what they need. They are:
  1. Robots.txt
  2. Sitemap

Robots.txt and Sitemap

Robots.txt is a simple text file that is placed on your site’s root directory. It is that file on your website that tells these search engine robots what to crawl and what not to crawl on your site. It also contains commands that describe which search engine robots are allowed to crawl and which are not.

Usually, search bots look for the robots.txt file in a website as soon as they enter one. It is therefore, significant to have a robots.txt file in the first place. Even if you want all the search robots to crawl all the pages on your site, a default robots.txt that allows, this is necessary. Please read our beginner’s guide on robots.txt if you want to  learn more.

Robots.txt also contain one important information and that is about sitemaps. In this post, we are going to elaborate on this very feature of robots.txt. But before that lets see what is a sitemap and why is it important.

A sitemap is an XML file that contains a list of all webpages on your site. It may also contain additional information about each URL in the form of meta data. And just like robots.txt, a sitemap is a must-have. It helps search engine bots explore, crawl and index all the webpages in a site through the sitemap.

Learn some more basics of XML sitemap from one of our previous posts.
How Are Robots.Txt And Sitemaps Related?

Back in 2006, Yahoo, Microsoft and Google united to support the standardized protocol of submitting pages to a site via sitemaps. You were required to submit your sitemaps through Google webmaster tools, Bing webmaster tools, Yahoo while some other search engines such as DuckDuckGoGo uses results from Bing/Yahoo.

After about six months, in April 2007, they joined in support of a system of finding the sitemap via robots.txt called autodiscovery of sitemaps. This meant that even if you did not submit the sitemap to individual search engines it was OK. They would find the sitemap location from your site’s robots.txt file first. (NOTE: Submitting of sitemaps is still,  however, done on most search engines that allow submissions of URL)

And hence, robots.txt file became even more significant for webmasters because they can easily pave way for search engine robots to discover all the pages on their website.
How To Create Robots.txt File With Sitemap Location?

Here are three simple steps to create a robots.txt file with sitemap location:

Step #1: Locate Your Sitemap URL

If your website has been developed by a third-party developer, you need to first check if they provided your site with a sitemap. The URL to the sitemap of your site usually looks like this: http://www.example.com/sitemap.xml

So type this URL in your browser with your domain in place of ‘example’.

You can also locate your sitemap via Google search by using search operators as shown in examples below:

site:example.com filetype:xml

OR

filetype:xml site:example.com inurl:sitemap

But this will only work if your site is already crawled and indexed by Google.

If you do not find a sitemap on your website, you can create one yourself using this XML Sitemap generator or follow the protocol explained at Sitemaps.org.

Step #2: Locate Your Robots.txt File

You can check whether your site has a robots.txt file by typing domain.com/robots.txt.

If you do not have a robots.txt file then you will have to create one and add it to the top-level directory (root directory) of your web server. You would need access to your web server. Usually, it is put in the same place where your site’s main “index.html” lies. The location of these files depends on the kind of web server software you have. You must take the help of a web developer if you are not well accustomed to these files.

Just remember to use all lower case for the file name that contains your robots.txt content. Do not use Robots.TXT or Robots.Txt as your filename.

Step #3: Add Sitemap Location To Robots.txt File

Now, open up robots.txt at the root of your site. Again, you need access to your web server to do so. So, ask for a web developer to do it for you, if you are not aware how to locate and open up your site’s robots.txt file.

To facilitate auto-discovery of your sitemap file through your robots.txt, all you have to do is place a directive with the URL in your robots.txt, as shown in the sample below:

Sitemap:  http://www.example.com/sitemap.xml

So, the robots.txt file looks like this:

Sitemap: http://www.example.com/sitemap.xml
User-agent:*
Disallow:


NOTE: The directive containing the sitemap location can be placed anywhere in the robots.txt file. It is independent of the user-agent line, so it does not matter where it is placed.

What If You Have Multiple Sitemaps?

Every sitemap can contain not more than 50,000 URLs. So in case of a larger site with many URLs, you can create multiple sitemap files. You must list these multiple sitemap file locations in a sitemap index file. The XML format of the sitemap index file is similar to the sitemap file, which means that it is a sitemap of sitemaps.

When you have multiple sitemaps, you can either specify your sitemap index file URL in your robots.txt file as shown in the example below:

Sitemap: http://www.example.com/sitemap_index.xml
User-agent:*
Disallow


Or, you can specify individual URLs of your multiple sitemap files, as shown in the example below:

Sitemap: http://www.example.com/sitemap_host1.xml
Sitemap: http://www.example.com/sitemap_host2.xml
User-agent:*
Disallow


Finally, there is one thing you need to pay attention to when adding the Sitemap directive to the robots.txt file.

Generally, it is advised to add the ‘Sitemap’ derivative along with the sitemap URL anywhere in the robots.txt file. But in some cases it has known to give some parsing errors. You can check Google Webmaster Tools for any such errors detected, about a week after you have updated your robots.txt file with your sitemap location.

To avoid this error it is recommended that you leave a line space after the sitemap URL.

I hope it is pretty clear now on how to create a robots.txt file with a sitemap location. Do it, it will help your website!





0 comments: