How do I generate robot TXT

How we create the perfect robots.txt for the SEO of our Joomla website

Why is the robots.txt file so important for our website?

Good search engine rankings are crucial for the success of our website. It is not uncommon for technical SEO to be neglected. A colossal SEO measure is to improve the crawlability with the help of the robots.txt instructions. Please note, however, that robots.txt cannot protect our website from access by web crawlers or other people. The robots.txt can only control the appearance in the search results of a search engine.

What is the robots.txt file?

The robots.txt file is a text document that provides commands for web crawlers to restrict total or partial access to our website. The robots.txt is located in the root directory of our website, i.e. in the top directory of a host, also known as the root directory.

Are the instructions in robots.txt binding for search engine crawlers?

The Robots Exclusion Standard Protocol is not an official standard, so search engine crawlers are not forced to adhere to the rules in the robots.txt. Most search engine crawlers, however, adhere to the Robots Exclusion Standard Protocol, since search engines are usually not interested in the content of directories and files that have no added value for their users. (Example: private vacation photos, content of log files, etc.)

What is the robots.txt used for?

With the rules and instructions in the robots.txt, we determine which content is not worth indexing for search engines. This makes better use of the crawl budget. The crawl budget is the time that the web crawler has to search our Joomla website and save it in the index. The higher our website ranking, the more crawl budget is available for indexing our website.

Where is the robots.txt located?

The robots.txt must always be in the root directory of the website. The robots.txt may only appear once on the domain or sub-domain and must be named exactly like this: robots.txt

How is the robots.txt processed?

We can create the robots.txt file with almost any text editor, provided it can create files in standard ASCII and UTF-8 formats. You should refrain from using word processing programs such as Word, Open Office, etc., as these usually save files in their own format and add unexpected characters. This can lead to errors in crawling.

How is the structure of a robots.txt?

The robots.txt is processed line by line by the web crawler. It is therefore possible that an instruction overwrites one of the preceding instructions if it contradicts the preceding instruction.

robots.txt example

Here the web crawler is informed in the second line that everything is allowed to be indexed, but in the third line indexing is forbidden for the entire website. The last instruction remains valid and overwrites the instruction "Allow: /" from the second line.

Every robots.txt consists of one or more rules, these in turn contain one or more instructions. In the previous example we see a rule that provides information

  • To whom does the rule apply (user agent)
  • Which directories or files the user agent is allowed to access
  • Which directories or files the user agent is not allowed to index

includes.

It is generally assumed that a web crawler is allowed to crawl all pages and directories that are not blocked by a Disallow: instruction in robots.txt.
When creating rules with the instructions they contain, pay particular attention to upper and lower case letters! Each instruction has to be stored in a new line, otherwise the robots.txt will not work.

Which commands are available to us for creating the robots.txt?

# - The diamond at the beginning of the line indicates a comment in the robots.txt, comments are ignored by web crawlers. If a comment takes up several lines, a hash "#" must appear at the beginning of each new line.

User agent: - with this command we determine for which web crawler the following instructions are intended. An asterisk "*", as shown in the example above, would address all web crawlers. Whereby the name of the web crawler only addresses this specific bot:

robots.txt example

Disallow: - With this command, in our robots.txt, we can tell the web crawler which directories and / or files are not worth including in the search engine's index. A slash after the "Disallow:" means that the content of the entire page should not be included in the search engine's index.

robots.txt example

If we only want to exclude a certain directory from our Joomla website from the index, then we have to enter the following:

robots.txt example

It is important here that the slash is also stored at the end of this instruction, otherwise other parts of the website will be unintentionally excluded from the indexing -
“Disallow: / directory” would also exclude the URL www.myDomain.com/directory.html from being indexed by a search engine, which we might not even want!

Allow: - With this command, in our robots.txt, we can tell the web crawler which directories and / or files we believe are worth including in the search engine's index. However, this statement is no guarantee that the search engine will actually save this directory or file in the index. A slash behind the "Allow:" means that the content of the entire page should be included in the search engine's index.

robots.txt example

If we only want to enable a certain directory from our Joomla website for indexing, then we have to enter the following:

robots.txt example

Sitemap: - Optionally, we can also enter the sitemap from our Joomla website. I actually recommend doing this. However, there are a few small things to consider. It must be a fully qualified URL, which means that the search engine crawler does not check different variants of the URL. If our Joomla website can be reached via "https", then the URL to our sitemap in the robots.txt must also begin with "https". The same applies to websites with and without "www". In my case, the statement should look like this:

robots.txt example

Examples that illustrate how to use the robots.txt and show what is possible!

The simplest robots.txt is to allow all web crawlers to do anything

robots.txt example

In the first line we see a comment that informs about the website for which this robots.txt was created. All web crawlers are addressed by the asterisk (*) after "User-agent". Since there is no slash behind the “Disallow:” everything is released for indexing. The "Allow:" instruction is mostly only used if we want to generate exceptions - an example of this is given below

Grant limited access to the Googlebot, but deny all other web crawlers all access

robots.txt example

A user agent is defined here, namely the "Googlebot" which is granted access to the entire website in the second line. However, this access is restricted again from the third line onwards, since the directories listed on a Joomla website should not find their way into the index of a search engine. Please note that every directory that is listed in the robots.txt has a slash "/" at the beginning and at the end.

Lock a directory in the robots.txt, but release files with a certain file extension using wildcards for indexing by web crawlers

robots.txt example

In this example from a robots.txt we see that the directory “images” has been locked for the “Googlebot image”, which is responsible for indexing images. However, in the next line an exception was generated for images with the file extension "jpg". This means that all images with the file extension “jpg” can be saved in the index, but no approval for indexing is available for any other files that may be in this directory.
The asterisk in front of ".jpg" is called a wildcard - a placeholder. This saves us from listing every single file in the robots.txt.

A single file from a specific directory in which robots.txt lock

robots.txt example

Here again the “Googlebot image” is addressed, which is not allowed to index the “familie.jpg” file.

Block similar directories for indexing by web crawlers using a wildcard in the robots.txt

robots.txt example

This entry in the robots.txt excludes all directories from indexing by search engines that begin with “folder”. For example "Folder1" "Folder2" "FolderXY"

Exclude URLs with a certain ending in the robots.txt from indexing by search engines

robots.txt example

In this case we block all URLs in our robots.txt from being indexed by the Googlebot with the ending ".asp". ".Asp" stands for "active server page". The asterisk in front of ".asp" is again a wildcard and replaces the beginning of every URL or file in question. The $ sign indicates the end of the url or file.

Exclude no-crawl URLs in robots.txt from indexing by search engines

robots.txt example

This means that URLs that have this character are excluded from indexing. An example would be
https://www.meineDomain.com/?view=category&id=67 which can be seen more often on Joomla websites or generally on content management systems (CMS).

List of the most important web crawlers

Google web crawler:

Googlebot - Googlebot for desktop computers and smartphones crawls everything
Googlebot Image - Googlebot for images
Googlebot Video - Googlebot for videos
Googlebot news - Googlebot for messages
AdsBot-Google - Checks the ad quality for websites optimized for desktop computers
Mediapartners-Google - AdSense
AdsBot Google Mobile Apps - Checks the ad quality for pages optimized for Android apps. Follows the Robots rules of AdsBot-Google

All currently deployed Google user agents

Bing web crawler:

Bingbot - Bingbot for desktop computers and smartphones crawls everything
MSNBot-Media - Bingbot for pictures and videos

All Bing user agents currently in use

Yahoo web crawler:

Slurp - Yahoobot for desktop computers and smartphones

How can we check our robots.txt?

Google offers a free service to check robots.txt. This can be found in the Google Search Console under the menu item “crawling” and then click on “robots.txt tester”. However, this is only available in the old version of the Google Search Console displayed. Here is a direct link to the tester: robots.txt Tester from Google

Many SEO agencies also offer this service for free on their websites!

Now it's your turn!

Do you have suggestions, additions, found a bug or is this tutorial out of date? I'm looking forward to your comment.

Thank you very much, if you recommend my contribution to others, I am grateful for your support!