Highload blog about programming and internet business. Highload-blog about programming and internet business Correct setting of robots txt bitrix

Reading time: 7 minute (s)


Almost every project that comes to us for audit or promotion has an incorrect robots.txt file, and often it is completely absent. This happens because when creating a file, everyone is guided by their imagination, and not by rules. Let's take a look at how to properly compose this file so that search robots work effectively with it.

Why do I need a robots.txt setting?

Robots.txt is a file located in the root directory of the site, which tells the search engine robots which sections and pages of the site they can access and which they cannot.

Setting up robots.txt is an important part of search engine results, properly configured robots will also increase website performance. The absence of Robots.txt will not stop search engines from crawling and indexing the site, but if you do not have this file, you may have two problems:

    The search robot will read the entire site, which will "undermine" the crawling budget. The crawling budget is the number of pages that a crawler can crawl over a certain period of time.

    Without a robots file, the search engine will have access to drafts and hidden pages, to the hundreds of pages used to administer the CMS. He will index them, and when it comes to the necessary pages, which provide direct content for visitors, the crawling budget will "run out".

    The index can get the login page of the site, other administrator resources, so an attacker can easily track them down and carry out a ddos ​​attack or hack the site.

How search bots see a site with and without robots.txt:


Robots.txt syntax

Before we start to parse the syntax and tweak robots.txt, let's take a look at what a “perfect file” should look like:


But don't use it right away. Each site most often needs its own settings, since we all have a different site structure, different CMS. Let's analyze each directive in order.

User-agent

User-agent - defines a search robot that must follow the instructions described in the file. If you need to address everyone at once, the * icon is used. You can also refer to a specific search robot. For example, Yandex and Google:


With this directive, the robot understands which files and folders are prohibited from indexing. If you want your entire site to be open for indexing, leave the Disallow value blank. To hide all content on the site, put “/” after Disallow.

We can deny access to a specific folder, file or file extension. In our example, we address all search robots, close access to the bitrix folder, search and the pdf extension.


Allow

Allow forcibly opens pages and sections of the site for indexing. In the example above, we turn to the Google search robot, close access to the bitrix, search folder and the pdf extension. But in the bitrix folder we forcibly open 3 folders for indexing: components, js, tools.


Host - site mirror

A site mirror is a duplicate of the main site. Mirrors are used for a variety of purposes: address changes, security, reducing server load, etc.

Host is one of the most important rules. If this rule is spelled out, then the robot will understand which of the site mirrors should be taken into account for indexing. This directive is required for Yandex and Mail.ru robots. Other robots will ignore this rule. Host is registered only once!

For the "https: //" and "http: //" protocols, the syntax in the robots.txt file will be different.

Sitemap - sitemap

A sitemap is a form of site navigation that is used to inform search engines about new pages. Using the sitemap directive, we "forcibly" show the robot where the map is located.


Symbols in robots.txt

Symbols used in the file: "/, *, $, #".


Functionality check after setting robots.txt

After you have placed Robots.txt on your site, you need to add and check it in the Yandex and Google webmaster.

Yandex check:

  1. Follow this link .
  2. Select: Indexing Setup - Robots.txt Analysis.

Google check:

  1. Follow this link .
  2. Select: Crawl - Robots.txt File Checker Tool.

This way you can check your robots.txt for errors and make the necessary settings if necessary.

  1. The contents of the file must be written in capital letters.
  2. Only one file or directory needs to be specified in the Disallow directive.
  3. The string "User-agent" must not be empty.
  4. User-agent should always come before Disallow.
  5. Do not forget to prescribe a slash if you need to prohibit directory indexing.
  6. Before uploading a file to the server, be sure to check it for syntax and spelling errors.

I wish you success!

Video Review of 3 Methods for Creating and Configuring a Robots.txt File

Delivery of the finished site on Bitrix is ​​not so bad. As a rule, all the most interesting starts after its first indexing by Google and Yandex search robots, when a lot of unnecessary information for users can get into the search results: from "technical junk" to that photo from the New Year's corporate party.

Hold on to an unknown SEO-schnick, hold on to a would-be programmer, but you just needed to make the right one robots.txt for Bitrix.

For reference: robots.txt is a file located at the root of the site and restricting search robots from accessing certain sections and pages of the site.

Robots.txt for corporate websites and business card websites

The favorite phrase of aspiring copywriters “every project is different” is best suited to our situation. The only exceptions are the standard directives for robots.txt: User-agent; Disallow, Host and Sitemap. If you want, this is a mandatory minimum.

Everything else regarding closure and overlap is at your discretion. Despite the fact that Bitrix is ​​a boxed solution, the directives of projects made on it can be very different from each other. The question is in the structure and functionality of a particular site.

Let's imagine that you have a corporate site on Bitrix with a standard set of sections: “About the Company”, “Services”, “Projects”, “Contacts”, “News”. If the content on such a site is unique, then you need to work on closing the technical part of the project.

1. Close from indexing folder / bitrix and / cgi-bin... Purely Technical information(CSS, templates, captchas), which no one needs, except for the one who swears in the GoogleBot's webmaster panel. You can safely close it. The algorithm of actions is as follows: Disallow: / example /

2. Folder / search is also not interesting to either search engines or users. By closing it, you will protect yourself from duplicate pages, duplicate tags and titles.

3. When compiling robots.txt on Bitrix, they sometimes forget about closing the authorization and PHP authentication forms on the site. This is about

/ auth /
/auth.php

4. If your site has the ability to print any materials: whether it be maps of the area or invoices for payment, do not forget to close the following directories in the robots.txt file:

/ *? print =
/ * & print =

5. Bitrix carefully stores the entire history of your site: successful user registrations, records of successful password change and recovery. However, we doubt that it will be of interest to search robots.

/ * register = yes
/ * forgot_password = yes
/ * change_password = yes
/ * login = yes
/ * logout = yes
/ * auth = yes

6. Imagine you are browsing a photo album on the site, open one, second, third photo, but on the fourth you decide to go back one step. Something like this curse will appear in the address bar:? Back_url_ =% 2Fbitrix% 2F% 2F. It is removed again by changing the robots.txt file in the root of the 1C-Bitrix CMS.

/ * BACKURL = *
/ * back_url = *
/ * BACK_URL = *
/ * back_url_admin = *

Thus, we insure the open part (visible to users) and closed (visible to Bitrix CMS administrators).

7. / Upload folder. Bitrix stores pictures and videos from the site in it. If the content is unique, then you do not need to close the folder. After all, indexed images and videos are an additional source of traffic. It's another matter when / upload contains confidential information or non-unique content.


Robots.txt on Bitrix for online stores

The basis is the same as for corporate sites, but with a few amendments.

1. Unlike a small company website, an online store usually has at least a hundred pages. Pagination pages, which are responsible for the transition of a user from one product card to another, clog up search engines. The more pages, the more "garbage".

/ *? PAGEN

2. Prohibition of indexing actions of users and site administrators. Traces of filtering, comparing products, adding products to the cart should also be hidden from the eyes of the search robot.

/ *? count
/ *? action
/ *? set_filter = *

3. Finally, there are UTM tags. You can close access to them as follows:

/ * openstat =
/ * utm_source =

Many are faced with problems of incorrect indexing of the site by search engines. In this article, I will explain how to create the correct robots.txt for Bitrix to avoid indexing errors.

What is robots.txt and what is it for?

Robots.txt is a text file that contains site indexing parameters for search engine robots (Yandex information).
Basically, it is needed to close pages and files from indexing that search engines index and, therefore, do not need to add to search results.

Usually, these are technical files and pages, administration panels, user accounts and duplicate information, for example, search for your site, etc.

Creating a basic robots.txt for Bitrix

A common mistake of beginners lies in manually compiling this file. You don't need to do this.
Bitrix already has a module responsible for the robots.txt file. It can be found on the page "Marketing -> Search Engine Optimization -> Setting robots.txt" .
On this page there is a button for creating a basic set of rules for the Bitrix system. Use it to create all the standard rules:

After the sitemap is generated, the path to it will be automatically added to robots.txt.

After that, you already have a good basic set of rules. And then you should proceed from the recommendations of an SEO specialist and close (with the "Deny file / folder" button) the necessary pages. These are usually search pages, personal accounts other.

And don't forget that you can contact us for

We have released a new book “Content Marketing in in social networks: How to get into the head of subscribers and fall in love with your brand. "

1C Bitrix is ​​the most popular commercial engine. It is widely used in many studios, although it is not ideal. And if we talk about SEO optimization, then you need to be extremely careful here.

Correct robots.txt for 1C Bitrix

In the new versions, the CMS developers initially laid down robots.txt, which is able to solve almost all problems with duplicate pages. If your version has not been updated, compare and fill in the new robots.

You also need to take a closer look at the issue of robots if your project is currently being finalized by programmers.

User-agent: * Disallow: / bitrix / Disallow: / search / Allow: /search/map.php Disallow: / club / search / Disallow: / club / group / search / Disallow: / club / forum / search / Disallow: / communication / forum / search / Disallow: /communication/blog/search.php Disallow: / club / gallery / tags / Disallow: / examples / my-components / Disallow: / examples / download / download_private / Disallow: / auth / Disallow : /auth.php Disallow: / personal / Disallow: / communication / forum / user / Disallow: /e-store/paid/detail.php Disallow: / e-store / affiliates / Disallow: / club / $ Disallow: / club / messages / Disallow: / club / log / Disallow: / content / board / my / Disallow: / content / links / my / Disallow: / * / search / Disallow: / * PAGE_NAME = search Disallow: / * PAGE_NAME = user_post Disallow : / * PAGE_NAME = detail_slide_show Disallow: / * / slide_show / Disallow: / * / gallery / * order = * Disallow: / *? Print = Disallow: / * & print = Disallow: / * register = yes Disallow: / * forgot_password = yes Disallow: / * change_password = yes Disallow: / * login = yes Disallow: / * logout = yes Disallow: / * au th = yes Disallow: / * action = ADD_TO_COMPARE_LIST Disallow: / * action = DELETE_FROM_COMPARE_LIST Disallow: / * action = ADD2BASKET Disallow: / * action = BUY Disallow: / * print_course = Y Disallow: / * bitlow: _ * / back = * Disallow: / * BACKURL = * Disallow: / * back_url = * Disallow: / * BACK_URL = * Disallow: / * back_url_admin = * Disallow: /*index.php$

Host: www.site.ru Sitemap: http://www.site.ru/sitemap.xml

Initial SEO site optimization on 1C Bitrix

1C Bitrix has an SEO module, which is already included in the "Start" tariff. This module has very great capabilities that will satisfy all the needs of seo specialists during the initial optimization of the site.

Its capabilities:

  • overall link ranking;
  • citation;
  • number of links;
  • search words;
  • indexing by search engines.

SEO module + Web analytics

Search engine optimization tools by page:

  1. all information that the user needs to modify the page is presented;
  2. in the public part, basic information on the content of the page is displayed;
  3. special information about the page is displayed: frequency of indexing by search engines, queries that lead to this page, additional statistical information;
  4. a visual assessment of the effectiveness of the page is given;
  5. the ability to immediately call the necessary dialogs and make changes on the page.

Site Search Engine Optimization Tool:

  1. all information required to modify the site is displayed;
  2. basic information on the content of the site is displayed in its public part;
  3. for the entire site, the following is displayed: total link ranking, citation, number of links, search words, indexing by search engines;
  4. visual assessment of the effectiveness of the site;
  5. the ability to immediately call the necessary dialogs and make changes on the site.

1C-Bitrix: Marketplace

Bitrix also has its own Marketplace, where there are several modules for SEO optimization project. They duplicate each other's features, so choose for price and features.

Easy Meta Tag Management for SEO

Free

A module that allows you to add unique SEO data (title, description, keywords) to any page of the site, including catalog elements.

SEO tools

Paid

  • Site CNC control on one page.
  • Possibility of redefining titles, and meta-tags of pages.
  • The ability to install redirects.
  • Testing OpenGraph Tags.
  • The last visit of a real Google or Yandex bot (a delayed check of the bot's validity by its IP address).
  • List of transitions to your pages, search traffic
  • Counting the number of likes to your pages, a third-party service

SEO Tools: Meta Tag Management PRO

Paid

A tool for automatically generating meta tags title, description, keywords, as well as the H1 heading for ANY pages of the site.

  • use of rules and templates;
  • applying the rule based on targeting;
  • the ability to customize the project for ANY number of keys;
  • centralized management of meta tags on any projects;
  • operational control of the status of meta tags on any page of the project.

SEO specialist tools

Paid

The module allows you to:

  • Set meta tags (title, keywords, description).
  • Force to change the H1 (page title) set by any components on the page.
  • Set the sign of the canonical address.
  • Install up to three SEO texts anywhere on the page with or without a visual editor.
  • Multi-site.
  • Edit all of the above both "from the face" of the site and from the admin panel.
  • Install and use the module on the Bitrix edition "First Site".

ASEO editor optimizer

Paid

The module allows you to set unique SEO data (title, description, keywords) and change the content for HTML blocks on any page of the site that has its own URL, or for a specific URL template based on GET parameters.

SeoONE: comprehensive search engine optimization and analysis

Paid

  1. Setting "URL without parameters".
  2. Setting "META-data pages".
  3. "Static" - here you can easily set unique meta-data (keywords and description) for the page, as well as a unique browser title and page title (usually h1).
  4. "Dynamic" - this setting is the same as the previous one. The only difference is that it is created for dynamically generated pages (for example, for a product catalog).
  5. The "Address spoofing" setting allows you to specify a secondary URL for the page.
  6. The "Express Analysis" setting. On this page you can add an unlimited number of sites for analysis.

CNCizer (set the symbolic code)

Paid

The module allows you to put on the site symbolic codes for elements and sections in automatic mode.

Linemedia: SEO blocks on the site

Paid

Provides a component that allows you to add multiple SEO text blocks to any page, set meta information about the page.

Link to sections and elements of infoblocks

Paid

Using this module in the standard visual editor it becomes possible to add and edit links to elements / sections of infoblocks.

Web analytics in 1C Bitrix: Yandex Metrics and Google Analytics

There are several options for placing counters in cms:

Option # 1. Place the counter code bitrix / templates / template name / headers.php after the tag .

Option number 2. Use a special plugin for Yandex Metrics.

Option number 3. Bitrix has its own web analytics module. Of course, it will not allow you to create your own reports, make segmentation, and so on, but for simple use keeping track of statistics is quite a tool.

Yandex Webmaster and Google webmaster in 1C Bitrix

Yes, there are built-in solutions to add a website to the Webmaster service (both in Google and Yandex), but we strongly recommend working with these services directly.

Because:

  • there you can see a lot more data;
  • you will be sure that the data is up-to-date (as much as possible) and not distorted;
  • if the service releases an update, you can immediately see and use it (if you work with the plugin, you will have to wait for updates).

If you are just creating a website and have thought about how 1C Bitrix is ​​suitable for promotion in search engines and whether there are any problems in it, then you do not need to worry. The engine is the leader among paid cms on the market and for a very long time, all seo specialists (I'm not only talking about our studio) have encountered Bitrix more than once and everyone has experience.

On 1C, Bitrix does not differ from promotion on other cms or self-written engines. The differences can only be seen in the optimization tools that we wrote about above.

But keep in mind that the tools alone will not promote your site. Here we need specialists who will set them up correctly.

By the way, we have a lot of instructional articles, which contain a lot of practical advice with a history of many years of practice. Of course, we thought about setting up a thematic mailing list, but so far we are not in time. So it's most convenient

ROBOTS.TXT- Robot exclusion standard - a file in text format .txt to restrict access to site content for robots. The file must be located at the root of the site (at /robots.txt). The use of a standard is optional, but search engines follow the rules in robots.txt. The file itself consists of a set of records of the form

:

where field is the name of the rule (User-Agent, Disallow, Allow, etc.)

Records are separated by one or more blank lines (line terminator: symbols CR, CR + LF, LF)

How to set up ROBOTS.TXT correctly?

This paragraph contains the basic requirements for configuring the file, specific recommendations for configuring, examples for popular CMS

  • The file size must not exceed 32 kB.
  • ASCII or UTF-8 encoding must be used.
  • A correct robots.txt file must contain at least one rule consisting of several directives. Each rule must contain the following directives:
    • for which robot is this rule (User-agent directive)
    • which resources this agent has access to (Allow directive), or which resources does not have access to (Disallow).
  • Each rule and directive must start on a new line.
  • The value of the Disallow / Allow rule must begin with either the / or *.
  • All lines beginning with the # character, or parts of lines starting with this character are considered comments and are not taken into account by agents.

Thus, the minimum content of a properly configured robots.txt file looks like this:

User-agent: * # for all agents Disallow: # nothing is prohibited = access to all files is allowed

How to create / modify ROBOTS.TXT?

It is possible to create a file using any text editor (for example, notepad ++). To create or modify a robots.txt file, you usually need access to the server via FTP / SSH, however, many CMS / CMFs have a built-in interface for managing file content through the administration panel (“admin panel”), for example: Bitrix, ShopScript and others.

What is the ROBOTS.TXT file on the site for?

As you can see from the definition, robots.txt allows you to control the behavior of robots when visiting a site, i.e. adjust the indexing of the site by search engines - this makes this file an important part of your site's SEO optimization. The most important feature of robots.txt is the prohibition of indexing pages / files that do not contain useful information. Or the entire site in general, which may be necessary, for example, for test versions of the site.

The main examples of what needs to be closed from indexing will be discussed below.

What should be closed from indexing?

First, you should always prohibit the indexing of sites during development, in order to avoid getting into the index of pages that will not be on the finished version of the site at all and pages with missing / duplicated / test content before they are filled.

Secondly, copies of the site created as test sites for development should be hidden from indexing.

Thirdly, we will analyze what content directly on the site should be prohibited from indexing.

  1. Administrative part of the site, service files.
  2. User authorization / registration pages, in most cases - personal sections of users (unless public access to personal pages is provided).
  3. Cart and checkout pages, order viewing.
  4. Product comparison pages, it is possible to selectively open such pages for indexing, provided they are unique. In general, comparison tables are countless pages with duplicate content.
  5. Search and filtering pages can be left open for indexing only if they are configured correctly: separate urls, filled in unique titles, meta tags. In most cases, these pages should be closed.
  6. Pages with sorting products / records, if they have different addresses.
  7. Pages with utm-, openstat-tags in URl (as well as all others).

ROBOTS.TXT syntax

Now let's dwell on the robots.txt syntax in more detail.

General Provisions:

  • each directive must start on a new line;
  • the line must not start with a space;
  • the directive value must be on one line;
  • no need to enclose directive values ​​in quotes;
  • by default, for all directive values, * is written at the end, Example: User-agent: Yandex Disallow: / cgi-bin * # blocks access to pages Disallow: / cgi-bin # the same
  • an empty line feed is interpreted as the end of the User-agent rule;
  • only one value is indicated in the directives "Allow", "Disallow";
  • the name of the robots.txt file does not allow the presence of capital letters;
  • robots.txt larger than 32 Kb is not allowed, robots will not download such a file and consider the site to be fully allowed;
  • unavailable robots.txt can be interpreted as fully permissive;
  • an empty robots.txt is considered fully permissive;
  • use Punycod to specify Cyrillic values ​​for rules;
  • only UTF-8 and ASCII encodings are allowed: the use of any national alphabets and other characters in robots.txt is not allowed.

Special symbols:

  • #

    The comment start symbol, all text after # and before line feed is considered a comment and is not used by robots.

    *

    A wildcard value denoting a prefix, suffix, or the entire directive value - any character set (including empty).

  • $

    Pointing to the end of the line, prohibiting the completion * to the value, to Example:

    User-agent: * # for all Allow: / $ # enable indexing of the main page Disallow: * # disable indexing of all pages, except for the allowed one

List of directives

  1. User-agent

    Mandatory directive. Determines which robot the rule belongs to; a rule can contain one or several such directives. You can use the * symbol to indicate a prefix, suffix, or full robot name. Example:

    # site is closed for Google.News and Google.Pictures User-agent: Googlebot-Image User-agent: Googlebot-News Disallow: / # for all robots whose name begins with Yandex, close the "News" section User-agent: Yandex * Disallow: / news # open to all other User-agent: * Disallow:

  2. Disallow

    The directive specifies which files or directories cannot be indexed. The directive value must start with / or *. By default, the value is followed by *, unless prohibited by the $ symbol.

  3. Allow

    Each rule must have at least one Disallow: or Allow: directive.

    The directive specifies which files or directories should be indexed. The directive value must start with / or *. By default, the value is followed by *, unless prohibited by the $ symbol.

    The use of the directive is relevant only in conjunction with Disallow to enable indexing of a subset of pages prohibited for indexing by the Disallow directive.

  4. Clean-param

    Optional, cross-sectional directive. Use the Clean-param directive if site page URLs contain GET parameters (displayed in the URL after the? Sign) that do not affect their content (for example, UTM). With the help of this rule, all addresses will be brought to a single form - the original one, without parameters.

    Directive syntax:

    Clean-param: p0 [& p1 & p2 & .. & pn]

    p0 ... - names of parameters that do not need to be taken into account
    path - prefix of the path of the pages for which the rule applies


    Example.

    the site has pages like

    Www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_3&book_id= 123

    When specifying a rule

    User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl

    the robot will reduce all page addresses to one:

    Www.example.com/some_dir/get_book.pl?book_id=123

  5. Sitemap

    An optional directive, it is possible to place several such directives in one file, cross-sectional (it is enough to specify in the file once, without duplicating for each agent).

    Example:

    Sitemap: https://example.com/sitemap.xml

  6. Crawl-delay

    The directive allows you to set the search robot a minimum time period (in seconds) between the end of loading one page and the start of loading the next. Fractional values ​​supported

    The minimum allowable value for Yandex robots is 2.0.

    Google robots do not respect this directive.

    Example:

    User-agent: Yandex Crawl-delay: 2.0 # sets a timeout of 2 seconds User-agent: * Crawl-delay: 1.5 # sets a timeout of 1.5 seconds

  7. Host

    The directive indicates the main mirror of the site. At the moment, of the popular search engines, only Mail.ru is supported.

    Example:

    User-agent: Mail.Ru Host: www.site.ru # main mirror with www

Robots.txt examples for popular CMS

ROBOTS.TXT for 1C: Bitrix

CMS Bitrix provides the ability to manage the content of the robots.txt file. To do this, in the administrative interface, you need to go to the “Setting robots.txt” tool using the search, or along the path Marketing-> Search Engine Optimization-> Setting robots.txt. You can also change the content of robots.txt using the built-in Bitrix file editor, or via FTP.

The example below can be used as a starter robots.txt set for Bitrix-based sites, but it is not universal and requires adaptation depending on the site.

Explanations:

  1. the breakdown into rules for different agents is due to the fact that Google does not support the Clean-param directive.
User-Agent: Yandex Disallow: * / index.php Disallow: / bitrix / Disallow: / * filter Disallow: / * order Disallow: / * show_include_exec_time = Disallow: / * show_page_exec_time = Disallow: / * show_sql_stat = Disallow: / * bitrix_include_areas = Disallow: / * clear_cache = Disallow: / * clear_cache_session = Disallow: / * ADD_TO_COMPARE_LIST Disallow: / * ORDER_BY Disallow: / *? Print = Disallow: / * & print = Disallow: / * print_course = Disallow: / *? Action = Disallow : / * & action = Disallow: / * register = Disallow: / * forgot_password = Disallow: / * change_password = Disallow: / * login = Disallow: / * logout = Disallow: / * auth = Disallow: / * backurl = Disallow: / * back_url = Disallow: / * BACKURL = Disallow: / * BACK_URL = Disallow: / * back_url_admin = Disallow: / *? utm_source = Disallow: / *? bxajaxid = Disallow: / * & bxajaxid = Disallow: / *? view_result = Disallow / * & view_result = Disallow: / *? PAGEN * & Disallow: / * & PAGEN Allow: * /? PAGEN * Allow: / bitrix / components / * / Allow: / bitrix / cache / * / Allow: / bitrix / js / * / Allow: / bitrix / templates / * / Allow: / bitrix / panel / * / Allow: / bitrix / components / * / * / Allow: / bitrix / cache / * / * / Allow: / bitrix / js / * / * / Allow: / bitrix / templates / * / * / Allow: / bitrix / panel / * / * / Allow: / bitrix / components / Allow: / bitrix / cache / Allow: / bitrix / js / Allow: / bitrix / templates / Allow: / bitrix / panel / Clean-Param: PAGEN_1 / Clean- Param: PAGEN_2 / # if the site has more components with pagination, then duplicate the rule for all variants, changing the number Clean-Param: sort Clean-Param: utm_source & utm_medium & utm_campaign Clean-Param: openstat User-Agent: * Disallow: * / index.php Disallow : / bitrix / Disallow: / * filter Disallow: / * sort Disallow: / * order Disallow: / * show_include_exec_time = Disallow: / * show_page_exec_time = Disallow: / * show_sql_stat = Disallow: / * bitrix_include_areas = Disallow = / * clear_cache : / * clear_cache_session = Disallow: / * ADD_TO_COMPARE_LIST Disallow: / * ORDER_BY Disallow: / *? print = Disallow: / * & print = Disallow: / * print_course = Disallow: / *? action = Disallow: / * & action = Disallow: / * register = Disallow: / * forgot_password = Disallow: / * change_password = Disallow: / * login = Disallow: / * logout = Disallow: / * auth = Disallow: / * backurl = Disallow: / * back_url = Disallow: / * BACKURL = Disallow: / * BACK_URL = Disallow: / * back_url_admin = Disallow: / *? utm_source = Disallow: / *? bxajaxid = Disallow: / * & bxajaxid = Disallow: / *? view_result = Disallow: / * & view_result = Disallow: / * utm_ Disallow: / * openstat = Disallow / *? PAGEN * & Disallow: / * & PAGEN Allow: * /? PAGEN * Allow: / bitrix / components / * / Allow: / bitrix / cache / * / Allow: / bitrix / js / * / Allow: / bitrix / templates / * / Allow: / bitrix / panel / * / Allow: / bitrix / components / * / * / Allow: / bitrix / cache / * / * / Allow: / bitrix / js / * / * / Allow: / bitrix / templates / * / * / Allow: / bitrix / panel / * / * / Allow: / bitrix / components / Allow: / bitrix / cache / Allow: / bitrix / js / Allow: / bitrix / templates / Allow: / bitrix / panel / Sitemap: http://site.com/sitemap.xml # replace with your sitemap url

ROBOTS.TXT for WordPress

There is no built-in tool for setting up robots.txt in the WordPress admin area, so access to the file is only possible using FTP, or after installing a special plugin (for example, DL Robots.txt).

The example below can be used as a starter robots.txt set for Wordpress sites, but it is not universal and requires adaptation depending on the site.


Explanations:

  1. the Allow directives indicate the paths to the files of styles, scripts, pictures: for correct indexing of the site, they must be available to robots;
  2. For most sites, post and tag archives pages only create duplicate content and do not create useful content, so in this example they are not indexed. If on your project such pages are necessary, useful and unique, then you should remove the directives Disallow: / tag / and Disallow: / author /.

An example of a correct ROBOTS.TXT for a WoRdPress site:

User-agent: Yandex # For Yandex Disallow: / cgi-bin Disallow: /? Disallow: / wp- Disallow: *? S = Disallow: * & s = Disallow: / search / Disallow: / author / Disallow: / users / Disallow: * / trackback Disallow: * / feed Disallow: * / rss Disallow: * / embed Disallow: /xmlrpc.php Disallow: / tag / Disallow: /readme.html Disallow: *? replytocom Allow: * / uploads Allow: /*/*.js Allow: /*/*.css Allow: / wp- * .png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Clean-Param: utm_source & utm_medium & utm_campaign Clean-Param: openstat User-agent: * Disallow: / cgi-bin Disallow: / ? Disallow: / wp- Disallow: *? S = Disallow: * & s = Disallow: / search / Disallow: / author / Disallow: / users / Disallow: * / trackback Disallow: * / feed Disallow: * / rss Disallow: * / embed Disallow: /xmlrpc.php Disallow: *? utm Disallow: * openstat = Disallow: / tag / Disallow: /readme.html Disallow: *? replytocom Allow: * / uploads Allow: /*/*.js Allow: / * /*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Sitemap: http://site.com/sitemap.xml # replace with the url of your sitemap

ROBOTS.TXT for OpenCart

There is no built-in robots.txt tool in the OpenCart admin area, so the file can only be accessed using FTP.

The example below can be used as a starter robots.txt set for OpenCart sites, but it is not universal and requires adaptation depending on the site.


Explanations:

  1. the Allow directives indicate the paths to the files of styles, scripts, pictures: for correct indexing of the site, they must be available to robots;
  2. splitting into rules for different agents is due to the fact that Google does not support the Clean-param directive;
User-agent: * Disallow: / * route = account / Disallow: / * route = affiliate / Disallow: / * route = checkout / Disallow: / * route = product / search Disallow: /index.php?route=product/product * & manufacturer_id = Disallow: / admin Disallow: / catalog Disallow: / system Disallow: / *? sort = Disallow: / * & sort = Disallow: / *? order = Disallow: / * & order = Disallow: / *? limit = Disallow: / * & limit = Disallow: / *? filter_name = Disallow: / * & filter_name = Disallow: / *? filter_sub_category = Disallow: / * & filter_sub_category = Disallow: / *? filter_description = Disallow: / * & filter_description = Disallow: / *? tracking = Disallow: / * & tracking = Disallow: / * compare-products Disallow: / * search Disallow: / * cart Disallow: / * checkout Disallow: / * login Disallow: / * logout Disallow: / * vouchers Disallow: / * wishlist Disallow: / * my-account Disallow: / * order-history Disallow: / * newsletter Disallow: / * return-add Disallow: / * forgot-password Disallow: / * downloads Disallow: / * returns Disallow: / * transactions Disallow: / * create-account Disallow: / * recurring Disallow: / * address-book Disallow: / * reward-points Disallow: / * affiliate-forgot-password Disallow: / * create-affiliate-account Disallow: / * affiliate-login Disallow: / * affiliates Disallow: / *? Filter_tag = Disallow: / * brands Disallow: / * specials Disallow: / * simpleregister Disallow: / * simplecheckout Disallow: * utm = Disallow: / * & page Disallow: / *? Page * & Allow: / *? Page Allow: / catalog / view / javascript / Allow: / catalog / view / theme / * / User-agent: Yandex Disallow: / * route = account / Disallow: / * route = affiliate / Disallow: / * route = checkout / Disallow: / * route = product / search Disallow: /index.php?route=product/product*&manufacturer_id= Disallow: / admin Disallow: / catalog Disallow: / system Disallow: / *? sort = Disallow: / * & sort = Disallow: / *? order = Disallow: / * & order = Disallow: / *? Limit = Disallow: / * & limit = Disallow: / *? Filter_name = Disallow: / * & filter_name = Disallow: / *? Filter_sub_category = Disallow: / * & filter_sub_category = Disallow: / *? filter_description = Disallow: / * & filter_description = Disallow: / * compa re-products Disallow: / * search Disallow: / * cart Disallow: / * checkout Disallow: / * login Disallow: / * logout Disallow: / * vouchers Disallow: / * wishlist Disallow: / * my-account Disallow: / * order -history Disallow: / * newsletter Disallow: / * return-add Disallow: / * forgot-password Disallow: / * downloads Disallow: / * returns Disallow: / * transactions Disallow: / * create-account Disallow: / * recurring Disallow: / * address-book Disallow: / * reward-points Disallow: / * affiliate-forgot-password Disallow: / * create-affiliate-account Disallow: / * affiliate-login Disallow: / * affiliates Disallow: / *? filter_tag = Disallow : / * brands Disallow: / * specials Disallow: / * simpleregister Disallow: / * simplecheckout Disallow: / * & page Disallow: / *? page * & Allow: / *? page Allow: / catalog / view / javascript / Allow: / catalog / view / theme / * / Clean-Param: page / Clean-Param: utm_source & utm_medium & utm_campaign / Sitemap: http://site.com/sitemap.xml # replace with your sitemap url

ROBOTS.TXT for Joomla!

Joomla's admin panel does not have a built-in tool for setting robots.txt, so the file can only be accessed using FTP.

The example below can be used as a starter robots.txt set for Joomla sites with SEF enabled, but it is not universal and requires adaptation depending on the site.


Explanations:

  1. the Allow directives indicate the paths to the files of styles, scripts, pictures: for correct indexing of the site, they must be available to robots;
  2. splitting into rules for different agents is due to the fact that Google does not support the Clean-param directive;
User-agent: Yandex Disallow: / *% Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / components / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / log / Disallow: / tmp / Disallow: / xmlrpc / Disallow: / plugins / Disallow: / modules / Disallow: / component / Disallow: / search * Disallow: / * mailto / Allow: /*.css?*$ Allow: /*.less?*$ Allow: /*.js?*$ Allow: /*.jpg?*$ Allow: /*.png?* $ Allow: /*.gif?*$ Allow: /templates/*.css Allow: /templates/*.less Allow: /templates/*.js Allow: /components/*.css Allow: /components/*.less Allow: /media/*.js Allow: /media/*.css Allow: /media/*.less Allow: /index.php?*view=sitemap* # open the sitemap Clean-param: searchword / Clean-param: limit & limitstart / Clean-param: keyword / User-agent: * Disallow: / *% Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / components / Disallow: / includes / Disallow: / installat ion / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / log / Disallow: / tmp / Disallow: / xmlrpc / Disallow: / plugins / Disallow: / modules / Disallow: / component / Disallow: / search * Disallow: / * mailto / Disallow: / * searchword Disallow: / * keyword Allow: /*.css?*$ Allow: /*.less?*$ Allow: /*.js?*$ Allow: /*.jpg?*$ Allow: /*.png?*$ Allow: /*.gif?*$ Allow: /templates/*.css Allow: /templates/*.less Allow: / templates / *. js Allow: /components/*.css Allow: /components/*.less Allow: /media/*.js Allow: /media/*.css Allow: /media/*.less Allow: /index.php?*view = sitemap * # open the sitemap Sitemap: http: // your_site_map_address

List of main agents

The bot Function
Googlebot Google's main indexing robot
Googlebot-News Google News
Googlebot-Image Google pictures
Googlebot-Video video
Mediapartners-Google
Mediapartners Google Adsense, Google mobile Adsense
AdsBot-Google check the quality of the landing page
AdsBot-Google-Mobile-Apps Google Robot for Apps
YandexBot Yandex's main indexing robot
YandexImages Yandex.Images
YandexVideo Yandex.Video
YandexMedia multimedia data
YandexBlogs blog search robot
YandexAddurl a robot that accesses a page when adding it through the "Add URL" form
YandexFavicons a robot that indexes favicons
YandexDirect Yandex.Direct
YandexMetrika Yandex.Metrica
YandexCatalog Yandex.Catalog
YandexNews Yandex.News
YandexImageResizer mobile services robot
Bingbot Bing's main indexing robot
Slurp the main indexing robot Yahoo!
Mail.Ru main indexing robot Mail.Ru

FAQ

Text file robots.txt is publicly available, which should be taken into account, and this file should not be used as a means of hiding confidential information.

Are there any differences between robots.txt for Yandex and Google?

There are no fundamental differences in the processing of robots.txt by Yandex and Google search engines, but still a number of points should be highlighted:

  • as mentioned earlier, the rules in robots.txt are advisory in nature, which is actively used by Google.

    In its robots.txt documentation, Google states that “.. is not intended to prevent web pages from being shown in results google search... “And“ If a robots.txt file prevents Googlebot from processing a web page, it can still serve on Google. ” To exclude pages from Google searches, you need to use meta robots.

    Yandex excludes pages from the search, guided by the rules of robots.txt.

  • Yandex, unlike Google, supports the Clean-param and Crawl-delay directives.
  • AdsBot Google robots do not follow the rules for User-agent: *, they need to set separate rules for them.
  • Many sources indicate that script and style files (.js, .css) should only be opened for indexing by Google robots. In fact, this is not true and you should open these files for Yandex as well: from 11/9/2015 Yandex began to use js and css when indexing sites (post in the official blog).

How to block a site from being indexed in robots.txt?

To close a site in Robots.txt, you need to use one of the following rules:

User-agent: * Disallow: / User-agent: * Disallow: *

It is possible to close the site only for one search engine (or several), while leaving the rest of the possibility of indexing. To do this, in the rule, you need to change the User-agent directive: replace * with the name of the agent that needs to be denied access ().

How to open a site for indexing in robots.txt?

In the usual case, to open a site for indexing in robots.txt, you do not need to take any action, you just need to make sure that all the necessary directories are open in robots.txt. For example, if your site was previously hidden from indexing, then the following rules should be removed from robots.txt (depending on the one used):

  • Disallow: /
  • Disallow: *

Please note that indexing can be prohibited not only by the robots.txt file, but also by using the robots meta tag.

It should also be noted that the absence of a robots.txt file in the site root means that site indexing is allowed.

How to specify the main site mirror in robots.txt?

Currently, it is not possible to specify the main mirror using robots.txt. Previously, Yandex PS used the Host directive, which contained an indication of the main mirror, but since March 20, 2018 Yandex has completely abandoned its use. Currently, specifying the main mirror is only possible with the help of a 301-page redirect.