It is a txt file created for robots by the webmasters to direct web robots how to crawl through web pages of a website. These robots are used by search engines to crawl on websites, search engines use their robots to crawl through websites as to seek whether sites follow the standards and also to index pages content, these robots also help search engines to store data into their databases. Robots.txt file is a part of Robot Exclusion Protocol (REP), a group pf web standards which instructs robots how to crawl through web. These robots are called out by various names by their search engines like:
- GoogleBot by Google
- Baidu Spider by Baidu
- MSNBot/BingBot by Bing
- YandexBot by Yandex
- Soso Spider by Soso
- ExaBot by 3ds
- Sogou Spider by Sogou
- Google Plus Share by Google
- Facebook External Hit by Facebook
- Google Feedfetcher by Google
The crawl function is there to “allow” or “disallow” bots to read all the sections or particular sections of a website. This function helps webmasters to save themselves from badbots, badbots are generally spammers, email crawlers, etc., these bots are used to steal information and spam the sites.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
With these two lines together it is considered as complete robots.txt file. A robot file can contain multiple lines of omg user agents and their directives.
Blocking web crawlers from all the content of website:
User-agent: *
Disallow: /
Allow web crawlers for all the content
User-agent: *
Disallow:
Block a crawler for particular content:
User-agent: *
Disallow: /example-subfolder/
Blocking specific bot from crawling the website:
User-agent: Badbot
Disallow: /
Blocking crawler from visiting specific page of website:
User-agent:
Disallow: /example-subfolder/blocked-page.html