Robots.txt, Web Crawler, Standards to crawl a web page

It is a txt file created for robots by the webmasters to direct web robots how to crawl through web pages of a website. These robots are used by search engines to crawl on websites, search engines use their robots to crawl through websites as to seek whether sites follow the standards and also to index pages content, these robots also help search engines to store data into their databases. Robots.txt file is a part of Robot Exclusion Protocol (REP), a group pf web standards which instructs robots how to crawl through web. These robots are called out by various names by their search engines like:

GoogleBot by Google
Baidu Spider by Baidu
MSNBot/BingBot by Bing
YandexBot by Yandex
Soso Spider by Soso
ExaBot by 3ds
Sogou Spider by Sogou
Google Plus Share by Google
Facebook External Hit by Facebook
Google Feedfetcher by Google

The crawl function is there to “allow” or “disallow” bots to read all the sections or particular sections of a website. This function helps webmasters to save themselves from badbots, badbots are generally spammers, email crawlers, etc., these bots are used to steal information and spam the sites.

Basic format:

User-agent: [user-agent name]
Disallow: [URL string not to be crawled]

With these two lines together it is considered as complete robots.txt file. A robot file can contain multiple lines of omg user agents and their directives.

Blocking web crawlers from all the content of website:

User-agent: *
Disallow: /

Allow web crawlers for all the content

User-agent: *
Disallow:

Block a crawler for particular content:

User-agent: *
Disallow: /example-subfolder/
Blocking specific bot from crawling the website:

User-agent: Badbot
Disallow: /

Blocking crawler from visiting specific page of website:

User-agent:
Disallow: /example-subfolder/blocked-page.html

Post Views: 1,773

Robots.txt

Leave a Reply Cancel reply