Documenting the format of robots.txt
In the early days of the web, the google web crawler that crawls public web pages to build the web index used by the search engine received complains from webmasters who did not agree to let google index their website or were concerned by the load that these crawlers put on their infrastructure. To cope with this problem, google engineers developed a simple text file called robots.txt which can be used by webmasters to specify the parts of the web site that can be crawled and by which crawler. Over the years, the robots.txt file evolved, but its format had never been formally specified despite its widespread usage. 25 years later, google fills this gap by releasing an Internet draft that eventually describes the format of this text file. It is likely that the IETF will discuss minor details on this document in the coming years or maybe months before accepting it.
It took 25 years for google to precisely document the format of the robots.txt… To celebrate this birthday, they also release the source code for a C++ parser to read the robots.txt text files.
This blog post was written to inform the readers of Computer Networking : Principles, Protocols and Practice about the evolution of the field. You can subscribe to the Atom feed for this blog at https://obonaventure.github.io/cnp3blog/feed.xml. These notes are also posted on the ebook’s Facebook page