The Humans Inclusion Standard

A Discount History of Website Metadata Files

The people behind humans.txt would not call it a standard. "Standard" suggests relative correctness in conformity, and the website for humans.txt makes it very clear that it is just an idea. The repeated use of "initiative" to describe humans.txt is surely an exercise in precision. I just happen to like "The Humans Inclusion Standard" because of the implied relation to The Robots Exclusion Standard, commonly known as robots.txt.

Robots.txt is the best-known of a group of text files that can be placed in a website's root directory to communicate information about that site. They're human-readable metadata files for website visitors. While many people are familiar with the existence of robots.txt files, there are an assortment of other, less-known website metadata text files. They enjoy a quiet existence in a website's root directory, largely unknown and unobserved.

First thing first though. When talking about website metadata, it's obligatory to mention the <META> tag. It can be used to include information in a website's header along the lines of <meta name="author" content="Max Power">. A meta tag can contain any key-value pair the author cares to specify. The meta tag hasn't always been available though. In 1994 web pages were written in HTML 1.2 or HTML+, there was no meta tag, and Martijn Koster needed a way to communicate information about websites to web crawlers. So it was that the robots.txt file came to be.


robots.txt (1994)

If the HTML 2.0 standard had dropped a year and a half earlier, would Koster have used <META> tag key-value pairs instead of the now ubiquitous robots.txt? That's unknown, but the point is that it wasn't even an option. By November 1995, when RFC 1866 introduced the HTML 2.0 standard and the <META> tag, the use of a robots.txt file was already established.

A robots.txt file communicates to crawlers and spiders which parts of a web page are off-limits. It's an honor-based system as there is nothing about a robots.txt file that actually controls what can and can't be scraped. Still, many people are happy to respect the wishes of websites' owners and robots.txt provides a standardized communication channel.


humans.txt (2002)

Humans.txt is a very human response to the existence of robots.txt. "Robots have a txt, why not us?" After all, the Internet is largely built by humans for humans. Humans.txt is intended to complement rather than replace HTML's meta tag. If all you want to do is credit yourself as the author of a website, a meta tag arguably offers all the functionality you need. They aren't appropriate for every situation though. What if a website has twenty authors? What if instead of a name you want use ASCII art? Most importantly though, meta tags aren't intended for a human audience. Humans.txt is. The initiative suggests a standard format, but makes it clear that it's really just a suggestion.


business.txt (2012)

The business.txt file was a proposal to standardize communication of business information such as physical address and operating hours. This information is arguably website data, not website metadata, and business.txt does not appear to have ever been widely adopted.


killer-robots.txt (2014)

From 2014 -2018, google.com hosted a killer-robots.txt file that appeared to prohibit the T-800 and T-1000 models of Terminator from targeting the founders of the company. I'm not sure what happened in 2018, but it seems safe to assume that Alphabet Inc. acquired Cyberdyne Systems, altogether eliminating any threat to company personnel.


security.txt (2017)

The security.txt file is a means for communicating with independent security researchers. The proposed standard is still in RFC review, but it has already been well received. In 2019 the Department of Homeland Security directed all federal agencies to begin publishing a security.txt file on their websites. You'll also find a security.txt file hosted by many leading tech companies including Google and Amazon Web Services. You don't have to be operating at that scale to use a security.txt though. To make it easy to get started, the security.txt website provides a simple form for generating properly-formatted files and clear instructions on publishing best practices.


ads.txt (2017)

The ads.txt file is an anti-fraud initiative that allows website publishers to specify their authorized digital sellers. Introduced by the Interactive Advertising Bureau Tech Lab, ads.txt allows buyers check if they are buying from an authorized reseller by looking for a matching entry in both the publisher's and seller's ads.txt file. The IAB isn't expecting anyone to manually cross-reference the content of these text files and provides a sample crawler written in Python as well as an ads.txt aggregator.


Of these standards, proposals, and initiatives, there is one that stands out as particularly interesting, at least on a humanistic level. Humans.txt is by far the loosest standard. It is meant to contain whatever a website's authors care to share, meaning it can contain nearly anything and still serve its intended purpose. Humans.txt is also the only one of these metadata files that targets a general audience. Robots.txt is meant for robots. Security.txt is intended for security researchers. Humans.txt is for whoever cares to look. So what exactly have people been putting in humans.txt files for the last twenty years? It's a question that has been on my mind for a while and that I have never found any real answer to. It seems the only way to know is to go look, but that's easy said than done when talking about the whole Internet.