Michael Jay Lissner
  • Home
  • About Site
  • Contact
  • Projects & Papers
  • Tags
  • Archives

Support for x-robots-tag and robots HTML meta tag

Contents

  • Google, Bing
  • Yahoo, AOL
  • Ask, Yandex, Nutch
  • The Internet Archive, Alexa
  • Blekko, Baidu
  • Duckduckgo

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn’t seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here’s Google’s page on the topic. And here’s Bing’s. The msnbot is retired.

Yahoo, AOL

Yahoo!’s search engine is provided by Bing. AOL’s is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia’s search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask’s page on the topic is here, and Yandex’s is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa’s crawler, which is known as ia_archiver. This crawler does not seem to support either the HTML robots meta tag nor the x-robots-tag HTTP header. Their page on the subject is here. I have requested more information from them, and will update this page if I hear back.

Blekko, Baidu

Blekko does not support either the robots meta tag nor the x-robots-tag header, per emails I’ve had with them. I also requested information from Baidu, but their response totally ignored my question and was in Chinese. They do have some information here, but it does not seem to provide any information on the noindex value for the robots tag. In any case, the only way to block these crawlers seems to be via a robots.txt file.

Duckduckgo

I previously stated that DDG did not support the x-robots-tag header, but while that was true, it didn’t tell the entire story. The entire story is that DDG uses other search crawlers for their content aggregation and uses their own crawler only for maintenance-type work. You can read more about this in my answer on StackOverflow.

I love getting feedback and comments. Make my day by making a comment.

Comments
comments powered by Disqus

  • « Respecting privacy while providing hundreds of thousands of public documents
  • The Winning Font in Court Opinions »

Published

Jan 24, 2012

Category

Tech

Tags

  • privacy 7
  • robots 1
  • search 1
  • x-robots-tag 1

Contact

This is Reader-Editable

Edit this post on Github

Get Weekly Updates

  • Unless mentioned otherwise, all material on this site is licensed under a Creative Commons copyright or the GNU Affero GPL. Privacy Policy.
  • Powered by Pelican. Theme: Elegant by Talha Mansoor