<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Michael Jay Lissner</title><link href="https://michaeljaylissner.com/" rel="alternate"></link><link href="https://michaeljaylissner.com/feeds/tag/x-robots-tag" rel="self"></link><id>https://michaeljaylissner.com/</id><updated>2012-01-24T23:20:20-08:00</updated><entry><title>Support for x-robots-tag and robots HTML meta tag</title><link href="https://michaeljaylissner.com/posts/2012/01/24/support-for-x-robots-tag-http-header-and-robots-html-meta-tag/" rel="alternate"></link><updated>2012-01-24T23:20:20-08:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2012-01-24:posts/2012/01/24/support-for-x-robots-tag-http-header-and-robots-html-meta-tag/</id><summary type="html">
&lt;p&gt;As part of our research for &lt;a href="/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents"&gt;our post&lt;/a&gt; on how we block search engines, we 
looked into which search engines support which privacy standards. This 
information doesn’t seem to exist anywhere else on the Internet, so below are 
our findings, starting with the big guys, and moving towards more obscure or 
foreign search engines.&lt;/p&gt;
&lt;h2 id="google-bing"&gt;Google, Bing&lt;/h2&gt;
&lt;p&gt;Google (known as Googlebot) and Bing (known as Bingbot) support the 
&lt;code&gt;x-robots-tag&lt;/code&gt; header and the robots &lt;span class="caps"&gt;HTML&lt;/span&gt; tag. Here’s &lt;a href="http://support.google.com/webmasters/bin/answer.py?hl=en&amp;amp;answer=79812"&gt;Google’s page&lt;/a&gt; on 
the topic. And &lt;a href="http://www.bing.com/community/site_blogs/b/webmaster/archive/2009/08/21/prevent-a-bot-from-getting-lost-in-space-sem-101.aspx"&gt;here’s Bing’s&lt;/a&gt;. The &lt;a href="http://www.bing.com/community/site_blogs/b/webmaster/archive/2009/11/04/msnbot-1-1-is-retired.aspx"&gt;msnbot is retired&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="yahoo-aol"&gt;Yahoo, &lt;span class="caps"&gt;AOL&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;Yahoo!’s search engine is provided by Bing. &lt;span class="caps"&gt;AOL&lt;/span&gt;’s is provided by Google. These 
are easy ones.&lt;/p&gt;
&lt;h2 id="ask-yandex-nutch"&gt;Ask, Yandex, Nutch&lt;/h2&gt;
&lt;p&gt;Ask (known as teoma), and Yandex (Russia’s search engine, known as yandex), 
support the robots meta tag, but do not appear to support the &lt;code&gt;x-robots-tag&lt;/code&gt;. 
Ask’s page on the topic &lt;a href="http://www.ask.com/staticcontent/about/helpcenter/about_helpcenter_webmaster#5"&gt;is here&lt;/a&gt;, and Yandex’s &lt;a href="http://help.yandex.com/webmaster/?id=1113833"&gt;is here&lt;/a&gt;. The popular 
open source crawler, &lt;a href="http://nutch.apache.org/"&gt;Nutch&lt;/a&gt;, also &lt;a href="http://nutch.sourceforge.net/docs/en/bot.html"&gt;supports the robots &lt;span class="caps"&gt;HTML&lt;/span&gt; tag&lt;/a&gt;, but 
&lt;a href="http://lucene.472066.n3.nabble.com/Support-for-x-robots-tag-td3678606.html"&gt;not the &lt;code&gt;x-robots-tag&lt;/code&gt; header&lt;/a&gt;. &lt;em&gt;Update:&lt;/em&gt; Newer versions of Nutch now 
support &lt;code&gt;x-robots-tag&lt;/code&gt;!&lt;/p&gt;
&lt;h2 id="the-internet-archive-alexa"&gt;The Internet Archive, Alexa&lt;/h2&gt;
&lt;p&gt;The Internet Archive uses Alexa’s crawler, which is known as ia_archiver. This 
crawler does not seem to support either the &lt;span class="caps"&gt;HTML&lt;/span&gt; robots meta tag nor the 
&lt;code&gt;x-robots-tag&lt;/code&gt; &lt;span class="caps"&gt;HTTP&lt;/span&gt; header. Their page on the subject &lt;a href="http://www.alexa.com/help/webmasters"&gt;is here&lt;/a&gt;. I have 
requested more information from them, and will update this page if I hear back.&lt;/p&gt;
&lt;h2 id="blekko-baidu"&gt;Blekko, Baidu&lt;/h2&gt;
&lt;p&gt;Blekko does not support either the robots meta tag nor the 
&lt;code&gt;x-robots-tag&lt;/code&gt; header, per emails I’ve had with them. I also requested 
information from Baidu, but their response totally ignored my question and was 
in Chinese. They do have some information &lt;a href="http://wenku.baidu.com/view/ec4457d4b14e852458fb5793.html"&gt;here&lt;/a&gt;, but it does not seem to 
provide any information on the noindex value for the robots tag. In any case, 
the only way to block these crawlers seems to be via a robots.txt file.&lt;/p&gt;
&lt;h2 id="duckduckgo"&gt;Duckduckgo&lt;/h2&gt;
&lt;p&gt;I previously stated that &lt;span class="caps"&gt;DDG&lt;/span&gt; did not support the &lt;code&gt;x-robots-tag&lt;/code&gt; header, but
while that was true, it didn’t tell the entire story. The entire story is that
&lt;span class="caps"&gt;DDG&lt;/span&gt; uses other search crawlers for their content aggregation and uses their 
own crawler only for maintenance-type work. You can read more about this in &lt;a href="http://stackoverflow.com/a/24089393/64911"&gt;my
answer on StackOverflow&lt;/a&gt;.&lt;/p&gt;</summary><category term="x-robots-tag"></category><category term="search"></category><category term="robots"></category><category term="privacy"></category></entry></feed>