Keywords & Google.cn



Google.cn’s de-listing of websites (these sites do not appear in search results) has bee failry well documented. The best way to determine if a site is de-listed is to use the “site:” modifier which restricts results to particular websites. For example, a search for site:news.bbc.co.uk in google.cn shows that there are no results but also indicates that the results have been censored (据当地法律法规和政策,部分搜索结果未予显示).

Previously, Ethan and I found that searches for certain terms were restricted to Chinese webpages.

Google.cn has gone further now and appears to be restricting searches for certain terms to sites that have been whitelisted.

I started with a search for 六四 (64) that is restricted to *not* include any .cn, .com, .org, or .net sites. There are no results, and the censored message is displayed. (The censored message will always be displayed if one of these special terms are searched for, no matter if any results are actually censored or not).

I then began to remove some of the restrictions. First, I allowed .net to be included, and only one site was returned. When only .org is allowed there are only 4 indexed domains. And when .com is allowed only 8 domains are indexed. There seemed to be a fair amount of .cn sites when .cn is allowed, I didnt bother trying to fish out the unique domains.

Now, this is not *the* whitelist — just the sites that are returned for the search 六四 (64). as we find more censored terms, a more definitive whitelist can be built out.

The reason I suspect these sites are whitelisted is because you cannot search other sites — sites that are indexed and not de-listed — for these special terms. My blog for example, site:www.nartv.org is indexed. However, if you search my blog for 六四 (64), the results are censored. (There is content on my blog that google.com has indexed with 六四 (64), and google.cn has indexed it too.

Not even Microsoft has been spared, it too is censored :).. And, actually, it seems that results from ccTLD’s other than .cn are not displayed when the secial terms are searched for. (I didn’t check every single one).

Now, there are some weirdnesses. For example, a search for “falun” with .com, net, .org, and .cn excluded will still return results. Some of this appears to be because IP addresses obviously don’t have domain suffixes but also because Google does not properly parse out domains that have a port number attached (this also happens on google.com). But, strange, nonetheless.

(Some more strangeness, opinion.people.com.cn is de-listed and although news.xinhuanet.com is indexed the censor message appears!)

De-listed domains, restricted keywords and whitelisted domains! What’s next Google?

Post a comment.