A Search for Human Rights

The Search Monitor Project: China focuses on assessing the level of transparency with regard to the self-censorship practices of search engine companies as well as the mechanisms and effects of this political censorship. (For background information, see this and this.) The following is a step by step process of a search for “human rights” (人权).

The first step is to retrieve a result set from the (uncensored) Chinese version of Google. Each result is parsed to its domain name (http://www.hrw.org/chinese/ becomes “www.hrw.org”).

The second step is to use the “site:” modifier to restrict results to the domain. The censored versions of Google and Microsoft can be queried directly, but Yahoo and Baidu must be queried from inside China because they are hosted inside China. This is because the bi-directional filtering of China’s “Great Firewall” (GFW) will block the inbound connections due to the presence of “www.hrw.org” in the search query. Conversely, search from inside China to Google or to Microsoft will be blocked because of the GFW — but since we are interested in search engine censorship it is necessary for us to remove the effects of the GFW.

The censored results for the top ten results from Google for the query 人权 are:

Keyword Translation Google MSN Yahoo Baidu
人权 human rights 1 / 10 3 / 10 1 / 10 1 / 10

The common site, censored by all four search engines, is www.hrw.org. This is the website of Human Rights Watch. Google, acting the most transparently, provides a notification that results have been removed and since the search has been restricted, using “site:” we can conclude that www.hrw.org is censored specifically and deliberately. Yahoo provides a notification, but since it appears at the bottom of every page regardless of whether the results are censored or not we are left to assume that it was never indexed because Yahoo China operates it web crawlers from behind the GFW. Baidu also operates its crawlers from behind the GFW so, like Yahoo, sites blocked by the GFW are not indexed. Microsoft uses the same de-listing mechanism as Google but has removed the censorship notification they formerly displayed. We therefore assume that it has been censored because there are no results when using the “site:” modifier (results do appear in the English version) but the lack of transparency reduces the accuracy of the claim.

The two other sites from (uncensored) Google’s top ten results for 人权 (human rights) are: zh.wikipedia.org and www.epicbook.com.

URL Google MSN Yahoo Baidu

Meta: org | | 701 | US | UUNET – MCI Communications Services, Inc. d/b/a Verizon Business

Censored Censored Censored Censored

Meta: org | | 14907 | US | WIKIMEDIA Wikimedia US network

Indexed Censored Indexed Indexed

Meta: com | | 4812 | CN | CHINANET-SH-AP China Telecom (Group)

Indexed Censored Indexed Indexed

It is interesting that Microsoft censors wikipedia while Yahoo and Baidu index it because wikipedia is generally blocked by the GFW. A possible explanation is that due to the fact that the GFW is not 100% consistent or accurate with its keyword filtering the crawlers were able to index normally blocked sites.

I am not familiar with www.epicbook.com but it is hosted inside China and is thus an unlikely candidate to host information that the government would want to censor. The fact that it is index by all the other three search engines supports this. While this could be a case of “collateral damage” due to Microsoft’s lack of transparency (the possibility that it is not censored, it is just not indexed) it is indexed in the English version of Microsoft’s search engine.

The “magnify” component of this project attempts to match the top ten results from Google/Yahoo with the top ten results form the China-specific versions of Google/Yahoo in order to note the similarities and differences in terms of censored, returned (the website is in the top ten of the both the .com and .cn versions of the search engine) and indexed (the website is in the top ten of the .com version, but not in the top ten of the .cn version, but is not censored). It also compares the results based on whether or not each website is hosted in China or ends in a .cn. This is taken as a measurement of “authorized” content that is unlikely to present information that China would block.

As noted above, there is only one censored site, the other nine results are also returned in the top ten of the censored .cn version of Google.

However, even in google.com 4 of the top 10 sites are hosted in China or end in a .cn leaving only 6 sites to represent alternative information. When the censored site is removed, the google.cn version moves to 50/50 split between authorized and potentially unauthorized information. While this case doesn’t show a dramatic difference, other search queries, particularly those specific to contextually relevant information often do.

This yahoo.com vs. yahoo.cn comparison uses the top 10 results from yahoo.com for a comparison. With this result set 8 sites returned in yahoo.com are censored, the remaining 2 are indexed but not returned in the top 10 in yahoo.cn.

While all 10 results in yahoo.com are hosted outside of China all 7 results in yahoo.cn are hosted in China or end in a .cn. This helps show how significant the censored sites are in comparison.

Although the total number of censored sites may be low, especially when compared to the amount of indexed sites, the significance of these sites in providing alternative information should not be underestimated.

Post a comment.