Search engines are increasingly censoring their results, often by geographic location, having a significant, negative impact on the right to freedom of expression. The most advanced cases of censoring political content is in search engines that market a version of their product in China. This project aims to expose and monitor the censoring practises of search engines with a specific focus on China.
Building upon efforts to assess the level of transparency (reading this first is probably a good idea) with regard to search engine censorship, this project aims to compare the level of censorship across the China-specific search engines of Google, Yahoo, and Microsoft as well as the domestic Chinese search engine, Baidu. The goal of comparison poses some significant methodological problems as the presence or absence of censorship notification, mechanism of censorship (and irregularities therein) and physical location of the servers themselves all add additional layers of complexity.
In attempting to develop an automated system that can reasonably compare the search engines some additional methods which would be well suited to one search engine but for which comparable data could not be generated from the others have been delegated to separate search engine-specific projects. As a result of the focus on comparability the methods outlined below not only build upon existing research in this area but can hopefully explain some of the anomalies previously identified. After reviewing previous reports by Reporters Without Borders and Human Rights Watch I sketch out methods that attempt to provide an accurate, automated comparison between the search engines.
In June 2006, Reporters Without Borders (RSF) conducted a comparison of the four search engines Google, Yahoo, Microsoft, and Baidu (later updated to also include Sohu and Sina) by entering key words into the search engines and analyzing 1) the presence or absence of any results and 2) the content of the results by classifying each returned web site (URL) as either “authorized” or “unauthorized” which presumably refers to whether or not the source is controlled by or supports the government of China or whether it contains critical, alternative information. While this report is an innovative attempt and comparison is suffers from methodological issues that affect the accuracy of the results.
First, the report is actually about the ranking of results rather than censorship. The top ten results were analyzed based on their content, not on whether a web site had been censored (de-listed/removed) from the results set. While the removal of censored sites will likely affect the combination of “authorized” vs. “unauthorized” sources it does not tell us what sites are censored or if the “unauthorized” sites are not censored but just do not appear in the top ten results. Since localized search engines often algorithmically privilege sites in the local language, ending in the country’s domain suffix (e.g. .cn) and possibly even being hosted within the country it affects where foreign hosted “unauthorized” content appears in the result set. Thus an “unauthorized” site may not appear in the top ten results of the localized search engine even though it does in the uncensored version. Instead, the site may appear further down in the rankings.
Second, the testing of the search engines did not account for China’s national filtering system, often labeled the Great Firewall of China (GFW). Consequently, the results concerning “no results” and “no results + user banned” should actually be seen in reverse. Since Yahoo and Baidu are physically located in China the search queries made by RSF were filtered by the GFW on their way to the Yahoo and Baidu servers. If the same search were conducted from China, the search queries would not pass through the GFW and would not be filtered. The search queries RSF made to Google and Microsoft did not pass through the GFW because those servers are not located in China and therefore results were always returned. However, had those same search queries been made from China to Google and Microsoft they would have been filtered by the GFW and would have been designated “no results” and “no results + user banned”. The failure to account for the GFW prevented RSF from accurately interrogating the filtering of the search engines because a distinction was not made between filtering by the search engines and filtering by the GFW. If the tests had been conducted from inside, rather than outside, of China the report would have captured the behaviour experienced by users in China who are censored by both the GFW and the search engines and perhaps are agnostic about which one is doing the censoring since the result is the same: censorship.
In August 2006, Human Rights Watch (HRW) released an impressive and detailed comparison of Google, Yahoo, Microsoft, and Baidu. Two approached were used in this report: the first focused on identifying censored sites the second on whether or not the result set returned from a search for a specific key word query was censored. The first approach involved using a list of 25 websites and searching for each website in each search engine (using the site: modifier, discussed below, when possible). If a “censorship notification” appeared and there were no results the web site was censored, but the report also noted instances in which the message appeared but some partial results appeared as well. In other cases, there were no results and since there was also no censorship notification (or a censorship notification that always appears and has no relationship with the results) it was suspected that the web site was censored. In this way, HRW was able to determine how many of the 25 sites were censored in each search engine.
HRW tested from both inside and outside of China and was thus able to isolate search engine filtering from that conducted by the GFW of China. HRW notes that queries to Yahoo from outside China generated errors (as we saw in the RSF study) and we now know that this is due to the bi-directional filtering of the GFW (see below). The partially censored results (what I call “Page Censored” below) can result from at least two reasons. The first, is that some search queries automatically trigger the censorship notification regardless of whether the results have been censored or not and second because the filtering algorithms of the search engines are imperfect. Google, for example, does not handle port numbers properly and fails to remove such pages and Microsoft does not handle domains by their root (domain.com) and therefore sub-domains (www.domain.com or dom.domain.com) may not all be removed. Microsoft also does not properly handle URLs that begin with “https”. In such cases partial results may be available despite the search engines attempts to censor.
Another issue (which is still an issue in the methodology discussed below) concerns search engines not censoring pages directly. Both Yahoo and Baidu operate the crawlers that index websites from inside China and thus do not index sites that are blocked by the GFW. This removes the need for the search engines to censor their results, as the index itself is already censored by the GFW. This means that there is not a credible technical way to distinguish between sites that are not indexed and sites that are censored. Another issue is that the GFW is not perfect, and normally censored sites sometimes end up in Yahoo & Baidu’s index. There have also been some cases in which Yahoo has removed indexed sites — those not blocked by the GFW — and used a censorship notification as Google does and Microsoft did previously. Therefore, for the most part, Yahoo and Baidu do not need to censor their results, because their index is already censored because their crawlers operate form within China and cannot visit blocked sites to begin with.
The second approach used by HRW focused on the issue of keyword filtering by search engines. The question is simple enough, if I search for keyword “a” will I get censored results “b“? However, the lack of transparency on the part of the search engines makes the answer to this simple question difficult. HRW used a list of 25 keywords to query the search engines and inferred possible censorship by comparing the results from censored China-specific versions of Google, Yahoo and Microsoft and their US counter-parts as well as noting the appearance of a censorship message. (Baidu had no such counterpart at the time, but perhaps Baidu Japan can now be used for this purpose.)
Comparing result sets can be problematic because of the algorithmically determined rank of the results. What appears on page one in the top ten results in google.com may appear on page twenty-five in google.cn. In the case of Yahoo and Baidu GFW-censored sites are not indexed at all and so will never appear no mater what one searches for. Another method is to use the difference in the estimated page count as an indicator of censored results. But the estimated page counts can vary considerably between servers and language/region-specific versions. Microsoft, for example, returns very few Chinese language results in their default English language search engine making comparison virtually impossible. As noted by HRW, even the presence of the censorship notification may not be reliable. In some cases the censorship notification will appear based on the keywords in the query not on the results returned. (You can restrict the results to a non-existent site and still get the censorship message.) In other cases, it has nothing to do with what was used as a query for example, a non-politically sensitive term) but the censorship message appears because a URL has been removed/de-listed. In other cases, some keyword queries return results and a censor message not because results have been removed but because results are only returned from a set of “white listed” sites. Compounding the problem, the censorship message appears to be page specific (at least in the case of Google). That is, if one searches for keyword “x” and gets back ten results there may be no censorship message, but when one click on “Page 2” and gets results 11-20 which do contain a censored site the censorship notification will appear. (Therefore, if you set the preferences to retrieve 100 results will one may be more likely to encounter the censorship notification than if restricted to 10 results?).
HRW accounted for such variance through manually checking results in addition to the estimated page count comparisons and the presence of a censorship notification. Not only does this involve extensive manual labour but also an expertise in analyzing the content for political significance. For example, HRW manually assessed and compared the first three pages of search results for Yahoo and Yahoo China. HRW’s efforts in the regard stand out as an example of the quality needed for this line of research.
The Search Monitor Project currently contains two related but separate components. The first, Generalized Comparison : Keywords and Urls, focuses on a generalized comparison between the China-specific versions of Google, Yahoo, Microsoft and Baidu. The second, Magnify: A Google-Google, Yahoo-Yahoo comparison, focuses on comparisons between the Chinese-language “global” versions of Google and Yahoo and their special censored China-specific versions.
While the core testing methods are the same, the Magnify: A Google-Google, Yahoo-Yahoo comparison contains some additional elements that allow for a more fine grained analysis. These will be noted below when appropriate.
Generating a URL Set
A set of sixty keywords have been selected covering the broad topical categories of censorship circumvention, falun gong/dafa, political sensitivities and social taboos. Search queries in “uncensored” engines (the Chinese language versions of Google or Yahoo) are used to generate lists of sites that are checked in censored search engines.
A query term, such as “人权” (human rights), is used to retrieve results from an “uncensored” search engine, such as google.com.
The websites from the “uncensored” results are parsed to retrieve the domain (including sub-domains).
A list of URL results, usually ten, are retrieved. A URL such as http://www.hrw.org/chinese/ is shortened to www.hrw.org
Each domain name is checked in each censored search engine.
Determining a Censored site
These domains are checked in the censored search engines using the “site:” modifier. The “site:” modifier restricts the results set to pages of a specific host name.
“site:www.hrw.org” (without the quotes) is used as a search term in in censored search engines to restrict the results to only those from the web site www.hrw.org
In cases where the censored search engine being tested display as special message indicating that results have been censored, a “censor message”, that relates to the specific search query domains that produce no results when queried with the “site:” modifier and contain a censor message are labeled as “Censored” while domains that return some results but contain a “censor message” are labeled as “Page Censored”.
In the cases where there is no censor message, or the censorship message appears on every page and bears no connection to the results, domains that produce no results when queried with the “site:” modifier are labelled as “Censored.”
Depending on the current behaviour of search engines there may be ad hoc additions.
http://www.google.cn/ – censored = censor message + 0 results, pagecensored = censor message + some results
http://www.live.com/?mkt=zh-cn – censored = 0 results, pagecensored = results that only contain urls beginning with “https” (no longer a censor message, failure to exclude “https” urls was noted when the censor message was in place and is thus used as “Page Censored”)
http://www.yahoo.cn/ – censored = 0 results, censor message is ignored because it appears on every page, it bears no relation to search results
http://www.baidu.com/ – censored = 0 results
It is important to note that sites that are simply not indexed by the search engine will appear as “Censored” thus possibly inflating the total amount to censorship attributed to search engines that do not have a censor message that is related to the results. This can be slightly compensated for by looking at the overlap of censored sites among search engines. In addition, since this is a normative project advocating transparency, should serve as an incentive for search engines to implement a censor message that is related to the results.
The Magnify: A Google-Google, Yahoo-Yahoo comparison project contains the classifications “Returned” and “Indexed” in addition to “Censored” and “PageCensored”. “Returned” refers to URLs from the the result set from the “uncensored” search engine that are returned in the result set from the censored search engine. “Indexed” refers to URLs from the the result set from the “uncensored” search engine that are not returned in the result set from the censored search engine, but are not censored. Using this method, the top ten results or a query in Google (Chinese) can be compared with the top ten results of Google China and can be categorized by “Returned”, sites that are common to both results sets, “Indexed”, sites in the top ten uncensored but not in the top ten of the censored results, and “(Page)Censored”, results that are actually censored. In addition, each URL in both result sets is check to see of it it hosted in China or ends in a .cn domain suffix.
The Great Firewall (GFW)
Borrowing a phrase from my colleagues Richard Clayton, Steven Murdoch and Robert Watson, it is necessary to ignore the filtering conducted by China to accurately test levels of censorship by the search engines themselves. As Clayton and Murdoch reveal, Internet traffic to and from China passes through a filtering system that is bi-directional – it affects both inbound and outbound traffic – (China also blocks outbound connections to IP addresses, but this does not interfere with the ability to test the search engines) which disrupts connections if the presence of particular keywords are detected. Often, China will designate a domain name as “key word” this disrupting access for any request that contains that domain name. This is important as queries directed to search engines hosted in China use the “site:” modifier followed by a domain name.
In order to avoid interference from the China’s filtering system, the China-specific versions of Google and MSN, which are hosted outside of China, are queried from outside of China and the China-specific versions of Yahoo and Baidu, hosted inside China, are queried from inside China.
The censored search engines, http://www.google.cn/ and http://www.live.com/?mkt=zh-cn are checked from outside of China.
The censored search engines, http://www.baidu.com/ and http://www.yahoo.cn/ are checked from inside of China.
In addition to affecting how to test each search engine, the location of the search engine to the GFW also affects how the search engines censor. Google and Microsoft, located outside of China, must remove, or de-list, specific sites from the results. Yahoo! and Baidu both operate their search spiders from inside China. The results in a situation where, because of China’s gateway filtering, the crawlers that index content for these search engines cannot access sites that China blocks.
184.108.40.206 – - [08/Feb/2008:08:05:40 -0500] “GET / HTTP/1.1″ 200 12258 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
220.127.116.11 – - [08/Feb/2008:09:04:42 -0500] “GET / HTTP/1.1″ 200 12258 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
18.104.22.168 – - [08/Feb/2008:11:46:31 -0500] “GET / HTTP/1.1″ 200 12258 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
22.214.171.124 – - [07/Feb/2008:16:58:33 -0500] “GET /robots.txt/ HTTP/1.0″ 200 24 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)”
126.96.36.199 – - [07/Feb/2008:16:58:35 -0500] “GET / HTTP/1.0″ 200 19068 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)”
Thus Yahoo! rarely has to de-list specific websites, most are just not indexed in the first place. However, this also leads o situations in which sites blocked by China and de-listed by Google and Microsoft are index by Yahoo!. The GFW is not 100% effective and occasionally crawlers operating from inside China are able to index a normally blocked site which then appears in their search results.
It is also important to note that sites indexed by the search engines that are blocked by the GFW will still be inaccessible to users in China.
Types of Results
Generalized Comparison : Keywords and Urls
This component shows each of the keywords used as search queries in an “uncensored” search engine and the number of the URLs from that result set that are censored in the China-specific versions of Google, Yahoo, Microsoft and Baidu. It also shows the domains that are censored by any one of the four search engines and that domain’s status with regard to the other three engines. In this way we can compare the amount of censored URLs per search query across all four search engines as well as that build a list of censored domains and compare th level of censorship across all four search engines.
Magnify: A Google-Google, Yahoo-Yahoo comparison
This component focuses on two of the search engines that have comparable censored and “uncensored” versions: Google and Yahoo. Microsoft’s “global” search engine in English contains very few Chinese sites and cannot really be compared to their Chinese version. Even other versions such as Hong Kong and Taiwan have such drastically different results when compared to the Chinese version making it a difficult fit for this model of testing. (At this time Baidu Japan has not been sufficiently investigated but it may offer an opportunity for comparison with Baidu China.)
The data collection for section focuses on a direct comparison between google.com (in Chinese)/ google.cn and yahoo.com (in Chinese) and yahoo.cn. It expands upon the collection of censored sites and pages by looking at the overlap of returned pages – pages that appear in the results set for the same query for the same number of results in both search engines as well as indexed pages. It also tracks which sites are hosted in China or end in a .cn domain suffix.
Organized in this way the results raise questions regarding the nature of censorship process as well a the censored content. In terms of process, critical questions have been frequently posed concerning the specificity of the censorship requirements communicated to these search engines by the Chinese government. Are search engines given a list of keywords or a list of web sites that they are to censor? Or, is there just a general reference type of content leaving search engines to infer what exact content to block? How significant are the censored web sites since they only represent a small fraction of indexed sites? How frequently are users search results censored in relation to the topics they search for? Which specific web sites are actually censored? What type of web sites are censored?
In an effort to provide some insight regarding the question of process, the project will measure the overlap between all the search engines as well as subsets that are functionally similar (Google/Microsoft, Yahoo/Baidu). Overlap refers to the sites that are censored by multiple search engines. Overlap is analyzed in two ways: the first focuses on sites that are censored by all search engines tested, the second focuses on search engines that censor using similar mechanisms. While this allows for a comparison among search engines it also acts as an indicator of whether the search engines are responding to specific blocking requests, usually associated with an official order, or a general determination on the part of company, perhaps based on topic areas provided by officials.
Since the total number of censored sites is likely to relatively small compared to the total number of indexed sites , this project proposes a measure of significance in order to show just how important the censored sites are in relation to those displayed to the user. Significance refers to the the number of top ten sites returned from an uncensored search engine that are censored in the China-specific version in relation to those that are either hosted in China or that end in a .cn domain suffix. China could, presumably, take action against those sites under their jurisdiction without having to resort to blocking. In this context, these sites are considered to be “authorized” and are unlikely to contain information that presents an alternative perspective to that approved by the government. In this component results that are returned in the top ten along with those that are indexed but not displayed in the top ten are distinguished from those that are censored. The significance is demonstrated by the absence of top ten results outside of China’s control among a majority of sites that are within the top ten.
Analyzing the web sites found to be censored is an important indicator of the type of content that the government of China wants to block (or of the interpretation of this interest by the search engine companies). Content refers to the type of websites that are targeted for censorship, not the content of individual articles contained within them. This is accomplished through the creation of categories to which censored sites are assigned. This component is the most problematic for automatic determinations. Web sites could be classified using various services that provide categorized URLs but the results may be less than desirable. To operationalize this component properly likely requires, as suggested by Rebecca Mackinnon, analysis by “a team of near-native Chinese speakers who are highly tuned-in to what the sensitive media topics.” Thus this component remains the least developed aspect of the project.