Posts tagged “Search Engines”

google.cn -> google.com.hk



Yesterday Google began redirecting requests for google.cn to google.com.hk effectively ending its years of self-censorship in China. To be clear, Google has not ended censorship in China — Google has ended its own self-censorship.

While searches within the .hk google are not censored by Google, they will still be affected by China’s keyword filtering. This means that queries for certain terms will not get through to google.com.hk search engine and the end user in China will not get any results.

Even if a user in China uses search queries that are not filtered by China and retrieves results from google’s .hk version, they will still be affected by China’s filtering if they click on the link and try and view those results directly.

What’s the difference? Users in China will be affected by China’s filtering, not Google’s. The difference is in the user’s experience — instead of retrieving results and carrying on as if censorship did not exist (disclaimer aside), the user now experiences the censorship first hand.

It is true that the user will not get any results from Google for queries that are filtered by China. this may results in quantitatively less information, but necessarily qualitatively (see here and here). Even if a controversial site slipped through the self-censorship, it would be picked up by China’s filtering if the user tried to access it directly.

The move removes Google from an ethically challenged situation and has raised awareness globally regarding China’s censorship practices.

Remember: Microsoft and Yahoo! are still censoring their China facing search engines.

Yahoo, MSN Censor More than Baidu



China unblocked many usually censored web sites following intense international pressure and scrutiny after having promised uncensored access during the Olympics. Five days later (August 6, 2008) I tested the search engines that Google, Yahoo! and Microsoft customize for the Chinese market as well as the leading domestic search engine Baidu. I found that all of the search engines were still censoring content that was unblocked by China. one interesting find was that Yahoo! was censoring less than all the others and Baidu (and Google) were censoring much less than Microsoft.

For purposes on comparison Google and Microsoft make a good match because both have to de-list web sites form search results while Yahoo! and Baidu index form within China and thus do not (usually) index sites already censored by China. (For more read my report on search engine comparison.)

Now over a month later things have changed. While these sites remain accessible in China some are still censored by the search engines. Google has dropped to only censoring two sites and is now censoring the least amount of content. Baidu is next with three censored sites. Microsoft remained steady, but Yahoo! has shifted from censoring the least amount of sites to the most!

The divergence between Yahoo! and Baidu is very interesting. If both crawl from within China and are subject to China’s filtering why is Yahoo! censoring so much more than Baidu? It could be that the conclusion that Yahoo! and Baidu do not de-list content is not fully accurate. If the sites are accessible in China then Yahoo! is likely de-listing the sites. Because of the suboptimal method of censorship notification employed by Yahoo! (a standard disclaimer on every page regardless of whether any of the results are censored or not) I cannot fully distinguish between sites that are de-listed and sites that have not been indexed (e.g. because China blocks them).

I’m still struck by the fact that over a month later sites that are available and uncensored in China are still censored by these search engines.

DOMAINS Google Yahoo Microsoft Baidu
ip =
"203.208.39.99"
host = "www.google.cn"
ip =
"202.165.102.243"
host = "one.cn.yahoo.com"
ip =
"202.89.236.206"
host = "cnweb.search.live.com"
ip =
"202.108.22.43"
host = "www.baidu.com"
chinese.wsj.com OK OK OK OK
cn.reuters.com OK OK OK OK
news.chinatimes.com OK CENSORED (0) CENSORED (0) OK
olympics.scmp.com OK OK OK OK
udn.com OK OK OK OK
www.amnesty.org OK CENSORED (0) CENSORED (0) CENSORED (0)
www.atchinese.com OK CENSORED (0) CENSORED (0) OK
www.ftchinese.com OK OK OK OK
www.hrw.org OK) CENSORED (0) CENSORED (0) CENSORED (0)
www.libertytimes.com.tw CENSORED (0, message) OK OK OK
www.mingpaomonthly.com OK OK OK OK
www.mingpaonews.com OK CENSORED (0) CENSORED (0) OK
www.rfa.org CENSORED (0, message) CENSORED (0) CENSORED (0) OK
www.rsf.org OK CENSORED (0) CENSORED (0) OK
www.scmp.com OK OK OK OK
www.voanews.com OK CENSORED (0) CENSORED (0) CENSORED (0)
www.yzzk.com OK CENSORED (0) OK OK
www1.appledaily.atnext.com OK CENSORED (0) OK OK
zh.wikipedia.org OK CENSORED (0) CENSORED (0) OK

Free Expression Principles



Major technology companies, including Google, Yahoo! and Microsoft, have agreed, in principle, to a voluntary set of principles designed to “guide businesses when they encounter laws and practices that may contravene international human rights standards or be at odds with law or culture in their home jurisdiction.” The objective is to protect and advance freedom of expression and privacy. Included in this initiative are mechanisms to provide for ongoing learning as well as the monitoring of compliance.

Google, Yahoo! and Microsoft sent letters to Sen. Durbin announcing the agreement. The letters re-state each company’s commitment to freedom of expression and highlight the core components of the initiative including the principles, the implementation guidelines and the accountability and learning framework.

Google’s letter draws on my report that compared Google, yahoo! and Microsoft’s search engines along with the domestic Chinese company Baidu. The most significant point centers on the impact of engagement. I found that the presence of foreign search engines resulted in an increased amount of information being available to Chinese Internet users. More specifically, I found that:

When the results from Google, Microsoft and Yahoo are combined, 20% of the sites censored by Baidu are available. However, individually they provide more information, especially Google and Microsoft which provide, on average, 51% and 55% more content (content not available in Baidu) while Yahoo! averages 25% more.

Since the search engines were censoring different content mixing searches across multiple search engine resulted in the ability to find sites censored by the other search engines.

Also, I noted that Baidu, the leading Chinese search engine, had introduced a censorship notification following the lead of the foreign search engines. Unlike foreign search engines under pressure from the home governments Baidu is not. While a still a small step, it shows that engagement can make a difference and that industry standards are important. that is why I think the principles for free expression and privacy are so important. They present a united effort and set an industry standard.

Engagement certainly presents a series of hard choices, but is a better choice than disengagement when it comes to information and communications technologies. These technologies build the bridges that connect diverse people and places, putting up barriers is what the censors do. I find it hard to believe that the promotion of free expression is served in Iran by denying Iranians access to the Java programming language.

The catch here is that this agreement and these principles are not an end point but a starting point. As I noted in my report the overall level of transparency is low — there is work to be done in this area. The process for determining what to censor is still unclear and supports the secrecy and unaccountability of China’s censorship policies. Even within a restrictive environment such as China I believe there is much more that can be done. (See below). I also showed that while the total amount of censorship may not be high, the significance of the censored sites is important.

These censored sites are often the only sources of alternative information available in the top ten results for politically sensitive search queries. Moreover, even the uncensored versions of these search engines highly rank content that is hosted in China or ends in the domain suffix .cn, both of which China retains control over and are thus unlikely to present alternative information.

China recently unblocked many censored web sites after intense international pressure and scrutiny after having promised uncensored access during the Olympics. Andrew Lih tested a sample of websites normally censored in China and found them to be accessible. The web sites of human rights groups such as Human Rights Watch, Reporters Sans Frontiers and Amnesty International are all now accessible.

Andrew posted his test results on August 1st, 2008, five days later search engines are still censoring sites that are not unblocked in China. For example, Yahoo! Microsoft and Baidu are still censoring www.amnesty.org while Google is not. Google, Microsoft and Baidu are still censoring www.hrw.org while Yahoo!s not. (Yahoo! has only one result, www.hrw.org/russian, I’m not sure how many Russian speakers there are in China, anyone know?) Only Microsoft is still censoring www.rsf.org — even Baidu is not. In fact, Microsoft is censoring more of these newly unblocked websites than the Chinese company Baidu! Another noteworthy observation is that Yahoo! is censoring the least of these newly unblocked sites.

DOMAINS Google Yahoo Microsoft Baidu
ip =
"203.208.39.99"
host = "www.google.cn"
ip =
"202.165.102.243"
host = "one.cn.yahoo.com"
ip =
"202.89.236.206"
host = "cnweb.search.live.com"
ip =
"202.108.22.43"
host = "www.baidu.com"
chinese.wsj.com OK OK OK OK
cn.reuters.com OK OK OK OK
news.chinatimes.com OK OK CENSORED (0) OK
olympics.scmp.com OK OK OK OK
udn.com OK OK OK OK
www.amnesty.org OK CENSORED (0) CENSORED (0) CENSORED (0)
www.atchinese.com OK OK CENSORED (0) OK
www.ftchinese.com OK OK OK OK
www.hrw.org CENSORED (0, message) OK CENSORED (0) CENSORED (0)
www.libertytimes.com.tw CENSORED (0, message) OK OK OK
www.mingpaomonthly.com OK OK OK OK
www.mingpaonews.com OK OK CENSORED (0) OK
www.rfa.org CENSORED (0, message) OK CENSORED (0) OK
www.rsf.org OK OK CENSORED (0) OK
www.scmp.com OK OK OK OK
www.voanews.com OK OK CENSORED (0) CENSORED (0)
www.yzzk.com OK OK OK OK
www1.appledaily.atnext.com OK CENSORED (0) OK OK
zh.wikipedia.org OK OK CENSORED (0) OK

* If at least one result was returned for a “site:” search on a domain, it was marked as OK.

To be fair, it does take time for search engines to respond. They have multiple servers, it may take time for them all to be updated. Also, there differences in implementation between those that crawl and index the web from behind China’s filtering system and those that do not and thus have to “de-list” results. (See the report for details on this.)

Still, I find it difficult to accept that sites that are unblocked in China remain censored in these search engines.

Search Monitor in the Press



China’s Overeager American Censors – Forbes

Practically every U.S.-owned search engine has caved to the Chinese government’s demands that they censor political Web sites in China. But none of them seem to agree on just what sites need censoring.

Google, at times, blocks Chinese users’ access to the BBC while Yahoo! permits it. Yahoo! sometimes filters out Voice of America–Google doesn’t. And Microsoft removes entries from the Chinese version of Wikipedia from its results while every other search engine includes them–even the dominant Chinese search engine Baidu.com.

Confused?

Search Engines’ Chinese Self-CensorshipTechnology Review and ABC

A report released last week by the Citizen Lab at the Munk Centre for International Studies at the University of Toronto found that different search engines are blocking fairly different content. “The low overlap means that companies are choosing the exact content to censor or, alternatively, to not censor,” says Nart Villeneuve, a senior research fellow at the Citizen Lab and the author of the report. “That doesn’t mean that they’re not getting guidance from the Chinese government in other ways,” he notes. But Villeneuve says that if search engines are interpreting Chinese policies to decide what to censor, that introduces the possibility that they may block more content than is strictly necessary.

Read more about the Search Monitor Project here and here.

Perspectives on Transparency



据当地法律法规和政策,部分搜索结果未予显示。

When Google first added this censorship notification to google.cn — the China-specific version of Google — its significance was largely overshadowed by the fact that they had agreed to censor their search engine at all. Following Google, Yahoo! also added a censorship notification, as did Microsoft. All three companies were grilled before Congressional Committees and human rights organizations. Now the domestic Chinese search engine Baidu — and others including Soso, Sougou, Yodao — introduced a censorship notification? What does this mean?

Yahoo! had been censoring their China-specific search engine for years prior to Google’s introduction of censorship drawing criticism from human rights and free speech advocacy organization but little from elsewhere. The open acknowledgment of censorship enabled for a much broader, well publicized debate/discussion to the complex issues of censorship in China. These “You’ve been Censored” notification raised considerable awareness of censorship in China. Of course, it came at the cost of these companies’ compliance with China’s censorship rules, arguably strengthening China’s control of the Internet.

In a recent study I compared the censorship practices of the search engines provided by Google, Microsoft and Yahoo! for the Chinese market along with the domestic Chinese search engine Baidu. I found that although Internet users in China are able to access more information due to the presence of foreign search engines the web sites that are censored are often the only sources of alternative information available for politically sensitive topics. I argued that the wide disparity among the actual web sites that these search engines censor suggests that these companies are determining what to (or not to) censor and that the lack of clarity in the process and the unwillingness of companies to disclose this information acts to bolster China’s current censorship policy that thrives on secrecy and unaccountability.

Since the report was finalized, the domestic Chinese search engine Baidu, following the foreign search engines, introduced a censorship notification indicating that it is possible to make progress through engagement. Other search engines such as Soso, Sougou, Yodao also, at least temporarily, also had a form of notification.

The downside is that these developments normalize censorship. Considering that this latest censorship targeted search terms and resulted in no results being available for those terms, this could be interpreted as a worsening of the situation.

But it is rather remarkable that Baidu has introduced a consistent censorship notification mechanism. Google, Microsoft and Yahoo! have to balance China’s censorship requirements with the pressure they receive from the U.S. Congress and human rights groups and thus have an incentive to be transparent. But Baidu is a domestic Chinese company that does not have such pressures. It is possible that Baidu introduced the notification simply to conform to what has become an industry norm.

It also suggests an increasing openness within China concerning censorship and informs Chinese Internet users — many of whom are not aware of censorship in China at all — that censorship is in fact occurring. While the introduction of censorship notification may seem negligible to some, and it is certainly no reason to become complacent, it is a small first step toward lifting the veil of secrecy and unaccountability that permeates China’s censorship policies. It demonstrates that the leadership of foreign companies can increase transparency even within domestic Chinese companies and, as a result, reaffirms that the further efforts to improve transparency cannot be allowed to remain stagnant.

I’ve been thinking about a range of offensive and defensive strategies to that both companies and activists could pursue in order to stimulate further efforts towards transparency on the part of companies as well as within China that I hope to post soon.

Search Monitor: Toward a Measure of Transparency



Citizen Lab Occasional Paper #1, “Search Monitor Project: Toward a Measure of Transparency“, (mirror) has been released today. This report interrogates and compares the censorship practices of the search engines provided by Google, Microsoft and Yahoo! for the Chinese market along with the domestic Chinese search engine Baidu. It is based on tests conducted between November 2007 and April 2008 focused on uncovering web sites that have been censored from search engine results.

The report finds that although Internet users in China are able to access more information due to the presence of foreign search engines the web sites that are censored are often the only sources of alternative information available for politically sensitive topics. In addition to censoring the web sites of Chinese dissidents and the Falun Gong movement, the web sites of major news organizations, such as the BBC, as well as international advocacy organizations, such as Human Rights Watch, are also censored.

The data presented in this report indicates that there is not a comprehensive system – such as a list issued by the Chinese government – in place for determining censored content. In fact, the evidence suggests that search engine companies themselves are selecting the specific web sites to be censored raising the possibility of over blocking as well as indicating that there is significant flexibility in choosing how to implement China’s censorship requirements.

This report finds that search engine companies maintain an overall low level of transparency regarding their censorship practices and concludes that independent monitoring is required to evaluate their compliance with public pledges regarding commitments to transparency and human rights. The lack of clarity in the process and the unwillingness of companies to disclose this information acts to bolster China’s current censorship policy that thrives on secrecy and unaccountability.

It is becoming increasingly clear that technology companies face a dilemma when attempting to penetrate the Chinese market. A failure to comply with China’s censorship policies can result in the wholesale blocking of a company’s entire service or significant levels of interference due to China’s filtering system. Companies that have a physical presence in China face the challenge of obtaining proper licensing and their Chinese employees may face legal threats for the foreign company’s failure to comply with China’s censorship policies. However, it is also clear that compliance with China’s censorship policies is also an unattractive option. Google, Microsoft and Yahoo! are all facing tough criticism from governments, human rights groups and civil liberties advocates as well as their shareholders for their complicity in China’s censorship policies.

While foreign search engines do provide more content than domestic search engines, the greatest benefit of having foreign search engines in China may not be increased access to information but is the potential contribution that these companies can make to further transparency and accountability in the process of censorship.

Since this report was finalized, the domestic Chinese search engine Baidu, following the foreign search engines, introduced a censorship notification indicating that it is possible to make progress through engagement. While this development may seem negligible to some and it is certainly no reason to become complacent, it is a small first step toward lifting the veil of secrecy and unaccountability that permeates China’s censorship policies.

Microsoft: Censorship Notification Returns



Microsoft now has a censorship notification in the censored version of the search engine live.com that they provide for the Chinese market. The notification appears when search are made for particular keywords, however, the notification is not displayed when searches are restricted to censored domains. (See Degrading Transparency: Comparing Google, Yahoo and Microsoft for past reports).

May 13, 2008
Engine Presence Placement Specificity Connection Screenshot
Google Yes High
Notification is placed under results
Low
Mentions “local law”
Yes
Notification only appears when results are censored
screenshot
Yahoo Yes Medium
Notification is placed at the bottom of every page
Low
Mentions “local law”
No screenshot
Microsoft Yes* Medium
Notification when searching for particular “key words”.*
Low
Mentions “local law”
Yes* screenshot (2)

* Microsoft provides notification when searching for particular “key words”, however, no message appears when restricting the search to a censored web site.

U.S. Funded Health Search Engine Blocks ‘Abortion’



Wired reports that a health services search engine funded by the US Government blocks searches for the word “abortion” because of the possibilty that funding could be denied for project that “actively promote abortion”:

Called Popline, the search site is run by the Johns Hopkins Bloomberg School of Public Health in Maryland. It’s funded by the U.S. Agency for International Development, or USAID…

“We recently made all abortion terms stop words,” Dickson [the manager of the database at John Hopkins] wrote in a note to Gloria Won, the UCSF medical center librarian making the inquiry. “As a federally funded project, we decided this was best for now.”

It turns out that the block was prompted by complaints from the Bush administration:

“The items in question had to do with abortion advocacy — the two items dealing with abortion were removed following this inquiry, and the administrators made a decision to restrict abortion as a search term,” said Tim Parsons, a spokesman for the Johns Hopkins Bloomberg School of Public Health in Maryland.

Searches for “abortion” have been restored. However, it does not appear that the two removed articles were restored.

Democracy “Magnified”



The “magnify” component of the Search Monitor project attempts to match the top ten results from Google/Yahoo with the top ten results form the China-specific versions of Google/Yahoo in order to note the similarities and differences in terms of censored, returned (the website is in the top ten of the both the .com and .cn versions of the search engine) and indexed (the website is in the top ten of the .com version, but not in the top ten of the .cn version, but is not censored). It also compares the results based on whether or not each website is hosted in China or ends in a .cn. This is taken as a measurement of “authorized” content that is unlikely to present information that China would block.

But, is this a worthwhile measurement?

Nine out of the top ten results for a search for 民主 in google.com and google.cn are the same. The only difference is that http://asiademo.org/ which appears as number two in google.com is censored in google.cn and, as a result, http://theory.people.com.cn/GB/49150/49152/5224247.html rounds out the top ten in google.cn.

But despite only having one censored site 7 of the top ten results for 民主 are either hosted in China or end in a .cn domain suffix. This number increases to 80% in google.cn

There is no overlap between yahoo.com and yahoo.cn, drastically different results are returned. Of the top ten results in yahoo.com 4 are censored in yahoo.cn.

All 10 of the results from yahoo.com are hosted outside of China and all 10 of the results from yahoo.cn are hosted inside China.

However, how does the content of the actual results match up? On this I require some help. Qualitatively, what content are users in China missing out on? How relevant are the censored sites?

Results for 民主 from google.com

http://zh.wikipedia.org/wiki/%E6%B0%91%E4%B8%BB

http://asiademo.org/

http://www.mzyfz-news.com.cn/

http://www.usembassy-china.org.cn/infousa/whatdm/GB/homepage.htm

http://www.mj.org.cn/

http://www.dphk.org/

http://www.dem-league.org.cn/

http://www.cndca.org.cn/

http://tag.blog.sohu.com/%C3%F1%D6%F7/

http://www.jfdaily.com.cn/epublish/gb/paper26/

Results for 民主 from google.cn

http://zh.wikipedia.org/wiki/%E6%B0%91%E4%B8%BB

http://www.usembassy-china.org.cn/infousa/whatdm/GB/homepage.htm

http://www.mzyfz-news.com.cn/

http://www.mj.org.cn/

http://www.dphk.org/

http://www.dem-league.org.cn/

http://theory.people.com.cn/GB/49150/49152/5224247.html

http://www.cndca.org.cn/

http://tag.blog.sohu.com/%C3%F1%D6%F7/

http://www.jfdaily.com.cn/epublish/gb/paper26/

Censored in google.cn:

asiademo.org

Results for 民主 from yahoo.com

http://zh.wikipedia.org/wiki/%E6%B0%91%E4%B8%BB

http://en.wikipedia.org/wiki/Democracy

http://www.asiademo.org/

http://www.dnc.org/

http://usinfo.state.gov/mgck/home/topics/democracy_human_rights/democracy.html

http://www.paulgraham.com/web20.html

http://www.dpj.or.jp/

http://home.computer.net/~pyd/clcb11.html

http://www.democracy.gov/dd/mgck_democracy_dialogues.html

http://www.cchere.net/tags/%C3%F1%D6%F7/

Results for 民主 from yahoo.cn

http://www.mzfz.gov.cn/

http://www.studa.net/minzhu/

http://npc.people.com.cn/GB/28320/41246/index.html

http://www.mzyfz.com/

http://www.taimeng.org.cn/

http://www.dem-league.org.cn/index.shtml

http://jb.mzfz.gov.cn/

http://cpc.people.com.cn/GB/104019/104098/6378610.html

http://www.civillaw.com.cn/Article/default.asp?id=35562

http://www.gongfa.com/minzhuzhuanti.htm

Censored in yahoo.cn:

www.dnc.org
usinfo.state.gov
www.asiademo.org
www.cchere.net

Does the hosted in China or ending in .cn measure make sense when the content itself is analyzed?

A Search for Human Rights



The Search Monitor Project: China focuses on assessing the level of transparency with regard to the self-censorship practices of search engine companies as well as the mechanisms and effects of this political censorship. (For background information, see this and this.) The following is a step by step process of a search for “human rights” (人权).

The first step is to retrieve a result set from the (uncensored) Chinese version of Google. Each result is parsed to its domain name (http://www.hrw.org/chinese/ becomes “www.hrw.org”).

The second step is to use the “site:” modifier to restrict results to the domain. The censored versions of Google and Microsoft can be queried directly, but Yahoo and Baidu must be queried from inside China because they are hosted inside China. This is because the bi-directional filtering of China’s “Great Firewall” (GFW) will block the inbound connections due to the presence of “www.hrw.org” in the search query. Conversely, search from inside China to Google or to Microsoft will be blocked because of the GFW — but since we are interested in search engine censorship it is necessary for us to remove the effects of the GFW.

The censored results for the top ten results from Google for the query 人权 are:

Keyword Translation Google MSN Yahoo Baidu
人权 human rights 1 / 10 3 / 10 1 / 10 1 / 10

The common site, censored by all four search engines, is www.hrw.org. This is the website of Human Rights Watch. Google, acting the most transparently, provides a notification that results have been removed and since the search has been restricted, using “site:” we can conclude that www.hrw.org is censored specifically and deliberately. Yahoo provides a notification, but since it appears at the bottom of every page regardless of whether the results are censored or not we are left to assume that it was never indexed because Yahoo China operates it web crawlers from behind the GFW. Baidu also operates its crawlers from behind the GFW so, like Yahoo, sites blocked by the GFW are not indexed. Microsoft uses the same de-listing mechanism as Google but has removed the censorship notification they formerly displayed. We therefore assume that it has been censored because there are no results when using the “site:” modifier (results do appear in the English version) but the lack of transparency reduces the accuracy of the claim.

The two other sites from (uncensored) Google’s top ten results for 人权 (human rights) are: zh.wikipedia.org and www.epicbook.com.

URL Google MSN Yahoo Baidu
www.hrw.org

Meta: org | 199.173.149.120 | 701 | US | UUNET – MCI Communications Services, Inc. d/b/a Verizon Business

Censored Censored Censored Censored
zh.wikipedia.org

Meta: org | 208.80.152.2 | 14907 | US | WIKIMEDIA Wikimedia US network

Indexed Censored Indexed Indexed
www.epicbook.com

Meta: com | 61.152.160.205 | 4812 | CN | CHINANET-SH-AP China Telecom (Group)

Indexed Censored Indexed Indexed

It is interesting that Microsoft censors wikipedia while Yahoo and Baidu index it because wikipedia is generally blocked by the GFW. A possible explanation is that due to the fact that the GFW is not 100% consistent or accurate with its keyword filtering the crawlers were able to index normally blocked sites.

I am not familiar with www.epicbook.com but it is hosted inside China and is thus an unlikely candidate to host information that the government would want to censor. The fact that it is index by all the other three search engines supports this. While this could be a case of “collateral damage” due to Microsoft’s lack of transparency (the possibility that it is not censored, it is just not indexed) it is indexed in the English version of Microsoft’s search engine.

The “magnify” component of this project attempts to match the top ten results from Google/Yahoo with the top ten results form the China-specific versions of Google/Yahoo in order to note the similarities and differences in terms of censored, returned (the website is in the top ten of the both the .com and .cn versions of the search engine) and indexed (the website is in the top ten of the .com version, but not in the top ten of the .cn version, but is not censored). It also compares the results based on whether or not each website is hosted in China or ends in a .cn. This is taken as a measurement of “authorized” content that is unlikely to present information that China would block.

As noted above, there is only one censored site, the other nine results are also returned in the top ten of the censored .cn version of Google.

However, even in google.com 4 of the top 10 sites are hosted in China or end in a .cn leaving only 6 sites to represent alternative information. When the censored site is removed, the google.cn version moves to 50/50 split between authorized and potentially unauthorized information. While this case doesn’t show a dramatic difference, other search queries, particularly those specific to contextually relevant information often do.

This yahoo.com vs. yahoo.cn comparison uses the top 10 results from yahoo.com for a comparison. With this result set 8 sites returned in yahoo.com are censored, the remaining 2 are indexed but not returned in the top 10 in yahoo.cn.

While all 10 results in yahoo.com are hosted outside of China all 7 results in yahoo.cn are hosted in China or end in a .cn. This helps show how significant the censored sites are in comparison.

Although the total number of censored sites may be low, especially when compared to the amount of indexed sites, the significance of these sites in providing alternative information should not be underestimated.

Search Monitor Project: China



Search engines are increasingly censoring their results, often by geographic location, having a significant, negative impact on the right to freedom of expression. The most advanced cases of censoring political content is in search engines that market a version of their product in China. This project aims to expose and monitor the censoring practises of search engines with a specific focus on China.

Building upon efforts to assess the level of transparency (reading this first is probably a good idea) with regard to search engine censorship, this project aims to compare the level of censorship across the China-specific search engines of Google, Yahoo, and Microsoft as well as the domestic Chinese search engine, Baidu. The goal of comparison poses some significant methodological problems as the presence or absence of censorship notification, mechanism of censorship (and irregularities therein) and physical location of the servers themselves all add additional layers of complexity.

In attempting to develop an automated system that can reasonably compare the search engines some additional methods which would be well suited to one search engine but for which comparable data could not be generated from the others have been delegated to separate search engine-specific projects. As a result of the focus on comparability the methods outlined below not only build upon existing research in this area but can hopefully explain some of the anomalies previously identified. After reviewing previous reports by Reporters Without Borders and Human Rights Watch I sketch out methods that attempt to provide an accurate, automated comparison between the search engines.

Previous Research

In June 2006, Reporters Without Borders (RSF) conducted a comparison of the four search engines Google, Yahoo, Microsoft, and Baidu (later updated to also include Sohu and Sina) by entering key words into the search engines and analyzing 1) the presence or absence of any results and 2) the content of the results by classifying each returned web site (URL) as either “authorized” or “unauthorized” which presumably refers to whether or not the source is controlled by or supports the government of China or whether it contains critical, alternative information. While this report is an innovative attempt and comparison is suffers from methodological issues that affect the accuracy of the results.

First, the report is actually about the ranking of results rather than censorship. The top ten results were analyzed based on their content, not on whether a web site had been censored (de-listed/removed) from the results set. While the removal of censored sites will likely affect the combination of “authorized” vs. “unauthorized” sources it does not tell us what sites are censored or if the “unauthorized” sites are not censored but just do not appear in the top ten results. Since localized search engines often algorithmically privilege sites in the local language, ending in the country’s domain suffix (e.g. .cn) and possibly even being hosted within the country it affects where foreign hosted “unauthorized” content appears in the result set. Thus an “unauthorized” site may not appear in the top ten results of the localized search engine even though it does in the uncensored version. Instead, the site may appear further down in the rankings.

Second, the testing of the search engines did not account for China’s national filtering system, often labeled the Great Firewall of China (GFW). Consequently, the results concerning “no results” and “no results + user banned” should actually be seen in reverse. Since Yahoo and Baidu are physically located in China the search queries made by RSF were filtered by the GFW on their way to the Yahoo and Baidu servers. If the same search were conducted from China, the search queries would not pass through the GFW and would not be filtered. The search queries RSF made to Google and Microsoft did not pass through the GFW because those servers are not located in China and therefore results were always returned. However, had those same search queries been made from China to Google and Microsoft they would have been filtered by the GFW and would have been designated “no results” and “no results + user banned”. The failure to account for the GFW prevented RSF from accurately interrogating the filtering of the search engines because a distinction was not made between filtering by the search engines and filtering by the GFW. If the tests had been conducted from inside, rather than outside, of China the report would have captured the behaviour experienced by users in China who are censored by both the GFW and the search engines and perhaps are agnostic about which one is doing the censoring since the result is the same: censorship.

In August 2006, Human Rights Watch (HRW) released an impressive and detailed comparison of Google, Yahoo, Microsoft, and Baidu. Two approached were used in this report: the first focused on identifying censored sites the second on whether or not the result set returned from a search for a specific key word query was censored. The first approach involved using a list of 25 websites and searching for each website in each search engine (using the site: modifier, discussed below, when possible). If a “censorship notification” appeared and there were no results the web site was censored, but the report also noted instances in which the message appeared but some partial results appeared as well. In other cases, there were no results and since there was also no censorship notification (or a censorship notification that always appears and has no relationship with the results) it was suspected that the web site was censored. In this way, HRW was able to determine how many of the 25 sites were censored in each search engine.

HRW tested from both inside and outside of China and was thus able to isolate search engine filtering from that conducted by the GFW of China. HRW notes that queries to Yahoo from outside China generated errors (as we saw in the RSF study) and we now know that this is due to the bi-directional filtering of the GFW (see below). The partially censored results (what I call “Page Censored” below) can result from at least two reasons. The first, is that some search queries automatically trigger the censorship notification regardless of whether the results have been censored or not and second because the filtering algorithms of the search engines are imperfect. Google, for example, does not handle port numbers properly and fails to remove such pages and Microsoft does not handle domains by their root (domain.com) and therefore sub-domains (www.domain.com or dom.domain.com) may not all be removed. Microsoft also does not properly handle URLs that begin with “https”. In such cases partial results may be available despite the search engines attempts to censor.

Another issue (which is still an issue in the methodology discussed below) concerns search engines not censoring pages directly. Both Yahoo and Baidu operate the crawlers that index websites from inside China and thus do not index sites that are blocked by the GFW. This removes the need for the search engines to censor their results, as the index itself is already censored by the GFW. This means that there is not a credible technical way to distinguish between sites that are not indexed and sites that are censored. Another issue is that the GFW is not perfect, and normally censored sites sometimes end up in Yahoo & Baidu’s index. There have also been some cases in which Yahoo has removed indexed sites — those not blocked by the GFW — and used a censorship notification as Google does and Microsoft did previously. Therefore, for the most part, Yahoo and Baidu do not need to censor their results, because their index is already censored because their crawlers operate form within China and cannot visit blocked sites to begin with.

The second approach used by HRW focused on the issue of keyword filtering by search engines. The question is simple enough, if I search for keyword “a” will I get censored results “b“? However, the lack of transparency on the part of the search engines makes the answer to this simple question difficult. HRW used a list of 25 keywords to query the search engines and inferred possible censorship by comparing the results from censored China-specific versions of Google, Yahoo and Microsoft and their US counter-parts as well as noting the appearance of a censorship message. (Baidu had no such counterpart at the time, but perhaps Baidu Japan can now be used for this purpose.)

Comparing result sets can be problematic because of the algorithmically determined rank of the results. What appears on page one in the top ten results in google.com may appear on page twenty-five in google.cn. In the case of Yahoo and Baidu GFW-censored sites are not indexed at all and so will never appear no mater what one searches for. Another method is to use the difference in the estimated page count as an indicator of censored results. But the estimated page counts can vary considerably between servers and language/region-specific versions. Microsoft, for example, returns very few Chinese language results in their default English language search engine making comparison virtually impossible. As noted by HRW, even the presence of the censorship notification may not be reliable. In some cases the censorship notification will appear based on the keywords in the query not on the results returned. (You can restrict the results to a non-existent site and still get the censorship message.) In other cases, it has nothing to do with what was used as a query for example, a non-politically sensitive term) but the censorship message appears because a URL has been removed/de-listed. In other cases, some keyword queries return results and a censor message not because results have been removed but because results are only returned from a set of “white listed” sites. Compounding the problem, the censorship message appears to be page specific (at least in the case of Google). That is, if one searches for keyword “x” and gets back ten results there may be no censorship message, but when one click on “Page 2” and gets results 11-20 which do contain a censored site the censorship notification will appear. (Therefore, if you set the preferences to retrieve 100 results will one may be more likely to encounter the censorship notification than if restricted to 10 results?).

HRW accounted for such variance through manually checking results in addition to the estimated page count comparisons and the presence of a censorship notification. Not only does this involve extensive manual labour but also an expertise in analyzing the content for political significance. For example, HRW manually assessed and compared the first three pages of search results for Yahoo and Yahoo China. HRW’s efforts in the regard stand out as an example of the quality needed for this line of research.

Methodology

The Search Monitor Project currently contains two related but separate components. The first, Generalized Comparison : Keywords and Urls, focuses on a generalized comparison between the China-specific versions of Google, Yahoo, Microsoft and Baidu. The second, Magnify: A Google-Google, Yahoo-Yahoo comparison, focuses on comparisons between the Chinese-language “global” versions of Google and Yahoo and their special censored China-specific versions.

While the core testing methods are the same, the Magnify: A Google-Google, Yahoo-Yahoo comparison contains some additional elements that allow for a more fine grained analysis. These will be noted below when appropriate.

Generating a URL Set

A set of sixty keywords have been selected covering the broad topical categories of censorship circumvention, falun gong/dafa, political sensitivities and social taboos. Search queries in “uncensored” engines (the Chinese language versions of Google or Yahoo) are used to generate lists of sites that are checked in censored search engines.

A query term, such as “人权” (human rights), is used to retrieve results from an “uncensored” search engine, such as google.com.

The websites from the “uncensored” results are parsed to retrieve the domain (including sub-domains).

A list of URL results, usually ten, are retrieved. A URL such as http://www.hrw.org/chinese/ is shortened to www.hrw.org

Each domain name is checked in each censored search engine.

Determining a Censored site

These domains are checked in the censored search engines using the “site:” modifier. The “site:” modifier restricts the results set to pages of a specific host name.

“site:www.hrw.org” (without the quotes) is used as a search term in in censored search engines to restrict the results to only those from the web site www.hrw.org

In cases where the censored search engine being tested display as special message indicating that results have been censored, a “censor message”, that relates to the specific search query domains that produce no results when queried with the “site:” modifier and contain a censor message are labeled as “Censored” while domains that return some results but contain a “censor message” are labeled as “Page Censored”.

In the cases where there is no censor message, or the censorship message appears on every page and bears no connection to the results, domains that produce no results when queried with the “site:” modifier are labelled as “Censored.”

Depending on the current behaviour of search engines there may be ad hoc additions.

http://www.google.cn/ – censored = censor message + 0 results, pagecensored = censor message + some results
http://www.live.com/?mkt=zh-cn – censored = 0 results, pagecensored = results that only contain urls beginning with “https” (no longer a censor message, failure to exclude “https” urls was noted when the censor message was in place and is thus used as “Page Censored”)
http://www.yahoo.cn/ – censored = 0 results, censor message is ignored because it appears on every page, it bears no relation to search results
http://www.baidu.com/ – censored = 0 results

It is important to note that sites that are simply not indexed by the search engine will appear as “Censored” thus possibly inflating the total amount to censorship attributed to search engines that do not have a censor message that is related to the results. This can be slightly compensated for by looking at the overlap of censored sites among search engines. In addition, since this is a normative project advocating transparency, should serve as an incentive for search engines to implement a censor message that is related to the results.

The Magnify: A Google-Google, Yahoo-Yahoo comparison project contains the classifications “Returned” and “Indexed” in addition to “Censored” and “PageCensored”. “Returned” refers to URLs from the the result set from the “uncensored” search engine that are returned in the result set from the censored search engine. “Indexed” refers to URLs from the the result set from the “uncensored” search engine that are not returned in the result set from the censored search engine, but are not censored. Using this method, the top ten results or a query in Google (Chinese) can be compared with the top ten results of Google China and can be categorized by “Returned”, sites that are common to both results sets, “Indexed”, sites in the top ten uncensored but not in the top ten of the censored results, and “(Page)Censored”, results that are actually censored. In addition, each URL in both result sets is check to see of it it hosted in China or ends in a .cn domain suffix.

The Great Firewall (GFW)

Borrowing a phrase from my colleagues Richard Clayton, Steven Murdoch and Robert Watson, it is necessary to ignore the filtering conducted by China to accurately test levels of censorship by the search engines themselves. As Clayton and Murdoch reveal, Internet traffic to and from China passes through a filtering system that is bi-directional – it affects both inbound and outbound traffic – (China also blocks outbound connections to IP addresses, but this does not interfere with the ability to test the search engines) which disrupts connections if the presence of particular keywords are detected. Often, China will designate a domain name as “key word” this disrupting access for any request that contains that domain name. This is important as queries directed to search engines hosted in China use the “site:” modifier followed by a domain name.

In order to avoid interference from the China’s filtering system, the China-specific versions of Google and MSN, which are hosted outside of China, are queried from outside of China and the China-specific versions of Yahoo and Baidu, hosted inside China, are queried from inside China.

The censored search engines, http://www.google.cn/ and http://www.live.com/?mkt=zh-cn are checked from outside of China.

The censored search engines, http://www.baidu.com/ and http://www.yahoo.cn/ are checked from inside of China.

In addition to affecting how to test each search engine, the location of the search engine to the GFW also affects how the search engines censor. Google and Microsoft, located outside of China, must remove, or de-list, specific sites from the results. Yahoo! and Baidu both operate their search spiders from inside China. The results in a situation where, because of China’s gateway filtering, the crawlers that index content for these search engines cannot access sites that China blocks.

61.135.166.102 – – [08/Feb/2008:08:05:40 -0500] “GET / HTTP/1.1″ 200 12258 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”

220.181.38.169 – – [08/Feb/2008:09:04:42 -0500] “GET / HTTP/1.1″ 200 12258 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”

60.28.17.38 – – [08/Feb/2008:11:46:31 -0500] “GET / HTTP/1.1″ 200 12258 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”

202.160.180.184 – – [07/Feb/2008:16:58:33 -0500] “GET /robots.txt/ HTTP/1.0″ 200 24 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)”

202.160.180.96 – – [07/Feb/2008:16:58:35 -0500] “GET / HTTP/1.0″ 200 19068 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)”

Thus Yahoo! rarely has to de-list specific websites, most are just not indexed in the first place. However, this also leads o situations in which sites blocked by China and de-listed by Google and Microsoft are index by Yahoo!. The GFW is not 100% effective and occasionally crawlers operating from inside China are able to index a normally blocked site which then appears in their search results.

It is also important to note that sites indexed by the search engines that are blocked by the GFW will still be inaccessible to users in China.

Types of Results

Generalized Comparison : Keywords and Urls

This component shows each of the keywords used as search queries in an “uncensored” search engine and the number of the URLs from that result set that are censored in the China-specific versions of Google, Yahoo, Microsoft and Baidu. It also shows the domains that are censored by any one of the four search engines and that domain’s status with regard to the other three engines. In this way we can compare the amount of censored URLs per search query across all four search engines as well as that build a list of censored domains and compare th level of censorship across all four search engines.

Magnify: A Google-Google, Yahoo-Yahoo comparison

This component focuses on two of the search engines that have comparable censored and “uncensored” versions: Google and Yahoo. Microsoft’s “global” search engine in English contains very few Chinese sites and cannot really be compared to their Chinese version. Even other versions such as Hong Kong and Taiwan have such drastically different results when compared to the Chinese version making it a difficult fit for this model of testing. (At this time Baidu Japan has not been sufficiently investigated but it may offer an opportunity for comparison with Baidu China.)

The data collection for section focuses on a direct comparison between google.com (in Chinese)/ google.cn and yahoo.com (in Chinese) and yahoo.cn. It expands upon the collection of censored sites and pages by looking at the overlap of returned pages – pages that appear in the results set for the same query for the same number of results in both search engines as well as indexed pages. It also tracks which sites are hosted in China or end in a .cn domain suffix.

Organized in this way the results raise questions regarding the nature of censorship process as well a the censored content. In terms of process, critical questions have been frequently posed concerning the specificity of the censorship requirements communicated to these search engines by the Chinese government. Are search engines given a list of keywords or a list of web sites that they are to censor? Or, is there just a general reference type of content leaving search engines to infer what exact content to block? How significant are the censored web sites since they only represent a small fraction of indexed sites? How frequently are users search results censored in relation to the topics they search for? Which specific web sites are actually censored? What type of web sites are censored?

In an effort to provide some insight regarding the question of process, the project will measure the overlap between all the search engines as well as subsets that are functionally similar (Google/Microsoft, Yahoo/Baidu). Overlap refers to the sites that are censored by multiple search engines. Overlap is analyzed in two ways: the first focuses on sites that are censored by all search engines tested, the second focuses on search engines that censor using similar mechanisms. While this allows for a comparison among search engines it also acts as an indicator of whether the search engines are responding to specific blocking requests, usually associated with an official order, or a general determination on the part of company, perhaps based on topic areas provided by officials.

Since the total number of censored sites is likely to relatively small compared to the total number of indexed sites , this project proposes a measure of significance in order to show just how important the censored sites are in relation to those displayed to the user. Significance refers to the the number of top ten sites returned from an uncensored search engine that are censored in the China-specific version in relation to those that are either hosted in China or that end in a .cn domain suffix. China could, presumably, take action against those sites under their jurisdiction without having to resort to blocking. In this context, these sites are considered to be “authorized” and are unlikely to contain information that presents an alternative perspective to that approved by the government. In this component results that are returned in the top ten along with those that are indexed but not displayed in the top ten are distinguished from those that are censored. The significance is demonstrated by the absence of top ten results outside of China’s control among a majority of sites that are within the top ten.

Analyzing the web sites found to be censored is an important indicator of the type of content that the government of China wants to block (or of the interpretation of this interest by the search engine companies). Content refers to the type of websites that are targeted for censorship, not the content of individual articles contained within them. This is accomplished through the creation of categories to which censored sites are assigned. This component is the most problematic for automatic determinations. Web sites could be classified using various services that provide categorized URLs but the results may be less than desirable. To operationalize this component properly likely requires, as suggested by Rebecca Mackinnon, analysis by “a team of near-native Chinese speakers who are highly tuned-in to what the sensitive media topics.” Thus this component remains the least developed aspect of the project.

Degrading Transparency: Comparing Google, Yahoo and Microsoft



Google, Yahoo! and Microsoft all maintain versions of their search engines for the Chinese market that censor political content. One of the key issues that emerged concerned transparency. In 2006, all three search engines, following Google’s lead, introduced a message that informed user when the results of their searches were censored. The presence of a mechanism of notification is a critical component of transparency. This notification informs users that their search results have been censored and indicates, to a certain degree, the reason (often unspecified “local law”) why based on what the user searched for. The message appeared only when the user’s results were censored and thus it was possible to connect the censorship to specific keywords or websites.

By 2008 the level of transparency has decreased. While Google’s censorship notification has remained essentially the same as it was in 2006, Yahoo! and Microsoft have altered the way in which users are notified of censorship. Yahoo! has put its censorship message at the bottom of every page regardless of whether results are censored or not, in effect de-linking the censorship notification from the results. Microsoft has removed the text completely and buried the censorship notification with a separate “help” page. These developments represent a significant degrading of transparency and accountability.

By removing or hiding the placement of the censorship message, which is vague to begin with, users may be unaware that their results have been censored and by de-linking the censor message from what the user actually searched for the topics and websites that are censored remain hidden from the user. The de-linking of the censorship message from the search results impacts the ability to determine what precise sites and “key words” are being censored.

The presence and placement of a censorship notification, along with the specificity of its content and its connection to the results, is an integral component of transparency. The specificity of the reason why content has been removed is an important component that is lacking in the case of China. In other cases, Google has cited specific laws, such as the DMCA, and other legal documents with which they must comply and reported the information, to some degree, to Chilling Effects.org. Yahoo maintains a list of sites its censored for copyright violations. However, in the case of censored political content in the case China nothing other than a reference to “local law” has been provided.

The presence of a notification that is directly connected to the results (notification appears only when content is actually removed in relation to what the user searches for) positively impacts the ability to accurately identify censored website and restricted keywords. When such notifications are either absent or disconnected from the results (for example, a notification that appears on every page regardless of whether results are censored or not) the ability to determine censored sites with a high degree of confidence diminishes as sites may simply not be indexed by the search engine. Therefore the notification is critical not only for informing users but also for the monitoring process.

June 26, 2006
Engine Presence Placement Specificity Connection Screenshot
Google Yes High
Notification is placed under results
Low
Results removed to comply with local law
Yes
Notification only appears when results are censored
screenshot
Yahoo Yes High*
Notification is placed under results
Low
Results removed
Yes* screenshot
Microsoft Yes High
Notification is placed under results
Low
Results removed, link to “help” page that mentions local law
Yes screenshot

* Yahoo China’s web crawlers operate from within China, behind the GFW, therefore sites that are blocked by China are not indexed by Yahoo (and thus do not need to be censored by Yahoo) leaving only sites that are either not blocked by China or are indexed during periods when there is variation in the capacity of China’s filtering system to actually be censored by Yahoo. The behaviour documented here refers to sites indexed by Yahoo but subsequently censored, not sites that are not indexed by Yahoo at all.

January 25, 2008
Engine Presence Placement Specificity Connection Screenshot
Google Yes High
Notification is placed under results
Low
Mentions “local law”
Yes
Notification only appears when results are censored
screenshot
Yahoo Yes Medium
Notification is placed at the bottom of every page
Low
Mentions “local law”
No screenshot
Microsoft Yes** Low
A link to a separate “help” page which contains a link to section that contains the notification
Low
A link to a separate “help” page which contains a link to section that mentions “local law”
No screenshot

** There is no notification on the actual page that results the search results. The user must click a “help” page and then navigate to yet another section that state that results may be removed in compliance with local law. the notification mentions pornography as a possible reason, no mention is made of political content.

Presence: The presence of a form of notification that informs users that results may be censored.
Placement: The location of the censorship notification message, particularly its placement in relation to the results.
Specificity: The extent to which users are informed about specific laws, orders and/or regulations leading to censored results.
Connection: Notification appears only when content is actually removed in relation to what the user searches for making it possible to determine which specific web sites and keywords have actually been censored.

(The versions of the search engines tested are the specific version for China. Google (www.google.cn) and Microsoft (www.live.com/?mkt=zh-cn) have their servers located outside of China and are tested directly while Yahoo’s (www.yahoo.cn) servers are hosted in China and are tested from inside China. This is necessary to test the search engines without interference from China’s filtering system.)

This is the start of an effort to more systematically monitor transparency over time so I am asking for feedback. Is this information useful? In what ways can it be improved?

EU Wants to block searches for “bomb”



“I do intend to carry out a clear exploring exercise with the private sector … on how it is possible to use technology to prevent people from using or searching dangerous words like bomb, kill, genocide or terrorism,” Frattini told Reuters.

Wow.

Searching for such words brings up quite a number of non-bomb-making-instruction sites, forcing search engines to not allow searches for such generic terms is ridiculous. The top results for a Google search for “genocide” for example returns a Wikipedia entry, a site dedicated to stopping genocide in Darfur among others. That much is obvious.

Perhaps EU Justice and Security Commissioner Franco Frattini meant that specific sites, such as sites with instructions on how to make a bomb, should be removed from search engines. In this scenario it is not that a user cannot search for the word “bomb” but if such a designated web site were to appear in the results it would not be shown to the user. This is what is already done by search engines in regard to copyright violations, hate speech, libel/defamation an any other “legal” request (such as news & politics websites that the Chinese government deems illegal). It would be fairly simple for the EU to request that search engines de-list certain sites, but of course, this comes with all the baggage of filtering systems (over-blocking, under-blocking & circumvention).

The above concerns aside the proposal is actually even more misguided. It assumes that search engines are the only way to access information. Such a policy would not take into account direct access to such sites, links fro other sites, especially forums, chat rooms, IM’s and so on. It is a shortsighted policy that appears to be mostly for show in the same vein as Seth Finkelstein argues about the deployment of censorware:

…governments end up giving money to these companies for the political benefits of being able to Do Something About The Problem (no matter the flaws).

The “wanting to do something” sentiment appears strong in this case as does the lack of careful consideration.

Thailand: YouTube Ban Lifted



Thailand has decided to lift the ban on YouTube after Google agreed to “filter” videos that insult the King.

Information and Communications Technology Minister Sitthichai Pookaiyaudom this week instructed the website ban be lifted after YouTube owner Google installed filters to stop Thais from accessing clips insulting the 79-year-old monarch, a ministry official said.

The Southeast Asian Press Alliance (SEAPA) criticized the collaboration between Google and the government to censor YouTube:

“Any such collusion could potentially be open for abuse, and thereby only exacerbate concerns over free speech over the internet,” SEAPA said.

“The cooperation between Google/YouTube and the Thai government could conceivably become a template sought by other governments that have had run-ins with sensitive content on the video-sharing site,” it said.

“.yahoo.com” briefly blocked in China



For the most part* the GFW blocks in two ways:

1) IP blocking
2) Keyword in url blocking

IP blocking is pretty easy to spot, traceroute will fail at the backbone level in China, and there will only be outgoing syn packets to the IP, the 3-way tcp handshake will never be established. (Note: all domains hosted on that IP are affected).

“Keyword-in-URL” blocking is different and sometime s a bit awkward. First, the keyword-in-url filtering is bi-direction so you can trigger it from outside -> to -> China or from China -> to -> outside.

Second, “keywords” can be domains themselves, I’ve even seen URLs used as a “keyword”. If these keywords appear in the HTTP Host header or in the GET request they will be “blocked”.

Third, the way the blocking works is that the 3-way TCP handshake will be established but when the GET request goes through the GFW sends RST packets to both the requester and the host (spoofed to appear as if they were from one another) to tear down the connection then host and the requester respond to each other with more RST packets. (There is some additional variation, but thats the basic version, see Steven Murdoch et al’s paper http://www.cl.cam.ac.uk/~rnc1/ignoring.pdf for more details).

The tricky part is that depending on the GFW (maybe related to the load) some of the transaction will go through. So for example, you may get half (or more!) of the html before the RST packet. Also, part of the page may load because, for example, it is not until an image with a keyword in its file name is loaded that the RST packet is sent.

Finally, the most tricky part. Because of the combination of additional RST packets from the GFW (and then the RST from the requester and host in response) further connections between the requester and host (not the internet as often reported) are disrupted for sometime. This means that if you are in China and you connect to Google (hosted outside of China) and you search for a banned keyword (the keyword goes into the GET request) you’ll be blocked. If you hit the back button in your browser and get the cached copy of Google and then search for a NOT blocked keyword it will appear to be blocked because your connection to Google is still being subjected to RST packets. This sometimes results in reports that certain keywords are blocked when in fact they are not.

Another important point to recognize is that this is dependant upon IP address. So, if the site you connect to has multiple IP addresses the behaviour may seem even more consistent and you requests may be being server by different IP addresses. For testing purposes it is best to connect directly to an IP rather than a domain name to ensure that you are always connecting to the same IP.

On June 27 2007, I captured traffic between myself and yahoo.cn (hosted in China, as well as some other hosts in China) using “.yahoo.com” (yes, that starts with a period, e.g. if affects all *.yahoo.com domains including mail.yahoo.com) and can confirm that it was subjected to the “keyword-in-url” blocking behaviour with “.yahoo.com” as the keyword.

However, and this is my opinion, the RST packet were quite slow to respond. In some cases the RST did not come until after the page loaded successfully (future connection were subjected to RST’s). It is possible that many requests for “.yahoo.com” were causing the GFW to slow down, anecdotaly the RST packets were not being received as fast as they usually are.

On June 28 2007″.yahoo.com” is no longer blocked by China.