For the most part* the GFW blocks in two ways:
1) IP blocking
2) Keyword in url blocking
IP blocking is pretty easy to spot, traceroute will fail at the backbone level in China, and there will only be outgoing syn packets to the IP, the 3-way tcp handshake will never be established. (Note: all domains hosted on that IP are affected).
“Keyword-in-URL” blocking is different and sometime s a bit awkward. First, the keyword-in-url filtering is bi-direction so you can trigger it from outside -> to -> China or from China -> to -> outside.
Second, “keywords” can be domains themselves, I’ve even seen URLs used as a “keyword”. If these keywords appear in the HTTP Host header or in the GET request they will be “blocked”.
Third, the way the blocking works is that the 3-way TCP handshake will be established but when the GET request goes through the GFW sends RST packets to both the requester and the host (spoofed to appear as if they were from one another) to tear down the connection then host and the requester respond to each other with more RST packets. (There is some additional variation, but thats the basic version, see Steven Murdoch et al’s paper http://www.cl.cam.ac.uk/~rnc1/ignoring.pdf for more details).
The tricky part is that depending on the GFW (maybe related to the load) some of the transaction will go through. So for example, you may get half (or more!) of the html before the RST packet. Also, part of the page may load because, for example, it is not until an image with a keyword in its file name is loaded that the RST packet is sent.
Finally, the most tricky part. Because of the combination of additional RST packets from the GFW (and then the RST from the requester and host in response) further connections between the requester and host (not the internet as often reported) are disrupted for sometime. This means that if you are in China and you connect to Google (hosted outside of China) and you search for a banned keyword (the keyword goes into the GET request) you’ll be blocked. If you hit the back button in your browser and get the cached copy of Google and then search for a NOT blocked keyword it will appear to be blocked because your connection to Google is still being subjected to RST packets. This sometimes results in reports that certain keywords are blocked when in fact they are not.
Another important point to recognize is that this is dependant upon IP address. So, if the site you connect to has multiple IP addresses the behaviour may seem even more consistent and you requests may be being server by different IP addresses. For testing purposes it is best to connect directly to an IP rather than a domain name to ensure that you are always connecting to the same IP.
On June 27 2007, I captured traffic between myself and yahoo.cn (hosted in China, as well as some other hosts in China) using “.yahoo.com” (yes, that starts with a period, e.g. if affects all *.yahoo.com domains including mail.yahoo.com) and can confirm that it was subjected to the “keyword-in-url” blocking behaviour with “.yahoo.com” as the keyword.
However, and this is my opinion, the RST packet were quite slow to respond. In some cases the RST did not come until after the page loaded successfully (future connection were subjected to RST’s). It is possible that many requests for “.yahoo.com” were causing the GFW to slow down, anecdotaly the RST packets were not being received as fast as they usually are.
On June 28 2007″.yahoo.com” is no longer blocked by China.