Tuesday 10 August 2010

WhatBlock 2.0

So here I'm going to set out my proposals for WhatBlock 2.0, please comment on how feasible/realistic my ideas are!
http://dev.dfey.org/whatblock -- for the current app
http://dev.dfey.org/whatblock/view.php -- for the current data collected by the app

What is WhatBlock?
WhatBLOCK is an http-accessible application designed to test the internal networks of schools and offices across the country to assess the wide-scale use and abuse of internet content filtering.

Many students and office works across the country are often at the mercy of their I.T. departments and content filtering software when it comes to internet access. It can often be a difficult and time consuming process to get often legitimate work-related websites unblocked on an internal network and WhatBLOCK is an attempt to try and provide a level of scrutiny for this process.

It is hoped that WhatBLOCK can ultimately make the lists of banned websites accessible and the people who create them accountable. WhatBLOCK therefore needs to find out which websites are blocked on different networks - this is achieved by presenting the user within the local network with a page containing an iframe of a website which could potentially be blocked. The user is then prompted to declare the name of the School/Institution which they are in and whether or not they can see the page within the iframe; this data is then logged.

The Problems With This
One of the main problems with Whatblock is its reliance on user interaction to collect the data and the corresponding impact on reliability. We are forced to effectively crowd-source the data because of the restrictions on what data we can pick up due to Cross-Site Scripting (XSS - see http://en.wikipedia.org/wiki/Cross-site_scripting).

The Problem with XSS
XSS means we can't make an XML HTTP connection to check if we can get to the external server or not. XSS also means that if we use an iframe to load the webpage, we can't automatically check the URL to which the iframe actually finishes loading. As a result of this we are forced to present the user with an iframe containing a webpage and then ask them whether or not they can see this site. Don't forget that in a lot of these places, people can't run executable binary files - so it's only possible to run this test through a browser in most places.

The Problem with iFrames
This approach means that mistakes are made. Sometimes the webpage may take a while to load and so the user may mistakenly take this is a sign that the webpage cannot load; this also has the effect of increasing the time taken to run each test and so mean that we can run fewer tests with lower user satisfaction (it's their time they're giving up!). The user could also lie giving false data or they could simply make a mistake - both of which add up to less reliable data and a headache when trying to process any of it. A lot of websites such as twitter, almost all webmail services, myspace, etc. also have JavaScript code to break-out of the iframe meaning that with the current testing methods we can't test these websites as the user's browser just ends up pointing to twitter or Gmail. There's no quick or reliable fix for this problem, so we just currently ignore these websites when testing - but these are often some of the more interesting websites to check! So what's the solution to all of this?

Possible Solutions
Idea: We could load images from the websites and ask the user if then can see them.
Problem: Many websites use a devoted image domain, for example twitter uses http://*.twimg.com to serve up ALL images for the twitter.com site - it's more likely that any content filtering server will block twitter.com than *.twimg.com and so it's possible that even if twitter.com is blocked that we might still be able to visit *.twimg.com giving us a false-positive for the test. This also doesn't remove the user interaction.

Idea: We could load a javascript file from the website and check it for any variables we known to be there.
Problem: This is time consuming for the developer as it has to be done individually for each website on the list! Some websites may not use JavaScript, they may not have it stored in .js files meaning it won't be accessible, or they may use a separate domain to keep the files on, as with the images. BUT this does remove the user interaction issue.

A Solution Which Could Work
Now this is where I think the project gets suddenly a lot less feasible... The only way to decrease load time, remove user interaction and reliably get data on this is to put a small JavaScript file on the remote server.

If the remote server admin were kind enough to create a http://www.remoteserver.com/whatblock.js file and put inside it something to the effect of:

whatblock = true

then it would be possible to include the remote javascript file in WhatBlock's runtime page and verify whether or not it was possible for the user to access www.remoteserver.com - as the contents of the file are so small, it should have very little effect on the server's bandwidth usage and is easy to setup. The admin could then inform WhatBlock where the file is, WhatBlock adds it to its database and includes it in future content filtering tests.

Why would an admin choose to do this?
And admin could choose to do this primarily for ideological reasons - the admin may support the act of monitoring secretly held lists of blocked websites. Many of us have felt the pain of going to a work-related website only to find that it's blocked, only to wish that there was some real transparent scrutiny of what gets blocked and what doesn't get blocked. The admin may also find it beneficial to see which institutions can and cannot have access to their websites - how can a website reach out to its target audience if it is unknowingly and incorrectly being made unavailable to the people it is designed to help? It's perfectly acceptable that the people writing these lists make mistakes about what's listed on them but if no one else is allowed to check that list then how is the mistake ever going to be corrected? WhatBlock is planned so that the data collected will be accessible by anyone - not just a select few, to provide a truly open dataset of banned websites - if your website is on WhatBlock, you will be able to see if it's blocked or not.

What if WhatBlock gets Blocked?
This may well happen, so It would be advantageous to have multiple instances of WhatBlock running on different domains - releasing the project under an Open Source license should provide for this.

Where do we go from here?
Well, first things first we would need to build a fully functioning site! Once that vital step is over, public awareness needs to be drawn to this problem to both bring users to the site and to get website administrators to put our little bit of JavaScript on their servers. Then, it's all down to data collection!

Feel free to leave any comments, critiques or suggestions!
Rob

EDIT: A fresh idea was floated by @harryrickards - he suggested that we run a flash instance which we could use as a proxy between the JS in our runtime page and the remote server. Unfortunately that will not work as Flash also has cross domain restrictions. You need to put an XML file on the remote server to allow that to work. (See http://www.adobe.com/devnet/articles/crossdomain_policy_file_spec.html)

The only way I can see this idea working is to get a signed Java applet running on the local user's computer. This is far from ideal but is closest to the executable binary that we can possibly get. (See http://weblogs.java.net/blog/2008/05/28/java-doodle-crossdomainxml-support) - that will then allow us to do cross-domain data transfers, but prompt the user with an ugly confirmation box and make us unable to check the content filters of those computers which don't have Java installed on them. Does anyone know what proportion of business/school computers DON'T have Java installed on them?

EDIT 2: WhatBlock 2.0 prototype has been released for testing using the image technique described above, loading the 'favicon.ico' file from the remote server. I'm currently unsure how this will work with real-world content filtration services, so testing is necessary. If you want to test it, see http://dev.dfey.org/whatblock2.