Getting the Color Scheme of a Website Using Canvas and Hierarchical Clustering

Update: brief update, I think there are better (faster) ways to do this. Check color quantization if you don’t care about preserving the exact color values.

For a Firefox addon I’m working on I wanted to grab the color scheme of whatever website the user is viewing. There are a few extensions that do something like this: Colorzilla and Palette Grabber get the colors used by the website’s DOM, and ColorSuckr gets the most common colors from an image. The problem with getting the colors from the DOM is that websites use images and gradients so sometimes you can’t get the overall color scheme just from looking at the CSS colors.

Luckily, you can capture the color of every pixel on a webpage from an extension using the HTML 5 standard canvas. You can draw any image onto a canvas and get an array of pixel values used by the image with the getImageData function. In Firefox, you can actually draw the entire webpage onto a canvas using the drawWindow function (drawWindow is a Firefox-only canvas extension, but you can at least use drawImage in other browsers).

So getting the number of occurrences of each color is as simple as drawing the page to a canvas, looping through each pixel value in the array returned by getImageData, and tallying each color’s frequency in a javascript hash. This is what you get when performing this analysis on the Twitter homepage:

So you can get the colors that occurred most on the page pretty easily, there is however, one big problem with this. On images and gradients, there are areas of very similar colors that might as well be the same color in the context of the overall theme. As I’ve found out, there are usually over 10,000 different colors on each webpage, so these colors need to be grouped together.

This kind of problem is called clustering, and it’s a problem that comes up a lot in image analysis, computational biology, and other computational disciplines. There are two common clustering algorithms, k-means and hierarchical clustering. K-means can be faster than hierarchical clustering, but the problem with k-means is that you have to know what k is before you even start — you have to know exactly how many clusters you want to end up with. That can’t be determined in this situation, so hierarchical clustering is the best bet.

The premise of hierarchical clustering is simple. Each color starts out in its own cluster. On each pass of the algorithm, the two clusters that are most similar (according to a metric you define yourself) are merged. You keep on doing this until there are no more clusters that are similar enough to merge.

I defined the distance of two colors to be the maximum difference between the two colors’ red, green, and blue components. Two colors were ‘similar enough’ if their distance was less than 12 (where each color component ranged from 0 to 255). When two clusters were merged, the new cluster’s representative color was a weighted average of the two clusters’ representative colors. The algorithm worked pretty well for this application, check out the results:

The algorithm takes a long time even if you just focus on the top few hundred colors (a few seconds), but that’s what Web Workers are for, after all. You can check out the code here. Do you know a faster clustering algorithm? How would you quantify the distance between two colors?

Update: after getting some great feedback, I refined the distance measure to be the 3d distance between the two colors in the L*a*b* color space (the CIE76 distance measure) , thanks to the developer of FireColour for the open source L*a*b* conversion code.

Regression Range Finder for Firefox Nightly Builds

regression range finder in action on Windows

project page·github

UPDATE: mozregression has been polished up see http://harthur.github.com/mozregression

Last week I came across a bug that had snuck into Firefox sometime after 3.0. I went to go find the regression range using hg bisect but quickly realized this wouldn’t work for a regression that occurred so long ago – the dependencies for Linux had changed and building the old source was a pain. So I went to go start pinning down the range using the mozilla-central nightlies. This usually takes a couple hours and I was tired of doing this and miscalculating the bisect steps, so I wrote a python script to do practically all the work for me.

The script takes a ‘good’ date and a ‘bad’ date as arguments and will narrow down the range by executing a binary search on the mozilla nightlies, it will download each build, install it, then pop open a new window in the nightly. You do whatever you have to do to verify the bug’s presence, then enter ‘good’ or ‘bad’ into the command prompt depending on whether the bug appeared in that nightly. It will do this a few times to narrow down the range.

When you’ve checked enough nightlies (about log n nightlies, if your initial regression range is n days), you’ll see something like:
Last good nightly: 2009-06-12 First bad nightly: 2009-06-13
Which you can then paste into bugs to make people very happy (-:

Check out the installation instructions. Some future plans include:

* Automatic tests. Using them to find the regression range with no interaction required (targeting Mozmill tests first, then mochitest and the others)
* [done] Other Applications. Run other Mozilla nightlies like Thunderbird (this shouldn’t be too hard because the script uses mozrunner <3)
* Other branches. Not just mozilla-central.
* [done] Mac. Get it working on here.

Configure Apache To Accept Cross-Site XMLHttpRequests on Ubuntu

Update: check out this for more up-to-date instructions: http://enable-cors.org/server_apache.html

1. Make sure you have the mod_headers Apache module installed. to do this check out /etc/apache2/mods-enabled/ and see if there’s a ‘headers.load’ in there. If there isn’t then just sudo ln -s /etc/apache2/mods-available/headers.load /etc/apache2/mods-enabled/headers.load

2. Add the Access-Control-Allow-Origin header to all HTTP responses. You can do this by adding the line Header set Access-Control-Allow-Origin "*" to the desired <Directory> section in your configuration file (like the /etc/apache2/sites-available/default file). Saying "*" will allow cross-site XHR requests from anywhere. You can say "www.myothersite.com" to only accept requests from that origin.

3. Reload apache server. sudo /etc/init.d/apache2 reload

Maybe this is really obvious to a lot of people, but it wasn’t to me, so there you go.

Javascript API for Bugzilla XML-RPC Web Services

UPDATE: use bz.js instead!

The QAC (QA Companion) is a Firefox add-on that lets beta testers easily find bugs, run manual test cases on the browser, and get involved in the Mozilla community. Right now it’s a bit out of date and fragile. We’re going to be completely redesigning and reimplementing it. One thing we needed for it was a way to create and search for bugs in Bugzilla. The only way to do this is through Bugzilla’s XML-RPC interface. So my first project was to create a Javascript library that interfaces with Bugzilla’s web services interface. This basically acts as a JSON-RPC library that wraps the XML-RPC functionality (in Bugzilla 3.5 there will be a JSON-RPC interface).

The API docs are here as well as a link to download: bugzillaRPC. Because the library makes HTTP requests to the Bugzilla server, you can’t use it from regular content. You can however use it from any extension or chrome-privileged code. One thing we had hoped to have was account creation functionality. Unfortunately, a user with ‘editaccount’ privileges must be logged in to create an account for another user. This is a security hole for our project, but there is functionality for offering a user an account by sending them an email. I didn’t implement every function in the Bugzilla Web Services API, mainly the ones the project uses, but it’s extremely easy to extend. We hope other people can use this!

Bugzilla Web Services fun

So if you try to call a method from a module that doesn’t exist in Bugzilla web services (in my case, I was trying to call ‘bugzilla.version’ but I should have been calling ‘Bugzilla.version’) then it will return a really cryptic and unhelpful fault with the text “Can’t call method “login_exempt” on an undefined value at Bugzilla/WebService/Server.pm line 25″. Just in case anyone else runs into this.

Also, you must use https://bugzilla.mozilla.org/xmlrpc.cgi, not http://bugzilla.mozilla.org/xmlrpc.cgi or it will return a 411 error.

RGBAL

There should be a new standard for colors, rgbal. Where you can specify red, green, blue, alpha, and luminosity values. The luminosity (I’m probably mis-using this word here, but b for brightness was taken) would control the brightness of a particular pixel on the screen (display brightness, not hsb brightness), which right now seems to be controlled in an all-or-nothing way often using special keys on your keyboard. Can you imagine how cool (/inaccessible!) things could get?

Treeview Hang on Dragging

dragging onto a xul tree that has a treeview defined hangs the browser. I’ve defined the nsITreeView function canDrop and it returns true always. /-:

UPDATE: I defined getParentIndex (nsITreeView), but didn’t implement it, this function has to be implemented for a drop operation and must return -1 if the row has no parent. agh.