Getting the Color Scheme of a Website Using Canvas and Hierarchical Clustering

by harthur

Update: brief update, I think there are better (faster) ways to do this. Check color quantization if you don’t care about preserving the exact color values.

For a Firefox addon I’m working on I wanted to grab the color scheme of whatever website the user is viewing. There are a few extensions that do something like this: Colorzilla and Palette Grabber get the colors used by the website’s DOM, and ColorSuckr gets the most common colors from an image. The problem with getting the colors from the DOM is that websites use images and gradients so sometimes you can’t get the overall color scheme just from looking at the CSS colors.

Luckily, you can capture the color of every pixel on a webpage from an extension using the HTML 5 standard canvas. You can draw any image onto a canvas and get an array of pixel values used by the image with the getImageData function. In Firefox, you can actually draw the entire webpage onto a canvas using the drawWindow function (drawWindow is a Firefox-only canvas extension, but you can at least use drawImage in other browsers).

So getting the number of occurrences of each color is as simple as drawing the page to a canvas, looping through each pixel value in the array returned by getImageData, and tallying each color’s frequency in a javascript hash. This is what you get when performing this analysis on the Twitter homepage:

So you can get the colors that occurred most on the page pretty easily, there is however, one big problem with this. On images and gradients, there are areas of very similar colors that might as well be the same color in the context of the overall theme. As I’ve found out, there are usually over 10,000 different colors on each webpage, so these colors need to be grouped together.

This kind of problem is called clustering, and it’s a problem that comes up a lot in image analysis, computational biology, and other computational disciplines. There are two common clustering algorithms, k-means and hierarchical clustering. K-means can be faster than hierarchical clustering, but the problem with k-means is that you have to know what k is before you even start — you have to know exactly how many clusters you want to end up with. That can’t be determined in this situation, so hierarchical clustering is the best bet.

The premise of hierarchical clustering is simple. Each color starts out in its own cluster. On each pass of the algorithm, the two clusters that are most similar (according to a metric you define yourself) are merged. You keep on doing this until there are no more clusters that are similar enough to merge.

I defined the distance of two colors to be the maximum difference between the two colors’ red, green, and blue components. Two colors were ‘similar enough’ if their distance was less than 12 (where each color component ranged from 0 to 255). When two clusters were merged, the new cluster’s representative color was a weighted average of the two clusters’ representative colors. The algorithm worked pretty well for this application, check out the results:

The algorithm takes a long time even if you just focus on the top few hundred colors (a few seconds), but that’s what Web Workers are for, after all. You can check out the code here. Do you know a faster clustering algorithm? How would you quantify the distance between two colors?

Update: after getting some great feedback, I refined the distance measure to be the 3d distance between the two colors in the L*a*b* color space (the CIE76 distance measure) , thanks to the developer of FireColour for the open source L*a*b* conversion code.