Getting the Color Scheme of a Website Using Canvas and Hierarchical Clustering

Update: brief update, I think there are better (faster) ways to do this. Check color quantization if you don’t care about preserving the exact color values.

For a Firefox addon I’m working on I wanted to grab the color scheme of whatever website the user is viewing. There are a few extensions that do something like this: Colorzilla and Palette Grabber get the colors used by the website’s DOM, and ColorSuckr gets the most common colors from an image. The problem with getting the colors from the DOM is that websites use images and gradients so sometimes you can’t get the overall color scheme just from looking at the CSS colors.

Luckily, you can capture the color of every pixel on a webpage from an extension using the HTML 5 standard canvas. You can draw any image onto a canvas and get an array of pixel values used by the image with the getImageData function. In Firefox, you can actually draw the entire webpage onto a canvas using the drawWindow function (drawWindow is a Firefox-only canvas extension, but you can at least use drawImage in other browsers).

So getting the number of occurrences of each color is as simple as drawing the page to a canvas, looping through each pixel value in the array returned by getImageData, and tallying each color’s frequency in a javascript hash. This is what you get when performing this analysis on the Twitter homepage:

So you can get the colors that occurred most on the page pretty easily, there is however, one big problem with this. On images and gradients, there are areas of very similar colors that might as well be the same color in the context of the overall theme. As I’ve found out, there are usually over 10,000 different colors on each webpage, so these colors need to be grouped together.

This kind of problem is called clustering, and it’s a problem that comes up a lot in image analysis, computational biology, and other computational disciplines. There are two common clustering algorithms, k-means and hierarchical clustering. K-means can be faster than hierarchical clustering, but the problem with k-means is that you have to know what k is before you even start — you have to know exactly how many clusters you want to end up with. That can’t be determined in this situation, so hierarchical clustering is the best bet.

The premise of hierarchical clustering is simple. Each color starts out in its own cluster. On each pass of the algorithm, the two clusters that are most similar (according to a metric you define yourself) are merged. You keep on doing this until there are no more clusters that are similar enough to merge.

I defined the distance of two colors to be the maximum difference between the two colors’ red, green, and blue components. Two colors were ‘similar enough’ if their distance was less than 12 (where each color component ranged from 0 to 255). When two clusters were merged, the new cluster’s representative color was a weighted average of the two clusters’ representative colors. The algorithm worked pretty well for this application, check out the results:

The algorithm takes a long time even if you just focus on the top few hundred colors (a few seconds), but that’s what Web Workers are for, after all. You can check out the code here. Do you know a faster clustering algorithm? How would you quantify the distance between two colors?

Update: after getting some great feedback, I refined the distance measure to be the 3d distance between the two colors in the L*a*b* color space (the CIE76 distance measure) , thanks to the developer of FireColour for the open source L*a*b* conversion code.

25 thoughts on “Getting the Color Scheme of a Website Using Canvas and Hierarchical Clustering

  1. As to speeding up the algorithm – the main problem is that you are merging only two clusters in each pass but you recalculate all the distances every time, that’s very inefficient. Very trivial enhancement:

    * Initialize your clusters as you already do
    * Calculate the difference for all clusters and put all triples [cluster1, clusters2, distance] into an array.
    * Sort that array by distance.
    * Create a mapping clusterNumber => clusterNumber, initially each cluster is mapped to itself.
    * Go through the triples array starting with the lowest distance until you hit your distance threshold. For each triple set mapping[cluster1] = mapping[cluster2].
    * Now you still have to account for the cases where you mapped one cluster to another but that cluster was remapped itself. So you need to go through all entries in mapping doing |if (mapping[clusterNumber] != mapping[mapping[clusterNumber]]) mapping[clusterNumber] = mapping[mapping[clusterNumber]]| until there are no more changes.
    * mapping now tells you which clusters to merge – you can merge them all in one step.

    Note that this isn’t entirely equivalent to the algorithm you are using now because you are looking at the difference to the mean color once you merged two clusters. But in reality this algorithm should be good enough.

    1. What Wladimir suggested seems like good advice – although I’d actually only store the cluster pairs where distance < threshold + 1 anyway

      Looking at your code, there's some other code that seems to run for every pixel or so and has comments about needing to be tweaked – did you profile this and prove the merging is actually the weak point of your algorithm?

      You could also choose to drawWindow at half the resolution of the original page. I doubt you'll lose much in terms of prevalent colours, and your search space will be 1/4 of its current size. I suppose it might aggrevate the anti-aliasing issue though… you'd have to experiment with that. 🙂

      In general, using .map and .every and other native JS ways to do something for every element of an array may or may not be faster than a standard for loop… you'd have to check.

      Minor tweaks: it used to be that ternary expressions were much slower than actual if/else statements. You'd want to test the current truth of that statement before redoing your code, but there you are… 🙂

      1. Great suggestions, I’ll be trying out something like Wladimir suggested for sure. I did profile the code (and by profile I mean it (-: ) across the various steps.

        drawWindow and getImageData take hardly any time, unless you’re using a really large hi-res screen. The hacked-up anti-aliasing check adds a couple seconds on to getting all the pixels for the page. I’m focusing on optimizing the clustering because right now I can only cluster the top couple hundred colors, so I’m missing the entire tail end of colors on the page. For a lot of websites this doesn’t matter, but for some it will miss groups of colors because of this.

  2. Very nice! I wanted to try doing something like this for Chromatabs way back, but never sat down to think about how.

    RGB isn’t a very good colorspace to work in for, you might try converting to HSV and try clustering on variations there… Wladimir’s link probably does something more advanced, but it makes my head hurt. 🙂

  3. @Justin Dolske: Well, it’s not *that* much more complicated 🙂 Unless of course you try CIEDE2000 variant which seems to be the most “correct” one. The main problem is converting to L*C*h color space, Wikipedia “helpfully” provides no formula for that – only stating that your data needs to be in sRGB color space (which IMO it is, color corrections are already applied to the data you get from the canvas).

  4. Wladimir, thanks for looking at my code! I re-computed the distances because I figured that the distance measure I used would be almost as fast as grabbing a cached value when the indexes into the matrix is three floating point values from averaging, but I didn’t test that at all, and I could use something other than mean here, but getting rid of the precision seemed to make it significantly less accurate.

    As for you algorithm, I think, but I’m not sure, that this would lose a lot of accuracy, and you’d have to re-sort on every pass so there might be a tradeoff there. But you’re right in that it might be just as accurate for this particular use case, it’s worth a try.

    Dolske, you’re right, doing something based on HSV would be a lot better, and the Wikipedia algorithms would be even better, they just look slow, but if I go for a distance matrix then that wouldn’t matter.

    Lots of stuff to try now!

  5. When you read a colour from the canvas, is it corrected back to sRGB? the images you get from right clicking and saving are untagged and in the monitor profile.

    1. I haven’t checked to see whether it stays on trace, I’d be really interested in learning how I could determine that. Yeah, the add-on has it, under Tools>Rainbow>Website Analyzer.

      1. I think at the moment the only good way to tell is with a debug build… I’ll try to take a look early next week.

  6. I believe colorzilla did in fact look at images (as in, it shipped an XPCOM component for the sole purpose of doing screen captures of the window because it predated canvas/drawWindow). I haven’t checked it since Mozilla 1.7-ish, though, so I have no idea if that’s still true.

  7. It is an awesome idea indeed. I need to have some color used in the page i would view. I am looking for a toll as like Photoshop color picker tools. If you have more easy idea to do that, please share with us. Thank you

Leave a Reply to Alex Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s