Comparing Images using Node and JIMP

Another post using the JIMP npm package, this time experimenting with several methods for comparing images to find duplication or plagiarism.

The full documentation for JIMP can be found at https://www.npmjs.com/package/jimp. I will be using three methods for comparing images:

  • hash: this returns a 64 bit perceptual hash of an image. Unlike the cryptographic hashing you might be familiar with, perceptual hashes vary in a way roughly proportional to the differences in input, so the hashes of similar images will also be similar.

  • distance: the Hamming distance between the hashes of two images, ie. the number of bits which differ.

  • diff: the percentage difference between two images.

The JIMP documentation linked to above recommends using both distance and diff to compare images. If either are less than 0.15 then the images can be considered to be the same. They claim 99% success with 1% false positives.

However, there were a few unanswered questions in my mind about this process:

  • Does it work if one of the images has been converted to black and white?

  • Does it work if the images are different sizes?

  • Does it work if one of the images has been slightly enhanced, for example sharpened?

  • Does it work with heavy editing, for example if one image is highly pixellized?

So as not to keep you in suspense, I found that it does work well in all these cases. All four edited images had the exact same hashes as the unedited original (and therefore the same Hamming distances) although the percentage differences did vary quite a bit. However, as long as at least one measure is less than 0.15 the images are flagged as identical according to the recommended methodology.

In this post I will show the source code used to test these cases, using the images below. There is also a completely different image which I have thrown in just to see what happens.

edinburgh_original.jpg

edinburgh_sharpened.jpg

edinburgh_bw.jpg

edinburgh_pixelized.jpg

edinburgh_small.jpg

london.jpg

The source code can be downloaded as a ZIP, or you can clone the Github repository if you prefer.

Source Code Links

ZIP File
GitHub

This is the source code.

comparingimages.js

compare();

async function compare()
{
    const Jimp = require("jimp");

    const edinburgh_original = await Jimp.read("edinburgh_original.jpg");
    const edinburgh_sharpened = await Jimp.read("edinburgh_sharpened.jpg");
    const edinburgh_bw = await Jimp.read("edinburgh_bw.jpg");
    const edinburgh_pixelized = await Jimp.read("edinburgh_pixelized.jpg");
    const edinburgh_small = await Jimp.read("edinburgh_small.jpg");
    const london = await Jimp.read("london.jpg");

    console.log("Images compared to edinburgh_original.jpg\n=========================================");
    console.log(`hash (base 64) ${edinburgh_original.hash()}`);
    console.log(`hash (binary)  ${edinburgh_original.hash(2)}\n`);

    console.log("edinburgh_sharpened.jpg\n=======================");
    console.log(`hash (base 64) ${edinburgh_sharpened.hash()}`);
    console.log(`hash (binary)  ${edinburgh_sharpened.hash(2)}`);
    console.log(`distance       ${Jimp.distance(edinburgh_original, edinburgh_sharpened)}`);
    console.log(`diff.percent   ${Jimp.diff(edinburgh_original, edinburgh_sharpened).percent}\n`);

    console.log("edinburgh_bw.jpg\n================");
    console.log(`hash (base 64) ${edinburgh_bw.hash()}`);
    console.log(`hash (binary)  ${edinburgh_bw.hash(2)}`);
    console.log(`distance       ${Jimp.distance(edinburgh_original, edinburgh_bw)}`);
    console.log(`diff.percent   ${Jimp.diff(edinburgh_original, edinburgh_bw).percent}\n`);

    console.log("edinburgh_pixelized.jpg\n=======================");
    console.log(`hash (base 64) ${edinburgh_pixelized.hash()}`);
    console.log(`hash (binary)  ${edinburgh_pixelized.hash(2)}`);
    console.log(`distance       ${Jimp.distance(edinburgh_original, edinburgh_pixelized)}`);
    console.log(`diff.percent   ${Jimp.diff(edinburgh_original, edinburgh_pixelized).percent}\n`);

    console.log("edinburgh_small.jpg\n===================");
    console.log(`hash (base 64) ${edinburgh_small.hash()}`);
    console.log(`hash (binary)  ${edinburgh_small.hash(2)}`);
    console.log(`distance       ${Jimp.distance(edinburgh_original, edinburgh_small)}`);
    console.log(`diff.percent   ${Jimp.diff(edinburgh_original, edinburgh_small).percent}\n`);

    console.log("london.jpg\n==========");
    console.log(`hash (base 64) ${london.hash()}`);
    console.log(`hash (binary)  ${london.hash(2)}`);
    console.log(`distance       ${Jimp.distance(edinburgh_original, london)}`);
    console.log(`diff.percent   ${Jimp.diff(edinburgh_original, london).percent}\n`);
}

The compare function is async as I have used await to open the images. As this is just an experiment I have omitted error handling although of course any production code interacting with the outside world, for example the file system, should handle errors.

After the images have been opened the hash of the original image is output. When called with no argument the hash function returns a base 64 number but you can also specify a base. Here I have also printed the binary or base 2 equivalent.

The rest of the code is repetitive, calculating the hashes, distances and percentage differences between the original image and the others.

The functions used here are relatively resource-intensive and running this program with even six small photos takes 2-3 seconds. Bear this in mind if you happen to be writing any code to compare large numbers of images.

Now let's run the code.

Run

node comparingimages.js

Program output

Images compared to edinburgh_original.jpg
=========================================
hash (base 64) dH20I0B00aM
hash (binary)  1101101011000010000000101100000000100101000000000000001010110000

edinburgh_sharpened.jpg
=======================
hash (base 64) dH20I0B00aM
hash (binary)  1101101011000010000000101100000000100101000000000000001010110000
distance       0
diff.percent   0.08049583333333334

edinburgh_bw.jpg
================
hash (base 64) dH20I0B00aM
hash (binary)  1101101011000010000000101100000000100101000000000000001010110000
distance       0
diff.percent   0.13681666666666667

edinburgh_pixelized.jpg
=======================
hash (base 64) dH20I0B00aM
hash (binary)  1101101011000010000000101100000000100101000000000000001010110000
distance       0
diff.percent   0.25950833333333334

edinburgh_small.jpg
===================
hash (base 64) dH20I0B00aM
hash (binary)  1101101011000010000000101100000000100101000000000000001010110000
distance       0
diff.percent   0.34801666666666664

london.jpg
==========
hash (base 64) awvjOFbaIoE
hash (binary)  1010100000011111010011110010101001001011001010101100011000101000
distance       0.515625
diff.percent   0.8483791666666667

As I mentioned above the hashes and Hamming distances are identical for the Edinburgh photos, although the percentage differences are increasingly higher. Note that "percent" is misleading; these numbers are actually decimals so, for example, 0.5 = 50%.

Not surprisingly the London photo is very different by all measures.