Another post using the JIMP npm package, this time experimenting with several methods for comparing images to find duplication or plagiarism.
The full documentation for JIMP can be found at https://www.npmjs.com/package/jimp. I will be using three methods for comparing images:
- hash: this returns a 64 bit perceptual hash of an image. Unlike the cryptographic hashing you might be familiar with, perceptual hashes vary in a way roughly proportional to the differences in input, so the hashes of similar images will also be similar.
- distance: the Hamming distance between the hashes of two images, ie. the number of bits which differ.
- diff: the percentage difference between two images.
The JIMP documentation linked to above recommends using both distance and diff to compare images. If either are less than 0.15 then the images can be considered to be the same. They claim 99% success with 1% false positives.
However, there were a few unanswered questions in my mind about this process:
- Does it work if one of the images has been converted to black and white?
- Does it work if the images are different sizes?
- Does it work if one of the images has been slightly enhanced, for example sharpened?
- Does it work with heavy editing, for example if one image is highly pixellized?
So as not to keep you in suspense, I found that it does work well in all these cases. All four edited images had the exact same hashes as the unedited original (and therefore the same Hamming distances) although the percentage differences did vary quite a bit. However, as long as at least one measure is less than 0.15 the images are flagged as identical according to the recommended methodology.
In this post I will show the source code used to test these cases, using the images below. There is also a completely different image which I have thrown in just to see what happens.
edinburgh_original.jpg
edinburgh_sharpened.jpg
edinburgh_bw.jpg
edinburgh_pixelized.jpg
edinburgh_small.jpg
london.jpg
The source code can be downloaded as a ZIP, or you can clone the Github repository if you prefer.
Source Code Links
This is the source code.
comparingimages.js
compare(); async function compare() { const Jimp = require("jimp"); const edinburgh_original = await Jimp.read("edinburgh_original.jpg"); const edinburgh_sharpened = await Jimp.read("edinburgh_sharpened.jpg"); const edinburgh_bw = await Jimp.read("edinburgh_bw.jpg"); const edinburgh_pixelized = await Jimp.read("edinburgh_pixelized.jpg"); const edinburgh_small = await Jimp.read("edinburgh_small.jpg"); const london = await Jimp.read("london.jpg"); console.log("Images compared to edinburgh_original.jpg\n========================================="); console.log(`hash (base 64) ${edinburgh_original.hash()}`); console.log(`hash (binary) ${edinburgh_original.hash(2)}\n`); console.log("edinburgh_sharpened.jpg\n======================="); console.log(`hash (base 64) ${edinburgh_sharpened.hash()}`); console.log(`hash (binary) ${edinburgh_sharpened.hash(2)}`); console.log(`distance ${Jimp.distance(edinburgh_original, edinburgh_sharpened)}`); console.log(`diff.percent ${Jimp.diff(edinburgh_original, edinburgh_sharpened).percent}\n`); console.log("edinburgh_bw.jpg\n================"); console.log(`hash (base 64) ${edinburgh_bw.hash()}`); console.log(`hash (binary) ${edinburgh_bw.hash(2)}`); console.log(`distance ${Jimp.distance(edinburgh_original, edinburgh_bw)}`); console.log(`diff.percent ${Jimp.diff(edinburgh_original, edinburgh_bw).percent}\n`); console.log("edinburgh_pixelized.jpg\n======================="); console.log(`hash (base 64) ${edinburgh_pixelized.hash()}`); console.log(`hash (binary) ${edinburgh_pixelized.hash(2)}`); console.log(`distance ${Jimp.distance(edinburgh_original, edinburgh_pixelized)}`); console.log(`diff.percent ${Jimp.diff(edinburgh_original, edinburgh_pixelized).percent}\n`); console.log("edinburgh_small.jpg\n==================="); console.log(`hash (base 64) ${edinburgh_small.hash()}`); console.log(`hash (binary) ${edinburgh_small.hash(2)}`); console.log(`distance ${Jimp.distance(edinburgh_original, edinburgh_small)}`); console.log(`diff.percent ${Jimp.diff(edinburgh_original, edinburgh_small).percent}\n`); console.log("london.jpg\n=========="); console.log(`hash (base 64) ${london.hash()}`); console.log(`hash (binary) ${london.hash(2)}`); console.log(`distance ${Jimp.distance(edinburgh_original, london)}`); console.log(`diff.percent ${Jimp.diff(edinburgh_original, london).percent}\n`); }
The compare function is async as I have used await to open the images. As this is just an experiment I have omitted error handling although of course any production code interacting with the outside world, for example the file system, should handle errors.
After the images have been opened the hash of the original image is output. When called with no argument the hash function returns a base 64 number but you can also specify a base. Here I have also printed the binary or base 2 equivalent.
The rest of the code is repetitive, calculating the hashes, distances and percentage differences between the original image and the others.
The functions used here are relatively resource-intensive and running this program with even six small photos takes 2-3 seconds. Bear this in mind if you happen to be writing any code to compare large numbers of images.
Now let's run the code.
Run
node comparingimages.js
Program output
Images compared to edinburgh_original.jpg ========================================= hash (base 64) dH20I0B00aM hash (binary) 1101101011000010000000101100000000100101000000000000001010110000 edinburgh_sharpened.jpg ======================= hash (base 64) dH20I0B00aM hash (binary) 1101101011000010000000101100000000100101000000000000001010110000 distance 0 diff.percent 0.08049583333333334 edinburgh_bw.jpg ================ hash (base 64) dH20I0B00aM hash (binary) 1101101011000010000000101100000000100101000000000000001010110000 distance 0 diff.percent 0.13681666666666667 edinburgh_pixelized.jpg ======================= hash (base 64) dH20I0B00aM hash (binary) 1101101011000010000000101100000000100101000000000000001010110000 distance 0 diff.percent 0.25950833333333334 edinburgh_small.jpg =================== hash (base 64) dH20I0B00aM hash (binary) 1101101011000010000000101100000000100101000000000000001010110000 distance 0 diff.percent 0.34801666666666664 london.jpg ========== hash (base 64) awvjOFbaIoE hash (binary) 1010100000011111010011110010101001001011001010101100011000101000 distance 0.515625 diff.percent 0.8483791666666667
As I mentioned above the hashes and Hamming distances are identical for the Edinburgh photos, although the percentage differences are increasingly higher. Note that "percent" is misleading; these numbers are actually decimals so, for example, 0.5 = 50%.
Not surprisingly the London photo is very different by all measures.