The "Universal Similarity Metric"
This similarity measure is proposed by Vitanyi et al. By comparing the information content of 2 files and a concatenated
version of the 2 files, we can find the similarity between them as we can find the amount of commonality between their representation in bits.
USM can be applied on any computer to measure how similar two files are. A list of quick and easy to follow steps is shown below:
Quick and easy to follow steps
- You have File 1 and File 2, lets call them File A and File B.
- Copy File A to a new file, open it, go to the very end of the file, paste File B in here, now you have File AB.
- Pull up any compressor you have (gzip, winrar, 7zip), 7zip and gzip works best but winrar/winzip still does the trick but don't use bzip2.
- Compress File A, File B and File AB.
- Note down the file sizes of all 3 files
- Between the compressed version of File A and B, note the smaller compressed File and bigger compressed File
- Open up calculator, calculate do the following math:
- (size of compressed File AB - size of Smaller File) / (size of Bigger File)
- And there you have it! (if you get anything below 0.5, it's probably "similar" according to us :-) )
Issues with the USM
There are some issues present with the measure:
- There are cases where if you swap File A and File B, the measure will be differnt, so which value do we trust?
- The measure will say it's always very different if you compare two files of different types (eg/ PDF and Word Document)
- Note that I've said that 0.5 is "similar" according to us, but is this true?
- Which compressor should I use, any reason why Bzip2 shouldn't be used?
- When comparing the same file you don't get the most similar (like zero) value.
These issues will be discussed thoroughly in my thesis and some solutions are proposed as to how to fix them