Humans can discover similarity between things rather quickly. However, when using computers to try to determine similarity between things, we need to consider what we mean by similarity. When computers measure similarity, they can only compare information represented in bits. Using the picture below as an example, computers will pick up irrelevant information such as it's a image of 294x221 pixels instead of saying it's an apple.
We therefore investigate Information-Theoretic approaches to measuring similarity. In particular, Vitanyi et al have proposed an Information-Theoretic Universal Similarity Metric (USM). We consider such issues as whether a similarity measure should be universal and whether it should be a metric. Information content can be estimated via Data Compression.
Vitanyi’s USM uses universal Data Compressors to estimate information content. We explore USM, and variations of it, by applying it to the specific problem domain of documents.
We found that the best data compressor to measure USM is one which takes most advantage of common information and that the measure is not always symmetric. We also found how the concept of Degree of Similiarity and Highest Overlap can assist in determining different properties of things we measure similarity of. The issue of self similarity is investigated with relation to how the compressor’s model affects the USM.