SimRank

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.
SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.
Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor, inserting additional terms that are neglected by SimRank or using PageRank-based alternatives.

Introduction

Many applications require a measure of "similarity" between objects.
One obvious example is the "find-similar-document" query,
on traditional text corpora or the World-Wide Web.
More generally, a similarity measure can be used to cluster objects, such as for collaborative filtering in a recommender system, in which “similar” users and items are grouped based on the users’ preferences.
Various aspects of objects can be used to determine similarity, usually depending on the domain and the appropriate definition of similarity for that domain.
In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences.
SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest.
On the Web, for example, two pages are related if there are hyperlinks between them.
A similar approach can be applied to scientific papers and their citations, or to any other document corpus with cross-reference information.
In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item.
Such domains are naturally modeled as graphs, with nodes representing objects and edges representing relationships.
The intuition behind the SimRank algorithm is that, in many domains, similar objects are referenced by similar objects.
More precisely, objects and are considered to be similar if they are pointed from objects and, respectively, and and are themselves similar.
The base case is that objects are maximally similar to themselves
It is important to note that SimRank is a general algorithm that determines only the similarity of structural context.
SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships.
Obviously, similarity of other domain-specific aspects are important as well; these can — and should be combined with relational structural-context similarity for an overall similarity measure.
For example, for Web pages SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora.
For recommendation systems, there may be built-in known similarities between items, as well as similarities between users.
Again, these similarities can be combined with the similarity scores that are computed based on preference patterns, in order to produce an overall similarity measure.

Basic SimRank equation

For a node in a directed graph, we denote by and the set of in-neighbors and out-neighbors of, respectively.
Individual in-neighbors are denoted as, for, and individual
out-neighbors are denoted as, for.
Let us denote the similarity between objects and by.
Following the earlier motivation, a recursive equation is written for.
If then is defined to be.
Otherwise,
where is a constant between and.
A slight technicality here is that either or may not have any in-neighbors.
Since there is no way to infer any similarity between and in this case, similarity is set to, so the summation in the above equation is defined to be when or.

Matrix representation of SimRank

Let be the similarity matrix whose entry denotes the similarity score, and be the column normalized adjacency matrix whose entry if there is an edge from to, and 0 otherwise. Then, in matrix notations, SimRank can be formulated as
where is an identity matrix.

Computing SimRank

A solution to the SimRank equations for a graph can be reached by iteration to a fixed-point.
Let be the number of nodes in.
For each iteration, we can keep entries, where gives the score between and on iteration.
We successively compute based on.
We start with where each is a lower bound on the actual SimRank score :
To compute from, we use the basic SimRank equation to get:
for, and for.
That is, on each iteration, we update the similarity of using the similarity scores of the neighbours of from the previous iteration according to the basic SimRank equation.
The values are nondecreasing as increases.
It was shown in that the values converge to limits satisfying the basic SimRank equation, the SimRank scores, i.e., for all,.
The original SimRank proposal suggested choosing the decay factor and a fixed number of iterations to perform.
However, the recent research showed that the given values for and generally imply relatively low accuracy of iteratively computed SimRank scores.
For guaranteeing more accurate computation results, the latter paper suggests either using a smaller decay factor or taking more iterations.

CoSimRank

CoSimRank is a variant of SimRank with the advantage of also having a local formulation, i.e. CoSimRank can be computed for a single node pair. Let be the similarity matrix whose entry denotes the similarity score, and be the column normalized adjacency matrix. Then, in matrix notations, CoSimRank can be formulated as:
where is an identity matrix. To compute the similarity score of only a single node pair, let, with being a vector of the
standard basis, i.e., the -th entry is 1 and all other entries are 0. Then, CoSimRank can be computed in two steps:

Step one can be seen a simplified version of Personalized PageRank. Step two sums up the vector similarity of each iteration. Both, matrix and local representation, compute the same similarity score. CoSimRank can also be used to compute the similarity of sets of nodes, by modifying.

Further research on SimRank

Fogaras and Racz suggested speeding up SimRank computation through probabilistic computation using the Monte Carlo method.
Antonellis et al. extended SimRank equations to take into consideration evidence factor for incident nodes and link weights.
Yu et al. further improved SimRank computation via a fine-grained memoization method to share small common parts among different partial sums.
Chen and Giles discussed the limitations and proper use cases of SimRank.
Partial Sums Memoization

Lizorkin et al. proposed three optimization techniques for speeding up the computation of SimRank:

Essential nodes selection may eliminate the computation of a fraction of node pairs with a-priori zero scores.
Partial sums memoization can effectively reduce repeated calculations of the similarity among different node pairs by caching part of similarity summations for later reuse.
A threshold setting on the similarity enables a further reduction in the number of node pairs to be computed.

In particular, the second observation of partial sums memoization plays a paramount role in greatly speeding up the computation of SimRank from to, where is the number of iterations, is average degree of a graph, and is the number of nodes in a graph. The central idea of partial sums memoization consists of two steps:
First, the partial sums over are memoized as
and then is iteratively computed from as
Consequently, the results of,,
can be reused later when we compute the similarities for a given vertex as the first argument.

Citations

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...