PMI (especially in its positive pointwise mutualinformation variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have expected them to appear by chance."[2]
The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables:[2] The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.
(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).
The measure is symmetric (). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when X and Y are perfectly associated (i.e. or ), yielding the following bounds:
Finally, will increase if is fixed but decreases.
Here is an example to illustrate:
x
y
p(x, y)
0
0
0.1
0
1
0.7
1
0
0.15
1
1
0.05
Using this table we can marginalize to get the following additional table for the individual distributions:
p(x)
p(y)
0
0.8
0.25
1
0.2
0.75
With this example, we can compute four values for . Using base-2 logarithms:
Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations":[3]
PMI can take both positive and negative values and has no fixed bounds, which makes it harder to interpret.[3]
PMI has "a well-known tendency to give higher scores to low-frequency events", but in applications such as measuring word similarity, it is preferable to have "a higher score for pairs of words whose relatedness is supported by more evidence."[3]
Positive PMI
The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero:[2]
This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment".[2] It also avoids having to deal with values for events that never occur together (), by setting PPMI for these to 0.[2]
Normalized pointwise mutual information (npmi)
Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence.[4]
The PMIk measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as[5][3]
In particular, . The additional factors of inside the logarithm are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs.[3] A 2011 case study demonstrated the success of PMI3 in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word "football", its most strongly associated words y according to the PMI measure (i.e. those maximizing ) were domain-specific ("midfielder", "cornerbacks", "goalkeepers") whereas the terms ranked most highly by PMI3 were much more general ("league", "clubs", "england").[3]
Specific Correlation
Total correlation is an extension of mutual information to multi-variables. Analogously to the definition of total correlation, the extension of PMI to multi-variables is "specific correlation."[6]
The SI of the results of random variables is expressed as the following:
PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds).[8] In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities and respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015)[citation needed] filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2)
word 1
word 2
count word 1
count word 2
count of co-occurrences
PMI
puerto
rico
1938
1311
1159
10.0349081703
hong
kong
2438
2694
2205
9.72831972408
los
angeles
3501
2808
2791
9.56067615065
carbon
dioxide
4265
1353
1032
9.09852946116
prize
laureate
5131
1676
1210
8.85870710982
san
francisco
5237
2477
1779
8.83305176711
nobel
prize
4098
5131
2498
8.68948811416
ice
hockey
5607
3002
1933
8.6555759741
star
trek
8264
1594
1489
8.63974676575
car
driver
5578
2749
1384
8.41470768304
it
the
283891
3293296
3347
-1.72037278119
are
of
234458
1761436
1019
-2.09254205335
this
the
199882
3293296
1211
-2.38612756961
is
of
565679
1761436
1562
-2.54614706831
and
of
1375396
1761436
2949
-2.79911817902
a
and
984442
1375396
1457
-2.92239510038
in
and
1187652
1375396
1537
-3.05660070757
to
and
1025659
1375396
1286
-3.08825363041
to
in
1025659
1187652
1066
-3.12911348956
of
and
1761436
1375396
1190
-3.70663100173
Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.
^Tim Van de Cruys. 2011. Two Multivariate Generalizations of Pointwise Mutual Information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pages 16–20, Portland, Oregon, USA. Association for Computational Linguistics.