This post is a summary of the work [1] of Jiwei Li et al.

Problem

Seq2Seq with MLE objective function tends to generate safe, commonplace responses [Table 1].

Table 1: Responses generated by a standard SEQ2SEQ neural model trained on the OpenSubtitles dataset (60 million conversation pairs). Decoding is implemented with beam size set to 200.

Input (message) Output (response) log(p)
What are you doing? I don’t know. -0.862
Nothing. -1.062
I’m talking to you. -1.100
What is your name? I don’t know. -0.913
Oh, my god! -0.974
My name is Robert. -1.553
How old are you? I don’t know. -0.794
I’m not sure. -1.170
Twenty-five. -1.637

Contribution

• Apply Maximum Mutual Information (MMI) to Seq2Seq model.

Motivation

Li et al. believe on the one hand, responses are dependent on messages, on the other hand, the reverse holds. MMI is proposed to capture this intuition.

Pointwise Mutual Information (PMI)

The PMI is to quantify the discrepancy between the probability of coincidence of a pair of outcomes given their joint distribution and their individual distributions, assuming independence [2]. Mathematically
$$pmi(x,y) = \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)}$$
if X and Y are independent, pmi(x,y) should be 0. On the condition that p(y) is fixed, pmi(x,y) increases as p(x|y) goes up.

Mutual Information (MI)

The mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables[3]. Mathematically
$$mi(X,Y) = \sum_{X}\sum_{Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$
where mi(X,Y) is nonnegative and symmetric.

MMI in Seq2Seq

Let S denote a message sequence as
$$S={s_1,s2,…,s_N}$$
where N is the number of words in sequence. Similarly, T denotes a response sequence as
$$T={t_1,t_2,…,t_M,EOS}$$
The standard objective function in seq2seq model is the log-likelihood function which yeilds the too-generic problem at test time
$$\hat{T} = \mathop{\arg\max}_T{\log p(T|S)}$$
The MMI is defined as
$$\log\frac{p(S,T)}{p(S)p(T)} \ \ , \ could \ also \ be \ \log p(T|S) - \lambda\log p(T)$$
which penalizes T which has higher prior probability. According to Bayes’ theorem, MMI can be rewrited as
$$\hat{T} = \mathop{\arg\max}_T{(1-\lambda)\log p(T|S) + \lambda\log p(S|T)}$$
which supports the hypothesis that T and S are dependent on each other.

Obstacles

MMI, no matter in which form, is difficult to adapt directly to decode since the former form leads to ungrammatical responses and the latter form makes decoding intractable.

Theoretically, log P(T|S) makes sure the grammar of T is reasonable and -log P(T) penalizes way too common sentences. In practice, However, it is noticed that the model is escaped from the control of log P(T|S).

p(S|T) requires the entire T which makes direct decoding intractable.

Workaround

For the first kind of objective function, add weights to p(T), then
$$U(T) = \prod_{i=1}^{M} p(t_i|t_1,…,t_{i-1})\cdot g(i)$$
where weight function g(i) is decremented monotonically. This g(i) enables P(T|S) to dominate the later generating process.

For the second kind of objective function, N-best candicates are generated from standard Seq2Seq model and then they are reranked using new objective function.

References

[1] Jiwei Li et al. A Diversity-Promoting Objective Function for Neural Conversation Models
[2] Wikipedia, Pointwise mutual information, https://en.wikipedia.org/wiki/Pointwise_mutual_information
[3] Wikipedia, Mutual information, https://en.wikipedia.org/wiki/Mutual_information