Sloppy Papers

Introduction

I get it—human beings make mistakes. I make mistakes all the time, and I will continue making mistakes for the rest of my time on this planet. With that said, sometimes other people's mistakes can become very frustrating time sinks for me and may or may not inadvertently prompt me to ask some very interesting questions.

My most recent example for this class of mistakes transpired as part of a course on Estimation Theory that I am currently taking at Purdue. Like in other graduate-level courses, I am required to give a talk on some appropriate related topic instead of writing a final exam. Out of the options that were available, I chose to give a talk on the Expectation-maximization (EM) algorithm, simply because I was interested in learning the details of this technique and writing some cool simulations using it.

The Wikipedia article and the various references cited in it are probably a lot better at providing a high level overview of the algorithm (and I have not reached 100% familiarity with it, partly due to the reasons explained below), but I will try regardless: Under my current understanding, the EM algorithm enables us to obtain a maximum-likelihood (ML) estimate of some parameter $$\theta$$ when the underlying distribution of the data is known, but the actual observations of the data are missing some information. The gist of the algorithm is that it iterates the following two steps:

1. In the expectation step, we compute the conditional expectation of the unknown (or missing) data based on the incomplete observed data
2. In the maximization step, we compute the ML estimate $$\widehat{\theta}_{k}$$ by treating the data computed in the previous step as actual measured data

From what I can tell as of right now, it can be shown that by repeating the above steps our estimate converges to the true ML estimate (The ML estimate we would compute if we had access to all of the data).

Problems

One of the provided references was T. K. Moon's introductory paper The Expectation-Maximization Algorithm, which was published in the November 1996 issue of the IEEE Signal Processing Magazine. You can download it from behind the IEEE Xplore paywall here or for free here. According to Google Scholar, it has been cited 1634 times. I like the paper—it is a light read that explains its concepts in a very clear and concise manner and it at least seems to give a very intuitive overview of the underlying mathematics.

The problem is just that it took me a substantial amount of time to get past the motivating example, which is given on the first few pages. It comes with a funny story—I recommend reading the article for some more details on the tale of D. T. Ector and S. Tim Hatter—and it's supposed to give the reader some sort of intuition about the workings of the algorithm. The problem is posed as the following:

Suppose that in an image pattern-recognition problem, there are two general classes to be distinguished: a class of dark objects and a class of light objects. The class of dark objects may be further subdivided into two shapes: round and square. Using a pattern recognizer, it is desired to determine the probability of a dark object.

The author then goes ahead to inform us that the data samples are assumed to be drawn from a multinomial distribution, parameterized by some parameter $$p$$, which we wish to estimate. For now, all that is interesting to us is that we can describe the distribution of the data with some pdf $$f(x_1, x_2, x_3\, |\, p)$$, where, again, $$p$$ is the unknown parameter. The problem statement continues with

A feature extractor is employed that can distinguish which objects are light and which are dark, but cannot distinguish shape. Let $$[y_1,\ y_2]^{\mathsf{T}} =\, \mathbf{y}$$ be the number of dark objects and number of light objects detected, respectively, so that $$y_1 = x_1 + x_2$$ and $$y_2 = x_3$$.

You get the issue: We want to obtain the ML estimate of the parameter $$p$$, but the data we're observing is missing some information:

There is a many-to-one mapping between $$\{x_1,\ x_2\}$$ and $$y_1$$. For example, if $$y_1 = 3$$, there is no way to tell from the measurements whether $$x_1 = 1$$ and $$x_2 = 2$$ or $$x_1 = 2$$ and $$x_2 = 1$$.

The EM algorithm is now supposed to help us obtain an estimate for the parameter $$p$$, even though we cannot observe the $$x$$ 's directly. Based on the definition of the algorithm and the known underlying multinomial distribution of the data, the author then derives equations for the two iteration steps and provides us with a table of the estimated values of $$x_1$$, $$x_2$$, and $$p$$ for the first 10 steps of the iteration. In the E-step, we are supposed to compute estimates for $$x_1$$ and $$x_2$$ as

$$x_{1}^{[k+1]} = y_{1} \frac{1/4}{1/2 + p^{[k]}/2}$$

and

$$x_{2}^{[k+1]} = y_{1} \frac{1/4 + p^{[k]}/2}{1/2 + p^{[k]}/2}$$

and in the M-step, we are supposed to compute the estimate of the parameter $$p$$ as

$$p^{[k+1]} = \frac{2x_{2}^{[k+1]} - x_{3}}{x_{2}^{[k+1]} + x_{3}}.$$

So far so good. We are also informed that the algorithm is initialized with $$p^{[0]} = 0$$ and that out of 100 drawn samples, 100 are dark, i.e. $$y_1 = 100$$, and that the true values of $$x_1$$ and $$x_2$$ are 25 and 38, respectively.

What?

This is where it starts to get messy. Of course, if we draw a 100 samples, 25 of them are dark and square, and 38 of them are dark and round, it would be hard to get me to agree to the fact that the number of dark objects is 100. The right answer is of course that $$y_1= 63$$. Obviously a typo, but a very disruptive typo. I had to go back and forth to reassure myself that it actually was an error, which was obviously distruptive to my reading flow.

We are also given the following table of results, in which we can clearly see how the estimated values of $$x_1$$ and $$x_2$$ converge to 25 and 38.

Given this data and the equations used to obtain it, I figured that it would be a good idea to reproduce those results by myself, since it would not take more than a few lines of Python to write out this algorithm. So I did, and you can inspect the code below.

# observed data
n = 100
y1 = 63
x3 = n-y1
# iteration steps
kk = 10
# initialize
x1_k, x2_k, p_k = 0, 0, 0
table = [['k', 'x1', 'x2', 'p']] # to print org-mode table
# iterate
for k in range(1, kk+1):
x1_k = y1 * (1/4)/(1/2 + p_k/2)
x2_k = y1 * (1/4 + p_k/2)/(1/2 + p_k/2)
p_k = (2*x2_k - x3)/(x2_k + x3)
table.append([str(k)] + ['{:10.6f}'.format(s) for s in [x1_k, x2_k, p_k]])
return table


Runnning this small simulation gives us the following results:

k x1 x2 p
1 31.500000 31.500000 0.379562
2 22.833333 40.166667 0.561555
3 20.172199 42.827801 0.609507
4 19.571211 43.428789 0.619897
5 19.445679 43.554321 0.622048
6 19.419896 43.580104 0.622489
7 19.414618 43.585382 0.622579
8 19.413539 43.586461 0.622597
9 19.413318 43.586682 0.622601
10 19.413273 43.586727 0.622602

This doesn't look like the table from the article at all! The numbers are converging to some values, but definitely not the values that we are expecting. Something is obviously wrong here. So what's the problem?

The problem is that while the simulation results that were presented are correct, the equations that are claimed to have been used to obtain those results are wrong. Take a look at the article. A few pages later (in Box 2), the author describes how to derive the expressions to calculate the values for $$x_{1}^{[k+1]}$$ and $$x_{2}^{[k+1]}$$ (the E-step) from the underlying trinomial distribution.

I don't have to reiterate the entire derivation here, but the end result of it is that given three discrete random variables $$X_1, X_2, X_3$$ drawn from a trinomial distribution with probabilities $$p_1, p_2, p_3$$, we can compute the conditional expectation $$E[X_1\, |\, X_1 + X_2 = y]$$ as

$$E[X_1\, |\, X_1 + X_2 = y] = y\frac{p_2}{p_1 - p_2}.$$

When this expression is used in the motivating example, we arrive at the equations for $$x_{1}^{[k+1]}$$ and $$x_{2}^{[k+1]}$$ that I gave earlier in the post (and which also appear in the paper).

This is incorrect. After going through his derivation by hand and verifying each of the steps, I realized that the cause of my frustration—and I was very frustrated at this point, because as a student my first assumption is that I am in the wrong—was due to a simple sign error that seemed to have propagated from here back to the example given in the introduction. You can check this yourself by working through the given derivation, but it turns out that an evil ghost must have flipped the plus sign in the denominator of the expression given above. The correct answer is

$$E[X_1\, |\, X_1 + X_2 = y] = y\frac{p_2}{p_1 + p_2}.$$

Cool. After almost losing my mind over this for a short while, I finally regained my sanity and plugged the correct results into my simulation:

$$x_{1}^{[k+1]} = y_{1} \frac{1/4}{1/2 + p^{[k]}/4}$$

and

$$x_{2}^{[k+1]} = y_{1} \frac{1/4 + p^{[k]}/4}{1/2 + p^{[k]}/4}.$$

For the sake of completeness: The underlying trinomial distribution given in the example was

$$P(X_1=x_1, X_2=x_2, X_3=x_3\, |\,p) = \left(\frac{n!}{x_1!x_2!x_3!}\right)\left(\frac{1}{4}\right)^{x_1} \left(\frac{1}{4}+\frac{p}{4}\right)^{x_2}\left(\frac{1}{2}-\frac{p}{4}\right)^{x_3}.$$

Great! The correct simulation and its results can be seen below.

# observed data
n = 100
y1 = 63
x3 = n-y1
# iteration steps
kk = 10
# initialize
x1_k, x2_k, p_k = 0, 0, 0
table = [['k', 'x1', 'x2', 'p']]
# iterate
for k in range(1, kk+1):
x1_k = y1 * (1/4)/(1/2 + p_k/4) # fixed the equation here
x2_k = y1 * (1/4 + p_k/4)/(1/2 + p_k/4) # fixed the equation here
p_k = (2*x2_k - x3)/(x2_k + x3)
table.append([str(k)] + ['{:10.6f}'.format(s) for s in [x1_k, x2_k, p_k]])
return table


Lo and behold, these results look a lot better:

k x1 x2 p
1 31.500000 31.500000 0.379562
2 26.475460 36.524540 0.490300
3 25.298157 37.701843 0.514093
4 25.058740 37.941260 0.518840
5 25.011514 37.988486 0.519773
6 25.002255 37.997745 0.519956
7 25.000441 37.999559 0.519991
8 25.000086 37.999914 0.519998
9 25.000017 37.999983 0.520000
10 25.000003 37.999997 0.520000

The Big Picture

It seems like the peer-review process should catch most of the problems of the kind that I encountered here. I'm aware that a few errors will always slip through the cracks and that forcing a PhD student to work out the math for (him|her)self might not be a bad thing after all. However, I do see this experience as a nod to the problem of reproducibility in science.

I acknowledge the issues with a lack of reproducibility of results from a high-level perspective: If you don't give people your code or your data and open your methods up for scrutiny, can you be 100% sure that you did not make a mistake? (There is an interesting connection between this question and the importance of open-source crypto here).

The angle that is a little more personal to me (and maybe a little underappreciated in general) is the educational cost of this. I'm a student—I don't know what I'm doing! I love reading all of the engineering papers with all of the pretty MATLAB plots, but it has happened numerous times that authors leave out some crucial detail that is part of their simulation code, making it impossible for me to reproduce their results. This is a frustrating experience. It slows down the process of learning and provides quite a high barrier to the entry in the world of academic research for aspiring scientists.

I hope that I will be able to lead by example and host the simulation code that comes with my papers (they will be submitted once I stop wasting time writing blog posts) on GitHub. I encourage the academic readers of this blog to consider doing something similar, maybe you'll save a poor PhD student's Thursday night at some point.

I would appreciate it.

TL;DR

Almost lost my mind about mistake in survey paper; fixed it; regained sanity; would bet money this isn't the only case this happens.

P.S.—some details for Emacs people

I hope everyone enjoyed this post. Here are some cool nerdy details on the code blocks and the tables they produce: I mentioned a while ago that this blog is built using Emacs Org-mode. Org has extensive literate programming capabilities using its source code block environment Babel. Using Babel, I can write a short python program inside of a source code block in my blog post, evaluate it while writing the post, and produce tabulated results that get rendered as HTML tables during exports.

Personally, I think that's pretty cool. You can find more information on making Org and Python play nice here. This is what the first block above looks like as I'm editing it in Emacs:

#+NAME:wrong_sim
#+BEGIN_SRC python :results table replace :exports both
# observed data
n = 100
y1 = 63
x3 = n-y1
# iteration steps
kk = 10
# initialize
x1_k, x2_k, p_k = 0, 0, 0
table = [['k', 'x1', 'x2', 'p']] # to print org-mode table
# iterate
for k in range(1, kk+1):
x1_k = y1 * (1/4)/(1/2 + p_k/2)
x2_k = y1 * (1/4 + p_k/2)/(1/2 + p_k/2)
p_k = (2*x2_k - x3)/(x2_k + x3)
table.append([str(k)] + ['{:10.6f}'.format(s) for s in [x1_k, x2_k, p_k]])
return table
#+END_SRC

Runnning this small simulation gives us the following results:
#+RESULTS: wrong_sim
|  k |        x1 |        x2 |        p |
|  1 | 31.500000 | 31.500000 | 0.379562 |
|  2 | 22.833333 | 40.166667 | 0.561555 |
|  3 | 20.172199 | 42.827801 | 0.609507 |
|  4 | 19.571211 | 43.428789 | 0.619897 |
|  5 | 19.445679 | 43.554321 | 0.622048 |
|  6 | 19.419896 | 43.580104 | 0.622489 |
|  7 | 19.414618 | 43.585382 | 0.622579 |
|  8 | 19.413539 | 43.586461 | 0.622597 |
|  9 | 19.413318 | 43.586682 | 0.622601 |
| 10 | 19.413273 | 43.586727 | 0.622602 |