# Imputation by Predictive Mean Matching: Promise & Peril

Paul Allison
March 5, 2015

Predictive mean matching (PMM) is an attractive way to do multiple imputation for missing data, especially for imputing quantitative variables that are not normally distributed. But, as I explain below, it’s also easy to do it the wrong way.

Compared with standard methods based on linear regression and the normal distribution, PMM produces imputed values that are much more like real values. If the original variable is skewed, the imputed values will also be skewed. If the original variable is bounded by 0 and 100, the imputed values will also be bounded by 0 and 100. And if the real values are discrete (like number of children), the imputed values will also be discrete. That’s because the imputed values are real values that are “borrowed” from individuals with real data.

PMM has been around for a long time (Rubin 1986, Little 1988), but only recently has it become widely available and practical to use. Originally, it could only be used in situations where a single variable had missing data or, more broadly, when the missing data pattern was monotonic. Now, however, the PMM method is embedded in many software packages that implement an approach to multiple imputation variously known as multiple imputation by chained equations (MICE), sequential generalized regression, or the fully conditional specification (FCS). It’s available in many statistical packages, including SAS, Stata, SPSS, and R, all of which allow you to use PMM for virtually any missing data pattern.

LEARN MORE IN A SEMINAR WITH PAUL ALLISON

There are two major pitfalls to PMM, however. First, only a handful of studies have evaluated its performance, so it’s not clear how well it compares with alternative methods. Second, at least two statistical packages, SPSS and Stata, have implemented PMM with a default setting that actually invalidates the method. If you use either of those packages, you must override the default.

Before explaining that problem, I first need to provide a brief description of how PMM works. Suppose there is a single variable x that has some cases with missing data, and a set of variables z (with no missing data) that are used to impute x. Do the following:

1. For cases with no missing data, estimate a linear regression of x on z, producing a set of coefficients b.
2. Make a random draw from the “posterior predictive distribution” of b, producing a new set of coefficients b*. Typically this would be a random draw from a multivariate normal distribution with mean b and the estimated covariance matrix of b (with an additional random draw for the residual variance). This step is necessary to produce sufficient variability in the imputed values, and is common to all “proper” methods for multiple imputation.
3. Using b*, generate predicted values for x for all cases, both those with data missing on x and those with data present.
4. For each case with missing x, identify a set of cases with observed x whose predicted values are close to the predicted value for the case with missing data.
5. From among those close cases, randomly choose one and assign its observed value to substitute for the missing value.
6. Repeat steps 2 through 5 for each completed data set.

Unlike many methods of imputation, the purpose of the linear regression is not to actually generate imputed values. Rather, it serves to construct a metric for matching cases with missing data to similar cases with data present.

There are several variations to this method (Morris et al. 2014), but the most important issue to settle is how many cases (k) should be in each match set. The default in the SAS procedure MI and in the MICE package for R is k=5. That is, each case with missing data on x is matched to the 5 cases (with data present) that have the closest predicted values. One of the 5 is chosen at random and its x value is assigned to the case with missing data. Solas and the user-written ice command for Stata set the default at k=10.

On the other hand, for the SPSS missing values module and for the built-in mi command in Stata the default is k=1. That is, each case with missing data is matched to the single case whose predicted value is closest to the predicted value for the case with missing data. With only one matched case, there is no random draw at Step 5 in the scheme above.

That’s a serious error. With no random draw at Step 5, the only source of random variation in the imputation process is the random draw of regression coefficients in Step 2. That’s not nearly enough to produce proper imputations. As a result, estimated standard errors tend to be much too low, leading to inflated test statistics and confidence intervals that are much too narrow (Morris et al. 2014).

Why did SPSS and Stata get it so wrong? Well, I’m guessing that they relied on Don Rubin’s classic 1987 book Multiple Imputation for Nonresponse in Surveys. In his description of PMM (p. 168), he proposed matching to a single case. But later work makes it clear that this is not the way to go.

So, if not k=1, then how many? That’s not clear. Schenker and Taylor (1996) did simulations with k=3 and k=10. Differences in performance were small, but with k=3, there was less bias and more sampling variation. Based on their simulations, Morris et al. (2014) recommended k=10 for most situations. But a lot depends on sample size. With large samples, k=10 is probably the better choice. But with smaller samples, k=10 will probably include too many cases that are rather unlike the case to which they are matched. Personally, I’m reasonably happy with the k=5 default of SAS and MICE.

The other major drawback of PMM is that there’s no mathematical theory to justify it (which is also true of MICE methods more generally). We have to rely on Monte Carlo simulations, and no simulation can study all the possibilities. Results reported by Schenker and Taylor (1996) and Morris et al. (2014) are very encouraging, but hardly definitive. In brief, it appears that PMM does almost as well as parametric methods for a correctly specified model, and a little better than parametric methods in certain misspecified models. So the current consensus seems to be that this is an acceptable and potentially useful method. But–as they say–more research is needed.

REFERENCES

Little, Roderick J. A. (1988) “Missing-data adjustments in large surveys.” Journal of Business & Economic Statistics 6: 287-296.

Morris, Tim P., Ian R. White and Patrick Royston (2014) “Tuning multiple imputation by predictive mean matching and local residual draws.” BMC Medical Research Methodology 14: 75-87.

Rubin, Donald B. (1986) “Statistical matching using file concatenation with adjusted weights and multiple imputations.” Journal of Business & Economic Statistics 4: 87-94.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.

Schenker, Nathaniel and Jeremy M.G. Taylor (1996) “Partially parametric techniques for multiple imputation.” Computational Statistics & Data Analysis 22: 425-446.

Share

1. Stef van Buuren says:

Paul, wonderful post. Another pitfall is that PMM may not work well in areas where the predictions are sparse, e.g. when imputing extremes. But I have always been impressed what it can do. I have found that a well-implemented PMM also performs well for semi-continuous data, for categorical data, for multilevel data, for compositions, and it will be a matter of time before more evidence on such applications will appear. It’s a great allround method. Hopefully, SPSS and Stata will improve their implementations. Stef.

1. Paul Allison says:

Thanks Stef. Your confidence in PMM increases my confidence.

2. Paul Hoon says:

Dear Drs. Allison, Royston, Morris, Van Burren, Rubin, White, Shafer and others I have not named.

First, thanks to all of you who have published fine imputation papers and books which are guiding my efforts.

Paul, I appreciate your excellent posts here at Statistical Horizons.

Am tackling a big imputation problem and I’m running into a speed problem.

I have three questions numbered below. First, some introductory information.

I read with interest Morris et al. recent paper entitled: “Tuning multiple imputation by PMM and LRD.”

Have a large health data base, approaching 3/4 million, with 100 variables and many variables have 40% missing. The missing value mechanism is MAR, and univariate tests show variables are non-normal.

I plan to “improve” normality with linskew(0) or bcskew(0) in Stata first, before imputation, and then back-transfer to raw score after imputation. In tests, these methods greatly improve skew, but many variables remain with kurtoses above 3 or 4, which I have to live with. Several papers show, and Royston recommends, that “transformation toward normality” before imputation will improve imputation point estimates for all methods, even chained PMM (which relies on a linear model to find “donors:).

Test runs with an 11 variable subset, show that Schafer’s MVN imputation method in Stata will run in about 5 minutes on a hex core xenon machine with 22 gigs of ECC ram, but PMM with k = 1 and iterations = 10 takes about nine hours. All data, program and op sys calls are in RAM on a machine I built myself. So I’m getting max speed. Speed under Mac OSX and Windows 8.1 (same hardware) are virtually the same.

Schafer’s MVM “imposes a normal distribution” when filling in missing values, and distorts or changes the character of the distribution, especially max and min. Chained PMM creates filled in data which faithfully reproduces the original, unfilled distribution, to the extent, that there is almost an exact “finger-print” match based on histograms. This holds for variables which have 40% missing. I am concerned about bias in regression and covariance structure after Schafer’s MVN. Applying Meng’s congeniality principle, chained PMM so far seems the best approach.

Contact with tech support at Stata, and further test runs with timing code have shown that the Stata implementation of PMM has not been parallelized, that is, computation load cannot be distributed across multiple processors, and speed improvements are marginal after four cores (processors). Since ICE was the progenitor for implementations in SPSS and other packages, I doubt that there is any existing implementation of chained PMM that has been parallelized. It may well be that a parallelized chained PMM algorithm is an oxymoron, and that it is not possible.

Am trying to look at all my options, but “hot deck” methods so far seem to be my best bet.

1) Do you know of an implementation of Chained PMM or Local Residual Draw (LRD) that has been “sped up”, that is, which can take advantage of multiple processors? That is, an implementation that has been parallelized?

Otherwise I’m looking at an ninety hour run to get a single imputation set under chained PMM for ten iterations. I will need m = 20 or twenty imputation sets for aggregation under Rubin’s rules, or 1800 hours of computation, or 75 days.

Dr. Allison, since you argue cogently for a k = 3 or more, this could be longer! Dr. Van Burren suggests k can be small, but I don’t yet have timing tests to see if larger values of k imply substantially longer run times.

I can probably cut this down to 8 weeks if I run several instances of Stata simultaneously on each machine, cut chained PMM iterations to five (convergence is reached after 4 to 5 iterations).

This is doable if you use marine batteries with 1000 cold cranking amps to power UPS backups in case of a power failure that lasts 24 hours. Am going to have to build in my electronics workshop, a new, fast, black cylinder Mac Pro Xenon. ECC (error checking ram) on a xenon motherboard is a necessity for an eight week run. Must also monitor CPU temperatures.

2) I cannot find LRD under Stata. Do you know if LRD has been parallelized? Does it run faster than PMM?

3) Are there other “quasi-parametric” methods, hot deck methods, that run fast? I would appreciate any and all suggestions and ideas from you gentlemen. Criticisms welcomed. I would like to use a method such as PMM which does an excellent job of maintaining distribution shape (even U or Poison), max, min, skew, kurtosis, mean, SD in the “filled in” distribution. but which will run faster.

Thank you for taking the time to read this email, and I look forward to your comments. Please contact me directly: phoon at fas dot harvard dot edu Please feel free to share this email with other researchers whom you believe may be able to offer suggestions.

Best,

Paul Hoon
Visiting Fellow
Harvard

1. Paul Allison says:

I’m really sorry, but this is just way more content than I have time to address. Others are welcome to comment and suggest answers.

3. Stef van Buuren says:

Paul, there is a C code routine for finding matches in the mice package that gave a speed up of 5-6 times compared to the older R version. It’s fairly dumb, and I am sure it could be made much faster than this. It can also be made parallel by someone with proper skills. Hope this helps, Stef.

1. Paul Allison says:

Good to know. Thanks for sharing.

4. Daniel Klein says:

Stata has already changed the default in the latest release (14), and forces users to specify the number of closest values (neighbors) to be used.

Paul, thanks a lot for your very interesting and inspiring blog entries. Much appreciated!

Best
Daniel

1. Paul Allison says:

Thanks Daniel. I’ll check out version 14 which I’ve just ordered.

2. Douglas AMULI says:

Yes Daniel,
I confirm. It is the same in stata 15
Users are forced to specify the value of k.

Best,
Douglas

5. Christian Czubacki says:

Hi,
great posting about the pmm-procedure in SPSS. Is there a way to change the k to like 10, as suggest?

best regards,
chris

1. Paul Allison says:

Good question but I don’t know. Any SPSSers out there who can check?

1. A. Janse says:

Thanks Christian for this question. With the PMM procedure, it seems not possible to set k. You can change the “Model type for scale variables:” under the tab “Method” in the ‘Impute missing data value screen’ to ‘Linear Regression’

In the following tab ‘Constraints’ one can choose to fill in ‘maximum case draws’ and the ‘maximum parameter draws’. But I’m statistically not well educated to evaluate if one of this is similar to k???

6. Michael McIsaac says:

Thank you for the interesting post. I wonder how your point about the need for a random variation at step 5 relates to your discussion of discrete data and to Stef van Buuren’s comment that the method works well for categorical data; surely in some settings the k nearest observed neighbours will always be identical (for, say, k1. Could you comment on the use of PMM in such a setting?

1. Paul Allison says:

Good point, but one I haven’t thought about.

7. David Disabato says:

Hi Paul and others,

I am having an interesting situation where the fraction of missing information from pooled analyses (e.g., OLS regression) increases the more auxiliary variables I add to my pmm imputation models. I was taught to include all available auxiliary variables in a dataset, so I have 30 auxiliary variables in my pmm imputation models. However, when I decrease the number of auxiliary variables to 20 or even 10, my fraction of missing information significantly decreases (from .80 to .60 and .50, respectively; I have about 50% missing data for analysis variables). Has anybody else experienced this? I have noticed it with traditional parametric imputation models as well, but it seems to be more common and extreme with pmm (in my experience).

Thanks,
David

1. Paul Allison says:

This can happen. Auxiliary variables often introduce more noise than help. Concentrate on auxiliaries that have high correlations with the variables that have missing data.

8. Laurent De Vriendt says:

Thank you very much for your post. I understand ppm a lot better.

However, I would like some precision on one aspect. When you describe the pmm procedure:

“For cases with no missing data, estimate a linear regression of x on z, producing a set of coefficients b”

Does that mean:

1)cases of no missing data for variable x (but there can be missing data for other variables in z)
or
2)cases of no missing data (as for rows with no missing data across all variables)

This is a problem to me since, in my dataset, all my rows have at least one missing data, but the missing data is not necessarily for the same variable. (i.e. time-series data with some rows systematically having one year missing, while other rows systematically have a different year missing)

Thank you very much,
Laurent

1. Paul Allison says:

I was describing the situation in which only one variable has missing data, the one that needs to be imputed. When other variables also have missing data, PMM must be implemented in a FCS (also known as MICE) framework.

9. Joachim says:

Paul, thank you very much for this post!

You are mentioning in your post that PMM is lacking comparisons with other imputation methods. I recently wrote an article, in which I’m showing some evidence that PMM outperforms stochastic regression imputation in respect to the plausibility of imputed values as well as in case of heteroscedastic data.

In case you are interested: https://statistical-programming.com/predictive-mean-matching-imputation-method/#pmm-vs-sri

Regards,
Joachim

10. Matthieu says:

Hi Paul,

Thank you for this blog post, very helpful !
In step 2, could you elaborate what you mean when you say “(with an additional random draw for the residual variance)” ?
I understand that there are residuals for the model b, residuals for the model b* (which is chosen at random), and that we can compute a variance for these 2 sets for residuals, but I don’t understand how and where you can “draw” from the residual variance.
Is this related to taking the residual of the donor in LRD ?

For step 6, “Repeat steps 2 through 5 for each completed data set.”, I’m not sure I understand it completely either, but maybe it’s just vocabulary.
I understand it as “Repeat steps 2 through 5 for each row of the data frame that is missing an x”, is that it ?

1. Paul Allison says:

The purpose of step 2 is to generate a different set of parameters for the imputation model for each data set that is produced. The parameters are the regression coefficients and the residual variance. All of these are first estimated by ordinary least squares. Then, for each data set, one makes random draws from the “posterior distribution” of these parameters. For the residual variance, a random draw is made from a transformation of a variable with a chi-square distribution.
Once you’ve got the random draws of the parameters, you then use those to generate predicted values of the variable with missing data for all the observations, whether the variable is missing or not.
When I say repeat steps 2-5, I’m not talking about each row of a data set. Rather I’m saying that steps 2-5 are repeated for each data set that you want to produce.

1. Matthieu says:

Oh okay.
So to be clear, the “data sets you want to produce” are data sets where the the missing values have been imputed using different b* models, right ?

But that means that for each data set you produce, all imputed values are perfectly aligned (they all lie exactly on the plane b*), aren’t they ?
Isn’t that problematic since we want to add randomness to the imputed values ? I know that b* is already perturbed by random noise compared to b, but is that enough?

Also, if you estimate a new model c on the complete (imputed) dataset, c will be midway between b and b* (closer to b because I guess you don’t have more than 50% missing data), but all imputed values push the distribution of your data in the same direction (towards b*), which seems like it adds a systematic bias. I would think that when you impute data, it shouldn’t change the distribution, so c should be almost the same as b no ? (I might be wrong on that, I’m just guessing).

Also, for step 3, on this website https://statisticsglobe.com/predictive-mean-matching-imputation-method/, the step 3 seems to say that not all values should be predicted from b*, but non-missing ones from b and missing ones from b*.

I’m sorry, I’m quite new to all of this and I’m trying to make sense out of it.

1. Paul Allison says:

The thing to keep in mind is that the predicted values produced in step 3 are NOT directly used for imputation. Rather, they are used to match each case with a missing value to several other cases that do not have missing values. Then, there is a single random draw from that set of matched cases. The observed value for the chosen case becomes the imputed value. That random draw is where the additional random variation comes in.

11. This post is awesome! I’m giving a presentation on pmm and other imputation methods for a programming/analytics club, and I definitely appreciated your attention to detail. Thank you!