I love Stata and always jump at the chance to teach it. But I was honestly a bit terrified. I had never been outside North America and Western Europe and don’t really consider myself much of a traveler. What would Dhaka be like? I had no real reference point for trying to predict the experience. Even as I was boarding the flight from Boston to Dubai (the second leg of a 28hour trip), there was a part of me still asking, “What have you gotten yourself into?”
Fortunately my love of teaching Stata was greater than my anxiety because the week I spent in Dhaka ended up being one of the most satisfying of my professional life.
Most of the students I taught were participants in the CDC’s Field Epidemiology Training Program (FETP). This worldwide program takes government doctors who’ve been doing clinical work and trains them to be field epidemiologists. All of the doctors had received basic biostatistics training before I arrived. So my primary role was to teach them how to use Stata to implement the techniques they’d already learned in class.
The most surprising aspect of this experience was how much we ended up focusing on data cleaning. In most of my career, I’ve used very wellbehaved data sets like the General Social Survey. In Dhaka, most of the students were dealing with what initially seemed to me very disorganized spreadsheets. Their data had all the numbers stored as text, inconsistent capitalization, and in many cases were half “long” format and half “wide” format.
I found it amazingly satisfying to teach these doctors how to take messy, complicated epidemiological surveillance data and turn them into something they could really use. There were one or two students who had been unable to make any progress on their projects for months because of data management difficulties. I had never met data management problems that complicated before. One particular reshape had me pulling my hair out and took me almost two hours to figure out. But I have rarely—maybe never—been as satisfied in front of a computer screen as the moment I finally got it.
Over the five days of the course, we covered a great deal, from ttests, to logistic regression, to complex survey analysis. I especially enjoyed teaching these eager, energetic young doctors because they knew exactly how they were going to use these skills to make a difference in their country.
Fortunately I also had a little time to experience Dhaka. I got to ride a rickshaw, take a boat on the Ganges, walk down Hindu Street, and see buildings built by the East India Company. Just walking down the streets of Dhaka was one of the most incredible experiences of my life. The sights and sounds were beyond anything I had ever experienced. As I packed up my laptop on the last day of class, with the sound of the call to prayer in the background, my thoughts were no longer “what have I gotten myself into?” but instead “is this really over?”
Boarding the flight to return home, I was so grateful that my love of Stata had taken me to such an unexpected place. I hope I made as much of an impact on my students as they and their country made on me.
Stephen Vaisey is an Associate Professor of Sociology at Duke University. For Statistical Horizons, he regularly teaches a seminar on Treatment Effects Analysis.
]]>Which probability model is better, the linear or the logistic? It depends. While there are situations where the linear model is clearly problematic, there are many common situations where the linear model is just fine, and even has advantages.
Let’s start by comparing the two models explicitly. If the outcome Y is a dichotomy with values 1 and 0, define p = E(YX), which is just the probability that Y is 1, given some value of the regressors X. Then the linear and logistic probability models are:
p = a_{0 }+ a_{1}X_{1} + a_{2}X_{2} + … + a_{k}X_{k} _{ }(linear)
ln[p/(1p)] = b_{0 }+ b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k} _{ }(logistic)
The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1p) is a linear function of the regressors.
The major advantage of the linear model is its interpretability. In the linear model, if a_{1} is (say) .05, that means that a oneunit increase in X_{1} is associated with a 5 percentage point increase in the probability that Y is 1. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese.
The logistic model is less interpretable. In the logistic model, if b_{1} is .05, that means that a oneunit increase in X_{1} is associated with a .05 increase in the log odds that Y is 1. And what does that mean? I’ve never met anyone with any intuition for log odds.
Because the log odds scale is so hard to interpret, it is common to report logistic regression results as odds ratios. To do this, we exponentiate both sides of the logistic regression equation and obtain a new equation that looks like this:
p/(1p) = d_{0} × (d_{1})^{X1}_{ }× (d_{2})^{X2} × … × (d_{k})^{Xk}
On the left side we have the odds and on the right side we have a product involving the odds ratios d_{1 }= exp(b_{1}), d_{2} = exp(b_{2}), etc.
Odds ratios seem like they should be intuitive. If d_{1} = 2, for example, that means that a oneunit increase in X_{1 }doubles the odds that Y is 1. That sounds like something we should understand.
But we don’t understand, really. We think we understand odds because in everyday speech we use the word “odds” in a vague and informal way. Journalists commonly use “odds” interchangeably with a variety of other words, such as “chance,” “risk,” “probability,” and “likelihood”—and academics are often just as sloppy when interpreting results. But in statistics these words aren’t synonyms. The word odds has a very specific meaning—p/(1p)—and so does the odds ratio.
Still think you have an intuition for odds ratios? Let me ask you a question. Suppose a getoutthevote campaign can double your odds of voting. If your probability of voting was 40% before the campaign, what is it after? 80%? No, it’s 57%.
If you got that wrong, don’t feel bad. You’ve got a lot of company. And if you got it right, I bet you had to do some mental arithmetic[1], or even use a calculator, before answering. The need for arithmetic should tell you that odds ratios aren’t intuitive.
Here’s a table that shows what doubling the odds does to various initial probabilities:
Before doubling 
After doubling 

Probability 
Odds 
Odds 
Probability 
10% 
0.11 
0.22 
18% 
20% 
0.25 
0.50 
33% 
30% 
0.43 
0.86 
46% 
40% 
0.67 
1.33 
57% 
50% 
1.00 
2.00 
67% 
60% 
1.50 
3.00 
75% 
70% 
2.33 
4.67 
82% 
80% 
4.00 
8.00 
89% 
90% 
9.00 
18.0 
95% 
It isn’t simple. The closest I’ve come to developing an intuition for odds ratios is this: If p is close to 0, then doubling the odds is approximately the same as doubling p. If p is close to 1, then doubling the odds is approximately the same as halving 1p. But if p is in the middle—not too close to 0 or 1—then I don’t really have much intuition and have to resort to arithmetic.
That’s why I’m not crazy about odds ratios.
The logistic model is unavoidable if it fits the data much better than the linear model. And sometimes it does. But in many situations the linear model fits just as well, or almost as well, as the logistic model. In fact, in many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret (Hellevik 2007).
For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability (cf. Long 1997).
It’s only when you have a really wide range of probabilities—say .01 to .99—that the linear approximation totally breaks down.
When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those outofbounds predicted probabilities are the Achilles heel of the linear model.
These considerations suggest a rule of thumb. If the probabilities that you’re modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.
Both situations occur with some frequency. If you’re modeling the probability of voting, or of being overweight, then nearly all the modeled probabilities will be between .20 and .80, and a linear probability model should fit nicely and offer a straightforward interpretation. On the other hand, if you’re modeling the probability that a bank transaction is fraudulent—as I used to do—then the modeled probabilities typically range between .000001 and .20. In that situation, the linear model just isn’t viable, and you have to use a logistic model or another nonlinear model (such as a neural net).
Keep in mind that the logistic model has problems of its own when probabilities get extreme. The log odds ln[p/(1p)] are undefined when p is equal to 0 or 1. When p gets close to 0 or 1 logistic regression can suffer from complete separation, quasicomplete separation, and rare events bias (King & Zeng, 2001). These problems are less likely to occur in large samples, but they occur frequently in small ones. Users should be aware of available remedies. See Paul Allison’s post on this topic.
Interpretability is not the only advantage of the linear probability model. Another advantage is computing speed. Fitting a logistic model is inherently slower because the model is fit by an iterative process of maximum likelihood. The slowness of logistic regression isn’t noticeable if you are fitting a simple model to a small or moderatesized dataset. But if you are fitting a very complicated model or a very large data set, logistic regression can be frustratingly slow.[2]
The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS). OLS ignores the fact that the linear probability model is heteroskedastic with residual variance p(1p), but the heteroscedasticity is minor if p is between .20 and .80, which is the situation where I recommend using the linear probability model at all. OLS estimates can be improved by using heteroscedasticityconsistent standard errors or weighted least squares. In my experience these improvements make little difference, but they are quick and reassuring.
Paul von Hippel is an Assistant Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science..
Hellevik, O. (2007) Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity, 43(1), 59–74. http://doi.org/10.1007/s1113500790773
King, G., & Zeng, L. (2001) Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163. http://doi.org/10.2307/25791637
Long, J. S. (1997) Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.
[1] Here’s the mental arithmetic that I did. A probability of 40% is equivalent to odds of 2/3. Doubling those odds gives odds of 4/3. And odds of 4/3 are equivalent to a probability of 4/7, which in my head I figured was about 56%. When I wrote this footnote, though, I checked my mental arithmetic using Excel, which showed me that 4/7 is 57%.
[2] In current work, my colleagues and I are using a hierarchical, spatially correlated model to estimate the probability of obesity among 376,576 adults in approximately 2,400 US counties. The computational methods are demanding, and switching from a logistic to a linear probability model reduced our runtime from days to less than an hour.
]]>This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models. Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.
In economics, models with lagged dependent variables are known as dynamic panel data models. Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.
The basic argument is pretty straightforward. Let y_{it} be the value of the dependent variable for individual i at time t. Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector x_{it}:
y_{it} = b_{0} + b_{1}y_{i}_{(t1)} + b_{2}x_{it} + u_{i} + e_{it}
The random intercept u_{i} represents the combined effect on y of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and independent of the other variables on the righthand side.
That’s where the problem lies. Because the model applies to all time points, u_{i} has a direct effect on y_{i}_{(t1)}. But if u_{i} affects y_{i}_{(t1)}, it can’t also be statistically independent of y_{i}_{(t1)}. The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).
Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the wages data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.
Using the xtreg command, let’s first estimate a random intercepts model for lwage (log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: ed (years of education) and fem (1 for female, 0 for male).
Here’s the Stata code:
use “http://statisticalhorizons.com/wpcontent/uploads/wages.dta”, clear
xtset id t
xtreg lwage L.lwage ed fem t
The xtset command tells Stata that this is a “crosssection timeseries” data set with identification numbers for persons stored in the variable id and a time variable t that ranges from 1 to 7. The xtreg command fits a randomintercepts model by default, with lwage as the dependent variable and the subsequent four variables as predictors. L.lwage specifies the oneyear lag of lwage.
Here’s the output:

lwage  Coef. Std. Err. z P>z [95% Conf. Interval]
+
lwage 
L1.  .8747517 .0085886 101.85 0.000 .8579183 .8915851
ed  .0108335 .0011933 9.08 0.000 .0084947 .0131724
fem  .06705 .010187 6.58 0.000 .0870162 .0470839
t  .0071965 .0019309 3.73 0.000 .0034119 .0109811
_cons  .7624068 .0491383 15.52 0.000 .6660974 .8587161
+
When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males. Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages.
As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274278 in RabeHesketh and Skrondal (2012).
Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called xtdpdqml. This unwieldy set of letters stands for “crosssection timeseries dynamic panel data estimation by quasimaximum likelihood.”
Here’s how to apply xtdpdqml to the wage data:
xtset id t
xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)
The re option specifies a random effects (random intercepts) model. By default, the command includes the lag1 dependent variable as a predictor. The initval option sets the starting values for the four variance parameters that are part of the model. Here is the output:

lwage  Coef. Std. Err. z P>z [95% Conf. Interval]
+
lwage 
L1.  .4142827 .0230843 17.95 0.000 .3690383 .459527
ed  .0403258 .0031841 12.66 0.000 .0340851 .0465666
fem  .2852665 .0271688 10.50 0.000 .3385164 .2320166
t  .0533413 .0027533 19.37 0.000 .0479449 .0587378
_cons  3.25368 .1304816 24.94 0.000 2.99794 3.509419

Results are markedly different from those produced above by xtreg. The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than nonblacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(.28)1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.
So doing it right can make a big difference. Unfortunately, xtdpdqml has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique MoralBenito, I have been developing a new Stata command, xtdpdml, that removes many of these limitations. (Note that the only difference in the names for the two commands is the q in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.
To estimate a model for the wage data with xtdpdml, use
xtset id t
xtdpdml lwage, inv(ed fem blk) errorinv
The inv option is for timeinvariant variables. The errorinv option forces the error variance to be the same at all points in time. Like xtdpdqml, this command automatically includes a 1time unit lag of the dependent variable. Unlike xtdpdqml, xtdpdml can include longer lags and/or multiple lags.
Here is the output:

 Coef. Std. Err. z P>z [95% Conf. Interval]
+
lwage2 
lwage1  .4088803 .0229742 17.80 0.000 .3638517 .453909
ed  .0406719 .0032025 12.70 0.000 .0343951 .0469486
fem  .2878266 .027345 10.53 0.000 .3414218 .2342315

Results are very similar to those for xtdpdqml. They are slightly different because xtdpdml always treats time as a categorical variable, but time was a quantitative variable in the earlier model for xtdpdqml.
If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in my unpublished paper. As a matter of fact, the xtdpdml command is just a frontend to the sem command in Stata. But it’s a lot more tedious and errorprone to set up the equations yourself. That’s why we wrote the command.
By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixedeffects models. You can’t put a lagged dependent variable on the righthand side. Both xtdpdqml and xtdpdml can handle this situation also.
If you’d like to learn more about dynamic panel data models, check out my 2day course on Longitudinal Data Analysis Using SEM. It will be offered again October 1617, 2015, in Los Angeles.
References
Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” Econometrica 51 (6): 16351659.
RabeHesketh, Sophia, and Anders Skrondal (2012) Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses. Third Edition. StataCorp LP.
]]>Incidentally, the use of ML for handling missing data is often referred to as “full information maximum likelihood” or FIML.
What I didn’t mention in that 2012 post (but which I discussed in the paper on which it was based) is that ML is also asymptotically efficient. Roughly speaking, that means that in large samples, the standard errors of ML estimates are as small as possible—you can’t do any better with other methods.
With MI, on the other hand, the only way to get asymptotic efficiency is to do an infinite number of imputations, something that is clearly not possible. You can get pretty close to full efficiency for the parameter estimates with a relatively small number of imputations (say, 10), but efficient estimation of standard errors and confidence intervals typically requires a much larger number of imputations.
So for large samples, ML seems to have the clear advantage. But what about small samples? For ML, the problem is that statistical inference is based on largesample approximations that may not be accurate in smaller samples. By contrast, statistical inference for MI is typically based on a tdistribution which adjusts for small sample size. That means that MI is better than ML when working with small samples, right?
Wrong! In a paper that will be published soon in Structural Equation Modeling, Paul von Hippel assesses the performance of ML and MI in small samples drawn from a bivariate normal distribution. He shows, analytically, that ML estimates have less bias than MI estimates. By simulation, he also shows that ML estimates have smaller sampling variance than MI estimates.
What about confidence intervals and pvalues? To address that issue, von Hippel introduces a novel method for calculating degrees of freedom for a tdistribution that can be used with ML estimation in small samples. He demonstrates by simulation that confidence intervals based on this tdistribution have approximately the correct coverage and are narrower, on average, than the usual confidence intervals for MI.
Problem solved? Well, not quite. Von Hippel’s DF formula requires some computational work, and that will discourage some researchers. In principle, the method could easily be programmed into structural equation modeling packages and, hopefully, that will happen. Until it does, however, the method probably won’t be widely used.
Bottom line is that ML seems like the better way to go for handling missing data in both large and small samples. But there’s still a big niche for MI. ML requires a parametric model that can be estimated by maximizing the likelihood. And to do that, you usually need specialized software. Most structural equation modeling packages can do FIML for linear models, but not for nonlinear models. As far as I know, Mplus is the only commercial package that can do FIML for logistic, Poisson, and Cox regression.
MI, on the other hand, can be readily applied to these and many other models, without the need for specialized software. Another attraction of MI is that you can easily do a sensitivity analysis for the possibility that data that are not missing at random. So if you really want to be a skilled handler of missing data, you need to be adept at both approaches.
If you want to learn more about both multiple imputation and maximum likelihood, check out my twoday course on Missing Data that will be offered this fall. Dates and location have not yet been set.
]]>In a 1944 article in the Journal of the American Statistical Association, Joseph Berkson introduced the logit model (aka logistic regression model) and argued for its superiority to the probit model. The logistic method really began to take off with the publication of David Cox’s 1970 book Analysis of Binary Data. Except for toxicology applications, probit has pretty much disappeared from the biomedical world. There are other options as well (like complementary loglog) but logistic regression is the overwhelming favorite.
So what is it about logistic regression that makes it so popular? In this post, I’m going to detail several things about logistic regression that make it more attractive than its competitors. And I’m also going to explain why logistic regression has some of these properties. It turns out that there is something special about the logit link that gives it a natural advantage over alternative link functions.
First, a brief review. For binary data, the goal is to model the probability p that one of two outcomes occurs. The logit function is log[p/(1p)], which varies between ∞ and +∞ as p varies between 0 and 1.The logistic regression model says that
log[p/(1p)] = b_{0} + b_{1}x_{1} + … + b_{k}x_{k}
or, equivalently,
p = 1/(1 + exp{(b_{0} + b_{1}x_{1} + … + b_{k}x_{k})})
Estimation of the b coefficients is usually accomplished by maximum likelihood.
In this context, the logit function is called the link function because it “links” the probability to the linear function of the predictor variables. (In the probit model, the link function is the inverse of the cumulative distribution function of a standard normal variable.)
What’s most important about the logit link is that it guarantees that p is bounded by 0 and 1, no matter what the b’s and the x’s are. However, that property is hardly unique to the logit link. It’s also true for the probit link, the complementary loglog link, and an infinite number of other possible link functions. But there are a few things that are special about logit:
1. If you exponentiate the coefficients, you get adjusted odds ratios. These have a remarkably intuitive interpretation, one that is even used in the popular media to convey the results of logistic regression to nonstatisticians. Coefficients from probit regression are not nearly so interpretable.
2. With logit, you can do disproportionate stratified random sampling on the dependent variable without biasing the coefficients. For example, you could construct a sample that includes all of the events, and a 10% random sample of the nonevents. This property is the justification for the widelyused casecontrol method in epidemiology. It’s also extremely useful when dealing with very large samples with rare events. No other link function has this property.
3. You can do exact logistic regression. With conventional maximum likelihood estimation, the pvalues and confidence intervals are largesample approximations. These approximations may not be very accurate when the number of events is small. With exact logistic regression (a generalization of Fisher’s exact test for contingency tables) exact pvalues are obtained by enumeration of all possible data permutations that produce the same “sufficient statistics” (more below). Again, this is not possible with any other link function.
4. You can do conditional logistic regression. Suppose your binary data are clustered in some way, for example, repeated measurements clustered within persons or persons clustered within neighborhoods. Suppose, further, that you want to estimate a model that allows for betweencluster differences, but you don’t want to put any restrictions on the distribution of those differences or their relationship with predictor variables. In that case, conditional maximum likelihood estimation of the logistic model is the way to go. What it conditions on is the total number of events in each cluster. When you do that, the betweencluster effects cancel out of the likelihood function. This doesn’t work with probit or any other link function.
What is it about the logit link that makes it possible to do these useful variations of logistic regression? The answer is somewhat esoteric, but I’ll do my best to explain it. For binary data, the most appropriate probability distribution is the binomial (or its special case, the Bernoulli distribution for single trial data). The binomial distribution happens to be a member of the very important exponential family of distributions. In general form, the probability distribution for the exponential family can be written as
f(xb) = h(x)exp{T(x)’g(b)–A(b)}
In this formula, x is a vector of the data, b is a vector of parameters, and h, g, T, and A are known functions. If g(b) = b, then b is said to be the natural parameter (or canonical parameter) of the distribution. T(x) is a vector of sufficient statistics of the data. These are summary statistics that contain all the available information in the data about the parameters.
For the binomial distribution, this formula specializes to
f(xp) = [N x]exp{x log[p/(1p)] –N log(1p)}
where N is the number of trials, x is the number of events and [N x] is the binomial coefficient. We see immediately that log[p/(1p)] is the natural parameter of the binomial distribution. Because the natural parameter directly multiplies the sufficient statistic (in this case, the number of events), all sorts of mathematical operations become much easier and more straightforward. If you work with something other than the natural parameter, things are more difficult. That’s why the logit link has a special place among the infinite set of possible link functions.
Of course, mathematical convenience does not imply that the logit link is more likely to be a realistic representation of the real world than some other link. But in the absence of compelling reasons to use something else, you might as well go with a method that’s convenient and interpretable. It’s the same reason why we often prefer linear models with normally distributed errors.
Incidentally, many social scientists, especially economists, still prefer to use linear probability models for binary data for exactly these reasons: mathematical convenience and interpretability. Check out next month’s post for arguments in favor of linear models for binary outcomes.
There are also some situations in which probit has the mathematical advantage. For example, suppose you want to do a factor analysis of a set of binary variables. You want a model in which the binary variables depend on one or more continuous latent variables. If the distributions of the binary variables are expressed as probit functions of the latent variables, then the multivariate normal distribution can be used as a basis for estimation. There is no comparable multivariate logistic distribution.
If you’d like to learn more about these and other methods for binary data, see my book Logistic Regression Using SAS: Theory and Application.
]]>
Compared with standard methods based on linear regression and the normal distribution, PMM produces imputed values that are much more like real values. If the original variable is skewed, the imputed values will also be skewed. If the original variable is bounded by 0 and 100, the imputed values will also be bounded by 0 and 100. And if the real values are discrete (like number of children), the imputed values will also be discrete. That’s because the imputed values are real values that are “borrowed” from individuals with real data.
PMM has been around for a long time (Rubin 1986, Little 1988), but only recently has it become widely available and practical to use. Originally, it could only be used in situations where a single variable had missing data or, more broadly, when the missing data pattern was monotonic. Now, however, the PMM method is embedded in many software packages that implement an approach to multiple imputation variously known as multiple imputation by chained equations (MICE), sequential generalized regression, or the fully conditional specification (FCS). It’s available in many statistical packages, including SAS, Stata, SPSS, and R, all of which allow you to use PMM for virtually any missing data pattern.
There are two major pitfalls to PMM, however. First, only a handful of studies have evaluated its performance, so it’s not clear how well it compares with alternative methods. Second, at least two statistical packages, SPSS and Stata, have implemented PMM with a default setting that actually invalidates the method. If you use either of those packages, you must override the default.
Before explaining that problem, I first need to provide a brief description of how PMM works. Suppose there is a single variable x that has some cases with missing data, and a set of variables z (with no missing data) that are used to impute x. Do the following:
Unlike many methods of imputation, the purpose of the linear regression is not to actually generate imputed values. Rather, it serves to construct a metric for matching cases with missing data to similar cases with data present.
There are several variations to this method (Morris et al. 2014), but the most important issue to settle is how many cases (k) should be in each match set. The default in the SAS procedure MI and in the MICE package for R is k=5. That is, each case with missing data on x is matched to the 5 cases (with data present) that have the closest predicted values. One of the 5 is chosen at random and its x value is assigned to the case with missing data. Solas and the userwritten ice command for Stata set the default at k=10.
On the other hand, for the SPSS missing values module and for the builtin mi command in Stata the default is k=1. That is, each case with missing data is matched to the single case whose predicted value is closest to the predicted value for the case with missing data. With only one matched case, there is no random draw at Step 5 in the scheme above.
That’s a serious error. With no random draw at Step 5, the only source of random variation in the imputation process is the random draw of regression coefficients in Step 2. That’s not nearly enough to produce proper imputations. As a result, estimated standard errors tend to be much too low, leading to inflated test statistics and confidence intervals that are much too narrow (Morris et al. 2014).
Why did SPSS and Stata get it so wrong? Well, I’m guessing that they relied on Don Rubin’s classic 1987 book Multiple Imputation for Nonresponse in Surveys. In his description of PMM (p. 168), he proposed matching to a single case. But later work makes it clear that this is not the way to go.
So, if not k=1, then how many? That’s not clear. Schenker and Taylor (1996) did simulations with k=3 and k=10. Differences in performance were small, but with k=3, there was less bias and more sampling variation. Based on their simulations, Morris et al. (2014) recommended k=10 for most situations. But a lot depends on sample size. With large samples, k=10 is probably the better choice. But with smaller samples, k=10 will probably include too many cases that are rather unlike the case to which they are matched. Personally, I’m reasonably happy with the k=5 default of SAS and MICE.
The other major drawback of PMM is that there’s no mathematical theory to justify it (which is also true of MICE methods more generally). We have to rely on Monte Carlo simulations, and no simulation can study all the possibilities. Results reported by Schenker and Taylor (1996) and Morris et al. (2014) are very encouraging, but hardly definitive. In brief, it appears that PMM does almost as well as parametric methods for a correctly specified model, and a little better than parametric methods in certain misspecified models. So the current consensus seems to be that this is an acceptable and potentially useful method. But–as they say–more research is needed.
REFERENCES
Little, Roderick J. A. (1988) “Missingdata adjustments in large surveys.” Journal of Business & Economic Statistics 6: 287296.
Morris, Tim P., Ian R. White and Patrick Royston (2014) “Tuning multiple imputation by predictive mean matching and local residual draws.” BMC Medical Research Methodology 14: 7587.
Rubin, Donald B. (1986) “Statistical matching using file concatenation with adjusted weights and multiple imputations.” Journal of Business & Economic Statistics 4: 8794.
Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.
Schenker, Nathaniel and Jeremy M.G. Taylor (1996) “Partially parametric techniques for multiple imputation.” Computational Statistics & Data Analysis 22: 425446.
]]>Despite the potential for this method, I recently learned that it’s vulnerable to a very troubling kind of bias when the lag structure is misspecified. In the latest issue of Sociological Methods and Research, Stephen Vaisey and Andrew Miles showed by both simulation and formal proof that a positive contemporaneous effect will often show up as a negative effect when estimating a fixed effects model with a predictor that is lagged by one time unit. They concluded that, for most social science applications, “artifactual negative ‘effects’ will likely be the rule rather than the exception.”
Vaisey and Miles investigated this problem only for the case of three periods of data, no lagged effect of the dependent variable y on itself, and no true effect of y on x. In that case, maximum likelihood reduces to OLS regression using difference scores: y_{3}–y_{2} on x_{2}–x_{1}. They showed that the coefficient for x_{2}–x_{1 }has an expected value that is exactly .5 times the true coefficient.
My own simulations suggest that a sign reversal can also happen with four or more periods and a lagged dependent variable. And the effect of one variable on the other doesn’t have to be exactly contemporaneous. The reversal of sign can also occur if the correct lag is one week, but the estimated model specifies a lag of one year. Note that this artifact does not arise with random effects models. It’s specific to fixed effects models with lagged predictors. That should not be interpreted as an endorsement of random effects models, however, because they are much more prone to bias from omitted variables.
As noted by Vaisey and Miles, a 2011 article in the Journal of Quantitative Criminology may exemplify the problem of misspecified lags. Following my advice, Ousey, Wilcox and Fisher used the fixed effects SEM method to examine the relationship between victimization and offending. Numerous studies have found a positive, crosssectional relationship between these variables: people who report being victims of crimes are also more likely to commit crimes. But Ousey et al. found negative effects of each variable on the other. Respondents who reported higher levels of offending in year t had lower levels of victimization in year t+1, after adjusting for fixed effects. And respondents with higher levels of victimization in year t had lower levels of offending in year t+1.
This surprising result could be real. But it could also occur if there is a positive effect of victimization on offending that is almost instantaneous rather than lagged by one year. And, finally, it could also occur if there is a positive, instantaneous effect of offending on victimization.
What can be done about this problem? Well, one implication is that more thought should go into the design of panel surveys. If you expect that changes in x will produce changes in y a month later, then collecting monthly data would be much better than collecting annual data. This could have the added advantage of reducing the total time for data collection, although it might also increase certain kinds of response bias.
What if your data have already been collected? Here’s a tentative recommendation that worked well in a few simulations. As a robustness check, estimate models that include both contemporaneous and lagged predictors. If a oneyear lag is the correct specification, then the contemporaneous effect should be small and not statistically significant. If, on the other hand, the contemporaneous effect is large and significant, it should raise serious doubts about validity of the method and the kinds of conclusions that can be drawn. It may be that the data are simply not suitable for separating the effect of x on y from the effect of y on x.
I tried this strategy on a subset of the data used by Ousey et al. to study victimization and offending. When both contemporaneous and lagged predictors were included, I found a strong positive effect of victimization on offending in the same year. The oneyear lagged effect was negative but small and nonsignificant. The same thing happened in the reverse direction. Offending had a strong positive effect on victimization in the same year, but the lagged effect was negative and not significant. My take: these data don’t allow one to draw any firm conclusions about whether victimization affects offending or offending affects victimization. They certainly don’t provide a basis for claiming negative effects of each variable on the other.
Clearly this is a problem that needs a great deal more study. There is a substantial econometric literature on determining the number of lags needed for autoregressive models but, as far as I know, Vaisey and Miles are the first to identify this particular phenomenon.
By the way, Steve Vaisey teaches a highlyrated course for Statistical Horizons called Treatment Effects Analysis.
]]>
Quick review: The basic approach is to assume a crosslagged linear model, with y at time t affected by both x and y and time t1, and x at time t also affected by both lagged variables. The equations are
y_{it} = b_{1}x_{i}_{(t1)} + b_{2}y_{i}_{(t1)} + c_{i} + e_{it}
x_{it} = a_{1}x_{i}_{(t1)} + a_{2}y_{i}_{(t1) }+ f_{i} + d_{it}
for i = 1,…, n, and t = 1,…, T.
The terms c_{i} and f_{i} represent individualspecific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, e_{it} and d_{it} are assumed to represent pure random noise, independent of any variables measured at earlier time points. Additional exogenous variables could also be added to these equations.
Conventional estimation methods are biased because of the lagged dependent variable and because of the reciprocal relationship between the two variables. The most popular solution is the ArellanoBond (AB) method (or one of its cousins), but I have previously argued for the use of maximum likelihood (ML) as implemented in structural equation modeling (SEM) software.
Last month I presented very preliminary simulation results showing that MLSEM had substantially lower meansquared error (MSE) than AB under a few conditions. Since then I’ve done simulations for 31 different sets of parameter values and data configurations. For each condition, I generated 1,000 samples, ran the two methods on each sample, and then calculated bias, mean squared error, and coverage for confidence intervals. Since the two equations are symmetrical, the focus is on the coefficients in the first equation, b_{1} for the effect of x on y, and b_{2} for the effect of y on itself.
The simulations for ML were done with PROC CALIS in SAS. I originally started with the sem command in Stata, but it had a lot of convergence problems for the smaller sample sizes. The AB simulations were done in Stata with the xtabond command. I tried PROC PANEL in SAS, but couldn’t find any combination of options that produced approximately unbiased estimates.
Here are some of the things I’ve learned:
Under every condition, ML showed little bias and quite accurate confidence interval coverage. That means that about 95% of the nominal 95% confidence intervals included the true value.
Except under “extreme” conditions, AB also had little bias and reasonably accurate confidence interval coverage.
However, compared with AB, MLSEM always showed less bias and smaller sampling variance. My standard of comparison is relative efficiency, which is the ratio of MSE for ML to MSE for AB. (MSE is the sum of the sampling variance plus the squared bias.) Across 31 different conditions, relative efficiency of the two estimators ranged from .02 to .96, with a median of .50. To translate, if the relative efficiency is .50, you’d need twice as large a sample to get the same accuracy with AB as with ML.
Relative efficiency of the two estimators is strongly affected by the value of the parameter b_{2}, the effect of y_{t}_{1} on y_{t}. As b_{2} gets close to 1, the AB estimators for both b_{1} and b_{2} become badly biased (toward 0), and the sample variance increases, which is consistent with previous literature on the AB estimator. For ML, on the other hand, bias and variance are rather insensitive to the value of b_{2}. Here are the numbers:
Rel Eff b1 
Rel Eff b2 

b2=0 
0.546207 
.8542228 
b2=.25 
0.509384 
.6652079 
b2=.50 
0.462959 
.5163349 
b2=.75 
0.202681 
.2357591 
b2=.90 
0.022177 
.0269079 
b2=1.0 
0.058521 
.0820448 
b2=1.25 
0.248683 
.4038526 
Relative efficiency is strongly affected by the number of time points, but in the opposite direction for the two coefficients. Thus, relative efficiency for b_{1} increases almost linearly as the number of time points goes from 3 to 10. But for b_{2}, relative efficiency is highest at T=3, declines markedly for T=4 and T=5, and then remains stable.
Rel Eff b1 
Rel Eff b2 

T=3 
0.243653 
.9607868 
T=4 
0.398391 
.8189295 
T=5 
0.509384 
.6652079 
T=7 
0.696802 
.6444535 
T=10 
0.821288 
.6459828 
Relative efficiency is also strongly affected by the ratio of the variance of c_{i}, (the fixed effect) to the variance of e_{it} (the pure random error). In the next table, I hold constant the variance of c and vary the standard devation of e.
Rel Eff b1 
Rel Eff b2 

SD(e)=.25 
0.234526 
.3879175 
SD(e)=1.0 
0.509384 
.6652079 
SD(e)=1.5 
0.551913 
.7790358 
SD(e)=2 
0.613148 
.7737681 
Relative efficiency is not strongly affected by:
Because ML is based on the assumption of multivariate normality, one might suspect that AB would do better than ML if the distributions were not normal. To check that out, I generated all the variables using a 2df chisquare variable, which is highly skewed to the right. ML still did great in this situation, and was still about twice as efficient as AB.
In sum, MLSEM outperforms AB in every situation studied, by a very substantial margin.
]]>In this post, I review some earlier work I’ve done on these questions, and I report new simulation results comparing the ArellanoBond method with maximum likelihood (ML) using structural equation modeling (SEM) software. ArrelanoBond is hugely popular among economists, but not widely known in other disciplines. ML with SEM is a method that I’ve been advocating for almost 15 years (Allison 2000, 2005a, 2005b, 2009). Long story short: ML rules.
I focus on panel data in which we observe y_{it} and x_{it} for i =1,…, n and t =1,…, T. The proposed linear model allows for reciprocal, lagged effects of these two variables on each other:
y_{it} = b_{1}x_{i}_{(t1)} + b_{2}y_{i}_{(t1)} + c_{i} + e_{it}
x_{it} = a_{1}x_{i}_{(t1)} + a_{2}y_{i}_{(t1) }+ f_{i} + d_{it}
The terms c_{i} and f_{i} represent individualspecific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, e_{it} and d_{it} are assumed to represent pure random noise, independent of any variables measured at earlier time points.
If all the assumptions are met, b_{1} can be interpreted as the causal effect of x on y, and a_{2} can be interpreted as the causal effect of y on x. This model can be elaborated in various ways to include, for example, other predictor variables, different lags, and coefficients that change over time.
Estimation of the model is not straightforward for reasons that are well known in the econometric literature. First, the presence of a lagged dependent variable as a predictor in each equation means that conventional fixed effects methods yield biased estimates of the coefficients under almost any condition. But even if the lagged dependent variables were excluded from the equations, the error term in each equation would still be correlated with all future values of both x and y. For example, e_{2} > y_{2} > x_{3}. So, again, conventional fixed effects will produce biased coefficients.
Arrelano and Bond (1991) solved these problems by using earlier lagged values of x and y as instrumental variables and by applying a generalized method of moments (GMM) estimator. Several software packages now implement this method, including SAS, Stata, LIMDEP, and the plm package for R.
My solution to the problems has been to estimate each equation separately by ML using any SEM package (e.g., LISREL, Mplus, PROC CALIS in SAS, or sem in Stata). Two “tricks” are necessary. Focusing on the first equation, fixed effects are accommodated by allowing c to be correlated with all measurements of x (as well as the initial measurement of y). Second, the error term e is allowed to be correlated with all future measurements of x. Analogous methods are used to estimate the second equation. For details, see the SEM chapters in my 2005 and 2009 books.
In my 2005 paper, I presented simulation evidence that the MLSEM method produces approximately unbiased estimates of the coefficients under a variety of conditions. For years, I’ve been promising to do a headtohead comparison of ML with ArellanoBond, but I’ve just now gotten around to doing it.
What I’m going to report here are some very preliminary but dramatic results. The model used to generate the data was one in which x has a positive effect on y, but y has a negative effect on x:
y_{it} = .5x_{i}_{(t1)} + .5y_{i}_{(t1)} + c_{i} + e_{it}
x_{it} = .5x_{(t1)} – .5y_{i}_{(t1) }+ f_{i} + d_{it}
All variables have normal distributions, c has a positive correlation with x, f has a positive correlation with y, and c and f are positively correlated with each other. The baseline model had 5 time points (T=5), with sample sizes of 50, 100, 400, and 1600. Then, keeping the sample size at 400, I examined T= 4, and 10. For each condition I did 1000 replications.
I focus here on the coefficient for the effect of x on y in the first equation. For each condition, I calculated the mean squared error (MSE), which is the variance of the estimator plus its squared bias. There was little bias in either estimator, so the MSE primarily reflects sampling variance.
Here are the preliminary results:
Mean Squared Error for Two Estimators 

Condition 
MLSEM 
ArrelanoBond 
Relative efficiency 
N=50,T=5 
.0057128 
.0110352 
.5176833 
N=100,T=5 
.0027484 
.0058433 
.4703557 
N=400,T=5 
.0006348 
.0014961 
.4242679 
N=1600, T=5 
.0001556 
.0003682 
.4226466 
N=400, T=4 
.0011632 
.0039785 
.2923685 
N=400, T=10 
.0001978 
.0002503 
.7902897 
The last column, relative efficiency, is the ratio of the MSE for ML to the MSE for AB. With 5 time points, AB is only about half as efficient as MLSEM, for any sample size. But the number of time points has a dramatic effect. AB is only 29% efficient for T=4 but 79% efficient for T=10.
The next steps are to vary such things as the magnitudes of the coefficients, the variances of the error terms, and the correlations between c and f with each other and with the predictor variables.
Besides its efficiency advantage, the MLSEM framework makes it easier than AB to accomplish several things:
Before concluding, I must mention that Hsiao et al. (2002) also did a simulation study to compare ML with a variety of other estimators for the panel model, including AB. However, their approach to ML was very different than mine, and it has not been implemented in any commercial software packages. Hsiao et al. found that ML did better with respect to both bias and efficiency than any of the other estimators, under almost all conditions. Nevertheless, the differences between ML and AB were much smaller than those reported here.
If you’re reading this post, you should definitely read next month’s follow up by clicking here.
To learn more about these and other methods for panel data, check out my seminars, Longitudinal Data Analysis Using SAS and Longitudinal Data Analysis Using Stata. Both will be offered in the spring of 2015. Plus, I am offering a new, more advanced seminar titled Longitudinal Data Analysis Using SEM in Fort Myers, Florida, January 2324.
References
Allison, Paul D. (2000) “Inferring Causal Order from Panel Data.” Paper presented at the Ninth International Conference on Panel Data, June 22, Geneva, Switzerland.
Allison, Paul D. (2005a) “Causal Inference with Panel Data.” Paper presented at the Annual Meeting of the American Sociological Association, August, Philadelphia, PA.
Allison, Paul D. (2005b) Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: SAS Institute.
Allison, Paul D. (2009) Fixed Effects Regression Models. Thousand Oaks, CA: Sage Publications.
Arellano, M. and S. Bond (1991) “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations.” The Review of Economic Studies 58: 277297.
Hsiao, Cheng, M. Hashem Pesaran, and A. Kamil Tahmiscioglu (2002) “Maximum likelihood estimation of fixed effects dynamic panel data models covering short time periods.”Journal of Econometrics 109: 107150.
]]>I’m happy to report that PROC MI in SAS 9.4 has several options for doing a sensitivity analysis based on multiple imputation. I’ve recently had a chance to read the documentation and do a few test runs. A little later in this post, I’ll tell you what I’ve learned.
But first, some background. There are two widelyused “modern” methods for handling missing data: multiple imputation and maximum likelihood. In virtually all implementations of these methods in commercial software, the underlying assumption is that data are missing at random (MAR). Roughly speaking, this means that the probability that data are missing on a particular variable does not depend on the value of that variable, after adjusting for observed variables. This assumption would be violated, for example, if people with high income were less likely to report their income.
The MAR assumption does allow missingness to depend on anything that you observe, it just can’t depend on things that you don’t observe. MAR is not a testable assumption. You may suspect that your data are not missing at random, but nothing in your data will tell you whether or not that’s the case.
It’s possible to do multiple imputation or maximum likelihood when data are missing not at random (MNAR), but to do that, you first need to specify a model for the missing data mechanism—that is, a model of how missingness depends on both observed and unobserved quantities. That raises three issues:
That’s a dangerous combination. And it’s why a sensitivity analysis is so important. The basic idea is to try out a bunch of plausible MNAR models, and then see how consistent the results are across the different models. If results are reasonably consistent, then you can feel pretty confident that, even if data are not missing at random, that would not compromise your conclusions. On the other hand, if the results are not consistent across models, you would have to worry about whether any of the results are trustworthy.
Keep in mind that this is not a test. Inconsistency of results does not tell you that your data are MNAR. It simply gives you some idea of what would happen if the data are MNAR in particular ways.
There’s nothing very deep about this. The hard part is figuring out how to come up with a reasonable set of models. It’s particularly hard if you’re using maximum likelihood to handle the missing data. Elsewhere I’ve argued for the advantages of maximum likelihood over multiple imputation. But one attraction of multiple imputation is that it’s easier to do a decent sensitivity analysis.
That’s where the new options for PROC MI come in. I think they’re easiest to explain by way of an example. In my Missing Data seminar, I use an example data set called COLLEGE, which contains information on 1302 fouryear colleges and universities in the U.S. The goal is to estimate a linear regression in which the dependent variable is graduation rate, the percentage of students who graduate among those who enrolled four years earlier.
There are lots of missing data for the five predictor variables, but we’re going to focus on the 98 colleges that did not report their graduation rates. It’s plausible that colleges with low graduation rates would be less likely to report those rates in order to avoid adverse publicity. If so, that would probably entail a violation of the MAR assumption. It would also imply that colleges with missing data on graduation rates would tend to have lower (unobserved) graduation rates than those colleges that report their graduation rates, controlling for other variables.
PROC MI allows us to build that supposition into the multiple imputation model. We can, for example, specify an imputation model that says that the imputed values of GRADRAT are only 80% of what they would be if the data were actually missing at random. Here’s the SAS code for doing that:
PROC MI DATA=MY.COLLEGE OUT=MIOUT;
VAR GRADRAT CSAT LENROLL STUFAC PRIVATE RMBRD ACT;
FCS ;
MNAR ADJUST(GRADRAT / SCALE=.80);
RUN;
This program produces five data sets, with missing data imputed by linear regression. For a sensitivity analysis, the essential ingredient here is the MNAR statement. The ADJUST option says to multiply the imputed values of GRADRAT by .80 at each step of the iterative process. To do a proper sensitivity analysis, we would redo both the imputation and the analysis for several different values of the SCALE parameter, ranging between 0 and 1.
The MNAR statement only works if you specify the MONOTONE method or the FCS method, which is what I used here. FCS stands for fully conditional specification, and it’s equivalent to the chained equation or sequential regression method used in many other packages. The MNAR statement does not work if you use the default MCMC method. [It could probably be done for MCMC, but that would mess up the elegant computational algorithm. FCS is already a “messy” algorithm, so a little more mess is no big deal].
Instead of multiplying the imputed values by some constant, we could add or subtract a constant, for example,
MNAR ADJUST(GRADRAT / SHIFT = 20);
This would subtract 20 points from any imputed graduation rates. Again, to do a sensitivity analysis, you’d want to try out a range of different SHIFT values to see what effect that would have on your results.
The SHIFT and SCALE options can be combined. The SHIFT option can also be used for adjusting the imputations of categorical outcomes (binary, ordinal or nominal), except that the changes are applied on the logodds scale.
Another option allows you to restrict the adjustments to certain subsets of the data, e.g.,
MNAR ADJUST(GRADRAT / SHIFT = 20 ADJUSTOBS=(PRIVATE=’1’));
This says to subtract 20 points from the imputed values of graduation rates, but only for private colleges, not for public colleges. If you use the ADJUSTOBS option, the subsetting variable (PRIVATE in this case) should be listed in a CLASS statement.
There are also other options, which you can read about here. An introductory article written by the guy who developed PROC MI, Yang Yuan, can be downloaded here.
If you don’t use SAS, you can do adjustments like this using other multiple imputation software along with a little programming. You first produce data sets under the MAR assumption and then you modify imputed values by adding or multiplying by the desired constants. But the SAS method is more elegant because the adjustments are made at each iteration, and the adjusted imputations are used in imputing other variables with missing data in later steps of the algorithm.
This particular way of doing a sensitivity analysis is based on something called patternmixture models for MNAR. You can read more about patternmixture models in Chapter 10 of the book Multiple Imputation and Its Application by James Carpenter and Michael Kenward.
Finally, it’s worth noting that the inclusion of appropriate auxiliary variables into the imputation model can go a long way toward reducing the likelihood of MNAR. The best auxiliary variables are those that are highly correlated with both the variable that has missing data and the probability that the variable is missing. For more on auxiliary variables, see this recent paper by Tenko Raykov, one of Statistical Horizons’ instructors.
]]>