## Discussion Board for Missing Data

##### March 10, 2018 By Paul Allison

This page is for participants in Paul Allison’s online course “Missing Data.” Please post any questions or comments you have about the course. Dr. Allison will respond to questions, but you should also feel free to respond to other people’s posts.

Hi everyone! My name is Jenil Patel, and I am currently a doctoral candidate in Epidemiology at UT School of Public Health in Houston. Since I work as a research assistant for several studies involving big datasets and also as suggested by my mentor, I found this course interesting and thus decided to join. Looking forward to developing some awesome statistical skillsets!

Hello, my name is Simon Brauer. I’m a Ph.D. candidate in sociology at Duke University, studying religion. I’ve been using missing data techniques a bit more lately and while I could explain the logic behind them, the practice of using them well is still unclear to me. This has become particularly relevant as I’ve been using more longitidunal data sets with a significant amount of missing data at different points in time.

Hi all, my name is Ben Fisher and I’m an Assistant Professor of Criminal Justice as the University of Louisville (no, I didn’t go to the Derby this year). I’ve picked up a few tips and tricks for handling missing data over the years, but haven’t been through any formal instruction on it, so this seems like a great opportunity.

Hello – my name is Veronika Shabanova and I am an assistant professor in the department of Pediatrics at Yale. I have a PhD in Biostatistics from Yale as well. There wasn’t a formal course on how to handle missing data (as part of our curriculum), but as a practicing biostatistician, I gained practical knowledge from my experience with handling data that had missing observations. Why did I sign up for the course? I love to learn and re-learn new and familiar statistical concepts. It’s nice to have some dedicated time to spend on this topic. Plus, Dr. Allison has a great reputation, so I am looking forward to the course!

Hi everyone, my name is Eyal Oren and I’m an Associate Professor in Epidemiology & Biostatistics at San Diego State University. Like some others posting here, I use and encounter missing data frequently but have never had ‘formal’ training in this regard. Looking forward to the class!

Hello! My name is Megan Gilligan, and I am an assistant professor in Human Development and Family Studies at Iowa State University.

Hi everyone. My name is SeungYong Han, and I am an assistant research scientist at Arizona State University. My main duties are mainly data analysis and publication, and missing data has always been a big issue for data analysis. I have used FIML and MI for some papers and works, but I am trying to understand more about the logic behind so that I can do a better job in the future. Looking forward to a productive and interesting discussion with all for a month!

Hi, my name is Phil Ender. I do consulting in the Southern California area. I am interested in the the best practices for data with missing values.

Hi everyone, my name is Diana Sonntag and I am the lead health economist at the Mannheim Institute of Public Health, Heidelberg University. I am working in the field of the economics of childhood obesity. Like some others posting here, I love to learn and apply statistical concepts and get an insight into the logic behind. Looking forward to the course.

Hi everyone, my name is Kwabena Boakye and I am an assistant professor at Georgia Southern University. I really want to learn the best practices in handling missing data for the research that I do. I have heard wonderful and good reviews from friends about Dr. Allison and I look forward to have a fruitful interaction and learning experience with you all.

HI everyone, my name is Anna Egalite and I’m a faculty member at NC State University. I teach qualitative methods courses and am always looking for ways to improve my instruction. Hoping to get some good resources and build knowledge by taking this course!

Hello everyone,

My name is Fengsong Gao. I am an adjunct research fellow with Menzies Health Institute Queensland. I am interested in development and application of sophisticated quantitative methodologies for the analysis of secondary data sets. This course would be higher valuable to help me better handle the missing data.

Hi everyone! My name is Elaine Wei, and I am an assistant professor of Educational Psychology at Mississippi State University. While I teach all levels of statistics/quantitative methodology courses in my department, I always feel that there are gaps, small or huge, to fill on certain topics of my stats knowledge, so…here I am!

Hi everyone, my name is Andreas Wahl, I am an assistant research scientist at the Univeristy of Stuttgart in Germany. For my dissertation/and different research projects I am currently working on simulation studies in which I copmare different missing data methods. I´d like to take the opportunity to validate my work I did until now with the help of the topics we are discussing here (as I never had any professional training in this field and had to gather my knowledge from scratch). Basically I am taking this course to improve/deepen the knowledge I already have. Looking forward to next fes weeks.

Hello, my name is Mihaela Henderson and I’m a doctoral candidate in the higher education program at NC State University. I would like to learn about missing data so I can be better prepared to work with NCES longitudinal studies.

Hi Everyone–I am Julie Gaither, and I’m an epidemiologist in the Department of Pediatrics at the Yale School of Medicine. I’m currently working on a study examining child mortality reports, which have a lot of missing data.

Years ago, when I was completing my dissertation, I took Dr. Allison’s course on survival analysis; it was great course. I’m sure this one will be as well.

Hi everyone,

My name is Hai Pham, I’m a PhD student at the QIMR Berghofer Medical Research Institute, Brisbane, Australia. I’m currently working on a randomised controlled trials and dealing with missing data is a big part of my PhD. Looking forwards interact with all of you and learn more about missing data.

Hello — My name is Jonathan Mayo, I’m a biostatistician working in perinatal epidemiology at Stanford University School of Medicine. I’m here to learn about imputing missing data for some of my projects.

Hello everyone, my name is Anders Garlid and I am a Ph.D. candidate under the mentorship of Dr. Peipei Ping in the Molecular, Cellular, and Integrative Physiology program at the University of California, Los Angeles.

I come from a background of basic research in mitochondrial and cardiovascular physiology and have recently transitioned to a focus on data science and bioinformatics approaches. I am interested to learn about missing data to improve my ability to contend with multi-omic data and other multi-dimensional or complex data types.

Looking forward to taking this course with everyone!

Best,

Anders

Hi everyone,

My name is Bilal Mirza, and I am a data science postdoc at UCLA under the supervision of Dr. Peipei Ping. We work with large-scale molecular datasets in our lab. I am interested in learning the new statistical methods for tacking missing data which is a common problem in omics datasets. Looking forward to attending this course and interact with everyone.

Hello everyone,

My name is Howard Choi, I am a graduate student in Bioinformatics at UCLA working under the mentorship of Dr. Peipei Ping. I am interested in learning advanced data imputation methods and the assumptions behind those techniques, so I can correctly select and apply a method that works best for our time-series multi-omics datasets. Looking forward to taking this course with you.

Hello! Thank you for offering this course online. My name is Brooks Bowden. I am an assistant professor in education policy and the economics of education. Similar to others, my formal coursework stopped short of missing data so I am excited to have this opportunity.

Hi all, I am a research associate at the National institute of Education in Singapore. As we work with survey data, we have lots of missing data. Right now, my supervisor and I are writing a paper about how we deal with missing data in our study and this course will be helpful for me as I have not had any formal coursework on missing data.

Dear all,

I am David Gimeno, faculty at the UTHealth School of Public Health in San Antonio. I teach applied epidemiology in my school. Always looking to refresh my skills. You can never learn too much!

Hello,

My name is Pascale Dubois. I am a doctorate student in education at Laval University (Quebec, Canada). I have to deal with missing data in my data set, so I look forward to this online course.

Hello,

My name is Jill Suitor. I am a professor of sociology at Purdue University. I am very excited that this course has become available on line. I lead a panel study of later-life families–with an increasing number of data points, missing data is becoming a larger issue on my project.

Dr. Allison,

In module 3, probably the second slide, when you are covering the general principle of maximum likelihood, I am not sure I am entirely following your explanation. I believe what I am missing is where PI falls in the formula referred to in the slide. I am using a screenreader and it doesn’t read symbols.

Would it be possible for you to type out the formula or formulas for me?

I would be happy to do so, but it would help if you could tell me what format would work best for you.

Dr. Allison, I am not sure I have an absolute answer as to what format would work; screenreaders never read images or symbols unless they are tagged. How about if we try this out in Word and I’ll see if I can make that work? I wish I had all the answers, but post-secondary mathematics and statistics, especially advanced statistics have always been a bit of an experiment, whereby I figure something out.

I have also run into something strange in the PDF file Missing Data Using Stata. The PDF appears to have 87 pages in it, and I’m assuming this iswhere the computer exercises are, or at least the instructions for downloading the data, but after page 41, my screenreader tells me the pages are blank. Do you know if there is some change in the format of the PDF document that might cause a difference in the last 36 or 37 pages?

Please let me know if you would rather me email you.

Thank you

Hi Prof Allison,

I have just completed exercise 2 and the standard errors in direct ml are larger than those given using em. If direct ml produces approximately unbiased estimates of the standard errors, shouldn’t the standard errors given in direct ml be smaller?

When you use the variances and covariances produced by the EM algorithm as input to a regression program, the standard errors will be incorrect–usually too small. That’s because the regression program needs a sample size to compute the standard errors. No single number for the sample size will give the right standard errors for all the regression coefficients. On the other hand, direct ML gives standard errors that correctly account for all the missing data.

Hi everyone,

I’m Nicolas Van der Linden, part-time lecturer in social psychology at the University Libre de Bruxelles, in Belgium, and also part-time Evaluation Manager in a non-profit organization, Modus Vivendi. Missing data is part of my life at the university and at Modus Vivendi and up to now, not much has ever been doing about it. I’m joining late. I have some catch-up to do. I’m looking for ward to this online course.

Best,

Nicolas

Dear Paul,

On page 79 of your chapter, you wrote “The principal output from this algorithm is the set of maximum likelihood estimates of the means, variances and covariances. Although imputed values are generated as part of the estimation process, it is not recommended that these values be used in any other analysis. They are not designed for that purpose, and they will yield biased estimates of many parameters”.

In other words, if I use EM in order to replace missing values, I should not use the dataset with complete values in predictive models. If my understanding is correct, this would be a major limitation since I’m usually not interested in absolute values like means and intercepts.

Thanking you in advance for your clarification,

Nicolas

That is correct. You should not use the imputed values generated by standard EM packages for analysis using other methods. However, there is a way to use EM to generate useful imputed values by combining it with bootstrapping. That method has been implemented in the program Amelia II: https://gking.harvard.edu/amelia

Another question:

In my understanding, hot deck imputation is very similar to regression imputation in cases where the variables used to choose the donors are known empirically or theoretically to be related to the variable with missing values. I detect some differences however. First, with hot deck imputation, imputed values are within the range of possible values, something which is rarely the case with regression imputation. Second, it seems to me that imputed values do not always lead to underestimation of parameter estimates like variances. It depends on the values taken from the donors. If imputed values come from donors with values distant from the mean, the variance could even be overestimated, right? In such cases, couldn’t the standard error estimates be biases upward, leading to an increase in Type II-error? Is my understanding correct and could you recommend one or two references on the problems of hot deck imputation if the problems are indeed partially different from the ones associated with other imputation methods?

Thanks,

Nicolas

A lot depends on how the hot deck method is implemented. I didn’t focus on this method in the course because traditional hot deck methods tend to sharply limit the number of variables that can be used as a basis for implementation. And it’s very important to be able to include all the variables in the model of interest. Toward the end of the course, I will discuss the method of predictive mean matching, which is very much like hot deck but removes the limitation on the number of variables. With this method, the imputed values are like real values in every way, because they ARE real values, just ones borrowed from other cases. If this method is correctly implemented, it should not lead to biases in variance estimates.

Dear Dr. Allison,

I am using Stata.

On slide 73, you wrote that the choice for mi set (e.g., mi set wide or mi set) is unimportant. Is that only relevant for the MCMC?

In excerise 3, both variables momage and child age are auxiliary variables. I tried to check this by using pwcorr self pov black hispanic momwork divorce gender momage childage, obs, yet, momage and child age are not well correlated with other variables (r<0.4). What is a good way to identify auxiliary variables?

Many thanks for your advise.

The choice for mi set is unimportant for any of the methods that mi impute can implement.

You’re right that childage and momage are not ideal auxiliary variables because they are not highly correlated with the variables that have missing data. That’s the main criterion to look for: correlations with the variables that have missing data. It doesn’t hurt (at least not much) to include variables with low correlations. But it typically won’t help either. I asked you to use those variables just so you could get some practice with the mechanics.

Dear Dr. Paul,

I want to impute my dataset so that it can be later fed to different analysis pipelines. Due to certain imitations, it is not feasible to run those analyses for multiple imputed datasets. Is it reasonable to employ EM with bootstrap (as you suggested in an earlier comment) to generate single imputed dataset and use that for downstream analyses? Also, do you suggest any other method that outputs reasonable imputed values in terms of single dataset?

Thanks,

Bilal

If you use the multiple imputation methods covered in the remainder of the course, you could generate a single data set that would give you approximately unbiased estimates of any parameters you wanted to estimate. However, standard errors and p-values would be too low, possibly much too low.

20 years ago, Schafer and Schenker published a paper that detailed a way to use single imputation methods effectively:

https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10473910#.WwRFIkgvxPZ

However, to my knowledge, there is no available software that implements this approach, and no one has extended it to cover more general situations than the limited ones that they deal with.

In Module 7, slide 5 (and elsewhere), you state that the mulivarite normal assumption works well for dichotomous variables so long as one group is not very small (less than 5%). Does this hold true for a set of dummy coded variables which represent more than 2 categories? And, if so, does it hold true for effect-coded variables?

Thanks!

Yes and yes.

Hi All,

I am a little late to the game and will be spending this week catching up on the assignments. I am a Research Psychologist at the VA in New Orleans. In my line of work (treatment outcome studies), I often deal with a lot of missing data and have yet to receive formal training on how to handle. As such, I am excited about this course and what it is has to offer.

Amanda

I wanted to share a solution to a small but potentially frustrating issue that R users might run into with this week’s exercises.

Some popular packages such as `dplyr` convert `data.frame`s into `tibble`s. However, the `jomo` package apparently does not work with objects in the `tibble` format. If you try to pass a `tibble` to `jomo()`, you will likely get the error “Object ‘imp’ not found.”

You can solve this with the `as.data.frame()` function, which can convert a `tibble` to a `data.frame`.

Thanks Simon. This is also documented on page A12 of the R handbook. There the offending command is read_sas, which reads a SAS data set and outputs a tibble. The as.data.frame function is necessary before passing the data to jomo.

Dear Prof Allison,

You explained in chapter 7 that we should include the dependent variable in the multiple imputation so as not to cause bias.

I performed multiple imputation on my dataset leaving out the dependent variable as the correlation with the independent variables were less than 0.40.

What do you advise in this case? Should i rerun the multiple imputation, including the dependent variable?

Thanks!

Yes, I advise that you rerun the imputation, including the dependent variable. Leaving it out can severely bias the coefficient estimates, even if the correlations with the independent variables are not large.

Dr. Allison,

I am analyzing a longitudinal data with 4 waves, and I would really appreciate it if you can advise.

More than 1300 respondents were recruited at time 1, but about half didn’t participate at time 2, 3, and 4. So the data is missing on all variables at time 2, 3, and 4 except some time-invariant variables measured at time 1. As a result, the number of participants who completed all 4 surveys and available for my analysis is about 350, which is quite small considering the full sample size over 1300.

I am using a FE model within a SEM framework (learned from your seminar on a longitudinal data analysis with SEM), so I am using FIML in Mplus.

Would you advise using FIML for my analysis? The model runs without any problem, but I am concerned because more than a half has no data particularly on the outcome variables and the main predictors at time 2, 3, and 4.

I would appreciate any of your advice on this!

Well, there’s no harm in doing FIML in this situation. However, the people who were only measured at time 1 will contribute nothing to a fixed-effects analysis. My guess is that your results won’t be much different than doing listwise deletion.

Thank you for your quick response. I wonder why the people who were only measured at time 1 will not contribute at all to FE analysis? Is it because they have no variation over time? If so, I wonder why. Isn’t having some values for missing data among those people the main goal of FIML?

Yes, because they have no variation over time, most importantly on the dependent variable. There are serious limits to how much you can accomplish with FIML (or multiple imputation) when data are missing on the dependent variable.

By coincidence, I have a quite similar dataset on doctoral persistence with about 1600 participants at wave 1, 865 at wave 2, 621 at wave 3, and 402 at wave 4. If I understand the discussion thread correctly, if I wanna analyse the data of participants who completed our questionnaire at the 4 waves, I should just omit (i.e., listwise deletion) the data of participants who stopped short of completing the final questionnaire and any method for handling missing data I’ll choose should only apply to participants who completed the final questionnaire only partially? If I notice that more motivated and persistent doctoral students continue participating in our study more often, we’re clearly in a NMAR situation if doctoral persistence is the dv and there’s pretty much nothing I can do about it, right, except maybe highlighting this limitation in the discussion of our paper? Is my understanding correct?

First, I don’t agree that “any method for handling missing data I’ll choose should only apply to participants who completed the final questionnaire only partially.” You should use participants’ data up to the point where they drop out.

Second, you say “If I notice that more motivated and persistent doctoral students continue participating in our study more often”. This suggests that you have other variables measuring motivation and persistence. If that’s the case, you can use those variables in the MI or ML process to reduce or eliminate the NMAR problem.

I have a question about #6 in exercise 1.

My Stata code works fine without causing any warning, but SAS code gives me a warning message without any results. Dr. Allison, could you let me know if you see any error in my code? Thanks!

Here is my SAS code:

proc surveyselect data=miss.nlsymiss method=urs N=581 reps=1000 out=bootsamp;

run;

proc mi data=bootsamp nimpute=0 noprint;

var anti self pov black hispanic divorce gender momwork;

em outem=nlsyem2;

by replicate;

run;

proc reg data=nlsyem2 outest=a noprint;

model anti=self pov black hispanic divorce gender momwork;

by replicate;

run;

proc means data=a std;

var self pov black hispanic divorce gender momwork;

run;

This gives me a warning message as below

: WARNING: The data set WORK.NLSYEM2 does not indicate how many observations were used to compute the COV matrix. The number of observations has been set to 10000. Statistics that depend on the number of observations (such as p-values) are not interpretable.

I ran your exact code and I got results. The warning message occurs because when you pass the EM estimates of the covariance matrix to PROC REG, it doesn’t know how many observations it was based on. So it computes standard errors based on a sample size of 10,000 (I have no idea why they chose that number). But that’s irrelevant because you’re not using the standard errors produced by PROC REG. Instead, you are estimating the standard errors by calculating the standard deviations of the coefficients across the 1000 samples.

So as far as I understand your answer, the code is correct, but SAS is not running properly for some reason? How do I fix this? Just in case, I did ask about this to SAS forum and am waiting for responses from SAS experts.

From what you told me, it seems like it’s running PROC REG, but not producing the final estimates from PROC MEANS. If you send me your log file, I might be able to figure out what’s going on..

It seems that I get the warning message whenever I run Proc Reg, bootstrapping or not. In fact, I found out that you mentioned about this exact warning message in this document: https://statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf

My understanding is that bootstrapping is one solution to this problem as you explained in the document. Let me know if I am wrong!

You are correct.

Just a comment on PROC MI in SAS 9.4 version.

It seems that the default number of imputations in PROC MI has changed from 5 to 25 in SAS 9.4. You can change the number with NIMPUTE option.

: http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_mi_syntax01.htm

This is the warning message I get after running PROC MI in exercise 3.

: WARNING: The default number of imputations has been changed from 5 to 25

Dear Dr Allison,

As part of my PhD I will be analysing data from a randomised trial. This will include testing for interactions between the treatment group (i.e. active and placebo) and selected covariates, some of which have missing data (eg. BMI).

I understand that your advice (in Module 10) is to impute the missing covariate(s) separately within each treatment group, and then estimate a model which includes the product term(s). But I was wondering how to conduct any subsequent stratified analyses. For example, suppose that there was a significant interaction between treatment group and BMI (categorised into two groups: = 25 kg/m^2), and that I wanted to estimate the effect of treatment within the each stratum of BMI. Is it appropriate to use a “where statement” to restrict to the stratum of interest when performing the estimation step? Eg.

PROC REG DATA=miout OUTEST=a COVOUT;

WHERE BMI = 1;

MODEL y= treatmentGroup;

BY _IMPUTATION_;

RUN;

PROC MIANALYZE DATA=a;

MODELEFFECTS INTERCEPT treatmentGroup;

RUN;

(And then repeat for BMI = 2.)

Or should another approach be used, given that the number of people in each level of BMI would fluctuate between imputed datasets?

Yours sincerely,

Hai Pham

I think this is a reasonable approach. One potential complication is that if you’re using normal MCMC to impute, the imputed BMIs will not be exactly 2. So your WHERE statement would have to specify a range of values.

Dear Prof Allison,

I ran multiple imputation with MICE in r.

The model I specified for my variables was the

2-level normal model.

As, my variables are from a survey with a range of 1-5, I also tried the 2-level predictive mean matching model for comparison.

However, I encountered some warnings when I did predictive mean matching. For both models, I imputed 15 datasets and 15 iterations and the dataset contains about 5% missing data.

These are the warning messages:

11: In checkConv(attr(opt, “derivs”), opt$par, ctrl = control$checkConv, … :Model failed to converge with max|grad| = 0.862765 (tol = 0.002, component 1

32: In checkConv(attr(opt, “derivs”), opt$par, ctrl = control$checkConv, … :Model failed to converge: degenerate Hessian with 1 negative eigenvalues

33: In checkConv(attr(opt, “derivs”), opt$par, ctrl = control$checkConv, … :unable to evaluate scaled gradient.

Not sure what to make of them, will be nice if you could me help me out with this.

Thanks.

I wish I could help, but my experience with the ‘mice’ package is limited. In any case, these kinds of errors are very difficult to diagnose in any package. I’m not sure what options are available in ‘mice’ for predictive mean matching, but you might try tweaking these.

Dear Prof. Allison,

I am working with the German Socio-Economic Panel, a large longitudinal dataset including Body-Mass-Index (BMI). Data on bmi are not missing at random and I found a new stata command mibmi that imputes BMI values using adjacent BMI values (instead of co-variates). Here is the link for everyone who is interested: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5234260/

How is this approach comparable with the methods that we learned in module 12? And, would you recommend this approach ? Many thanks for your advice.

An interesting article that I had not seen before. As the authors repeatedly note, this is a method designed for imputing variables that have a lot of between-person variability but not much within-person variability. It has potential, but I have a few concerns:

– The imputation process only uses other values of the same variable. This could bias relationships with other variables.

– Not obvious that the “right” amount of variability is introduced into the imputation problem.

– The performance tests are measures of how close the imputed values are to the real values. As I have stressed, trying to optimize this metric can make things worse rather than better.

Thanks a lot for your advise. I was concerned about using only other values of the bmi as well as. Which of the methods that we learned in module 12 would you recommend to impute bmi data? Or, would you recommend to apply other methods like survival analysis which considers censored data?

Many thanks for your advice.

I would need to know more about the analysis model that you want to estimate, and a little more about the study design.

Thank you very much for taking time which is much appreciated! It´s my pleasure to provide some more information about the analysis model and the study design.

We are planning two separate analyses; firstly we want to estimate trajectories of body-mass-index (bmi) depending on sex, age, socio-economic status over the lifetime by using data from the German Socio-Economic Panel, a longitudinal panel from 1984-2016, where bmi has been measured every second year (2002, 2004, …, 2016). BMI has been categorised into normal weight, overweight and obese (according WHO cut-off points). I assume that the probability of moving between BMI states (e.g., from normal weight to overweight) depends not only on individual characteristics such as age and sex but also on how long an individual has stayed in a specific BMI state (e.g., slow or rapid weight change in a given time period). To evaluate this, I use the time to BMI change as the primary outcome using flexible parametric models.

In using a survival approach and bmi as dependent variable, I am not planning to impute data for bmi. An alternative analysis model would be a random-effects model by using the mixed command (as discussed on slide 119) but without imputing data for bmi. What do you think?

In the second analysis, I am planning to estimate the causal relationship between bmi and health-care utilization by using again data from the German Socio-Economic Panel. Here, I am quite not sure whether it would make sense to impute data for bmi (here; control variable) and would be glad to hear your advice.

Sincere thanks,

Diana

This is very helpful. One more question. Which of the following software packages do you prefer to use or are capable of using: Stata, Mplus, R?

For the survival analysis, if data are missing only because of drop out (with no return), I think it makes sense to just censor observations at the time of the drop out. If there is intermittent missing data, there could be some benefit from imputation, although probably not a lot.

If you estimate a mixed model (using either -mixed- for a continuous outcome or -melogit- for dichotomous outcome), there’s no need for imputation, unless you have cases with bmi observed but missing covariates.

For your causal relationship analysis, I’m not sure exactly what you have in mind, but I would probably go with sem (with FIML).

Thank you so much for taking time and all your support! Could I ask please a final question, just to make sure that I understood you correctly.

For the mixed model, you are recommending to impute if cases with bmi were observed but missing covariates. When I am using bmi as dependent variable and I have no good auxiliary variables, is your recommendation to impute also valid? I am just asking to make sure that I have understood slide 81 (Should Missing Data on the Dependent Variable be impute) correctly.

Thanks a lot for all your help! Much appreciated!

1. It’s worth imputing missing values on the covariates for cases in which the dependent variable is observed.

2. If you are estimating a mixed model via maximum likelihood, it’s not worth imputing cases with missing values on the dependent variable.

3. If you are estimating some other kind of model (e.g., GEE) it may be worth imputing cases with missing values on the dependent variable using values of the dependent variable at other time points as auxiliary variables.

Could I ask you please for a further question?

On page 119, you are recommending dummy variables to impute panel data by using a random-effects model. I am wondering whether this approach is also suitable for fixed-effects models and tried the following code (where the first seven lines are from our script):

use”C:\data\hip3.dta”, clear

gen copycesd=ceased

tab sid, gen(id)

mi set flong

mi register impute adl walk pain srh cesd

mi impute mvn walk adl pain srh cesd = id* i.wave, add(60)

drop if copycesd==.

mi xtset sid wave

mi estimate: xtreg cesd walk srh pain adl, fe

Is that an appropriate way to use imputed data in a fixed-effects model? Or, are you recommending another approach?

Many thanks for taking time and your advise.

For this particular example, I would prefer to use the method of imputing in the wide form because I know it has good properties. The use of cluster fixed effects imputation has not been extensively studied, and there is some evidence that it may tend to overfit the data. However, if you use this method, there is no reason to think that it would be any worse if the goal was to estimate a fixed effects model rather than a random effects model. In fact, one could argue that it is a more “congenial” imputation model for fixed than for random effects.

Thank you so much for your detailed explanation. You mentioned that there is some evidence that the use of fixed-effects imputation overfit the data.

Could I ask you please to kindly send one reference/link/paper that I could read a bit more about it? Thanks a lot!

Here’s one:

Speidel, Matthias, Jörg Drechsler, and Joseph W. Sakshaug. “Biases in multilevel analyses caused by cluster-specific fixed-effects imputation.” Behavior research methods (2017): 1-17.

Dr. Allison,

I am using Mplus to examine the association between BMI and food insecurity. All measured at four time points.

I am using a FE model within a SEM framework as in the Ousey example in your longitudinal data analysis seminar.

There are a lot of missing cases due to missing at t2, t3, and t4. So I am trying to use FIML with auxiliary variables, which are self-measured BMI / weight / height / body image at four time points in addition to three dichotomous variables, like gender, race, and income.

I would really appreciate it if you can give me some comments on two questions below.

1) what does (M) do in my Mplus code below? With (M), I get a warning and an error message. Without (M), it works. FYI, I was advised to use WLSMV since food insecurity variable is dichotomous.

*** WARNING in ANALYSIS command

Estimator WLSMV is not allowed when there are variables in the AUXILIARY option with the ‘m’ specifier. Default will be used.

*** ERROR in VARIABLE command

Analysis with categorical variables is not available with the ‘m’ specifier in the AUXILIARY option.

2) What should I do to use dichotomous variables as auxiliary variables in the Mplus code?

Here is my Mplus code

VARIABLE:

…

Auxiliary= (M) bmiself_t1 bmiself_t2 bmiself_t3 bmiself_t4 wtself_t1 wtself_t2 wtself_t3 wtself_t4 htself_t1 htself_t2 htself_t3 htself_t4 bc_t1 bc_t2 bc_t3 bc_t4 male white pell_status

;

Categorical are fi_t2 fi_t3 fi_t4;

Cluster=cluster3;

ANALYSIS:

estimator=WLSMV;

parameterization=theta;

type=complex;

1. The (M) option says to use the variables that follow as auxiliary variables for missing data analysis. It is essential to use this option. So might want to do MLMV instead of WLSMV. MLMV can, in principle, handle categorical variables.

2. To include dichotomous variables as auxiliary variables, don’t specify them as categorical.

I used

– MLMV, instead of WLSMV

– a few continuous and dichotomous auxiliary variables with (m) option.

And some other specifications.

– “cluster=” is used, so “type=complex”

– “categorical are” is used because one of two main factors is a dichotomous variable, thus, parameterization=theta

Note: I am interested in the bi-directional relationships between two factors (one continuous and the other dichotomous) over four time points.

I ran the code and got a warning message and an error message.

; WARNING in ANALYSIS

command Estimator MLMV is only available when LISTWISE=ON is specified in the DATA command. Default estimator will be used.

; ERROR in VARIABLE command

Analysis with categorical variables is not available with the ‘m’ specifier in the AUXILIARY option.

q1) it seems that I cannot use MLMV when LISTWISE option is off? I don’t understand this. Would the default estimator be fine?

q2) So it seems like I have to delete “categorical are” statement to make it work without the error message. But one of my main predictors is dichotomous so I think I should use it. Any other way to avoid this error message?

Thank you!

1. You might try MLR or just ML.

2. A variable only needs to be declared categorical if it appears as a dependent variable in a model equation.

Mplus advised using estimator=MLR instead of ML when TYPE=complex, so I used MLR and it worked.

I can also include dichotomous auxiliary variables as well without any warning message.

Thanks!

Glad to hear it!

I apologize if the question is really basic but, on page 35 of the handbook for Stata users, how do I get standardized coefficients ?

I presume you’re referring to slide 35. For the regress command, the option is beta, as in

regress gradrat lenroll rmbrd stufac csat private, beta

For the sem command, the option is stand. But note that if you use this option, you’ll only get the standardized coefficients, not the unstandardized. So I usually run sem without the stand option, and then issue the command

sem, stand

I tried unsuccesfully to replicate the analyses with the college database (pg 37 and 38 of the handbook for Stata users). I copy-pasted the following code in bloc and line by line and nothing happenes. I also entered the code manually line by line and still nothing happened. What am I doing wrong? I’d like to know in order to be able to do exercice 1 with Stata.

program define bootem

use “/Users/nicolasvanderlinden/Documents/Personnel/Formations/Missing data/college.dta”, clear

bsample

mi set wide

mi register impute gradrat lenroll rmbrd stufac csat act private

mi impute mvn gradrat lenroll rmbrd stufac csat act private, emonly

matrix Sigma=r(Sigma_em)

matrix M=r(Beta_em)

_getcovcorr Sigma, corr

matrix C = r(C)

matlist C

corr2data gradrat lenroll rmbrd stufac csat act private, ///

cov(Sigma) mean(M) clear

reg gradrat lenroll rmbrd stufac csat private

end

simulate _b, reps(1000): bootem

summarize _b_lenroll _b_rmbrd _b_stufac _b_csat _b_private _b_cons

When I copied your code into Stata, many of the lines terminated with a line-end character rather than a “carriage return”. When I corrected those, the program worked fine.

And I found another mistake in the code I was using. I was keeping the signs /// in the code. This is a bit embarrassing. I didn’t understand the signs meant the code is being continued on the next line. I didn’t find the same results as on slide 38 but I guess this is normal because the replications are never exactly the same.

Yes, the code is intended to be run from a do file window in which the /// is necessary when a command extends to another line. And you are correct that bootstrap results will not replicate exactly unless you set the same seed number (with the set seed command). However, setting the seed to be the same is an artificial constraint. There’s no reason to prefer the results from one seed over another.

Dear Paul,

Lall (2016) has some recommendations on when to use auxiliary variables. In my understanding, he specifically advice against the inclusion of auxiliary variables that have too much missing data (> 24%). What is your take on this (I do not recall you mentioned it in the videos I’ve seen up to now).

Thanking you in advance for your answer,

Nicolas

Missing data on auxiliary variables is certainly not desirable, but it’s usually not harmful. Even with large fractions of missing data on an auxiliary variable, it can be useful if (a) it has a high correlation with the variable that is missing, and (b) the auxiliary variable is generally observed when the imputed value is not. That’s the case the the college example when ACT is used as an auxiliary for CSAT.

Another question. On page 426, Lall (2016) writes “I add only three features to the imputation model. First, for TSCS datasets I include a sequence of third-order time polynomials—a new capability in Amelia II—to better model smooth temporal variation within cross-section units. Second, I include lags of the dependent and key explanatory variables—or leads if they are already lagged—since data for one period tend to be highly correlated with data for the previous (or subsequent) period. Third, I add a ridge prior of 1% of the number of observations in the dataset, which addresses computational problems caused by high levels of missing data and multicollinearity as well as increasing the numerical stability of the imputation process”.

I don’t understand the first and third features and, although, I’m familiar with time series analysis, I don’t know what leads are.

Best,

Nicolas

1. Third order polynomial: Letting t be time, the linear imputation model includes t, t^2, and t^3.

2. If you are trying to you impute y(t), a lag would be y(t-1) and a lead would be y(t+1).

3. Most multiple imputation methods are explicitly or implicitly built on Bayesian principles, which require a prior distribution. Usually, the prior is non-informative, either a flat prior or the Jeffreys prior. But informative priors can be used, and a ridge prior is type of informative prior. I’ve never used an informative prior of any kind, but it might be helpful in some cases.

Dear Paul,

In his paper, Little proposes a way of more rigorously testing the MCAR assumption by using logic models. If I understand it correctly, he conducted different logic models for the two different iv. I guess it makes not much of a difference whether one conduct such tests separately for each iv or just one model with all iv together (partialling out the effets of the different iv).

Related to this, I wanted to ask you if you had any recommendations or examples you could send us on how to report a ml and mi methods in an empirical paper.

Thanks in advance,

Nicolas

I’m not a big fan of Little’s MCAR test. That’s because ML and MI rely on MAR, not MCAR. So rejecting MCAR doesn’t impugn those methods. And failing to reject in no way justifies those methods.

As for reporting, not that much is needed. Here’s what I say in a recent slide:

If you use ML just say, e.g., “To handle missing data, we used full information maximum likelihood (Allison 2001, Enders 2010 ), as implemented in Mplus version 7.2.”

For multiple imputation, you should report

-Method used, e.g., “To handle missing data, we did multiple imputation using the MCMC method under the assumption of multivariate normality”

-Software used, e.g., “PROC MI in SAS 9.4”.

-Number of data sets.

-Other stuff you COULD report: % of data missing for each variable, fraction of missing information for each parameter, missing at random assumption, what variables were in the imputation model, etc.

When the analysis model is a multilevel model with random slopes, I assume the imputation model should include those random slopes in order to be compatible with the analysis model?

If so, do you have recommendations on how this can be practically achieved when you have missing data on a variable you would like to have a random slope? (for example, is this achieved by passing the random-slope variables to the Y2 argument in jomo()?)

Thank you.

Yes, it would certainly be desirable to include random slopes in the imputation model. However, I don’t have any experience in doing this. It’s not clear whether jomo can do it. At least I couldn’t find anything in the documentation that suggests it can. However, there is another R package ‘pan’ that will do it. Here’s an article that explains how: http://journals.sagepub.com/doi/pdf/10.1177/2158244016668220

Dr. Allison,

I am running a spline regression model to account for some of the nonlinearity between my dependent variable and my predictor of interest. In doing this, knots are placed at specified intervals of this variable to fit the piecewise regressions into a smoother shape. I also have covariates that have some missing data.

We’ve learned that the imputation model should correspond closely to the analysis model, and in module 10, when there are interactions or nonlinearities they should be present in both.

What I’ve done so far is:

1. listwise deletion and spline regression

2. multiple imputation not accounting for nonlinear portion (i.e. PROC MI with all dependent and independent variables, followed by spline regression, and then PROC MIANALYZE).

Results of 1 and 2 were very similar.

I’m curious how you might approach multiple imputation in this setting. Would it be preferred to divide the data at the specified knots (i.e. run PROC MI with a BY statement)? Any other approach you might take?

Thank you,

Jonathan

I haven’t considered this problem before. But I like your proposed solution. Have you tried it?

Dr. Allison,

I am using Stata to do Exercise 4, and I have a question. It seems that there is no option in mi statement of Stata that is compatible with “round” option in SAS. So I used mi passive to round the imputations for SELF to whole numbers. Please let me know if this is not the way to do it. Thank you.

* step 4;

…

mi impute mvn self pov black hispanic momwork=anti divorce gender, add(15) prior(jeffreys) burnin(100) burnbetween(25) rseed(06062018)

mi passive: gen byte newself=round(self, 1)

mi estimate, dots: regress anti newself pov black hispanic divorce gender momwork

…

This seems like a good way to round imputed values. However, I do not recommend this when imputing dichotomous variables.

Dear Paul,

Please let me go back to SeungYong Han’s post above from May 24, 2.22pm.

If I have participants who only completed my questionnaire at wave 1, others who did so at waves 1 and 2, others who did so at waves 1-3, and still other who completed the questionnaire at all waves, optimal missing data methods like ML and MI will not fare better than listwise deletion. Did I correctly understand your answer to SeungYong Han?

But what if I have participants who completed the questionnaire at waves 1, 2 and 4, others at waves 1 and 4, stil others at waves 2, 3 and 4 and so on. Would the answer be different?

Many thanks for your time and consideration,

Nicolas

The first situation is sometimes described as pure drop out. The second is described as intermittent missing data. In either case, if you are estimating a mixed model by maximum likelihood, then listwise deletion by person-time will be optimal. In other words, you use all the data that you have for each person and just estimate the model in the usual way. On the other hand, if you are using any other method (like GEE) then multiple imputation can be useful.

Hi Prof Allison,

I have tried multiple imputation using both MCMC with jomo package in r and FCS with MICE package also in r.

It seems to be that with MCMC, one does not specify the auxiliary variables for each of the imputed variables. Instead, all the variables will be used to estimate the missing data.

On the other hand, with FCS, we can specify a separate regression model for each variable and also different auxiliary variables for each variable.

Therefore, the recommendation that auxiliary variables have a correlation of at least 0.40 with imputed variable would apply more to imputation using FCS?

If my understanding of how auxiliary variables work in FCS and MCMC is correct, are there any instances when it is preferable to specify separate auxiliary variable for each of the variables?

Thanks!

For MCMC, you usually just specify a set of variables, and then everything is used to impute everything else. With FCS, on the other hand, you do have the option to specify different set of variables as predictors for each variable with missing data. Nevertheless, when I do FCS, I usually follow the same strategy as with MCMC–let every variable serve as a predictor for every variable missing data. That’s because you usually want your imputed values to reflect all the associations with all the variables in the model, as well as with any auxiliary variables. Once you start imposing restrictions, it’s easy to get into trouble without realizing it.

Dr. Allison,

Would the approach you described in Module 11 work for hierarchical models with binary outcomes? I am looking at hospital + patient level data where outcome is patients’ in-hospital events (Yes/No). Could proc mixed be replaced with proc glimmix in the multiple imputation process described on p.115 (SAS_AllSlides) ?

Thank you!

Yes, absolutely.

Dear Dr. Paul,

If the missing data percentage is more than 50%, will imputation methods provided reliable results? In general, is there a rule of thumb related to maximum fraction of missing values in the dataset allowed to get reliable estimates?

Thanks,

Bilal

If the assumptions are met, even with more than 50% missing you will get reliable results from multiple imputation. However, the greater the fraction of missing information, the more vulnerable you are to violations of assumptions. The key assumptions are missing at random and any distributional assumptions (e.g., multivariate normality). There is no rule of thumb. Each situation must be evaluated on its own merits.

Dr. Allison

Thank you so much for offering a wonderful and very informative, not forgetting practical, course on missing data.

Duke