Statistical Associates Publishers
Missing Valus & Data Imputation: 10 Worst Pitfalls and Mistakes
- Using listwise deletion to drop cases with missing values.
Listwise deletion (LD) may be tolerated in a large sample when the number of missing values is small (< 5%), but LD is now derogated. Multiple imputation (MI) and expectation maximization (EM) estimates are almost always as least as good as LD estimates.
- Using mean substitution to replace missing values .
Though still supported by most statistical packages, this once-popular method is now derogated and obsolete
- Using multiple linear regression to replace missing values .
Once a popular advance over mean substitution and listwise deletion, this method is now derogated due to bias and inefficiency compared to MI and EM methods. However, regression methods are preferred when missingness is monotone.
- Not checking for missing completely at random.
If data are missing completely at random (MCAR) then imputation of missing values is unnecesary.
- Not checking for missing at random.
MI and EM assume data are missing at random (MAR). Exploratory checks on whether data are MAR validates the imputation process.
- Not assessing the effect size and effectiveness of variables in the imputation model. .
It is not enough that data are MAR. Observed variables must have strong enough relationship to missingness to predict it.
- Not having the same variables in the imputation model as in the analysis model in MI .
The variables in the analytic model and the imputation model should be the same in multiple imputation. Auxiliary variables useful for imputation should be part of the set even if not in the original analysis model.
- Not checking for convergence of the multiple imputation process .
The iterative process must converge, as indicated by a plot of the worst linear function coefficient or some other test.
- Having too few imputations in multiple imputation .
Earlier recommendations of 3-5 imputations prior to pooling have been superceded by recommendations of a much larger number. Some 20 - 100 imputations may be needed to obtain convergence and to avoid fall-off of MI extimates compated to maximum likelihood estimates.
- Meeting the assumptions of multivariate normality. .
Our book, listed below, discusses this assumption of data imputation.
Want to learn more about all this and much more?
"Missing Values Analysis & Data Imputation" on Amazon, Kindle format
"Missing Values Analysis & Data Imputation" Preview, PDF format
"Missing Values Analysis & Data Imputation" Information and table of contents
"Statistical Associates Library" of 50 Statistics E-books on Amazon, no-password .PDF format