Statistical Associates Publishers

## Multiple Regression: 10 Worst Pitfalls and Mistakes

1. Having a binary dependent variable.
If you have an underlying normal distribution for a dichotomous dependent variable, this violates the assumption that the dependent variable be normally distributed. For dichotomies for which there is an underlying normal distributione (e.g., high v. low income coded 1 and 1, but with continuous income as the basis), probit regression is appropriate. Otherwise binary logistic regression should be used instead of OLS regression.

2. Having heterogenous effects across levels of the dependent variable.
OLS regression predicts average effects. If effects differ for different parts of the range of the dependent variable, the mean will be a "bad average". The researcher should consider quantile regression.

3. Not having linear relationships..
The right-hand predictor side of the equation must be linear with the left-hand outcome side of the equation. If not, some of the predictor variables may need to be transformed (ex., age and age-squared) and/or a non-linear link function may need to be used which transforms the outcome variable, as is done in generalized linear models.

4. Ignoring heteroscedastic residuals. .
If error is not homogenous along the range of the dependent variable, the researcher needs to use robust standard errors, robust regression, quantile regression, or some other approach.

5. Misspecifying the model..
If causally significant variables are omitted or causally spurious but correlated variables are included in the model, all coefficients witll change, even in direction and/or significance.

6. Asserting that a regression model is true..
Multiple models may generate model fit as good or better. It is much more cound to compare two or more models to determine which fits the data better than it is to assert one is "correct".

7. Using stepwise regression for confirmatory purposes. .
Stepwise regression is a data-driven approach which may overfit the data to noise in the data and may not replicate for future datasets. If used, it should be for exploratory research and, ideally, results should be cross-validated using a validation dataset. Moreover, model selection regression contains more sophisticated procedures for automated modeling than does traditional stepwise regression.

8. Lack of sampling adequacy in factor space. .
It is not just that that the researcher needs to have adequate sample size. It is also necessary to have adequate count in each cell formed by the factors in the analysis. While often discussed with regard to analysis of variance, sampling adequacy applies to categorical variables used in regression analysis also (both are part of the general linear model). All cell frequencies should be greater than 1 and 80% or more of cells are should be greater than 5 in count. The presence of small or empty cells may cause the regression model to become unstable, reporting implausibly large b coefficients for dummy variables.

9. Using significance tests when you do not have a random sample. .
If you have a random sample, you can generalize to the population from which it is drawn if a regression coefficient is significant. If you have an enumeration (all the cases in the population to which you wish to generalize), significance testing is irrelevant. If you have a non-random sample, significance tests will be in error to an unknown degree. No, bootstrapped significance tests will not solve this problem of inability to generalize, though they do help when the the sample is random but the distribution is non-normal or unknown.

10. Not meeting the assumptions of multiple linear regression. .
Our book, listed below, enumerates 22 assumptions of multiple linear regression, clearly listed in the "Assumptions" section.