Logistic Regression: 10 Worst Pitfalls and Mistakes
Not having truly binary data for the dependent variable in binary logistic regression.
If you have an underlying normal distribution for your dichotomous variable, as you would for income = 0 = low and income = 1 = high, probit regression is more appropriate.
Not having unordered categories for the dependent variable in multinomial logistic regression.
If there is an ascending or descending order to your dependent variable, ordinal regression is more appropriate as it has more power than multinomial logistic regression.
Not having linearity in the logit.
The right-hand predictor side of the equation must be linear with the left-hand outcome side of the equation. You must test for linearity in the logit (in logistic regression the logit is the outcome side). This is commonly done with the Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm [(X)ln(X)]. If these terms are significant, then there is nonlinearity in the logit. This method is not sensitive to small nonlinearities.
Using classification tables to report model strength when the research purpose is causal analysis..
Classification tables are used when the research purpose is prediction, not causal analysis. This is because classification tables reward only prediction and not near-miss estimates, unlike pseudo R-squared measures.
Using ROC curves to compare models when the research purpose is causal analysis..
ROC tables are based on classification table results and have the same problem as mentioned above from the point of view of causal rather than predictive analysis.
In classification tables, comparing percent correct against the wrong baseline. .
In predictive analysis, the baseline for percent correct is not 1/L, where L is the number of levels of the dependent variable (not necessarily 1/2 = .50 for a binary dependent, for example). The common baseline for percent correct by chance is the proportion that the most numerous category is of the total. Thus a percent correct of 75% does not improve prediction at all if the most numerous category is 75% or more of the total.
Reporting pseudo R-squared measures as percent of variance explained in the dependent variable. .
This is just incorrect. Report such measures in terms of weak, moderate, or strong. Common cutoffs are 0 - .3, .3 - .6, and .6 - 1.0 respectively.
Lack of sampling adequacy in factor space. .
It is not just that you need to have adequate sample size. You also need to have adequate count in each cell formed by the factors in your analysis. All cell frequencies should be greater than 1 and 80% or more of cells are should be greater than 5 in count. The presence of small or empty cells may cause the logistic model to become unstable, reporting implausibly large b coefficients and odds ratios for independent variables.
Using significance tests when you do not have a random sample. .
If you have a random sample, you can generalize to the population from which it is drawn if a logistic coefficient is significant. If you have an enumeration (all the cases in the population to which you wish to generalize), significance testing is irrelevant. If you have a non-random sample, significance tests will be in error to an unknown degree. No, bootstrapped significance tests will not solve this problem of inability to generalize, though they do help when the the sample is random but the distribution is non-normal or unknown.
For reporting effect size, relying on the odds ratio alone. .
Odds ratios are aummary measures of effect size. It is necessary to use marginal analysis (discussed in our book) to understand how effect size is conditioned on the range of values of the covariates.
Not meeting the assumptions of logistic regression. .
Our book, listed below, enumerates 16 assumptions of logistic regression, clearly listed in the "Assumptions" section.