In an earlier article, we looked at the meaning of the P value.1 This time we will look at another crucial statistical concept: that of confounding.
Confounding, as the name implies, is the recognition that crude associations may not reflect reality, but may instead be the result of outside factors. To illustrate, imagine that you want to study whether smoking increases the risk of death (in statistical terms, smoking is the exposure, and death is the outcome). You follow 5,000 people who smoke and 5,000 people who do not smoke for 10 years. At the end of the follow-up you find that about 40% of nonsmokers died, compared with only 10% of smokers. What do you conclude? At face value it would seem that smoking prevents death. However, before reaching this conclusion you might want to look at other factors. A look at the dataset shows that the average baseline age among nonsmokers was 60 years, whereas among smokers was 40 years. Could this be the cause of the results? You repeat the analysis based on strata of age (i.e., you compare smokers who were aged 60-70 years at baseline with nonsmokers who were aged 60-70 years, smokers who were aged 50-60 years with nonsmokers who were aged 50-60 years, and so on). What you find is that, for each category of age, the percentage of death among smokers was higher. Hence, you now reach the opposite conclusion, namely that smoking does increase the risk of death.
What happened? Why the different result? The answer is that, in this case, age was a confounder. What we initially thought was the effect of smoking was, in reality, at least in part, the effect of age. Overall, more deaths occurred among nonsmokers in the first analysis because they were older at baseline. When we compare people with similar age but who differ on smoking status, then the difference in mortality between them is not because of age (they have the same age) but smoking. Thus, in the second analysis we took age into account, or, in statistical terms, we adjusted for age, whereas the first analysis was, in statistical terms, an unadjusted or crude analysis. We should always be aware of studies with only crude results, because they might be biased/misleading.2
In the example above, age is not the only factor that might influence mortality. Alcohol or drug use, cancer or heart disease, body mass index, or physical activity can also influence death, independently of smoking. How to adjust for all these factors? We cannot do stratified analyses as we did above, because the strata would be too many. The solution is to do a multivariable regression analysis. This is a statistical tool to adjust for multiple factors (or variables) at the same time. When we adjust for all these factors, we are comparing the effect of smoking in people who are the same with regard to all these factors but who differ on smoking status. In statistical terms, we study the effect of smoking, keeping everything else constant. In this way we “isolate” the effect of smoking on death by taking into account all other factors, or, in statistical terms, we study the effect of smoking independently of other factors.
How many factors should be included in a multivariable analysis? As a general rule, the more the better, to reduce confounding. However, the number of variables to include in a regression model is limited by the sample size. The general rule of thumb is that, for every 10 events (for dichotomous outcomes) or 10 people (for continuous outcomes), we can add one variable in the model. If we add more variables than that, then in statistical terms the model becomes overfitted (i.e., it gives results that are specific to that dataset, but may not be applicable to other datasets). Overfitted models can be as biased/misleading as crude models.3