“RISK FACTOR (Syn: determinant) A factor that is causally related to a change in the RISK of a relevant health process, outcome or condition. The causal nature of the relationship is established on the basis of scientific evidence (including evidence from EPIDEMIOLOGICAL RESEARCH) and CAUSAL INFERENCE. The causal relationship is inherently probabilistic, as it happens in many other spheres of nature and human life. If the relationship is noncausal, the factor is just a RISK MARKER.
In general, the term remains ambiguous, and it is left to the reader to interpret a study’s findings as predictive, causal or etiologic. The implication from the headlines of two recent equine studies are that the press interprets these risk factors as causal and meritorious of an intervention. As will be shown this is a misinterpretation.
Unfortunately, these studies often conflate prediction (prognosis) and causal explanation (prevention/treatment), leading to confusion and misinterpretation.
Prediction vs Explanation(Causal) Models
Type of Model
Purpose
Parameter of Interest
Exchangeability Required
Variable Selection
Predictive
To accurately predict the value of the dependent variable Y based upon \(\ge 1\) predictor variables
\(\hat{Y}\) the predicted outcome of interest
No
Variables are selected solely for their ability to predict the outcome. Definitions of the variable as confounders, mediators or colliders are not essential (therefore algorithmic methods of model devlopment can be used)
Explanatory
To understand causes and identify their impact on outcome variable Y
\(\beta_x\) ; the effect estimate for the exposure of interest
Yes
Variables selected to maintain exchangeability, a priori based upon DAG (directed acyclic graph) analysis and domain expert input
Defining the goal via a question
Risk factor studies are uninterpretable if the authors do not define the meaning of “risk factor” in their particular study. This can be achieved by proposing a specific question. Using the aforementioned equine studies.
Predictive questions would be:
Which thoroughbred racehorses are most likely to experience sudden death?
Which eventing horses are most likely to experience falls on the cross-country course?
Causal or explanatory questions would be:
To what degree does Lasix increase the risk of non-musculoskeletal sudden death in thoroughbred racehorses?
To what degree does a low dressage score increase a horse’s risk of falling on the cross-country course?
GLM (generalized linear models) and confusion
The generalized linear model allows various outcomes to vary linearly with a vector of covariates by using a link function f(*)., frequently the logit function.
The coefficients \(\hat{\beta_0},\hat{\beta_1}...\hat{\beta_n}\) are usually obtained using maximum likelihood estimation (MLE) which determine the parameters making the observations most likely. Although easy to understand it requires significant computation which is trivial with modern software. With the advent of this software GLM became widely established. The methods however were associated with prediction rather than causal explanation. These models cannot be used to make any causal inferences without significant assumptions.
The result has been that individual coefficients in GLMs have been given causal meanings particularly based upon statistical significance. This association is commonly seen in equine risk factor studies where the press and researchers imply causality due to the “significant statistical association” and make recommendations based upon these results. For example banning furosemide could reduce sudden deaths in thoroughbreds or eliminating horses with low dressage scores would decrease falls on the cross country course.
The confusion between predictor variables and causal variables and their association with the dependent variable is particularly tenacious when it comes to risk factor studies. Correlational and causal relationships are frequently conflated in these studies leading to potentially erroneous conclusions.
Predictive and Causal Generalized Linear Models
Differences in GLMs for prediction and causal inference
The covariates considered for inclusion or exclusion
How the set of covariates to include is determined
Selection process
Model evaluation
Model interpretation
Prediction modelling
A prediction model aims to offer a probability of an outcome based on a set of predictor variables. These predictions may allow preparation for the event or potentially inform an intervention to prevent its occurence.
A prediction model for sudden death in Thoroughbred racehorses could be used to inform the trainer or owner that a particular horse is at increased risk of the event. This may aid in decision-making. The model would likely contain several covariates which are associated with the outcome but not necessarily causal. It is possible that furosemide administration is associated with an increase in sudden death but is not a direct cause (perhaps it is a marker of EIPH, which is the proximate cause). Since a predictive model is uninterested in cause, the included variables are not interpretable as causal factors. In fact, the individual coefficients in the predictive model have no meaning. The model can only be interpreted in full. This is one of many reasons why univariable selection of variables for a prediction model is not recommended.
Variables are often selected for prediction models using an algorithm such as stepwise regression. And the best fitting model is chosen, and its ability to discriminate and calibrate are measured. This is known as internal validation. A prediction model must then be externally validated on a new dataset. This is not what is done with risk factor models in equine medicine.
A prediction model cannot provide information on how to change the expected outcome. It also cannot indicate which of the predictors are most relevant as predictions are dependent on the set of predictors not the individual predictors. It is unlikely that the individual predictor coefficients have a sensible interpretation in isolation.
Causal Modelling
A causal model estimates the true causal association between an exposure such as the administration of furosemide and an outcome such as sudden death, by removing all hypothesized associations which distort the true relationship. Causal information can be used to alter the outcome by intervening or altering the exposure.
In the GLM a causal association is measured by the coefficient of the exposure variable such as the administration of furosemide. Spurious associations are removed by including the set of variables as controls that could alter the relationship. For example, if furosemide is identified as a risk factor for sudden death in thoroughbreds, the set of covariates which could create spurious associations should be identified. The most familiar association is a variable that is a common cause of the exposure and the outcome (such as EIPH), known as confounding.
Variables which are a common descendent of both the exposure and outcome are known as colliders. Controlling for these variables will induce a non causal relationship between the exposure and outcome. Variables such as mediators which transmit part of the causal association should also be excluded from the model.
These relationships should be rigorously thought out and variables that are not in the dataset should be considered if they are thought to be confounders.
library(ggdag)
Warning: package 'ggdag' was built under R version 4.2.3
Attaching package: 'ggdag'
The following object is masked from 'package:stats':
filter
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.3
theme_set(theme_dag()) lasix_ca_dag <-dagify(suddendeath ~ furosemide, furosemide ~ EIPH + age, suddendeath~ age, suddendeath ~ EIPH,labels =c( "suddendeath"="Sudden\n Death","furosemide"="Furosemide","EIPH"="EIPH grade","age"="Age","weight"="Weight" ),exposure ="furosemide",outcome ="suddendeath" )a<-ggdag(lasix_ca_dag, text =FALSE, use_labels ="label")b<-ggdag_paths(lasix_ca_dag, text =FALSE, use_labels ="label", shadow =TRUE)c<-ggdag_adjustment_set(lasix_ca_dag, text =FALSE, use_labels ="label", shadow =TRUE)library(patchwork) c
The simple DAG above includes two confounding variables for furosemide administration and sudden death which would necessarily be controlled in a causal model where furosemide administration was the cause of sudden death. There will be more confounders both known and unknown. Adding variables that are mediators or colliders will bias the model. To understand the causal relationship between furosemide and sudden death, simply interpreting the odds ratio for furosemide administration as part of a mutually adjusted multivariable model which did not define furosemide administration as the exposure of interest will lead to misleading conclusions.
Risk factor studies frequently report all \(\beta\) coefficients in the model as though they represent a causal effect. This interpretation is both implicit and explicit. This interpretation is known as the table 2 fallacy and is incorrect.
The highlighted odds ratios cannot be interpreted.
Reporting the \(\beta\) coefficients as though they carry equal causal information to the primary exposure is misleading and can lead to headlines like these.
This is a misinterpretation of the analysis and could only be made if furosemide was analyzed as the primary exposure with the appropriate confounders added to the model and inappropriate controls excluded. There is no statistical method for determining confounders or colliders. Unfortunately this is a very common approach to modelling and interpreting risk factors. This misunderstanding risks financial resources, creates confusion in public discourse and could lead to ineffectual and wasteful policy decisions.
Huitfeldt, Anders. 2016. “Is Caviar a Risk Factor for Being a Millionaire?: Table 1.”BMJ, December, i6536. https://doi.org/10.1136/bmj.i6536.
Westreich, D., and S. Greenland. 2013. “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients.”American Journal of Epidemiology 177 (4): 292–98. https://doi.org/10.1093/aje/kws412.