In this notebook I will demonstrate several techniques for understanding the relation between a single continous outcome and a set of correlated variables. The dependent variable in this case study is Proton Density Fat Fraction (PDFF) which is defined as
“a measure of the proportion of a tissue which is composed of fat”, in this case derived from MRI imaging of the liver [from https://perspectum.com/media/1361/understanding-pdff.pdf].
source("R/0_setup.R")
library(caret)
library(pls)
IP<- readRDS("DemoSet2.rds")
tests<- names(IP)[27:71]
sett<- IP %>% select(PDFF, all_of(tests)) %>%
filter(complete.cases(.)) %>%
data.frame()
sk<- skim(sett) %>% select(variable= skim_variable, mean= numeric.mean, min= numeric.p0, median= numeric.p50,
max= numeric.p100, numeric.hist)
sk[, 2:4]<- apply(sk[, 2:4], 2, function(x) round(x, 2))
print(sk)
In the set with missing data removed, we have PDFF and biochemistry data for 2515 participants. Since PDFF is highly skewed and truncated at zero, a transformation is applied to make it easier for linear models to make sensible predictions of PDFF.
The first method that comes to mind is a multiple linear regression model, where we might be tempted to use all our biochemstry data as X variables. However, as we will see below, this is bad idea where our X variables are correlated.
m1<- lm(PDFF~., data= sett)
m1 %>% summary() %>% print()
Call:
lm(formula = PDFF ~ ., data = sett)
Residuals:
Min 1Q Median 3Q Max
-1.6305 -0.1865 -0.0440 0.1339 1.6955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3091563 0.0062625 209.047 < 2e-16 ***
haematocrit_percentage 0.2844307 0.1995569 1.425 0.15419
red_blood_cell_erythrocyte_count 0.0571971 0.1602537 0.357 0.72119
haemoglobin_concentration -0.3313760 0.1476659 -2.244 0.02491 *
mean_corpuscular_volume -0.7168277 0.2527603 -2.836 0.00461 **
red_blood_cell_erythrocyte_distribution_width -0.0130327 0.0074308 -1.754 0.07957 .
mean_corpuscular_haemoglobin 0.8560445 0.2686840 3.186 0.00146 **
platelet_count 0.0267325 0.0481077 0.556 0.57848
white_blood_cell_leukocyte_count -0.5970714 0.2847785 -2.097 0.03613 *
mean_corpuscular_haemoglobin_concentration -0.3008519 0.1273568 -2.362 0.01824 *
mean_platelet_thrombocyte_volume 0.0127022 0.0247567 0.513 0.60794
platelet_crit -0.0175185 0.0433749 -0.404 0.68633
platelet_distribution_width -0.0087618 0.0073703 -1.189 0.23463
basophill_percentage 0.0474270 0.0409708 1.158 0.24715
eosinophill_percentage 0.2854742 0.2276292 1.254 0.20992
lymphocyte_percentage 1.6541060 1.1583432 1.428 0.15342
monocyte_percentage 0.5316702 0.3479571 1.528 0.12665
neutrophill_percentage 1.8229093 1.2894847 1.414 0.15758
basophill_count 0.0078867 0.0105680 0.746 0.45557
eosinophill_count 0.0699992 0.0276716 2.530 0.01148 *
lymphocyte_count 0.2450480 0.1164926 2.104 0.03552 *
monocyte_count 0.0270995 0.0375491 0.722 0.47054
neutrophill_count 0.4936365 0.2307846 2.139 0.03254 *
reticulocyte_count -0.0503671 0.1384470 -0.364 0.71604
reticulocyte_percentage 0.0288779 0.1344721 0.215 0.82998
mean_reticulocyte_volume 0.0107052 0.0103554 1.034 0.30134
high_light_scatter_reticulocyte_percentage -0.1906402 0.1236990 -1.541 0.12341
mean_sphered_cell_volume -0.0153129 0.0111490 -1.373 0.16973
high_light_scatter_reticulocyte_count 0.2961738 0.1238703 2.391 0.01688 *
immature_reticulocyte_fraction -0.0102351 0.0147230 -0.695 0.48701
alkaline_phosphatase -0.0024304 0.0068407 -0.355 0.72241
cholesterol -0.0162209 0.0224310 -0.723 0.46966
cystatin_c 0.0001916 0.0080584 0.024 0.98103
alanine_aminotransferase 0.0836863 0.0100436 8.332 < 2e-16 ***
creatinine -0.0202820 0.0091358 -2.220 0.02651 *
gamma_glutamyltransferase 0.0005962 0.0080063 0.074 0.94064
urea 0.0093931 0.0070524 1.332 0.18301
triglycerides 0.0405936 0.0078859 5.148 2.85e-07 ***
urate 0.0375807 0.0088955 4.225 2.48e-05 ***
ldl_direct -0.0057776 0.0372489 -0.155 0.87675
c_reactive_protein 0.0505941 0.0072055 7.022 2.82e-12 ***
aspartate_aminotransferase -0.0478666 0.0087086 -5.496 4.27e-08 ***
total_bilirubin -0.0052989 0.0071012 -0.746 0.45562
apolipoprotein_b 0.0111629 0.0255332 0.437 0.66201
igf_1 -0.0134079 0.0066528 -2.015 0.04397 *
glycated_haemoglobin_hb_a1c 0.0311873 0.0069783 4.469 8.21e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3141 on 2469 degrees of freedom
Multiple R-squared: 0.3059, Adjusted R-squared: 0.2932
F-statistic: 24.18 on 45 and 2469 DF, p-value: < 2.2e-16
vifs<- tibble(Measure= names(car::vif(m1)),
VIF= car::vif(m1)) %>%
print()
A number of biochemistry and blood markers were showing as significant in a multiple linear model for PDFF. Together this model had an adjusted R^2 of 29.3%. However by looking at the variance inflation factors (VIFs) we can see that many of these are much larger than 10, meaning that multicollinearity is a serious issue in this model.
We might try to apply a stepwise algorithm to the linear model, to potentially remove some of the redundant X variables.
stepwise_linear<- step(m1, trace=0)
summary(stepwise_linear)
vifs<- tibble(Measure= names(car::vif(stepwise_linear)),
VIF= car::vif(stepwise_linear)) %>%
print()
In the stepwise linear model, we still have an adjusted R^2 of about 29%, but with only 26 out of the 46 variables. The coefficients of these are illustrated below, however it is clear from the VIFs that even the stepwise linear model has enough serious multicollinearity for us to doubt the resutls.
Another option might be to apply lasso regularisation to select a useful set of variables and avoid multicollinearity.
# Apply lasso / regularised regression / feature selection
Y<- sett$PDFF
X<- sett %>% select(-PDFF) %>% as.matrix()
g1<- glmnet::glmnet(X, Y, family= "gaussian", alpha= 1)
# Coefficient priority matrix
mmat<- as.matrix(coef(g1))
mmat<- mmat[-1,]
Zmat<- mmat==0
# Extract largest model where all p-values are significant
trialSet<- row.names(Zmat)[!Zmat[,13]]
formula<- as.formula(paste("PDFF ~",str_c(trialSet, collapse=" + "))) %>% print()
PDFF ~ red_blood_cell_erythrocyte_count + high_light_scatter_reticulocyte_count +
alanine_aminotransferase + triglycerides + urate + c_reactive_protein
m3<- lm(formula, data= sett)
m3 %>% summary()
Call:
lm(formula = formula, data = sett)
Residuals:
Min 1Q Median 3Q Max
-1.5673 -0.1882 -0.0456 0.1323 1.7532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.309156 0.006389 204.892 < 2e-16 ***
red_blood_cell_erythrocyte_count 0.023925 0.007332 3.263 0.00112 **
high_light_scatter_reticulocyte_count 0.083482 0.007001 11.924 < 2e-16 ***
alanine_aminotransferase 0.056588 0.007193 7.867 5.36e-15 ***
triglycerides 0.050112 0.007153 7.005 3.15e-12 ***
urate 0.033466 0.007543 4.437 9.54e-06 ***
c_reactive_protein 0.060057 0.006692 8.974 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3204 on 2508 degrees of freedom
Multiple R-squared: 0.266, Adjusted R-squared: 0.2643
F-statistic: 151.5 on 6 and 2508 DF, p-value: < 2.2e-16
m3 %>% car::vif()
red_blood_cell_erythrocyte_count high_light_scatter_reticulocyte_count
1.316087 1.200101
alanine_aminotransferase triglycerides
1.266921 1.252863
urate c_reactive_protein
1.393215 1.096639
m3terms<- names(m3$coefficients)
Here 6 variables are highlighted, with an adjusted R^2 that is somewhat lower at 26%, but where there is no issue of multicollinearity.
Rather than feature selection, a method such as partial least squares can be used to gather correlated X variables together, in order to relate all variables to Y without collinearity. The first method here is Principal Components Regression, which is very similar to using the components of a PCA in a linear model
# Apply test-train split
set.seed(123)
idx <- sample(1:nrow(sett), floor(nrow(sett)* 0.7))
train.data <- sett[idx, ]
test.data <- sett[-idx, ]
# Principal Components Regression
model <- train(
PDFF~., data = train.data, method = "pcr",
scale = TRUE,
trControl = trainControl("cv", number = 5),
tuneLength = 14
)
# Plot model RMSE vs different values of components
plot(model)
# Summarize the final model
summary(model$finalModel)
Data: X dimension: 1760 45
Y dimension: 1760 1
Fit method: svdpc
Number of components considered: 12
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
X 13.75 22.84 30.55 37.63 44.17 50.00 54.33 58.46 62.28 65.88 69.17 72.43
.outcome 21.23 22.63 22.73 23.27 23.37 23.41 23.56 23.57 23.62 23.92 24.00 24.10
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$PDFF),
Rsquare = caret::R2(predictions, test.data$PDFF)
)
Here PCR has extracted 12 components which together explain 72% of the variance in X, and which have an R^2 with Y of 25%
Partial least squares is similar again, however rather than seeking to maximise the variance explained in X, components are extracted to maximise the shared correlation between X and Y.
### Computing partial least squares
# Build the model on training set
set.seed(1234)
model <- train(
PDFF~., data = train.data, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 14
)
# Plot model RMSE vs different values of components
plot(model)
# Summarize the final model
summary(model$finalModel)
Data: X dimension: 1760 45
Y dimension: 1760 1
Fit method: oscorespls
Number of components considered: 9
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps
X 13.49 19.56 24.70 30.21 34.18 37.15 42.52 46.45 50.47
.outcome 24.56 27.28 28.67 29.00 29.26 29.43 29.47 29.51 29.55
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test.data$PDFF),
Rsquare = caret::R2(predictions, test.data$PDFF)
)
NA
Here, PLS has extracted 9 components, which together explain 50% of the variation in X. These 9 components have an R^2 with Y of 29.2%
We can extract the component loadings below to understand which variables are having the an influence in each component.
loadings<- model$finalModel$loadings %>%
unclass() %>% as.matrix()
tl<- loadings %>% data.frame() %>%
rownames_to_column(var= "Measure") %>%
gather(Component, value, -Measure) %>%
arrange(Measure, desc(abs(value))) %>%
group_by(Measure) %>% mutate(Order= order(desc(abs(value)))) %>%
#filter(abs(value) >= 0.2) %>%
filter(Order==1) %>%
mutate(Direction= ifelse(value > 0, "Positive", "Negative"))
g<- ggplot(tl, aes(x= value, y=Measure, fill= Direction)) +
geom_col() +
facet_wrap(~ Component, scales= "free", ncol= 2) +
theme_bw() +
theme(legend.position = "None",
text = element_text(size = 9),
strip.background = element_rect(fill= "grey30"),
strip.text = element_text(color="white", face= "bold"))
print(g)
Lastly, let’s plug the components calculated in the above PLS into an ordinary linear model to understand the impact of each component on PDFF.
scores<- model$finalModel$scores %>%
unclass() %>%
data.frame()
newdf<- bind_cols(train.data, scores)
m3<- lm(PDFF ~ Comp.1 + Comp.2 + Comp.3 + Comp.4 + Comp.5 + Comp.6 + Comp.7 + Comp.8 + Comp.9, data= newdf)
summary(m3)
Call:
lm(formula = PDFF ~ Comp.1 + Comp.2 + Comp.3 + Comp.4 + Comp.5 +
Comp.6 + Comp.7 + Comp.8 + Comp.9, data = newdf)
Residuals:
Min 1Q Median 3Q Max
-1.61942 -0.19375 -0.04196 0.12675 1.72864
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.310667 0.007594 172.591 < 2e-16 ***
Comp.1 0.077387 0.003133 24.701 < 2e-16 ***
Comp.2 0.047244 0.005746 8.222 3.87e-16 ***
Comp.3 0.032024 0.005466 5.859 5.55e-09 ***
Comp.4 0.017366 0.006038 2.876 0.00408 **
Comp.5 0.018101 0.007103 2.548 0.01090 *
Comp.6 0.015680 0.007635 2.054 0.04015 *
Comp.7 0.005771 0.006017 0.959 0.33759
Comp.8 0.008243 0.008489 0.971 0.33165
Comp.9 0.008507 0.008223 1.035 0.30102
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3186 on 1750 degrees of freedom
Multiple R-squared: 0.2955, Adjusted R-squared: 0.2919
F-statistic: 81.55 on 9 and 1750 DF, p-value: < 2.2e-16
As expected, the adjusted R^2 is sitting at the 29.2% mark. Here we see that only 6 components have p-values less than 0.05, and all coefficients are positive with Y. This gives us a good insight into which elements of biochemistry are most interesting when it comes to differences in PDFF. …
…
…