In my last published analysis I tried to ascertain if the baseline value of hemoglobin affected treatment effectiveness in the CAMP teaching dataset. In this new analysis, we’re going to try to create a model that takes into account all the measured baseline characteristics to see if they can explain the differences that arise in treatment effectiveness among the groups.
In order to learn how to do this, I’ve been watching Brandon Foltz’s Statistics 101: Multiple Linear Regression Playlist, as this is a subject I haven’t been exposed to before. I’m going to try to follow every advice he gives.
To start, we’re going to read the data from the CAMP teaching dataset. As I’ve detailed this before I’m not going to repeat the code here.
Let’s then create a new variable that measures the impact of treatment and select only the variables that we’re going to use in our analysis.
require(tidyverse)
df %>% mutate (EffectPREFEVPP = PREFEVPP_72MO - PREFEVPP_BAS) %>% select(id,TG,GENDER,ETHNIC,hemog,wbc,agehome,anypet,woodstove,dehumid,parent_smokes,any_smokes,EffectPREFEVPP) -> df2
head(df2,10)
## # A tibble: 10 × 13
## id TG GENDER ETHNIC hemog wbc agehome anypet woodst…¹ dehumid paren…²
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 1 3 12.5 65 50 1 2 2 1
## 2 2 2 1 1 12.5 82 34 1 2 1 2
## 3 4 2 2 4 13.6 54 34 2 2 2 2
## 4 5 2 1 2 13.8 66 14 1 2 2 2
## 5 6 2 1 4 13 114 20 2 2 2 1
## 6 9 1 2 4 12.6 52 5 1 2 2 2
## 7 10 3 1 4 14.4 97 25 1 2 2 2
## 8 12 3 1 4 12.6 60 13 1 2 2 1
## 9 13 3 1 4 11.9 62 86 2 2 2 1
## 10 14 2 2 4 12 64 38 1 2 2 2
## # … with 2 more variables: any_smokes <dbl>, EffectPREFEVPP <dbl>, and
## # abbreviated variable names ¹woodstove, ²parent_smokes
Unfortunately, many of the variables aren’t formatted in a way that allows us to use them in linear regression models. We will need to transform the categorical variables into dummy variables format them using 0’s and 1’s.
Looking at the documentation, we can see that, for dichotomous variables, they have used 1 for “Yes” and 2 for “No”. Let’s change this to 0 for “No” and 1 for “Yes”.
df2 %>% mutate(
woodstove = case_when(woodstove == 2 ~ 0, woodstove == 1 ~ 1),
dehumid = case_when(dehumid == 2 ~ 0, dehumid == 1 ~ 1),
anypet = case_when(anypet == 2 ~ 0, anypet == 1 ~ 1),
parent_smokes = case_when(parent_smokes == 2 ~ 0, parent_smokes == 1 ~ 1),
ismale = case_when(GENDER == 2 ~ 0, GENDER == 1 ~ 1),
iswhite = case_when(ETHNIC == 1 ~ 1, .default = 0),
isblack = case_when(ETHNIC == 2 ~ 1, .default = 0),
ishispanic = case_when(ETHNIC == 3 ~ 1, .default = 0),
any_smokes = case_when(any_smokes == 2 ~ 0, any_smokes == 1 ~ 1),
isBudesonide = case_when(TG == 1 ~ 1, .default = 0),
isNedocromil = case_when(TG == 2 ~ 1, .default = 0),
.after = agehome) %>%
select (-id,-TG,-GENDER,-ETHNIC) %>%
relocate (EffectPREFEVPP) -> df3
Now that this is done, we need to evaluate the relationship of the independent variables amongst themselves and to the dependent variable. To do that, we’re going to use base R function cor() to create a correlation matrix and then we’re going to use a package called “corrplot” to plot this in a way that is easier to visualize.
We’re going to create a plot that expresses with the size of a circle the intensity of a correlation, with the color we’re going to indicate the direction, and if the correlation is statistically significant we’re going to print the correlation coefficient in the circle.
require(corrplot)
cor(df3, use="pairwise.complete.obs") -> cors
cor.mtest(df3, conf.level = 0.95) -> confcors
corrplot(cors, p.mat = confcors$p, method = 'circle', type = 'lower', insig='blank',
addCoef.col ='black', number.cex = 0.5, order = 'AOE', diag=FALSE)
With this, we can see that, individually, only having a pet in the house is correlated to our dependent variable, in a statistically significant but weak manner. None of the independent variables described are good descriptors of the dependent variable.
This type of result is to be expected, as the data has been manipulated to create anonymity. If there were more variables that had achieved significance in correlations with the dependent variable, it would be wise to build a multiple regression model that included these variables. In order to do so, we would need to test for the assumptions of noncollinearity (the correlation matrix above is enough for that) and homoscedasticity. We’ve tested previously for the normality of our dependent variable.