This markdown is to serve as a tutorial on the complete process of a thorough linear regression model. With example data analysis. This document covers the following fundamentals:
Data Preparation
Preliminary Exploratory Analysis
Model and Variable Selection in Linear Regression
Residual and Model Diagnostics
Interpretation of Results
Note: I am very comfortable with R. This is the language used in these analysis. It is possible and practical to use other scripting/analysis languages.
This section covers how we read data, and prepare data for the analysis.
The data was provided in .xlsx format. However, I am
more comfortable with .csv format and the
data.table package in R, which reads .csv files. It’s
trivial to convert .xlsx to .csv.
Although the built in R data.frame data type is adequate
for most operations, the data.table package in R utilizes
multi-thread CPU operations and text vectorization. This package is
excellent for large data-sets and their manipulation.
# 0) loading library
library (data.table) # Excellent for large data operations
# 1) defining file path
file_path <- ("C://Users//William//Desktop//Data Analysis//Mohreet//all data.csv")
# 2) Load the data
data <- fread(file_path)
The data has been loaded. We can take a look at the data structure:
str(data) #structure of the data
## Classes 'data.table' and 'data.frame': 209 obs. of 11 variables:
## $ ID : chr "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
## $ Type_num: int 1 1 1 1 1 1 1 1 1 1 ...
## $ ROI : chr "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
## $ Vt : num 34.1 33.2 24.1 24 33.4 ...
## $ fGM : num 0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
## $ fWM : num 0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
## $ fCSF : num 0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
## $ Glu : num 0 0 5.32 5.07 7.35 ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Age : num 42 42 42 42 42 21 21 21 28 28 ...
## $ Sex# : num 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(data) #display first few rows
## ID Type_num ROI Vt fGM fWM fCSF Glu Sex Age
## 1: SL25-1006 1 LacingulateVt 34.08057 0.628 0.186 0.186 0.000000 F 42
## 2: SL25-1006 1 RacingulateVt 33.20815 0.628 0.186 0.186 0.000000 F 42
## 3: SL25-1006 1 RDLPFCVt 24.07263 0.256 0.732 0.012 5.321472 F 42
## 4: SL25-1006 1 LDLPFCVt 24.03108 0.296 0.682 0.022 5.072136 F 42
## 5: SL25-1006 1 cingulateVt 33.43579 0.512 0.328 0.161 7.351965 F 42
## 6: SL25-1007 1 LacingulateVt 44.35559 0.542 0.141 0.316 4.918276 F 21
## Sex#
## 1: 1
## 2: 1
## 3: 1
## 4: 1
## 5: 1
## 6: 1
There appears to be 11 variables, 209 observations
Next, we check for missing data:
data[rowSums(is.na(data)) > 0]
## ID Type_num ROI Vt fGM fWM fCSF Glu Sex Age Sex#
## 1: SL25-1046 2 cingulateVt 28.91756 NA NA NA NA F 19 1
Patient ID SL25-1046 has missing data. Unclear about the
nature of the mistiness, we exclude this patient from the analysis.
There are ways to compute this missingness but it goes way beyond the
scope and I suspect this is just a clerical error.
data <- data[!ID == "SL25-1046"] #This exlucdes the missing patient
Next, we need to see the nature of each variable (what kind of data are they?)
str(data) #structure of the data
## Classes 'data.table' and 'data.frame': 204 obs. of 11 variables:
## $ ID : chr "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
## $ Type_num: int 1 1 1 1 1 1 1 1 1 1 ...
## $ ROI : chr "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
## $ Vt : num 34.1 33.2 24.1 24 33.4 ...
## $ fGM : num 0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
## $ fWM : num 0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
## $ fCSF : num 0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
## $ Glu : num 0 0 5.32 5.07 7.35 ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Age : num 42 42 42 42 42 21 21 21 28 28 ...
## $ Sex# : num 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
unique(data$Type_num) #checking Type_num
## [1] 1 2 3
This is an integer, we need to convert it into factor.
data[, Type_num := factor(Type_num)]
unique(data$ROI) #checking Type_num
## [1] "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt"
## [5] "cingulateVt"
We have 5 levels. Also converting to factors.
data[, ROI := factor(ROI)]
data[, Sex := factor(Sex)]
Now we look at the data structure again:
str(data)
## Classes 'data.table' and 'data.frame': 204 obs. of 11 variables:
## $ ID : chr "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
## $ Type_num: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ROI : Factor w/ 5 levels "cingulateVt",..: 2 4 5 3 1 2 4 3 2 4 ...
## $ Vt : num 34.1 33.2 24.1 24 33.4 ...
## $ fGM : num 0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
## $ fWM : num 0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
## $ fCSF : num 0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
## $ Glu : num 0 0 5.32 5.07 7.35 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 2 2 ...
## $ Age : num 42 42 42 42 42 21 21 21 28 28 ...
## $ Sex# : num 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
We can see that all of the numeric variables are numeric, and all of the categorical variables are converted to factors.
The last wrangling of this data is to subset to only columns we are interested in:
data[,`Sex#`:= NULL]
This is typically referred to as Table One in health
science research. You need to summarize the descriptive characteristics
of your study cohort before any analysis is completed. You summarize
only the independent variable, not the
outcome.
Thankfully, in R, the tableone package does this for us.
I picked Sex, Age, ROI,
Glu to summarize. You can add additional if you want.
library(tableone)
library(tableone)
CreateTableOne(
vars = c("Sex", "Age", "ROI", "Glu"), #variables to include
data = data #data is data
)
##
## Overall
## n 204
## Sex = M (%) 103 (50.5)
## Age (mean (SD)) 28.73 (7.73)
## ROI (%)
## cingulateVt 40 (19.6)
## LacingulateVt 42 (20.6)
## LDLPFCVt 42 (20.6)
## RacingulateVt 41 (20.1)
## RDLPFCVt 39 (19.1)
## Glu (mean (SD)) 5.81 (1.60)
Conclusion:
50.5% of patients are male. The patient sex of this cohort is nicely balanced.
Mean age is 28.73 with stand deviation of 7.73 (quite high to be honest).
The ROI group percentage is nicely summarized.
Mean Glu concentration is 5.81, SD is 1.60
No striking imbalance or issue from this exploratory data analysis. We can move on to regression.
Now the fun begins. We will do a comprehensive linear regression:
The first thing to do is to fit a base model, so we have something to work on.
model <- lm(Vt ~ Glu + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
summary(model)
##
## Call:
## lm(formula = Vt ~ Glu + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.2422 -5.3111 -0.5451 3.9053 28.2416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.172e+03 1.057e+03 -1.109 0.26892
## Glu -5.013e-01 3.807e-01 -1.317 0.18947
## ROILacingulateVt -3.146e+00 2.152e+00 -1.461 0.14551
## ROILDLPFCVt -5.233e+00 1.733e+00 -3.019 0.00288 **
## ROIRacingulateVt -2.627e+00 2.260e+00 -1.162 0.24653
## ROIRDLPFCVt -3.896e+00 2.068e+00 -1.884 0.06112 .
## fGM 1.202e+03 1.058e+03 1.136 0.25732
## fWM 1.211e+03 1.058e+03 1.145 0.25364
## fCSF 1.222e+03 1.059e+03 1.154 0.24973
## SexM 7.175e-01 1.141e+00 0.629 0.53036
## Age 2.347e-02 7.314e-02 0.321 0.74865
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.692 on 193 degrees of freedom
## Multiple R-squared: 0.08208, Adjusted R-squared: 0.03452
## F-statistic: 1.726 on 10 and 193 DF, p-value: 0.07736
Then we need to check the linearity of this model. In other words, is this linear model actually representing a linear relationship.
The best way to check for this is to plot the residual (distance between observation and fitted value) vs the fitted values themselves. This is purely statistical. Ideally, we should see the observations scatter around 0. The fitted line should be somewhat flat around 0.
plot(model, which = 1)
Conclusion: linearity assumption holds. This model is appropriate. Slide curvature around the extremities but could be due to outliers, which we will check next.
library(car)
## Loading required package: carData
crPlots(model)
Conclusion: linearity assumption holds. Linear model is approporiate. There are slight curvature around the extremities of the data but could be due to outliers, which we will check next.
In linear regression, the influential outliers of your data can be
identified by calculating the cook distance
The rule of thumb for influential outliers using Cooks distance is:
Cook's D > 4/n
cooksd <- cooks.distance(model)
which(cooksd > 4/length(cooksd))
## 83 88 90 101 103 104 132 137 177 201
## 83 88 90 101 103 104 132 137 177 201
Conclusion: There are influential outliers for 10 observations (83,88,90,101,103,104,132,137,177,201). Let’s remove them from the data set to have a clean data. However, you might want to keep them. Use your own best judgement. Usually, you run 1 model with outliers and 1 model without outliers. This is called a sensitivity analysis, and is something you should discuss with your prof (if you need this or not).
influential <- c(83,88,90,101,103,104,132,137,177,201) #labelling which observation is influential
data_clean <- data[-influential, ] #removing the influential
Next we check for multi-collinearity. Basically to see if your independent variables are correlated themselves. It’s a standard check.
The score to calculate is called
variable inflation factor (VIF). Essentially, any VIF above
5 requires investigation. Above 10 is very concerning.
library(car)
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Glu 1.274729e+00 1 1.129039
## ROI 3.074314e+00 4 1.150717
## fGM 6.311103e+04 1 251.219079
## fWM 1.652170e+05 1 406.468985
## fCSF 3.964789e+04 1 199.117786
## Sex 1.122961e+00 1 1.059699
## Age 1.098074e+00 1 1.047890
Conclusion: fGM, fWM, and
fCSF is very problematic. I don’t know what’s going on
here. They are all co-linear to each other. For this exercise we remove
these variables in our model.
After the above diagnostic tests, we now have the following model:
model <- lm(Vt ~ Glu + ROI + Sex + Age, data = data_clean)
summary(model)
##
## Call:
## lm(formula = Vt ~ Glu + ROI + Sex + Age, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7636 -5.0470 -0.2673 4.1303 15.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.423092 2.701769 13.481 < 2e-16 ***
## Glu -0.431711 0.320378 -1.348 0.179457
## ROILacingulateVt -3.525574 1.548742 -2.276 0.023961 *
## ROILDLPFCVt -5.407170 1.500213 -3.604 0.000402 ***
## ROIRacingulateVt -5.271128 1.518445 -3.471 0.000644 ***
## ROIRDLPFCVt -4.353497 1.534329 -2.837 0.005053 **
## SexM 1.213178 1.004588 1.208 0.228720
## Age 0.002149 0.063346 0.034 0.972970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.659 on 186 degrees of freedom
## Multiple R-squared: 0.09688, Adjusted R-squared: 0.06289
## F-statistic: 2.85 on 7 and 186 DF, p-value: 0.007568
Now we have a preliminary base model. We have to explore the following question.
“Does the effect of Glu vary with other variables”. This
is called effect modification or interactions.
You typically only check interactions that are biologically
plausible. However, here we will check for interaction between
Glu and ROI
The way to do this is to run the model once for every possible
interaction with the main regressor Glu. Then, you do a
comparison test against the base model. Typically, this is done through
something called Akaike Information Criterion (AIC). This
is standard practice.
The lower the AIC the better the model. So if the
AIC for the interaction model is lower than the base, use
that one. Vice versa.
# base model
model <- lm(Vt ~ Glu + ROI + Sex + Age, data=data_clean)
# interaction model
model_int <- lm(Vt ~ Glu*ROI + Sex + Age, data=data)
# compare
AIC(model, model_int)
## Warning in AIC.default(model, model_int): models are not all fitted to the same
## number of observations
## df AIC
## model 9 1296.04
## model_int 13 1425.43
Conclusion: We can see that this barely changes. We
detected no statistical interaction. We do not need to
consider effect modification.
The last step now is to refine the model. Basically, you want to eliminate all the insignificant covariates until all variables (main exposure + covariates) are all significant.
Iteration 0: Base Model
# base model
model <- lm(Vt ~ Glu + ROI + Sex + Age, data=data_clean)
summary(model)
##
## Call:
## lm(formula = Vt ~ Glu + ROI + Sex + Age, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7636 -5.0470 -0.2673 4.1303 15.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.423092 2.701769 13.481 < 2e-16 ***
## Glu -0.431711 0.320378 -1.348 0.179457
## ROILacingulateVt -3.525574 1.548742 -2.276 0.023961 *
## ROILDLPFCVt -5.407170 1.500213 -3.604 0.000402 ***
## ROIRacingulateVt -5.271128 1.518445 -3.471 0.000644 ***
## ROIRDLPFCVt -4.353497 1.534329 -2.837 0.005053 **
## SexM 1.213178 1.004588 1.208 0.228720
## Age 0.002149 0.063346 0.034 0.972970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.659 on 186 degrees of freedom
## Multiple R-squared: 0.09688, Adjusted R-squared: 0.06289
## F-statistic: 2.85 on 7 and 186 DF, p-value: 0.007568
Conclusion: Age has the highest P value, let’s remove that
Iteration 1: Age Removed
model_1 <- lm(Vt ~ Glu + ROI + Sex, data=data_clean)
summary(model_1)
##
## Call:
## lm(formula = Vt ~ Glu + ROI + Sex, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7820 -5.0455 -0.2624 4.1152 15.9316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.4819 2.0675 17.645 < 2e-16 ***
## Glu -0.4319 0.3195 -1.352 0.177992
## ROILacingulateVt -3.5257 1.5446 -2.283 0.023581 *
## ROILDLPFCVt -5.4075 1.4962 -3.614 0.000387 ***
## ROIRacingulateVt -5.2717 1.5143 -3.481 0.000621 ***
## ROIRDLPFCVt -4.3526 1.5300 -2.845 0.004939 **
## SexM 1.2220 0.9678 1.263 0.208305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.642 on 187 degrees of freedom
## Multiple R-squared: 0.09688, Adjusted R-squared: 0.0679
## F-statistic: 3.343 on 6 and 187 DF, p-value: 0.003753
Conclusion: Sex has a highP value, let’s remove that
Iteration 2: Sex removed
model_1 <- lm(Vt ~ Glu + ROI, data=data_clean)
summary(model_1)
##
## Call:
## lm(formula = Vt ~ Glu + ROI, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2025 -5.3495 -0.2803 4.1126 15.3282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.7350 2.0610 17.824 < 2e-16 ***
## Glu -0.3651 0.3155 -1.157 0.248699
## ROILacingulateVt -3.5992 1.5459 -2.328 0.020967 *
## ROILDLPFCVt -5.4004 1.4985 -3.604 0.000401 ***
## ROIRacingulateVt -5.2937 1.5166 -3.491 0.000601 ***
## ROIRDLPFCVt -4.4417 1.5308 -2.902 0.004156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.652 on 188 degrees of freedom
## Multiple R-squared: 0.08918, Adjusted R-squared: 0.06495
## F-statistic: 3.681 on 5 and 188 DF, p-value: 0.003348
We stop the backward elimination here.
Our final model is: Vt ~ Glu + ROI
Recall the analysis result:
summary(model_1)
##
## Call:
## lm(formula = Vt ~ Glu + ROI, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2025 -5.3495 -0.2803 4.1126 15.3282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.7350 2.0610 17.824 < 2e-16 ***
## Glu -0.3651 0.3155 -1.157 0.248699
## ROILacingulateVt -3.5992 1.5459 -2.328 0.020967 *
## ROILDLPFCVt -5.4004 1.4985 -3.604 0.000401 ***
## ROIRacingulateVt -5.2937 1.5166 -3.491 0.000601 ***
## ROIRDLPFCVt -4.4417 1.5308 -2.902 0.004156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.652 on 188 degrees of freedom
## Multiple R-squared: 0.08918, Adjusted R-squared: 0.06495
## F-statistic: 3.681 on 5 and 188 DF, p-value: 0.003348
The main exposure Glu is not statistically
significantly associated with changes in Vt levels. The
P-value is 0.24.
ROI location was significantly associated with lower Vt levels
relative to the cingulate region
(cingulateVt is the reference
category).
Being in ROI: LacingulateVt is associated with -3.5992 unit change in Vt
Being in ROI: LDLPFCVt is associated with -5.4004 unit change in Vt
Being in ROI: RacingulateVtis associated with -5.2937 unit change in Vt
Being in ROI: RDLPFCVt is associated with -4.4417 unit change in Vt
Keep in mind, there could be other factors that impact the result:
Low sample size
Biases
Random effects
Treat this analysis as a tutorial. Not your actual data analysis.