This markdown is to serve as a tutorial on the complete process of a thorough linear regression model. With example data analysis. This document covers the following fundamentals:
Data Preparation
Preliminary Exploratory Analysis
Model and Variable Selection in Linear Regression
Residual and Model Diagnostics
Interpretation of Results
Note: I am very comfortable with R. This is the language used in these analysis. It is possible and practical to use other scripting/analysis languages.
This section covers how we read data, and prepare data for the analysis.
The data was provided in .xlsx format. However, I am
more comfortable with .csv format and the
data.table package in R, which reads .csv files. It’s
trivial to convert .xlsx to .csv.
Although the built in R data.frame data type is adequate
for most operations, the data.table package in R utilizes
multi-thread CPU operations and text vectorization. This package is
excellent for large data-sets and their manipulation.
# 0) loading library
library (data.table) # Excellent for large data operations
# 1) defining file path
file_path <- ("C://Users//William//Desktop//Data Analysis//Mohreet//all data.csv")
# 2) Load the data
data <- fread(file_path)
The data has been loaded. We can take a look at the data structure:
str(data) #structure of the data
## Classes 'data.table' and 'data.frame': 209 obs. of 11 variables:
## $ ID : chr "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
## $ Type_num: int 1 1 1 1 1 1 1 1 1 1 ...
## $ ROI : chr "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
## $ Vt : num 34.1 33.2 24.1 24 33.4 ...
## $ fGM : num 0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
## $ fWM : num 0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
## $ fCSF : num 0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
## $ Glu : num 0 0 5.32 5.07 7.35 ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Age : num 42 42 42 42 42 21 21 21 28 28 ...
## $ Sex# : num 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(data) #display first few rows
## ID Type_num ROI Vt fGM fWM fCSF Glu Sex Age
## 1: SL25-1006 1 LacingulateVt 34.08057 0.628 0.186 0.186 0.000000 F 42
## 2: SL25-1006 1 RacingulateVt 33.20815 0.628 0.186 0.186 0.000000 F 42
## 3: SL25-1006 1 RDLPFCVt 24.07263 0.256 0.732 0.012 5.321472 F 42
## 4: SL25-1006 1 LDLPFCVt 24.03108 0.296 0.682 0.022 5.072136 F 42
## 5: SL25-1006 1 cingulateVt 33.43579 0.512 0.328 0.161 7.351965 F 42
## 6: SL25-1007 1 LacingulateVt 44.35559 0.542 0.141 0.316 4.918276 F 21
## Sex#
## 1: 1
## 2: 1
## 3: 1
## 4: 1
## 5: 1
## 6: 1
There appears to be 11 variables, 209 observations
Next, we check for missing data:
data[rowSums(is.na(data)) > 0]
## ID Type_num ROI Vt fGM fWM fCSF Glu Sex Age Sex#
## 1: SL25-1046 2 cingulateVt 28.91756 NA NA NA NA F 19 1
Patient ID SL25-1046 has missing data. Unclear about the
nature of the mistiness, we exclude this patient from the analysis.
There are ways to compute this missingness but it goes way beyond the
scope and I suspect this is just a clerical error.
data <- data[!ID == "SL25-1046"] #This exlucdes the missing patient
Next, we need to see the nature of each variable (what kind of data are they?)
str(data) #structure of the data
## Classes 'data.table' and 'data.frame': 204 obs. of 11 variables:
## $ ID : chr "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
## $ Type_num: int 1 1 1 1 1 1 1 1 1 1 ...
## $ ROI : chr "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
## $ Vt : num 34.1 33.2 24.1 24 33.4 ...
## $ fGM : num 0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
## $ fWM : num 0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
## $ fCSF : num 0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
## $ Glu : num 0 0 5.32 5.07 7.35 ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Age : num 42 42 42 42 42 21 21 21 28 28 ...
## $ Sex# : num 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
unique(data$Type_num) #checking Type_num
## [1] 1 2 3
This is an integer, we need to convert it into factor.
data[, Type_num := factor(Type_num)]
unique(data$ROI) #checking Type_num
## [1] "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt"
## [5] "cingulateVt"
We have 5 levels. Also converting to factors.
data[, ROI := factor(ROI)]
data[, Sex := factor(Sex)]
Now we look at the data structure again:
str(data)
## Classes 'data.table' and 'data.frame': 204 obs. of 11 variables:
## $ ID : chr "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
## $ Type_num: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ROI : Factor w/ 5 levels "cingulateVt",..: 2 4 5 3 1 2 4 3 2 4 ...
## $ Vt : num 34.1 33.2 24.1 24 33.4 ...
## $ fGM : num 0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
## $ fWM : num 0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
## $ fCSF : num 0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
## $ Glu : num 0 0 5.32 5.07 7.35 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 2 2 ...
## $ Age : num 42 42 42 42 42 21 21 21 28 28 ...
## $ Sex# : num 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
We can see that all of the numeric variables are numeric, and all of the categorical variables are converted to factors.
The last wrangling of this data is to subset to only columns we are interested in:
data[,`Sex#`:= NULL]
This is typically referred to as Table One in health
science research. You need to summarize the descriptive characteristics
of your study cohort before any analysis is completed. You summarize
only the independent variable, not the
outcome.
Thankfully, in R, the tableone package does this for us.
I picked Sex, Age, ROI,
Glu to summarize. You can add additional if you want.
library(tableone)
library(tableone)
CreateTableOne(
vars = c("Sex", "Age", "ROI", "Glu"), #variables to include
data = data #data is data
)
##
## Overall
## n 204
## Sex = M (%) 103 (50.5)
## Age (mean (SD)) 28.73 (7.73)
## ROI (%)
## cingulateVt 40 (19.6)
## LacingulateVt 42 (20.6)
## LDLPFCVt 42 (20.6)
## RacingulateVt 41 (20.1)
## RDLPFCVt 39 (19.1)
## Glu (mean (SD)) 5.81 (1.60)
Conclusion:
50.5% of patients are male. The patient sex of this cohort is nicely balanced.
Mean age is 28.73 with stand deviation of 7.73 (quite high to be honest).
The ROI group percentage is nicely summarized.
Mean Glu concentration is 5.81, SD is 1.60
No striking imbalance or issue from this exploratory data analysis. We can move on to regression.
Now the fun begins. We will do a comprehensive linear regression:
The first thing to do is to fit a base model, so we have something to work on.
model <- lm(Vt ~ Type_num + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
summary(model)
##
## Call:
## lm(formula = Vt ~ Type_num + ROI + fGM + fWM + fCSF + Sex + Age,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.871 -5.066 -0.380 4.061 26.698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -927.77716 1059.76044 -0.875 0.38242
## Type_num2 -2.35550 1.19478 -1.971 0.05011 .
## Type_num3 0.49780 1.49175 0.334 0.73897
## ROILacingulateVt -3.25671 2.13573 -1.525 0.12894
## ROILDLPFCVt -5.08325 1.71853 -2.958 0.00349 **
## ROIRacingulateVt -2.33782 2.23586 -1.046 0.29706
## ROIRDLPFCVt -3.90381 2.05508 -1.900 0.05899 .
## fGM 955.83521 1059.90847 0.902 0.36829
## fWM 964.46302 1060.12684 0.910 0.36409
## fCSF 971.56415 1060.78693 0.916 0.36087
## SexM 0.42573 1.11638 0.381 0.70337
## Age 0.03060 0.07271 0.421 0.67431
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.641 on 192 degrees of freedom
## Multiple R-squared: 0.09872, Adjusted R-squared: 0.04709
## F-statistic: 1.912 on 11 and 192 DF, p-value: 0.03992
Then we need to check the linearity of this model. In other words, is this linear model actually representing a linear relationship.
The best way to check for this is to plot the residual (distance between observation and fitted value) vs the fitted values themselves. This is purely statistical. Ideally, we should see the observations scatter around 0. The fitted line should be somewhat flat around 0.
plot(model, which = 1)
Conclusion: linearity assumption holds. This model is appropriate. Slide curvature around the extremities but could be due to outliers, which we will check next.
library(car)
## Loading required package: carData
crPlots(model)
Conclusion: linearity assumption holds. Linear model is approporiate. There are slight curvature around the extremities of the data but could be due to outliers, which we will check next.
In linear regression, the influential outliers of your data can be
identified by calculating the cook distance
The rule of thumb for influential outliers using Cooks distance is:
Cook’s D > 4/n
cooksd <- cooks.distance(model)
which(cooksd > 4/length(cooksd))
## 83 88 90 104 132 137 177 201
## 83 88 90 104 132 137 177 201
Conclusion: There are influential outliers for 10 observations (83,88,90,101,103,104,132,137,177,201). Let’s remove them from the data set to have a clean data. However, you might want to keep them. Use your own best judgement. Usually, you run 1 model with outliers and 1 model without outliers. This is called a sensitivity analysis, and is something you should discuss with your prof (if you need this or not).
influential <- c(83,88,90,101,103,104,132,137,177,201) #labelling which observation is influential
data_clean <- data[-influential, ] #removing the influential
Next we check for multi-collinearity. Basically to see if your independent variables are correlated themselves. It’s a standard check.
The score to calculate is called
variable inflation factor (VIF). Essentially, any VIF above
5 requires investigation. Above 10 is very concerning.
library(car)
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Type_num 1.029448e+00 2 1.007282
## ROI 2.966310e+00 4 1.145584
## fGM 6.422003e+04 1 253.416711
## fWM 1.681278e+05 1 410.033947
## fCSF 4.033827e+04 1 200.843895
## Sex 1.088414e+00 1 1.043271
## Age 1.099535e+00 1 1.048587
Conclusion: fGM, fWM, and
fCSF is very problematic. I don’t know what’s going on
here. They are all co-linear to each other. For this exercise we remove
these variables in our model.
After the above diagnostic tests, we now have the following model:
model <- lm(Vt ~ Type_num + ROI + Sex + Age, data = data_clean)
summary(model)
##
## Call:
## lm(formula = Vt ~ Type_num + ROI + Sex + Age, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5180 -4.9118 -0.4532 4.3749 15.4011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.934522 2.115718 16.512 < 2e-16 ***
## Type_num2 -2.208457 1.062025 -2.079 0.038952 *
## Type_num3 -0.886501 1.310477 -0.676 0.499586
## ROILacingulateVt -3.966006 1.502424 -2.640 0.009005 **
## ROILDLPFCVt -5.162124 1.483815 -3.479 0.000628 ***
## ROIRacingulateVt -5.276430 1.513147 -3.487 0.000610 ***
## ROIRDLPFCVt -4.468439 1.523473 -2.933 0.003781 **
## SexM 0.980771 0.987036 0.994 0.321690
## Age 0.007976 0.063111 0.126 0.899572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.633 on 185 degrees of freedom
## Multiple R-squared: 0.109, Adjusted R-squared: 0.07043
## F-statistic: 2.828 on 8 and 185 DF, p-value: 0.005538
Now we have a preliminary base model. We have to explore the following question.
“Does the effect of Glu vary with other variables”. This
is called effect modification or interactions.
You typically only check interactions that are biologically
plausible. However, here we will check for interaction between
Glu and ROI
The way to do this is to run the model once for every possible
interaction with the main regressor Glu. Then, you do a
comparison test against the base model. Typically, this is done through
something called Akaike Information Criterion (AIC). This
is standard practice.
The lower the AIC the better the model. So if the
AIC for the interaction model is lower than the base, use
that one. Vice versa.
# base model
model <- lm(Vt ~ Type_num + ROI + Sex + Age, data=data_clean)
# interaction model
model_int <- lm(Vt ~ Type_num*ROI + Sex + Age, data=data)
# compare
AIC(model, model_int)
## Warning in AIC.default(model, model_int): models are not all fitted to the same
## number of observations
## df AIC
## model 10 1295.429
## model_int 18 1416.000
Conclusion: We can see that this barely changes. We
detected no statistical interaction. We do not need to
consider effect modification.
The last step now is to refine the model. Basically, you want to eliminate all the insignificant covariates until all variables (main exposure + covariates) are all significant.
Iteration 0: Base Model
# base model
model <- lm(Vt ~ Type_num + ROI + Sex + Age, data=data_clean)
summary(model)
##
## Call:
## lm(formula = Vt ~ Type_num + ROI + Sex + Age, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5180 -4.9118 -0.4532 4.3749 15.4011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.934522 2.115718 16.512 < 2e-16 ***
## Type_num2 -2.208457 1.062025 -2.079 0.038952 *
## Type_num3 -0.886501 1.310477 -0.676 0.499586
## ROILacingulateVt -3.966006 1.502424 -2.640 0.009005 **
## ROILDLPFCVt -5.162124 1.483815 -3.479 0.000628 ***
## ROIRacingulateVt -5.276430 1.513147 -3.487 0.000610 ***
## ROIRDLPFCVt -4.468439 1.523473 -2.933 0.003781 **
## SexM 0.980771 0.987036 0.994 0.321690
## Age 0.007976 0.063111 0.126 0.899572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.633 on 185 degrees of freedom
## Multiple R-squared: 0.109, Adjusted R-squared: 0.07043
## F-statistic: 2.828 on 8 and 185 DF, p-value: 0.005538
Conclusion: Age has the highest P value, let’s remove that
Iteration 1: Age Removed
model_1 <- lm(Vt ~ Type_num + ROI + Sex, data=data_clean)
summary(model_1)
##
## Call:
## lm(formula = Vt ~ Type_num + ROI + Sex, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5052 -4.9049 -0.4656 4.3497 15.4246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1468 1.2829 27.396 < 2e-16 ***
## Type_num2 -2.2043 1.0587 -2.082 0.038705 *
## Type_num3 -0.8859 1.3070 -0.678 0.498747
## ROILacingulateVt -3.9673 1.4984 -2.648 0.008800 **
## ROILDLPFCVt -5.1629 1.4799 -3.489 0.000606 ***
## ROIRacingulateVt -5.2787 1.5090 -3.498 0.000586 ***
## ROIRDLPFCVt -4.4656 1.5193 -2.939 0.003707 **
## SexM 1.0131 0.9508 1.065 0.288050
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.615 on 186 degrees of freedom
## Multiple R-squared: 0.1089, Adjusted R-squared: 0.07534
## F-statistic: 3.247 on 7 and 186 DF, p-value: 0.002832
Conclusion: Sex has a highP value, let’s remove that
Iteration 2: Sex removed
model_1 <- lm(Vt ~ Type_num + ROI, data=data_clean)
summary(model_1)
##
## Call:
## lm(formula = Vt ~ Type_num + ROI, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9921 -5.2949 -0.2151 4.7763 14.9398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6587 1.1900 29.965 < 2e-16 ***
## Type_num2 -2.1986 1.0591 -2.076 0.039262 *
## Type_num3 -0.8585 1.3072 -0.657 0.512161
## ROILacingulateVt -3.9668 1.4989 -2.646 0.008830 **
## ROILDLPFCVt -5.1880 1.4802 -3.505 0.000572 ***
## ROIRacingulateVt -5.2906 1.5095 -3.505 0.000572 ***
## ROIRDLPFCVt -4.5200 1.5190 -2.976 0.003309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.617 on 187 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.07467
## F-statistic: 3.596 on 6 and 187 DF, p-value: 0.002134
We stop the backward elimination here.
Our final model is: Vt ~ Glu + ROI
Recall the analysis result:
summary(model_1)
##
## Call:
## lm(formula = Vt ~ Type_num + ROI, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9921 -5.2949 -0.2151 4.7763 14.9398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6587 1.1900 29.965 < 2e-16 ***
## Type_num2 -2.1986 1.0591 -2.076 0.039262 *
## Type_num3 -0.8585 1.3072 -0.657 0.512161
## ROILacingulateVt -3.9668 1.4989 -2.646 0.008830 **
## ROILDLPFCVt -5.1880 1.4802 -3.505 0.000572 ***
## ROIRacingulateVt -5.2906 1.5095 -3.505 0.000572 ***
## ROIRDLPFCVt -4.5200 1.5190 -2.976 0.003309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.617 on 187 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.07467
## F-statistic: 3.596 on 6 and 187 DF, p-value: 0.002134
Keep in mind, there could be other factors that impact the result:
Low sample size
Biases
Random effects
Treat this analysis as a tutorial. Not your actual data analysis.