Preamble

This markdown is to serve as a tutorial on the complete process of a thorough linear regression model. With example data analysis. This document covers the following fundamentals:

Data Preparation
Preliminary Exploratory Analysis
Model and Variable Selection in Linear Regression
Residual and Model Diagnostics
Interpretation of Results

Note: I am very comfortable with R. This is the language used in these analysis. It is possible and practical to use other scripting/analysis languages.

Section 1 Data Preparation

This section covers how we read data, and prepare data for the analysis.

Loading the Data

The data was provided in .xlsx format. However, I am more comfortable with .csv format and the data.table package in R, which reads .csv files. It’s trivial to convert .xlsx to .csv.

Although the built in R data.frame data type is adequate for most operations, the data.table package in R utilizes multi-thread CPU operations and text vectorization. This package is excellent for large data-sets and their manipulation.

# 0) loading library

library (data.table) # Excellent for large data operations


# 1) defining file path

file_path <- ("C://Users//William//Desktop//Data Analysis//Mohreet//all data.csv")

# 2) Load the data

data <- fread(file_path)

Data Structure

The data has been loaded. We can take a look at the data structure:

str(data) #structure of the data

## Classes 'data.table' and 'data.frame':   209 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : chr  "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : chr  "F" "F" "F" "F" ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

head(data) #display first few rows

##           ID Type_num           ROI       Vt   fGM   fWM  fCSF      Glu Sex Age
## 1: SL25-1006        1 LacingulateVt 34.08057 0.628 0.186 0.186 0.000000   F  42
## 2: SL25-1006        1 RacingulateVt 33.20815 0.628 0.186 0.186 0.000000   F  42
## 3: SL25-1006        1      RDLPFCVt 24.07263 0.256 0.732 0.012 5.321472   F  42
## 4: SL25-1006        1      LDLPFCVt 24.03108 0.296 0.682 0.022 5.072136   F  42
## 5: SL25-1006        1   cingulateVt 33.43579 0.512 0.328 0.161 7.351965   F  42
## 6: SL25-1007        1 LacingulateVt 44.35559 0.542 0.141 0.316 4.918276   F  21
##    Sex#
## 1:    1
## 2:    1
## 3:    1
## 4:    1
## 5:    1
## 6:    1

There appears to be 11 variables, 209 observations

Missing Data

Next, we check for missing data:

data[rowSums(is.na(data)) > 0]

##           ID Type_num         ROI       Vt fGM fWM fCSF Glu Sex Age Sex#
## 1: SL25-1046        2 cingulateVt 28.91756  NA  NA   NA  NA   F  19    1

Patient ID SL25-1046 has missing data. Unclear about the nature of the mistiness, we exclude this patient from the analysis. There are ways to compute this missingness but it goes way beyond the scope and I suspect this is just a clerical error.

data <- data[!ID == "SL25-1046"] #This exlucdes the missing patient

Variable types

Next, we need to see the nature of each variable (what kind of data are they?)

str(data) #structure of the data

## Classes 'data.table' and 'data.frame':   204 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : chr  "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : chr  "F" "F" "F" "F" ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Type_num

unique(data$Type_num) #checking Type_num

## [1] 1 2 3

This is an integer, we need to convert it into factor.

data[, Type_num := factor(Type_num)]

ROI

unique(data$ROI) #checking Type_num

## [1] "LacingulateVt" "RacingulateVt" "RDLPFCVt"      "LDLPFCVt"     
## [5] "cingulateVt"

We have 5 levels. Also converting to factors.

data[, ROI := factor(ROI)]

SEX (as labeled by letter)

data[, Sex := factor(Sex)]

Now we look at the data structure again:

str(data)

## Classes 'data.table' and 'data.frame':   204 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : Factor w/ 5 levels "cingulateVt",..: 2 4 5 3 1 2 4 3 2 4 ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We can see that all of the numeric variables are numeric, and all of the categorical variables are converted to factors.

The last wrangling of this data is to subset to only columns we are interested in:

SEX# we can get rid of this duplicate column

data[,`Sex#`:= NULL]

Section 2 Exploratory Data Analysis

This is typically referred to as Table One in health science research. You need to summarize the descriptive characteristics of your study cohort before any analysis is completed. You summarize only the independent variable, not the outcome.

Thankfully, in R, the tableone package does this for us. I picked Sex, Age, ROI, Glu to summarize. You can add additional if you want.

library(tableone)
library(tableone)

CreateTableOne(
  vars = c("Sex", "Age", "ROI", "Glu"), #variables to include
  data = data #data is data
)

##                   
##                    Overall      
##   n                  204        
##   Sex = M (%)        103 (50.5) 
##   Age (mean (SD))  28.73 (7.73) 
##   ROI (%)                       
##      cingulateVt      40 (19.6) 
##      LacingulateVt    42 (20.6) 
##      LDLPFCVt         42 (20.6) 
##      RacingulateVt    41 (20.1) 
##      RDLPFCVt         39 (19.1) 
##   Glu (mean (SD))   5.81 (1.60)

Conclusion:

50.5% of patients are male. The patient sex of this cohort is nicely balanced.
Mean age is 28.73 with stand deviation of 7.73 (quite high to be honest).
The ROI group percentage is nicely summarized.
Mean Glu concentration is 5.81, SD is 1.60
No striking imbalance or issue from this exploratory data analysis. We can move on to regression.

Section 3: Linear Regression

Now the fun begins. We will do a comprehensive linear regression:

Base Model

The first thing to do is to fit a base model, so we have something to work on.

model <- lm(Vt ~ Type_num + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
summary(model)

## 
## Call:
## lm(formula = Vt ~ Type_num + ROI + fGM + fWM + fCSF + Sex + Age, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.871  -5.066  -0.380   4.061  26.698 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -927.77716 1059.76044  -0.875  0.38242   
## Type_num2          -2.35550    1.19478  -1.971  0.05011 . 
## Type_num3           0.49780    1.49175   0.334  0.73897   
## ROILacingulateVt   -3.25671    2.13573  -1.525  0.12894   
## ROILDLPFCVt        -5.08325    1.71853  -2.958  0.00349 **
## ROIRacingulateVt   -2.33782    2.23586  -1.046  0.29706   
## ROIRDLPFCVt        -3.90381    2.05508  -1.900  0.05899 . 
## fGM               955.83521 1059.90847   0.902  0.36829   
## fWM               964.46302 1060.12684   0.910  0.36409   
## fCSF              971.56415 1060.78693   0.916  0.36087   
## SexM                0.42573    1.11638   0.381  0.70337   
## Age                 0.03060    0.07271   0.421  0.67431   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.641 on 192 degrees of freedom
## Multiple R-squared:  0.09872,    Adjusted R-squared:  0.04709 
## F-statistic: 1.912 on 11 and 192 DF,  p-value: 0.03992

Linearity Check

Then we need to check the linearity of this model. In other words, is this linear model actually representing a linear relationship.

The best way to check for this is to plot the residual (distance between observation and fitted value) vs the fitted values themselves. This is purely statistical. Ideally, we should see the observations scatter around 0. The fitted line should be somewhat flat around 0.

plot(model, which = 1)

Conclusion: linearity assumption holds. This model is appropriate. Slide curvature around the extremities but could be due to outliers, which we will check next.

library(car)

## Loading required package: carData

crPlots(model)

Conclusion: linearity assumption holds. Linear model is approporiate. There are slight curvature around the extremities of the data but could be due to outliers, which we will check next.

Outliers: Cook distance

In linear regression, the influential outliers of your data can be identified by calculating the cook distance

The rule of thumb for influential outliers using Cooks distance is:

Cook’s D > 4/n

cooksd <- cooks.distance(model)
which(cooksd > 4/length(cooksd))

##  83  88  90 104 132 137 177 201 
##  83  88  90 104 132 137 177 201

Conclusion: There are influential outliers for 10 observations (83,88,90,101,103,104,132,137,177,201). Let’s remove them from the data set to have a clean data. However, you might want to keep them. Use your own best judgement. Usually, you run 1 model with outliers and 1 model without outliers. This is called a sensitivity analysis, and is something you should discuss with your prof (if you need this or not).

influential <- c(83,88,90,101,103,104,132,137,177,201) #labelling which observation is influential

data_clean <- data[-influential, ] #removing the influential

Multi-Collinearity: VIF

Next we check for multi-collinearity. Basically to see if your independent variables are correlated themselves. It’s a standard check.

The score to calculate is called variable inflation factor (VIF). Essentially, any VIF above 5 requires investigation. Above 10 is very concerning.

library(car) 
vif(model)

##                  GVIF Df GVIF^(1/(2*Df))
## Type_num 1.029448e+00  2        1.007282
## ROI      2.966310e+00  4        1.145584
## fGM      6.422003e+04  1      253.416711
## fWM      1.681278e+05  1      410.033947
## fCSF     4.033827e+04  1      200.843895
## Sex      1.088414e+00  1        1.043271
## Age      1.099535e+00  1        1.048587

Conclusion: fGM, fWM, and fCSF is very problematic. I don’t know what’s going on here. They are all co-linear to each other. For this exercise we remove these variables in our model.

Preliminary model

After the above diagnostic tests, we now have the following model:

model <- lm(Vt ~ Type_num + ROI + Sex + Age, data = data_clean)
summary(model)

## 
## Call:
## lm(formula = Vt ~ Type_num + ROI + Sex + Age, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5180  -4.9118  -0.4532   4.3749  15.4011 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.934522   2.115718  16.512  < 2e-16 ***
## Type_num2        -2.208457   1.062025  -2.079 0.038952 *  
## Type_num3        -0.886501   1.310477  -0.676 0.499586    
## ROILacingulateVt -3.966006   1.502424  -2.640 0.009005 ** 
## ROILDLPFCVt      -5.162124   1.483815  -3.479 0.000628 ***
## ROIRacingulateVt -5.276430   1.513147  -3.487 0.000610 ***
## ROIRDLPFCVt      -4.468439   1.523473  -2.933 0.003781 ** 
## SexM              0.980771   0.987036   0.994 0.321690    
## Age               0.007976   0.063111   0.126 0.899572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.633 on 185 degrees of freedom
## Multiple R-squared:  0.109,  Adjusted R-squared:  0.07043 
## F-statistic: 2.828 on 8 and 185 DF,  p-value: 0.005538

Interactions/Effect Modification

Now we have a preliminary base model. We have to explore the following question.

“Does the effect of Glu vary with other variables”. This is called effect modification or interactions.

You typically only check interactions that are biologically plausible. However, here we will check for interaction between Glu and ROI

The way to do this is to run the model once for every possible interaction with the main regressor Glu. Then, you do a comparison test against the base model. Typically, this is done through something called Akaike Information Criterion (AIC). This is standard practice.

The lower the AIC the better the model. So if the AIC for the interaction model is lower than the base, use that one. Vice versa.

For Glu and ROI

# base model
model <- lm(Vt ~ Type_num + ROI + Sex + Age, data=data_clean)

# interaction model
model_int <- lm(Vt ~ Type_num*ROI + Sex + Age, data=data)

# compare
AIC(model, model_int)

## Warning in AIC.default(model, model_int): models are not all fitted to the same
## number of observations

##           df      AIC
## model     10 1295.429
## model_int 18 1416.000

Conclusion: We can see that this barely changes. We detected no statistical interaction. We do not need to consider effect modification.

Variable Selection: Backward Selection

The last step now is to refine the model. Basically, you want to eliminate all the insignificant covariates until all variables (main exposure + covariates) are all significant.

Iteration 0: Base Model

# base model
model <- lm(Vt ~ Type_num + ROI + Sex + Age, data=data_clean)
summary(model)

## 
## Call:
## lm(formula = Vt ~ Type_num + ROI + Sex + Age, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5180  -4.9118  -0.4532   4.3749  15.4011 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.934522   2.115718  16.512  < 2e-16 ***
## Type_num2        -2.208457   1.062025  -2.079 0.038952 *  
## Type_num3        -0.886501   1.310477  -0.676 0.499586    
## ROILacingulateVt -3.966006   1.502424  -2.640 0.009005 ** 
## ROILDLPFCVt      -5.162124   1.483815  -3.479 0.000628 ***
## ROIRacingulateVt -5.276430   1.513147  -3.487 0.000610 ***
## ROIRDLPFCVt      -4.468439   1.523473  -2.933 0.003781 ** 
## SexM              0.980771   0.987036   0.994 0.321690    
## Age               0.007976   0.063111   0.126 0.899572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.633 on 185 degrees of freedom
## Multiple R-squared:  0.109,  Adjusted R-squared:  0.07043 
## F-statistic: 2.828 on 8 and 185 DF,  p-value: 0.005538

Conclusion: Age has the highest P value, let’s remove that

Iteration 1: Age Removed

model_1 <- lm(Vt ~ Type_num + ROI + Sex, data=data_clean)
summary(model_1)

## 
## Call:
## lm(formula = Vt ~ Type_num + ROI + Sex, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5052  -4.9049  -0.4656   4.3497  15.4246 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       35.1468     1.2829  27.396  < 2e-16 ***
## Type_num2         -2.2043     1.0587  -2.082 0.038705 *  
## Type_num3         -0.8859     1.3070  -0.678 0.498747    
## ROILacingulateVt  -3.9673     1.4984  -2.648 0.008800 ** 
## ROILDLPFCVt       -5.1629     1.4799  -3.489 0.000606 ***
## ROIRacingulateVt  -5.2787     1.5090  -3.498 0.000586 ***
## ROIRDLPFCVt       -4.4656     1.5193  -2.939 0.003707 ** 
## SexM               1.0131     0.9508   1.065 0.288050    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.615 on 186 degrees of freedom
## Multiple R-squared:  0.1089, Adjusted R-squared:  0.07534 
## F-statistic: 3.247 on 7 and 186 DF,  p-value: 0.002832

Conclusion: Sex has a highP value, let’s remove that

Iteration 2: Sex removed

model_1 <- lm(Vt ~ Type_num + ROI, data=data_clean)
summary(model_1)

## 
## Call:
## lm(formula = Vt ~ Type_num + ROI, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9921  -5.2949  -0.2151   4.7763  14.9398 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       35.6587     1.1900  29.965  < 2e-16 ***
## Type_num2         -2.1986     1.0591  -2.076 0.039262 *  
## Type_num3         -0.8585     1.3072  -0.657 0.512161    
## ROILacingulateVt  -3.9668     1.4989  -2.646 0.008830 ** 
## ROILDLPFCVt       -5.1880     1.4802  -3.505 0.000572 ***
## ROIRacingulateVt  -5.2906     1.5095  -3.505 0.000572 ***
## ROIRDLPFCVt       -4.5200     1.5190  -2.976 0.003309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.617 on 187 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.07467 
## F-statistic: 3.596 on 6 and 187 DF,  p-value: 0.002134

We stop the backward elimination here.

Our final model is: Vt ~ Glu + ROI

Interpretation

Recall the analysis result:

summary(model_1)

## 
## Call:
## lm(formula = Vt ~ Type_num + ROI, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9921  -5.2949  -0.2151   4.7763  14.9398 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       35.6587     1.1900  29.965  < 2e-16 ***
## Type_num2         -2.1986     1.0591  -2.076 0.039262 *  
## Type_num3         -0.8585     1.3072  -0.657 0.512161    
## ROILacingulateVt  -3.9668     1.4989  -2.646 0.008830 ** 
## ROILDLPFCVt       -5.1880     1.4802  -3.505 0.000572 ***
## ROIRacingulateVt  -5.2906     1.5095  -3.505 0.000572 ***
## ROIRDLPFCVt       -4.5200     1.5190  -2.976 0.003309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.617 on 187 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.07467 
## F-statistic: 3.596 on 6 and 187 DF,  p-value: 0.002134

Keep in mind, there could be other factors that impact the result:

Low sample size
Biases
Random effects

Treat this analysis as a tutorial. Not your actual data analysis.

Linear Regression Exercise With Data

William Bai

2026-02-19