Preamble

This markdown is to serve as a tutorial on the complete process of a thorough linear regression model. With example data analysis. This document covers the following fundamentals:

Data Preparation
Preliminary Exploratory Analysis
Model and Variable Selection in Linear Regression
Residual and Model Diagnostics
Interpretation of Results

Note: I am very comfortable with R. This is the language used in these analysis. It is possible and practical to use other scripting/analysis languages.

Section 1 Data Preparation

This section covers how we read data, and prepare data for the analysis.

Loading the Data

The data was provided in .xlsx format. However, I am more comfortable with .csv format and the data.table package in R, which reads .csv files. It’s trivial to convert .xlsx to .csv.

Although the built in R data.frame data type is adequate for most operations, the data.table package in R utilizes multi-thread CPU operations and text vectorization. This package is excellent for large data-sets and their manipulation.

# 0) loading library

library (data.table) # Excellent for large data operations


# 1) defining file path

file_path <- ("C://Users//William//Desktop//Data Analysis//Mohreet//all data.csv")

# 2) Load the data

data <- fread(file_path)

Data Structure

The data has been loaded. We can take a look at the data structure:

str(data) #structure of the data

## Classes 'data.table' and 'data.frame':   209 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : chr  "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : chr  "F" "F" "F" "F" ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

head(data) #display first few rows

##           ID Type_num           ROI       Vt   fGM   fWM  fCSF      Glu Sex Age
## 1: SL25-1006        1 LacingulateVt 34.08057 0.628 0.186 0.186 0.000000   F  42
## 2: SL25-1006        1 RacingulateVt 33.20815 0.628 0.186 0.186 0.000000   F  42
## 3: SL25-1006        1      RDLPFCVt 24.07263 0.256 0.732 0.012 5.321472   F  42
## 4: SL25-1006        1      LDLPFCVt 24.03108 0.296 0.682 0.022 5.072136   F  42
## 5: SL25-1006        1   cingulateVt 33.43579 0.512 0.328 0.161 7.351965   F  42
## 6: SL25-1007        1 LacingulateVt 44.35559 0.542 0.141 0.316 4.918276   F  21
##    Sex#
## 1:    1
## 2:    1
## 3:    1
## 4:    1
## 5:    1
## 6:    1

There appears to be 11 variables, 209 observations

Missing Data

Next, we check for missing data:

data[rowSums(is.na(data)) > 0]

##           ID Type_num         ROI       Vt fGM fWM fCSF Glu Sex Age Sex#
## 1: SL25-1046        2 cingulateVt 28.91756  NA  NA   NA  NA   F  19    1

Patient ID SL25-1046 has missing data. Unclear about the nature of the mistiness, we exclude this patient from the analysis. There are ways to compute this missingness but it goes way beyond the scope and I suspect this is just a clerical error.

data <- data[!ID == "SL25-1046"] #This exlucdes the missing patient

Variable types

Next, we need to see the nature of each variable (what kind of data are they?)

str(data) #structure of the data

## Classes 'data.table' and 'data.frame':   204 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : chr  "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : chr  "F" "F" "F" "F" ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Type_num

unique(data$Type_num) #checking Type_num

## [1] 1 2 3

This is an integer, we need to convert it into factor.

data[, Type_num := factor(Type_num)]

ROI

unique(data$ROI) #checking Type_num

## [1] "LacingulateVt" "RacingulateVt" "RDLPFCVt"      "LDLPFCVt"     
## [5] "cingulateVt"

We have 5 levels. Also converting to factors.

data[, ROI := factor(ROI)]

SEX (as labeled by letter)

data[, Sex := factor(Sex)]

Now we look at the data structure again:

str(data)

## Classes 'data.table' and 'data.frame':   204 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : Factor w/ 5 levels "cingulateVt",..: 2 4 5 3 1 2 4 3 2 4 ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We can see that all of the numeric variables are numeric, and all of the categorical variables are converted to factors.

The last wrangling of this data is to subset to only columns we are interested in:

SEX# we can get rid of this duplicate column

data[,`Sex#`:= NULL]

Section 2 Exploratory Data Analysis

This is typically referred to as Table One in health science research. You need to summarize the descriptive characteristics of your study cohort before any analysis is completed. You summarize only the independent variable, not the outcome.

Thankfully, in R, the tableone package does this for us. I picked Sex, Age, ROI, Glu to summarize. You can add additional if you want.

library(tableone)
library(tableone)

CreateTableOne(
  vars = c("Sex", "Age", "ROI", "Glu"), #variables to include
  data = data #data is data
)

##                   
##                    Overall      
##   n                  204        
##   Sex = M (%)        103 (50.5) 
##   Age (mean (SD))  28.73 (7.73) 
##   ROI (%)                       
##      cingulateVt      40 (19.6) 
##      LacingulateVt    42 (20.6) 
##      LDLPFCVt         42 (20.6) 
##      RacingulateVt    41 (20.1) 
##      RDLPFCVt         39 (19.1) 
##   Glu (mean (SD))   5.81 (1.60)

Conclusion:

50.5% of patients are male. The patient sex of this cohort is nicely balanced.
Mean age is 28.73 with stand deviation of 7.73 (quite high to be honest).
The ROI group percentage is nicely summarized.
Mean Glu concentration is 5.81, SD is 1.60
No striking imbalance or issue from this exploratory data analysis. We can move on to regression.

Section 3: Linear Regression

Now the fun begins. We will do a comprehensive linear regression:

Base Model

The first thing to do is to fit a base model, so we have something to work on.

model <- lm(Vt ~ Glu + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
summary(model)

## 
## Call:
## lm(formula = Vt ~ Glu + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2422  -5.3111  -0.5451   3.9053  28.2416 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -1.172e+03  1.057e+03  -1.109  0.26892   
## Glu              -5.013e-01  3.807e-01  -1.317  0.18947   
## ROILacingulateVt -3.146e+00  2.152e+00  -1.461  0.14551   
## ROILDLPFCVt      -5.233e+00  1.733e+00  -3.019  0.00288 **
## ROIRacingulateVt -2.627e+00  2.260e+00  -1.162  0.24653   
## ROIRDLPFCVt      -3.896e+00  2.068e+00  -1.884  0.06112 . 
## fGM               1.202e+03  1.058e+03   1.136  0.25732   
## fWM               1.211e+03  1.058e+03   1.145  0.25364   
## fCSF              1.222e+03  1.059e+03   1.154  0.24973   
## SexM              7.175e-01  1.141e+00   0.629  0.53036   
## Age               2.347e-02  7.314e-02   0.321  0.74865   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.692 on 193 degrees of freedom
## Multiple R-squared:  0.08208,    Adjusted R-squared:  0.03452 
## F-statistic: 1.726 on 10 and 193 DF,  p-value: 0.07736

Linearity Check

Then we need to check the linearity of this model. In other words, is this linear model actually representing a linear relationship.

The best way to check for this is to plot the residual (distance between observation and fitted value) vs the fitted values themselves. This is purely statistical. Ideally, we should see the observations scatter around 0. The fitted line should be somewhat flat around 0.

plot(model, which = 1)

Conclusion: linearity assumption holds. This model is appropriate. Slide curvature around the extremities but could be due to outliers, which we will check next.

library(car)

## Loading required package: carData

crPlots(model)

Conclusion: linearity assumption holds. Linear model is approporiate. There are slight curvature around the extremities of the data but could be due to outliers, which we will check next.

Outliers: Cook distance

In linear regression, the influential outliers of your data can be identified by calculating the cook distance

The rule of thumb for influential outliers using Cooks distance is:

Cook's D > 4/n

cooksd <- cooks.distance(model)
which(cooksd > 4/length(cooksd))

##  83  88  90 101 103 104 132 137 177 201 
##  83  88  90 101 103 104 132 137 177 201

Conclusion: There are influential outliers for 10 observations (83,88,90,101,103,104,132,137,177,201). Let’s remove them from the data set to have a clean data. However, you might want to keep them. Use your own best judgement. Usually, you run 1 model with outliers and 1 model without outliers. This is called a sensitivity analysis, and is something you should discuss with your prof (if you need this or not).

influential <- c(83,88,90,101,103,104,132,137,177,201) #labelling which observation is influential

data_clean <- data[-influential, ] #removing the influential

Multi-Collinearity: VIF

Next we check for multi-collinearity. Basically to see if your independent variables are correlated themselves. It’s a standard check.

The score to calculate is called variable inflation factor (VIF). Essentially, any VIF above 5 requires investigation. Above 10 is very concerning.

library(car) 
vif(model)

##              GVIF Df GVIF^(1/(2*Df))
## Glu  1.274729e+00  1        1.129039
## ROI  3.074314e+00  4        1.150717
## fGM  6.311103e+04  1      251.219079
## fWM  1.652170e+05  1      406.468985
## fCSF 3.964789e+04  1      199.117786
## Sex  1.122961e+00  1        1.059699
## Age  1.098074e+00  1        1.047890

Conclusion: fGM, fWM, and fCSF is very problematic. I don’t know what’s going on here. They are all co-linear to each other. For this exercise we remove these variables in our model.

Preliminary model

After the above diagnostic tests, we now have the following model:

model <- lm(Vt ~ Glu + ROI + Sex + Age, data = data_clean)
summary(model)

## 
## Call:
## lm(formula = Vt ~ Glu + ROI + Sex + Age, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7636  -5.0470  -0.2673   4.1303  15.9254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      36.423092   2.701769  13.481  < 2e-16 ***
## Glu              -0.431711   0.320378  -1.348 0.179457    
## ROILacingulateVt -3.525574   1.548742  -2.276 0.023961 *  
## ROILDLPFCVt      -5.407170   1.500213  -3.604 0.000402 ***
## ROIRacingulateVt -5.271128   1.518445  -3.471 0.000644 ***
## ROIRDLPFCVt      -4.353497   1.534329  -2.837 0.005053 ** 
## SexM              1.213178   1.004588   1.208 0.228720    
## Age               0.002149   0.063346   0.034 0.972970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.659 on 186 degrees of freedom
## Multiple R-squared:  0.09688,    Adjusted R-squared:  0.06289 
## F-statistic:  2.85 on 7 and 186 DF,  p-value: 0.007568

Interactions/Effect Modification

Now we have a preliminary base model. We have to explore the following question.

“Does the effect of Glu vary with other variables”. This is called effect modification or interactions.

You typically only check interactions that are biologically plausible. However, here we will check for interaction between Glu and ROI

The way to do this is to run the model once for every possible interaction with the main regressor Glu. Then, you do a comparison test against the base model. Typically, this is done through something called Akaike Information Criterion (AIC). This is standard practice.

The lower the AIC the better the model. So if the AIC for the interaction model is lower than the base, use that one. Vice versa.

For Glu and ROI

# base model
model <- lm(Vt ~ Glu + ROI + Sex + Age, data=data_clean)

# interaction model
model_int <- lm(Vt ~ Glu*ROI + Sex + Age, data=data)

# compare
AIC(model, model_int)

## Warning in AIC.default(model, model_int): models are not all fitted to the same
## number of observations

##           df     AIC
## model      9 1296.04
## model_int 13 1425.43

Conclusion: We can see that this barely changes. We detected no statistical interaction. We do not need to consider effect modification.

Variable Selection: Backward Selection

The last step now is to refine the model. Basically, you want to eliminate all the insignificant covariates until all variables (main exposure + covariates) are all significant.

Iteration 0: Base Model

# base model
model <- lm(Vt ~ Glu + ROI + Sex + Age, data=data_clean)
summary(model)

## 
## Call:
## lm(formula = Vt ~ Glu + ROI + Sex + Age, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7636  -5.0470  -0.2673   4.1303  15.9254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      36.423092   2.701769  13.481  < 2e-16 ***
## Glu              -0.431711   0.320378  -1.348 0.179457    
## ROILacingulateVt -3.525574   1.548742  -2.276 0.023961 *  
## ROILDLPFCVt      -5.407170   1.500213  -3.604 0.000402 ***
## ROIRacingulateVt -5.271128   1.518445  -3.471 0.000644 ***
## ROIRDLPFCVt      -4.353497   1.534329  -2.837 0.005053 ** 
## SexM              1.213178   1.004588   1.208 0.228720    
## Age               0.002149   0.063346   0.034 0.972970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.659 on 186 degrees of freedom
## Multiple R-squared:  0.09688,    Adjusted R-squared:  0.06289 
## F-statistic:  2.85 on 7 and 186 DF,  p-value: 0.007568

Conclusion: Age has the highest P value, let’s remove that

Iteration 1: Age Removed

model_1 <- lm(Vt ~ Glu + ROI + Sex, data=data_clean)
summary(model_1)

## 
## Call:
## lm(formula = Vt ~ Glu + ROI + Sex, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7820  -5.0455  -0.2624   4.1152  15.9316 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       36.4819     2.0675  17.645  < 2e-16 ***
## Glu               -0.4319     0.3195  -1.352 0.177992    
## ROILacingulateVt  -3.5257     1.5446  -2.283 0.023581 *  
## ROILDLPFCVt       -5.4075     1.4962  -3.614 0.000387 ***
## ROIRacingulateVt  -5.2717     1.5143  -3.481 0.000621 ***
## ROIRDLPFCVt       -4.3526     1.5300  -2.845 0.004939 ** 
## SexM               1.2220     0.9678   1.263 0.208305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.642 on 187 degrees of freedom
## Multiple R-squared:  0.09688,    Adjusted R-squared:  0.0679 
## F-statistic: 3.343 on 6 and 187 DF,  p-value: 0.003753

Conclusion: Sex has a highP value, let’s remove that

Iteration 2: Sex removed

model_1 <- lm(Vt ~ Glu + ROI, data=data_clean)
summary(model_1)

## 
## Call:
## lm(formula = Vt ~ Glu + ROI, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2025  -5.3495  -0.2803   4.1126  15.3282 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       36.7350     2.0610  17.824  < 2e-16 ***
## Glu               -0.3651     0.3155  -1.157 0.248699    
## ROILacingulateVt  -3.5992     1.5459  -2.328 0.020967 *  
## ROILDLPFCVt       -5.4004     1.4985  -3.604 0.000401 ***
## ROIRacingulateVt  -5.2937     1.5166  -3.491 0.000601 ***
## ROIRDLPFCVt       -4.4417     1.5308  -2.902 0.004156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.652 on 188 degrees of freedom
## Multiple R-squared:  0.08918,    Adjusted R-squared:  0.06495 
## F-statistic: 3.681 on 5 and 188 DF,  p-value: 0.003348

We stop the backward elimination here.

Our final model is: Vt ~ Glu + ROI

Interpretation

Recall the analysis result:

summary(model_1)

## 
## Call:
## lm(formula = Vt ~ Glu + ROI, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2025  -5.3495  -0.2803   4.1126  15.3282 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       36.7350     2.0610  17.824  < 2e-16 ***
## Glu               -0.3651     0.3155  -1.157 0.248699    
## ROILacingulateVt  -3.5992     1.5459  -2.328 0.020967 *  
## ROILDLPFCVt       -5.4004     1.4985  -3.604 0.000401 ***
## ROIRacingulateVt  -5.2937     1.5166  -3.491 0.000601 ***
## ROIRDLPFCVt       -4.4417     1.5308  -2.902 0.004156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.652 on 188 degrees of freedom
## Multiple R-squared:  0.08918,    Adjusted R-squared:  0.06495 
## F-statistic: 3.681 on 5 and 188 DF,  p-value: 0.003348

The main exposure Glu is not statistically significantly associated with changes in Vt levels. The P-value is 0.24.
ROI location was significantly associated with lower Vt levels relative to the cingulate region (cingulateVt is the reference category).
- Being in ROI: LacingulateVt is associated with -3.5992 unit change in Vt
- Being in ROI: LDLPFCVt is associated with -5.4004 unit change in Vt
- Being in ROI: RacingulateVtis associated with -5.2937 unit change in Vt
- Being in ROI: RDLPFCVt is associated with -4.4417 unit change in Vt

Warning

Keep in mind, there could be other factors that impact the result:

Low sample size
Biases
Random effects

Treat this analysis as a tutorial. Not your actual data analysis.

Linear Regression Exercise With Data

William Bai

2026-02-19