Preamble

This markdown is to serve as a tutorial on the complete process of a thorough linear regression model. With example data analysis. This document covers the following fundamentals:

Data Preparation
Preliminary Exploratory Analysis
Model and Variable Selection in Linear Regression
Residual and Model Diagnostics
Interpretation of Results

Note: I am very comfortable with R. This is the language used in these analysis. It is possible and practical to use other scripting/analysis languages.

Section 1 Data Preparation

This section covers how we read data, and prepare data for the analysis.

Loading the Data

The data was provided in .xlsx format. However, I am more comfortable with .csv format and the data.table package in R, which reads .csv files. It’s trivial to convert .xlsx to .csv.

Although the built in R data.frame data type is adequate for most operations, the data.table package in R utilizes multi-thread CPU operations and text vectorization. This package is excellent for large data-sets and their manipulation.

# 0) loading library

library (data.table) # Excellent for large data operations


# 1) defining file path

file_path <- ("C://Users//William//Desktop//Data Analysis//Mohreet//all data.csv")

# 2) Load the data

data <- fread(file_path)

Data Structure

The data has been loaded. We can take a look at the data structure:

str(data) #structure of the data

## Classes 'data.table' and 'data.frame':   209 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : chr  "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : chr  "F" "F" "F" "F" ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

head(data) #display first few rows

##           ID Type_num           ROI       Vt   fGM   fWM  fCSF      Glu Sex Age
## 1: SL25-1006        1 LacingulateVt 34.08057 0.628 0.186 0.186 0.000000   F  42
## 2: SL25-1006        1 RacingulateVt 33.20815 0.628 0.186 0.186 0.000000   F  42
## 3: SL25-1006        1      RDLPFCVt 24.07263 0.256 0.732 0.012 5.321472   F  42
## 4: SL25-1006        1      LDLPFCVt 24.03108 0.296 0.682 0.022 5.072136   F  42
## 5: SL25-1006        1   cingulateVt 33.43579 0.512 0.328 0.161 7.351965   F  42
## 6: SL25-1007        1 LacingulateVt 44.35559 0.542 0.141 0.316 4.918276   F  21
##    Sex#
## 1:    1
## 2:    1
## 3:    1
## 4:    1
## 5:    1
## 6:    1

There appears to be 11 variables, 209 observations

Missing Data

Next, we check for missing data:

data[rowSums(is.na(data)) > 0]

##           ID Type_num         ROI       Vt fGM fWM fCSF Glu Sex Age Sex#
## 1: SL25-1046        2 cingulateVt 28.91756  NA  NA   NA  NA   F  19    1

Patient ID SL25-1046 has missing data. Unclear about the nature of the mistiness, we exclude this patient from the analysis. There are ways to compute this missingness but it goes way beyond the scope and I suspect this is just a clerical error.

data <- data[!ID == "SL25-1046"] #This exlucdes the missing patient

Variable types

Next, we need to see the nature of each variable (what kind of data are they?)

str(data) #structure of the data

## Classes 'data.table' and 'data.frame':   204 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : chr  "LacingulateVt" "RacingulateVt" "RDLPFCVt" "LDLPFCVt" ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : chr  "F" "F" "F" "F" ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Type_num

unique(data$Type_num) #checking Type_num

## [1] 1 2 3

This is an integer, we need to convert it into factor.

data[, Type_num := factor(Type_num)]

ROI

unique(data$ROI) #checking Type_num

## [1] "LacingulateVt" "RacingulateVt" "RDLPFCVt"      "LDLPFCVt"     
## [5] "cingulateVt"

We have 5 levels. Also converting to factors.

data[, ROI := factor(ROI)]

SEX (as labeled by letter)

data[, Sex := factor(Sex)]

Now we look at the data structure again:

str(data)

## Classes 'data.table' and 'data.frame':   204 obs. of  11 variables:
##  $ ID      : chr  "SL25-1006" "SL25-1006" "SL25-1006" "SL25-1006" ...
##  $ Type_num: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ROI     : Factor w/ 5 levels "cingulateVt",..: 2 4 5 3 1 2 4 3 2 4 ...
##  $ Vt      : num  34.1 33.2 24.1 24 33.4 ...
##  $ fGM     : num  0.628 0.628 0.256 0.296 0.512 0.542 0.542 0.425 0.613 0.613 ...
##  $ fWM     : num  0.186 0.186 0.732 0.682 0.328 0.141 0.141 0.48 0.17 0.17 ...
##  $ fCSF    : num  0.186 0.186 0.012 0.022 0.161 0.316 0.316 0.095 0.217 0.217 ...
##  $ Glu     : num  0 0 5.32 5.07 7.35 ...
##  $ Sex     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Age     : num  42 42 42 42 42 21 21 21 28 28 ...
##  $ Sex#    : num  1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We can see that all of the numeric variables are numeric, and all of the categorical variables are converted to factors.

The last wrangling of this data is to subset to only columns we are interested in:

SEX# we can get rid of this duplicate column

data[,`Sex#`:= NULL]

Section 2 Exploratory Data Analysis

This is typically referred to as Table One in health science research. You need to summarize the descriptive characteristics of your study cohort before any analysis is completed. You summarize only the independent variable, not the outcome.

Thankfully, in R, the tableone package does this for us. I picked Sex, Age, ROI, Glu to summarize. You can add additional if you want.

library(tableone)
library(tableone)

CreateTableOne(
  vars = c("Sex", "Age", "ROI", "Glu"), #variables to include
  data = data #data is data
)

##                   
##                    Overall      
##   n                  204        
##   Sex = M (%)        103 (50.5) 
##   Age (mean (SD))  28.73 (7.73) 
##   ROI (%)                       
##      cingulateVt      40 (19.6) 
##      LacingulateVt    42 (20.6) 
##      LDLPFCVt         42 (20.6) 
##      RacingulateVt    41 (20.1) 
##      RDLPFCVt         39 (19.1) 
##   Glu (mean (SD))   5.81 (1.60)

Conclusion:

50.5% of patients are male. The patient sex of this cohort is nicely balanced.
Mean age is 28.73 with stand deviation of 7.73 (quite high to be honest).
The ROI group percentage is nicely summarized.
Mean Glu concentration is 5.81, SD is 1.60
No striking imbalance or issue from this exploratory data analysis. We can move on to regression.

Section 3: Linear Regression

Now the fun begins. We will do a comprehensive linear regression:

Base Model

The first thing to do is to fit a base model, so we have something to work on.

model <- lm(Glu ~ Vt + Type_num + ROI + fGM + fWM + fCSF + Sex + Age, data = data)
summary(model)

## 
## Call:
## lm(formula = Glu ~ Vt + Type_num + ROI + fGM + fWM + fCSF + Sex + 
##     Age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9266 -0.5431  0.1928  0.7172  2.9326 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -39.897906 197.794069  -0.202   0.8404  
## Vt                -0.022104   0.013443  -1.644   0.1018  
## Type_num2         -0.127920   0.224793  -0.569   0.5700  
## Type_num3          0.683953   0.277948   2.461   0.0148 *
## ROILacingulateVt   0.234847   0.400222   0.587   0.5580  
## ROILDLPFCVt       -0.398229   0.327322  -1.217   0.2252  
## ROIRacingulateVt  -0.592528   0.417656  -1.419   0.1576  
## ROIRDLPFCVt       -0.218809   0.386378  -0.566   0.5718  
## fGM               45.137885 197.845726   0.228   0.8198  
## fWM               45.352430 197.893880   0.229   0.8190  
## fCSF              51.448605 198.022843   0.260   0.7953  
## SexM               0.526587   0.208025   2.531   0.0122 *
## Age               -0.000206   0.013550  -0.015   0.9879  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.423 on 191 degrees of freedom
## Multiple R-squared:  0.2564, Adjusted R-squared:  0.2097 
## F-statistic: 5.487 on 12 and 191 DF,  p-value: 5.184e-08

Linearity Check

Then we need to check the linearity of this model. In other words, is this linear model actually representing a linear relationship.

The best way to check for this is to plot the residual (distance between observation and fitted value) vs the fitted values themselves. This is purely statistical. Ideally, we should see the observations scatter around 0. The fitted line should be somewhat flat around 0.

plot(model, which = 1)

Conclusion: linearity assumption holds. This model is appropriate. Slide curvature around the extremities but could be due to outliers, which we will check next.

library(car)

## Loading required package: carData

crPlots(model)

Conclusion: linearity assumption holds. Linear model is approporiate. There are slight curvature around the extremities of the data but could be due to outliers, which we will check next.

Outliers: Cook distance

In linear regression, the influential outliers of your data can be identified by calculating the cook distance

The rule of thumb for influential outliers using Cooks distance is:

Cook’s D > 4/n

cooksd <- cooks.distance(model)
which(cooksd > 4/length(cooksd))

##   1   2  38  42  91  92  94 162 163 164 177 
##   1   2  38  42  91  92  94 162 163 164 177

Conclusion: There are influential outliers for 10 observations (83,88,90,101,103,104,132,137,177,201). Let’s remove them from the data set to have a clean data. However, you might want to keep them. Use your own best judgement. Usually, you run 1 model with outliers and 1 model without outliers. This is called a sensitivity analysis, and is something you should discuss with your prof (if you need this or not).

influential <- c(1,2,27,38,42,91,92,94,141,162,163,164) #labelling which observation is influential

data_clean <- data[-influential, ] #removing the influential

Multi-Collinearity: VIF

Next we check for multi-collinearity. Basically to see if your independent variables are correlated themselves. It’s a standard check.

The score to calculate is called variable inflation factor (VIF). Essentially, any VIF above 5 requires investigation. Above 10 is very concerning.

library(car) 
vif(model)

##                  GVIF Df GVIF^(1/(2*Df))
## Vt       1.109539e+00  1        1.053346
## Type_num 1.057884e+00  2        1.014167
## ROI      3.121471e+00  4        1.152909
## fGM      6.449205e+04  1      253.952846
## fWM      1.688526e+05  1      410.916777
## fCSF     4.051451e+04  1      201.282164
## Sex      1.089239e+00  1        1.043666
## Age      1.100550e+00  1        1.049071

Conclusion: fGM, fWM, and fCSF is very problematic. I don’t know what’s going on here. They are all co-linear to each other. For this exercise we remove these variables in our model.

Preliminary model

After the above diagnostic tests, we now have the following model:

model <- lm(Glu ~ Vt + Type_num + ROI + Sex + Age, data = data_clean)
summary(model)

## 
## Call:
## lm(formula = Glu ~ Vt + Type_num + ROI + Sex + Age, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4913 -0.6643 -0.1085  0.5377  2.7371 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.403055   0.416571  12.970  < 2e-16 ***
## Vt               -0.011218   0.008656  -1.296  0.19661    
## Type_num2         0.101261   0.150088   0.675  0.50074    
## Type_num3         0.494189   0.179110   2.759  0.00639 ** 
## ROILacingulateVt  1.331394   0.212725   6.259 2.71e-09 ***
## ROILDLPFCVt      -0.339718   0.212095  -1.602  0.11095    
## ROIRacingulateVt  0.512695   0.212527   2.412  0.01684 *  
## ROIRDLPFCVt       0.438467   0.214026   2.049  0.04193 *  
## SexM             -0.073291   0.137927  -0.531  0.59581    
## Age               0.016610   0.008916   1.863  0.06409 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9158 on 182 degrees of freedom
## Multiple R-squared:  0.321,  Adjusted R-squared:  0.2874 
## F-statistic: 9.558 on 9 and 182 DF,  p-value: 6.904e-12

Interactions/Effect Modification

Now we have a preliminary base model. We have to explore the following question.

“Does the effect of Glu vary with other variables”. This is called effect modification or interactions.

You typically only check interactions that are biologically plausible. However, here we will check for interaction between Glu and ROI

The way to do this is to run the model once for every possible interaction with the main regressor Glu. Then, you do a comparison test against the base model. Typically, this is done through something called Akaike Information Criterion (AIC). This is standard practice.

The lower the AIC the better the model. So if the AIC for the interaction model is lower than the base, use that one. Vice versa.

For Glu and ROI

# base model
model <- lm(Glu ~ Vt + Type_num + ROI + Sex + Age, data=data_clean)

# interaction model
model_int <- lm(Glu ~ Type_num*Vt + ROI + Sex + Age, data=data)

# compare
AIC(model, model_int)

## Warning in AIC.default(model, model_int): models are not all fitted to the same
## number of observations

##           df      AIC
## model     11 522.8371
## model_int 13 753.9422

Conclusion: We can see that this barely changes. We detected no statistical interaction. We do not need to consider effect modification.

# base model
model <- lm(Glu ~ Vt + Type_num + ROI + Sex + Age, data=data_clean)

# interaction model
model_int <- lm(Glu ~ Type_num*ROI + Vt + Sex + Age, data=data)

# compare
AIC(model, model_int)

## Warning in AIC.default(model, model_int): models are not all fitted to the same
## number of observations

##           df      AIC
## model     11 522.8371
## model_int 19 755.8443

Variable Selection: Backward Selection

The last step now is to refine the model. Basically, you want to eliminate all the insignificant covariates until all variables (main exposure + covariates) are all significant.

Iteration 0: Base Model

# base model
model <- lm(Glu ~ Vt + Type_num + ROI + Sex + Age, data=data_clean)
summary(model)

## 
## Call:
## lm(formula = Glu ~ Vt + Type_num + ROI + Sex + Age, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4913 -0.6643 -0.1085  0.5377  2.7371 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.403055   0.416571  12.970  < 2e-16 ***
## Vt               -0.011218   0.008656  -1.296  0.19661    
## Type_num2         0.101261   0.150088   0.675  0.50074    
## Type_num3         0.494189   0.179110   2.759  0.00639 ** 
## ROILacingulateVt  1.331394   0.212725   6.259 2.71e-09 ***
## ROILDLPFCVt      -0.339718   0.212095  -1.602  0.11095    
## ROIRacingulateVt  0.512695   0.212527   2.412  0.01684 *  
## ROIRDLPFCVt       0.438467   0.214026   2.049  0.04193 *  
## SexM             -0.073291   0.137927  -0.531  0.59581    
## Age               0.016610   0.008916   1.863  0.06409 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9158 on 182 degrees of freedom
## Multiple R-squared:  0.321,  Adjusted R-squared:  0.2874 
## F-statistic: 9.558 on 9 and 182 DF,  p-value: 6.904e-12

Conclusion: Age has the highest P value, let’s remove that

Iteration 1: Sex Removed

model_1 <- lm(Glu ~ Vt + Type_num + ROI + Age, data=data_clean)
summary(model_1)

## 
## Call:
## lm(formula = Glu ~ Vt + Type_num + ROI + Age, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4459 -0.6315 -0.1081  0.5337  2.7000 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.406868   0.415692  13.007  < 2e-16 ***
## Vt               -0.011343   0.008636  -1.314  0.19065    
## Type_num2         0.098402   0.149698   0.657  0.51179    
## Type_num3         0.494846   0.178754   2.768  0.00621 ** 
## ROILacingulateVt  1.327989   0.212211   6.258  2.7e-09 ***
## ROILDLPFCVt      -0.340933   0.211666  -1.611  0.10897    
## ROIRacingulateVt  0.509394   0.212019   2.403  0.01728 *  
## ROIRDLPFCVt       0.439879   0.213589   2.059  0.04087 *  
## Age               0.015330   0.008568   1.789  0.07523 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.914 on 183 degrees of freedom
## Multiple R-squared:  0.3199, Adjusted R-squared:  0.2902 
## F-statistic: 10.76 on 8 and 183 DF,  p-value: 2.291e-12

Conclusion: Sex has a highP value, let’s remove that

Final Model:

model_1 <- lm(Glu ~ Vt + Type_num + ROI + ROI:Vt + ROI:Type_num, data=data_clean)
summary(model_1)

## 
## Call:
## lm(formula = Glu ~ Vt + Type_num + ROI + ROI:Vt + ROI:Type_num, 
##     data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36069 -0.57774 -0.06964  0.45890  2.42198 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.31319    0.82170   8.900  7.6e-16 ***
## Vt                         -0.04429    0.02242  -1.976  0.04981 *  
## Type_num2                  -0.24488    0.33278  -0.736  0.46282    
## Type_num3                  -0.45503    0.39615  -1.149  0.25231    
## ROILacingulateVt            0.04640    0.99046   0.047  0.96269    
## ROILDLPFCVt                -2.17586    1.09675  -1.984  0.04886 *  
## ROIRacingulateVt           -1.02604    0.99635  -1.030  0.30455    
## ROIRDLPFCVt                -1.40609    1.11499  -1.261  0.20899    
## Vt:ROILacingulateVt         0.02188    0.02874   0.761  0.44745    
## Vt:ROILDLPFCVt              0.05176    0.03078   1.682  0.09445 .  
## Vt:ROIRacingulateVt         0.02255    0.02875   0.784  0.43392    
## Vt:ROIRDLPFCVt              0.04075    0.03140   1.298  0.19613    
## Type_num2:ROILacingulateVt  0.44399    0.46991   0.945  0.34607    
## Type_num3:ROILacingulateVt  1.44776    0.58106   2.492  0.01367 *  
## Type_num2:ROILDLPFCVt       0.05926    0.48926   0.121  0.90374    
## Type_num3:ROILDLPFCVt       0.52483    0.56250   0.933  0.35211    
## Type_num2:ROIRacingulateVt  0.90479    0.46684   1.938  0.05425 .  
## Type_num3:ROIRacingulateVt  1.81246    0.57876   3.132  0.00204 ** 
## Type_num2:ROIRDLPFCVt       0.54964    0.49015   1.121  0.26369    
## Type_num3:ROIRDLPFCVt       1.32471    0.56220   2.356  0.01958 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9036 on 172 degrees of freedom
## Multiple R-squared:  0.3753, Adjusted R-squared:  0.3062 
## F-statistic: 5.437 on 19 and 172 DF,  p-value: 2.827e-10

Statistically significant decrease in glutamate level associated with increase in Vt.
- for 1 unit increase in Vt, -0.044 decrease in Glutamate
- The effect is in fact, the same across the 5 regions from this model. You did not identify change across data sets.
- You realized a better could be to divide up the data based on ROI and run 5 separate models
  - You will do that and let them know the result later. But u didnt have time.

Linear Regression Exercise With Data

William Bai

2026-02-19