The Classical Model

The General Linear Model refers to a broad family of models for continuous, normally distributed outcomes. It includes familiar methods such as simple linear regression, multiple regression, ANOVA, and ANCOVA.

The general model equation for an outcome variable \(Y\) modeled by \(m\) variables \(X_{1\ldots m}\) is:

\[ Y = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m + \varepsilon \]

where

Fitting the model means calculating the regression parameters \(\beta_{0\ldots m}\) which can then be used to estimate the outcome:

\[ \hat{Y} = Y_{\text{est}} = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m \] Note that \(Y\) denotes observed outcome. Estimated outcome is written as \(\hat{Y}\) or \(Y_\text{est}\), you can use these interchangeably. The estimated outcome is equal to the pure model prediciton without the error term.

In case of a single predictor we often write

\[ Y = \alpha + \beta X + \varepsilon \hspace{1em} \text{or}\hspace{1em} \hat{Y} = \alpha + \beta X \]

instead.

Let’s introduce an example to illustrate this concept. In the following, we will work with data collected from 3 schools, specifically Hillcrest High School, Oakwood High School, and St Mary’s Church of England School. For each student, we have recorded the average test scores (ranging from 0% to 100%) and the average number of hours they studied shortly before their exams.

Here is the dataset:

school_long_name school test_score study_time
Hillcrest High School Hillcrest 54.59454 13.092146
Hillcrest High School Hillcrest 45.90443 7.285176
Hillcrest High School Hillcrest 46.07368 10.068656
Hillcrest High School Hillcrest 53.15568 10.877858
Hillcrest High School Hillcrest 51.89308 10.192075
Hillcrest High School Hillcrest 45.88416 8.660897
Hillcrest High School Hillcrest 49.43284 13.513836
Hillcrest High School Hillcrest 49.04923 8.695293
Hillcrest High School Hillcrest 55.00619 15.034541
Hillcrest High School Hillcrest 49.30719 8.791128
Hillcrest High School Hillcrest 49.59065 12.893879
Hillcrest High School Hillcrest 54.12954 15.839206
Hillcrest High School Hillcrest 47.03255 4.812688
Hillcrest High School Hillcrest 48.98281 8.142904
Hillcrest High School Hillcrest 50.05463 8.579306
Hillcrest High School Hillcrest 56.51023 10.887121
Hillcrest High School Hillcrest 48.13828 8.126511
Hillcrest High School Hillcrest 33.20910 1.009904
Hillcrest High School Hillcrest 40.99077 1.657870
Hillcrest High School Hillcrest 57.04151 12.939610
Oakwood High School Oakwood 44.29733 8.059354
Oakwood High School Oakwood 25.14011 3.635345
Oakwood High School Oakwood 35.47447 8.463518
Oakwood High School Oakwood 45.12912 12.623294
Oakwood High School Oakwood 48.16753 14.664851
Oakwood High School Oakwood 35.43849 7.687863
Oakwood High School Oakwood 37.85879 8.207462
Oakwood High School Oakwood 25.77590 3.689781
Oakwood High School Oakwood 48.83004 10.359562
Oakwood High School Oakwood 36.44190 7.059286
Oakwood High School Oakwood 42.43892 10.345621
Oakwood High School Oakwood 45.67442 11.093782
Oakwood High School Oakwood 46.28141 12.084581
Oakwood High School Oakwood 35.90872 7.152491
Oakwood High School Oakwood 47.41863 10.494136
Oakwood High School Oakwood 29.00540 3.828244
Oakwood High School Oakwood 31.40098 6.625893
Oakwood High School Oakwood 36.01010 6.426547
Oakwood High School Oakwood 24.41876 1.736647
Oakwood High School Oakwood 38.60989 9.087638
Oakwood High School Oakwood 37.99014 9.597266
Oakwood High School Oakwood 39.07729 7.896098
Oakwood High School Oakwood 43.98547 11.253760
St Mary’s Church of England School St Mary’s 42.94447 6.799156
St Mary’s Church of England School St Mary’s 34.92019 4.874427
St Mary’s Church of England School St Mary’s 49.85766 10.277724
St Mary’s Church of England School St Mary’s 44.38598 6.545091
St Mary’s Church of England School St Mary’s 60.11943 13.311574
St Mary’s Church of England School St Mary’s 46.89865 7.684932
St Mary’s Church of England School St Mary’s 50.23389 10.946214
St Mary’s Church of England School St Mary’s 49.43679 9.945046
St Mary’s Church of England School St Mary’s 40.07521 6.627753
St Mary’s Church of England School St Mary’s 60.94646 13.706453
St Mary’s Church of England School St Mary’s 57.77003 10.907968
St Mary’s Church of England School St Mary’s 48.67736 9.248552
St Mary’s Church of England School St Mary’s 51.15944 9.808922
St Mary’s Church of England School St Mary’s 51.22429 11.017137
St Mary’s Church of England School St Mary’s 46.55865 9.248769
St Mary’s Church of England School St Mary’s 23.99421 0.000000
St Mary’s Church of England School St Mary’s 54.27720 9.833919
St Mary’s Church of England School St Mary’s 48.37929 7.877566
Hillcrest High School Hillcrest 45.30004 9.534962
Hillcrest High School Hillcrest 57.01957 10.724741
Hillcrest High School Hillcrest 56.86480 13.178481
Hillcrest High School Hillcrest 46.07672 6.797394
Hillcrest High School Hillcrest 55.57510 12.886898
Hillcrest High School Hillcrest 47.07002 9.986815
Hillcrest High School Hillcrest 49.22528 12.094788
Hillcrest High School Hillcrest 52.23690 11.741456
Hillcrest High School Hillcrest 47.77480 11.141905
Hillcrest High School Hillcrest 45.58995 5.849913
Hillcrest High School Hillcrest 52.34357 8.708711
Hillcrest High School Hillcrest 47.91817 10.849825
Hillcrest High School Hillcrest 43.12712 6.118700
Hillcrest High School Hillcrest 46.96063 7.350784
Hillcrest High School Hillcrest 47.81392 10.722260
Hillcrest High School Hillcrest 50.39072 11.283807
Hillcrest High School Hillcrest 53.06295 10.370573
Hillcrest High School Hillcrest 48.49496 6.321941
Hillcrest High School Hillcrest 45.96745 5.679928
Hillcrest High School Hillcrest 48.66620 13.517391
Oakwood High School Oakwood 41.34408 9.753035
Oakwood High School Oakwood 43.68350 9.244591
Oakwood High School Oakwood 41.67275 8.616581
Oakwood High School Oakwood 32.22012 5.396284
Oakwood High School Oakwood 35.53073 10.815261
Oakwood High School Oakwood 38.83860 8.327851
Oakwood High School Oakwood 40.58326 8.431000
Oakwood High School Oakwood 45.69603 11.779309
Oakwood High School Oakwood 45.36142 11.444590
Oakwood High School Oakwood 49.60593 13.155619
Oakwood High School Oakwood 35.91185 7.550749
Oakwood High School Oakwood 47.79057 10.930316
Oakwood High School Oakwood 49.71638 13.152602
Oakwood High School Oakwood 29.56580 5.646904
Oakwood High School Oakwood 38.93851 6.396892
Oakwood High School Oakwood 36.78149 5.584054
Oakwood High School Oakwood 33.67548 4.601628
Oakwood High School Oakwood 35.45055 9.219218
Oakwood High School Oakwood 42.16985 10.938883
Oakwood High School Oakwood 49.07087 12.582166
Oakwood High School Oakwood 46.35821 12.113523
Oakwood High School Oakwood 34.98333 5.969644
Oakwood High School Oakwood 58.42821 14.524716
St Mary’s Church of England School St Mary’s 39.48171 6.978950
St Mary’s Church of England School St Mary’s 42.54783 9.295812
St Mary’s Church of England School St Mary’s 44.95859 7.712503
St Mary’s Church of England School St Mary’s 44.77387 8.612220
St Mary’s Church of England School St Mary’s 50.96967 9.543849
St Mary’s Church of England School St Mary’s 46.57311 9.336753
St Mary’s Church of England School St Mary’s 54.34814 8.903993
St Mary’s Church of England School St Mary’s 48.53935 9.303488
St Mary’s Church of England School St Mary’s 42.13689 7.522964
St Mary’s Church of England School St Mary’s 42.90107 7.466619
St Mary’s Church of England School St Mary’s 35.57561 3.995973
St Mary’s Church of England School St Mary’s 44.78883 7.832269
St Mary’s Church of England School St Mary’s 38.44711 7.441319
St Mary’s Church of England School St Mary’s 72.24097 17.084943
St Mary’s Church of England School St Mary’s 37.95949 4.892922
St Mary’s Church of England School St Mary’s 49.29009 9.391039
St Mary’s Church of England School St Mary’s 36.70040 4.498395
St Mary’s Church of England School St Mary’s 34.26447 4.567963
Hillcrest High School Hillcrest 49.05039 9.353377
Hillcrest High School Hillcrest 46.63433 5.989353
Hillcrest High School Hillcrest 51.74739 8.973802
Hillcrest High School Hillcrest 43.49421 7.694494
Hillcrest High School Hillcrest 46.46544 7.138255
Hillcrest High School Hillcrest 41.27340 2.905237
Hillcrest High School Hillcrest 42.08927 5.305026
Hillcrest High School Hillcrest 46.31831 9.517820
Hillcrest High School Hillcrest 48.92636 10.682132
Hillcrest High School Hillcrest 48.76119 7.500638
Hillcrest High School Hillcrest 47.52638 8.979459
Hillcrest High School Hillcrest 50.46286 12.347939
Hillcrest High School Hillcrest 54.44885 13.298837
Hillcrest High School Hillcrest 41.11537 5.687929
Hillcrest High School Hillcrest 51.49040 8.627312
Hillcrest High School Hillcrest 51.64673 12.583766
Hillcrest High School Hillcrest 47.65148 7.570081
Hillcrest High School Hillcrest 48.73925 8.821862
Hillcrest High School Hillcrest 48.46442 8.720948
Hillcrest High School Hillcrest 51.00763 6.316233
Oakwood High School Oakwood 34.81445 7.645218
Oakwood High School Oakwood 43.21828 8.890936
Oakwood High School Oakwood 37.57005 7.737664
Oakwood High School Oakwood 44.13324 12.319428
Oakwood High School Oakwood 36.86629 7.536292
Oakwood High School Oakwood 39.59984 7.679763
Oakwood High School Oakwood 42.86316 11.069858
Oakwood High School Oakwood 31.30408 5.810165
Oakwood High School Oakwood 40.17264 8.857175
Oakwood High School Oakwood 33.61506 4.324636
Oakwood High School Oakwood 46.74118 12.480779
Oakwood High School Oakwood 34.15559 8.158333
Oakwood High School Oakwood 33.23144 7.575734
Oakwood High School Oakwood 30.22384 5.264513
Oakwood High School Oakwood 38.33064 8.955984
Oakwood High School Oakwood 35.09257 6.578424
Oakwood High School Oakwood 38.76908 7.378793
Oakwood High School Oakwood 46.38074 12.842296
Oakwood High School Oakwood 35.56375 8.452693
Oakwood High School Oakwood 35.34916 5.763923
Oakwood High School Oakwood 41.76415 9.468891
Oakwood High School Oakwood 41.25415 7.891055
Oakwood High School Oakwood 38.45118 10.749311
St Mary’s Church of England School St Mary’s 61.09157 13.276536
St Mary’s Church of England School St Mary’s 43.06366 6.001193
St Mary’s Church of England School St Mary’s 54.20218 10.343221
St Mary’s Church of England School St Mary’s 46.20423 9.233964
St Mary’s Church of England School St Mary’s 58.19549 11.665967
St Mary’s Church of England School St Mary’s 43.25403 8.289936
St Mary’s Church of England School St Mary’s 54.10719 11.489128
St Mary’s Church of England School St Mary’s 35.79491 3.744103
St Mary’s Church of England School St Mary’s 62.27251 14.047647
St Mary’s Church of England School St Mary’s 58.20645 11.573604
St Mary’s Church of England School St Mary’s 45.70714 8.526942
St Mary’s Church of England School St Mary’s 30.16766 4.632249
St Mary’s Church of England School St Mary’s 51.17844 10.908296
St Mary’s Church of England School St Mary’s 49.02100 10.428852
St Mary’s Church of England School St Mary’s 46.42774 8.960203
St Mary’s Church of England School St Mary’s 48.29198 9.433638
St Mary’s Church of England School St Mary’s 42.22076 7.226943
St Mary’s Church of England School St Mary’s 50.52733 10.085690


Simple linear regression analysis

We can now model how students’ test scores depend on the length of study. As a first step we estimate a simple linear regression model:

# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time, data = df)
# show mode summary
summary(model)
## 
## Call:
## lm(formula = test_score ~ study_time, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1068  -4.5792   0.5577   4.1368  11.4546 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7993     1.2427   21.57   <2e-16 ***
## study_time    2.0192     0.1326   15.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.266 on 181 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5592 
## F-statistic: 231.9 on 1 and 181 DF,  p-value: < 2.2e-16
#model

With the formula

formula(model)
## test_score ~ study_time

we tell R to fit a regression line for the point cloud defined by study_time (as single continuous predictor) and test_score (as continuous output). We see that the resulting basic regression parameters

# Collect "classic" regression parameter 
coef(model)
## (Intercept)  study_time 
##   26.799339    2.019198

are significantly different from zero and that the residuals are roughly symmetrically distributed around 0, which is a good sign. The test score \(Score_\text{est}\) is estimated by the regression equation

\[ Test\,\,Score_\text{est} = 26.799 + 2.019\,Study\,\,Time\,(\text{hours}) \]

According to this model, each additional minute of study increases the test score by 2.019 units. In case students do not study at all their test score is estimated 26.799 on average.

The model output is visualized as regression line, i.e. the closest linear (“straight”) approximation of all dots in the scatter plot (point cloud) where each dot represents a single student.

basic_ggplot = ggplot(df, aes(x = study_time, y = test_score)) +
  geom_point(alpha = alpha_dots, color="#678") +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(x = "Study Time", y = "Score")

# Assign intercept and slope to variables
intercept = coef(model)[1]
slope = coef(model)[2]

basic_ggplot +
  labs(title = "Association of student's study time and scoring") +
  geom_abline(intercept = intercept, slope = slope, color = "#345", linetype = "longdash")


Residual analysis

In order to have valid estimates the error term (residuals) must be normally distributed. As seen in the model output above, the residuals are approximately symmetrically distributed. We can, of course, calculate the residual variable directly to show its distribution in more detail:

# get residual variable
test_score_resid = resid(model)

# show how residual are distributed 
ggplot(df, aes(x = test_score_resid)) +
  geom_histogram(binwidth = 5, fill = "#678", color = "white", alpha=.6) +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(title = "Histogram of residuals", x = "Score residual", y = "Count") 

Residuals are slightly left skewed, which can be ignored in practice.


In brief, the classical model

  • assumes a linear relationship between predictors and outcome.
  • assumes normally distributed residuals with constant variance.
  • is estimated using ordinary least squares (OLS).
  • is used when the outcome is continuous and unbounded (e.g., test scores, weight).


Generalized Linear Models (GLM): Summary

Generalized Linear Models (GLMs) extend traditional linear models to handle non-normal outcome variables, such as binary, count, proportion, or skewed continuous data. They provide a flexible framework that combines linear modeling with appropriate transformations and error distributions.

GLMs build upon a link function, i.e., a transformation that links the expected outcome \(Y\) to the linear predictor:

The general model equation for a GLM is:

\[ g(Y) = \beta_0 + \beta_1 X_1 + \ldots + \beta_m X_m + \varepsilon \]

This equation links the expected value of the outcome \(Y\) to a linear combination of predictors through the link function \(g(\ldots)\).

Example 1: Non-linear Associations

In the following example we analyze how salaries (€) in an organization depend on years of employment and gender. Data is collected from 200 employees, specifically 100 female and 100 male employees.

Here is the dataset:

years_empl salary gender
27.4441813 274372.94 Male
28.1122624 272584.60 Male
8.5841860 83252.41 Male
24.9134288 229785.25 Male
19.2523656 141310.60 Male
15.5728785 108422.85 Male
22.0976494 185929.98 Male
4.0399979 42793.74 Male
19.7097687 100284.31 Male
21.1519435 167208.49 Male
13.7322533 84488.60 Male
21.5733675 171300.54 Male
28.0401674 291434.29 Male
7.6628647 76376.50 Male
13.8687847 80076.12 Male
28.2004357 305893.12 Male
29.3467929 318895.77 Male
3.5246208 55349.75 Male
14.2499124 107613.32 Male
16.8099824 125935.58 Male
27.1209416 247016.44 Male
4.1613050 40497.62 Male
29.6667519 331348.26 Male
28.4000470 276661.61 Male
2.4731267 31579.04 Male
15.4263535 111774.36 Male
11.7061040 88052.90 Male
27.1721439 270697.87 Male
13.4090888 74413.58 Male
25.0801278 206600.50 Male
22.1278685 198856.63 Male
24.3316542 213999.52 Male
11.6432485 77472.96 Male
20.5550919 153524.76 Male
0.1184502 47629.30 Male
24.9874824 230629.76 Male
0.2200244 32724.36 Male
6.2297692 46640.30 Male
27.1980422 278288.55 Male
18.3533593 142577.96 Male
11.3867772 95481.67 Male
13.0731475 78232.04 Male
1.1229310 42575.02 Male
29.2061974 331214.32 Male
12.9525375 67893.01 Male
28.7272979 285770.63 Male
26.6326472 235624.41 Male
19.1993631 117483.76 Male
29.1289983 309636.63 Male
18.5651462 142275.07 Male
10.0028163 84795.75 Male
10.4024474 84622.06 Male
11.9545623 63018.48 Male
23.5407833 224974.88 Male
1.1680947 37063.01 Male
22.4638616 182548.19 Male
20.3183049 146089.50 Male
5.1379299 43416.06 Male
7.8326389 58960.65 Male
15.4323880 104896.59 Male
20.2682182 151437.42 Male
29.4845159 318956.30 Male
22.7863280 178413.12 Male
16.9946527 109272.55 Male
25.4906916 205630.04 Male
5.6842181 41537.78 Male
8.1385984 49839.02 Male
24.8447546 259463.99 Male
20.7961446 137931.11 Male
7.2163422 55495.92 Male
1.2896639 49143.88 Male
4.2143728 40028.07 Male
6.4915625 52297.32 Male
14.3819569 79848.95 Male
5.9223103 48154.50 Male
21.5806751 162196.75 Male
0.2365422 38631.97 Male
11.2646989 43504.75 Male
15.4322312 84736.67 Male
0.0471166 32806.04 Male
17.4481201 129666.43 Male
4.7371562 36430.33 Male
10.7708492 71014.11 Male
19.3689564 158119.13 Male
23.2747009 214691.13 Male
16.9094052 99585.02 Male
7.0111020 50807.05 Male
2.6994155 55253.81 Male
2.5683619 30202.92 Male
9.1565511 61622.70 Male
20.0227954 147570.59 Male
0.0071669 43297.98 Male
6.2570987 42819.48 Male
27.9910238 281155.98 Male
27.7693425 270439.65 Male
22.0228290 191392.66 Male
9.9921595 59509.47 Male
15.4518999 96772.72 Male
22.3192394 189336.76 Male
18.5747772 116733.59 Male
27.4441813 204242.00 Female
28.1122624 191386.91 Female
8.5841860 72219.94 Female
24.9134288 167490.37 Female
19.2523656 108435.26 Female
15.5728785 70663.49 Female
22.0976494 140781.09 Female
4.0399979 32199.05 Female
19.7097687 111206.97 Female
21.1519435 151187.35 Female
13.7322533 75816.61 Female
21.5733675 119743.62 Female
28.0401674 216027.60 Female
7.6628647 45854.16 Female
13.8687847 88053.06 Female
28.2004357 237475.42 Female
29.3467929 219145.13 Female
3.5246208 45214.51 Female
14.2499124 82617.81 Female
16.8099824 110742.94 Female
27.1209416 196822.69 Female
4.1613050 52693.92 Female
29.6667519 213160.54 Female
28.4000470 244370.12 Female
2.4731267 48641.89 Female
15.4263535 86065.03 Female
11.7061040 46340.89 Female
27.1721439 210633.58 Female
13.4090888 83942.68 Female
25.0801278 173513.78 Female
22.1278685 143467.72 Female
24.3316542 155985.73 Female
11.6432485 73309.23 Female
20.5550919 130895.97 Female
0.1184502 33939.11 Female
24.9874824 152443.33 Female
0.2200244 40976.86 Female
6.2297692 54712.10 Female
27.1980422 188808.55 Female
18.3533593 84492.50 Female
11.3867772 69645.56 Female
13.0731475 69735.95 Female
1.1229310 36242.49 Female
29.2061974 212333.47 Female
12.9525375 59894.91 Female
28.7272979 240390.18 Female
26.6326472 199596.33 Female
19.1993631 123822.69 Female
29.1289983 257722.99 Female
18.5651462 111962.82 Female
10.0028163 30410.55 Female
10.4024474 67145.34 Female
11.9545623 86840.21 Female
23.5407833 186767.75 Female
1.1680947 48096.85 Female
22.4638616 127293.27 Female
20.3183049 113809.76 Female
5.1379299 32825.78 Female
7.8326389 42222.32 Female
15.4323880 85583.32 Female
20.2682182 105943.37 Female
29.4845159 266857.26 Female
22.7863280 149472.85 Female
16.9946527 97313.91 Female
25.4906916 186105.23 Female
5.6842181 45221.87 Female
8.1385984 51050.88 Female
24.8447546 192923.96 Female
20.7961446 125372.93 Female
7.2163422 30462.69 Female
1.2896639 38619.32 Female
4.2143728 35021.34 Female
6.4915625 39430.34 Female
14.3819569 66077.73 Female
5.9223103 51836.70 Female
21.5806751 133279.58 Female
0.2365422 38235.89 Female
11.2646989 62489.24 Female
15.4322312 78485.47 Female
0.0471166 48852.66 Female
17.4481201 97678.32 Female
4.7371562 56015.18 Female
10.7708492 45738.23 Female
19.3689564 109407.30 Female
23.2747009 148957.97 Female
16.9094052 92124.57 Female
7.0111020 69238.16 Female
2.6994155 35898.28 Female
2.5683619 39572.20 Female
9.1565511 42813.41 Female
20.0227954 110912.02 Female
0.0071669 44986.09 Female
6.2570987 65365.23 Female
27.9910238 231578.99 Female
27.7693425 188859.09 Female
22.0228290 170910.92 Female
9.9921595 75632.53 Female
15.4518999 88084.00 Female
22.3192394 153654.18 Female
18.5747772 95533.93 Female


The scatterplot shows a clear positive association: The longer employees are employed, the higher salaries they earn:

p = ggplot(df, aes(x = years_empl, y = salary, color = gender)) +
  geom_point(alpha = .6) +
  theme_minimal() +
  scale_x_continuous(name="Years of employment", labels = scales::comma) + 
  scale_y_continuous(name="Salary (€)", labels = scales::comma) +
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
  #scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
p + theme(legend.position="none")


We can fit a simple linear regression model to model this association:

linear_model = lm(salary ~ years_empl, data = df)
summary(linear_model)
## 
## Call:
## lm(formula = salary ~ years_empl, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58615 -27281  -3463  19327 101896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2684.3     4717.3  -0.569     0.57    
## years_empl    7943.6      260.2  30.535   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared:  0.8248, Adjusted R-squared:  0.8239 
## F-statistic: 932.4 on 1 and 198 DF,  p-value: < 2.2e-16
#linear_model

p + theme(legend.position="none") + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#345", alpha = 1, linetype = "longdash") 


We are particularly interested in the differences of salaries between gender. A simple comparison of average salaries reveals that female have lower salaries on average:

tapply(df$salary, df$gender, mean)
##   Female     Male 
## 109140.8 135466.1
salary_female = tapply(df$salary, df$gender, mean)["Female"]
salary_male = tapply(df$salary, df$gender, mean)["Male"]

By adding gender as dummy variable into our linear model, we come to the conclusion, that salaries increase by 7943.61 € per year and that male salaries exceed female salaries by NA € on average, independent of employment time.

linear_model2 = lm(salary ~ years_empl+gender, data = df)
summary(linear_model2)
## 
## Call:
## lm(formula = salary ~ years_empl + gender, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66761 -21296  -6514  20747  88733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15846.9     4842.8  -3.272  0.00126 ** 
## years_empl    7943.6      239.2  33.215  < 2e-16 ***
## genderMale   26325.4     4311.0   6.107 5.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30480 on 197 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8512 
## F-statistic: 570.3 on 2 and 197 DF,  p-value: < 2.2e-16
#linear_model2

p = p + theme(legend.position.inside=c(.28,.63)) + 
  scale_color_manual(values = gender_colors, name = "Gender")

p + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#666", alpha = 1, linetype = "dotted") + 
  geom_abline(intercept = coef(linear_model2)[1], slope = coef(linear_model2)[2], 
              color = gender_colors[1], alpha = 1, linetype = "longdash") + 
  geom_abline(intercept = coef(linear_model2)[1]+coef(linear_model2)[3], slope = coef(linear_model2)[2], 
              color = gender_colors[2], alpha = 1, linetype = "longdash") 

It is, however, clearly visible that the association between years of employment and salaries is not linear. A linear model, thus, is not appropriate, and our interpretation is flawed. In order to fit a linear model, we first have to transform our dependent variable so that the association becomes linear.

When looking at selected data points it becomes obvious that the association is exponential in nature:

Years of Employment Salary Salary log2(Salary)
0 32.768 215 15
10 +10 65.536 x2 216 16 +1
20 +10 131.072 x2 217 17 +1
30 +10 262.144 x2 218 18 +1
p + theme(legend.position="none") + 
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender") + 
  geom_point(x=0, y=2**15, color="#f52", size=5, alpha=.5) +
  geom_point(x=10, y=2**16, color="#f52", size=5, alpha=.5) +
  geom_point(x=20, y=2**17, color="#f52", size=5, alpha=.5) +
  geom_point(x=30, y=2**18, color="#f52", size=5, alpha=.5) 

Therefore, we use a logarithmic transformation to linearize the association. Instead of fitting

\[ Salary = \beta_0 + \beta_1Years + \varepsilon \]

we choose \(\text{log}_2(Y)\) as dependent variable:

\[ \text{log}_2(Salary) = \beta_0 + \beta_1Years + \varepsilon \]

log_model = lm(log2(salary) ~ years_empl, data = df)
summary(log_model)
## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1115 -0.1760 -0.0016  0.2198  0.5921 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.979177   0.039675  377.54   <2e-16 ***
## years_empl   0.102428   0.002188   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2789 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16
#log_model

beta0 = round(coef(log_model)[1],3)
beta1 = round(coef(log_model)[2],3)

pl = ggplot(df, aes(x = years_empl, y = log2(salary), color = gender)) +
  geom_point(alpha = .6) +
  theme_minimal() + 
  scale_x_continuous(name="Years of employment", labels = scales::comma) + 
  scale_y_continuous(name="Log2(Salary)", labels = scales::comma) +
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
  #scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")

pl + theme(legend.position="none") + 
  geom_abline(intercept = coef(log_model)[1], slope = coef(log_model)[2], 
              color = "#345", alpha = 1, linetype = "longdash") 

According to this model, the log2 of Salary increases by 0.102 each year :

\[ \text{log}_2(Salary) = 14.979 + 0.102 Years \]

As this description is a little hard to understand we have to get the model parameter and model predictions back on original scale. Raising the predictive form of the regression equation

\[ \text{log}_2(Salary_\text{est}) = \beta_0 + \beta_1Years \]

to the power of 2 we get

\[ Salary_\text{est} = 2^{(\beta_0 + \beta_1Years)} = 2^{\beta_0} \cdot 2^{\beta_1Years} \] Now we have a multiplicative (rather than additive) model indicating the underlying non-linear structure .

\[ Salary_\text{est} = 2^{14.979} \cdot 2^{0.102\cdot Years} \] Because we applied a logarithmic transformation, we can now interpret the model parameters as growth rates. The model predictions are as follows:

df$log_pred = 2**(beta0 + beta1*df$years_empl)

p + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#345", alpha = 1, linetype = "dotted")  + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model)[1] + coef(log_model)[2]*years_empl)),
              color = "#345", alpha = 1, linetype = "longdash") 

But if we look at the gender differences in more detail it appears as if gender do not differ by intercepts only but have different growth rates. This can only be modeled by fitting separate models for male and female.

# adjust for Male and Female
log_model_male = lm(log2(salary) ~ years_empl, data = df[df$gender=="Male",])
summary(log_model_male)
## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender == 
##     "Male", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8088 -0.1247  0.0048  0.1004  0.5500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.97655    0.04442  337.15   <2e-16 ***
## years_empl   0.11018    0.00245   44.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2208 on 98 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9533 
## F-statistic:  2023 on 1 and 98 DF,  p-value: < 2.2e-16
#log_model_male
log_model_female = lm(log2(salary) ~ years_empl, data = df[df$gender=="Female",])
summary(log_model_female)
## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender == 
##     "Female", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03654 -0.11005  0.02057  0.15374  0.58988 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.981809   0.052984   282.8   <2e-16 ***
## years_empl   0.094675   0.002922    32.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2634 on 98 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9138 
## F-statistic:  1050 on 1 and 98 DF,  p-value: < 2.2e-16
#log_model_female
# log_model_dummy = lm(log2(salary) ~ years_empl + gender, data = df)
# summary(log_model_dummy)
# log_model_male
# log_model_female
# log_model_dummy
betam0 = round(coef(log_model_male)[1],3)
betam1 = round(coef(log_model_male)[2],3)
betaf0 = round(coef(log_model_female)[1],3)
betaf1 = round(coef(log_model_female)[2],3)

p + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model_male)[1] + coef(log_model_male)[2]*years_empl)),
              color = "steelblue", alpha = 1, linetype = "longdash")   + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model_female)[1] + coef(log_model_female)[2]*years_empl)),
              color = "darkorange", alpha = 1, linetype = "longdash")  

Now it becomes obvious, that gender salaries differ by their growth rates rather than being dependent on starting salaries.

Female \[ Salary_\text{est Female} = 2^{14.982} \cdot 2^{0.095\cdot Years} \] Male \[ Salary_\text{est Male} = 2^{14.977} \cdot 2^{0.11\cdot Years} \] While starting salaries are more or less equal between gender (female: \(2^{14.982} = 32361.71\)€, male: \(2^{14.977} = 32249.74\)€), the growth rates differ significantly (female: \(2^{0.095} = 6.81\)%, male: \(2^{0.11} = 7.92\)%).

In the terminology of GLM the logarithm log2() served as link function \(g\)() that allowed us to linearize the association. Interpretation of model output, though, is much easier if output is transformed back into real measures.

Example 2: Binary Outcome, Logit Transformation

df = read.xlsx("./data/AdherenceData.xlsx") 

Regression analysis requires a metric outcome variable. If we aim at modeling a binary (or categorical) outcome variable we have to choose a link function \(g\)() that transforms the binary outcome into a metric.

The following example models participation in, or adherence to, a preventive program as part of the health management offers in a company. Variables in the data set are:

The data set comprises a binary outcome (adherence) and binary (app_use), multitom (stress_level), and metric (weekly_activity) explanatory variables:

id adherence app_use stress_level weekly_activity
1 1 1 1 210
2 1 0 3 180
3 0 0 2 60
4 0 1 3 150
5 0 0 1 30
6 1 1 2 140
7 0 0 3 80
8 1 1 3 90
9 1 0 2 50
10 1 1 1 220
11 0 0 3 90
12 1 1 3 160
13 0 0 2 70
14 1 1 1 220
15 0 1 3 85
16 1 1 2 140
17 0 0 3 55
18 1 1 1 230
19 0 0 3 95
20 1 1 3 175
21 0 1 2 65
22 1 1 1 210
23 0 0 3 180
24 0 1 3 155
25 0 0 2 160
26 1 1 1 225
27 0 0 3 110
28 1 1 3 165
29 0 0 2 75
30 1 1 1 205


The core research question asks for the effect of physical activity on the probability of adherence. We hypothesize:

Physical activity and adherence to the program are positively associated.

The linear model

A linear regression model can be written as:

\[ \hat{Y} = \beta_0 + \beta_1 X \]

Applied to a binary outcome such as adherence, this model describes a straight-line relationship between weekly_activity and the expected value of adherence.

glm = ggplot(df, aes(x = weekly_activity, y = adherence)) +
  geom_point(color = "darkblue", shape = 1) +
  scale_x_continuous(limits = c(0, 280)) +
  scale_y_continuous(limits = c(-0.2, 1.2)) +
  labs(x = "Weekly activity",
       y = "Adherence") +
  theme_minimal()
glm + geom_smooth(method = "lm",
              se = FALSE,
              linetype = "dashed",
              color = "steelblue",
              linewidth = 0)

m = lm(adherence ~ weekly_activity, data=df)
m
## 
## Call:
## lm(formula = adherence ~ weekly_activity, data = df)
## 
## Coefficients:
##     (Intercept)  weekly_activity  
##       -0.210823         0.005227
b = round(coefficients(m), 3)

\[ \text{adherence } = -0.211 + 0.005 \cdot \text{ weekly_activity} \]

It is obvious to interpret the model outcome as probabilities, at least in the value range of the predictor weekly_activity. However, this is problematic because predicted values outside this range may fall below 0 or above 1, although probabilities must lie between 0 and 1.

The logistic model

Logistic regression solves this problem by modeling an S-shaped probability curve (which is, by the way, also characteristic of logistic growth). This model is the so-called logistic model:

\[ p = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Here, \(p\) denotes the probability that the outcome variable (here: adherence) takes the value 1.

glm + geom_smooth(method = "glm",
            method.args = list(family = "binomial"),
            se = FALSE,
            linetype = "solid",
            color = "darkorange",
            linewidth = 0) +
  labs(
    x = "Weekly activity",
    y = "Probability of Adherence (p)"
  )

# use glm here
m = glm(adherence ~ weekly_activity, data = df, family = binomial)
m
## 
## Call:  glm(formula = adherence ~ weekly_activity, family = binomial, 
##     data = df)
## 
## Coefficients:
##     (Intercept)  weekly_activity  
##        -3.86451          0.02835  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
## Null Deviance:       41.59 
## Residual Deviance: 27.65     AIC: 31.65
b = round(coefficients(m), 3)

This model assumes that the probability of adherence changes non-linearly with weekly_activity: the curve is flatter close to 0 and 1, and steepest in the middle range. This ‘trick’ solves the issue of probability estimates beyond [0,1]. It is, however, obvious that this model is non-linear. To linearize, we calculate the odds of adherence, i.e., the ratio of the probability of adherence to the probability of non-adherence. The term (1 - p) represents the probability of the alternative outcome (non-adherence).

\[ \frac{p}{1-p} = \frac{\frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}}{1-\frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}} = e^{\beta_0 + \beta_1 X} \]

If we apply the logarithm to the odds we get the so called linear logit model:

\[ \text{ln}\frac{p}{1-p} = \text{logit}(p) = \text{ln } e^{\beta_0 + \beta_1 X} = \beta_0 + \beta_1 X \]

Applied to our data the linear logit model is estimated as

\[ \text{logit}(p) = -3.865 + 0.028 \cdot \text{weekly_activity} \]

The coefficient for weekly_activity (0.028) is positive, indicating that higher levels of physical activity are associated with a higher probability of adherence. Overall, the model implies a monotonic, non-linear increase in adherence probability with increasing physical activity.

Note that the regression coefficient \(\beta_1\) = 0.028 itself is not a probability. It describes the change in the log-odds of adherence for a one-unit increase in weekly_activity. After exponentiation, it becomes an odds ratio:

\[ e^{\beta_1} = e^{0.028} = 1.028 \]

The odds ratio indicates by which factor the odds of adherence change when weekly_activity increases by one unit. In odds terms, a one-unit increase in weekly activity (i.e., +1 minute) multiplies the odds of adherence by 1.028. This corresponds to an increase of about 2.8% in the odds per additional minute of activity. More meaningful increments are, e.g., a 10-minute [30-minute] increase in weekly activity multiplies the odds of adherence by approximately \(e\)0.28 \(\approx\) 1.32 [\(e\)0.84 \(\approx\) 2.32], corresponding to about a 32% [132%] increase in the odds.

It is important to note that this refers to odds, not probabilities. A 32% increase in odds does not translate into a 32% increase in probability!

The intercept (-3.865) represents the log-odds of adherence when weekly_activity = 0. In probability terms this means \[ p = \frac{1}{1 + e^{3.865}} \approx 0.0205325 \] Thus, with zero activity, adherence is essentially close to zero.

The graphical representation of the model on the logit scale appears as a straight line, closely resembling a linear regression. However, a closer inspection shows that the logit model implies a steeper slope than the linear model. Substantively, this representation is of limited interpretive value, as log-odds are not directly intuitive compared to probabilities.

x_seq = seq(min(df$weekly_activity), max(df$weekly_activity), length.out = 100)
logit_hat = predict(m, newdata = data.frame(weekly_activity = x_seq), type = "link")
df_plot = data.frame(weekly_activity = x_seq, logit = logit_hat)

glm +
  geom_line(data = df_plot,
            aes(x = weekly_activity, y = logit),
            color = "darkorange",
            linetype = "solid",
            linewidth = 0) +
  labs(
    x = "Weekly Activity",
    y = "Logit(p) (Log-odds of adherence)"
  )

On odds and odds ratios

Odds ratios are often easier to understand when illustrated with a categorical predictor rather than a metric one. In the categorical case, the odds of an outcome can be directly compared between clearly defined groups (e.g., low-activity vs. high-activity users), making the interpretation more intuitive. This simplified setting helps to introduce the basic logic of odds and odds ratios and to illustrate how they differ from probabilities.

Let’s look at 200 repondents classified into two grous, namely those with low weelky activity and those with high weelky activity.

t = data.frame(c1 = c("no", "yes"), c2 = c(80,20), c3=c(40,60))
names(t) = c("Adherence", "Low activity", "High activity")
kable_styling(kable(t, align = "lrr"), full_width = FALSE)
Adherence Low activity High activity
no 80 40
yes 20 60

The odds for adherence rather than non-adherence in the low activity group is

\[ \text{odds}_{low} = \frac{20}{80} = 0.25 \]

The odds for adherence rather than non-adherence in the high activity group are

\[ \text{odds}_{high} = \frac{60}{40} = 1.5 \]

The odds ratio (OR) for adherence for the high activity vs the low activity group are

\[ \text{OR} = \frac{60/40}{20/80} = \frac{1.5}{0.25} = 6 \]

The odds of adherence are 6 times higher for the high-activity group than for the low-activity group, indicating substantially greater adherence among highly active individuals.

This mini-example illustrates that, with categorical predictors, odds ratios can be computed directly from contingency tables because the grouping already exists in the data. With metric predictors, however, there are infinitely many possible comparisons, so we need a model that describes how odds change systematically across the predictor range. This is achieved by modeling the odds of the probabilities,

\[ \frac{p}{1-p} \]

or, more precisely, their logarithm (the logit).

On odds ratios and relative probabilities

Note that in terms of probabilities the numbers change:

The probability for adherence in the low activity group is \[ p(\text{adherence=yes | low activity} )= \frac{20}{80+20}=0.2 \] The probability for adherence in the high activity group is \[ p(\text{adherence=yes | high activity} )= \frac{60}{60+40}=0.6 \] and the relative probability for adherence of high-activity vs. low-activity users is

\[ \text{relative probability for adherence} = \frac{60/(60+40)}{20/(80+20)} = \frac{0.6}{0.2} = 3 \]

Thus, odds ratios are not equivalent to probability ratios often referred to as relative risks.

It is important to note that odds ratios are invariant to the direction of calculation: the odds ratio obtained row-wise is identical to the one obtained column-wise.

\[ \text{OR} = \frac{60/40}{20/80} = \frac{1.5}{0.25} = \frac{60/20}{40/80} = \frac{3}{0.5} = 6 \]

It is evident that this does not hold for probabilities, because \[ \text{relative likelihood for high activity} = \frac{60/(60+20)}{40/(40+80)} = \frac{0.75}{0.33} = 2.25 \ne 3 \] So the relative likelihood is not invariant to the direction of calculation.

As a side note, the invariance of odds ratios has important implications for research design. Unlike relative probabilities, odds ratios remain unchanged when the direction of conditioning is reversed. This is particularly important in case-control studies, where researchers sample individuals based on the outcome rather than the predictor. In such designs, direct probability comparisons are often not meaningful or not estimable, whereas odds ratios still provide a consistent measure of association. Transferred to the current example, this would mean comparing activity levels between adherent and non-adherent individuals instead of examining how adherence depends on activity. This perspective is substantively much less meaningful than the original direction.

The multivariate logistic model

In order to model the (mutually independent) effects of all three predictors, we hypothezise that

  • Physical activity and adherence to the program are positively associated.
  • Use of a digital health management app and adherence to the program are positively associated.
  • Perceived stress and adherence to the program are negatively associated.

the multivariate logistic model estimates are:

# use glm here
m = glm(adherence ~ weekly_activity + app_use + stress_level, data = df, family = binomial)
b = round(coefficients(m), 3)
summary(m)
## 
## Call:
## glm(formula = adherence ~ weekly_activity + app_use + stress_level, 
##     family = binomial, data = df)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -1.46997    2.42241  -0.607   0.5440  
## weekly_activity  0.02248    0.01175   1.914   0.0556 .
## app_use          1.92007    1.09928   1.747   0.0807 .
## stress_level    -1.08695    0.92624  -1.174   0.2406  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.589  on 29  degrees of freedom
## Residual deviance: 22.558  on 26  degrees of freedom
## AIC: 30.558
## 
## Number of Fisher Scoring iterations: 6

\[ \text{logit}(p) = -1.47 + 0.022 \cdot \text{weekly_activity} + 1.92 \cdot \text{app_use} -1.087 \cdot \text{stress_level} \]

Translated into odds ratios the model reads

# use glm here
b = round(exp(coefficients(m)), 3)

\[ \frac{p}{1-p} = 0.23 + 1.023 \cdot \text{weekly_activity} + 6.821 \cdot \text{app_use} + 0.337 \cdot \text{stress_level} \]

Due to the small sample size, the effects are not significant, though.

In addition to the estimated regression coefficients b, logistic regression reports usually include estimates of odds ratios OR = eb and their 95% confidence intervals CI(95%) = [low, high]:

b = round(coef(m), 3)
OR = round(exp(coef(m)), 3)
CI = round(exp(confint(m)), 3)
CI = paste0("[",CI[,1], ",", CI[,2], "]")
t = data.frame(cbind(b, OR, CI))
kable_styling(kable(t, align = "rrr"), full_width = FALSE)
b OR CI
(Intercept) -1.47 0.23 [0.001,18.703]
weekly_activity 0.022 1.023 [1.001,1.051]
app_use 1.92 6.821 [0.852,75.536]
stress_level -1.087 0.337 [0.036,1.685]

The model shows that adherence is positively associated with physical activity and especially app use, while higher stress substantially reduces the likelihood of adherence. In detail, the odds ratios indicate the direction and strength of association with adherence, while the extreme width of the CIs indicate great uncertainty due to the small sample size.

  • weekly_activity (OR = 1.023): Each additional minute of activity increases the odds of adherence by about 2.3%. This is a small but cumulative effect (e.g., larger increases over 10–30 minutes become more substantial).
  • app_use (OR = 6.821): Individuals using the app have nearly 7 times higher odds of adherence compared to non-users. This is a strong positive association.
  • stress_level (OR = 0.337): Higher stress is associated with substantially lower adherence. Specifically, a one-unit increase in stress reduces the odds by about 66%. This indicates a strong negative association.
  • Intercept (OR = 0.23): At the reference level (all predictors = 0), the odds of adherence are low. This mainly serves as a baseline and is often of limited substantive interest unless the reference point is meaningful.

Regression coefficients b can take both negative and positive values, whereas odds ratios (ORs), defined as (eb), are strictly positive. Positive coefficients translate into (OR > 1) (increased odds), while negative coefficients yield (OR< 1) (decreased odds). An OR of 1 indicates no difference in odds, i.e., no effect, since (e0 = 1).

Accordingly, at the 5% significance level, an effect is typically considered statistically significant if the corresponding confidence interval for the OR does not include 1.

Metric vs. categorical predictors

We can easily extend the model by replacing metric predictors with dummy variables. This is particularly relevant for variables such as stress_level, which is ordinal and may not satisfy the assumption of a linear effect. Instead of treating stress_level as a numeric variable (implying equal and linear changes between levels), we can represent each category (e.g., 1=low, 2=moderate, 3=high) using dummy variables. This allows the model to estimate separate effects for each level relative to a reference category (the lowest by default).

This step is important because it relaxes the linearity assumption and enables a more flexible and potentially more accurate representation of how stress is associated with adherence.

#df = read.xlsx("./data/AdherenceData.xlsx") 
m = glm(adherence ~ weekly_activity + as.factor(app_use) + as.factor(stress_level), data = df, family = binomial)
#m
b = round(coef(m), 3)
OR = round(exp(coef(m)), 3)
CI = round(exp(confint(m)), 3)
CI = paste0("[",CI[,1], ",", CI[,2], "]")
t = data.frame(cbind(b, OR, CI))
kable_styling(kable(t, align = "rrr"), full_width = FALSE)
b OR CI
(Intercept) -2.736 0.065 [0,4.866]
weekly_activity 0.023 1.023 [1.001,1.051]
as.factor(app_use)1 1.926 6.864 [0.853,76.812]
as.factor(stress_level)2 -0.881 0.414 [0.003,57.166]
as.factor(stress_level)3 -2.046 0.129 [0.001,6.884]

Note that the results for the binary variable app_use remain unchanged (reference category: 0 = no app use). Users have substantially higher odds of adherence, although the wide confidence interval indicates considerable uncertainty.

For stress_level, the use of dummy variables provides a more differentiated view. Relative to the reference category (low stress, level 1), both medium (level 2) and high stress (level 3) are associated with lower odds of adherence.

Specifically, individuals with high stress have odds that are about 0.129 times those of low-stress individuals (i.e., a reduction of roughly 87%). Compared to medium stress, high stress corresponds to an additional reduction by a factor of about 0.129 / 0.414 ≈ 0.312, indicating substantially lower adherence at the highest stress level.

However, the confidence intervals are very wide and include 1, so these effects are estimated with high uncertainty and should be interpreted cautiously.

Common GLM Examples

GLMs allow us to apply linear modeling techniques to a wide range of outcome types. The key is the link function, which transforms the expected value of the outcome into a continuous scale suitable for a linear predictor. This enables valid modeling even when the raw outcome is binary, count-based, bounded, or non-linear.

Outcome Type Distribution Link Function What the Link Does
Non-linear Skewed normal Log, Exp, … Maps metric values to a linear function
Binary (0/1) Binomial Logit Maps probabilities (0–1) to the real line
Count Poisson Log Maps positive counts to the real line
Proportion Binomial Logit or Probit Models bounded outcomes on a continuous scale
Positive continuous Gamma Inverse Ensures predictions remain strictly positive