Generalized Linear Models (GLM)

The Classical Model

The General Linear Model refers to a broad family of models for continuous, normally distributed outcomes. It includes familiar methods such as simple linear regression, multiple regression, ANOVA, and ANCOVA.

The general model equation for an outcome variable \(Y\) modeled by \(m\) variables \(X_{1\ldots m}\) is:

\[ Y = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m + \varepsilon \]

where

\(\beta_0\) denotes the intercept, i.e. the estimate of \(Y\) if all \(X_{1\ldots m}\) are \(0\)
\(\beta_{1\ldots m}\) are the partial regression coefficients
\(\varepsilon\) denotes the residual variable, i.e. the differences between observed and expected outcome values.

Fitting the model means calculating the regression parameters \(\beta_{0\ldots m}\) which can then be used to estimate the outcome:

\[ \hat{Y} = Y_{\text{est}} = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m \] Note that \(Y\) denotes observed outcome. Estimated outcome is written as \(\hat{Y}\) or \(Y_\text{est}\), you can use these interchangeably. The estimated outcome is equal to the pure model prediciton without the error term.

In case of a single predictor we often write

\[ Y = \alpha + \beta X + \varepsilon \hspace{1em} \text{or}\hspace{1em} \hat{Y} = \alpha + \beta X \]

instead.

Let’s introduce an example to illustrate this concept. In the following, we will work with data collected from 3 schools, specifically Hillcrest High School, Oakwood High School, and St Mary’s Church of England School. For each student, we have recorded the average test scores (ranging from 0% to 100%) and the average number of hours they studied shortly before their exams.

Here is the dataset:

school_long_name	school	test_score	study_time
Hillcrest High School	Hillcrest	54.59454	13.092146
Hillcrest High School	Hillcrest	45.90443	7.285176
Hillcrest High School	Hillcrest	46.07368	10.068656
Hillcrest High School	Hillcrest	53.15568	10.877858
Hillcrest High School	Hillcrest	51.89308	10.192075
Hillcrest High School	Hillcrest	45.88416	8.660897
Hillcrest High School	Hillcrest	49.43284	13.513836
Hillcrest High School	Hillcrest	49.04923	8.695293
Hillcrest High School	Hillcrest	55.00619	15.034541
Hillcrest High School	Hillcrest	49.30719	8.791128
Hillcrest High School	Hillcrest	49.59065	12.893879
Hillcrest High School	Hillcrest	54.12954	15.839206
Hillcrest High School	Hillcrest	47.03255	4.812688
Hillcrest High School	Hillcrest	48.98281	8.142904
Hillcrest High School	Hillcrest	50.05463	8.579306
Hillcrest High School	Hillcrest	56.51023	10.887121
Hillcrest High School	Hillcrest	48.13828	8.126511
Hillcrest High School	Hillcrest	33.20910	1.009904
Hillcrest High School	Hillcrest	40.99077	1.657870
Hillcrest High School	Hillcrest	57.04151	12.939610
Oakwood High School	Oakwood	44.29733	8.059354
Oakwood High School	Oakwood	25.14011	3.635345
Oakwood High School	Oakwood	35.47447	8.463518
Oakwood High School	Oakwood	45.12912	12.623294
Oakwood High School	Oakwood	48.16753	14.664851
Oakwood High School	Oakwood	35.43849	7.687863
Oakwood High School	Oakwood	37.85879	8.207462
Oakwood High School	Oakwood	25.77590	3.689781
Oakwood High School	Oakwood	48.83004	10.359562
Oakwood High School	Oakwood	36.44190	7.059286
Oakwood High School	Oakwood	42.43892	10.345621
Oakwood High School	Oakwood	45.67442	11.093782
Oakwood High School	Oakwood	46.28141	12.084581
Oakwood High School	Oakwood	35.90872	7.152491
Oakwood High School	Oakwood	47.41863	10.494136
Oakwood High School	Oakwood	29.00540	3.828244
Oakwood High School	Oakwood	31.40098	6.625893
Oakwood High School	Oakwood	36.01010	6.426547
Oakwood High School	Oakwood	24.41876	1.736647
Oakwood High School	Oakwood	38.60989	9.087638
Oakwood High School	Oakwood	37.99014	9.597266
Oakwood High School	Oakwood	39.07729	7.896098
Oakwood High School	Oakwood	43.98547	11.253760
St Mary’s Church of England School	St Mary’s	42.94447	6.799156
St Mary’s Church of England School	St Mary’s	34.92019	4.874427
St Mary’s Church of England School	St Mary’s	49.85766	10.277724
St Mary’s Church of England School	St Mary’s	44.38598	6.545091
St Mary’s Church of England School	St Mary’s	60.11943	13.311574
St Mary’s Church of England School	St Mary’s	46.89865	7.684932
St Mary’s Church of England School	St Mary’s	50.23389	10.946214
St Mary’s Church of England School	St Mary’s	49.43679	9.945046
St Mary’s Church of England School	St Mary’s	40.07521	6.627753
St Mary’s Church of England School	St Mary’s	60.94646	13.706453
St Mary’s Church of England School	St Mary’s	57.77003	10.907968
St Mary’s Church of England School	St Mary’s	48.67736	9.248552
St Mary’s Church of England School	St Mary’s	51.15944	9.808922
St Mary’s Church of England School	St Mary’s	51.22429	11.017137
St Mary’s Church of England School	St Mary’s	46.55865	9.248769
St Mary’s Church of England School	St Mary’s	23.99421	0.000000
St Mary’s Church of England School	St Mary’s	54.27720	9.833919
St Mary’s Church of England School	St Mary’s	48.37929	7.877566
Hillcrest High School	Hillcrest	45.30004	9.534962
Hillcrest High School	Hillcrest	57.01957	10.724741
Hillcrest High School	Hillcrest	56.86480	13.178481
Hillcrest High School	Hillcrest	46.07672	6.797394
Hillcrest High School	Hillcrest	55.57510	12.886898
Hillcrest High School	Hillcrest	47.07002	9.986815
Hillcrest High School	Hillcrest	49.22528	12.094788
Hillcrest High School	Hillcrest	52.23690	11.741456
Hillcrest High School	Hillcrest	47.77480	11.141905
Hillcrest High School	Hillcrest	45.58995	5.849913
Hillcrest High School	Hillcrest	52.34357	8.708711
Hillcrest High School	Hillcrest	47.91817	10.849825
Hillcrest High School	Hillcrest	43.12712	6.118700
Hillcrest High School	Hillcrest	46.96063	7.350784
Hillcrest High School	Hillcrest	47.81392	10.722260
Hillcrest High School	Hillcrest	50.39072	11.283807
Hillcrest High School	Hillcrest	53.06295	10.370573
Hillcrest High School	Hillcrest	48.49496	6.321941
Hillcrest High School	Hillcrest	45.96745	5.679928
Hillcrest High School	Hillcrest	48.66620	13.517391
Oakwood High School	Oakwood	41.34408	9.753035
Oakwood High School	Oakwood	43.68350	9.244591
Oakwood High School	Oakwood	41.67275	8.616581
Oakwood High School	Oakwood	32.22012	5.396284
Oakwood High School	Oakwood	35.53073	10.815261
Oakwood High School	Oakwood	38.83860	8.327851
Oakwood High School	Oakwood	40.58326	8.431000
Oakwood High School	Oakwood	45.69603	11.779309
Oakwood High School	Oakwood	45.36142	11.444590
Oakwood High School	Oakwood	49.60593	13.155619
Oakwood High School	Oakwood	35.91185	7.550749
Oakwood High School	Oakwood	47.79057	10.930316
Oakwood High School	Oakwood	49.71638	13.152602
Oakwood High School	Oakwood	29.56580	5.646904
Oakwood High School	Oakwood	38.93851	6.396892
Oakwood High School	Oakwood	36.78149	5.584054
Oakwood High School	Oakwood	33.67548	4.601628
Oakwood High School	Oakwood	35.45055	9.219218
Oakwood High School	Oakwood	42.16985	10.938883
Oakwood High School	Oakwood	49.07087	12.582166
Oakwood High School	Oakwood	46.35821	12.113523
Oakwood High School	Oakwood	34.98333	5.969644
Oakwood High School	Oakwood	58.42821	14.524716
St Mary’s Church of England School	St Mary’s	39.48171	6.978950
St Mary’s Church of England School	St Mary’s	42.54783	9.295812
St Mary’s Church of England School	St Mary’s	44.95859	7.712503
St Mary’s Church of England School	St Mary’s	44.77387	8.612220
St Mary’s Church of England School	St Mary’s	50.96967	9.543849
St Mary’s Church of England School	St Mary’s	46.57311	9.336753
St Mary’s Church of England School	St Mary’s	54.34814	8.903993
St Mary’s Church of England School	St Mary’s	48.53935	9.303488
St Mary’s Church of England School	St Mary’s	42.13689	7.522964
St Mary’s Church of England School	St Mary’s	42.90107	7.466619
St Mary’s Church of England School	St Mary’s	35.57561	3.995973
St Mary’s Church of England School	St Mary’s	44.78883	7.832269
St Mary’s Church of England School	St Mary’s	38.44711	7.441319
St Mary’s Church of England School	St Mary’s	72.24097	17.084943
St Mary’s Church of England School	St Mary’s	37.95949	4.892922
St Mary’s Church of England School	St Mary’s	49.29009	9.391039
St Mary’s Church of England School	St Mary’s	36.70040	4.498395
St Mary’s Church of England School	St Mary’s	34.26447	4.567963
Hillcrest High School	Hillcrest	49.05039	9.353377
Hillcrest High School	Hillcrest	46.63433	5.989353
Hillcrest High School	Hillcrest	51.74739	8.973802
Hillcrest High School	Hillcrest	43.49421	7.694494
Hillcrest High School	Hillcrest	46.46544	7.138255
Hillcrest High School	Hillcrest	41.27340	2.905237
Hillcrest High School	Hillcrest	42.08927	5.305026
Hillcrest High School	Hillcrest	46.31831	9.517820
Hillcrest High School	Hillcrest	48.92636	10.682132
Hillcrest High School	Hillcrest	48.76119	7.500638
Hillcrest High School	Hillcrest	47.52638	8.979459
Hillcrest High School	Hillcrest	50.46286	12.347939
Hillcrest High School	Hillcrest	54.44885	13.298837
Hillcrest High School	Hillcrest	41.11537	5.687929
Hillcrest High School	Hillcrest	51.49040	8.627312
Hillcrest High School	Hillcrest	51.64673	12.583766
Hillcrest High School	Hillcrest	47.65148	7.570081
Hillcrest High School	Hillcrest	48.73925	8.821862
Hillcrest High School	Hillcrest	48.46442	8.720948
Hillcrest High School	Hillcrest	51.00763	6.316233
Oakwood High School	Oakwood	34.81445	7.645218
Oakwood High School	Oakwood	43.21828	8.890936
Oakwood High School	Oakwood	37.57005	7.737664
Oakwood High School	Oakwood	44.13324	12.319428
Oakwood High School	Oakwood	36.86629	7.536292
Oakwood High School	Oakwood	39.59984	7.679763
Oakwood High School	Oakwood	42.86316	11.069858
Oakwood High School	Oakwood	31.30408	5.810165
Oakwood High School	Oakwood	40.17264	8.857175
Oakwood High School	Oakwood	33.61506	4.324636
Oakwood High School	Oakwood	46.74118	12.480779
Oakwood High School	Oakwood	34.15559	8.158333
Oakwood High School	Oakwood	33.23144	7.575734
Oakwood High School	Oakwood	30.22384	5.264513
Oakwood High School	Oakwood	38.33064	8.955984
Oakwood High School	Oakwood	35.09257	6.578424
Oakwood High School	Oakwood	38.76908	7.378793
Oakwood High School	Oakwood	46.38074	12.842296
Oakwood High School	Oakwood	35.56375	8.452693
Oakwood High School	Oakwood	35.34916	5.763923
Oakwood High School	Oakwood	41.76415	9.468891
Oakwood High School	Oakwood	41.25415	7.891055
Oakwood High School	Oakwood	38.45118	10.749311
St Mary’s Church of England School	St Mary’s	61.09157	13.276536
St Mary’s Church of England School	St Mary’s	43.06366	6.001193
St Mary’s Church of England School	St Mary’s	54.20218	10.343221
St Mary’s Church of England School	St Mary’s	46.20423	9.233964
St Mary’s Church of England School	St Mary’s	58.19549	11.665967
St Mary’s Church of England School	St Mary’s	43.25403	8.289936
St Mary’s Church of England School	St Mary’s	54.10719	11.489128
St Mary’s Church of England School	St Mary’s	35.79491	3.744103
St Mary’s Church of England School	St Mary’s	62.27251	14.047647
St Mary’s Church of England School	St Mary’s	58.20645	11.573604
St Mary’s Church of England School	St Mary’s	45.70714	8.526942
St Mary’s Church of England School	St Mary’s	30.16766	4.632249
St Mary’s Church of England School	St Mary’s	51.17844	10.908296
St Mary’s Church of England School	St Mary’s	49.02100	10.428852
St Mary’s Church of England School	St Mary’s	46.42774	8.960203
St Mary’s Church of England School	St Mary’s	48.29198	9.433638
St Mary’s Church of England School	St Mary’s	42.22076	7.226943
St Mary’s Church of England School	St Mary’s	50.52733	10.085690

Simple linear regression analysis

We can now model how students’ test scores depend on the length of study. As a first step we estimate a simple linear regression model:

# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time, data = df)
# show mode summary
summary(model)

## 
## Call:
## lm(formula = test_score ~ study_time, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1068  -4.5792   0.5577   4.1368  11.4546 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7993     1.2427   21.57   <2e-16 ***
## study_time    2.0192     0.1326   15.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.266 on 181 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5592 
## F-statistic: 231.9 on 1 and 181 DF,  p-value: < 2.2e-16

#model

With the formula

formula(model)

## test_score ~ study_time

we tell R to fit a regression line for the point cloud defined by study_time (as single continuous predictor) and test_score (as continuous output). We see that the resulting basic regression parameters

# Collect "classic" regression parameter 
coef(model)

## (Intercept)  study_time 
##   26.799339    2.019198

are significantly different from zero and that the residuals are roughly symmetrically distributed around 0, which is a good sign. The test score \(Score_\text{est}\) is estimated by the regression equation

\[ Test\,\,Score_\text{est} = 26.799 + 2.019\,Study\,\,Time\,(\text{hours}) \]

According to this model, each additional minute of study increases the test score by 2.019 units. In case students do not study at all their test score is estimated 26.799 on average.

The model output is visualized as regression line, i.e. the closest linear (“straight”) approximation of all dots in the scatter plot (point cloud) where each dot represents a single student.

basic_ggplot = ggplot(df, aes(x = study_time, y = test_score)) +
  geom_point(alpha = alpha_dots, color="#678") +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(x = "Study Time", y = "Score")

# Assign intercept and slope to variables
intercept = coef(model)[1]
slope = coef(model)[2]

basic_ggplot +
  labs(title = "Association of student's study time and scoring") +
  geom_abline(intercept = intercept, slope = slope, color = "#345", linetype = "longdash")

Residual analysis

In order to have valid estimates the error term (residuals) must be normally distributed. As seen in the model output above, the residuals are approximately symmetrically distributed. We can, of course, calculate the residual variable directly to show its distribution in more detail:

# get residual variable
test_score_resid = resid(model)

# show how residual are distributed 
ggplot(df, aes(x = test_score_resid)) +
  geom_histogram(binwidth = 5, fill = "#678", color = "white", alpha=.6) +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(title = "Histogram of residuals", x = "Score residual", y = "Count")

Residuals are slightly left skewed, which can be ignored in practice.

In brief, the classical model

assumes a linear relationship between predictors and outcome.
assumes normally distributed residuals with constant variance.
is estimated using ordinary least squares (OLS).
is used when the outcome is continuous and unbounded (e.g., test scores, weight).

Generalized Linear Models (GLM): Summary

Generalized Linear Models (GLMs) extend traditional linear models to handle non-normal outcome variables, such as binary, count, proportion, or skewed continuous data. They provide a flexible framework that combines linear modeling with appropriate transformations and error distributions.

GLMs build upon a link function, i.e., a transformation that links the expected outcome \(Y\) to the linear predictor:

The link function \(g(Y)\) transforms the expected value of the outcome variable to a scale compatible with linear modeling.
The right-hand side of the model equation — the linear predictor — is continuous.
Therefore, \(g(Y)\) must also be continuous, even if \(Y\) itself is bounded (e.g., a probability).
This allows us to model probabilities, counts, or other non-normal outcomes using a linear equation on the transformed scale.

The general model equation for a GLM is:

\[ g(Y) = \beta_0 + \beta_1 X_1 + \ldots + \beta_m X_m + \varepsilon \]

This equation links the expected value of the outcome \(Y\) to a linear combination of predictors through the link function \(g(\ldots)\).

Example 1: Non-linear Associations

In the following example we analyze how salaries (€) in an organization depend on years of employment and gender. Data is collected from 200 employees, specifically 100 female and 100 male employees.

Here is the dataset:

years_empl	salary	gender
27.4441813	274372.94	Male
28.1122624	272584.60	Male
8.5841860	83252.41	Male
24.9134288	229785.25	Male
19.2523656	141310.60	Male
15.5728785	108422.85	Male
22.0976494	185929.98	Male
4.0399979	42793.74	Male
19.7097687	100284.31	Male
21.1519435	167208.49	Male
13.7322533	84488.60	Male
21.5733675	171300.54	Male
28.0401674	291434.29	Male
7.6628647	76376.50	Male
13.8687847	80076.12	Male
28.2004357	305893.12	Male
29.3467929	318895.77	Male
3.5246208	55349.75	Male
14.2499124	107613.32	Male
16.8099824	125935.58	Male
27.1209416	247016.44	Male
4.1613050	40497.62	Male
29.6667519	331348.26	Male
28.4000470	276661.61	Male
2.4731267	31579.04	Male
15.4263535	111774.36	Male
11.7061040	88052.90	Male
27.1721439	270697.87	Male
13.4090888	74413.58	Male
25.0801278	206600.50	Male
22.1278685	198856.63	Male
24.3316542	213999.52	Male
11.6432485	77472.96	Male
20.5550919	153524.76	Male
0.1184502	47629.30	Male
24.9874824	230629.76	Male
0.2200244	32724.36	Male
6.2297692	46640.30	Male
27.1980422	278288.55	Male
18.3533593	142577.96	Male
11.3867772	95481.67	Male
13.0731475	78232.04	Male
1.1229310	42575.02	Male
29.2061974	331214.32	Male
12.9525375	67893.01	Male
28.7272979	285770.63	Male
26.6326472	235624.41	Male
19.1993631	117483.76	Male
29.1289983	309636.63	Male
18.5651462	142275.07	Male
10.0028163	84795.75	Male
10.4024474	84622.06	Male
11.9545623	63018.48	Male
23.5407833	224974.88	Male
1.1680947	37063.01	Male
22.4638616	182548.19	Male
20.3183049	146089.50	Male
5.1379299	43416.06	Male
7.8326389	58960.65	Male
15.4323880	104896.59	Male
20.2682182	151437.42	Male
29.4845159	318956.30	Male
22.7863280	178413.12	Male
16.9946527	109272.55	Male
25.4906916	205630.04	Male
5.6842181	41537.78	Male
8.1385984	49839.02	Male
24.8447546	259463.99	Male
20.7961446	137931.11	Male
7.2163422	55495.92	Male
1.2896639	49143.88	Male
4.2143728	40028.07	Male
6.4915625	52297.32	Male
14.3819569	79848.95	Male
5.9223103	48154.50	Male
21.5806751	162196.75	Male
0.2365422	38631.97	Male
11.2646989	43504.75	Male
15.4322312	84736.67	Male
0.0471166	32806.04	Male
17.4481201	129666.43	Male
4.7371562	36430.33	Male
10.7708492	71014.11	Male
19.3689564	158119.13	Male
23.2747009	214691.13	Male
16.9094052	99585.02	Male
7.0111020	50807.05	Male
2.6994155	55253.81	Male
2.5683619	30202.92	Male
9.1565511	61622.70	Male
20.0227954	147570.59	Male
0.0071669	43297.98	Male
6.2570987	42819.48	Male
27.9910238	281155.98	Male
27.7693425	270439.65	Male
22.0228290	191392.66	Male
9.9921595	59509.47	Male
15.4518999	96772.72	Male
22.3192394	189336.76	Male
18.5747772	116733.59	Male
27.4441813	204242.00	Female
28.1122624	191386.91	Female
8.5841860	72219.94	Female
24.9134288	167490.37	Female
19.2523656	108435.26	Female
15.5728785	70663.49	Female
22.0976494	140781.09	Female
4.0399979	32199.05	Female
19.7097687	111206.97	Female
21.1519435	151187.35	Female
13.7322533	75816.61	Female
21.5733675	119743.62	Female
28.0401674	216027.60	Female
7.6628647	45854.16	Female
13.8687847	88053.06	Female
28.2004357	237475.42	Female
29.3467929	219145.13	Female
3.5246208	45214.51	Female
14.2499124	82617.81	Female
16.8099824	110742.94	Female
27.1209416	196822.69	Female
4.1613050	52693.92	Female
29.6667519	213160.54	Female
28.4000470	244370.12	Female
2.4731267	48641.89	Female
15.4263535	86065.03	Female
11.7061040	46340.89	Female
27.1721439	210633.58	Female
13.4090888	83942.68	Female
25.0801278	173513.78	Female
22.1278685	143467.72	Female
24.3316542	155985.73	Female
11.6432485	73309.23	Female
20.5550919	130895.97	Female
0.1184502	33939.11	Female
24.9874824	152443.33	Female
0.2200244	40976.86	Female
6.2297692	54712.10	Female
27.1980422	188808.55	Female
18.3533593	84492.50	Female
11.3867772	69645.56	Female
13.0731475	69735.95	Female
1.1229310	36242.49	Female
29.2061974	212333.47	Female
12.9525375	59894.91	Female
28.7272979	240390.18	Female
26.6326472	199596.33	Female
19.1993631	123822.69	Female
29.1289983	257722.99	Female
18.5651462	111962.82	Female
10.0028163	30410.55	Female
10.4024474	67145.34	Female
11.9545623	86840.21	Female
23.5407833	186767.75	Female
1.1680947	48096.85	Female
22.4638616	127293.27	Female
20.3183049	113809.76	Female
5.1379299	32825.78	Female
7.8326389	42222.32	Female
15.4323880	85583.32	Female
20.2682182	105943.37	Female
29.4845159	266857.26	Female
22.7863280	149472.85	Female
16.9946527	97313.91	Female
25.4906916	186105.23	Female
5.6842181	45221.87	Female
8.1385984	51050.88	Female
24.8447546	192923.96	Female
20.7961446	125372.93	Female
7.2163422	30462.69	Female
1.2896639	38619.32	Female
4.2143728	35021.34	Female
6.4915625	39430.34	Female
14.3819569	66077.73	Female
5.9223103	51836.70	Female
21.5806751	133279.58	Female
0.2365422	38235.89	Female
11.2646989	62489.24	Female
15.4322312	78485.47	Female
0.0471166	48852.66	Female
17.4481201	97678.32	Female
4.7371562	56015.18	Female
10.7708492	45738.23	Female
19.3689564	109407.30	Female
23.2747009	148957.97	Female
16.9094052	92124.57	Female
7.0111020	69238.16	Female
2.6994155	35898.28	Female
2.5683619	39572.20	Female
9.1565511	42813.41	Female
20.0227954	110912.02	Female
0.0071669	44986.09	Female
6.2570987	65365.23	Female
27.9910238	231578.99	Female
27.7693425	188859.09	Female
22.0228290	170910.92	Female
9.9921595	75632.53	Female
15.4518999	88084.00	Female
22.3192394	153654.18	Female
18.5747772	95533.93	Female

The scatterplot shows a clear positive association: The longer employees are employed, the higher salaries they earn:

p = ggplot(df, aes(x = years_empl, y = salary, color = gender)) +
  geom_point(alpha = .6) +
  theme_minimal() +
  scale_x_continuous(name="Years of employment", labels = scales::comma) + 
  scale_y_continuous(name="Salary (€)", labels = scales::comma) +
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
  #scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
p + theme(legend.position="none")

We can fit a simple linear regression model to model this association:

linear_model = lm(salary ~ years_empl, data = df)
summary(linear_model)

## 
## Call:
## lm(formula = salary ~ years_empl, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58615 -27281  -3463  19327 101896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2684.3     4717.3  -0.569     0.57    
## years_empl    7943.6      260.2  30.535   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared:  0.8248, Adjusted R-squared:  0.8239 
## F-statistic: 932.4 on 1 and 198 DF,  p-value: < 2.2e-16

#linear_model

p + theme(legend.position="none") + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#345", alpha = 1, linetype = "longdash")

We are particularly interested in the differences of salaries between gender. A simple comparison of average salaries reveals that female have lower salaries on average:

tapply(df$salary, df$gender, mean)

##   Female     Male 
## 109140.8 135466.1

salary_female = tapply(df$salary, df$gender, mean)["Female"]
salary_male = tapply(df$salary, df$gender, mean)["Male"]

By adding gender as dummy variable into our linear model, we come to the conclusion, that salaries increase by 7943.61 € per year and that male salaries exceed female salaries by NA € on average, independent of employment time.

linear_model2 = lm(salary ~ years_empl+gender, data = df)
summary(linear_model2)

## 
## Call:
## lm(formula = salary ~ years_empl + gender, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66761 -21296  -6514  20747  88733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15846.9     4842.8  -3.272  0.00126 ** 
## years_empl    7943.6      239.2  33.215  < 2e-16 ***
## genderMale   26325.4     4311.0   6.107 5.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30480 on 197 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8512 
## F-statistic: 570.3 on 2 and 197 DF,  p-value: < 2.2e-16

#linear_model2

p = p + theme(legend.position.inside=c(.28,.63)) + 
  scale_color_manual(values = gender_colors, name = "Gender")

p + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#666", alpha = 1, linetype = "dotted") + 
  geom_abline(intercept = coef(linear_model2)[1], slope = coef(linear_model2)[2], 
              color = gender_colors[1], alpha = 1, linetype = "longdash") + 
  geom_abline(intercept = coef(linear_model2)[1]+coef(linear_model2)[3], slope = coef(linear_model2)[2], 
              color = gender_colors[2], alpha = 1, linetype = "longdash")

It is, however, clearly visible that the association between years of employment and salaries is not linear. A linear model, thus, is not appropriate, and our interpretation is flawed. In order to fit a linear model, we first have to transform our dependent variable so that the association becomes linear.

When looking at selected data points it becomes obvious that the association is exponential in nature:

Years of Employment		Salary		Salary	log₂(Salary)
0		32.768		2¹⁵	15
10	+10	65.536	x2	2¹⁶	16	+1
20	+10	131.072	x2	2¹⁷	17	+1
30	+10	262.144	x2	2¹⁸	18	+1

p + theme(legend.position="none") + 
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender") + 
  geom_point(x=0, y=2**15, color="#f52", size=5, alpha=.5) +
  geom_point(x=10, y=2**16, color="#f52", size=5, alpha=.5) +
  geom_point(x=20, y=2**17, color="#f52", size=5, alpha=.5) +
  geom_point(x=30, y=2**18, color="#f52", size=5, alpha=.5)

Therefore, we use a logarithmic transformation to linearize the association. Instead of fitting

\[ Salary = \beta_0 + \beta_1Years + \varepsilon \]

we choose \(\text{log}_2(Y)\) as dependent variable:

\[ \text{log}_2(Salary) = \beta_0 + \beta_1Years + \varepsilon \]

log_model = lm(log2(salary) ~ years_empl, data = df)
summary(log_model)

## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1115 -0.1760 -0.0016  0.2198  0.5921 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.979177   0.039675  377.54   <2e-16 ***
## years_empl   0.102428   0.002188   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2789 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16

#log_model

beta0 = round(coef(log_model)[1],3)
beta1 = round(coef(log_model)[2],3)

pl = ggplot(df, aes(x = years_empl, y = log2(salary), color = gender)) +
  geom_point(alpha = .6) +
  theme_minimal() + 
  scale_x_continuous(name="Years of employment", labels = scales::comma) + 
  scale_y_continuous(name="Log2(Salary)", labels = scales::comma) +
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
  #scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")

pl + theme(legend.position="none") + 
  geom_abline(intercept = coef(log_model)[1], slope = coef(log_model)[2], 
              color = "#345", alpha = 1, linetype = "longdash")

According to this model, the log₂ of Salary increases by 0.102 each year :

\[ \text{log}_2(Salary) = 14.979 + 0.102 Years \]

As this description is a little hard to understand we have to get the model parameter and model predictions back on original scale. Raising the predictive form of the regression equation

\[ \text{log}_2(Salary_\text{est}) = \beta_0 + \beta_1Years \]

to the power of 2 we get

\[ Salary_\text{est} = 2^{(\beta_0 + \beta_1Years)} = 2^{\beta_0} \cdot 2^{\beta_1Years} \] Now we have a multiplicative (rather than additive) model indicating the underlying non-linear structure .

\[ Salary_\text{est} = 2^{14.979} \cdot 2^{0.102\cdot Years} \] Because we applied a logarithmic transformation, we can now interpret the model parameters as growth rates. The model predictions are as follows:

Employees in year 0 earn \(2^{14.979} \cdot 2^0 = 32294.48\) € on average.
Every year, income increases by factor \(2^{0.102} = 1.0733\), i.e., by 7.33%, independent of years of employment.

df$log_pred = 2**(beta0 + beta1*df$years_empl)

p + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#345", alpha = 1, linetype = "dotted")  + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model)[1] + coef(log_model)[2]*years_empl)),
              color = "#345", alpha = 1, linetype = "longdash")

But if we look at the gender differences in more detail it appears as if gender do not differ by intercepts only but have different growth rates. This can only be modeled by fitting separate models for male and female.

# adjust for Male and Female
log_model_male = lm(log2(salary) ~ years_empl, data = df[df$gender=="Male",])
summary(log_model_male)

## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender == 
##     "Male", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8088 -0.1247  0.0048  0.1004  0.5500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.97655    0.04442  337.15   <2e-16 ***
## years_empl   0.11018    0.00245   44.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2208 on 98 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9533 
## F-statistic:  2023 on 1 and 98 DF,  p-value: < 2.2e-16

#log_model_male
log_model_female = lm(log2(salary) ~ years_empl, data = df[df$gender=="Female",])
summary(log_model_female)

## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender == 
##     "Female", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03654 -0.11005  0.02057  0.15374  0.58988 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.981809   0.052984   282.8   <2e-16 ***
## years_empl   0.094675   0.002922    32.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2634 on 98 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9138 
## F-statistic:  1050 on 1 and 98 DF,  p-value: < 2.2e-16

#log_model_female
# log_model_dummy = lm(log2(salary) ~ years_empl + gender, data = df)
# summary(log_model_dummy)
# log_model_male
# log_model_female
# log_model_dummy
betam0 = round(coef(log_model_male)[1],3)
betam1 = round(coef(log_model_male)[2],3)
betaf0 = round(coef(log_model_female)[1],3)
betaf1 = round(coef(log_model_female)[2],3)

p + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model_male)[1] + coef(log_model_male)[2]*years_empl)),
              color = "steelblue", alpha = 1, linetype = "longdash")   + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model_female)[1] + coef(log_model_female)[2]*years_empl)),
              color = "darkorange", alpha = 1, linetype = "longdash")

Now it becomes obvious, that gender salaries differ by their growth rates rather than being dependent on starting salaries.

Female \[ Salary_\text{est Female} = 2^{14.982} \cdot 2^{0.095\cdot Years} \] Male \[ Salary_\text{est Male} = 2^{14.977} \cdot 2^{0.11\cdot Years} \] While starting salaries are more or less equal between gender (female: \(2^{14.982} = 32361.71\)€, male: \(2^{14.977} = 32249.74\)€), the growth rates differ significantly (female: \(2^{0.095} = 6.81\)%, male: \(2^{0.11} = 7.92\)%).

In the terminology of GLM the logarithm log₂() served as link function \(g\)() that allowed us to linearize the association. Interpretation of model output, though, is much easier if output is transformed back into real measures.

Example 2: Binary Outcome, Logit Transformation

df = read.xlsx("./data/AdherenceData.xlsx")

Regression analysis requires a metric outcome variable. If we aim at modeling a binary (or categorical) outcome variable we have to choose a link function \(g\)() that transforms the binary outcome into a metric.

The following example models participation in, or adherence to, a preventive program as part of the health management offers in a company. Variables in the data set are:

adherence — whether a patient adheres to the preventive program (0 = no, 1 = yes)
app_use — use of a digital health management app (0 = no, 1 = yes)
stress_level — perceived stress (1 = low, 2 = moderate, 3 = high)
weekly_activity — minutes of physical activity per week

The data set comprises a binary outcome (adherence) and binary (app_use), multitom (stress_level), and metric (weekly_activity) explanatory variables:

id	adherence	app_use	stress_level	weekly_activity
1	1	1	1	210
2	1	0	3	180
3	0	0	2	60
4	0	1	3	150
5	0	0	1	30
6	1	1	2	140
7	0	0	3	80
8	1	1	3	90
9	1	0	2	50
10	1	1	1	220
11	0	0	3	90
12	1	1	3	160
13	0	0	2	70
14	1	1	1	220
15	0	1	3	85
16	1	1	2	140
17	0	0	3	55
18	1	1	1	230
19	0	0	3	95
20	1	1	3	175
21	0	1	2	65
22	1	1	1	210
23	0	0	3	180
24	0	1	3	155
25	0	0	2	160
26	1	1	1	225
27	0	0	3	110
28	1	1	3	165
29	0	0	2	75
30	1	1	1	205

The core research question asks for the effect of physical activity on the probability of adherence. We hypothesize:

Physical activity and adherence to the program are positively associated.

The linear model

A linear regression model can be written as:

\[ \hat{Y} = \beta_0 + \beta_1 X \]

Applied to a binary outcome such as adherence, this model describes a straight-line relationship between weekly_activity and the expected value of adherence.

glm = ggplot(df, aes(x = weekly_activity, y = adherence)) +
  geom_point(color = "darkblue", shape = 1) +
  scale_x_continuous(limits = c(0, 280)) +
  scale_y_continuous(limits = c(-0.2, 1.2)) +
  labs(x = "Weekly activity",
       y = "Adherence") +
  theme_minimal()
glm + geom_smooth(method = "lm",
              se = FALSE,
              linetype = "dashed",
              color = "steelblue",
              linewidth = 0)

m = lm(adherence ~ weekly_activity, data=df)
m

## 
## Call:
## lm(formula = adherence ~ weekly_activity, data = df)
## 
## Coefficients:
##     (Intercept)  weekly_activity  
##       -0.210823         0.005227

b = round(coefficients(m), 3)

\[ \text{adherence } = -0.211 + 0.005 \cdot \text{ weekly_activity} \]

It is obvious to interpret the model outcome as probabilities, at least in the value range of the predictor weekly_activity. However, this is problematic because predicted values outside this range may fall below 0 or above 1, although probabilities must lie between 0 and 1.

The logistic model

Logistic regression solves this problem by modeling an S-shaped probability curve (which is, by the way, also characteristic of logistic growth). This model is the so-called logistic model:

\[ p = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Here, \(p\) denotes the probability that the outcome variable (here: adherence) takes the value 1.

glm + geom_smooth(method = "glm",
            method.args = list(family = "binomial"),
            se = FALSE,
            linetype = "solid",
            color = "darkorange",
            linewidth = 0) +
  labs(
    x = "Weekly activity",
    y = "Probability of Adherence (p)"
  )

# use glm here
m = glm(adherence ~ weekly_activity, data = df, family = binomial)
m

## 
## Call:  glm(formula = adherence ~ weekly_activity, family = binomial, 
##     data = df)
## 
## Coefficients:
##     (Intercept)  weekly_activity  
##        -3.86451          0.02835  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
## Null Deviance:       41.59 
## Residual Deviance: 27.65     AIC: 31.65

b = round(coefficients(m), 3)

This model assumes that the probability of adherence changes non-linearly with weekly_activity: the curve is flatter close to 0 and 1, and steepest in the middle range. This ‘trick’ solves the issue of probability estimates beyond [0,1]. It is, however, obvious that this model is non-linear. To linearize, we calculate the odds of adherence, i.e., the ratio of the probability of adherence to the probability of non-adherence. The term (1 - p) represents the probability of the alternative outcome (non-adherence).

\[ \frac{p}{1-p} = \frac{\frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}}{1-\frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}} = e^{\beta_0 + \beta_1 X} \]

If we apply the logarithm to the odds we get the so called linear logit model:

\[ \text{ln}\frac{p}{1-p} = \text{logit}(p) = \text{ln } e^{\beta_0 + \beta_1 X} = \beta_0 + \beta_1 X \]

Applied to our data the linear logit model is estimated as

\[ \text{logit}(p) = -3.865 + 0.028 \cdot \text{weekly_activity} \]

The coefficient for weekly_activity (0.028) is positive, indicating that higher levels of physical activity are associated with a higher probability of adherence. Overall, the model implies a monotonic, non-linear increase in adherence probability with increasing physical activity.

Note that the regression coefficient \(\beta_1\) = 0.028 itself is not a probability. It describes the change in the log-odds of adherence for a one-unit increase in weekly_activity. After exponentiation, it becomes an odds ratio:

\[ e^{\beta_1} = e^{0.028} = 1.028 \]

The odds ratio indicates by which factor the odds of adherence change when weekly_activity increases by one unit. In odds terms, a one-unit increase in weekly activity (i.e., +1 minute) multiplies the odds of adherence by 1.028. This corresponds to an increase of about 2.8% in the odds per additional minute of activity. More meaningful increments are, e.g., a 10-minute [30-minute] increase in weekly activity multiplies the odds of adherence by approximately \(e\)^0.28 \(\approx\) 1.32 [\(e\)^0.84 \(\approx\) 2.32], corresponding to about a 32% [132%] increase in the odds.

It is important to note that this refers to odds, not probabilities. A 32% increase in odds does not translate into a 32% increase in probability!

The intercept (-3.865) represents the log-odds of adherence when weekly_activity = 0. In probability terms this means \[ p = \frac{1}{1 + e^{3.865}} \approx 0.0205325 \] Thus, with zero activity, adherence is essentially close to zero.

The graphical representation of the model on the logit scale appears as a straight line, closely resembling a linear regression. However, a closer inspection shows that the logit model implies a steeper slope than the linear model. Substantively, this representation is of limited interpretive value, as log-odds are not directly intuitive compared to probabilities.

x_seq = seq(min(df$weekly_activity), max(df$weekly_activity), length.out = 100)
logit_hat = predict(m, newdata = data.frame(weekly_activity = x_seq), type = "link")
df_plot = data.frame(weekly_activity = x_seq, logit = logit_hat)

glm +
  geom_line(data = df_plot,
            aes(x = weekly_activity, y = logit),
            color = "darkorange",
            linetype = "solid",
            linewidth = 0) +
  labs(
    x = "Weekly Activity",
    y = "Logit(p) (Log-odds of adherence)"
  )

On odds and odds ratios

Odds ratios are often easier to understand when illustrated with a categorical predictor rather than a metric one. In the categorical case, the odds of an outcome can be directly compared between clearly defined groups (e.g., low-activity vs. high-activity users), making the interpretation more intuitive. This simplified setting helps to introduce the basic logic of odds and odds ratios and to illustrate how they differ from probabilities.

Let’s look at 200 repondents classified into two grous, namely those with low weelky activity and those with high weelky activity.

t = data.frame(c1 = c("no", "yes"), c2 = c(80,20), c3=c(40,60))
names(t) = c("Adherence", "Low activity", "High activity")
kable_styling(kable(t, align = "lrr"), full_width = FALSE)

Adherence	Low activity	High activity
no	80	40
yes	20	60

The odds for adherence rather than non-adherence in the low activity group is

\[ \text{odds}_{low} = \frac{20}{80} = 0.25 \]

The odds for adherence rather than non-adherence in the high activity group are

\[ \text{odds}_{high} = \frac{60}{40} = 1.5 \]

The odds ratio (OR) for adherence for the high activity vs the low activity group are

\[ \text{OR} = \frac{60/40}{20/80} = \frac{1.5}{0.25} = 6 \]

The odds of adherence are 6 times higher for the high-activity group than for the low-activity group, indicating substantially greater adherence among highly active individuals.

This mini-example illustrates that, with categorical predictors, odds ratios can be computed directly from contingency tables because the grouping already exists in the data. With metric predictors, however, there are infinitely many possible comparisons, so we need a model that describes how odds change systematically across the predictor range. This is achieved by modeling the odds of the probabilities,

\[ \frac{p}{1-p} \]

or, more precisely, their logarithm (the logit).

On odds ratios and relative probabilities

Note that in terms of probabilities the numbers change:

The probability for adherence in the low activity group is \[ p(\text{adherence=yes | low activity} )= \frac{20}{80+20}=0.2 \] The probability for adherence in the high activity group is \[ p(\text{adherence=yes | high activity} )= \frac{60}{60+40}=0.6 \] and the relative probability for adherence of high-activity vs. low-activity users is

\[ \text{relative probability for adherence} = \frac{60/(60+40)}{20/(80+20)} = \frac{0.6}{0.2} = 3 \]

Thus, odds ratios are not equivalent to probability ratios often referred to as relative risks.

It is important to note that odds ratios are invariant to the direction of calculation: the odds ratio obtained row-wise is identical to the one obtained column-wise.

\[ \text{OR} = \frac{60/40}{20/80} = \frac{1.5}{0.25} = \frac{60/20}{40/80} = \frac{3}{0.5} = 6 \]

It is evident that this does not hold for probabilities, because \[ \text{relative likelihood for high activity} = \frac{60/(60+20)}{40/(40+80)} = \frac{0.75}{0.33} = 2.25 \ne 3 \] So the relative likelihood is not invariant to the direction of calculation.

As a side note, the invariance of odds ratios has important implications for research design. Unlike relative probabilities, odds ratios remain unchanged when the direction of conditioning is reversed. This is particularly important in case-control studies, where researchers sample individuals based on the outcome rather than the predictor. In such designs, direct probability comparisons are often not meaningful or not estimable, whereas odds ratios still provide a consistent measure of association. Transferred to the current example, this would mean comparing activity levels between adherent and non-adherent individuals instead of examining how adherence depends on activity. This perspective is substantively much less meaningful than the original direction.

The multivariate logistic model

In order to model the (mutually independent) effects of all three predictors, we hypothezise that

Physical activity and adherence to the program are positively associated.
Use of a digital health management app and adherence to the program are positively associated.
Perceived stress and adherence to the program are negatively associated.

the multivariate logistic model estimates are:

# use glm here
m = glm(adherence ~ weekly_activity + app_use + stress_level, data = df, family = binomial)
b = round(coefficients(m), 3)
summary(m)

## 
## Call:
## glm(formula = adherence ~ weekly_activity + app_use + stress_level, 
##     family = binomial, data = df)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -1.46997    2.42241  -0.607   0.5440  
## weekly_activity  0.02248    0.01175   1.914   0.0556 .
## app_use          1.92007    1.09928   1.747   0.0807 .
## stress_level    -1.08695    0.92624  -1.174   0.2406  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.589  on 29  degrees of freedom
## Residual deviance: 22.558  on 26  degrees of freedom
## AIC: 30.558
## 
## Number of Fisher Scoring iterations: 6

\[ \text{logit}(p) = -1.47 + 0.022 \cdot \text{weekly_activity} + 1.92 \cdot \text{app_use} -1.087 \cdot \text{stress_level} \]

Translated into odds ratios the model reads

# use glm here
b = round(exp(coefficients(m)), 3)

\[ \frac{p}{1-p} = 0.23 + 1.023 \cdot \text{weekly_activity} + 6.821 \cdot \text{app_use} + 0.337 \cdot \text{stress_level} \]

Due to the small sample size, the effects are not significant, though.

In addition to the estimated regression coefficients b, logistic regression reports usually include estimates of odds ratios OR = e^b and their 95% confidence intervals CI(95%) = [low, high]:

b = round(coef(m), 3)
OR = round(exp(coef(m)), 3)
CI = round(exp(confint(m)), 3)
CI = paste0("[",CI[,1], ",", CI[,2], "]")
t = data.frame(cbind(b, OR, CI))
kable_styling(kable(t, align = "rrr"), full_width = FALSE)

	b	OR	CI
(Intercept)	-1.47	0.23	[0.001,18.703]
weekly_activity	0.022	1.023	[1.001,1.051]
app_use	1.92	6.821	[0.852,75.536]
stress_level	-1.087	0.337	[0.036,1.685]

The model shows that adherence is positively associated with physical activity and especially app use, while higher stress substantially reduces the likelihood of adherence. In detail, the odds ratios indicate the direction and strength of association with adherence, while the extreme width of the CIs indicate great uncertainty due to the small sample size.

weekly_activity (OR = 1.023): Each additional minute of activity increases the odds of adherence by about 2.3%. This is a small but cumulative effect (e.g., larger increases over 10–30 minutes become more substantial).
app_use (OR = 6.821): Individuals using the app have nearly 7 times higher odds of adherence compared to non-users. This is a strong positive association.
stress_level (OR = 0.337): Higher stress is associated with substantially lower adherence. Specifically, a one-unit increase in stress reduces the odds by about 66%. This indicates a strong negative association.
Intercept (OR = 0.23): At the reference level (all predictors = 0), the odds of adherence are low. This mainly serves as a baseline and is often of limited substantive interest unless the reference point is meaningful.

Regression coefficients b can take both negative and positive values, whereas odds ratios (ORs), defined as (e^b), are strictly positive. Positive coefficients translate into (OR > 1) (increased odds), while negative coefficients yield (OR< 1) (decreased odds). An OR of 1 indicates no difference in odds, i.e., no effect, since (e⁰ = 1).

Accordingly, at the 5% significance level, an effect is typically considered statistically significant if the corresponding confidence interval for the OR does not include 1.

Metric vs. categorical predictors

We can easily extend the model by replacing metric predictors with dummy variables. This is particularly relevant for variables such as stress_level, which is ordinal and may not satisfy the assumption of a linear effect. Instead of treating stress_level as a numeric variable (implying equal and linear changes between levels), we can represent each category (e.g., 1=low, 2=moderate, 3=high) using dummy variables. This allows the model to estimate separate effects for each level relative to a reference category (the lowest by default).

This step is important because it relaxes the linearity assumption and enables a more flexible and potentially more accurate representation of how stress is associated with adherence.

#df = read.xlsx("./data/AdherenceData.xlsx") 
m = glm(adherence ~ weekly_activity + as.factor(app_use) + as.factor(stress_level), data = df, family = binomial)
#m
b = round(coef(m), 3)
OR = round(exp(coef(m)), 3)
CI = round(exp(confint(m)), 3)
CI = paste0("[",CI[,1], ",", CI[,2], "]")
t = data.frame(cbind(b, OR, CI))
kable_styling(kable(t, align = "rrr"), full_width = FALSE)

	b	OR	CI
(Intercept)	-2.736	0.065	[0,4.866]
weekly_activity	0.023	1.023	[1.001,1.051]
as.factor(app_use)1	1.926	6.864	[0.853,76.812]
as.factor(stress_level)2	-0.881	0.414	[0.003,57.166]
as.factor(stress_level)3	-2.046	0.129	[0.001,6.884]

Note that the results for the binary variable app_use remain unchanged (reference category: 0 = no app use). Users have substantially higher odds of adherence, although the wide confidence interval indicates considerable uncertainty.

For stress_level, the use of dummy variables provides a more differentiated view. Relative to the reference category (low stress, level 1), both medium (level 2) and high stress (level 3) are associated with lower odds of adherence.

Specifically, individuals with high stress have odds that are about 0.129 times those of low-stress individuals (i.e., a reduction of roughly 87%). Compared to medium stress, high stress corresponds to an additional reduction by a factor of about 0.129 / 0.414 ≈ 0.312, indicating substantially lower adherence at the highest stress level.

However, the confidence intervals are very wide and include 1, so these effects are estimated with high uncertainty and should be interpreted cautiously.

Common GLM Examples

GLMs allow us to apply linear modeling techniques to a wide range of outcome types. The key is the link function, which transforms the expected value of the outcome into a continuous scale suitable for a linear predictor. This enables valid modeling even when the raw outcome is binary, count-based, bounded, or non-linear.

Outcome Type	Distribution	Link Function	What the Link Does
Non-linear	Skewed normal	Log, Exp, …	Maps metric values to a linear function
Binary (0/1)	Binomial	Logit	Maps probabilities (0–1) to the real line
Count	Poisson	Log	Maps positive counts to the real line
Proportion	Binomial	Logit or Probit	Models bounded outcomes on a continuous scale
Positive continuous	Gamma	Inverse	Ensures predictions remain strictly positive