The General Linear Model refers to a broad family of models for continuous, normally distributed outcomes. It includes familiar methods such as simple linear regression, multiple regression, ANOVA, and ANCOVA.
The general model equation for an outcome variable \(Y\) modeled by \(m\) variables \(X_{1\ldots m}\) is:
\[ Y = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m + \varepsilon \]
where
Fitting the model means calculating the regression parameters \(\beta_{0\ldots m}\) which can then be used to estimate the outcome:
\[ \hat{Y} = Y_{\text{est}} = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m \] Note that \(Y\) denotes observed outcome. Estimated outcome is written as \(\hat{Y}\) or \(Y_\text{est}\), you can use these interchangeably. The estimated outcome is equal to the pure model prediciton without the error term.
In case of a single predictor we often write
\[ Y = \alpha + \beta X + \varepsilon \hspace{1em} \text{or}\hspace{1em} \hat{Y} = \alpha + \beta X \]
instead.
Let’s introduce an example to illustrate this concept. In the following, we will work with data collected from 3 schools, specifically Hillcrest High School, Oakwood High School, and St Mary’s Church of England School. For each student, we have recorded the average test scores (ranging from 0% to 100%) and the average number of hours they studied shortly before their exams.
Here is the dataset:
| school_long_name | school | test_score | study_time |
|---|---|---|---|
| Hillcrest High School | Hillcrest | 54.59454 | 13.092146 |
| Hillcrest High School | Hillcrest | 45.90443 | 7.285176 |
| Hillcrest High School | Hillcrest | 46.07368 | 10.068656 |
| Hillcrest High School | Hillcrest | 53.15568 | 10.877858 |
| Hillcrest High School | Hillcrest | 51.89308 | 10.192075 |
| Hillcrest High School | Hillcrest | 45.88416 | 8.660897 |
| Hillcrest High School | Hillcrest | 49.43284 | 13.513836 |
| Hillcrest High School | Hillcrest | 49.04923 | 8.695293 |
| Hillcrest High School | Hillcrest | 55.00619 | 15.034541 |
| Hillcrest High School | Hillcrest | 49.30719 | 8.791128 |
| Hillcrest High School | Hillcrest | 49.59065 | 12.893879 |
| Hillcrest High School | Hillcrest | 54.12954 | 15.839206 |
| Hillcrest High School | Hillcrest | 47.03255 | 4.812688 |
| Hillcrest High School | Hillcrest | 48.98281 | 8.142904 |
| Hillcrest High School | Hillcrest | 50.05463 | 8.579306 |
| Hillcrest High School | Hillcrest | 56.51023 | 10.887121 |
| Hillcrest High School | Hillcrest | 48.13828 | 8.126511 |
| Hillcrest High School | Hillcrest | 33.20910 | 1.009904 |
| Hillcrest High School | Hillcrest | 40.99077 | 1.657870 |
| Hillcrest High School | Hillcrest | 57.04151 | 12.939610 |
| Oakwood High School | Oakwood | 44.29733 | 8.059354 |
| Oakwood High School | Oakwood | 25.14011 | 3.635345 |
| Oakwood High School | Oakwood | 35.47447 | 8.463518 |
| Oakwood High School | Oakwood | 45.12912 | 12.623294 |
| Oakwood High School | Oakwood | 48.16753 | 14.664851 |
| Oakwood High School | Oakwood | 35.43849 | 7.687863 |
| Oakwood High School | Oakwood | 37.85879 | 8.207462 |
| Oakwood High School | Oakwood | 25.77590 | 3.689781 |
| Oakwood High School | Oakwood | 48.83004 | 10.359562 |
| Oakwood High School | Oakwood | 36.44190 | 7.059286 |
| Oakwood High School | Oakwood | 42.43892 | 10.345621 |
| Oakwood High School | Oakwood | 45.67442 | 11.093782 |
| Oakwood High School | Oakwood | 46.28141 | 12.084581 |
| Oakwood High School | Oakwood | 35.90872 | 7.152491 |
| Oakwood High School | Oakwood | 47.41863 | 10.494136 |
| Oakwood High School | Oakwood | 29.00540 | 3.828244 |
| Oakwood High School | Oakwood | 31.40098 | 6.625893 |
| Oakwood High School | Oakwood | 36.01010 | 6.426547 |
| Oakwood High School | Oakwood | 24.41876 | 1.736647 |
| Oakwood High School | Oakwood | 38.60989 | 9.087638 |
| Oakwood High School | Oakwood | 37.99014 | 9.597266 |
| Oakwood High School | Oakwood | 39.07729 | 7.896098 |
| Oakwood High School | Oakwood | 43.98547 | 11.253760 |
| St Mary’s Church of England School | St Mary’s | 42.94447 | 6.799156 |
| St Mary’s Church of England School | St Mary’s | 34.92019 | 4.874427 |
| St Mary’s Church of England School | St Mary’s | 49.85766 | 10.277724 |
| St Mary’s Church of England School | St Mary’s | 44.38598 | 6.545091 |
| St Mary’s Church of England School | St Mary’s | 60.11943 | 13.311574 |
| St Mary’s Church of England School | St Mary’s | 46.89865 | 7.684932 |
| St Mary’s Church of England School | St Mary’s | 50.23389 | 10.946214 |
| St Mary’s Church of England School | St Mary’s | 49.43679 | 9.945046 |
| St Mary’s Church of England School | St Mary’s | 40.07521 | 6.627753 |
| St Mary’s Church of England School | St Mary’s | 60.94646 | 13.706453 |
| St Mary’s Church of England School | St Mary’s | 57.77003 | 10.907968 |
| St Mary’s Church of England School | St Mary’s | 48.67736 | 9.248552 |
| St Mary’s Church of England School | St Mary’s | 51.15944 | 9.808922 |
| St Mary’s Church of England School | St Mary’s | 51.22429 | 11.017137 |
| St Mary’s Church of England School | St Mary’s | 46.55865 | 9.248769 |
| St Mary’s Church of England School | St Mary’s | 23.99421 | 0.000000 |
| St Mary’s Church of England School | St Mary’s | 54.27720 | 9.833919 |
| St Mary’s Church of England School | St Mary’s | 48.37929 | 7.877566 |
| Hillcrest High School | Hillcrest | 45.30004 | 9.534962 |
| Hillcrest High School | Hillcrest | 57.01957 | 10.724741 |
| Hillcrest High School | Hillcrest | 56.86480 | 13.178481 |
| Hillcrest High School | Hillcrest | 46.07672 | 6.797394 |
| Hillcrest High School | Hillcrest | 55.57510 | 12.886898 |
| Hillcrest High School | Hillcrest | 47.07002 | 9.986815 |
| Hillcrest High School | Hillcrest | 49.22528 | 12.094788 |
| Hillcrest High School | Hillcrest | 52.23690 | 11.741456 |
| Hillcrest High School | Hillcrest | 47.77480 | 11.141905 |
| Hillcrest High School | Hillcrest | 45.58995 | 5.849913 |
| Hillcrest High School | Hillcrest | 52.34357 | 8.708711 |
| Hillcrest High School | Hillcrest | 47.91817 | 10.849825 |
| Hillcrest High School | Hillcrest | 43.12712 | 6.118700 |
| Hillcrest High School | Hillcrest | 46.96063 | 7.350784 |
| Hillcrest High School | Hillcrest | 47.81392 | 10.722260 |
| Hillcrest High School | Hillcrest | 50.39072 | 11.283807 |
| Hillcrest High School | Hillcrest | 53.06295 | 10.370573 |
| Hillcrest High School | Hillcrest | 48.49496 | 6.321941 |
| Hillcrest High School | Hillcrest | 45.96745 | 5.679928 |
| Hillcrest High School | Hillcrest | 48.66620 | 13.517391 |
| Oakwood High School | Oakwood | 41.34408 | 9.753035 |
| Oakwood High School | Oakwood | 43.68350 | 9.244591 |
| Oakwood High School | Oakwood | 41.67275 | 8.616581 |
| Oakwood High School | Oakwood | 32.22012 | 5.396284 |
| Oakwood High School | Oakwood | 35.53073 | 10.815261 |
| Oakwood High School | Oakwood | 38.83860 | 8.327851 |
| Oakwood High School | Oakwood | 40.58326 | 8.431000 |
| Oakwood High School | Oakwood | 45.69603 | 11.779309 |
| Oakwood High School | Oakwood | 45.36142 | 11.444590 |
| Oakwood High School | Oakwood | 49.60593 | 13.155619 |
| Oakwood High School | Oakwood | 35.91185 | 7.550749 |
| Oakwood High School | Oakwood | 47.79057 | 10.930316 |
| Oakwood High School | Oakwood | 49.71638 | 13.152602 |
| Oakwood High School | Oakwood | 29.56580 | 5.646904 |
| Oakwood High School | Oakwood | 38.93851 | 6.396892 |
| Oakwood High School | Oakwood | 36.78149 | 5.584054 |
| Oakwood High School | Oakwood | 33.67548 | 4.601628 |
| Oakwood High School | Oakwood | 35.45055 | 9.219218 |
| Oakwood High School | Oakwood | 42.16985 | 10.938883 |
| Oakwood High School | Oakwood | 49.07087 | 12.582166 |
| Oakwood High School | Oakwood | 46.35821 | 12.113523 |
| Oakwood High School | Oakwood | 34.98333 | 5.969644 |
| Oakwood High School | Oakwood | 58.42821 | 14.524716 |
| St Mary’s Church of England School | St Mary’s | 39.48171 | 6.978950 |
| St Mary’s Church of England School | St Mary’s | 42.54783 | 9.295812 |
| St Mary’s Church of England School | St Mary’s | 44.95859 | 7.712503 |
| St Mary’s Church of England School | St Mary’s | 44.77387 | 8.612220 |
| St Mary’s Church of England School | St Mary’s | 50.96967 | 9.543849 |
| St Mary’s Church of England School | St Mary’s | 46.57311 | 9.336753 |
| St Mary’s Church of England School | St Mary’s | 54.34814 | 8.903993 |
| St Mary’s Church of England School | St Mary’s | 48.53935 | 9.303488 |
| St Mary’s Church of England School | St Mary’s | 42.13689 | 7.522964 |
| St Mary’s Church of England School | St Mary’s | 42.90107 | 7.466619 |
| St Mary’s Church of England School | St Mary’s | 35.57561 | 3.995973 |
| St Mary’s Church of England School | St Mary’s | 44.78883 | 7.832269 |
| St Mary’s Church of England School | St Mary’s | 38.44711 | 7.441319 |
| St Mary’s Church of England School | St Mary’s | 72.24097 | 17.084943 |
| St Mary’s Church of England School | St Mary’s | 37.95949 | 4.892922 |
| St Mary’s Church of England School | St Mary’s | 49.29009 | 9.391039 |
| St Mary’s Church of England School | St Mary’s | 36.70040 | 4.498395 |
| St Mary’s Church of England School | St Mary’s | 34.26447 | 4.567963 |
| Hillcrest High School | Hillcrest | 49.05039 | 9.353377 |
| Hillcrest High School | Hillcrest | 46.63433 | 5.989353 |
| Hillcrest High School | Hillcrest | 51.74739 | 8.973802 |
| Hillcrest High School | Hillcrest | 43.49421 | 7.694494 |
| Hillcrest High School | Hillcrest | 46.46544 | 7.138255 |
| Hillcrest High School | Hillcrest | 41.27340 | 2.905237 |
| Hillcrest High School | Hillcrest | 42.08927 | 5.305026 |
| Hillcrest High School | Hillcrest | 46.31831 | 9.517820 |
| Hillcrest High School | Hillcrest | 48.92636 | 10.682132 |
| Hillcrest High School | Hillcrest | 48.76119 | 7.500638 |
| Hillcrest High School | Hillcrest | 47.52638 | 8.979459 |
| Hillcrest High School | Hillcrest | 50.46286 | 12.347939 |
| Hillcrest High School | Hillcrest | 54.44885 | 13.298837 |
| Hillcrest High School | Hillcrest | 41.11537 | 5.687929 |
| Hillcrest High School | Hillcrest | 51.49040 | 8.627312 |
| Hillcrest High School | Hillcrest | 51.64673 | 12.583766 |
| Hillcrest High School | Hillcrest | 47.65148 | 7.570081 |
| Hillcrest High School | Hillcrest | 48.73925 | 8.821862 |
| Hillcrest High School | Hillcrest | 48.46442 | 8.720948 |
| Hillcrest High School | Hillcrest | 51.00763 | 6.316233 |
| Oakwood High School | Oakwood | 34.81445 | 7.645218 |
| Oakwood High School | Oakwood | 43.21828 | 8.890936 |
| Oakwood High School | Oakwood | 37.57005 | 7.737664 |
| Oakwood High School | Oakwood | 44.13324 | 12.319428 |
| Oakwood High School | Oakwood | 36.86629 | 7.536292 |
| Oakwood High School | Oakwood | 39.59984 | 7.679763 |
| Oakwood High School | Oakwood | 42.86316 | 11.069858 |
| Oakwood High School | Oakwood | 31.30408 | 5.810165 |
| Oakwood High School | Oakwood | 40.17264 | 8.857175 |
| Oakwood High School | Oakwood | 33.61506 | 4.324636 |
| Oakwood High School | Oakwood | 46.74118 | 12.480779 |
| Oakwood High School | Oakwood | 34.15559 | 8.158333 |
| Oakwood High School | Oakwood | 33.23144 | 7.575734 |
| Oakwood High School | Oakwood | 30.22384 | 5.264513 |
| Oakwood High School | Oakwood | 38.33064 | 8.955984 |
| Oakwood High School | Oakwood | 35.09257 | 6.578424 |
| Oakwood High School | Oakwood | 38.76908 | 7.378793 |
| Oakwood High School | Oakwood | 46.38074 | 12.842296 |
| Oakwood High School | Oakwood | 35.56375 | 8.452693 |
| Oakwood High School | Oakwood | 35.34916 | 5.763923 |
| Oakwood High School | Oakwood | 41.76415 | 9.468891 |
| Oakwood High School | Oakwood | 41.25415 | 7.891055 |
| Oakwood High School | Oakwood | 38.45118 | 10.749311 |
| St Mary’s Church of England School | St Mary’s | 61.09157 | 13.276536 |
| St Mary’s Church of England School | St Mary’s | 43.06366 | 6.001193 |
| St Mary’s Church of England School | St Mary’s | 54.20218 | 10.343221 |
| St Mary’s Church of England School | St Mary’s | 46.20423 | 9.233964 |
| St Mary’s Church of England School | St Mary’s | 58.19549 | 11.665967 |
| St Mary’s Church of England School | St Mary’s | 43.25403 | 8.289936 |
| St Mary’s Church of England School | St Mary’s | 54.10719 | 11.489128 |
| St Mary’s Church of England School | St Mary’s | 35.79491 | 3.744103 |
| St Mary’s Church of England School | St Mary’s | 62.27251 | 14.047647 |
| St Mary’s Church of England School | St Mary’s | 58.20645 | 11.573604 |
| St Mary’s Church of England School | St Mary’s | 45.70714 | 8.526942 |
| St Mary’s Church of England School | St Mary’s | 30.16766 | 4.632249 |
| St Mary’s Church of England School | St Mary’s | 51.17844 | 10.908296 |
| St Mary’s Church of England School | St Mary’s | 49.02100 | 10.428852 |
| St Mary’s Church of England School | St Mary’s | 46.42774 | 8.960203 |
| St Mary’s Church of England School | St Mary’s | 48.29198 | 9.433638 |
| St Mary’s Church of England School | St Mary’s | 42.22076 | 7.226943 |
| St Mary’s Church of England School | St Mary’s | 50.52733 | 10.085690 |
We can now model how students’ test scores depend on the length of study. As a first step we estimate a simple linear regression model:
# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time, data = df)
# show mode summary
summary(model)
##
## Call:
## lm(formula = test_score ~ study_time, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1068 -4.5792 0.5577 4.1368 11.4546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.7993 1.2427 21.57 <2e-16 ***
## study_time 2.0192 0.1326 15.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.266 on 181 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5592
## F-statistic: 231.9 on 1 and 181 DF, p-value: < 2.2e-16
#model
With the formula
formula(model)
## test_score ~ study_time
we tell R to fit a regression line for the point cloud defined by study_time (as single continuous predictor) and test_score (as continuous output). We see that the resulting basic regression parameters
# Collect "classic" regression parameter
coef(model)
## (Intercept) study_time
## 26.799339 2.019198
are significantly different from zero and that the residuals are roughly symmetrically distributed around 0, which is a good sign. The test score \(Score_\text{est}\) is estimated by the regression equation
\[ Test\,\,Score_\text{est} = 26.799 + 2.019\,Study\,\,Time\,(\text{hours}) \]
According to this model, each additional minute of study increases the test score by 2.019 units. In case students do not study at all their test score is estimated 26.799 on average.
The model output is visualized as regression line, i.e. the closest linear (“straight”) approximation of all dots in the scatter plot (point cloud) where each dot represents a single student.
basic_ggplot = ggplot(df, aes(x = study_time, y = test_score)) +
geom_point(alpha = alpha_dots, color="#678") +
theme_minimal() + theme(panel.grid = element_blank()) +
labs(x = "Study Time", y = "Score")
# Assign intercept and slope to variables
intercept = coef(model)[1]
slope = coef(model)[2]
basic_ggplot +
labs(title = "Association of student's study time and scoring") +
geom_abline(intercept = intercept, slope = slope, color = "#345", linetype = "longdash")
In order to have valid estimates the error term (residuals) must be normally distributed. As seen in the model output above, the residuals are approximately symmetrically distributed. We can, of course, calculate the residual variable directly to show its distribution in more detail:
# get residual variable
test_score_resid = resid(model)
# show how residual are distributed
ggplot(df, aes(x = test_score_resid)) +
geom_histogram(binwidth = 5, fill = "#678", color = "white", alpha=.6) +
theme_minimal() + theme(panel.grid = element_blank()) +
labs(title = "Histogram of residuals", x = "Score residual", y = "Count")
Residuals are slightly left skewed, which can be ignored in practice.
In brief, the classical model
Generalized Linear Models (GLMs) extend traditional linear models to handle non-normal outcome variables, such as binary, count, proportion, or skewed continuous data. They provide a flexible framework that combines linear modeling with appropriate transformations and error distributions.
GLMs build upon a link function, i.e., a transformation that links the expected outcome \(Y\) to the linear predictor:
The general model equation for a GLM is:
\[ g(Y) = \beta_0 + \beta_1 X_1 + \ldots + \beta_m X_m + \varepsilon \]
This equation links the expected value of the outcome \(Y\) to a linear combination of predictors through the link function \(g(\ldots)\).
In the following example we analyze how salaries (€) in an organization depend on years of employment and gender. Data is collected from 200 employees, specifically 100 female and 100 male employees.
Here is the dataset:
| years_empl | salary | gender |
|---|---|---|
| 27.4441813 | 274372.94 | Male |
| 28.1122624 | 272584.60 | Male |
| 8.5841860 | 83252.41 | Male |
| 24.9134288 | 229785.25 | Male |
| 19.2523656 | 141310.60 | Male |
| 15.5728785 | 108422.85 | Male |
| 22.0976494 | 185929.98 | Male |
| 4.0399979 | 42793.74 | Male |
| 19.7097687 | 100284.31 | Male |
| 21.1519435 | 167208.49 | Male |
| 13.7322533 | 84488.60 | Male |
| 21.5733675 | 171300.54 | Male |
| 28.0401674 | 291434.29 | Male |
| 7.6628647 | 76376.50 | Male |
| 13.8687847 | 80076.12 | Male |
| 28.2004357 | 305893.12 | Male |
| 29.3467929 | 318895.77 | Male |
| 3.5246208 | 55349.75 | Male |
| 14.2499124 | 107613.32 | Male |
| 16.8099824 | 125935.58 | Male |
| 27.1209416 | 247016.44 | Male |
| 4.1613050 | 40497.62 | Male |
| 29.6667519 | 331348.26 | Male |
| 28.4000470 | 276661.61 | Male |
| 2.4731267 | 31579.04 | Male |
| 15.4263535 | 111774.36 | Male |
| 11.7061040 | 88052.90 | Male |
| 27.1721439 | 270697.87 | Male |
| 13.4090888 | 74413.58 | Male |
| 25.0801278 | 206600.50 | Male |
| 22.1278685 | 198856.63 | Male |
| 24.3316542 | 213999.52 | Male |
| 11.6432485 | 77472.96 | Male |
| 20.5550919 | 153524.76 | Male |
| 0.1184502 | 47629.30 | Male |
| 24.9874824 | 230629.76 | Male |
| 0.2200244 | 32724.36 | Male |
| 6.2297692 | 46640.30 | Male |
| 27.1980422 | 278288.55 | Male |
| 18.3533593 | 142577.96 | Male |
| 11.3867772 | 95481.67 | Male |
| 13.0731475 | 78232.04 | Male |
| 1.1229310 | 42575.02 | Male |
| 29.2061974 | 331214.32 | Male |
| 12.9525375 | 67893.01 | Male |
| 28.7272979 | 285770.63 | Male |
| 26.6326472 | 235624.41 | Male |
| 19.1993631 | 117483.76 | Male |
| 29.1289983 | 309636.63 | Male |
| 18.5651462 | 142275.07 | Male |
| 10.0028163 | 84795.75 | Male |
| 10.4024474 | 84622.06 | Male |
| 11.9545623 | 63018.48 | Male |
| 23.5407833 | 224974.88 | Male |
| 1.1680947 | 37063.01 | Male |
| 22.4638616 | 182548.19 | Male |
| 20.3183049 | 146089.50 | Male |
| 5.1379299 | 43416.06 | Male |
| 7.8326389 | 58960.65 | Male |
| 15.4323880 | 104896.59 | Male |
| 20.2682182 | 151437.42 | Male |
| 29.4845159 | 318956.30 | Male |
| 22.7863280 | 178413.12 | Male |
| 16.9946527 | 109272.55 | Male |
| 25.4906916 | 205630.04 | Male |
| 5.6842181 | 41537.78 | Male |
| 8.1385984 | 49839.02 | Male |
| 24.8447546 | 259463.99 | Male |
| 20.7961446 | 137931.11 | Male |
| 7.2163422 | 55495.92 | Male |
| 1.2896639 | 49143.88 | Male |
| 4.2143728 | 40028.07 | Male |
| 6.4915625 | 52297.32 | Male |
| 14.3819569 | 79848.95 | Male |
| 5.9223103 | 48154.50 | Male |
| 21.5806751 | 162196.75 | Male |
| 0.2365422 | 38631.97 | Male |
| 11.2646989 | 43504.75 | Male |
| 15.4322312 | 84736.67 | Male |
| 0.0471166 | 32806.04 | Male |
| 17.4481201 | 129666.43 | Male |
| 4.7371562 | 36430.33 | Male |
| 10.7708492 | 71014.11 | Male |
| 19.3689564 | 158119.13 | Male |
| 23.2747009 | 214691.13 | Male |
| 16.9094052 | 99585.02 | Male |
| 7.0111020 | 50807.05 | Male |
| 2.6994155 | 55253.81 | Male |
| 2.5683619 | 30202.92 | Male |
| 9.1565511 | 61622.70 | Male |
| 20.0227954 | 147570.59 | Male |
| 0.0071669 | 43297.98 | Male |
| 6.2570987 | 42819.48 | Male |
| 27.9910238 | 281155.98 | Male |
| 27.7693425 | 270439.65 | Male |
| 22.0228290 | 191392.66 | Male |
| 9.9921595 | 59509.47 | Male |
| 15.4518999 | 96772.72 | Male |
| 22.3192394 | 189336.76 | Male |
| 18.5747772 | 116733.59 | Male |
| 27.4441813 | 204242.00 | Female |
| 28.1122624 | 191386.91 | Female |
| 8.5841860 | 72219.94 | Female |
| 24.9134288 | 167490.37 | Female |
| 19.2523656 | 108435.26 | Female |
| 15.5728785 | 70663.49 | Female |
| 22.0976494 | 140781.09 | Female |
| 4.0399979 | 32199.05 | Female |
| 19.7097687 | 111206.97 | Female |
| 21.1519435 | 151187.35 | Female |
| 13.7322533 | 75816.61 | Female |
| 21.5733675 | 119743.62 | Female |
| 28.0401674 | 216027.60 | Female |
| 7.6628647 | 45854.16 | Female |
| 13.8687847 | 88053.06 | Female |
| 28.2004357 | 237475.42 | Female |
| 29.3467929 | 219145.13 | Female |
| 3.5246208 | 45214.51 | Female |
| 14.2499124 | 82617.81 | Female |
| 16.8099824 | 110742.94 | Female |
| 27.1209416 | 196822.69 | Female |
| 4.1613050 | 52693.92 | Female |
| 29.6667519 | 213160.54 | Female |
| 28.4000470 | 244370.12 | Female |
| 2.4731267 | 48641.89 | Female |
| 15.4263535 | 86065.03 | Female |
| 11.7061040 | 46340.89 | Female |
| 27.1721439 | 210633.58 | Female |
| 13.4090888 | 83942.68 | Female |
| 25.0801278 | 173513.78 | Female |
| 22.1278685 | 143467.72 | Female |
| 24.3316542 | 155985.73 | Female |
| 11.6432485 | 73309.23 | Female |
| 20.5550919 | 130895.97 | Female |
| 0.1184502 | 33939.11 | Female |
| 24.9874824 | 152443.33 | Female |
| 0.2200244 | 40976.86 | Female |
| 6.2297692 | 54712.10 | Female |
| 27.1980422 | 188808.55 | Female |
| 18.3533593 | 84492.50 | Female |
| 11.3867772 | 69645.56 | Female |
| 13.0731475 | 69735.95 | Female |
| 1.1229310 | 36242.49 | Female |
| 29.2061974 | 212333.47 | Female |
| 12.9525375 | 59894.91 | Female |
| 28.7272979 | 240390.18 | Female |
| 26.6326472 | 199596.33 | Female |
| 19.1993631 | 123822.69 | Female |
| 29.1289983 | 257722.99 | Female |
| 18.5651462 | 111962.82 | Female |
| 10.0028163 | 30410.55 | Female |
| 10.4024474 | 67145.34 | Female |
| 11.9545623 | 86840.21 | Female |
| 23.5407833 | 186767.75 | Female |
| 1.1680947 | 48096.85 | Female |
| 22.4638616 | 127293.27 | Female |
| 20.3183049 | 113809.76 | Female |
| 5.1379299 | 32825.78 | Female |
| 7.8326389 | 42222.32 | Female |
| 15.4323880 | 85583.32 | Female |
| 20.2682182 | 105943.37 | Female |
| 29.4845159 | 266857.26 | Female |
| 22.7863280 | 149472.85 | Female |
| 16.9946527 | 97313.91 | Female |
| 25.4906916 | 186105.23 | Female |
| 5.6842181 | 45221.87 | Female |
| 8.1385984 | 51050.88 | Female |
| 24.8447546 | 192923.96 | Female |
| 20.7961446 | 125372.93 | Female |
| 7.2163422 | 30462.69 | Female |
| 1.2896639 | 38619.32 | Female |
| 4.2143728 | 35021.34 | Female |
| 6.4915625 | 39430.34 | Female |
| 14.3819569 | 66077.73 | Female |
| 5.9223103 | 51836.70 | Female |
| 21.5806751 | 133279.58 | Female |
| 0.2365422 | 38235.89 | Female |
| 11.2646989 | 62489.24 | Female |
| 15.4322312 | 78485.47 | Female |
| 0.0471166 | 48852.66 | Female |
| 17.4481201 | 97678.32 | Female |
| 4.7371562 | 56015.18 | Female |
| 10.7708492 | 45738.23 | Female |
| 19.3689564 | 109407.30 | Female |
| 23.2747009 | 148957.97 | Female |
| 16.9094052 | 92124.57 | Female |
| 7.0111020 | 69238.16 | Female |
| 2.6994155 | 35898.28 | Female |
| 2.5683619 | 39572.20 | Female |
| 9.1565511 | 42813.41 | Female |
| 20.0227954 | 110912.02 | Female |
| 0.0071669 | 44986.09 | Female |
| 6.2570987 | 65365.23 | Female |
| 27.9910238 | 231578.99 | Female |
| 27.7693425 | 188859.09 | Female |
| 22.0228290 | 170910.92 | Female |
| 9.9921595 | 75632.53 | Female |
| 15.4518999 | 88084.00 | Female |
| 22.3192394 | 153654.18 | Female |
| 18.5747772 | 95533.93 | Female |
The scatterplot shows a clear positive association: The longer employees are employed, the higher salaries they earn:
p = ggplot(df, aes(x = years_empl, y = salary, color = gender)) +
geom_point(alpha = .6) +
theme_minimal() +
scale_x_continuous(name="Years of employment", labels = scales::comma) +
scale_y_continuous(name="Salary (€)", labels = scales::comma) +
scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
#scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
p + theme(legend.position="none")
We can fit a simple linear regression model to model this association:
linear_model = lm(salary ~ years_empl, data = df)
summary(linear_model)
##
## Call:
## lm(formula = salary ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58615 -27281 -3463 19327 101896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2684.3 4717.3 -0.569 0.57
## years_empl 7943.6 260.2 30.535 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared: 0.8248, Adjusted R-squared: 0.8239
## F-statistic: 932.4 on 1 and 198 DF, p-value: < 2.2e-16
#linear_model
p + theme(legend.position="none") +
geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2],
color = "#345", alpha = 1, linetype = "longdash")
We are particularly interested in the differences of salaries between gender. A simple comparison of average salaries reveals that female have lower salaries on average:
tapply(df$salary, df$gender, mean)
## Female Male
## 109140.8 135466.1
salary_female = tapply(df$salary, df$gender, mean)["Female"]
salary_male = tapply(df$salary, df$gender, mean)["Male"]
By adding gender as dummy variable into our linear model, we come to the conclusion, that salaries increase by 7943.61 € per year and that male salaries exceed female salaries by NA € on average, independent of employment time.
linear_model2 = lm(salary ~ years_empl+gender, data = df)
summary(linear_model2)
##
## Call:
## lm(formula = salary ~ years_empl + gender, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66761 -21296 -6514 20747 88733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15846.9 4842.8 -3.272 0.00126 **
## years_empl 7943.6 239.2 33.215 < 2e-16 ***
## genderMale 26325.4 4311.0 6.107 5.33e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30480 on 197 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8512
## F-statistic: 570.3 on 2 and 197 DF, p-value: < 2.2e-16
#linear_model2
p = p + theme(legend.position.inside=c(.28,.63)) +
scale_color_manual(values = gender_colors, name = "Gender")
p +
geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2],
color = "#666", alpha = 1, linetype = "dotted") +
geom_abline(intercept = coef(linear_model2)[1], slope = coef(linear_model2)[2],
color = gender_colors[1], alpha = 1, linetype = "longdash") +
geom_abline(intercept = coef(linear_model2)[1]+coef(linear_model2)[3], slope = coef(linear_model2)[2],
color = gender_colors[2], alpha = 1, linetype = "longdash")
It is, however, clearly visible that the association between years of employment and salaries is not linear. A linear model, thus, is not appropriate, and our interpretation is flawed. In order to fit a linear model, we first have to transform our dependent variable so that the association becomes linear.
When looking at selected data points it becomes obvious that the association is exponential in nature:
| Years of Employment | Salary | Salary | log2(Salary) | |||
|---|---|---|---|---|---|---|
| 0 | 32.768 | 215 | 15 | |||
| 10 | +10 | 65.536 | x2 | 216 | 16 | +1 |
| 20 | +10 | 131.072 | x2 | 217 | 17 | +1 |
| 30 | +10 | 262.144 | x2 | 218 | 18 | +1 |
p + theme(legend.position="none") +
scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender") +
geom_point(x=0, y=2**15, color="#f52", size=5, alpha=.5) +
geom_point(x=10, y=2**16, color="#f52", size=5, alpha=.5) +
geom_point(x=20, y=2**17, color="#f52", size=5, alpha=.5) +
geom_point(x=30, y=2**18, color="#f52", size=5, alpha=.5)
Therefore, we use a logarithmic transformation to linearize the association. Instead of fitting
\[ Salary = \beta_0 + \beta_1Years + \varepsilon \]
we choose \(\text{log}_2(Y)\) as dependent variable:
\[ \text{log}_2(Salary) = \beta_0 + \beta_1Years + \varepsilon \]
log_model = lm(log2(salary) ~ years_empl, data = df)
summary(log_model)
##
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1115 -0.1760 -0.0016 0.2198 0.5921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.979177 0.039675 377.54 <2e-16 ***
## years_empl 0.102428 0.002188 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2789 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
#log_model
beta0 = round(coef(log_model)[1],3)
beta1 = round(coef(log_model)[2],3)
pl = ggplot(df, aes(x = years_empl, y = log2(salary), color = gender)) +
geom_point(alpha = .6) +
theme_minimal() +
scale_x_continuous(name="Years of employment", labels = scales::comma) +
scale_y_continuous(name="Log2(Salary)", labels = scales::comma) +
scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
#scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
pl + theme(legend.position="none") +
geom_abline(intercept = coef(log_model)[1], slope = coef(log_model)[2],
color = "#345", alpha = 1, linetype = "longdash")
According to this model, the log2 of Salary increases by 0.102 each year :
\[ \text{log}_2(Salary) = 14.979 + 0.102 Years \]
As this description is a little hard to understand we have to get the model parameter and model predictions back on original scale. Raising the predictive form of the regression equation
\[ \text{log}_2(Salary_\text{est}) = \beta_0 + \beta_1Years \]
to the power of 2 we get
\[ Salary_\text{est} = 2^{(\beta_0 + \beta_1Years)} = 2^{\beta_0} \cdot 2^{\beta_1Years} \] Now we have a multiplicative (rather than additive) model indicating the underlying non-linear structure .
\[ Salary_\text{est} = 2^{14.979} \cdot 2^{0.102\cdot Years} \] Because we applied a logarithmic transformation, we can now interpret the model parameters as growth rates. The model predictions are as follows:
df$log_pred = 2**(beta0 + beta1*df$years_empl)
p +
geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2],
color = "#345", alpha = 1, linetype = "dotted") +
geom_line(aes(x=years_empl,
y=2**(coef(log_model)[1] + coef(log_model)[2]*years_empl)),
color = "#345", alpha = 1, linetype = "longdash")
But if we look at the gender differences in more detail it appears as if gender do not differ by intercepts only but have different growth rates. This can only be modeled by fitting separate models for male and female.
# adjust for Male and Female
log_model_male = lm(log2(salary) ~ years_empl, data = df[df$gender=="Male",])
summary(log_model_male)
##
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender ==
## "Male", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8088 -0.1247 0.0048 0.1004 0.5500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.97655 0.04442 337.15 <2e-16 ***
## years_empl 0.11018 0.00245 44.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2208 on 98 degrees of freedom
## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9533
## F-statistic: 2023 on 1 and 98 DF, p-value: < 2.2e-16
#log_model_male
log_model_female = lm(log2(salary) ~ years_empl, data = df[df$gender=="Female",])
summary(log_model_female)
##
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender ==
## "Female", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03654 -0.11005 0.02057 0.15374 0.58988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.981809 0.052984 282.8 <2e-16 ***
## years_empl 0.094675 0.002922 32.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2634 on 98 degrees of freedom
## Multiple R-squared: 0.9146, Adjusted R-squared: 0.9138
## F-statistic: 1050 on 1 and 98 DF, p-value: < 2.2e-16
#log_model_female
# log_model_dummy = lm(log2(salary) ~ years_empl + gender, data = df)
# summary(log_model_dummy)
# log_model_male
# log_model_female
# log_model_dummy
betam0 = round(coef(log_model_male)[1],3)
betam1 = round(coef(log_model_male)[2],3)
betaf0 = round(coef(log_model_female)[1],3)
betaf1 = round(coef(log_model_female)[2],3)
p +
geom_line(aes(x=years_empl,
y=2**(coef(log_model_male)[1] + coef(log_model_male)[2]*years_empl)),
color = "steelblue", alpha = 1, linetype = "longdash") +
geom_line(aes(x=years_empl,
y=2**(coef(log_model_female)[1] + coef(log_model_female)[2]*years_empl)),
color = "darkorange", alpha = 1, linetype = "longdash")
Now it becomes obvious, that gender salaries differ by their growth rates rather than being dependent on starting salaries.
Female \[ Salary_\text{est Female} = 2^{14.982} \cdot 2^{0.095\cdot Years} \] Male \[ Salary_\text{est Male} = 2^{14.977} \cdot 2^{0.11\cdot Years} \] While starting salaries are more or less equal between gender (female: \(2^{14.982} = 32361.71\)€, male: \(2^{14.977} = 32249.74\)€), the growth rates differ significantly (female: \(2^{0.095} = 6.81\)%, male: \(2^{0.11} = 7.92\)%).
In the terminology of GLM the logarithm log2() served as link function \(g\)() that allowed us to linearize the association. Interpretation of model output, though, is much easier if output is transformed back into real measures.
df = read.xlsx("./data/AdherenceData.xlsx")
Regression analysis requires a metric outcome variable. If we aim at modeling a binary (or categorical) outcome variable we have to choose a link function \(g\)() that transforms the binary outcome into a metric.
The following example models participation in, or adherence to, a preventive program as part of the health management offers in a company. Variables in the data set are:
adherence — whether a patient adheres to the preventive
program (0 = no, 1 = yes)app_use — use of a digital health management app (0 =
no, 1 = yes)stress_level — perceived stress (1 = low, 2 = moderate,
3 = high)weekly_activity — minutes of physical activity per
weekThe data set comprises a binary outcome (adherence) and
binary (app_use), multitom (stress_level), and
metric (weekly_activity) explanatory variables:
| id | adherence | app_use | stress_level | weekly_activity |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 210 |
| 2 | 1 | 0 | 3 | 180 |
| 3 | 0 | 0 | 2 | 60 |
| 4 | 0 | 1 | 3 | 150 |
| 5 | 0 | 0 | 1 | 30 |
| 6 | 1 | 1 | 2 | 140 |
| 7 | 0 | 0 | 3 | 80 |
| 8 | 1 | 1 | 3 | 90 |
| 9 | 1 | 0 | 2 | 50 |
| 10 | 1 | 1 | 1 | 220 |
| 11 | 0 | 0 | 3 | 90 |
| 12 | 1 | 1 | 3 | 160 |
| 13 | 0 | 0 | 2 | 70 |
| 14 | 1 | 1 | 1 | 220 |
| 15 | 0 | 1 | 3 | 85 |
| 16 | 1 | 1 | 2 | 140 |
| 17 | 0 | 0 | 3 | 55 |
| 18 | 1 | 1 | 1 | 230 |
| 19 | 0 | 0 | 3 | 95 |
| 20 | 1 | 1 | 3 | 175 |
| 21 | 0 | 1 | 2 | 65 |
| 22 | 1 | 1 | 1 | 210 |
| 23 | 0 | 0 | 3 | 180 |
| 24 | 0 | 1 | 3 | 155 |
| 25 | 0 | 0 | 2 | 160 |
| 26 | 1 | 1 | 1 | 225 |
| 27 | 0 | 0 | 3 | 110 |
| 28 | 1 | 1 | 3 | 165 |
| 29 | 0 | 0 | 2 | 75 |
| 30 | 1 | 1 | 1 | 205 |
The core research question asks for the effect of physical activity on the probability of adherence. We hypothesize:
Physical activity and adherence to the program are positively associated.
A linear regression model can be written as:
\[ \hat{Y} = \beta_0 + \beta_1 X \]
Applied to a binary outcome such as adherence, this
model describes a straight-line relationship between
weekly_activity and the expected value of
adherence.
glm = ggplot(df, aes(x = weekly_activity, y = adherence)) +
geom_point(color = "darkblue", shape = 1) +
scale_x_continuous(limits = c(0, 280)) +
scale_y_continuous(limits = c(-0.2, 1.2)) +
labs(x = "Weekly activity",
y = "Adherence") +
theme_minimal()
glm + geom_smooth(method = "lm",
se = FALSE,
linetype = "dashed",
color = "steelblue",
linewidth = 0)
m = lm(adherence ~ weekly_activity, data=df)
m
##
## Call:
## lm(formula = adherence ~ weekly_activity, data = df)
##
## Coefficients:
## (Intercept) weekly_activity
## -0.210823 0.005227
b = round(coefficients(m), 3)
\[ \text{adherence } = -0.211 + 0.005 \cdot \text{ weekly_activity} \]
It is obvious to interpret the model outcome as probabilities, at
least in the value range of the predictor weekly_activity.
However, this is problematic because predicted values outside this range
may fall below 0 or above 1, although probabilities must lie between 0
and 1.
Logistic regression solves this problem by modeling an S-shaped probability curve (which is, by the way, also characteristic of logistic growth). This model is the so-called logistic model:
\[ p = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
Here, \(p\) denotes the probability that the outcome variable (here: adherence) takes the value 1.
glm + geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE,
linetype = "solid",
color = "darkorange",
linewidth = 0) +
labs(
x = "Weekly activity",
y = "Probability of Adherence (p)"
)
# use glm here
m = glm(adherence ~ weekly_activity, data = df, family = binomial)
m
##
## Call: glm(formula = adherence ~ weekly_activity, family = binomial,
## data = df)
##
## Coefficients:
## (Intercept) weekly_activity
## -3.86451 0.02835
##
## Degrees of Freedom: 29 Total (i.e. Null); 28 Residual
## Null Deviance: 41.59
## Residual Deviance: 27.65 AIC: 31.65
b = round(coefficients(m), 3)
This model assumes that the probability of adherence changes
non-linearly with weekly_activity: the curve is flatter
close to 0 and 1, and steepest in the middle range. This ‘trick’ solves
the issue of probability estimates beyond [0,1]. It is, however, obvious
that this model is non-linear. To linearize, we calculate the
odds of adherence, i.e., the ratio of the probability of
adherence to the probability of non-adherence. The term (1 - p)
represents the probability of the alternative outcome
(non-adherence).
\[ \frac{p}{1-p} = \frac{\frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}}{1-\frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}} = e^{\beta_0 + \beta_1 X} \]
If we apply the logarithm to the odds we get the so called linear logit model:
\[ \text{ln}\frac{p}{1-p} = \text{logit}(p) = \text{ln } e^{\beta_0 + \beta_1 X} = \beta_0 + \beta_1 X \]
Applied to our data the linear logit model is estimated as
\[ \text{logit}(p) = -3.865 + 0.028 \cdot \text{weekly_activity} \]
The coefficient for weekly_activity (0.028) is positive,
indicating that higher levels of physical activity are associated with a
higher probability of adherence. Overall, the model implies a monotonic,
non-linear increase in adherence probability with increasing physical
activity.
Note that the regression coefficient \(\beta_1\) = 0.028 itself is not a
probability. It describes the change in the log-odds of adherence for a
one-unit increase in weekly_activity. After exponentiation,
it becomes an odds ratio:
\[ e^{\beta_1} = e^{0.028} = 1.028 \]
The odds ratio indicates by which factor the odds of adherence change
when weekly_activity increases by one unit. In odds terms,
a one-unit increase in weekly activity (i.e., +1 minute) multiplies the
odds of adherence by 1.028. This corresponds to an increase of about
2.8% in the odds per additional minute of activity. More meaningful
increments are, e.g., a 10-minute [30-minute] increase in weekly
activity multiplies the odds of adherence by approximately \(e\)0.28 \(\approx\) 1.32 [\(e\)0.84 \(\approx\) 2.32], corresponding to about a
32% [132%] increase in the odds.
It is important to note that this refers to odds, not probabilities. A 32% increase in odds does not translate into a 32% increase in probability!
The intercept (-3.865) represents the log-odds of adherence when
weekly_activity = 0. In probability terms this means \[
p = \frac{1}{1 + e^{3.865}} \approx 0.0205325
\] Thus, with zero activity, adherence is essentially close to
zero.
The graphical representation of the model on the logit scale appears as a straight line, closely resembling a linear regression. However, a closer inspection shows that the logit model implies a steeper slope than the linear model. Substantively, this representation is of limited interpretive value, as log-odds are not directly intuitive compared to probabilities.
x_seq = seq(min(df$weekly_activity), max(df$weekly_activity), length.out = 100)
logit_hat = predict(m, newdata = data.frame(weekly_activity = x_seq), type = "link")
df_plot = data.frame(weekly_activity = x_seq, logit = logit_hat)
glm +
geom_line(data = df_plot,
aes(x = weekly_activity, y = logit),
color = "darkorange",
linetype = "solid",
linewidth = 0) +
labs(
x = "Weekly Activity",
y = "Logit(p) (Log-odds of adherence)"
)
Odds ratios are often easier to understand when illustrated with a categorical predictor rather than a metric one. In the categorical case, the odds of an outcome can be directly compared between clearly defined groups (e.g., low-activity vs. high-activity users), making the interpretation more intuitive. This simplified setting helps to introduce the basic logic of odds and odds ratios and to illustrate how they differ from probabilities.
Let’s look at 200 repondents classified into two grous, namely those with low weelky activity and those with high weelky activity.
t = data.frame(c1 = c("no", "yes"), c2 = c(80,20), c3=c(40,60))
names(t) = c("Adherence", "Low activity", "High activity")
kable_styling(kable(t, align = "lrr"), full_width = FALSE)
| Adherence | Low activity | High activity |
|---|---|---|
| no | 80 | 40 |
| yes | 20 | 60 |
The odds for adherence rather than non-adherence in the low activity group is
\[ \text{odds}_{low} = \frac{20}{80} = 0.25 \]
The odds for adherence rather than non-adherence in the high activity group are
\[ \text{odds}_{high} = \frac{60}{40} = 1.5 \]
The odds ratio (OR) for adherence for the high activity vs the low activity group are
\[ \text{OR} = \frac{60/40}{20/80} = \frac{1.5}{0.25} = 6 \]
The odds of adherence are 6 times higher for the high-activity group than for the low-activity group, indicating substantially greater adherence among highly active individuals.
This mini-example illustrates that, with categorical predictors, odds ratios can be computed directly from contingency tables because the grouping already exists in the data. With metric predictors, however, there are infinitely many possible comparisons, so we need a model that describes how odds change systematically across the predictor range. This is achieved by modeling the odds of the probabilities,
\[ \frac{p}{1-p} \]
or, more precisely, their logarithm (the logit).
Note that in terms of probabilities the numbers change:
The probability for adherence in the low activity group is \[ p(\text{adherence=yes | low activity} )= \frac{20}{80+20}=0.2 \] The probability for adherence in the high activity group is \[ p(\text{adherence=yes | high activity} )= \frac{60}{60+40}=0.6 \] and the relative probability for adherence of high-activity vs. low-activity users is
\[ \text{relative probability for adherence} = \frac{60/(60+40)}{20/(80+20)} = \frac{0.6}{0.2} = 3 \]
Thus, odds ratios are not equivalent to probability ratios often referred to as relative risks.
It is important to note that odds ratios are invariant to the direction of calculation: the odds ratio obtained row-wise is identical to the one obtained column-wise.
\[ \text{OR} = \frac{60/40}{20/80} = \frac{1.5}{0.25} = \frac{60/20}{40/80} = \frac{3}{0.5} = 6 \]
It is evident that this does not hold for probabilities, because \[ \text{relative likelihood for high activity} = \frac{60/(60+20)}{40/(40+80)} = \frac{0.75}{0.33} = 2.25 \ne 3 \] So the relative likelihood is not invariant to the direction of calculation.
As a side note, the invariance of odds ratios has important implications for research design. Unlike relative probabilities, odds ratios remain unchanged when the direction of conditioning is reversed. This is particularly important in case-control studies, where researchers sample individuals based on the outcome rather than the predictor. In such designs, direct probability comparisons are often not meaningful or not estimable, whereas odds ratios still provide a consistent measure of association. Transferred to the current example, this would mean comparing activity levels between adherent and non-adherent individuals instead of examining how adherence depends on activity. This perspective is substantively much less meaningful than the original direction.
In order to model the (mutually independent) effects of all three predictors, we hypothezise that
the multivariate logistic model estimates are:
# use glm here
m = glm(adherence ~ weekly_activity + app_use + stress_level, data = df, family = binomial)
b = round(coefficients(m), 3)
summary(m)
##
## Call:
## glm(formula = adherence ~ weekly_activity + app_use + stress_level,
## family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.46997 2.42241 -0.607 0.5440
## weekly_activity 0.02248 0.01175 1.914 0.0556 .
## app_use 1.92007 1.09928 1.747 0.0807 .
## stress_level -1.08695 0.92624 -1.174 0.2406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.589 on 29 degrees of freedom
## Residual deviance: 22.558 on 26 degrees of freedom
## AIC: 30.558
##
## Number of Fisher Scoring iterations: 6
\[ \text{logit}(p) = -1.47 + 0.022 \cdot \text{weekly_activity} + 1.92 \cdot \text{app_use} -1.087 \cdot \text{stress_level} \]
Translated into odds ratios the model reads
# use glm here
b = round(exp(coefficients(m)), 3)
\[ \frac{p}{1-p} = 0.23 + 1.023 \cdot \text{weekly_activity} + 6.821 \cdot \text{app_use} + 0.337 \cdot \text{stress_level} \]
Due to the small sample size, the effects are not significant, though.
In addition to the estimated regression coefficients b, logistic regression reports usually include estimates of odds ratios OR = eb and their 95% confidence intervals CI(95%) = [low, high]:
b = round(coef(m), 3)
OR = round(exp(coef(m)), 3)
CI = round(exp(confint(m)), 3)
CI = paste0("[",CI[,1], ",", CI[,2], "]")
t = data.frame(cbind(b, OR, CI))
kable_styling(kable(t, align = "rrr"), full_width = FALSE)
| b | OR | CI | |
|---|---|---|---|
| (Intercept) | -1.47 | 0.23 | [0.001,18.703] |
| weekly_activity | 0.022 | 1.023 | [1.001,1.051] |
| app_use | 1.92 | 6.821 | [0.852,75.536] |
| stress_level | -1.087 | 0.337 | [0.036,1.685] |
The model shows that adherence is positively associated with physical activity and especially app use, while higher stress substantially reduces the likelihood of adherence. In detail, the odds ratios indicate the direction and strength of association with adherence, while the extreme width of the CIs indicate great uncertainty due to the small sample size.
Regression coefficients b can take both negative and positive values, whereas odds ratios (ORs), defined as (eb), are strictly positive. Positive coefficients translate into (OR > 1) (increased odds), while negative coefficients yield (OR< 1) (decreased odds). An OR of 1 indicates no difference in odds, i.e., no effect, since (e0 = 1).
Accordingly, at the 5% significance level, an effect is typically considered statistically significant if the corresponding confidence interval for the OR does not include 1.
We can easily extend the model by replacing metric predictors with
dummy variables. This is particularly relevant for variables such as
stress_level, which is ordinal and may not satisfy the
assumption of a linear effect. Instead of treating
stress_level as a numeric variable (implying equal and
linear changes between levels), we can represent each category (e.g.,
1=low, 2=moderate, 3=high) using dummy variables. This allows the model
to estimate separate effects for each level relative to a reference
category (the lowest by default).
This step is important because it relaxes the linearity assumption and enables a more flexible and potentially more accurate representation of how stress is associated with adherence.
#df = read.xlsx("./data/AdherenceData.xlsx")
m = glm(adherence ~ weekly_activity + as.factor(app_use) + as.factor(stress_level), data = df, family = binomial)
#m
b = round(coef(m), 3)
OR = round(exp(coef(m)), 3)
CI = round(exp(confint(m)), 3)
CI = paste0("[",CI[,1], ",", CI[,2], "]")
t = data.frame(cbind(b, OR, CI))
kable_styling(kable(t, align = "rrr"), full_width = FALSE)
| b | OR | CI | |
|---|---|---|---|
| (Intercept) | -2.736 | 0.065 | [0,4.866] |
| weekly_activity | 0.023 | 1.023 | [1.001,1.051] |
| as.factor(app_use)1 | 1.926 | 6.864 | [0.853,76.812] |
| as.factor(stress_level)2 | -0.881 | 0.414 | [0.003,57.166] |
| as.factor(stress_level)3 | -2.046 | 0.129 | [0.001,6.884] |
Note that the results for the binary variable app_use
remain unchanged (reference category: 0 = no app use). Users have
substantially higher odds of adherence, although the wide confidence
interval indicates considerable uncertainty.
For stress_level, the use of dummy variables provides a
more differentiated view. Relative to the reference category (low
stress, level 1), both medium (level 2) and high stress (level 3) are
associated with lower odds of adherence.
Specifically, individuals with high stress have odds that are about 0.129 times those of low-stress individuals (i.e., a reduction of roughly 87%). Compared to medium stress, high stress corresponds to an additional reduction by a factor of about 0.129 / 0.414 ≈ 0.312, indicating substantially lower adherence at the highest stress level.
However, the confidence intervals are very wide and include 1, so these effects are estimated with high uncertainty and should be interpreted cautiously.
GLMs allow us to apply linear modeling techniques to a wide range of outcome types. The key is the link function, which transforms the expected value of the outcome into a continuous scale suitable for a linear predictor. This enables valid modeling even when the raw outcome is binary, count-based, bounded, or non-linear.
| Outcome Type | Distribution | Link Function | What the Link Does |
|---|---|---|---|
| Non-linear | Skewed normal | Log, Exp, … | Maps metric values to a linear function |
| Binary (0/1) | Binomial | Logit | Maps probabilities (0–1) to the real line |
| Count | Poisson | Log | Maps positive counts to the real line |
| Proportion | Binomial | Logit or Probit | Models bounded outcomes on a continuous scale |
| Positive continuous | Gamma | Inverse | Ensures predictions remain strictly positive |