The General Linear Model refers to a broad family of models for continuous, normally distributed outcomes. It includes familiar methods such as simple linear regression, multiple regression, ANOVA, and ANCOVA.
The general model equation for an outcome variable \(Y\) modeled by \(m\) variables \(X_{1\ldots m}\) is:
\[ Y = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m + \varepsilon \]
where
Fitting the model means calculating the regression parameters \(\beta_{0\ldots m}\) which can then be used to estimate the outcome:
\[ \hat{Y} = Y_{\text{est}} = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m \] Note that \(Y\) denotes observed outcome. Estimated outcome is written as \(\hat{Y}\) or \(Y_\text{est}\), you can use these interchangeably. The estimated outcome is equal to the pure model prediciton without the error term.
In case of a single predictor we often write
\[ Y = \alpha + \beta X + \varepsilon \]
instead.
Let’s introduce an example to illustrate this concept. In the following, we will work with data collected from 3 schools, specifically Hillcrest High School, Oakwood High School, and St Mary’s Church of England School. For each student, we have recorded the average test scores (ranging from 0% to 100%) and the average number of hours they studied shortly before their exams.
Here is the dataset:
school_long_name | school | test_score | study_time |
---|---|---|---|
Hillcrest High School | Hillcrest | 54.59454 | 13.092146 |
Hillcrest High School | Hillcrest | 45.90443 | 7.285176 |
Hillcrest High School | Hillcrest | 46.07368 | 10.068656 |
Hillcrest High School | Hillcrest | 53.15568 | 10.877858 |
Hillcrest High School | Hillcrest | 51.89308 | 10.192075 |
Hillcrest High School | Hillcrest | 45.88416 | 8.660897 |
Hillcrest High School | Hillcrest | 49.43284 | 13.513836 |
Hillcrest High School | Hillcrest | 49.04923 | 8.695293 |
Hillcrest High School | Hillcrest | 55.00619 | 15.034541 |
Hillcrest High School | Hillcrest | 49.30719 | 8.791128 |
Hillcrest High School | Hillcrest | 49.59065 | 12.893879 |
Hillcrest High School | Hillcrest | 54.12954 | 15.839206 |
Hillcrest High School | Hillcrest | 47.03255 | 4.812688 |
Hillcrest High School | Hillcrest | 48.98281 | 8.142904 |
Hillcrest High School | Hillcrest | 50.05463 | 8.579306 |
Hillcrest High School | Hillcrest | 56.51023 | 10.887121 |
Hillcrest High School | Hillcrest | 48.13828 | 8.126511 |
Hillcrest High School | Hillcrest | 33.20910 | 1.009904 |
Hillcrest High School | Hillcrest | 40.99077 | 1.657870 |
Hillcrest High School | Hillcrest | 57.04151 | 12.939610 |
Oakwood High School | Oakwood | 44.29733 | 8.059354 |
Oakwood High School | Oakwood | 25.14011 | 3.635345 |
Oakwood High School | Oakwood | 35.47447 | 8.463518 |
Oakwood High School | Oakwood | 45.12912 | 12.623294 |
Oakwood High School | Oakwood | 48.16753 | 14.664851 |
Oakwood High School | Oakwood | 35.43849 | 7.687863 |
Oakwood High School | Oakwood | 37.85879 | 8.207462 |
Oakwood High School | Oakwood | 25.77590 | 3.689781 |
Oakwood High School | Oakwood | 48.83004 | 10.359562 |
Oakwood High School | Oakwood | 36.44190 | 7.059286 |
Oakwood High School | Oakwood | 42.43892 | 10.345621 |
Oakwood High School | Oakwood | 45.67442 | 11.093782 |
Oakwood High School | Oakwood | 46.28141 | 12.084581 |
Oakwood High School | Oakwood | 35.90872 | 7.152491 |
Oakwood High School | Oakwood | 47.41863 | 10.494136 |
Oakwood High School | Oakwood | 29.00540 | 3.828244 |
Oakwood High School | Oakwood | 31.40098 | 6.625893 |
Oakwood High School | Oakwood | 36.01010 | 6.426547 |
Oakwood High School | Oakwood | 24.41876 | 1.736647 |
Oakwood High School | Oakwood | 38.60989 | 9.087638 |
Oakwood High School | Oakwood | 37.99014 | 9.597266 |
Oakwood High School | Oakwood | 39.07729 | 7.896098 |
Oakwood High School | Oakwood | 43.98547 | 11.253760 |
St Mary’s Church of England School | St Mary’s | 42.94447 | 6.799156 |
St Mary’s Church of England School | St Mary’s | 34.92019 | 4.874427 |
St Mary’s Church of England School | St Mary’s | 49.85766 | 10.277724 |
St Mary’s Church of England School | St Mary’s | 44.38598 | 6.545091 |
St Mary’s Church of England School | St Mary’s | 60.11943 | 13.311574 |
St Mary’s Church of England School | St Mary’s | 46.89865 | 7.684932 |
St Mary’s Church of England School | St Mary’s | 50.23389 | 10.946214 |
St Mary’s Church of England School | St Mary’s | 49.43679 | 9.945046 |
St Mary’s Church of England School | St Mary’s | 40.07521 | 6.627753 |
St Mary’s Church of England School | St Mary’s | 60.94646 | 13.706453 |
St Mary’s Church of England School | St Mary’s | 57.77003 | 10.907968 |
St Mary’s Church of England School | St Mary’s | 48.67736 | 9.248552 |
St Mary’s Church of England School | St Mary’s | 51.15944 | 9.808922 |
St Mary’s Church of England School | St Mary’s | 51.22429 | 11.017137 |
St Mary’s Church of England School | St Mary’s | 46.55865 | 9.248769 |
St Mary’s Church of England School | St Mary’s | 23.99421 | 0.000000 |
St Mary’s Church of England School | St Mary’s | 54.27720 | 9.833919 |
St Mary’s Church of England School | St Mary’s | 48.37929 | 7.877566 |
Hillcrest High School | Hillcrest | 45.30004 | 9.534962 |
Hillcrest High School | Hillcrest | 57.01957 | 10.724741 |
Hillcrest High School | Hillcrest | 56.86480 | 13.178481 |
Hillcrest High School | Hillcrest | 46.07672 | 6.797394 |
Hillcrest High School | Hillcrest | 55.57510 | 12.886898 |
Hillcrest High School | Hillcrest | 47.07002 | 9.986815 |
Hillcrest High School | Hillcrest | 49.22528 | 12.094788 |
Hillcrest High School | Hillcrest | 52.23690 | 11.741456 |
Hillcrest High School | Hillcrest | 47.77480 | 11.141905 |
Hillcrest High School | Hillcrest | 45.58995 | 5.849913 |
Hillcrest High School | Hillcrest | 52.34357 | 8.708711 |
Hillcrest High School | Hillcrest | 47.91817 | 10.849825 |
Hillcrest High School | Hillcrest | 43.12712 | 6.118700 |
Hillcrest High School | Hillcrest | 46.96063 | 7.350784 |
Hillcrest High School | Hillcrest | 47.81392 | 10.722260 |
Hillcrest High School | Hillcrest | 50.39072 | 11.283807 |
Hillcrest High School | Hillcrest | 53.06295 | 10.370573 |
Hillcrest High School | Hillcrest | 48.49496 | 6.321941 |
Hillcrest High School | Hillcrest | 45.96745 | 5.679928 |
Hillcrest High School | Hillcrest | 48.66620 | 13.517391 |
Oakwood High School | Oakwood | 41.34408 | 9.753035 |
Oakwood High School | Oakwood | 43.68350 | 9.244591 |
Oakwood High School | Oakwood | 41.67275 | 8.616581 |
Oakwood High School | Oakwood | 32.22012 | 5.396284 |
Oakwood High School | Oakwood | 35.53073 | 10.815261 |
Oakwood High School | Oakwood | 38.83860 | 8.327851 |
Oakwood High School | Oakwood | 40.58326 | 8.431000 |
Oakwood High School | Oakwood | 45.69603 | 11.779309 |
Oakwood High School | Oakwood | 45.36142 | 11.444590 |
Oakwood High School | Oakwood | 49.60593 | 13.155619 |
Oakwood High School | Oakwood | 35.91185 | 7.550749 |
Oakwood High School | Oakwood | 47.79057 | 10.930316 |
Oakwood High School | Oakwood | 49.71638 | 13.152602 |
Oakwood High School | Oakwood | 29.56580 | 5.646904 |
Oakwood High School | Oakwood | 38.93851 | 6.396892 |
Oakwood High School | Oakwood | 36.78149 | 5.584054 |
Oakwood High School | Oakwood | 33.67548 | 4.601628 |
Oakwood High School | Oakwood | 35.45055 | 9.219218 |
Oakwood High School | Oakwood | 42.16985 | 10.938883 |
Oakwood High School | Oakwood | 49.07087 | 12.582166 |
Oakwood High School | Oakwood | 46.35821 | 12.113523 |
Oakwood High School | Oakwood | 34.98333 | 5.969644 |
Oakwood High School | Oakwood | 58.42821 | 14.524716 |
St Mary’s Church of England School | St Mary’s | 39.48171 | 6.978950 |
St Mary’s Church of England School | St Mary’s | 42.54783 | 9.295812 |
St Mary’s Church of England School | St Mary’s | 44.95859 | 7.712503 |
St Mary’s Church of England School | St Mary’s | 44.77387 | 8.612220 |
St Mary’s Church of England School | St Mary’s | 50.96967 | 9.543849 |
St Mary’s Church of England School | St Mary’s | 46.57311 | 9.336753 |
St Mary’s Church of England School | St Mary’s | 54.34814 | 8.903993 |
St Mary’s Church of England School | St Mary’s | 48.53935 | 9.303488 |
St Mary’s Church of England School | St Mary’s | 42.13689 | 7.522964 |
St Mary’s Church of England School | St Mary’s | 42.90107 | 7.466619 |
St Mary’s Church of England School | St Mary’s | 35.57561 | 3.995973 |
St Mary’s Church of England School | St Mary’s | 44.78883 | 7.832269 |
St Mary’s Church of England School | St Mary’s | 38.44711 | 7.441319 |
St Mary’s Church of England School | St Mary’s | 72.24097 | 17.084943 |
St Mary’s Church of England School | St Mary’s | 37.95949 | 4.892922 |
St Mary’s Church of England School | St Mary’s | 49.29009 | 9.391039 |
St Mary’s Church of England School | St Mary’s | 36.70040 | 4.498395 |
St Mary’s Church of England School | St Mary’s | 34.26447 | 4.567963 |
Hillcrest High School | Hillcrest | 49.05039 | 9.353377 |
Hillcrest High School | Hillcrest | 46.63433 | 5.989353 |
Hillcrest High School | Hillcrest | 51.74739 | 8.973802 |
Hillcrest High School | Hillcrest | 43.49421 | 7.694494 |
Hillcrest High School | Hillcrest | 46.46544 | 7.138255 |
Hillcrest High School | Hillcrest | 41.27340 | 2.905237 |
Hillcrest High School | Hillcrest | 42.08927 | 5.305026 |
Hillcrest High School | Hillcrest | 46.31831 | 9.517820 |
Hillcrest High School | Hillcrest | 48.92636 | 10.682132 |
Hillcrest High School | Hillcrest | 48.76119 | 7.500638 |
Hillcrest High School | Hillcrest | 47.52638 | 8.979459 |
Hillcrest High School | Hillcrest | 50.46286 | 12.347939 |
Hillcrest High School | Hillcrest | 54.44885 | 13.298837 |
Hillcrest High School | Hillcrest | 41.11537 | 5.687929 |
Hillcrest High School | Hillcrest | 51.49040 | 8.627312 |
Hillcrest High School | Hillcrest | 51.64673 | 12.583766 |
Hillcrest High School | Hillcrest | 47.65148 | 7.570081 |
Hillcrest High School | Hillcrest | 48.73925 | 8.821862 |
Hillcrest High School | Hillcrest | 48.46442 | 8.720948 |
Hillcrest High School | Hillcrest | 51.00763 | 6.316233 |
Oakwood High School | Oakwood | 34.81445 | 7.645218 |
Oakwood High School | Oakwood | 43.21828 | 8.890936 |
Oakwood High School | Oakwood | 37.57005 | 7.737664 |
Oakwood High School | Oakwood | 44.13324 | 12.319428 |
Oakwood High School | Oakwood | 36.86629 | 7.536292 |
Oakwood High School | Oakwood | 39.59984 | 7.679763 |
Oakwood High School | Oakwood | 42.86316 | 11.069858 |
Oakwood High School | Oakwood | 31.30408 | 5.810165 |
Oakwood High School | Oakwood | 40.17264 | 8.857175 |
Oakwood High School | Oakwood | 33.61506 | 4.324636 |
Oakwood High School | Oakwood | 46.74118 | 12.480779 |
Oakwood High School | Oakwood | 34.15559 | 8.158333 |
Oakwood High School | Oakwood | 33.23144 | 7.575734 |
Oakwood High School | Oakwood | 30.22384 | 5.264513 |
Oakwood High School | Oakwood | 38.33064 | 8.955984 |
Oakwood High School | Oakwood | 35.09257 | 6.578424 |
Oakwood High School | Oakwood | 38.76908 | 7.378793 |
Oakwood High School | Oakwood | 46.38074 | 12.842296 |
Oakwood High School | Oakwood | 35.56375 | 8.452693 |
Oakwood High School | Oakwood | 35.34916 | 5.763923 |
Oakwood High School | Oakwood | 41.76415 | 9.468891 |
Oakwood High School | Oakwood | 41.25415 | 7.891055 |
Oakwood High School | Oakwood | 38.45118 | 10.749311 |
St Mary’s Church of England School | St Mary’s | 61.09157 | 13.276536 |
St Mary’s Church of England School | St Mary’s | 43.06366 | 6.001193 |
St Mary’s Church of England School | St Mary’s | 54.20218 | 10.343221 |
St Mary’s Church of England School | St Mary’s | 46.20423 | 9.233964 |
St Mary’s Church of England School | St Mary’s | 58.19549 | 11.665967 |
St Mary’s Church of England School | St Mary’s | 43.25403 | 8.289936 |
St Mary’s Church of England School | St Mary’s | 54.10719 | 11.489128 |
St Mary’s Church of England School | St Mary’s | 35.79491 | 3.744103 |
St Mary’s Church of England School | St Mary’s | 62.27251 | 14.047647 |
St Mary’s Church of England School | St Mary’s | 58.20645 | 11.573604 |
St Mary’s Church of England School | St Mary’s | 45.70714 | 8.526942 |
St Mary’s Church of England School | St Mary’s | 30.16766 | 4.632249 |
St Mary’s Church of England School | St Mary’s | 51.17844 | 10.908296 |
St Mary’s Church of England School | St Mary’s | 49.02100 | 10.428852 |
St Mary’s Church of England School | St Mary’s | 46.42774 | 8.960203 |
St Mary’s Church of England School | St Mary’s | 48.29198 | 9.433638 |
St Mary’s Church of England School | St Mary’s | 42.22076 | 7.226943 |
St Mary’s Church of England School | St Mary’s | 50.52733 | 10.085690 |
We can now model how students’ test scores depend on the length of study. As a first step we estimate a simple linear regression model:
# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time, data = df)
# show mode summary
summary(model)
##
## Call:
## lm(formula = test_score ~ study_time, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1068 -4.5792 0.5577 4.1368 11.4546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.7993 1.2427 21.57 <2e-16 ***
## study_time 2.0192 0.1326 15.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.266 on 181 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5592
## F-statistic: 231.9 on 1 and 181 DF, p-value: < 2.2e-16
With the formula
formula(model)
## test_score ~ study_time
we tell R to fit a regression line for the point cloud defined by study_time (as single continuous predictor) and test_score (as continuous output). We see that the resulting basic regression parameters
# Collect "classic" regression parameter
coef(model)
## (Intercept) study_time
## 26.799339 2.019198
are significantly different from zero and that the residuals are roughly symmetrically distributed around 0, which is a good sign. The test score \(Score_\text{est}\) is estimated by the regression equation
\[ Test\,\,Score_\text{est} = 26.799 + 2.019\,Study\,\,Time\,(\text{hours}) \]
According to this model, each additional minute of study increases the test score by 2.019 units. In case students do not study at all their test score is estimated 26.799 on average.
The model output is visualized as regression line, i.e. the closest linear (“straight”) approximation of all dots in the scatter plot (point cloud) where each dot represents a single student.
basic_ggplot = ggplot(df, aes(x = study_time, y = test_score)) +
geom_point(alpha = alpha_dots, color="#678") +
theme_minimal() + theme(panel.grid = element_blank()) +
labs(x = "Study Time", y = "Score")
# Assign intercept and slope to variables
intercept = coef(model)[1]
slope = coef(model)[2]
basic_ggplot +
labs(title = "Association of student's study time and scoring") +
geom_abline(intercept = intercept, slope = slope, color = "#345", linetype = "longdash")
In order to have valid estimates the error term (residuals) must be normally distributed. As seen in the model output above, the residuals are approximately symmetrically distributed. We can, of course, calculate the residual variable directly to show its distribution in more detail:
# get residual variable
test_score_resid = resid(model)
# show how residual are distributed
ggplot(df, aes(x = test_score_resid)) +
geom_histogram(binwidth = 5, fill = "#678", color = "white", alpha=.6) +
theme_minimal() + theme(panel.grid = element_blank()) +
labs(title = "Histogram of residuals", x = "Score residual", y = "Count")
Residuals are slightly left skewed, which can be ignored in practice.
In brief, the classical model
Generalized Linear Models (GLMs) extend traditional linear models to handle non-normal outcome variables, such as binary, count, proportion, or skewed continuous data. They provide a flexible framework that combines linear modeling with appropriate transformations and error distributions.
GLMs build upon a link function, i.e., a transformation that links the expected outcome \(Y\) to the linear predictor:
The general model equation for a GLM is:
\[ g(Y) = \beta_0 + \beta_1 X_1 + \ldots + \beta_m X_m \]
This equation links the expected value of the outcome \(Y\) to a linear combination of predictors through the link function \(g(\ldots)\).
In the following example we analyze how salaries (€) in an organization depend on years of employment and gender. Data is collected from 200 employees, specifically 100 female and 100 male employees.
Here is the dataset:
years_empl | salary | gender |
---|---|---|
27.4441813 | 274372.94 | Male |
28.1122624 | 272584.60 | Male |
8.5841860 | 83252.41 | Male |
24.9134288 | 229785.25 | Male |
19.2523656 | 141310.60 | Male |
15.5728785 | 108422.85 | Male |
22.0976494 | 185929.98 | Male |
4.0399979 | 42793.74 | Male |
19.7097687 | 100284.31 | Male |
21.1519435 | 167208.49 | Male |
13.7322533 | 84488.60 | Male |
21.5733675 | 171300.54 | Male |
28.0401674 | 291434.29 | Male |
7.6628647 | 76376.50 | Male |
13.8687847 | 80076.12 | Male |
28.2004357 | 305893.12 | Male |
29.3467929 | 318895.77 | Male |
3.5246208 | 55349.75 | Male |
14.2499124 | 107613.32 | Male |
16.8099824 | 125935.58 | Male |
27.1209416 | 247016.44 | Male |
4.1613050 | 40497.62 | Male |
29.6667519 | 331348.26 | Male |
28.4000470 | 276661.61 | Male |
2.4731267 | 31579.04 | Male |
15.4263535 | 111774.36 | Male |
11.7061040 | 88052.90 | Male |
27.1721439 | 270697.87 | Male |
13.4090888 | 74413.58 | Male |
25.0801278 | 206600.50 | Male |
22.1278685 | 198856.63 | Male |
24.3316542 | 213999.52 | Male |
11.6432485 | 77472.96 | Male |
20.5550919 | 153524.76 | Male |
0.1184502 | 47629.30 | Male |
24.9874824 | 230629.76 | Male |
0.2200244 | 32724.36 | Male |
6.2297692 | 46640.30 | Male |
27.1980422 | 278288.55 | Male |
18.3533593 | 142577.96 | Male |
11.3867772 | 95481.67 | Male |
13.0731475 | 78232.04 | Male |
1.1229310 | 42575.02 | Male |
29.2061974 | 331214.32 | Male |
12.9525375 | 67893.01 | Male |
28.7272979 | 285770.63 | Male |
26.6326472 | 235624.41 | Male |
19.1993631 | 117483.76 | Male |
29.1289983 | 309636.63 | Male |
18.5651462 | 142275.07 | Male |
10.0028163 | 84795.75 | Male |
10.4024474 | 84622.06 | Male |
11.9545623 | 63018.48 | Male |
23.5407833 | 224974.88 | Male |
1.1680947 | 37063.01 | Male |
22.4638616 | 182548.19 | Male |
20.3183049 | 146089.50 | Male |
5.1379299 | 43416.06 | Male |
7.8326389 | 58960.65 | Male |
15.4323880 | 104896.59 | Male |
20.2682182 | 151437.42 | Male |
29.4845159 | 318956.30 | Male |
22.7863280 | 178413.12 | Male |
16.9946527 | 109272.55 | Male |
25.4906916 | 205630.04 | Male |
5.6842181 | 41537.78 | Male |
8.1385984 | 49839.02 | Male |
24.8447546 | 259463.99 | Male |
20.7961446 | 137931.11 | Male |
7.2163422 | 55495.92 | Male |
1.2896639 | 49143.88 | Male |
4.2143728 | 40028.07 | Male |
6.4915625 | 52297.32 | Male |
14.3819569 | 79848.95 | Male |
5.9223103 | 48154.50 | Male |
21.5806751 | 162196.75 | Male |
0.2365422 | 38631.97 | Male |
11.2646989 | 43504.75 | Male |
15.4322312 | 84736.67 | Male |
0.0471166 | 32806.04 | Male |
17.4481201 | 129666.43 | Male |
4.7371562 | 36430.33 | Male |
10.7708492 | 71014.11 | Male |
19.3689564 | 158119.13 | Male |
23.2747009 | 214691.13 | Male |
16.9094052 | 99585.02 | Male |
7.0111020 | 50807.05 | Male |
2.6994155 | 55253.81 | Male |
2.5683619 | 30202.92 | Male |
9.1565511 | 61622.70 | Male |
20.0227954 | 147570.59 | Male |
0.0071669 | 43297.98 | Male |
6.2570987 | 42819.48 | Male |
27.9910238 | 281155.98 | Male |
27.7693425 | 270439.65 | Male |
22.0228290 | 191392.66 | Male |
9.9921595 | 59509.47 | Male |
15.4518999 | 96772.72 | Male |
22.3192394 | 189336.76 | Male |
18.5747772 | 116733.59 | Male |
27.4441813 | 204242.00 | Female |
28.1122624 | 191386.91 | Female |
8.5841860 | 72219.94 | Female |
24.9134288 | 167490.37 | Female |
19.2523656 | 108435.26 | Female |
15.5728785 | 70663.49 | Female |
22.0976494 | 140781.09 | Female |
4.0399979 | 32199.05 | Female |
19.7097687 | 111206.97 | Female |
21.1519435 | 151187.35 | Female |
13.7322533 | 75816.61 | Female |
21.5733675 | 119743.62 | Female |
28.0401674 | 216027.60 | Female |
7.6628647 | 45854.16 | Female |
13.8687847 | 88053.06 | Female |
28.2004357 | 237475.42 | Female |
29.3467929 | 219145.13 | Female |
3.5246208 | 45214.51 | Female |
14.2499124 | 82617.81 | Female |
16.8099824 | 110742.94 | Female |
27.1209416 | 196822.69 | Female |
4.1613050 | 52693.92 | Female |
29.6667519 | 213160.54 | Female |
28.4000470 | 244370.12 | Female |
2.4731267 | 48641.89 | Female |
15.4263535 | 86065.03 | Female |
11.7061040 | 46340.89 | Female |
27.1721439 | 210633.58 | Female |
13.4090888 | 83942.68 | Female |
25.0801278 | 173513.78 | Female |
22.1278685 | 143467.72 | Female |
24.3316542 | 155985.73 | Female |
11.6432485 | 73309.23 | Female |
20.5550919 | 130895.97 | Female |
0.1184502 | 33939.11 | Female |
24.9874824 | 152443.33 | Female |
0.2200244 | 40976.86 | Female |
6.2297692 | 54712.10 | Female |
27.1980422 | 188808.55 | Female |
18.3533593 | 84492.50 | Female |
11.3867772 | 69645.56 | Female |
13.0731475 | 69735.95 | Female |
1.1229310 | 36242.49 | Female |
29.2061974 | 212333.47 | Female |
12.9525375 | 59894.91 | Female |
28.7272979 | 240390.18 | Female |
26.6326472 | 199596.33 | Female |
19.1993631 | 123822.69 | Female |
29.1289983 | 257722.99 | Female |
18.5651462 | 111962.82 | Female |
10.0028163 | 30410.55 | Female |
10.4024474 | 67145.34 | Female |
11.9545623 | 86840.21 | Female |
23.5407833 | 186767.75 | Female |
1.1680947 | 48096.85 | Female |
22.4638616 | 127293.27 | Female |
20.3183049 | 113809.76 | Female |
5.1379299 | 32825.78 | Female |
7.8326389 | 42222.32 | Female |
15.4323880 | 85583.32 | Female |
20.2682182 | 105943.37 | Female |
29.4845159 | 266857.26 | Female |
22.7863280 | 149472.85 | Female |
16.9946527 | 97313.91 | Female |
25.4906916 | 186105.23 | Female |
5.6842181 | 45221.87 | Female |
8.1385984 | 51050.88 | Female |
24.8447546 | 192923.96 | Female |
20.7961446 | 125372.93 | Female |
7.2163422 | 30462.69 | Female |
1.2896639 | 38619.32 | Female |
4.2143728 | 35021.34 | Female |
6.4915625 | 39430.34 | Female |
14.3819569 | 66077.73 | Female |
5.9223103 | 51836.70 | Female |
21.5806751 | 133279.58 | Female |
0.2365422 | 38235.89 | Female |
11.2646989 | 62489.24 | Female |
15.4322312 | 78485.47 | Female |
0.0471166 | 48852.66 | Female |
17.4481201 | 97678.32 | Female |
4.7371562 | 56015.18 | Female |
10.7708492 | 45738.23 | Female |
19.3689564 | 109407.30 | Female |
23.2747009 | 148957.97 | Female |
16.9094052 | 92124.57 | Female |
7.0111020 | 69238.16 | Female |
2.6994155 | 35898.28 | Female |
2.5683619 | 39572.20 | Female |
9.1565511 | 42813.41 | Female |
20.0227954 | 110912.02 | Female |
0.0071669 | 44986.09 | Female |
6.2570987 | 65365.23 | Female |
27.9910238 | 231578.99 | Female |
27.7693425 | 188859.09 | Female |
22.0228290 | 170910.92 | Female |
9.9921595 | 75632.53 | Female |
15.4518999 | 88084.00 | Female |
22.3192394 | 153654.18 | Female |
18.5747772 | 95533.93 | Female |
The scatterplot shows a clear positive association: The longer employees are employed, the higher salaries they earn:
p = ggplot(df, aes(x = years_empl, y = salary, color = gender)) +
geom_point(alpha = .6) +
theme_minimal() +
scale_x_continuous(name="Years of employment", labels = scales::comma) +
scale_y_continuous(name="Salary (€)", labels = scales::comma) +
scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
#scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
p + theme(legend.position="none")
We can fit a simple linear regression model to model this association:
linear_model = lm(salary ~ years_empl, data = df)
summary(linear_model)
##
## Call:
## lm(formula = salary ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58615 -27281 -3463 19327 101896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2684.3 4717.3 -0.569 0.57
## years_empl 7943.6 260.2 30.535 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared: 0.8248, Adjusted R-squared: 0.8239
## F-statistic: 932.4 on 1 and 198 DF, p-value: < 2.2e-16
p + theme(legend.position="none") +
geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2],
color = "#345", alpha = 1, linetype = "longdash")
We are particularly interested in the differences of salaries between gender. A simple comparison of average salaries reveals that female have lower salaries on average:
by(df$salary, df$gender, mean)
## df$gender: Female
## [1] 109140.8
## ------------------------------------------------------------
## df$gender: Male
## [1] 135466.1
salary_female = by(df$salary, df$gender, mean)["Female"]
salary_male = by(df$salary, df$gender, mean)["Male"]
By adding gender as dummy variable into our linear model, we come to the conclusion, that salaries increase by 7943.61 € per year and that male salaries exceed female salaries by NA € on average, independent of employment time.
linear_model2 = lm(salary ~ years_empl+gender, data = df)
summary(linear_model2)
##
## Call:
## lm(formula = salary ~ years_empl + gender, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66761 -21296 -6514 20747 88733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15846.9 4842.8 -3.272 0.00126 **
## years_empl 7943.6 239.2 33.215 < 2e-16 ***
## genderMale 26325.4 4311.0 6.107 5.33e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30480 on 197 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8512
## F-statistic: 570.3 on 2 and 197 DF, p-value: < 2.2e-16
p = p + theme(legend.position.inside=c(.28,.63)) +
scale_color_manual(values = gender_colors, name = "Gender")
p +
geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2],
color = "#666", alpha = 1, linetype = "dotted") +
geom_abline(intercept = coef(linear_model2)[1], slope = coef(linear_model2)[2],
color = gender_colors[1], alpha = 1, linetype = "longdash") +
geom_abline(intercept = coef(linear_model2)[1]+coef(linear_model2)[3], slope = coef(linear_model2)[2],
color = gender_colors[2], alpha = 1, linetype = "longdash")
It is, however, clearly visible that the association between years of employment and salaries is not linear. A linear model, thus, is not appropriate, and our interpretation is flawed. In order to fit a linear model, we first have to transform our dependent variable so that the association becomes linear.
When looking at selected data points it becomes obvious that the association is exponential in nature:
Years of Employment | Salary | Salary | log2(Salary) | |||
---|---|---|---|---|---|---|
0 | 32.768 | 215 | 15 | |||
10 | +10 | 65.536 | x2 | 216 | 16 | +1 |
20 | +10 | 131.072 | x2 | 217 | 17 | +1 |
30 | +10 | 262.144 | x2 | 218 | 18 | +1 |
p + theme(legend.position="none") +
scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender") +
geom_point(x=0, y=2**15, color="#f52", size=5, alpha=.5) +
geom_point(x=10, y=2**16, color="#f52", size=5, alpha=.5) +
geom_point(x=20, y=2**17, color="#f52", size=5, alpha=.5) +
geom_point(x=30, y=2**18, color="#f52", size=5, alpha=.5)
Therefore, we use a logarithmic transformation to linearize the association. Instead of fitting
\[ Salary = \beta_0 + \beta_1Years + \varepsilon \]
we choose \(\text{log}_2(Y)\) as dependent variable:
\[ \text{log}_2(Salary) = \beta_0 + \beta_1Years + \varepsilon \]
log_model = lm(log2(salary) ~ years_empl, data = df)
summary(log_model)
##
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1115 -0.1760 -0.0016 0.2198 0.5921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.979177 0.039675 377.54 <2e-16 ***
## years_empl 0.102428 0.002188 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2789 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
beta0 = round(coef(log_model)[1],3)
beta1 = round(coef(log_model)[2],3)
pl = ggplot(df, aes(x = years_empl, y = log2(salary), color = gender)) +
geom_point(alpha = .6) +
theme_minimal() +
scale_x_continuous(name="Years of employment", labels = scales::comma) +
scale_y_continuous(name="Log2(Salary)", labels = scales::comma) +
scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
#scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
pl + theme(legend.position="none") +
geom_abline(intercept = coef(log_model)[1], slope = coef(log_model)[2],
color = "#345", alpha = 1, linetype = "longdash")
According to this model, the log2 of Salary increases by 0.102 each year :
\[ \text{log}_2(Salary) = 14.979 + 0.102 Years \]
As this description is a little hard to understand we have to get the model parameter and model predictions back on original scale. Raising the predictive form of the regression equation
\[ \text{log}_2(Salary_\text{est}) = \beta_0 + \beta_1Years \]
to the power of 2 we get
\[ Salary_\text{est} = 2^{(\beta_0 + \beta_1Years)} = 2^{\beta_0} \cdot 2^{\beta_1Years} \] Now we have a multiplicative (rather than additive) model indicating the underlying non-linear structure .
\[ Salary_\text{est} = 2^{14.979} \cdot 2^{0.102\cdot Years} \] Because we applied a logarithmic transformation, we can now interpret the model parameters as growth rates. The model predictions are as follows:
df$log_pred = 2**(beta0 + beta1*df$years_empl)
p +
geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2],
color = "#345", alpha = 1, linetype = "dotted") +
geom_line(aes(x=years_empl,
y=2**(coef(log_model)[1] + coef(log_model)[2]*years_empl)),
color = "#345", alpha = 1, linetype = "longdash")
But if we look at the gender differences in more detail it appears as if gender do not differ by intercepts only but have different growth rates. This can only be modeled by fitting separate models for male and female.
# adjust for Male and Female
log_model_male = lm(log2(salary) ~ years_empl, data = df[df$gender=="Male",])
summary(log_model_male)
##
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender ==
## "Male", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8088 -0.1247 0.0048 0.1004 0.5500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.97655 0.04442 337.15 <2e-16 ***
## years_empl 0.11018 0.00245 44.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2208 on 98 degrees of freedom
## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9533
## F-statistic: 2023 on 1 and 98 DF, p-value: < 2.2e-16
log_model_female = lm(log2(salary) ~ years_empl, data = df[df$gender=="Female",])
summary(log_model_female)
##
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender ==
## "Female", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03654 -0.11005 0.02057 0.15374 0.58988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.981809 0.052984 282.8 <2e-16 ***
## years_empl 0.094675 0.002922 32.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2634 on 98 degrees of freedom
## Multiple R-squared: 0.9146, Adjusted R-squared: 0.9138
## F-statistic: 1050 on 1 and 98 DF, p-value: < 2.2e-16
betam0 = round(coef(log_model_male)[1],3)
betam1 = round(coef(log_model_male)[2],3)
betaf0 = round(coef(log_model_female)[1],3)
betaf1 = round(coef(log_model_female)[2],3)
p +
geom_line(aes(x=years_empl,
y=2**(coef(log_model_male)[1] + coef(log_model_male)[2]*years_empl)),
color = "steelblue", alpha = 1, linetype = "longdash") +
geom_line(aes(x=years_empl,
y=2**(coef(log_model_female)[1] + coef(log_model_female)[2]*years_empl)),
color = "darkorange", alpha = 1, linetype = "longdash")
Now it becomes obvious, that gender salaries differ by their growth rates rather than being dependent on starting salaries.
Female \[ Salary_\text{est Female} = 2^{14.982} \cdot 2^{0.095\cdot Years} \] Male \[ Salary_\text{est Male} = 2^{14.977} \cdot 2^{0.11\cdot Years} \] While starting salaries are more or less equal between gender (female: \(2^{14.982} = 32361.71\)€, male: \(2^{14.977} = 32249.74\)€), the growth rates differ significantly (female: \(2^{0.095} = 6.81\)%, male: \(2^{0.11} = 7.92\)%).
In the terminology of GLM the logarithm log2() served as link function \(g\)() that allowed us to linearize the association. Interpretation of model output, though, is much easier if output is transformed back into real measures.
Regression analysis requires a metric outcome variable. If we aim at modeling a binary (or categorical) outcome variable we have to choose a link function \(g\)() that transforms the binary outcome into a metric.
Consider the following example …
Outcome Type | Distribution | Link Function | What the Link Does |
---|---|---|---|
Binary (0/1) | Binomial | Logit | Maps probabilities (0–1) to the real line |
Count | Poisson | Log | Maps positive counts to the real line |
Proportion | Binomial | Logit or Probit | Models bounded outcomes on a continuous scale |
Positive continuous | Gamma | Inverse | Ensures predictions remain strictly positive |
GLMs allow us to apply linear modeling techniques to a wide range of outcome types. The key is the link function, which transforms the expected value of the outcome into a continuous scale suitable for a linear predictor. This enables valid modeling even when the raw outcome is binary, count-based, or bounded.