The Classical Model

The General Linear Model refers to a broad family of models for continuous, normally distributed outcomes. It includes familiar methods such as simple linear regression, multiple regression, ANOVA, and ANCOVA.

The general model equation for an outcome variable \(Y\) modeled by \(m\) variables \(X_{1\ldots m}\) is:

\[ Y = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m + \varepsilon \]

where

Fitting the model means calculating the regression parameters \(\beta_{0\ldots m}\) which can then be used to estimate the outcome:

\[ \hat{Y} = Y_{\text{est}} = \beta_0 + \beta_1X_1 + \beta_1X_2 + \ldots + \beta_mX_m \] Note that \(Y\) denotes observed outcome. Estimated outcome is written as \(\hat{Y}\) or \(Y_\text{est}\), you can use these interchangeably. The estimated outcome is equal to the pure model prediciton without the error term.

In case of a single predictor we often write

\[ Y = \alpha + \beta X + \varepsilon \]

instead.

Let’s introduce an example to illustrate this concept. In the following, we will work with data collected from 3 schools, specifically Hillcrest High School, Oakwood High School, and St Mary’s Church of England School. For each student, we have recorded the average test scores (ranging from 0% to 100%) and the average number of hours they studied shortly before their exams.

Here is the dataset:

school_long_name school test_score study_time
Hillcrest High School Hillcrest 54.59454 13.092146
Hillcrest High School Hillcrest 45.90443 7.285176
Hillcrest High School Hillcrest 46.07368 10.068656
Hillcrest High School Hillcrest 53.15568 10.877858
Hillcrest High School Hillcrest 51.89308 10.192075
Hillcrest High School Hillcrest 45.88416 8.660897
Hillcrest High School Hillcrest 49.43284 13.513836
Hillcrest High School Hillcrest 49.04923 8.695293
Hillcrest High School Hillcrest 55.00619 15.034541
Hillcrest High School Hillcrest 49.30719 8.791128
Hillcrest High School Hillcrest 49.59065 12.893879
Hillcrest High School Hillcrest 54.12954 15.839206
Hillcrest High School Hillcrest 47.03255 4.812688
Hillcrest High School Hillcrest 48.98281 8.142904
Hillcrest High School Hillcrest 50.05463 8.579306
Hillcrest High School Hillcrest 56.51023 10.887121
Hillcrest High School Hillcrest 48.13828 8.126511
Hillcrest High School Hillcrest 33.20910 1.009904
Hillcrest High School Hillcrest 40.99077 1.657870
Hillcrest High School Hillcrest 57.04151 12.939610
Oakwood High School Oakwood 44.29733 8.059354
Oakwood High School Oakwood 25.14011 3.635345
Oakwood High School Oakwood 35.47447 8.463518
Oakwood High School Oakwood 45.12912 12.623294
Oakwood High School Oakwood 48.16753 14.664851
Oakwood High School Oakwood 35.43849 7.687863
Oakwood High School Oakwood 37.85879 8.207462
Oakwood High School Oakwood 25.77590 3.689781
Oakwood High School Oakwood 48.83004 10.359562
Oakwood High School Oakwood 36.44190 7.059286
Oakwood High School Oakwood 42.43892 10.345621
Oakwood High School Oakwood 45.67442 11.093782
Oakwood High School Oakwood 46.28141 12.084581
Oakwood High School Oakwood 35.90872 7.152491
Oakwood High School Oakwood 47.41863 10.494136
Oakwood High School Oakwood 29.00540 3.828244
Oakwood High School Oakwood 31.40098 6.625893
Oakwood High School Oakwood 36.01010 6.426547
Oakwood High School Oakwood 24.41876 1.736647
Oakwood High School Oakwood 38.60989 9.087638
Oakwood High School Oakwood 37.99014 9.597266
Oakwood High School Oakwood 39.07729 7.896098
Oakwood High School Oakwood 43.98547 11.253760
St Mary’s Church of England School St Mary’s 42.94447 6.799156
St Mary’s Church of England School St Mary’s 34.92019 4.874427
St Mary’s Church of England School St Mary’s 49.85766 10.277724
St Mary’s Church of England School St Mary’s 44.38598 6.545091
St Mary’s Church of England School St Mary’s 60.11943 13.311574
St Mary’s Church of England School St Mary’s 46.89865 7.684932
St Mary’s Church of England School St Mary’s 50.23389 10.946214
St Mary’s Church of England School St Mary’s 49.43679 9.945046
St Mary’s Church of England School St Mary’s 40.07521 6.627753
St Mary’s Church of England School St Mary’s 60.94646 13.706453
St Mary’s Church of England School St Mary’s 57.77003 10.907968
St Mary’s Church of England School St Mary’s 48.67736 9.248552
St Mary’s Church of England School St Mary’s 51.15944 9.808922
St Mary’s Church of England School St Mary’s 51.22429 11.017137
St Mary’s Church of England School St Mary’s 46.55865 9.248769
St Mary’s Church of England School St Mary’s 23.99421 0.000000
St Mary’s Church of England School St Mary’s 54.27720 9.833919
St Mary’s Church of England School St Mary’s 48.37929 7.877566
Hillcrest High School Hillcrest 45.30004 9.534962
Hillcrest High School Hillcrest 57.01957 10.724741
Hillcrest High School Hillcrest 56.86480 13.178481
Hillcrest High School Hillcrest 46.07672 6.797394
Hillcrest High School Hillcrest 55.57510 12.886898
Hillcrest High School Hillcrest 47.07002 9.986815
Hillcrest High School Hillcrest 49.22528 12.094788
Hillcrest High School Hillcrest 52.23690 11.741456
Hillcrest High School Hillcrest 47.77480 11.141905
Hillcrest High School Hillcrest 45.58995 5.849913
Hillcrest High School Hillcrest 52.34357 8.708711
Hillcrest High School Hillcrest 47.91817 10.849825
Hillcrest High School Hillcrest 43.12712 6.118700
Hillcrest High School Hillcrest 46.96063 7.350784
Hillcrest High School Hillcrest 47.81392 10.722260
Hillcrest High School Hillcrest 50.39072 11.283807
Hillcrest High School Hillcrest 53.06295 10.370573
Hillcrest High School Hillcrest 48.49496 6.321941
Hillcrest High School Hillcrest 45.96745 5.679928
Hillcrest High School Hillcrest 48.66620 13.517391
Oakwood High School Oakwood 41.34408 9.753035
Oakwood High School Oakwood 43.68350 9.244591
Oakwood High School Oakwood 41.67275 8.616581
Oakwood High School Oakwood 32.22012 5.396284
Oakwood High School Oakwood 35.53073 10.815261
Oakwood High School Oakwood 38.83860 8.327851
Oakwood High School Oakwood 40.58326 8.431000
Oakwood High School Oakwood 45.69603 11.779309
Oakwood High School Oakwood 45.36142 11.444590
Oakwood High School Oakwood 49.60593 13.155619
Oakwood High School Oakwood 35.91185 7.550749
Oakwood High School Oakwood 47.79057 10.930316
Oakwood High School Oakwood 49.71638 13.152602
Oakwood High School Oakwood 29.56580 5.646904
Oakwood High School Oakwood 38.93851 6.396892
Oakwood High School Oakwood 36.78149 5.584054
Oakwood High School Oakwood 33.67548 4.601628
Oakwood High School Oakwood 35.45055 9.219218
Oakwood High School Oakwood 42.16985 10.938883
Oakwood High School Oakwood 49.07087 12.582166
Oakwood High School Oakwood 46.35821 12.113523
Oakwood High School Oakwood 34.98333 5.969644
Oakwood High School Oakwood 58.42821 14.524716
St Mary’s Church of England School St Mary’s 39.48171 6.978950
St Mary’s Church of England School St Mary’s 42.54783 9.295812
St Mary’s Church of England School St Mary’s 44.95859 7.712503
St Mary’s Church of England School St Mary’s 44.77387 8.612220
St Mary’s Church of England School St Mary’s 50.96967 9.543849
St Mary’s Church of England School St Mary’s 46.57311 9.336753
St Mary’s Church of England School St Mary’s 54.34814 8.903993
St Mary’s Church of England School St Mary’s 48.53935 9.303488
St Mary’s Church of England School St Mary’s 42.13689 7.522964
St Mary’s Church of England School St Mary’s 42.90107 7.466619
St Mary’s Church of England School St Mary’s 35.57561 3.995973
St Mary’s Church of England School St Mary’s 44.78883 7.832269
St Mary’s Church of England School St Mary’s 38.44711 7.441319
St Mary’s Church of England School St Mary’s 72.24097 17.084943
St Mary’s Church of England School St Mary’s 37.95949 4.892922
St Mary’s Church of England School St Mary’s 49.29009 9.391039
St Mary’s Church of England School St Mary’s 36.70040 4.498395
St Mary’s Church of England School St Mary’s 34.26447 4.567963
Hillcrest High School Hillcrest 49.05039 9.353377
Hillcrest High School Hillcrest 46.63433 5.989353
Hillcrest High School Hillcrest 51.74739 8.973802
Hillcrest High School Hillcrest 43.49421 7.694494
Hillcrest High School Hillcrest 46.46544 7.138255
Hillcrest High School Hillcrest 41.27340 2.905237
Hillcrest High School Hillcrest 42.08927 5.305026
Hillcrest High School Hillcrest 46.31831 9.517820
Hillcrest High School Hillcrest 48.92636 10.682132
Hillcrest High School Hillcrest 48.76119 7.500638
Hillcrest High School Hillcrest 47.52638 8.979459
Hillcrest High School Hillcrest 50.46286 12.347939
Hillcrest High School Hillcrest 54.44885 13.298837
Hillcrest High School Hillcrest 41.11537 5.687929
Hillcrest High School Hillcrest 51.49040 8.627312
Hillcrest High School Hillcrest 51.64673 12.583766
Hillcrest High School Hillcrest 47.65148 7.570081
Hillcrest High School Hillcrest 48.73925 8.821862
Hillcrest High School Hillcrest 48.46442 8.720948
Hillcrest High School Hillcrest 51.00763 6.316233
Oakwood High School Oakwood 34.81445 7.645218
Oakwood High School Oakwood 43.21828 8.890936
Oakwood High School Oakwood 37.57005 7.737664
Oakwood High School Oakwood 44.13324 12.319428
Oakwood High School Oakwood 36.86629 7.536292
Oakwood High School Oakwood 39.59984 7.679763
Oakwood High School Oakwood 42.86316 11.069858
Oakwood High School Oakwood 31.30408 5.810165
Oakwood High School Oakwood 40.17264 8.857175
Oakwood High School Oakwood 33.61506 4.324636
Oakwood High School Oakwood 46.74118 12.480779
Oakwood High School Oakwood 34.15559 8.158333
Oakwood High School Oakwood 33.23144 7.575734
Oakwood High School Oakwood 30.22384 5.264513
Oakwood High School Oakwood 38.33064 8.955984
Oakwood High School Oakwood 35.09257 6.578424
Oakwood High School Oakwood 38.76908 7.378793
Oakwood High School Oakwood 46.38074 12.842296
Oakwood High School Oakwood 35.56375 8.452693
Oakwood High School Oakwood 35.34916 5.763923
Oakwood High School Oakwood 41.76415 9.468891
Oakwood High School Oakwood 41.25415 7.891055
Oakwood High School Oakwood 38.45118 10.749311
St Mary’s Church of England School St Mary’s 61.09157 13.276536
St Mary’s Church of England School St Mary’s 43.06366 6.001193
St Mary’s Church of England School St Mary’s 54.20218 10.343221
St Mary’s Church of England School St Mary’s 46.20423 9.233964
St Mary’s Church of England School St Mary’s 58.19549 11.665967
St Mary’s Church of England School St Mary’s 43.25403 8.289936
St Mary’s Church of England School St Mary’s 54.10719 11.489128
St Mary’s Church of England School St Mary’s 35.79491 3.744103
St Mary’s Church of England School St Mary’s 62.27251 14.047647
St Mary’s Church of England School St Mary’s 58.20645 11.573604
St Mary’s Church of England School St Mary’s 45.70714 8.526942
St Mary’s Church of England School St Mary’s 30.16766 4.632249
St Mary’s Church of England School St Mary’s 51.17844 10.908296
St Mary’s Church of England School St Mary’s 49.02100 10.428852
St Mary’s Church of England School St Mary’s 46.42774 8.960203
St Mary’s Church of England School St Mary’s 48.29198 9.433638
St Mary’s Church of England School St Mary’s 42.22076 7.226943
St Mary’s Church of England School St Mary’s 50.52733 10.085690


Simple linear regression analysis

We can now model how students’ test scores depend on the length of study. As a first step we estimate a simple linear regression model:

# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time, data = df)
# show mode summary
summary(model)
## 
## Call:
## lm(formula = test_score ~ study_time, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1068  -4.5792   0.5577   4.1368  11.4546 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7993     1.2427   21.57   <2e-16 ***
## study_time    2.0192     0.1326   15.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.266 on 181 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5592 
## F-statistic: 231.9 on 1 and 181 DF,  p-value: < 2.2e-16

With the formula

formula(model)
## test_score ~ study_time

we tell R to fit a regression line for the point cloud defined by study_time (as single continuous predictor) and test_score (as continuous output). We see that the resulting basic regression parameters

# Collect "classic" regression parameter 
coef(model)
## (Intercept)  study_time 
##   26.799339    2.019198

are significantly different from zero and that the residuals are roughly symmetrically distributed around 0, which is a good sign. The test score \(Score_\text{est}\) is estimated by the regression equation

\[ Test\,\,Score_\text{est} = 26.799 + 2.019\,Study\,\,Time\,(\text{hours}) \]

According to this model, each additional minute of study increases the test score by 2.019 units. In case students do not study at all their test score is estimated 26.799 on average.

The model output is visualized as regression line, i.e. the closest linear (“straight”) approximation of all dots in the scatter plot (point cloud) where each dot represents a single student.

basic_ggplot = ggplot(df, aes(x = study_time, y = test_score)) +
  geom_point(alpha = alpha_dots, color="#678") +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(x = "Study Time", y = "Score")

# Assign intercept and slope to variables
intercept = coef(model)[1]
slope = coef(model)[2]

basic_ggplot +
  labs(title = "Association of student's study time and scoring") +
  geom_abline(intercept = intercept, slope = slope, color = "#345", linetype = "longdash")


Residual analysis

In order to have valid estimates the error term (residuals) must be normally distributed. As seen in the model output above, the residuals are approximately symmetrically distributed. We can, of course, calculate the residual variable directly to show its distribution in more detail:

# get residual variable
test_score_resid = resid(model)

# show how residual are distributed 
ggplot(df, aes(x = test_score_resid)) +
  geom_histogram(binwidth = 5, fill = "#678", color = "white", alpha=.6) +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(title = "Histogram of residuals", x = "Score residual", y = "Count") 

Residuals are slightly left skewed, which can be ignored in practice.


In brief, the classical model

  • assumes a linear relationship between predictors and outcome.
  • assumes normally distributed residuals with constant variance.
  • estimated using ordinary least squares (OLS).
  • used when the outcome is continuous and unbounded (e.g., test scores, weight).


Generalized Linear Models (GLM): Summary

Generalized Linear Models (GLMs) extend traditional linear models to handle non-normal outcome variables, such as binary, count, proportion, or skewed continuous data. They provide a flexible framework that combines linear modeling with appropriate transformations and error distributions.

GLMs build upon a link function, i.e., a transformation that links the expected outcome \(Y\) to the linear predictor:

The general model equation for a GLM is:

\[ g(Y) = \beta_0 + \beta_1 X_1 + \ldots + \beta_m X_m \]

This equation links the expected value of the outcome \(Y\) to a linear combination of predictors through the link function \(g(\ldots)\).

Example 1: Non-linear Associations

In the following example we analyze how salaries (€) in an organization depend on years of employment and gender. Data is collected from 200 employees, specifically 100 female and 100 male employees.

Here is the dataset:

years_empl salary gender
27.4441813 274372.94 Male
28.1122624 272584.60 Male
8.5841860 83252.41 Male
24.9134288 229785.25 Male
19.2523656 141310.60 Male
15.5728785 108422.85 Male
22.0976494 185929.98 Male
4.0399979 42793.74 Male
19.7097687 100284.31 Male
21.1519435 167208.49 Male
13.7322533 84488.60 Male
21.5733675 171300.54 Male
28.0401674 291434.29 Male
7.6628647 76376.50 Male
13.8687847 80076.12 Male
28.2004357 305893.12 Male
29.3467929 318895.77 Male
3.5246208 55349.75 Male
14.2499124 107613.32 Male
16.8099824 125935.58 Male
27.1209416 247016.44 Male
4.1613050 40497.62 Male
29.6667519 331348.26 Male
28.4000470 276661.61 Male
2.4731267 31579.04 Male
15.4263535 111774.36 Male
11.7061040 88052.90 Male
27.1721439 270697.87 Male
13.4090888 74413.58 Male
25.0801278 206600.50 Male
22.1278685 198856.63 Male
24.3316542 213999.52 Male
11.6432485 77472.96 Male
20.5550919 153524.76 Male
0.1184502 47629.30 Male
24.9874824 230629.76 Male
0.2200244 32724.36 Male
6.2297692 46640.30 Male
27.1980422 278288.55 Male
18.3533593 142577.96 Male
11.3867772 95481.67 Male
13.0731475 78232.04 Male
1.1229310 42575.02 Male
29.2061974 331214.32 Male
12.9525375 67893.01 Male
28.7272979 285770.63 Male
26.6326472 235624.41 Male
19.1993631 117483.76 Male
29.1289983 309636.63 Male
18.5651462 142275.07 Male
10.0028163 84795.75 Male
10.4024474 84622.06 Male
11.9545623 63018.48 Male
23.5407833 224974.88 Male
1.1680947 37063.01 Male
22.4638616 182548.19 Male
20.3183049 146089.50 Male
5.1379299 43416.06 Male
7.8326389 58960.65 Male
15.4323880 104896.59 Male
20.2682182 151437.42 Male
29.4845159 318956.30 Male
22.7863280 178413.12 Male
16.9946527 109272.55 Male
25.4906916 205630.04 Male
5.6842181 41537.78 Male
8.1385984 49839.02 Male
24.8447546 259463.99 Male
20.7961446 137931.11 Male
7.2163422 55495.92 Male
1.2896639 49143.88 Male
4.2143728 40028.07 Male
6.4915625 52297.32 Male
14.3819569 79848.95 Male
5.9223103 48154.50 Male
21.5806751 162196.75 Male
0.2365422 38631.97 Male
11.2646989 43504.75 Male
15.4322312 84736.67 Male
0.0471166 32806.04 Male
17.4481201 129666.43 Male
4.7371562 36430.33 Male
10.7708492 71014.11 Male
19.3689564 158119.13 Male
23.2747009 214691.13 Male
16.9094052 99585.02 Male
7.0111020 50807.05 Male
2.6994155 55253.81 Male
2.5683619 30202.92 Male
9.1565511 61622.70 Male
20.0227954 147570.59 Male
0.0071669 43297.98 Male
6.2570987 42819.48 Male
27.9910238 281155.98 Male
27.7693425 270439.65 Male
22.0228290 191392.66 Male
9.9921595 59509.47 Male
15.4518999 96772.72 Male
22.3192394 189336.76 Male
18.5747772 116733.59 Male
27.4441813 204242.00 Female
28.1122624 191386.91 Female
8.5841860 72219.94 Female
24.9134288 167490.37 Female
19.2523656 108435.26 Female
15.5728785 70663.49 Female
22.0976494 140781.09 Female
4.0399979 32199.05 Female
19.7097687 111206.97 Female
21.1519435 151187.35 Female
13.7322533 75816.61 Female
21.5733675 119743.62 Female
28.0401674 216027.60 Female
7.6628647 45854.16 Female
13.8687847 88053.06 Female
28.2004357 237475.42 Female
29.3467929 219145.13 Female
3.5246208 45214.51 Female
14.2499124 82617.81 Female
16.8099824 110742.94 Female
27.1209416 196822.69 Female
4.1613050 52693.92 Female
29.6667519 213160.54 Female
28.4000470 244370.12 Female
2.4731267 48641.89 Female
15.4263535 86065.03 Female
11.7061040 46340.89 Female
27.1721439 210633.58 Female
13.4090888 83942.68 Female
25.0801278 173513.78 Female
22.1278685 143467.72 Female
24.3316542 155985.73 Female
11.6432485 73309.23 Female
20.5550919 130895.97 Female
0.1184502 33939.11 Female
24.9874824 152443.33 Female
0.2200244 40976.86 Female
6.2297692 54712.10 Female
27.1980422 188808.55 Female
18.3533593 84492.50 Female
11.3867772 69645.56 Female
13.0731475 69735.95 Female
1.1229310 36242.49 Female
29.2061974 212333.47 Female
12.9525375 59894.91 Female
28.7272979 240390.18 Female
26.6326472 199596.33 Female
19.1993631 123822.69 Female
29.1289983 257722.99 Female
18.5651462 111962.82 Female
10.0028163 30410.55 Female
10.4024474 67145.34 Female
11.9545623 86840.21 Female
23.5407833 186767.75 Female
1.1680947 48096.85 Female
22.4638616 127293.27 Female
20.3183049 113809.76 Female
5.1379299 32825.78 Female
7.8326389 42222.32 Female
15.4323880 85583.32 Female
20.2682182 105943.37 Female
29.4845159 266857.26 Female
22.7863280 149472.85 Female
16.9946527 97313.91 Female
25.4906916 186105.23 Female
5.6842181 45221.87 Female
8.1385984 51050.88 Female
24.8447546 192923.96 Female
20.7961446 125372.93 Female
7.2163422 30462.69 Female
1.2896639 38619.32 Female
4.2143728 35021.34 Female
6.4915625 39430.34 Female
14.3819569 66077.73 Female
5.9223103 51836.70 Female
21.5806751 133279.58 Female
0.2365422 38235.89 Female
11.2646989 62489.24 Female
15.4322312 78485.47 Female
0.0471166 48852.66 Female
17.4481201 97678.32 Female
4.7371562 56015.18 Female
10.7708492 45738.23 Female
19.3689564 109407.30 Female
23.2747009 148957.97 Female
16.9094052 92124.57 Female
7.0111020 69238.16 Female
2.6994155 35898.28 Female
2.5683619 39572.20 Female
9.1565511 42813.41 Female
20.0227954 110912.02 Female
0.0071669 44986.09 Female
6.2570987 65365.23 Female
27.9910238 231578.99 Female
27.7693425 188859.09 Female
22.0228290 170910.92 Female
9.9921595 75632.53 Female
15.4518999 88084.00 Female
22.3192394 153654.18 Female
18.5747772 95533.93 Female


The scatterplot shows a clear positive association: The longer employees are employed, the higher salaries they earn:

p = ggplot(df, aes(x = years_empl, y = salary, color = gender)) +
  geom_point(alpha = .6) +
  theme_minimal() +
  scale_x_continuous(name="Years of employment", labels = scales::comma) + 
  scale_y_continuous(name="Salary (€)", labels = scales::comma) +
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
  #scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
p + theme(legend.position="none")


We can fit a simple linear regression model to model this association:

linear_model = lm(salary ~ years_empl, data = df)
summary(linear_model)
## 
## Call:
## lm(formula = salary ~ years_empl, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58615 -27281  -3463  19327 101896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2684.3     4717.3  -0.569     0.57    
## years_empl    7943.6      260.2  30.535   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared:  0.8248, Adjusted R-squared:  0.8239 
## F-statistic: 932.4 on 1 and 198 DF,  p-value: < 2.2e-16
p + theme(legend.position="none") + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#345", alpha = 1, linetype = "longdash") 


We are particularly interested in the differences of salaries between gender. A simple comparison of average salaries reveals that female have lower salaries on average:

by(df$salary, df$gender, mean)
## df$gender: Female
## [1] 109140.8
## ------------------------------------------------------------ 
## df$gender: Male
## [1] 135466.1
salary_female = by(df$salary, df$gender, mean)["Female"]
salary_male = by(df$salary, df$gender, mean)["Male"]

By adding gender as dummy variable into our linear model, we come to the conclusion, that salaries increase by 7943.61 € per year and that male salaries exceed female salaries by NA € on average, independent of employment time.

linear_model2 = lm(salary ~ years_empl+gender, data = df)
summary(linear_model2)
## 
## Call:
## lm(formula = salary ~ years_empl + gender, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66761 -21296  -6514  20747  88733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15846.9     4842.8  -3.272  0.00126 ** 
## years_empl    7943.6      239.2  33.215  < 2e-16 ***
## genderMale   26325.4     4311.0   6.107 5.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30480 on 197 degrees of freedom
## Multiple R-squared:  0.8527, Adjusted R-squared:  0.8512 
## F-statistic: 570.3 on 2 and 197 DF,  p-value: < 2.2e-16
p = p + theme(legend.position.inside=c(.28,.63)) + 
  scale_color_manual(values = gender_colors, name = "Gender")

p + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#666", alpha = 1, linetype = "dotted") + 
  geom_abline(intercept = coef(linear_model2)[1], slope = coef(linear_model2)[2], 
              color = gender_colors[1], alpha = 1, linetype = "longdash") + 
  geom_abline(intercept = coef(linear_model2)[1]+coef(linear_model2)[3], slope = coef(linear_model2)[2], 
              color = gender_colors[2], alpha = 1, linetype = "longdash") 

It is, however, clearly visible that the association between years of employment and salaries is not linear. A linear model, thus, is not appropriate, and our interpretation is flawed. In order to fit a linear model, we first have to transform our dependent variable so that the association becomes linear.

When looking at selected data points it becomes obvious that the association is exponential in nature:

Years of Employment Salary Salary log2(Salary)
0 32.768 215 15
10 +10 65.536 x2 216 16 +1
20 +10 131.072 x2 217 17 +1
30 +10 262.144 x2 218 18 +1
p + theme(legend.position="none") + 
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender") + 
  geom_point(x=0, y=2**15, color="#f52", size=5, alpha=.5) +
  geom_point(x=10, y=2**16, color="#f52", size=5, alpha=.5) +
  geom_point(x=20, y=2**17, color="#f52", size=5, alpha=.5) +
  geom_point(x=30, y=2**18, color="#f52", size=5, alpha=.5) 

Therefore, we use a logarithmic transformation to linearize the association. Instead of fitting

\[ Salary = \beta_0 + \beta_1Years + \varepsilon \]

we choose \(\text{log}_2(Y)\) as dependent variable:

\[ \text{log}_2(Salary) = \beta_0 + \beta_1Years + \varepsilon \]

log_model = lm(log2(salary) ~ years_empl, data = df)
summary(log_model)
## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1115 -0.1760 -0.0016  0.2198  0.5921 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.979177   0.039675  377.54   <2e-16 ***
## years_empl   0.102428   0.002188   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2789 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16
beta0 = round(coef(log_model)[1],3)
beta1 = round(coef(log_model)[2],3)

pl = ggplot(df, aes(x = years_empl, y = log2(salary), color = gender)) +
  geom_point(alpha = .6) +
  theme_minimal() + 
  scale_x_continuous(name="Years of employment", labels = scales::comma) + 
  scale_y_continuous(name="Log2(Salary)", labels = scales::comma) +
  scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")
  #scale_color_manual(values = c("steelblue", "steelblue"), name = "Gender")

pl + theme(legend.position="none") + 
  geom_abline(intercept = coef(log_model)[1], slope = coef(log_model)[2], 
              color = "#345", alpha = 1, linetype = "longdash") 

According to this model, the log2 of Salary increases by 0.102 each year :

\[ \text{log}_2(Salary) = 14.979 + 0.102 Years \]

As this description is a little hard to understand we have to get the model parameter and model predictions back on original scale. Raising the predictive form of the regression equation

\[ \text{log}_2(Salary_\text{est}) = \beta_0 + \beta_1Years \]

to the power of 2 we get

\[ Salary_\text{est} = 2^{(\beta_0 + \beta_1Years)} = 2^{\beta_0} \cdot 2^{\beta_1Years} \] Now we have a multiplicative (rather than additive) model indicating the underlying non-linear structure .

\[ Salary_\text{est} = 2^{14.979} \cdot 2^{0.102\cdot Years} \] Because we applied a logarithmic transformation, we can now interpret the model parameters as growth rates. The model predictions are as follows:

df$log_pred = 2**(beta0 + beta1*df$years_empl)

p + 
  geom_abline(intercept = coef(linear_model)[1], slope = coef(linear_model)[2], 
              color = "#345", alpha = 1, linetype = "dotted")  + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model)[1] + coef(log_model)[2]*years_empl)),
              color = "#345", alpha = 1, linetype = "longdash") 

But if we look at the gender differences in more detail it appears as if gender do not differ by intercepts only but have different growth rates. This can only be modeled by fitting separate models for male and female.

# adjust for Male and Female
log_model_male = lm(log2(salary) ~ years_empl, data = df[df$gender=="Male",])
summary(log_model_male)
## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender == 
##     "Male", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8088 -0.1247  0.0048  0.1004  0.5500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.97655    0.04442  337.15   <2e-16 ***
## years_empl   0.11018    0.00245   44.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2208 on 98 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9533 
## F-statistic:  2023 on 1 and 98 DF,  p-value: < 2.2e-16
log_model_female = lm(log2(salary) ~ years_empl, data = df[df$gender=="Female",])
summary(log_model_female)
## 
## Call:
## lm(formula = log2(salary) ~ years_empl, data = df[df$gender == 
##     "Female", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03654 -0.11005  0.02057  0.15374  0.58988 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.981809   0.052984   282.8   <2e-16 ***
## years_empl   0.094675   0.002922    32.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2634 on 98 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9138 
## F-statistic:  1050 on 1 and 98 DF,  p-value: < 2.2e-16
betam0 = round(coef(log_model_male)[1],3)
betam1 = round(coef(log_model_male)[2],3)
betaf0 = round(coef(log_model_female)[1],3)
betaf1 = round(coef(log_model_female)[2],3)

p + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model_male)[1] + coef(log_model_male)[2]*years_empl)),
              color = "steelblue", alpha = 1, linetype = "longdash")   + 
  geom_line(aes(x=years_empl, 
                y=2**(coef(log_model_female)[1] + coef(log_model_female)[2]*years_empl)),
              color = "darkorange", alpha = 1, linetype = "longdash")  

Now it becomes obvious, that gender salaries differ by their growth rates rather than being dependent on starting salaries.

Female \[ Salary_\text{est Female} = 2^{14.982} \cdot 2^{0.095\cdot Years} \] Male \[ Salary_\text{est Male} = 2^{14.977} \cdot 2^{0.11\cdot Years} \] While starting salaries are more or less equal between gender (female: \(2^{14.982} = 32361.71\)€, male: \(2^{14.977} = 32249.74\)€), the growth rates differ significantly (female: \(2^{0.095} = 6.81\)%, male: \(2^{0.11} = 7.92\)%).

In the terminology of GLM the logarithm log2() served as link function \(g\)() that allowed us to linearize the association. Interpretation of model output, though, is much easier if output is transformed back into real measures.

Example 2: Binary Outcome, Logit Transformation

Regression analysis requires a metric outcome variable. If we aim at modeling a binary (or categorical) outcome variable we have to choose a link function \(g\)() that transforms the binary outcome into a metric.

Consider the following example …

Common GLM Examples

Outcome Type Distribution Link Function What the Link Does
Binary (0/1) Binomial Logit Maps probabilities (0–1) to the real line
Count Poisson Log Maps positive counts to the real line
Proportion Binomial Logit or Probit Models bounded outcomes on a continuous scale
Positive continuous Gamma Inverse Ensures predictions remain strictly positive

Bottom Line

GLMs allow us to apply linear modeling techniques to a wide range of outcome types. The key is the link function, which transforms the expected value of the outcome into a continuous scale suitable for a linear predictor. This enables valid modeling even when the raw outcome is binary, count-based, or bounded.