Homework 3

1. Sample description

The dataset includes 200 public service employees: 100 male and 100 female. Salaries range from €30,203 to €331,348, with an average salary of €122,304 and a standard deviation of €79,030.12. The variable years_empl represents employees’ work experience in years, and is used as the predictor to estimate how experience influences salary.

SalaryData = read.xlsx("SalaryData.xlsx")
row(SalaryData)

##        [,1] [,2] [,3]
##   [1,]    1    1    1
##   [2,]    2    2    2
##   [3,]    3    3    3
##   [4,]    4    4    4
##   [5,]    5    5    5
##   [6,]    6    6    6
##   [7,]    7    7    7
##   [8,]    8    8    8
##   [9,]    9    9    9
##  [10,]   10   10   10
##  [11,]   11   11   11
##  [12,]   12   12   12
##  [13,]   13   13   13
##  [14,]   14   14   14
##  [15,]   15   15   15
##  [16,]   16   16   16
##  [17,]   17   17   17
##  [18,]   18   18   18
##  [19,]   19   19   19
##  [20,]   20   20   20
##  [21,]   21   21   21
##  [22,]   22   22   22
##  [23,]   23   23   23
##  [24,]   24   24   24
##  [25,]   25   25   25
##  [26,]   26   26   26
##  [27,]   27   27   27
##  [28,]   28   28   28
##  [29,]   29   29   29
##  [30,]   30   30   30
##  [31,]   31   31   31
##  [32,]   32   32   32
##  [33,]   33   33   33
##  [34,]   34   34   34
##  [35,]   35   35   35
##  [36,]   36   36   36
##  [37,]   37   37   37
##  [38,]   38   38   38
##  [39,]   39   39   39
##  [40,]   40   40   40
##  [41,]   41   41   41
##  [42,]   42   42   42
##  [43,]   43   43   43
##  [44,]   44   44   44
##  [45,]   45   45   45
##  [46,]   46   46   46
##  [47,]   47   47   47
##  [48,]   48   48   48
##  [49,]   49   49   49
##  [50,]   50   50   50
##  [51,]   51   51   51
##  [52,]   52   52   52
##  [53,]   53   53   53
##  [54,]   54   54   54
##  [55,]   55   55   55
##  [56,]   56   56   56
##  [57,]   57   57   57
##  [58,]   58   58   58
##  [59,]   59   59   59
##  [60,]   60   60   60
##  [61,]   61   61   61
##  [62,]   62   62   62
##  [63,]   63   63   63
##  [64,]   64   64   64
##  [65,]   65   65   65
##  [66,]   66   66   66
##  [67,]   67   67   67
##  [68,]   68   68   68
##  [69,]   69   69   69
##  [70,]   70   70   70
##  [71,]   71   71   71
##  [72,]   72   72   72
##  [73,]   73   73   73
##  [74,]   74   74   74
##  [75,]   75   75   75
##  [76,]   76   76   76
##  [77,]   77   77   77
##  [78,]   78   78   78
##  [79,]   79   79   79
##  [80,]   80   80   80
##  [81,]   81   81   81
##  [82,]   82   82   82
##  [83,]   83   83   83
##  [84,]   84   84   84
##  [85,]   85   85   85
##  [86,]   86   86   86
##  [87,]   87   87   87
##  [88,]   88   88   88
##  [89,]   89   89   89
##  [90,]   90   90   90
##  [91,]   91   91   91
##  [92,]   92   92   92
##  [93,]   93   93   93
##  [94,]   94   94   94
##  [95,]   95   95   95
##  [96,]   96   96   96
##  [97,]   97   97   97
##  [98,]   98   98   98
##  [99,]   99   99   99
## [100,]  100  100  100
## [101,]  101  101  101
## [102,]  102  102  102
## [103,]  103  103  103
## [104,]  104  104  104
## [105,]  105  105  105
## [106,]  106  106  106
## [107,]  107  107  107
## [108,]  108  108  108
## [109,]  109  109  109
## [110,]  110  110  110
## [111,]  111  111  111
## [112,]  112  112  112
## [113,]  113  113  113
## [114,]  114  114  114
## [115,]  115  115  115
## [116,]  116  116  116
## [117,]  117  117  117
## [118,]  118  118  118
## [119,]  119  119  119
## [120,]  120  120  120
## [121,]  121  121  121
## [122,]  122  122  122
## [123,]  123  123  123
## [124,]  124  124  124
## [125,]  125  125  125
## [126,]  126  126  126
## [127,]  127  127  127
## [128,]  128  128  128
## [129,]  129  129  129
## [130,]  130  130  130
## [131,]  131  131  131
## [132,]  132  132  132
## [133,]  133  133  133
## [134,]  134  134  134
## [135,]  135  135  135
## [136,]  136  136  136
## [137,]  137  137  137
## [138,]  138  138  138
## [139,]  139  139  139
## [140,]  140  140  140
## [141,]  141  141  141
## [142,]  142  142  142
## [143,]  143  143  143
## [144,]  144  144  144
## [145,]  145  145  145
## [146,]  146  146  146
## [147,]  147  147  147
## [148,]  148  148  148
## [149,]  149  149  149
## [150,]  150  150  150
## [151,]  151  151  151
## [152,]  152  152  152
## [153,]  153  153  153
## [154,]  154  154  154
## [155,]  155  155  155
## [156,]  156  156  156
## [157,]  157  157  157
## [158,]  158  158  158
## [159,]  159  159  159
## [160,]  160  160  160
## [161,]  161  161  161
## [162,]  162  162  162
## [163,]  163  163  163
## [164,]  164  164  164
## [165,]  165  165  165
## [166,]  166  166  166
## [167,]  167  167  167
## [168,]  168  168  168
## [169,]  169  169  169
## [170,]  170  170  170
## [171,]  171  171  171
## [172,]  172  172  172
## [173,]  173  173  173
## [174,]  174  174  174
## [175,]  175  175  175
## [176,]  176  176  176
## [177,]  177  177  177
## [178,]  178  178  178
## [179,]  179  179  179
## [180,]  180  180  180
## [181,]  181  181  181
## [182,]  182  182  182
## [183,]  183  183  183
## [184,]  184  184  184
## [185,]  185  185  185
## [186,]  186  186  186
## [187,]  187  187  187
## [188,]  188  188  188
## [189,]  189  189  189
## [190,]  190  190  190
## [191,]  191  191  191
## [192,]  192  192  192
## [193,]  193  193  193
## [194,]  194  194  194
## [195,]  195  195  195
## [196,]  196  196  196
## [197,]  197  197  197
## [198,]  198  198  198
## [199,]  199  199  199
## [200,]  200  200  200

table(SalaryData$gender)

## 
## Female   Male 
##    100    100

summary(SalaryData$salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30203   54208   97496  122304  179447  331348

sd(SalaryData$salary)

## [1] 79030.12

summary(SalaryData$years_exp)

## Length  Class   Mode 
##      0   NULL   NULL

sd(SalaryData$years_exp)

## [1] NA

2. Association between years and salary as scatterplot.

The scatterplot reveals a clear positive, but non-linear, relationship between years of employment and salary. While salary grows slowly at first, it increases more steeply with more experience. This suggests an exponential pattern: justifying the log transformation used later in the analysis.

names(SalaryData)

## [1] "years_empl" "salary"     "gender"

plot_data = na.omit(SalaryData[, c("years_empl", "salary")])

# Plot
plot(plot_data$years_empl, plot_data$salary,
     main = "Scatterplot: Years of Employment vs. Salary",
     xlab = "Years of Employment",
     ylab = "Salary (€)",
     pch = 19,
     col = "steelblue")

3. Estimate salary by years of employment

To account for the non-linear relationship observed in the scatterplot a logarithmic transformation to the salary variable was applied. This linearizes the data and makes it suitable for linear regression. The model shows a very strong fit, with an R² of 0.917, indicating that years of employment explain over 91% of the variance in log-salary.

SalaryData$log_salary = log(SalaryData$salary)
model = lm(log_salary ~ years_empl, data = SalaryData)
summary(model)

## 
## Call:
## lm(formula = log_salary ~ years_empl, data = SalaryData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77041 -0.12197 -0.00111  0.15234  0.41044 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.382774   0.027501  377.54   <2e-16 ***
## years_empl   0.070998   0.001517   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16

4. Interpretation

The estimated coefficient for years_empl is 0.071. Since the salary variable was log-transformed, this means that each additional year of employment is associated with an average salary increase of approximately 7.37 percent. The model fits the data very well, with an R-squared of 0.917. This means that around 91 percent of the variation in salary (on the log scale) can be explained by years of employment.

5. (Voluntary) Gender effects

Separate regression models for men and women show different results.

model_male = lm(log_salary ~ years_empl, data = subset(SalaryData, gender == "Male"))
model_female = lm(log_salary ~ years_empl, data = subset(SalaryData, gender == "Female"))

summary(model_male)

## 
## Call:
## lm(formula = log_salary ~ years_empl, data = subset(SalaryData, 
##     gender == "Male"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56063 -0.08644  0.00333  0.06960  0.38121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.380951   0.030790  337.15   <2e-16 ***
## years_empl   0.076372   0.001698   44.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.153 on 98 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9533 
## F-statistic:  2023 on 1 and 98 DF,  p-value: < 2.2e-16

summary(model_female)

## 
## Call:
## lm(formula = log_salary ~ years_empl, data = subset(SalaryData, 
##     gender == "Female"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71847 -0.07628  0.01426  0.10656  0.40887 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.384598   0.036725   282.8   <2e-16 ***
## years_empl   0.065623   0.002025    32.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1825 on 98 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9138 
## F-statistic:  1050 on 1 and 98 DF,  p-value: < 2.2e-16

For men, the coefficient is 0.076, which corresponds to an average salary increase of about 7.9 percent per year of employment. For women, the coefficient is 0.066, meaning an average increase of around 6.8 percent per year. Both models show a strong fit: the R-squared is 0.954 for men and 0.915 for women. This indicates that experience is closely linked to salary in both groups, although the yearly increase is slightly higher for men in this sample.