1. Sample description

This dataset contains information on public service employees, including their salary (€), years of employment, and gender. The aim of the analysis is to:

 # salary
mean(df$salary)
## [1] 122303.5
median(df$salary)
## [1] 97496.12
sd(df$salary)
## [1] 79030.12
min(df$salary)
## [1] 30202.92
max(df$salary)
## [1] 331348.3
hist(df$salary,
     main = "Histogram of Salary",
     xlab = "Salary (€)",
     col = "lightblue",
     breaks = 20)

# employment
mean(df$years_empl)
## [1] 15.73436
median(df$years_empl)
## [1] 16.19143
sd(df$years_empl)
## [1] 9.035618
min(df$years_empl)
## [1] 0.007166897
max(df$years_empl)
## [1] 29.66675
hist(df$years_empl,
     main = "Histogram of Years of Employment",
     xlab = "Years of Employment",
     col = "lightgreen",
     breaks = 20)

2. Association between years and salary as scatterplot.

To explore the relationship between years of employment and salary, a scatterplot was created. Each point represents one employee, showing their years of employment on the x-axis and salary in euros on the y-axis.

The plot reveals a positive, non-linear association:

plot(df$years_empl, df$salary,
     main = "Salary vs. Years of Employment",
     xlab = "Years of Employment",
     ylab = "Salary (€)",
     col = "blue", pch = 19)


3. Estimate salary by years of employment

The scatterplot in Step 2 showed a non-linear relationship between years of employment and salary. To linearize the association and improve model fit, a logarithmic transformation to the dependent variable salary is going to be applied. This helps stabilize the variance and makes the relationship more linear, which is suitable for linear regression.

model <- lm(log(salary) ~ years_empl, data = df)

summary(model)
## 
## Call:
## lm(formula = log(salary) ~ years_empl, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77041 -0.12197 -0.00111  0.15234  0.41044 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.382774   0.027501  377.54   <2e-16 ***
## years_empl   0.070998   0.001517   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16

4. Interpretation

After log-transforming salary, we estimated a linear model to predict salary from years of employment. The regression shows that salary increases by approximately 7.4% per year of employment. The model explains 91.7% of the total variation in salary, indicating a strong and meaningful relationship.

5. (Voluntary) Gender effects

To explore whether the relationship between years of employment and salary differs by gender, we estimated separate log-linear models for male and female employees. This allows us to compare both the growth rate of salary and the overall level (intercept) across genders.

male_df <- subset(df, gender == "Male")
female_df <- subset(df, gender == "Female")

male_model <- lm(log(salary) ~ years_empl, data = male_df)
female_model <- lm(log(salary) ~ years_empl, data = female_df)

summary(male_model)
## 
## Call:
## lm(formula = log(salary) ~ years_empl, data = male_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56063 -0.08644  0.00333  0.06960  0.38121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.380951   0.030790  337.15   <2e-16 ***
## years_empl   0.076372   0.001698   44.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.153 on 98 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9533 
## F-statistic:  2023 on 1 and 98 DF,  p-value: < 2.2e-16
summary(female_model)
## 
## Call:
## lm(formula = log(salary) ~ years_empl, data = female_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71847 -0.07628  0.01426  0.10656  0.40887 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.384598   0.036725   282.8   <2e-16 ***
## years_empl   0.065623   0.002025    32.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1825 on 98 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9138 
## F-statistic:  1050 on 1 and 98 DF,  p-value: < 2.2e-16

To explore potential differences by gender, we estimated separate log-linear regression models for male and female employees.

For male employees, the model shows an average salary increase of approximately 7.9% per year of employment, with an R² of 95.4%.

For female employees, the salary increases by around 6.8% per year, with an R² of 91.5%.

This suggests that salary grows faster for men, although the model fits well for both groups.

Men and women appear to start at a similar salary, but salary grows faster for men with increasing years of employment. This difference could suggest a gender gap in salary progression over time.