This dataset contains information on public service employees, including their salary (€), years of employment, and gender. The aim of the analysis is to:
Describe the sample using basic statistics.
Visualize the relationship between years of employment and salary.
Build a regression model to estimate salary based on years of employment (including transformation if needed).
Interpret the model and its coefficients.
# salary
mean(df$salary)
## [1] 122303.5
median(df$salary)
## [1] 97496.12
sd(df$salary)
## [1] 79030.12
min(df$salary)
## [1] 30202.92
max(df$salary)
## [1] 331348.3
hist(df$salary,
main = "Histogram of Salary",
xlab = "Salary (€)",
col = "lightblue",
breaks = 20)
# employment
mean(df$years_empl)
## [1] 15.73436
median(df$years_empl)
## [1] 16.19143
sd(df$years_empl)
## [1] 9.035618
min(df$years_empl)
## [1] 0.007166897
max(df$years_empl)
## [1] 29.66675
hist(df$years_empl,
main = "Histogram of Years of Employment",
xlab = "Years of Employment",
col = "lightgreen",
breaks = 20)
To explore the relationship between years of employment and salary, a scatterplot was created. Each point represents one employee, showing their years of employment on the x-axis and salary in euros on the y-axis.
The plot reveals a positive, non-linear association:
In general, employees with more years of employment tend to earn higher salaries.
However, the increase in salary appears to slow down at higher levels of experience, suggesting a curved rather than a straight-line relationship.
plot(df$years_empl, df$salary,
main = "Salary vs. Years of Employment",
xlab = "Years of Employment",
ylab = "Salary (€)",
col = "blue", pch = 19)
The scatterplot in Step 2 showed a non-linear relationship between years of employment and salary. To linearize the association and improve model fit, a logarithmic transformation to the dependent variable salary is going to be applied. This helps stabilize the variance and makes the relationship more linear, which is suitable for linear regression.
model <- lm(log(salary) ~ years_empl, data = df)
summary(model)
##
## Call:
## lm(formula = log(salary) ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77041 -0.12197 -0.00111 0.15234 0.41044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.382774 0.027501 377.54 <2e-16 ***
## years_empl 0.070998 0.001517 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
After log-transforming salary, we estimated a linear model to predict salary from years of employment. The regression shows that salary increases by approximately 7.4% per year of employment. The model explains 91.7% of the total variation in salary, indicating a strong and meaningful relationship.
To explore whether the relationship between years of employment and salary differs by gender, we estimated separate log-linear models for male and female employees. This allows us to compare both the growth rate of salary and the overall level (intercept) across genders.
male_df <- subset(df, gender == "Male")
female_df <- subset(df, gender == "Female")
male_model <- lm(log(salary) ~ years_empl, data = male_df)
female_model <- lm(log(salary) ~ years_empl, data = female_df)
summary(male_model)
##
## Call:
## lm(formula = log(salary) ~ years_empl, data = male_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56063 -0.08644 0.00333 0.06960 0.38121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.380951 0.030790 337.15 <2e-16 ***
## years_empl 0.076372 0.001698 44.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.153 on 98 degrees of freedom
## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9533
## F-statistic: 2023 on 1 and 98 DF, p-value: < 2.2e-16
summary(female_model)
##
## Call:
## lm(formula = log(salary) ~ years_empl, data = female_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71847 -0.07628 0.01426 0.10656 0.40887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.384598 0.036725 282.8 <2e-16 ***
## years_empl 0.065623 0.002025 32.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1825 on 98 degrees of freedom
## Multiple R-squared: 0.9146, Adjusted R-squared: 0.9138
## F-statistic: 1050 on 1 and 98 DF, p-value: < 2.2e-16
To explore potential differences by gender, we estimated separate log-linear regression models for male and female employees.
For male employees, the model shows an average salary increase of approximately 7.9% per year of employment, with an R² of 95.4%.
For female employees, the salary increases by around 6.8% per year, with an R² of 91.5%.
This suggests that salary grows faster for men, although the model fits well for both groups.
Men and women appear to start at a similar salary, but salary grows faster for men with increasing years of employment. This difference could suggest a gender gap in salary progression over time.