This dataset contains salary information (in Euros), years of employment, and gender for public sector employees. We will explore the number of observations, gender distribution, central tendency, and variation in salary and years of experience.
# Clean column names
names(df) <- trimws(names(df))
# Convert data types
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years)
df$gender <- as.factor(df$gender)
# Descriptive statistics
nrow(df) # Number of observations
## [1] 200
##
## Female Male
## 100 100
## [1] 122303.5
## [1] 15.73436
## [1] 79030.12
## [1] 9.035618
# Salary summary
salary_summary <- as.data.frame(t(summary(df$salary)))
kbl(salary_summary, caption = "Salary Summary") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Var1 | Var2 | Freq |
---|---|---|
A | Min. | 30202.92 |
A | 1st Qu. | 54207.56 |
A | Median | 97496.12 |
A | Mean | 122303.45 |
A | 3rd Qu. | 179446.88 |
A | Max. | 331348.26 |
The plot below shows the relationship between years of employment and salary. As years of employment increase, salaries tend to rise, but the trend looks curved rather than straight, suggesting a non-linear relationship.
# Scatterplot of years vs salary
plot(x = df$years, y = df$salary,
xlab = "Years of Employment",
ylab = "Salary (€)",
main = "Scatterplot: Years of Employment vs Salary",
pch = 19, col = "blue")
# Add a regression line (not log-transformed yet)
abline(lm(df$salary ~ df$years), col = "red", lwd = 2)
To make the relationship more linear, I applied a log transformation to the salary variable. Then I used a linear model to predict log-salary from years of employment. Finally, I transformed the predictions back to the original scale.
SOME TEXT HERE. BRIEFLY DESCRIBE YOUR APPROACH TO LINEARIZE THE ASSOCIATION.
# Log-transform salary and model
df$log_salary <- log(df$salary)
model <- lm(log_salary ~ years, data = df)
# Model summary table
model_summary <- as.data.frame(summary(model)$coefficients)
kbl(model_summary, caption = "Regression Model Summary (log-salary ~ years)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 10.3827745 | 0.0275009 | 377.54280 | 0 |
years | 0.0709977 | 0.0015166 | 46.81294 | 0 |
The model shows that salary increases with each additional year of employment. The slope coefficient (around 0.06) means that for every extra year worked, salary increases by about 6% on average. This indicates a positive and meaningful relationship between years of experience and salary.
Here we check if the relationship between years of employment and salary is different for males and females. The separate models below show the results for each gender.
# Convert gender column to lowercase for consistency
df$gender <- tolower(df$gender)
# Model for males
model_male <- lm(log(salary) ~ years, data = df[df$gender == "male", ])
summary_male <- as.data.frame(summary(model_male)$coefficients)
kbl(summary_male, caption = "Male Model Summary (log-salary ~ years)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 10.3809505 | 0.0307899 | 337.15408 | 0 |
years | 0.0763721 | 0.0016980 | 44.97743 | 0 |
# Model for females
model_female <- lm(log(salary) ~ years, data = df[df$gender == "female", ])
summary_female <- as.data.frame(summary(model_female)$coefficients)
kbl(summary_female, caption = "Female Model Summary (log-salary ~ years)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 10.3845984 | 0.0367255 | 282.76299 | 0 |
years | 0.0656234 | 0.0020253 | 32.40113 | 0 |
Both models show a positive relationship between years and salary, but the exact growth rate may differ by gender. For example, if males have a higher slope, their salary grows faster per year of employment than for females.