1 1. Sample description

This dataset contains salary information (in Euros), years of employment, and gender for public sector employees. We will explore the number of observations, gender distribution, central tendency, and variation in salary and years of experience.

# Clean column names
names(df) <- trimws(names(df))

# Convert data types
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years)
df$gender <- as.factor(df$gender)

# Descriptive statistics
nrow(df)         # Number of observations
## [1] 200
table(df$gender) # Frequency of gender
## 
## Female   Male 
##    100    100
mean(df$salary)  # Average salary
## [1] 122303.5
mean(df$years)   # Average years
## [1] 15.73436
sd(df$salary)    # Standard deviation salary
## [1] 79030.12
sd(df$years)     # Standard deviation years
## [1] 9.035618
# Salary summary
salary_summary <- as.data.frame(t(summary(df$salary)))
kbl(salary_summary, caption = "Salary Summary") |> 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Salary Summary
Var1 Var2 Freq
A Min. 30202.92
A 1st Qu. 54207.56
A Median 97496.12
A Mean 122303.45
A 3rd Qu. 179446.88
A Max. 331348.26


2 2. Association between years and salary as scatterplot.

The plot below shows the relationship between years of employment and salary. As years of employment increase, salaries tend to rise, but the trend looks curved rather than straight, suggesting a non-linear relationship.

# Scatterplot of years vs salary
plot(x = df$years, y = df$salary, 
     xlab = "Years of Employment", 
     ylab = "Salary (€)", 
     main = "Scatterplot: Years of Employment vs Salary", 
     pch = 19, col = "blue")

# Add a regression line (not log-transformed yet)
abline(lm(df$salary ~ df$years), col = "red", lwd = 2)


3 3. Estimate salary by years of employment

To make the relationship more linear, I applied a log transformation to the salary variable. Then I used a linear model to predict log-salary from years of employment. Finally, I transformed the predictions back to the original scale.

SOME TEXT HERE. BRIEFLY DESCRIBE YOUR APPROACH TO LINEARIZE THE ASSOCIATION.

# Log-transform salary and model
df$log_salary <- log(df$salary)
model <- lm(log_salary ~ years, data = df)

# Model summary table
model_summary <- as.data.frame(summary(model)$coefficients)
kbl(model_summary, caption = "Regression Model Summary (log-salary ~ years)") |> 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Regression Model Summary (log-salary ~ years)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.3827745 0.0275009 377.54280 0
years 0.0709977 0.0015166 46.81294 0
# Predicted salary (back-transformed)
df$salary_pred <- exp(predict(model))


4 4. Interpretation

The model shows that salary increases with each additional year of employment. The slope coefficient (around 0.06) means that for every extra year worked, salary increases by about 6% on average. This indicates a positive and meaningful relationship between years of experience and salary.


5 5. (Voluntary) Gender effects

Here we check if the relationship between years of employment and salary is different for males and females. The separate models below show the results for each gender.

# Convert gender column to lowercase for consistency
df$gender <- tolower(df$gender)

# Model for males
model_male <- lm(log(salary) ~ years, data = df[df$gender == "male", ])
summary_male <- as.data.frame(summary(model_male)$coefficients)
kbl(summary_male, caption = "Male Model Summary (log-salary ~ years)") |> 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Male Model Summary (log-salary ~ years)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.3809505 0.0307899 337.15408 0
years 0.0763721 0.0016980 44.97743 0
# Model for females
model_female <- lm(log(salary) ~ years, data = df[df$gender == "female", ])
summary_female <- as.data.frame(summary(model_female)$coefficients)
kbl(summary_female, caption = "Female Model Summary (log-salary ~ years)") |> 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T)
Female Model Summary (log-salary ~ years)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.3845984 0.0367255 282.76299 0
years 0.0656234 0.0020253 32.40113 0

Both models show a positive relationship between years and salary, but the exact growth rate may differ by gender. For example, if males have a higher slope, their salary grows faster per year of employment than for females.