Homework 3

1. Sample description

This dataset includes public service employees’ salaries (in euros), years of employment, and gender.

# replace this by a basic sample description (by applying 
# Clean column names to avoid issues with spaces
names(df) <- trimws(names(df))

# Convert data types 
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years)
df$gender <- as.factor(df$gender)

# Number of rows (observations)
nrow(df)

## [1] 200

# Frequency table for gender
table(df$gender)

## 
## Female   Male 
##    100    100

# Means
mean_salary <- mean(df$salary)
mean_years <- mean(df$years)

# Standard deviations
sd_salary <- sd(df$salary)
sd_years <- sd(df$years)
summary(df$salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30203   54208   97496  122304  179447  331348

2. Association between years and salary as scatterplot.

The scatter plot displays the relationship between years of employment and salary among public service employees. Each point represents an individual, with years on the x-axis and salary on the y-axis. The upward-sloping pink regression line suggests a positive linear relationship: as the number of years worked increases, salary tends to rise, indicating that longer employment is associated with higher pay.

# Scatterplot of Years (independent) vs Salary (dependent)
plot(x=df$years, y=df$salary)
abline(lm(salary ~ years, data = df), col = "pink", lwd = 2)

# replace this by plot(independent variable, dependent variable)

3. Estimate salary by years of employment

We noticed that the relationship between salary and years of employment wasn’t linear, so we applied a logarithmic transformation to the salary data to make the pattern more straight and easier to model with linear regression.

df$log_salary <- log(df$salary)

model <- lm(log_salary ~ years, data = df)
df$log_salary_pred <- predict(model)  # predicted log(salary)
df$salary_pred <- exp(df$log_salary_pred)  # retransform to original scale
summary(model)

## 
## Call:
## lm(formula = log_salary ~ years, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77041 -0.12197 -0.00111  0.15234  0.41044 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.382774   0.027501  377.54   <2e-16 ***
## years        0.070998   0.001517   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16

4. Interpretation

The regression model shows a strong positive relationship between years of employment and the log of salary. The coefficient for years is 0.071, which means that for each additional year of employment, salary increases by about 7.1% on average (since we used log(salary)). The intercept is 10.38, which represents the expected log salary when years = 0. The model fits the data very well, with an R-squared of 0.917, meaning about 92% of the variation in log salary is explained by years of employment. The small residual standard error (0.193) and very low p-values (both < 2e-16) show that the results are statistically significant.

```

Homework 3

2025-05-23

1. Sample description

2. Association between years and salary as scatterplot.

3. Estimate salary by years of employment

4. Interpretation