This dataset includes public service employees’ salaries (in euros), years of employment, and gender.
# replace this by a basic sample description (by applying
# Clean column names to avoid issues with spaces
names(df) <- trimws(names(df))
# Convert data types
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years)
df$gender <- as.factor(df$gender)
# Number of rows (observations)
nrow(df)
## [1] 200
# Frequency table for gender
table(df$gender)
##
## Female Male
## 100 100
# Means
mean_salary <- mean(df$salary)
mean_years <- mean(df$years)
# Standard deviations
sd_salary <- sd(df$salary)
sd_years <- sd(df$years)
summary(df$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30203 54208 97496 122304 179447 331348
The scatter plot displays the relationship between years of employment and salary among public service employees. Each point represents an individual, with years on the x-axis and salary on the y-axis. The upward-sloping pink regression line suggests a positive linear relationship: as the number of years worked increases, salary tends to rise, indicating that longer employment is associated with higher pay.
# Scatterplot of Years (independent) vs Salary (dependent)
plot(x=df$years, y=df$salary)
abline(lm(salary ~ years, data = df), col = "pink", lwd = 2)
# replace this by plot(independent variable, dependent variable)
We noticed that the relationship between salary and years of employment wasn’t linear, so we applied a logarithmic transformation to the salary data to make the pattern more straight and easier to model with linear regression.
df$log_salary <- log(df$salary)
model <- lm(log_salary ~ years, data = df)
df$log_salary_pred <- predict(model) # predicted log(salary)
df$salary_pred <- exp(df$log_salary_pred) # retransform to original scale
summary(model)
##
## Call:
## lm(formula = log_salary ~ years, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77041 -0.12197 -0.00111 0.15234 0.41044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.382774 0.027501 377.54 <2e-16 ***
## years 0.070998 0.001517 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
The regression model shows a strong positive relationship between years of employment and the log of salary. The coefficient for years is 0.071, which means that for each additional year of employment, salary increases by about 7.1% on average (since we used log(salary)). The intercept is 10.38, which represents the expected log salary when years = 0. The model fits the data very well, with an R-squared of 0.917, meaning about 92% of the variation in log salary is explained by years of employment. The small residual standard error (0.193) and very low p-values (both < 2e-16) show that the results are statistically significant.
```