This dataset contains salary (€), years of employment, and gender for public service employees.
# replace this by a basic sample description (by applying row(), table(), means(), sd(), summary(), ... (whatever applies best)
#Clean column names to avoid issues with spaces
names(df) <- trimws(names(df))
#Convert data types
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years_empl)
df$gender <- as.factor(df$gender)
#Number of rows (observations)
nrow(df)
## [1] 200
#Frequency table for gender
table(df$gender)
##
## Female Male
## 100 100
#Means
mean_salary <- mean(df$salary)
mean_years_empl <- mean(df$years)
#Standard deviation
sd_salary <- sd(df$salary)
sd_years <- sd(df$years_empl)
summary(df$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30203 54208 97496 122303 179447 331348
The scatterplot below visualizes the relationship between years of employment (x-axis) and salary in euros (y-axis). It shows a clear upward trend, indicating a positive association: as years of employment increase, salary tends to rise. However, the spread of data points also suggests some variability in salaries for employees with the same length of employment. A linear trend line is added in green to highlight the general positive pattern.
# Scatterplot of years (independent) vs Salary (dependent)
plot(x=df$years_empl, y=df$salary)
abline(lm(salary ~ years, data = df), col = "green", lwd = 2)
#replace this by plot(independent variable, dependent variable)
Initial visual inspection suggested a non-linear relationship between salary and years of employment. Salaries tend to rise more steeply in later years, hinting at an exponential or multiplicative growth pattern rather than a constant additive increase. To address this, we apply a logarithmic transformation to the salary variable. This log transformation helps stabilize the variance and linearize the relationship, making it suitable for linear regression. The model below estimates the log of salary as a function of years of employment.
# replace this by your regression model. Use lm() and transform the dependent variable "salary" appropriately!
df$log_salary <- log(df$salary)
model <- lm(log_salary ~ years, data = df)
summary(model)
##
## Call:
## lm(formula = log_salary ~ years, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77041 -0.12197 -0.00111 0.15234 0.41044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.382774 0.027501 377.54 <2e-16 ***
## years 0.070998 0.001517 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
The model indicates a clear and statistically significant relationship between years of employment and salary. Specifically, the regression suggests that with more years of employment, salary tends to increase. The logarithmic transformation of salary implies that the increase is proportional rather than absolute, meaning salaries grow at a consistent percentage rate over time. The strong fit of the model further supports this finding, as the years of employment explain a large portion of the variance in salary across individuals.