1. Sample description

This dataset contains salary (€), years of employment, and gender for public service employees.

# replace this by a basic sample description (by applying row(), table(), means(), sd(), summary(), ... (whatever applies best)
#Clean column names to avoid issues with spaces
names(df) <- trimws(names(df))

#Convert data types
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years_empl)
df$gender <- as.factor(df$gender)

#Number of rows (observations)
nrow(df)
## [1] 200
#Frequency table for gender
table(df$gender)
## 
## Female   Male 
##    100    100
#Means
mean_salary <- mean(df$salary)
mean_years_empl <- mean(df$years)

#Standard deviation
sd_salary <- sd(df$salary)
sd_years <- sd(df$years_empl)
summary(df$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30203   54208   97496  122303  179447  331348


2. Association between years and salary as scatterplot.

The scatterplot below visualizes the relationship between years of employment (x-axis) and salary in euros (y-axis). It shows a clear upward trend, indicating a positive association: as years of employment increase, salary tends to rise. However, the spread of data points also suggests some variability in salaries for employees with the same length of employment. A linear trend line is added in green to highlight the general positive pattern.

# Scatterplot of years (independent) vs Salary (dependent)
plot(x=df$years_empl, y=df$salary)
abline(lm(salary ~ years, data = df), col = "green", lwd = 2)

#replace this by plot(independent variable, dependent variable)


3. Estimate salary by years of employment

Initial visual inspection suggested a non-linear relationship between salary and years of employment. Salaries tend to rise more steeply in later years, hinting at an exponential or multiplicative growth pattern rather than a constant additive increase. To address this, we apply a logarithmic transformation to the salary variable. This log transformation helps stabilize the variance and linearize the relationship, making it suitable for linear regression. The model below estimates the log of salary as a function of years of employment.

# replace this by your regression model. Use lm() and transform the dependent variable "salary" appropriately!
df$log_salary <- log(df$salary)

model <- lm(log_salary ~ years, data = df)
summary(model)
## 
## Call:
## lm(formula = log_salary ~ years, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77041 -0.12197 -0.00111  0.15234  0.41044 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.382774   0.027501  377.54   <2e-16 ***
## years        0.070998   0.001517   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16


4. Interpretation

The model indicates a clear and statistically significant relationship between years of employment and salary. Specifically, the regression suggests that with more years of employment, salary tends to increase. The logarithmic transformation of salary implies that the increase is proportional rather than absolute, meaning salaries grow at a consistent percentage rate over time. The strong fit of the model further supports this finding, as the years of employment explain a large portion of the variance in salary across individuals.