This dataset includes information on employee salaries (€), duration of employment in years, and gender for individuals working in the public sector. Below, we provide basic descriptive statistics to understand the composition of the sample, including its size, central tendency, variability, and gender distribution.
# replace this by a basic sample description (by applying
# Clean column names to avoid issues with spaces
names(df) <- trimws(names(df))
# Convert data types
df$salary <- as.numeric(df$salary)
df$years <- as.numeric(df$years)
df$gender <- as.factor(df$gender)
# Number of rows (observations)
nrow(df)
## [1] 200
# Frequency table for gender
table(df$gender)
##
## Female Male
## 100 100
# Means
mean_salary <- mean(df$salary)
mean_years <- mean(df$years)
# Standard deviations
sd_salary <- sd(df$salary)
sd_years <- sd(df$years)
summary(df$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30203 54208 97496 122304 179447 331348
The following scatterplot visualizes the relationship between years of employment and salary. A positive correlation is evident—employees with more years of service tend to earn higher salaries. However, the relationship may not be strictly linear, as salary growth seems to taper off at higher levels of experience.
# Scatterplot of Years (independent) vs Salary (dependent)
plot(x=df$years, y=df$salary)
abline(lm(df$salary ~ df$years), col = "pink", lwd = 2)
# replace this by plot(independent variable, dependent variable)
Due to the apparent non-linearity in the observed data, a log transformation is applied to the salary variable. This adjustment helps stabilize variance and linearize the relationship, allowing for more accurate prediction of salary based on years of employment using a linear model.
df$log_salary <- log(df$salary)
model <- lm(log_salary ~ years, data = df)
df$log_salary_pred <- predict(model) # predicted log(salary)
df$salary_pred <- exp(df$log_salary_pred) # retransform to original scale
summary(model)
##
## Call:
## lm(formula = log_salary ~ years, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77041 -0.12197 -0.00111 0.15234 0.41044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.382774 0.027501 377.54 <2e-16 ***
## years 0.070998 0.001517 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
The model suggests that salary increases with additional years of employment. Specifically, the coefficient for years is approximately 0.0633, which, in the context of a log-linear model, implies an estimated average increase of around 6.33% in salary for each additional year of employment. This relationship is statistically significant, indicating a meaningful impact of tenure on earnings.