1. Sample description

For a brief overview I calculated how many of the employers in the data set were female and male as well as the age distribution. Additionally, I calculated the avergae years of employment av_yrs, and the average salary av_slry.

library(kableExtra)
library(knitr)
# sample description:
# replace this by a basic sample description (by applying row(), table(), means(), sd(), summary(), ... (whatever applies best)

# print all variables / column names
names(df)
## [1] "years_empl" "salary"     "gender"
# gender distribution
gnd = df$gender
gnd_kable = as.data.frame(table(gnd))

# show number of female and male employers in table
kable_styling(
  kable(gnd_kable,
        col.names = c("Female","Male"),
        caption = "Gender Distribution in the Dataset"
        ), full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
Gender Distribution in the Dataset
Female Male
Female 100
Male 100
sum(df$gender == 'Male')
## [1] 100
sum(df$gender == 'Female')
## [1] 100

The gender distribution is perfectly even in this data, since there are 100 Female and 100 Male employees.

# Descriptive statistics for years of employment
hist(df$years_empl, 
     breaks=8, 
     col = "lightblue", 
     border = "white",
     main = paste("Histogramm of Years of Employment"),
     xlab = "Years")

# Mean
av_yrs = mean(df$years_empl)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_yrs = summary(df$years_empl)
# Standard deviation
sd_yrs = sd(df$years_empl, na.rm = FALSE)

The average years of employment is approximately 15.73 years, with a standard deviation of 9.04. This indicates that the values are widely spread out around the mean, suggesting a diverse range of employment durations among individuals in the dataset.

# Descriptive statistics for salary
hist(df$salary, 
     breaks=8, 
     col = "lightblue", 
     border = "white",
     main = paste("Histogramm of Salary"),
     xlab = "Salary Amount")

# Mean
av_slry = mean(df$salary)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_slry = summary(df$salary)
# Standard deviation
sd_slry = sd(df$salary, na.rm = FALSE)

The average salary is 1.2230345^{5}, with a standard deviation of 7.9030117^{4}, indicating considerable variability in income across the dataset. The median is slightly lower than the mean, suggesting a right-skewed distribution with some high earners. The histogram supports this by showing a longer tail to the right.

2. Association between years and salary as scatterplot.

# Scatterplot: Years of employment vs. Salary
# years_empl = independent variable, salary = dependent variable
# the salary changes with the years of employment
plot(df$years_empl, df$salary,
     main = "Scatterplot of Years of Employment and Salary",
     xlab = "Years of Employment",
     ylab = "Salary",
     pch = 19,               # solid circles
     col = "lightblue")

# Add a linear regression line
abline(lm(salary ~ years_empl, data = df), col = "darkred", lwd = 2)

The scatterplot shows the relationship between years of employment and salary. Each point represents one individual. A linear regression line was added to indicate the trend. Visually, the plot suggests a positive association: as the number of years employed increases, salary tends to increase. To support this, a linear model was fitted (see below).

# Correlation
cor(df$years_empl, df$salary)
## [1] 0.908204
# Linear regression model
model = lm(salary ~ years_empl, data = df)
summary(model)
## 
## Call:
## lm(formula = salary ~ years_empl, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58615 -27281  -3463  19327 101896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2684.3     4717.3  -0.569     0.57    
## years_empl    7943.6      260.2  30.535   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared:  0.8248, Adjusted R-squared:  0.8239 
## F-statistic: 932.4 on 1 and 198 DF,  p-value: < 2.2e-16

A scatterplot with a regression line illustrates a strong positive relationship between years of employment and salary. The Pearson correlation coefficient is 0.91, indicating a very strong linear association.
A linear regression model was fitted: salary = −2,684.3 + 7,943.6 × years_empl

The slope of the model suggests that for each additional year of employment, salary increases on average by $7,944. The model explains about 82.5% of the variability in salary (R² = 0.825), and the relationship is statistically significant (p < 0.001).

3. Estimate salary by years of employment

To estimate salary based on years of employment, i will use the regression equation I already obtained in the section above: salary = −2,684.3 + 7,943.6 × years_empl

# Calculate estimated salary
df$estimated_salary = -2684.3 + 7943.6 * df$years_empl

plot(df$years_empl, df$salary,
     main = "Actual vs. Estimated Salary",
     xlab = "Years of Employment",
     ylab = "Salary",
     pch = 19, col = "lightblue")

lines(df$years_empl[order(df$years_empl)],
      df$estimated_salary[order(df$years_empl)], 
      col = "red", lwd = 2)


4. Interpretation

The linear regression model shows a strong and statistically significant positive relationship between years of employment and salary. For each additional year of employment, salary increases on average by approximately $7,944. The model explains 82.5% of the variation in salary (R² = 0.825), indicating that years of employment is a very strong predictor of salary in this dataset. The correlation coefficient (r = 0.91) further supports this strong linear association.


5. (Voluntary) Gender effects

To explore potential gender effects, separate regression models were estimated for males and females. And the average salaries were calculated.

# Separate models by gender
model_female = lm(salary ~ years_empl, data = df[df$gender == "Female", ])
model_male = lm(salary ~ years_empl, data = df[df$gender == "Male", ])

# Summarize models
summary(model_female)
## 
## Call:
## lm(formula = salary ~ years_empl, data = df[df$gender == "Female", 
##     ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42048 -20067  -5730  18051  66365 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4607       4895   0.941    0.349    
## years_empl      6644        270  24.610   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24330 on 98 degrees of freedom
## Multiple R-squared:  0.8607, Adjusted R-squared:  0.8593 
## F-statistic: 605.7 on 1 and 98 DF,  p-value: < 2.2e-16
summary(model_male)
## 
## Call:
## lm(formula = salary ~ years_empl, data = df[df$gender == "Male", 
##     ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71929 -22398  -1396  22959  71220 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9975.4     6348.8  -1.571    0.119    
## years_empl    9243.6      350.1  26.401   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31560 on 98 degrees of freedom
## Multiple R-squared:  0.8767, Adjusted R-squared:  0.8755 
## F-statistic:   697 on 1 and 98 DF,  p-value: < 2.2e-16

Both groups showed a positive association between years of employment and salary, but the slope and intercept differed:
- For males, the average increase in salary per year of employment was approximately $9243.6.
- For females, it was $6644.
These results suggest that gender may influence salary progression over time.

# salary avergaes
av_F = mean(df$salary [df$gender == "Female"])
av_M = mean(df$salary [df$gender == "Male"])

# Create the table
avg_slry_table = data.frame(
  Gender = c("Female", "Male"),
  Average_Salary = c(av_F, av_M))

kable_styling(kable(avg_slry_table,
      col.names = c("Gender","Average Salary"),
      caption = "Average Salary among Gender"), full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
Average Salary among Gender
Gender Average Salary
Female 109140.8
Male 135466.1
# Scatterplot with both groups
plot(df$years_empl, df$salary,
     col = ifelse(df$gender == "Female", "lightgreen", "orange"),
     pch = 19,
     xlab = "Years of Employment",
     ylab = "Salary",
     main = "Salary vs. Years of Employment by Gender")

# Add regression lines
abline(model_female, col = "lightgreen", lwd = 2)
abline(model_male, col = "orange", lwd = 2)

legend("topleft", legend = c("Female", "Male"),
       col = c("lightgreen", "orange"), lwd = 2, bty = "n")

The avergae salarys also differ, while female employees have an avergae salary of 1.0914076^{5}, while male employees have an average salary of 1.3546614^{5}.

A visual comparison of the regression lines supports this observation explained above.