1. Sample description

library(kableExtra)
library(knitr)
# sample description:
# replace this by a basic sample description (by applying row(), table(), means(), sd(), summary(), ... (whatever applies best)

# print all variables / column names
names(df)
## [1] "years_empl" "salary"     "gender"
# gender distribution
gnd = df$gender
gnd_kable = as.data.frame(table(gnd))

# show number of female and male employers in table
kable_styling(
  kable(gnd_kable,
        col.names = c("Female","Male"),
        caption = "Gender Distribution in the Dataset"
        ), full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
Gender Distribution in the Dataset
Female Male
Female 100
Male 100
sum(df$gender == 'Male')
## [1] 100
sum(df$gender == 'Female')
## [1] 100

Gender distribution: 100 Female and 100 Male employees.

# Descriptive statistics for years of employment

df$years_empl <- as.numeric(as.character(df$years_empl))
str(df$years_empl)
##  num [1:200] 27.44 28.11 8.58 24.91 19.25 ...
hist(df$years_empl, 
     breaks=8, 
     col = "pink", 
     border = "orange",
     main = paste("Histogramm of Years of Employment"),
     xlab = "Years")

# Mean
av_yrs = mean(df$years_empl)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_yrs = summary(df$years_empl)
# Standard deviation
sd_yrs = sd(df$years_empl, na.rm = FALSE)

The average years of employment is approximately 15.73 years, with a standard deviation of 9.04. This indicates that the values are widely spread out around the mean, suggesting a diverse range of employment durations among individuals in the dataset.

# Descriptive statistics for salary
hist(df$salary, 
     breaks=8, 
     col = "lightblue", 
     border = "white",
     main = paste("Histogramm of Salary"),
     xlab = "Salary Amount")

# Mean
av_slry = mean(df$salary)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_slry = summary(df$salary)
# Standard deviation
sd_slry = sd(df$salary, na.rm = FALSE)

Average salary: 1.2230345^{5} Standard deviation: 7.9030117^{4} Median: slightly lower than the mean, suggesting right-skewed distribution (some high earners) Histogram: supporting, by showing a longer tail to the right


2. Association between years and salary as scatterplot

# Scatterplot: Years of employment vs. Salary
# years_empl = independent variable, salary = dependent variable
# the salary changes with the years of employment
plot(df$years_empl, df$salary,
     main = "Scatterplot of Years of Employment and Salary",
     xlab = "Years of Employment",
     ylab = "Salary",
     pch = 19,               # solid circles
     col = "lightblue")

# Add a linear regression line
abline(lm(salary ~ years_empl, data = df), col = "yellow", lwd = 2)

Scatterplot: showing relationship between years of employment & salary - each point represent one individual - plot is sugessting a positive association: as number of years of employment increases, salary tends to increase To support this, a linear model was fitted (see below).

# Correlation
cor(df$years_empl, df$salary)
## [1] 0.908204
# Linear regression model
model = lm(salary ~ years_empl, data = df)
summary(model)
## 
## Call:
## lm(formula = salary ~ years_empl, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58615 -27281  -3463  19327 101896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2684.3     4717.3  -0.569     0.57    
## years_empl    7943.6      260.2  30.535   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared:  0.8248, Adjusted R-squared:  0.8239 
## F-statistic: 932.4 on 1 and 198 DF,  p-value: < 2.2e-16

A scatterplot with a regression line illustrates a strong positive relationship between years of employment and salary. The Pearson correlation coefficient is 0.91, indicating a very strong linear association.
A linear regression model was fitted: salary = −2,684.3 + 7,943.6 × years_empl

The slope of the model suggests that for each additional year of employment, salary increases on average by $7,944. The model explains about 82.5% of the variability in salary (R² = 0.825), and the relationship is statistically significant (p < 0.001).

3. Estimate salary by years of employment

# Calculate estimated salary
df$estimated_salary = -2684.3 + 7943.6 * df$years_empl

plot(df$years_empl, df$salary,
     main = "Actual vs. Estimated Salary",
     xlab = "Years of Employment",
     ylab = "Salary",
     pch = 19, col = "blue")

lines(df$years_empl[order(df$years_empl)],
      df$estimated_salary[order(df$years_empl)], 
      col = "green", lwd = 2)


4. Interpretation

The linear regression model shows a strong and statistically significant positive relationship between years of employment and salary. For each additional year of employment, salary increases on average by approximately $7,944. The model explains 82.5% of the variation in salary (R² = 0.825), indicating that years of employment is a very strong predictor of salary in this dataset. The correlation coefficient (r = 0.91) further supports this strong linear association.