av_yrsav_slrylibrary(kableExtra)
library(knitr)
# sample description:
# replace this by a basic sample description (by applying row(), table(), means(), sd(), summary(), ... (whatever applies best)
# print all variables / column names
names(df)
## [1] "years_empl" "salary" "gender"
# gender distribution
gnd = df$gender
gnd_kable = as.data.frame(table(gnd))
# show number of female and male employers in table
kable_styling(
kable(gnd_kable,
col.names = c("Female","Male"),
caption = "Gender Distribution in the Dataset"
), full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
| Female | Male |
|---|---|
| Female | 100 |
| Male | 100 |
sum(df$gender == 'Male')
## [1] 100
sum(df$gender == 'Female')
## [1] 100
Gender distribution: 100 Female and 100 Male employees.
# Descriptive statistics for years of employment
df$years_empl <- as.numeric(as.character(df$years_empl))
str(df$years_empl)
## num [1:200] 27.44 28.11 8.58 24.91 19.25 ...
hist(df$years_empl,
breaks=8,
col = "pink",
border = "orange",
main = paste("Histogramm of Years of Employment"),
xlab = "Years")
# Mean
av_yrs = mean(df$years_empl)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_yrs = summary(df$years_empl)
# Standard deviation
sd_yrs = sd(df$years_empl, na.rm = FALSE)
The average years of employment is approximately 15.73 years, with a standard deviation of 9.04. This indicates that the values are widely spread out around the mean, suggesting a diverse range of employment durations among individuals in the dataset.
# Descriptive statistics for salary
hist(df$salary,
breaks=8,
col = "lightblue",
border = "white",
main = paste("Histogramm of Salary"),
xlab = "Salary Amount")
# Mean
av_slry = mean(df$salary)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_slry = summary(df$salary)
# Standard deviation
sd_slry = sd(df$salary, na.rm = FALSE)
Average salary: 1.2230345^{5} Standard deviation: 7.9030117^{4} Median: slightly lower than the mean, suggesting right-skewed distribution (some high earners) Histogram: supporting, by showing a longer tail to the right
# Scatterplot: Years of employment vs. Salary
# years_empl = independent variable, salary = dependent variable
# the salary changes with the years of employment
plot(df$years_empl, df$salary,
main = "Scatterplot of Years of Employment and Salary",
xlab = "Years of Employment",
ylab = "Salary",
pch = 19, # solid circles
col = "lightblue")
# Add a linear regression line
abline(lm(salary ~ years_empl, data = df), col = "yellow", lwd = 2)
Scatterplot: showing relationship between years of employment &
salary - each point represent one individual - plot is sugessting a
positive association: as number of years of employment increases, salary
tends to increase To support this, a linear model was fitted (see
below).
# Correlation
cor(df$years_empl, df$salary)
## [1] 0.908204
# Linear regression model
model = lm(salary ~ years_empl, data = df)
summary(model)
##
## Call:
## lm(formula = salary ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58615 -27281 -3463 19327 101896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2684.3 4717.3 -0.569 0.57
## years_empl 7943.6 260.2 30.535 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared: 0.8248, Adjusted R-squared: 0.8239
## F-statistic: 932.4 on 1 and 198 DF, p-value: < 2.2e-16
A scatterplot with a regression line illustrates a strong positive
relationship between years of employment and salary. The Pearson
correlation coefficient is 0.91, indicating a very strong linear
association.
A linear regression model was fitted:
salary = −2,684.3 + 7,943.6 × years_empl
The slope of the model suggests that for each additional year of
employment, salary increases on average by $7,944. The model explains
about 82.5% of the variability in salary (R² = 0.825), and the
relationship is statistically significant (p < 0.001).
# Calculate estimated salary
df$estimated_salary = -2684.3 + 7943.6 * df$years_empl
plot(df$years_empl, df$salary,
main = "Actual vs. Estimated Salary",
xlab = "Years of Employment",
ylab = "Salary",
pch = 19, col = "blue")
lines(df$years_empl[order(df$years_empl)],
df$estimated_salary[order(df$years_empl)],
col = "green", lwd = 2)
The linear regression model shows a strong and statistically significant positive relationship between years of employment and salary. For each additional year of employment, salary increases on average by approximately $7,944. The model explains 82.5% of the variation in salary (R² = 0.825), indicating that years of employment is a very strong predictor of salary in this dataset. The correlation coefficient (r = 0.91) further supports this strong linear association.