For a brief overview I calculated how many of the employers in the
data set were female and male as well as the age distribution.
Additionally, I calculated the avergae years of employment
av_yrs, and the average salary av_slry.
library(kableExtra)
library(knitr)
# sample description:
# replace this by a basic sample description (by applying row(), table(), means(), sd(), summary(), ... (whatever applies best)
# print all variables / column names
names(df)
## [1] "years_empl" "salary" "gender"
# gender distribution
gnd = df$gender
gnd_kable = as.data.frame(table(gnd))
# show number of female and male employers in table
kable_styling(
kable(gnd_kable,
col.names = c("Female","Male"),
caption = "Gender Distribution in the Dataset"
), full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
| Female | Male |
|---|---|
| Female | 100 |
| Male | 100 |
sum(df$gender == 'Male')
## [1] 100
sum(df$gender == 'Female')
## [1] 100
The gender distribution is perfectly even in this data, since there are 100 Female and 100 Male employees.
# Descriptive statistics for years of employment
hist(df$years_empl,
breaks=8,
col = "lightblue",
border = "white",
main = paste("Histogramm of Years of Employment"),
xlab = "Years")
# Mean
av_yrs = mean(df$years_empl)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_yrs = summary(df$years_empl)
# Standard deviation
sd_yrs = sd(df$years_empl, na.rm = FALSE)
The average years of employment is approximately 15.73 years, with a standard deviation of 9.04. This indicates that the values are widely spread out around the mean, suggesting a diverse range of employment durations among individuals in the dataset.
# Descriptive statistics for salary
hist(df$salary,
breaks=8,
col = "lightblue",
border = "white",
main = paste("Histogramm of Salary"),
xlab = "Salary Amount")
# Mean
av_slry = mean(df$salary)
# Summary (min, 1st quartile, median, mean, 3rd quartile, max)
summary_slry = summary(df$salary)
# Standard deviation
sd_slry = sd(df$salary, na.rm = FALSE)
The average salary is 1.2230345^{5}, with a standard deviation of
7.9030117^{4}, indicating considerable variability in income across the
dataset. The median is slightly lower than the mean, suggesting a
right-skewed distribution with some high earners. The histogram supports
this by showing a longer tail to the right.
# Scatterplot: Years of employment vs. Salary
# years_empl = independent variable, salary = dependent variable
# the salary changes with the years of employment
plot(df$years_empl, df$salary,
main = "Scatterplot of Years of Employment and Salary",
xlab = "Years of Employment",
ylab = "Salary",
pch = 19, # solid circles
col = "lightblue")
# Add a linear regression line
abline(lm(salary ~ years_empl, data = df), col = "darkred", lwd = 2)
The scatterplot shows the relationship between years of employment and salary. Each point represents one individual. A linear regression line was added to indicate the trend. Visually, the plot suggests a positive association: as the number of years employed increases, salary tends to increase. To support this, a linear model was fitted (see below).
# Correlation
cor(df$years_empl, df$salary)
## [1] 0.908204
# Linear regression model
model = lm(salary ~ years_empl, data = df)
summary(model)
##
## Call:
## lm(formula = salary ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58615 -27281 -3463 19327 101896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2684.3 4717.3 -0.569 0.57
## years_empl 7943.6 260.2 30.535 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33160 on 198 degrees of freedom
## Multiple R-squared: 0.8248, Adjusted R-squared: 0.8239
## F-statistic: 932.4 on 1 and 198 DF, p-value: < 2.2e-16
A scatterplot with a regression line illustrates a strong positive
relationship between years of employment and salary. The Pearson
correlation coefficient is 0.91, indicating a very strong linear
association.
A linear regression model was fitted:
salary = −2,684.3 + 7,943.6 × years_empl
The slope of the model suggests that for each additional year of
employment, salary increases on average by $7,944. The model explains
about 82.5% of the variability in salary (R² = 0.825), and the
relationship is statistically significant (p < 0.001).
To estimate salary based on years of employment, i will use the
regression equation I already obtained in the section above:
salary = −2,684.3 + 7,943.6 × years_empl
# Calculate estimated salary
df$estimated_salary = -2684.3 + 7943.6 * df$years_empl
plot(df$years_empl, df$salary,
main = "Actual vs. Estimated Salary",
xlab = "Years of Employment",
ylab = "Salary",
pch = 19, col = "lightblue")
lines(df$years_empl[order(df$years_empl)],
df$estimated_salary[order(df$years_empl)],
col = "red", lwd = 2)
The linear regression model shows a strong and statistically significant positive relationship between years of employment and salary. For each additional year of employment, salary increases on average by approximately $7,944. The model explains 82.5% of the variation in salary (R² = 0.825), indicating that years of employment is a very strong predictor of salary in this dataset. The correlation coefficient (r = 0.91) further supports this strong linear association.
To explore potential gender effects, separate regression models were estimated for males and females. And the average salaries were calculated.
# Separate models by gender
model_female = lm(salary ~ years_empl, data = df[df$gender == "Female", ])
model_male = lm(salary ~ years_empl, data = df[df$gender == "Male", ])
# Summarize models
summary(model_female)
##
## Call:
## lm(formula = salary ~ years_empl, data = df[df$gender == "Female",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -42048 -20067 -5730 18051 66365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4607 4895 0.941 0.349
## years_empl 6644 270 24.610 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24330 on 98 degrees of freedom
## Multiple R-squared: 0.8607, Adjusted R-squared: 0.8593
## F-statistic: 605.7 on 1 and 98 DF, p-value: < 2.2e-16
summary(model_male)
##
## Call:
## lm(formula = salary ~ years_empl, data = df[df$gender == "Male",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -71929 -22398 -1396 22959 71220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9975.4 6348.8 -1.571 0.119
## years_empl 9243.6 350.1 26.401 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31560 on 98 degrees of freedom
## Multiple R-squared: 0.8767, Adjusted R-squared: 0.8755
## F-statistic: 697 on 1 and 98 DF, p-value: < 2.2e-16
Both groups showed a positive association between years of employment
and salary, but the slope and intercept differed:
- For males, the average increase in salary per year of employment was
approximately $9243.6.
- For females, it was $6644.
These results suggest that gender may influence salary progression over
time.
# salary avergaes
av_F = mean(df$salary [df$gender == "Female"])
av_M = mean(df$salary [df$gender == "Male"])
# Create the table
avg_slry_table = data.frame(
Gender = c("Female", "Male"),
Average_Salary = c(av_F, av_M))
kable_styling(kable(avg_slry_table,
col.names = c("Gender","Average Salary"),
caption = "Average Salary among Gender"), full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
| Gender | Average Salary |
|---|---|
| Female | 109140.8 |
| Male | 135466.1 |
# Scatterplot with both groups
plot(df$years_empl, df$salary,
col = ifelse(df$gender == "Female", "lightgreen", "orange"),
pch = 19,
xlab = "Years of Employment",
ylab = "Salary",
main = "Salary vs. Years of Employment by Gender")
# Add regression lines
abline(model_female, col = "lightgreen", lwd = 2)
abline(model_male, col = "orange", lwd = 2)
legend("topleft", legend = c("Female", "Male"),
col = c("lightgreen", "orange"), lwd = 2, bty = "n")
The avergae salarys also differ, while female employees have an avergae salary of 1.0914076^{5}, while male employees have an average salary of 1.3546614^{5}.
A visual comparison of the regression lines supports this observation explained above.