To understand how’s my sample composed, I’ll:
- identify the
variables under investigation, i.e., the parameters assessed through the
questionnaire;
- estimate the total sample size;
- analyze the
distribution of these variables within the sample population.
sample_size <- nrow(respondents)
sample_size
## [1] 200
table(df$gender)
##
## Female Male
## 100 100
mean(df$years_empl)
## [1] 15.73436
mean(df$salary)
## [1] 122303.5
sd(df$salary)
## [1] 79030.12
summary(df$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30203 54208 97496 122304 179447 331348
My sample is composed by 200 people, gender-balanced between NA
females and NA males.
Participants, on average, have a medium
working experience.
The average salary is about 1.22303^{5}, but
there’s a high standard deviation of 7.903^{4}, indicating large
variability in income levels.
Now, I might transform gender into a numeric variable so that I can compute the correlation table with all the 3 variables
df$gender_numeric <- ifelse(df$gender == "Male", 1, 0)
cor_data <- df[, c("gender_numeric", "salary", "years_empl")]
cor_matrix <- cor(cor_data, use = "complete.obs")
cor_matrix
## gender_numeric salary years_empl
## gender_numeric 1.000000e+00 0.1669708 4.927207e-18
## salary 1.669708e-01 1.0000000 9.082040e-01
## years_empl 4.927207e-18 0.9082040 1.000000e+00
kable(cor_matrix, digits = 2, caption = "Correlation Matrix: Gender, Salary, and Years Employed")
| gender_numeric | salary | years_empl | |
|---|---|---|---|
| gender_numeric | 1.00 | 0.17 | 0.00 |
| salary | 0.17 | 1.00 | 0.91 |
| years_empl | 0.00 | 0.91 | 1.00 |
At this point, I would like to show that even within a random dataset Gender pay gap exist. I expect that on average, men have a higher salary than women. :)
t.test(salary ~ gender, data = df)
##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -2.3829, df = 180.59, p-value = 0.01821
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -48124.05 -4526.72
## sample estimates:
## mean in group Female mean in group Male
## 109140.8 135466.1
Indeed I was right. There is a significant difference in salary between genders, with women earning less on average than men in this sample.
Another interesting association is to inquiry the association between years of employment and salary. In other words, I want to understand if salary proportionally grows as one’s years of employment get higher.
plot(df$years_empl, df$salary, xlab = "Years of Employment", ylab = "Salary", main = "Scatterplot of Salary vs Years of Employment", pch = 19, col = "steelblue")
The graph shows that as years of employment increase, salary tends to
increase as well. Therefore, their relationship is strongly positive.
However, this relationship is not perfectly linear.
In the early
career stages salary increases slowly but this growth may accelerate
with more years of experience. All in all, this visual finding aligns
with the correlation matrix result (r = 0.91), which indicates a very
strong linear relationship between salary and years of employment.
Now i remove missing values (NA’s) because by leaving them I got error (different length of the variables).
df_clean <- df[complete.cases(df$salary, df$years_empl), ]
To develop a more accurate model for estimating salary based on years
of employment, I have to make the relationship between the dependent and
independent variables as linear as possible.
In order to do that, I
transformed the dependent variable by taking the natural logarithm of
the salary.
This should improve the model’s performance.
df_clean$log_salary <- log(df_clean$salary)
Finally, I can build the regression model with the logarithmic
salary. This is relevant because with the log I can reduce the
variability and normalize skewed data.
model <- lm(log_salary ~ years_empl , data = df_clean)
summary(model)
##
## Call:
## lm(formula = log_salary ~ years_empl, data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77041 -0.12197 -0.00111 0.15234 0.41044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.382774 0.027501 377.54 <2e-16 ***
## years_empl 0.070998 0.001517 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
plot(df_clean$years_empl, log(df_clean$salary), xlab = "Years of Employment", ylab = "Log(Salary)", main = "Log(Salary) vs Years of Employment", pch = 19, col = "steelblue")
All in all,the slope for years_empl 0.071 represents that each
additional year of employment is associated with an average 7.36%
increase in salary.
The model explains approximately 91.7% of the variance in log salary,
which indicates an excellent fit and a very strong relationship between
years of employment and salary growth.
Both the intercept and slope coefficients are highly statistically
significant (p-values < 2e-16), meaning these relationships are very
unlikely due to chance.
Residuals indicate a good fit, thanks to the log transformation
stabilizing variance and reducing skewness in salary data.
All in all, this analysis demonstrates a strong, statistically significant relationship between years of employment and salary when using the log-transformed salary. Specifically, each additional year of employment is associated with an average salary increase of about 7.36%. The model explains approximately 91.7% of the variation in log salary, indicating a very good fit.
The use of the logarithmic transformation effectively addresses issues of skewness and heteroscedasticity in the salary data, providing a more reliable and interpretable model. Overall, the findings suggest that salary grows exponentially with years of employment rather than linearly.