1. Sample description

To understand how’s my sample composed, I’ll:
- identify the variables under investigation, i.e., the parameters assessed through the questionnaire;
- estimate the total sample size;
- analyze the distribution of these variables within the sample population.

sample_size <- nrow(respondents)
sample_size
## [1] 200
table(df$gender)
## 
## Female   Male 
##    100    100
mean(df$years_empl)
## [1] 15.73436
mean(df$salary)
## [1] 122303.5
sd(df$salary)
## [1] 79030.12
summary(df$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30203   54208   97496  122304  179447  331348

My sample is composed by 200 people, gender-balanced between NA females and NA males.
Participants, on average, have a medium working experience.
The average salary is about 1.22303^{5}, but there’s a high standard deviation of 7.903^{4}, indicating large variability in income levels.

Now, I might transform gender into a numeric variable so that I can compute the correlation table with all the 3 variables

df$gender_numeric <- ifelse(df$gender == "Male", 1, 0)
cor_data <- df[, c("gender_numeric", "salary", "years_empl")]
cor_matrix <- cor(cor_data, use = "complete.obs")
cor_matrix
##                gender_numeric    salary   years_empl
## gender_numeric   1.000000e+00 0.1669708 4.927207e-18
## salary           1.669708e-01 1.0000000 9.082040e-01
## years_empl       4.927207e-18 0.9082040 1.000000e+00
kable(cor_matrix, digits = 2, caption = "Correlation Matrix: Gender, Salary, and Years Employed")
Correlation Matrix: Gender, Salary, and Years Employed
gender_numeric salary years_empl
gender_numeric 1.00 0.17 0.00
salary 0.17 1.00 0.91
years_empl 0.00 0.91 1.00

At this point, I would like to show that even within a random dataset Gender pay gap exist. I expect that on average, men have a higher salary than women. :)

t.test(salary ~ gender, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -2.3829, df = 180.59, p-value = 0.01821
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -48124.05  -4526.72
## sample estimates:
## mean in group Female   mean in group Male 
##             109140.8             135466.1

Indeed I was right. There is a significant difference in salary between genders, with women earning less on average than men in this sample.

2. Association between YEARS and SALARY.

Another interesting association is to inquiry the association between years of employment and salary. In other words, I want to understand if salary proportionally grows as one’s years of employment get higher.

plot(df$years_empl, df$salary, xlab = "Years of Employment", ylab = "Salary", main = "Scatterplot of Salary vs Years of Employment", pch = 19, col = "steelblue")

The graph shows that as years of employment increase, salary tends to increase as well. Therefore, their relationship is strongly positive. However, this relationship is not perfectly linear.
In the early career stages salary increases slowly but this growth may accelerate with more years of experience. All in all, this visual finding aligns with the correlation matrix result (r = 0.91), which indicates a very strong linear relationship between salary and years of employment.

3. Estimate salary by years of employment

Now i remove missing values (NA’s) because by leaving them I got error (different length of the variables).

df_clean <- df[complete.cases(df$salary, df$years_empl), ]

To develop a more accurate model for estimating salary based on years of employment, I have to make the relationship between the dependent and independent variables as linear as possible.
In order to do that, I transformed the dependent variable by taking the natural logarithm of the salary.
This should improve the model’s performance.

df_clean$log_salary <- log(df_clean$salary)

Finally, I can build the regression model with the logarithmic salary. This is relevant because with the log I can reduce the variability and normalize skewed data.

model <- lm(log_salary ~ years_empl , data = df_clean)
summary(model)
## 
## Call:
## lm(formula = log_salary ~ years_empl, data = df_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77041 -0.12197 -0.00111  0.15234  0.41044 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.382774   0.027501  377.54   <2e-16 ***
## years_empl   0.070998   0.001517   46.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared:  0.9171, Adjusted R-squared:  0.9167 
## F-statistic:  2191 on 1 and 198 DF,  p-value: < 2.2e-16
plot(df_clean$years_empl, log(df_clean$salary), xlab = "Years of Employment", ylab = "Log(Salary)", main = "Log(Salary) vs Years of Employment",  pch = 19, col = "steelblue")

4. Interpretation and conclusions

All in all,the slope for years_empl 0.071 represents that each additional year of employment is associated with an average 7.36% increase in salary.

The model explains approximately 91.7% of the variance in log salary, which indicates an excellent fit and a very strong relationship between years of employment and salary growth.

Both the intercept and slope coefficients are highly statistically significant (p-values < 2e-16), meaning these relationships are very unlikely due to chance.

Residuals indicate a good fit, thanks to the log transformation stabilizing variance and reducing skewness in salary data.

All in all, this analysis demonstrates a strong, statistically significant relationship between years of employment and salary when using the log-transformed salary. Specifically, each additional year of employment is associated with an average salary increase of about 7.36%. The model explains approximately 91.7% of the variation in log salary, indicating a very good fit.

The use of the logarithmic transformation effectively addresses issues of skewness and heteroscedasticity in the salary data, providing a more reliable and interpretable model. Overall, the findings suggest that salary grows exponentially with years of employment rather than linearly.