The dataset consist of three variables which are “Years of employment”, “gender” and “salary”. To analyze these samples, simple sample description is used as follows:
##
## Female Male
## 100 100
The sample size of gender is evenly distributed.
## Group Mean_Salary
## 1 Overall 122303.5
## 2 Female 109140.8
## 3 Male 135466.1
## [1] "Compared to females, who have a mean salary of 109140.76 men have a higher salary of 135466.14"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007167 7.790195 16.191430 15.734362 22.908421 29.666752
The minimum years of employment is 0 and the maximum years of employment is around 29.67 years. The mean is 15.73 years.
The scatterplot shows the relation between the variables “years of emmployment” and “salary”. There is a clear upward trend. If years_empl increases, salary increases too. This suggests a strong positive correlation between these two variables. However, the trend is not perfectly linear, instead, salary accelerates more steeply after 15-20 years of employment.
plot(df$years_empl, df$salary)
To make the relationship between salary and years of employment linear, a log transformation is applied to the variable “salary”.
model = lm(log(salary) ~ years_empl, data = df)
summary(model)
##
## Call:
## lm(formula = log(salary) ~ years_empl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77041 -0.12197 -0.00111 0.15234 0.41044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.382774 0.027501 377.54 <2e-16 ***
## years_empl 0.070998 0.001517 46.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1933 on 198 degrees of freedom
## Multiple R-squared: 0.9171, Adjusted R-squared: 0.9167
## F-statistic: 2191 on 1 and 198 DF, p-value: < 2.2e-16
If years of employment is 0, which means someone just started their job, the expected log-salary is ~ 10.3828. Each additional year of employment increases the log-salary by 0.071 which is a percent increase of around 7.4%.