Table 1 presents the total number of persons in the dataset, with both genders being equally represented. The mean years of employment from the participants is 15.7343624 and standard deviation is 9.0356184. The mean salary from the persons in the dataset is 1.2230345^{5} and standard deviation is 7.9030117^{4}.
| Gender | Count |
|---|---|
| Female | 100 |
| Male | 100 |
Assuming years of employment is the independent variable and salary is the dependent variable. The salary someone earns may depend on their years of employment. Looking at the scatterplot, you can see that as the years of employment increase, the salary increases as well, which is logic considering real-world circumstances. Additionally, I showed the outliers (in red) when the salary is above 250000 and below 50000.
First I tried to create the regression model, but this does not represent reality. Therefore I went on to linearize the association between salary and years of employment. I transformed the dependent variable, salary, to linearize this association. The model now looks like -> log10(salary) = 4.51 + 0.03 * years_empl.
##
## Call:
## lm(formula = log10(df$salary) ~ df$years_empl)
##
## Coefficients:
## (Intercept) df$years_empl
## 4.50918 0.03083
The model -> log10(salary) = 4.51 + 0.03 * years_empl. This means that for each additional year of employment, the log10(salary) increases by 0.03. 4.51 is the starting point. 0.03 means that each extra year of employment increases log10(salary) by 0.03. In terms of actual salary this means that a 1-unit increase in years_empl corresponds to a 7.2% increase because a 0.03 increase in log10(salary) means your salary is multiplied by 10^0.03 ≈ 1.072.
So, each extra year of employment gives about a 7.2% salary increase. This means the more years in employment, the higher the salary in percentage.
I tried to create the models for males and females.
##
## Call:
## lm(formula = log10(df$salary) ~ df$years_empl + df$gender)
##
## Coefficients:
## (Intercept) df$years_empl df$genderMale
## 4.47325 0.03083 0.07187
##
## Call:
## lm(formula = log10(salary) ~ years_empl, data = df_male)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.243480 -0.037542 0.001445 0.030229 0.165556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5083895 0.0133719 337.15 <2e-16 ***
## years_empl 0.0331680 0.0007374 44.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06647 on 98 degrees of freedom
## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9533
## F-statistic: 2023 on 1 and 98 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log10(salary) ~ years_empl, data = df_female)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.312028 -0.033127 0.006192 0.046280 0.177572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5099738 0.0159497 282.8 <2e-16 ***
## years_empl 0.0284999 0.0008796 32.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07928 on 98 degrees of freedom
## Multiple R-squared: 0.9146, Adjusted R-squared: 0.9138
## F-statistic: 1050 on 1 and 98 DF, p-value: < 2.2e-16