1. Sample description

Table 1 presents the total number of persons in the dataset, with both genders being equally represented. The mean years of employment from the participants is 15.7343624 and standard deviation is 9.0356184. The mean salary from the persons in the dataset is 1.2230345^{5} and standard deviation is 7.9030117^{4}.

Table 1: Gender Distribution
Gender Count
Female 100
Male 100


2. Association between years and salary as scatterplot.

Assuming years of employment is the independent variable and salary is the dependent variable. The salary someone earns may depend on their years of employment. Looking at the scatterplot, you can see that as the years of employment increase, the salary increases as well, which is logic considering real-world circumstances. Additionally, I showed the outliers (in red) when the salary is above 250000 and below 50000.


3. Estimate salary by years of employment

First I tried to create the regression model, but this does not represent reality. Therefore I went on to linearize the association between salary and years of employment. I transformed the dependent variable, salary, to linearize this association. The model now looks like -> log10(salary) = 4.51 + 0.03 * years_empl.

## 
## Call:
## lm(formula = log10(df$salary) ~ df$years_empl)
## 
## Coefficients:
##   (Intercept)  df$years_empl  
##       4.50918        0.03083


4. Interpretation

The model -> log10(salary) = 4.51 + 0.03 * years_empl. This means that for each additional year of employment, the log10(salary) increases by 0.03. 4.51 is the starting point. 0.03 means that each extra year of employment increases log10(salary) by 0.03. In terms of actual salary this means that a 1-unit increase in years_empl corresponds to a 7.2% increase because a 0.03 increase in log10(salary) means your salary is multiplied by 10^0.03 ≈ 1.072.

So, each extra year of employment gives about a 7.2% salary increase. This means the more years in employment, the higher the salary in percentage.


5. Gender effects

I tried to create the models for males and females.

## 
## Call:
## lm(formula = log10(df$salary) ~ df$years_empl + df$gender)
## 
## Coefficients:
##   (Intercept)  df$years_empl  df$genderMale  
##       4.47325        0.03083        0.07187
## 
## Call:
## lm(formula = log10(salary) ~ years_empl, data = df_male)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.243480 -0.037542  0.001445  0.030229  0.165556 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.5083895  0.0133719  337.15   <2e-16 ***
## years_empl  0.0331680  0.0007374   44.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06647 on 98 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9533 
## F-statistic:  2023 on 1 and 98 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log10(salary) ~ years_empl, data = df_female)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.312028 -0.033127  0.006192  0.046280  0.177572 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.5099738  0.0159497   282.8   <2e-16 ***
## years_empl  0.0284999  0.0008796    32.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07928 on 98 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9138 
## F-statistic:  1050 on 1 and 98 DF,  p-value: < 2.2e-16