sa <- read_dta('SA.dta')
head(sa)
summary(sa)

sa <- sa %>% 
  mutate(logwage = log(wage))

Density distributions

The ‘raw’ variable of wage has a long right ‘tail’ (a positive skew), i.e. it is lognormal distributed. After the log transformation, it get normalized.

Model 1: logged wage ~ education plus work experience

Interpretation: both of the predictors have a positive significant effect; change in education by one unit increases the logarithm of wages by 0,15; change in experience by one unit increases the logarithm of wages by 0,015.

summary(mod1 <- lm(logwage ~ educ + exper, data = sa))
## 
## Call:
## lm(formula = logwage ~ educ + exper, data = sa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2228 -0.4612  0.0557  0.5014  3.9545 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2001317  0.0207020   9.667   <2e-16 ***
## educ        0.1503643  0.0017467  86.086   <2e-16 ***
## exper       0.0154358  0.0005441  28.369   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7832 on 19945 degrees of freedom
## Multiple R-squared:  0.2766, Adjusted R-squared:  0.2765 
## F-statistic:  3813 on 2 and 19945 DF,  p-value: < 2.2e-16

How many years should a worker work on average to double her wage?

In order to answer the question, we should exponentiate the coefficient. Thus, when work experience changes by 1, the wage of a worker increases by 2%. To know how many years should a worker work on average to double his or her wage, we shoul imply the formula: X → log1,02(2) = 35, that means 35 years.

exp(coef(mod1))
## (Intercept)        educ       exper 
##    1.221564    1.162258    1.015556
# print(betaeduc <- (1.16-1)*100)
print(betaexper <- (1.02-1)*100)
## [1] 2
log(2, 1.02)
## [1] 35.00279
# or
69/2
## [1] 34.5

(Non)linearity

The effect of work experience on logged wage looks linear.

crPlots(mod1,
        ~ exper,
        ylab = "Partial residuals",
        col=carPalette()[1], col.lines=carPalette()[3:4])

Model 2: logged wage ~ education plus a squared term of work experience

By added a polynomial of degree 2 in the model, we can see that the relationship is rather non-linear, which remind a mountain curve (weakly visible). We got the following equation: logwage = 54.00526 + 0,15(educ) + 25.3(exper) - 7.3(exper)^2. This means that the function first grows (the relationship is positive: the higher the experience, the higher the wage), and then, after 14.6 (25.3/(7.3*2)) points of experience, this relationship becomes negative.

summary(mod2 <- lm(logwage ~ educ + poly(exper,2), data = sa))
## 
## Call:
## lm(formula = logwage ~ educ + poly(exper, 2), data = sa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2563 -0.4620  0.0525  0.4996  3.9251 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.542403   0.012434   43.62   <2e-16 ***
## educ             0.150200   0.001743   86.17   <2e-16 ***
## poly(exper, 2)1 25.275032   0.890463   28.38   <2e-16 ***
## poly(exper, 2)2 -7.292723   0.781605   -9.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7816 on 19944 degrees of freedom
## Multiple R-squared:  0.2797, Adjusted R-squared:  0.2796 
## F-statistic:  2582 on 3 and 19944 DF,  p-value: < 2.2e-16

Model 3: logged wage ~ experience plus interaction between education and gender

Overall, female gender is negatively associated with wages. In the comparison with men (reference category), women wages are 0.24 log points lower (or 22% lower). And at the same time, education is more important for women in determining wages than for men: the effect of education on wages is higher by 0.01 log points (1%) among women compared to men.

sa$female <- as.factor(sa$female)

summary(mod3 <- lm(logwage ~ educ*female + exper, data = sa))
## 
## Call:
## lm(formula = logwage ~ educ * female + exper, data = sa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1197 -0.4595  0.0527  0.5029  3.8961 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2694726  0.0217713  12.377  < 2e-16 ***
## educ          0.1502080  0.0020571  73.021  < 2e-16 ***
## female1      -0.2472828  0.0251589  -9.829  < 2e-16 ***
## exper         0.0151128  0.0005415  27.911  < 2e-16 ***
## educ:female1  0.0107683  0.0032883   3.275  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7787 on 19943 degrees of freedom
## Multiple R-squared:  0.285,  Adjusted R-squared:  0.2849 
## F-statistic:  1987 on 4 and 19943 DF,  p-value: < 2.2e-16
exp(coef(mod3))
##  (Intercept)         educ      female1        exper educ:female1 
##    1.3092738    1.1620759    0.7809199    1.0152276    1.0108265
print(betafemale <- (0.78-1)*100)
## [1] -22
print(betainteraction <- (1.01-1)*100)
## [1] 1