On the text website http://www.pearsonglobaleditions.com, you will find a data file CPS2015, which contains data for full-time, full-year workers, ages 25–34, with a high school diploma or B.A./B.S. as their highest degree. A detailed description is given in CPS2015_Description, also available on the website. (These are the same data as in CPS96_15, used in Empirical Exercise 3.1, but are limited to the year 2015.) In this exercise, you will investigate the relationship between a worker’s age and earnings. (Generally, older workers have more job experience, leading to higher productivity and higher earnings.)
Start the project by clearing the workspace. Then load the R package openxlsx and the data CPS2015.
rm(list=ls())
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.3.3
library(car)
## Loading required package: carData
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
id <- "1VhpAVM4U7fDOuZu-tVS8ZNy9KcVFafNz"
cps_15 <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
attach(cps_15)
First, we generate three variables that will be used later:
lahe <- log(ahe) #generate log of ahe
lage <- log(age) #generate log of age
agesq <- age^2 #generate squared age
- Run a regression of average hourly earnings (AHE) on age (Age), sex (Female), and education (Bachelor). If Age increases from 25 to 26, how are earnings expected to change? If Age increases from 33 to 34, how are earnings expected to change?
fit_a <- lm(ahe~age+female+bachelor)
summary(fit_a)
##
## Call:
## lm(formula = ahe ~ age + female + bachelor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.913 -6.647 -1.865 4.252 83.908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04481 1.35465 1.509 0.131
## age 0.53128 0.04507 11.788 <2e-16 ***
## female -4.14354 0.26590 -15.583 <2e-16 ***
## bachelor 9.84564 0.26242 37.519 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 7094 degrees of freedom
## Multiple R-squared: 0.1896, Adjusted R-squared: 0.1893
## F-statistic: 553.4 on 3 and 7094 DF, p-value: < 2.2e-16
[Ans] According to the regression results shown above, if Age increases from 25 to 26, earnings are predicted to increase by \(0.531\) per hour. If Age increases from 33 to 34, earnings are predicted to increase by \(0.531\) per hour. These values are the same because the regression is a linear function relating \(AHE\) and \(Age\). The marginal effect of Age on AHE is a constant value.
- Run a regression of the logarithm of average hourly earnings, ln(AHE), on Age, Female, and Bachelor. If Age increases from 25 to 26, how are earnings expected to change? If Age increases from 33 to 34, how are earnings expected to change?
fit_b <- lm(lahe~age+female+bachelor)
summary(fit_b)
##
## Call:
## lm(formula = lahe ~ age + female + bachelor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5980 -0.2878 0.0078 0.3008 2.0631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.027359 0.059245 34.22 <2e-16 ***
## age 0.024191 0.001971 12.27 <2e-16 ***
## female -0.177622 0.011629 -15.27 <2e-16 ***
## bachelor 0.461503 0.011477 40.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4774 on 7094 degrees of freedom
## Multiple R-squared: 0.2084, Adjusted R-squared: 0.208
## F-statistic: 622.4 on 3 and 7094 DF, p-value: < 2.2e-16
[Ans] The regression results for this question are shown above. If Age increases from 25 to 26, ln(AHE) is predicted to increase by \(0.024\), so earnings are predicted to increase by \(2.4\%\). If Age increases from 34 to 35, ln(AHE) is predicted to increase by \(0.024\), so earnings are predicted to increase by \(2.4\%\). These values, in percentage terms, are the same because the regression is a linear function relating ln(AHE) and Age.
- Run a regression of the logarithm of average hourly earnings, ln(AHE), on ln(Age), Female, and Bachelor. If Age increases from 25 to 26, how are earnings expected to change? If Age increases from 33 to 34, how are earnings expected to change?
fit_c <- lm(lahe~lage+female+bachelor)
summary(fit_c)
##
## Call:
## lm(formula = lahe ~ lage + female + bachelor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59410 -0.28720 0.01002 0.30250 2.06247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.32325 0.19608 1.649 0.0993 .
## lage 0.71537 0.05784 12.368 <2e-16 ***
## female -0.17753 0.01163 -15.268 <2e-16 ***
## bachelor 0.46152 0.01147 40.220 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4774 on 7094 degrees of freedom
## Multiple R-squared: 0.2086, Adjusted R-squared: 0.2083
## F-statistic: 623.4 on 3 and 7094 DF, p-value: < 2.2e-16
[Ans] If Age increases from 25 to 26, then ln(Age) has increased by \(ln(26)-ln(25)=0.0392\) (or \(3.92\%\)). The predicted increase in \(ln(AHE)\) is \(0.72 \times (.0392) = 0.028\). This means that earnings are predicted to increase by \(2.8\%\). If \(Age\) increases from 34 to 35, then ln(Age) has increased by \(ln(35) - ln(34) = .0290\) (or \(2.90\%\)). The predicted increase in \(ln(AHE)\) is \(0.72 \times (0.0290) = 0.021\). This means that earnings are predicted to increase by \(2.1\%\).
- Run a regression of the logarithm of average hourly earnings, ln(AHE), on Age, \(Age^2\), Female, and Bachelor. If Age increases from 25 to 26, how are earnings expected to change? If Age increases from 33 to 34, how are earnings expected to change?
fit_d <- lm(lahe~age+agesq+female+bachelor)
summary(fit_d)
##
## Call:
## lm(formula = lahe ~ age + agesq + female + bachelor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57645 -0.28683 0.01264 0.30410 2.05961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4187449 0.6720879 0.623 0.53327
## age 0.1341152 0.0457906 2.929 0.00341 **
## agesq -0.0018603 0.0007742 -2.403 0.01630 *
## female -0.1773644 0.0116256 -15.256 < 2e-16 ***
## bachelor 0.4616293 0.0114731 40.236 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4773 on 7093 degrees of freedom
## Multiple R-squared: 0.209, Adjusted R-squared: 0.2086
## F-statistic: 468.6 on 4 and 7093 DF, p-value: < 2.2e-16
[Ans] When \(Age\) increases from 25 to 26, the predicted change in \(ln(AHE)\) is: \[ (0.134 \times 26 - 0.0019 \times 26^2) - (0.134 \times 25 - 0.0019 \times 25^2) = 0.037. \] This means that earnings are predicted to increase by \(3.7\%\).
When \(Age\) increases from 34 to 35, the predicted change in \(ln(AHE)\) is: \[ (0.134 \times 34 - 0.0019 \times 34^2) - (0.134 \times 33 - 0.0019 \times 33^2) = 0.007. \] This means that earnings are predicted to increase by \(0.7\%\).
- Do you prefer the regression in (c) to the regression in (b)? Explain.
[Ans] The regressions differ in their choice of one of the regressors. They can be compared on the basis of the \(\bar{R}^2\). The regression in (c) has a (marginally) higher \(\bar{R}^2\) so it is preferred.
- Do you prefer the regression in (d) to the regression in (b)? Explain.
coeftest(fit_d, vcov=vcovHC, type="HC3")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.41874493 0.66994642 0.6250 0.531963
## age 0.13411516 0.04563522 2.9389 0.003305 **
## agesq -0.00186029 0.00077175 -2.4105 0.015957 *
## female -0.17736438 0.01150356 -15.4182 < 2.2e-16 ***
## bachelor 0.46162932 0.01145983 40.2824 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[Ans] The regression in (d) adds the variable \(Age^2\) to regression (b). The coefficient on \(Age^2\) is statistically significant (\(t = -2.41\)). This suggests that (d) is preferred to (b).
- Do you prefer the regression in (d) to the regression in (c)? Explain.
[Ans] The regressions differ in their choice of the regressors (\(ln(Age)\) in (c) and \(Age\) and \(Age^2\) in (d)). Since both regressions have the same dependent variable, they can be compared on the basis of the \(\bar{R}^2\). The regression in (d) has a (marginally) higher \(\bar{R}^2\), so it is preferred.
- Plot the regression relation between \(Age\) and \(ln(AHE)\) from (b), (c), and (d) for males with a high school diploma. Describe the similarities and differences between the estimated regression functions. Would your answer change if you plotted the regression function for females with college degrees?
# Step-1: Define a new sequence of age
x <- seq(min(age), max(age), length.out=length(age))
# Step-2: Obtain predicted lahe using model (b), (c), (d), respectively.
# Without loss of generality, assume female=0 and bachelor=0.
lahe_hat_b <- coef(fit_b)["(Intercept)"] + x*(coef(fit_b)["age"]) # model (b)
lahe_hat_c <- coef(fit_c)["(Intercept)"] + log(x)*(coef(fit_c)["lage"]) # model (c)
lahe_hat_d <- coef(fit_d)["(Intercept)"] + x*(coef(fit_d)["age"]) + (x^2)*(coef(fit_d)["agesq"]) # model (d)
# Step-3: Plot
plot(x=x, y=lahe_hat_b, type="l", ylab="ln(ahe)", xlab="age")
lines(x=x, y=lahe_hat_c, type="l", col="blue")
lines(x=x, y=lahe_hat_d, type="l", col="red")
legend(25,2.85, legend=c("(b)", "(c)", "(d)"),
col=c("black", "blue", "red"), lty=1)
[Ans] The regression functions from (b) and (c) are similar. The quadratic regression (d) shows more curvature. The regression functions for a female with a high school diploma will look just like these, but they will be shifted by the amount of the coefficient on the binary regressor \(Female\), i.e., only the intercept will be different. The regression functions for workers with a bachelor’s degree will also look just like these, but they would be shifted by the amount of the coefficient on the binary variable \(Bachelor\).
- Run a regression of \(ln(AHE)\) on \(Age\), \(Age^2\), \(Female\), \(Bachelor\), and the interaction term \(Female \times Bachelor\). What does the coefficient on the interaction term measure? Alexis is a 30-year-old female with a bachelor’s degree. What does the regression predict for her value of \(ln(AHE)\)? Jane is a 30-year-old female with a high school diploma. What does the regression predict for her value of \(ln(AHE)\)? What is the predicted difference between Alexis’s and Jane’s earnings? Bob is a 30-year-old male with a bachelor’s degree. What does the regression predict for his value of \(ln(AHE)\)? Jim is a 30-year-old male with a high school diploma. What does the regression predict for his value of \(ln(AHE)\)? What is the predicted difference between Bob’s and Jim’s earnings?
fit_i <- lm(lahe~age+agesq+female+bachelor+I(female*bachelor))
summary(fit_i)
##
## Call:
## lm(formula = lahe ~ age + agesq + female + bachelor + I(female *
## bachelor))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57150 -0.28595 0.01284 0.30422 2.06833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4119039 0.6721221 0.613 0.54000
## age 0.1348145 0.0457959 2.944 0.00325 **
## agesq -0.0018710 0.0007743 -2.416 0.01570 *
## female -0.1903241 0.0173730 -10.955 < 2e-16 ***
## bachelor 0.4521137 0.0148824 30.379 < 2e-16 ***
## I(female * bachelor) 0.0234742 0.0233840 1.004 0.31548
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4773 on 7092 degrees of freedom
## Multiple R-squared: 0.2091, Adjusted R-squared: 0.2086
## F-statistic: 375.1 on 5 and 7092 DF, p-value: < 2.2e-16
[Ans] The coefficient on the interaction term \(Female \times Bachelor\) shows the “extra effect” of \(Bachelor\) on \(ln(AHE)\) for women relative the effect for men.
Predicted values of \(ln(AHE)\): \[ Alexis: 0.135 \times 30 - 0.0019 \times 30^2 - 0.19 \times 1 + 0.45 \times 1 + 0.023 \times 1 + 0.41 = 3.03. \] \[ Jane: 0.135 \times 30 - 0.0019 \times 30^2 - 0.19 \times 1 + 0.45 \times 0 + 0.023 \times 0 + 0.41 = 2.56. \] \[ Bob: 0.135 \times 30 - 0.0019 \times 30^2 - 0.19 \times 0 + 0.45 \times 1 + 0.023 \times 0 + 0.41 = 3.20. \] \[ Jim: 0.135 \times 30 - 0.0019 \times 30^2 - 0.19 \times 0 + 0.45 \times 0 + 0.023 \times 1 + 0.41 = 2.75. \]
Difference in \(ln(AHE)\): \(Alexis - Jane = 3.03 - 2.56 = 0.47\). Difference in \(ln(AHE)\): \(Bob - Jim = 3.20 - 2.75 = 0.45\).
Notice that the difference in the difference predicted effects is \(0.47 - 0.45 = 0.02\), which is the value of the coefficient on the interaction term.
- Is the effect of \(Age\) on earnings different for men than for women? Specify and estimate a regression that you can use to answer this question.
fit_j <- lm(lahe~age+agesq+female+bachelor+I(female*bachelor)
+I(female*age)+I(female*agesq))
summary(fit_j)
##
## Call:
## lm(formula = lahe ~ age + agesq + female + bachelor + I(female *
## bachelor) + I(female * age) + I(female * agesq))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58759 -0.28747 0.00577 0.30606 2.05400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2904817 0.8821495 0.329 0.7419
## age 0.1391579 0.0600275 2.318 0.0205 *
## agesq -0.0018792 0.0010136 -1.854 0.0638 .
## female -0.0344858 1.3628566 -0.025 0.9798
## bachelor 0.4514003 0.0148824 30.331 <2e-16 ***
## I(female * bachelor) 0.0230741 0.0233802 0.987 0.3237
## I(female * age) -0.0012515 0.0929059 -0.013 0.9893
## I(female * agesq) -0.0001339 0.0015716 -0.085 0.9321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4772 on 7090 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.2089
## F-statistic: 268.8 on 7 and 7090 DF, p-value: < 2.2e-16
linearHypothesis(fit_j, c("I(female * age)=0", "I(female * agesq)=0"))
[Ans] In the regression above, we include two additional regressors: the interactions of \(Female\) and the age variables, \(Age\) and \(Age^2\). The F-statistic testing the restriction that the coefficients on these interaction terms is equal to zero is \(F = 2.63\) with a p-value of \(0.07\). This implies that there is statistically significant evidence at the \(10\%\) but not \(5\%\) level that there is a different effect of \(Age\) on \(ln(AHE)\) for men and women.
- Is the effect of \(Age\) on earnings different for high school graduates than for college graduates? Specify and estimate a regression that you can use to answer this question.
fit_k <- lm(lahe~age+agesq+female+bachelor+I(female*bachelor)
+I(bachelor*age)+I(bachelor*agesq))
summary(fit_k)
##
## Call:
## lm(formula = lahe ~ age + agesq + female + bachelor + I(female *
## bachelor) + I(bachelor * age) + I(bachelor * agesq))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58695 -0.28901 0.01077 0.30390 2.06019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0783349 0.9662679 0.081 0.9354
## age 0.1604331 0.0658512 2.436 0.0149 *
## agesq -0.0023512 0.0011136 -2.111 0.0348 *
## female -0.1902750 0.0173732 -10.952 <2e-16 ***
## bachelor 1.0932895 1.3449839 0.813 0.4163
## I(female * bachelor) 0.0241893 0.0233928 1.034 0.3011
## I(bachelor * age) -0.0491851 0.0916408 -0.537 0.5915
## I(bachelor * agesq) 0.0009206 0.0015494 0.594 0.5524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4773 on 7090 degrees of freedom
## Multiple R-squared: 0.2094, Adjusted R-squared: 0.2086
## F-statistic: 268.2 on 7 and 7090 DF, p-value: < 2.2e-16
linearHypothesis(fit_k, c("I(bachelor * age)=0", "I(bachelor * agesq)=0"))
[Ans] In this regression, we include two additional regressors that are interactions of \(Bachelor\) and the age variables, \(Age\) and \(Age^2\). The F-statistic testing the restriction that the coefficients on these interaction terms is zero is 1.05 with a p-value of 0.35. This implies that there not is statistically significant evidence (at the \(10\%\) level) that there is a different effect of \(Age\) on \(ln(AHE)\) for high school and college graduates.
- After running all these regressions (and any others that you want to run), summarize the effect of age on earnings for young workers.
[Ans] The estimated regressions suggest that earnings increase as workers age from 25–34, the range of age studied in this sample. Education and sex are significant predictors of earnings, and there are statistically significant interaction effects between age and sex.