This will be an empirical problem set examining the cps dataset that we have ofen referred to in class. You may be reading this in a pdf file, which was created using RMarkdown. The RMarkdown file used to create this file is posted on CCLE and contains code boxes to help you get started. You may want to load the RMarkdown file in RStudio and work on it directly to obtain your answers and display your code.
To get started: (i) Clear the workspace, (ii) Load the
PoEdata, and (iii) Import the cps dataset. The
description of all the variables contained in the cps
dataset can be found at the following website: http://www.principlesofeconometrics.com/poe4/data/def/cps.def
# Tell R to clear the workspace
rm(list = ls())
# Tell R to load the PoEdata library
cps_url <- "https://raw.githubusercontent.com/ccolonescu/PoEdata/master/data/cps.rda"
load(url(cps_url))
(25 pts) Consider a basic model in which we regress wages on education in a model \[{\rm wage} = \beta_1 + \beta_2\text{educ} + \epsilon.\]
**+ What is the description of the variable wage in the
data? “wage” is the earnings per hour
**+ What is the description of the variable educ? “educ”
is the years of education
##
## Call:
## lm(formula = wage ~ educ, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.282 -3.728 -1.188 2.382 63.088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.20260 0.46549 -11.18 <2e-16 ***
## educ 1.15692 0.03446 33.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.585 on 4731 degrees of freedom
## Multiple R-squared: 0.1924, Adjusted R-squared: 0.1923
## F-statistic: 1127 on 1 and 4731 DF, p-value: < 2.2e-16
(25 pts) We want to examine whether expected wages depend on
education in a different manner for men and women. To this end, let
female = 1 for women, female = 0 otherwise,
and consider the model: \[{\rm wage} =
\beta_1 + \delta_1 \text{female} + \beta_2 \text{educ} + \delta_2
\text{female}\times \text{educ} + \epsilon\]
**+ What are the linear regression estimates for \(\delta_1\) and \(\delta_2\)? What do these estimates say about the differences in how expected wage conditional on education evolve differently for men and women?
_1 is -3.311 and _2 is 0.061. These estimates say that at zero years of education, women earn around $3.31 less per hour than men. For each additional year of education, it increases a women’s expected wage by about $0.06 more per hour than men. This shrinks the gender gap and allows for women to earn more than men, with higher education.
** + Use an F-test to examine the null hypothesis that expected wages conditional on education are the same for mean and women. What is the p-value for your test? (Note: This test is known as a Chow-test). What does the p-value tell you that you could not conclude from the estimates of \(\delta_1\) and \(\delta_2\) you obtained in the previous part?
The p-value is very close to zero so it’s 0. We reject the null hypothesis that expected wages condition on education are the same for men and women.
The extremely small p-value shows that the differences between men and women in how education affects wages are not happening by chance. \(\delta_1\) and \(\delta_2\) only tell us what the gaps look like and which direction they go, but the F-test tells us whether those differences actually matter statistically. Since the p-value is basically zero, it shows that men and women truly have different wage-education relationship.
##
## Call:
## lm(formula = wage ~ educ + female + educ:female, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.446 -3.372 -1.064 2.256 64.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.85895 0.60441 -6.385 1.88e-10 ***
## educ 1.14693 0.04492 25.533 < 2e-16 ***
## female -3.31073 0.91506 -3.618 0.0003 ***
## educ:female 0.06091 0.06769 0.900 0.3683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.444 on 4729 degrees of freedom
## Multiple R-squared: 0.233, Adjusted R-squared: 0.2325
## F-statistic: 478.8 on 3 and 4729 DF, p-value: < 2.2e-16
## female
## -3.310733
## educ:female
## 0.06090984
## [1] 124.9239
## [1] 0
(25 pts) In this question we will graphically examine whether we should be concerned about the homoskedasticity assumption not holding.
**+ Make a plot with wages on the Y axis and education on the X axis using only observations for women. Clearly label the axes. (Hint: See RMarkdown file for a hint on how to extract observations for women only.)
**+ Redo the plot using observations only for men.
**+ What do the graphs suggest about the homoskedasticity assumption?
The scatter plots for both men and women show that wages start off pretty clustered when education is low, but the spread gets a lot wider as education increases. Since the variation in wages grows with education, this points to heteroskedasticity and suggests that the homoskedasticity assumption probably does not hold. People with more years of schooling tend to have a much bigger range of possible wages.
(25 pts) For simplicity we next drop the female variable and examine the role of experience in wages.
**+ What is the name of the variable in the cps dataset that stores the experience data? “exper”
**+ Consider a simple model in which we suppose that \({\rm wage} = \beta_1 + \beta_2 {\rm educ} + \beta_3 {\rm experience} + \epsilon\). What does the estimate for \(\beta_3\) say about how an additional year of experience affects expected wages? Does the impact depend on the level of experience or education? Estimate of b3 is about 0.12230. This means by holding education constant, an additional year of experience is associuated with an increase of $0.12 dollars an hour, on average. Tis applies for everyone, regardless of experience or education.
**+ We want to examine whether the linear specification is adequate or we should be using a more complex function of education and experience. Let \(\widehat{\rm wage} = b_1 + b_2 {\rm education} + b_3 {\rm experience}\) be the fitted value from your regression. Next run the regression \[{\rm wage} = \beta_1 + \beta_2 {\rm education} + \beta_3 {\rm experience} + \gamma (\widehat{\rm wage})^2 + \epsilon\] What is your estimate for \(\gamma\)?
\(\gamma\) = 0.04727
**+ What is the p-value for the null hypothesis that \(\gamma = 0\) against the alternative that \(\gamma\neq 0\)? What does this suggest about whether we should include be using a more complicated functional form than the linear model? (Note: This is called a RESET test).
The p-value is < 2e-16 which is basically 0. We can confidently reject the null hypothesis that \(\gamma\) = 0. The simple lineasr model with only education and experience is not capturing everything, and a more complex nonlinear model would fit the data better.
##
## Call:
## lm(formula = wage ~ educ + exper, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.428 -3.338 -1.011 2.262 60.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.69571 0.49321 -17.63 <2e-16 ***
## educ 1.24449 0.03377 36.86 <2e-16 ***
## exper 0.12230 0.00698 17.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.412 on 4730 degrees of freedom
## Multiple R-squared: 0.2417, Adjusted R-squared: 0.2413
## F-statistic: 753.7 on 2 and 4730 DF, p-value: < 2.2e-16
## exper
## 0.1222984
## [1] 8.561796 10.295476 6.152118 9.007986 14.194238 8.928692
##
## Call:
## lm(formula = wage ~ educ + exper + I(wagehat^2), data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.867 -3.216 -0.972 2.152 57.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.316953 1.482531 2.912 0.00361 **
## educ 0.036315 0.134189 0.271 0.78669
## exper 0.002138 0.014659 0.146 0.88405
## I(wagehat^2) 0.047269 0.005084 9.297 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.364 on 4729 degrees of freedom
## Multiple R-squared: 0.2553, Adjusted R-squared: 0.2548
## F-statistic: 540.3 on 3 and 4729 DF, p-value: < 2.2e-16
## I(wagehat^2)
## 0.04726874
## [1] 2.148954e-20