The dataset you will be analyzing for this assignment is an extract from the Panel Study of Income Dynamics (1997). The file psid97.dat is included. Please include all relevant Stata output with you copy of the solutions.
The dataset includes five variables: • Age of the household head (age) • Years of education of the household head (educ) • Annual earnings for the household head in 1996 (earnings) • Annual hours work for the household head in 1996 (hours) • An indicator for the marital status of the household head (1=married, 0=single)
Before doing any analysis, we first set up the proper working environment.
# Loading the dataset
library(haven) # Package for reading the Stata file
data <- read_dta('psid97.dta') # Read into the dataset
summary(data) # General info of the dataset
## age educ earnings hours
## Min. :18.00 Min. : 0.00 Min. : 0 Min. : 0
## 1st Qu.:34.00 1st Qu.:12.00 1st Qu.: 7280 1st Qu.:1568
## Median :42.50 Median :13.00 Median : 26000 Median :2070
## Mean :44.77 Mean :13.31 Mean : 32965 Mean :1880
## 3rd Qu.:53.00 3rd Qu.:16.00 3rd Qu.: 45001 3rd Qu.:2480
## Max. :95.00 Max. :17.00 Max. :700021 Max. :5307
## NA's :3359 NA's :3359 NA's :3359 NA's :3359
## married
## Min. :0.000
## 1st Qu.:1.000
## Median :1.000
## Mean :0.824
## 3rd Qu.:1.000
## Max. :1.000
## NA's :3359
Create an hourly wage variable, which is earnings divided by hours worked.
# Create the new column named "hourly_wage"
data$hourly_wage <- data$earnings / data$hours
# Check it out
summary(data$hourly_wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 8.789 14.331 Inf 22.520 Inf 3829
Define an indicator working = 1 if their annual hours worked exceeds zero.
library(tidyverse) # Load tidyverse
# Create the new column "working" conditional on column "hours"
data <- data %>%
mutate(working = case_when(hours > 0 ~ 1,
hours == 0 ~ 0
))
# Check it out
summary(data$working)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 1.000 0.858 1.000 1.000 3359
Create a measure of labor market experience, which is typically age minus years of education minus 6.
# Create a new column called "experience"
data$experience <- data$age - data$educ - 6
# Check it out
summary(data$experience)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.00 14.00 23.00 25.46 33.00 83.00 3359
Warning: From summary statistics we know the way we constructed the variable measuring labor market experience can be problematic, or we may have some invalid data entry, because there is an entry with -1 year of experience.
First, we wish to study the relationship between wages and experience. For this section, limit your analysis to household heads that are working.
Plot wages against experience (put experience on the horizontal axis and wages on the vertical axis). Describe the correlation between wages and experience. Does the correlation appear to be strong or weak? Explain.
The following graph is a scatter plot describing the relationship between wages and experience in our dataset. The scatter plot seems to show that there is not a strong correlation between wages and experience.
To make things clearer, I added a trend line based on the LOESS (in orange) and a trend line based on the linear estimation (in blue). Both lines indicate a very weak correlation between experience and wages.
ggplot(subset(data, working == 1), aes(experience, hourly_wage)) + # Subset
geom_point(color = '#00AFBB', size = 1) + # Scatter plot
geom_smooth(color = "steelblue", method = 'lm') +# Add the linear Smooth Line
geom_smooth(color = '#D55E00', method = 'loess') + # add the loess smooth line
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
What is the sample correlation between wages and experience? Did you expect to find this result? Why or why not?
The sample correlation between the two is about 0.12 (rounded to the nearest hundredth). We should expect to see this result, because we just saw a weak relationship between the two from the scatter plot above.
# Load the relevant package
subset(data, working == 1) %>%
select(experience, hourly_wage) %>%
cor(.)
## experience hourly_wage
## experience 1.0000000 0.1212883
## hourly_wage 0.1212883 1.0000000
Run the following regression #1:
\(Wage = \beta_{0} + \beta_{1} * Experience + \mu_{i}\)
What is the OLS estimate of β1? Is it significantly different from zero at the α = 0.05 level of significance?
The OLS estimate of \(\beta_{1}\) is 0.21 (rounded to the nearest hundredth). It is significant at the \(\alpha\) = .05 level.
wage_exp_lm <- subset(data, working == 1) %>%
select(experience, hourly_wage) %>%
lm(hourly_wage ~ experience, data=.)
summary(wage_exp_lm)
##
## Call:
## lm(formula = hourly_wage ~ experience, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.128 -8.986 -3.548 4.229 224.602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.84929 0.78090 17.735 < 2e-16 ***
## experience 0.20998 0.03202 6.557 6.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.01 on 2880 degrees of freedom
## Multiple R-squared: 0.01471, Adjusted R-squared: 0.01437
## F-statistic: 43 on 1 and 2880 DF, p-value: 6.465e-11
What is the F-statistic for the null hypothesis H0: β1 = 0 in the regression? What critical value from the F-distribution table should you use when testing the null hypothesis at the α = 0.05 level of significance? Would you reject this null hypothesis according to the F test at the α = 0.05 level of significance?
From the regression table for question c), we can see that the F-statistic is 43. We also know that the numerator degree of freedom is 1, and the denominator degree of freedom is 2880 in our case, so we can calculate the cut-off F-statistics using the following syntax:
qf(.05, 1, 2880, lower.tail = FALSE)
## [1] 3.84469
Since our F-statistics is 43, which is larger than 3.85, the critical value at \(\alpha\) = .05. We should reject the null hypothesis.
Notice that you got the same answer in parts C and D. In both cases, you are testing the null hypothesis H0: β1 = 0, so you should get the same answer no matter which method you use. When testing the significance of a single parameter, it can be shown that the F-statistic is equal to the square of the t-statistic. Does this hold in your output?
We can see from below that the F-statistics equals to the square of the t-statistics in our case.
summary(wage_exp_lm)$fstatistic[1] # Extract f statistics
## value
## 42.99983
coef(summary(wage_exp_lm))[, "t value"][2]^2 # Extract t statistics and square
## experience
## 42.99983
Now we wish to study the relationship between wages and experience controlling for education. Again, please limit your analysis to household heads that are working.
What is the sample correlation between education and wages? What is the sample correlation between education and experience? Did you expect to find this latter correlation? Why or why not?
subset(data, working == 1) %>%
select(experience, hourly_wage, educ) %>%
cor(.)
## experience hourly_wage educ
## experience 1.0000000 0.1212883 -0.1258688
## hourly_wage 0.1212883 1.0000000 0.2513287
## educ -0.1258688 0.2513287 1.0000000
From the correlation matrix above, we can see that the sample correlation between education and wages is about 0.25. The sample correlation between education and experience is about -0.13.
We should expect a negative correlation between education and experience, because the more years you invest into schooling, the less likely you enter the labor market early.
Now run regression #2:
Wage = \(\beta_{0}\) + \(\beta_{1}\)experience + \(\beta_{2}\) educ + \(\mu_{i}\)
What is the OLS estimate of β1? Is it significantly different from zero at the α = 0.05 level of significance? Why do you think the estimate of β1 differs from the result you found in problem 1? Do you think the result you found here in problem 2 is a ‘better’ estimate of the effect of one more year of experience on earnings than the estimate in problem 1? Why? (This last question is not a statistical question.)
From the regression output below, we can see that the OLS estimate of \(\beta_{1}\) is about 0.27 (rounded to the nearest hundredth). It’s significantly different from zero at the \(\alpha\) = .05 level.
The \(\beta_{1}\) here is different from the \(\beta_{1}\) in the previous model because here in our model, \(\beta_{1}\) represents the linearly estimated relationship between experience and wages, holding education constant.
The results we have here should “better” represent the relationship in the real world, because education affects wages greatly according to theory and our intuition. After controlling education, we did see a stronger relationship between wages and experience, to which we’re also expecting.
w_e_edu_lm <- subset(data, working == 1) %>%
select(experience, hourly_wage, educ) %>%
lm(hourly_wage ~ experience + educ, data=.)
summary(w_e_edu_lm)
##
## Call:
## lm(formula = hourly_wage ~ experience + educ, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.720 -7.665 -2.027 3.879 233.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.95945 2.30054 -8.241 2.56e-16 ***
## experience 0.26901 0.03108 8.656 < 2e-16 ***
## educ 2.32344 0.15397 15.090 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 2879 degrees of freedom
## Multiple R-squared: 0.08693, Adjusted R-squared: 0.08629
## F-statistic: 137 on 2 and 2879 DF, p-value: < 2.2e-16
Test the null hypothesis H0: β2 = 0 at the α = 0.05 level of significance using two methods. First, use the t-statistic from regression #2. Second, use the F-test for the incremental effect of adding an independent variable. You will need to use the results from both regression #1 and regression #2. Show all of your work for constructing this F-test. Do you get the same answer from each of these methods? Are the two methods related? (Please state your decision rule for each of these methods.)
From the regression table presented above for question b), we can see that \(\beta_{2}\) is significant at the \(\alpha\) = .05 level. The t-statistics is 15.090, much larger than the cut-off t-statistics for \(\alpha\) = .05, which we calculated below:
qt(.05, 2879, lower.tail=FALSE)
## [1] 1.645383
To verify what we achieved above, we perform a partial
F-test. In this case, our full model is:
w_e_edu_lm
, the regression object we constructed for this
section. Meanwhile, our reduced model is wage_exp_lm
, the
regression object we constructed for section 1.
We know that the formula for partial F-test is:
\(F = \frac{\frac{RSS_{reduced} - RSS_{full}}{p}}{\frac{RSS_{full}}{n-k}}\)
From the two regressions we performed, we know that:
\(RSS_{reduced}\) is:
sum(resid(wage_exp_lm)^2)
## [1] 1152928
\(RSS_{full}\) is:
sum(resid(w_e_edu_lm)^2)
## [1] 1068424
p is the number of predictors removed from the full model, which is just 1 in our case.
n is total observations in the sample used to perform the regression, in our case is:
nrow(subset(data, working == 1))
## [1] 2882
k is the number of coefficients in the full model, in our case, it is 3.
So the F-statistics is:
\(F = \frac{\frac{1152928 - 1068424}{1}}{\frac{1068424}{2882-3}}\)
This is : F = 227.71 (rounded to the nearest hundredth)
Alternatively, we can just let R do the tough work for us:
anova(wage_exp_lm, w_e_edu_lm)
## Analysis of Variance Table
##
## Model 1: hourly_wage ~ experience
## Model 2: hourly_wage ~ experience + educ
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2880 1152928
## 2 2879 1068424 1 84504 227.71 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table above, we can see that F is also 227.71, and it is
significant at the \(\alpha\)=.05
level. We reject the null hypothesis (\(\beta_{2}=0\)) and conclude that
education
contributes significant information to wages.
The t-test and partial F-test here render the same conclusion. Actually, we can see that the square of the t-statistics for education equals to the F-statistics in our partial F-test.
We have demonstrated our decision rules for both two tests in the above paragraphs explicitly. For both of them, we compare the statistics with the cut-off value at the \(\alpha=.05\) level. If our statistics is larger than the cut-off value, we reject the null hypothesis (\(\beta_{2}=0\)).
Test the joint null hypothesis H0: β1 = β2 = 0 at the α = 0.05 level. Be sure to state your decision rule.
To test this hypothesis, we simply refer to the F-statistics in our full model:
summary(w_e_edu_lm)
##
## Call:
## lm(formula = hourly_wage ~ experience + educ, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.720 -7.665 -2.027 3.879 233.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.95945 2.30054 -8.241 2.56e-16 ***
## experience 0.26901 0.03108 8.656 < 2e-16 ***
## educ 2.32344 0.15397 15.090 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 2879 degrees of freedom
## Multiple R-squared: 0.08693, Adjusted R-squared: 0.08629
## F-statistic: 137 on 2 and 2879 DF, p-value: < 2.2e-16
The F-statistics is 137, significant at the .05 level, indicating
that we should clearly reject the null hypothesis that the variables
experience
and educ
collectively have no
effect on wages
.
The results also show that the variable experience
is
significant (p < 2e-16) controlling for the variable
educ
, as is educ
(p < 2e-16) controlling
for the variable experience
.
Test the null hypothesis H0: β1 + β2 = 2 at the α = 0.05 level.
To test this, we simply pull out the 95% confidence interval for both \(\beta_{1}\) and \(\beta_{2}\), and add them up to construct the 95% confidence interval for \(\beta_{1} + \beta_{2}\):
# Confidence interval for beta1 + beta2 Lower bound
confint(w_e_edu_lm)[2,1] + confint(w_e_edu_lm)[3,1]
## [1] 2.229604
# Confidence interval for beta1 + beta2 Upper bound
confint(w_e_edu_lm)[2,2] + confint(w_e_edu_lm)[3,2]
## [1] 2.955296
So the 95% confidence interval for \(\beta_{1} + \beta_{2}\) is: (2.229604, 2.955296). Since the lower bound of our 95% CI for \(\beta_{1} + \beta_{2}\) exceeds 2, we reject the null hypothesis: \(\beta_{1} + \beta_{2} = 2\).