The dataset you will be analyzing for this assignment is an extract from the Panel Study of Income Dynamics (1997). The file psid97.dat is included. Please include all relevant Stata output with you copy of the solutions.

The dataset includes five variables: • Age of the household head (age) • Years of education of the household head (educ) • Annual earnings for the household head in 1996 (earnings) • Annual hours work for the household head in 1996 (hours) • An indicator for the marital status of the household head (1=married, 0=single)

Setting up the workspace

Before doing any analysis, we first set up the proper working environment.

# Loading the dataset
library(haven)  # Package for reading the Stata file
data <- read_dta('psid97.dta') # Read into the dataset

summary(data) # General info of the dataset
##       age             educ          earnings          hours     
##  Min.   :18.00   Min.   : 0.00   Min.   :     0   Min.   :   0  
##  1st Qu.:34.00   1st Qu.:12.00   1st Qu.:  7280   1st Qu.:1568  
##  Median :42.50   Median :13.00   Median : 26000   Median :2070  
##  Mean   :44.77   Mean   :13.31   Mean   : 32965   Mean   :1880  
##  3rd Qu.:53.00   3rd Qu.:16.00   3rd Qu.: 45001   3rd Qu.:2480  
##  Max.   :95.00   Max.   :17.00   Max.   :700021   Max.   :5307  
##  NA's   :3359    NA's   :3359    NA's   :3359     NA's   :3359  
##     married     
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :0.824  
##  3rd Qu.:1.000  
##  Max.   :1.000  
##  NA's   :3359

Section 0

a)

Create an hourly wage variable, which is earnings divided by hours worked.

# Create the new column named "hourly_wage"
data$hourly_wage <- data$earnings / data$hours

# Check it out
summary(data$hourly_wage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   8.789  14.331     Inf  22.520     Inf    3829

b)

Define an indicator working = 1 if their annual hours worked exceeds zero.

library(tidyverse) # Load tidyverse
# Create the new column "working" conditional on column "hours"
data <- data %>% 
  mutate(working = case_when(hours > 0 ~ 1,
                             hours == 0 ~ 0
  ))

# Check it out 
summary(data$working)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   1.000   0.858   1.000   1.000    3359

c)

Create a measure of labor market experience, which is typically age minus years of education minus 6.

# Create a new column called "experience"
data$experience <- data$age - data$educ - 6

# Check it out
summary(data$experience)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1.00   14.00   23.00   25.46   33.00   83.00    3359

Warning: From summary statistics we know the way we constructed the variable measuring labor market experience can be problematic, or we may have some invalid data entry, because there is an entry with -1 year of experience.


Section 1

First, we wish to study the relationship between wages and experience. For this section, limit your analysis to household heads that are working.

a)

Plot wages against experience (put experience on the horizontal axis and wages on the vertical axis). Describe the correlation between wages and experience. Does the correlation appear to be strong or weak? Explain.

The following graph is a scatter plot describing the relationship between wages and experience in our dataset. The scatter plot seems to show that there is not a strong correlation between wages and experience.

To make things clearer, I added a trend line based on the LOESS (in orange) and a trend line based on the linear estimation (in blue). Both lines indicate a very weak correlation between experience and wages.

ggplot(subset(data, working == 1), aes(experience, hourly_wage)) + # Subset
  geom_point(color = '#00AFBB', size = 1) + # Scatter plot 
  geom_smooth(color = "steelblue", method = 'lm') +# Add the linear Smooth Line
  geom_smooth(color = '#D55E00', method = 'loess') + # add the loess smooth line
  scale_y_continuous(labels = scales::comma) 
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

b)

What is the sample correlation between wages and experience? Did you expect to find this result? Why or why not?

The sample correlation between the two is about 0.12 (rounded to the nearest hundredth). We should expect to see this result, because we just saw a weak relationship between the two from the scatter plot above.

# Load the relevant package
subset(data, working == 1) %>%
  select(experience, hourly_wage) %>% 
  cor(.)
##             experience hourly_wage
## experience   1.0000000   0.1212883
## hourly_wage  0.1212883   1.0000000

c)

Run the following regression #1:

\(Wage = \beta_{0} + \beta_{1} * Experience + \mu_{i}\)

What is the OLS estimate of β1? Is it significantly different from zero at the α = 0.05 level of significance?

The OLS estimate of \(\beta_{1}\) is 0.21 (rounded to the nearest hundredth). It is significant at the \(\alpha\) = .05 level.

wage_exp_lm <- subset(data, working == 1) %>%
  select(experience, hourly_wage) %>%
  lm(hourly_wage ~ experience, data=.)

summary(wage_exp_lm)
## 
## Call:
## lm(formula = hourly_wage ~ experience, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.128  -8.986  -3.548   4.229 224.602 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.84929    0.78090  17.735  < 2e-16 ***
## experience   0.20998    0.03202   6.557 6.46e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.01 on 2880 degrees of freedom
## Multiple R-squared:  0.01471,    Adjusted R-squared:  0.01437 
## F-statistic:    43 on 1 and 2880 DF,  p-value: 6.465e-11

d)

What is the F-statistic for the null hypothesis H0: β1 = 0 in the regression? What critical value from the F-distribution table should you use when testing the null hypothesis at the α = 0.05 level of significance? Would you reject this null hypothesis according to the F test at the α = 0.05 level of significance?

From the regression table for question c), we can see that the F-statistic is 43. We also know that the numerator degree of freedom is 1, and the denominator degree of freedom is 2880 in our case, so we can calculate the cut-off F-statistics using the following syntax:

qf(.05, 1, 2880, lower.tail = FALSE)
## [1] 3.84469

Since our F-statistics is 43, which is larger than 3.85, the critical value at \(\alpha\) = .05. We should reject the null hypothesis.

e)

Notice that you got the same answer in parts C and D. In both cases, you are testing the null hypothesis H0: β1 = 0, so you should get the same answer no matter which method you use. When testing the significance of a single parameter, it can be shown that the F-statistic is equal to the square of the t-statistic. Does this hold in your output?

We can see from below that the F-statistics equals to the square of the t-statistics in our case.

summary(wage_exp_lm)$fstatistic[1] # Extract f statistics 
##    value 
## 42.99983
coef(summary(wage_exp_lm))[, "t value"][2]^2 # Extract t statistics and square
## experience 
##   42.99983


Section 2

Now we wish to study the relationship between wages and experience controlling for education. Again, please limit your analysis to household heads that are working.

a)

What is the sample correlation between education and wages? What is the sample correlation between education and experience? Did you expect to find this latter correlation? Why or why not?

subset(data, working == 1) %>%
  select(experience, hourly_wage, educ) %>% 
  cor(.)
##             experience hourly_wage       educ
## experience   1.0000000   0.1212883 -0.1258688
## hourly_wage  0.1212883   1.0000000  0.2513287
## educ        -0.1258688   0.2513287  1.0000000

From the correlation matrix above, we can see that the sample correlation between education and wages is about 0.25. The sample correlation between education and experience is about -0.13.

We should expect a negative correlation between education and experience, because the more years you invest into schooling, the less likely you enter the labor market early.

b)

Now run regression #2:

Wage = \(\beta_{0}\) + \(\beta_{1}\)experience + \(\beta_{2}\) educ + \(\mu_{i}\)

What is the OLS estimate of β1? Is it significantly different from zero at the α = 0.05 level of significance? Why do you think the estimate of β1 differs from the result you found in problem 1? Do you think the result you found here in problem 2 is a ‘better’ estimate of the effect of one more year of experience on earnings than the estimate in problem 1? Why? (This last question is not a statistical question.)

From the regression output below, we can see that the OLS estimate of \(\beta_{1}\) is about 0.27 (rounded to the nearest hundredth). It’s significantly different from zero at the \(\alpha\) = .05 level.

The \(\beta_{1}\) here is different from the \(\beta_{1}\) in the previous model because here in our model, \(\beta_{1}\) represents the linearly estimated relationship between experience and wages, holding education constant.

The results we have here should “better” represent the relationship in the real world, because education affects wages greatly according to theory and our intuition. After controlling education, we did see a stronger relationship between wages and experience, to which we’re also expecting.

w_e_edu_lm <- subset(data, working == 1) %>%
  select(experience, hourly_wage, educ) %>%
  lm(hourly_wage ~ experience + educ, data=.)

summary(w_e_edu_lm)
## 
## Call:
## lm(formula = hourly_wage ~ experience + educ, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.720  -7.665  -2.027   3.879 233.253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.95945    2.30054  -8.241 2.56e-16 ***
## experience    0.26901    0.03108   8.656  < 2e-16 ***
## educ          2.32344    0.15397  15.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 2879 degrees of freedom
## Multiple R-squared:  0.08693,    Adjusted R-squared:  0.08629 
## F-statistic:   137 on 2 and 2879 DF,  p-value: < 2.2e-16

c)

Test the null hypothesis H0: β2 = 0 at the α = 0.05 level of significance using two methods. First, use the t-statistic from regression #2. Second, use the F-test for the incremental effect of adding an independent variable. You will need to use the results from both regression #1 and regression #2. Show all of your work for constructing this F-test. Do you get the same answer from each of these methods? Are the two methods related? (Please state your decision rule for each of these methods.)

From the regression table presented above for question b), we can see that \(\beta_{2}\) is significant at the \(\alpha\) = .05 level. The t-statistics is 15.090, much larger than the cut-off t-statistics for \(\alpha\) = .05, which we calculated below:

qt(.05, 2879, lower.tail=FALSE)
## [1] 1.645383

To verify what we achieved above, we perform a partial F-test. In this case, our full model is: w_e_edu_lm, the regression object we constructed for this section. Meanwhile, our reduced model is wage_exp_lm, the regression object we constructed for section 1.

We know that the formula for partial F-test is:

\(F = \frac{\frac{RSS_{reduced} - RSS_{full}}{p}}{\frac{RSS_{full}}{n-k}}\)

From the two regressions we performed, we know that:

\(RSS_{reduced}\) is:

sum(resid(wage_exp_lm)^2)
## [1] 1152928

\(RSS_{full}\) is:

sum(resid(w_e_edu_lm)^2)
## [1] 1068424

p is the number of predictors removed from the full model, which is just 1 in our case.

n is total observations in the sample used to perform the regression, in our case is:

nrow(subset(data, working == 1))
## [1] 2882

k is the number of coefficients in the full model, in our case, it is 3.

So the F-statistics is:

\(F = \frac{\frac{1152928 - 1068424}{1}}{\frac{1068424}{2882-3}}\)

This is : F = 227.71 (rounded to the nearest hundredth)

Alternatively, we can just let R do the tough work for us:

anova(wage_exp_lm, w_e_edu_lm)
## Analysis of Variance Table
## 
## Model 1: hourly_wage ~ experience
## Model 2: hourly_wage ~ experience + educ
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   2880 1152928                                  
## 2   2879 1068424  1     84504 227.71 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the table above, we can see that F is also 227.71, and it is significant at the \(\alpha\)=.05 level. We reject the null hypothesis (\(\beta_{2}=0\)) and conclude that education contributes significant information to wages.

The t-test and partial F-test here render the same conclusion. Actually, we can see that the square of the t-statistics for education equals to the F-statistics in our partial F-test.

We have demonstrated our decision rules for both two tests in the above paragraphs explicitly. For both of them, we compare the statistics with the cut-off value at the \(\alpha=.05\) level. If our statistics is larger than the cut-off value, we reject the null hypothesis (\(\beta_{2}=0\)).

d)

Test the joint null hypothesis H0: β1 = β2 = 0 at the α = 0.05 level. Be sure to state your decision rule.

To test this hypothesis, we simply refer to the F-statistics in our full model:

summary(w_e_edu_lm)
## 
## Call:
## lm(formula = hourly_wage ~ experience + educ, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.720  -7.665  -2.027   3.879 233.253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.95945    2.30054  -8.241 2.56e-16 ***
## experience    0.26901    0.03108   8.656  < 2e-16 ***
## educ          2.32344    0.15397  15.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 2879 degrees of freedom
## Multiple R-squared:  0.08693,    Adjusted R-squared:  0.08629 
## F-statistic:   137 on 2 and 2879 DF,  p-value: < 2.2e-16

The F-statistics is 137, significant at the .05 level, indicating that we should clearly reject the null hypothesis that the variables experience and educ collectively have no effect on wages.

The results also show that the variable experience is significant (p < 2e-16) controlling for the variable educ, as is educ (p < 2e-16) controlling for the variable experience.

e)

Test the null hypothesis H0: β1 + β2 = 2 at the α = 0.05 level.

To test this, we simply pull out the 95% confidence interval for both \(\beta_{1}\) and \(\beta_{2}\), and add them up to construct the 95% confidence interval for \(\beta_{1} + \beta_{2}\):

# Confidence interval for beta1 + beta2 Lower bound 
confint(w_e_edu_lm)[2,1] + confint(w_e_edu_lm)[3,1]
## [1] 2.229604
# Confidence interval for beta1 + beta2 Upper bound 
confint(w_e_edu_lm)[2,2] + confint(w_e_edu_lm)[3,2]
## [1] 2.955296

So the 95% confidence interval for \(\beta_{1} + \beta_{2}\) is: (2.229604, 2.955296). Since the lower bound of our 95% CI for \(\beta_{1} + \beta_{2}\) exceeds 2, we reject the null hypothesis: \(\beta_{1} + \beta_{2} = 2\).