Setup

library(pacman)
p_load(tidyverse, stargazer, broom)

wages <- read_csv("wages.csv")

Let’s look at the data frame:

head(wages,10)
## # A tibble: 10 x 34
##       id nearc2 nearc4  educ   age fatheduc motheduc weight momdad14 sinmom14
##    <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
##  1     3      0      0    12    27        8        8 380166        1        0
##  2     4      0      0    12    34       14       12 367470        1        0
##  3     5      1      1    11    27       11       12 380166        1        0
##  4     6      1      1    12    34        8        7 367470        1        0
##  5     7      1      1    12    26        9       12 380166        1        0
##  6     8      1      1    18    33       14       14 367470        1        0
##  7     9      1      1    14    29       14       14 496635        1        0
##  8    10      1      1    12    28       12       12 367772        1        0
##  9    11      1      1    12    29       12       12 480445        1        0
## 10    12      1      1     9    28       11       12 380166        1        0
## # … with 24 more variables: step14 <dbl>, reg661 <dbl>, reg662 <dbl>,
## #   reg663 <dbl>, reg664 <dbl>, reg665 <dbl>, reg666 <dbl>, reg667 <dbl>,
## #   reg668 <dbl>, reg669 <dbl>, south66 <dbl>, black <dbl>, smsa <dbl>,
## #   south <dbl>, smsa66 <dbl>, wage <dbl>, enroll <dbl>, KWW <dbl>, IQ <dbl>,
## #   married <dbl>, libcrd14 <dbl>, exper <dbl>, lwage <dbl>, expersq <dbl>

Multiple Regression in R

We know how to do simple OLS–doing multiple OLS is just one easy step further! Let’s run a regression of log wages on education and experience:

reg1 <- lm(lwage ~ educ + exper, data = wages)
summary(reg1)
## 
## Call:
## lm(formula = lwage ~ educ + exper, data = wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57440 -0.22238  0.02188  0.25802  1.17969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.834500   0.092866   52.06   <2e-16 ***
## educ        0.080330   0.005259   15.27   <2e-16 ***
## exper       0.046395   0.003306   14.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3877 on 1601 degrees of freedom
## Multiple R-squared:  0.1458, Adjusted R-squared:  0.1447 
## F-statistic: 136.6 on 2 and 1601 DF,  p-value: < 2.2e-16

R gives us the t-stat and p-value so we can easily conduct hypothesis tests. Suppose we want to test \[H_0\text{: } \beta_1 = 0\] vs \[H_1\text{: } \beta_1 \neq 0\] at the 5% level. We can look at the p-value and see that \(p < \alpha\) so we can reject the null hypothesis at 5%.

Like last lab, we can use stargazer to summarize our regression results. This time, let’s keep the adjusted \(R^2\) in our table.

stargazer(reg1, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         lwage           
## ----------------------------------------
## educ                  0.080***          
##                        (0.005)          
##                                         
## exper                 0.046***          
##                        (0.003)          
##                                         
## Constant              4.835***          
##                        (0.093)          
##                                         
## ----------------------------------------
## Observations            1,604           
## R2                      0.146           
## Adjusted R2             0.145           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

Confidence Intervals

How do we get confidence intervals in R? Let’s do a 95% confidence interval.

Option 1: Construct by hand. We have the standard error and we have \(\hat{\beta}_1\). We need to get the critical value of t.

# We can use the qt() function to get the critical value of t. First put in your significance level. 
# for a two-tail test, make sure to divide by 2
# then the degrees of freedom. we have 1604-3 degrees of freedom
qt(.05/2, 1601)
## [1] -1.961447

The critical value of t is 1.96. Now we can plug that in to our confidence interval formula: \[ .080330 - 1.96(.005259) \leq \beta_1 \leq .080330 + 1.96(.005259) \] Which simplifies to: \[ .0700 \leq \beta_1 \leq .0906\]

Option 2: Let R tell us. The tidy() function will give us the confidence interval if we tell it!

tidy(reg1, conf.int = T)
## # A tibble: 3 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   4.83     0.0929       52.1 0          4.65      5.02  
## 2 educ          0.0803   0.00526      15.3 2.86e-49   0.0700    0.0906
## 3 exper         0.0464   0.00331      14.0 2.80e-42   0.0399    0.0529

The confidence interval matches our by-hand solution!

But this is for a 95%. How do we get tidy() to give us a different confidence level? What if we want a 90% confidence interval?

tidy(reg1, conf.int = T, conf.level = .9)
## # A tibble: 3 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   4.83     0.0929       52.1 0          4.68      4.99  
## 2 educ          0.0803   0.00526      15.3 2.86e-49   0.0717    0.0890
## 3 exper         0.0464   0.00331      14.0 2.80e-42   0.0410    0.0518

Omitted Variable Bias

reg1 <- lm(lwage ~ educ, data = wages)
reg2 <- lm(lwage ~ educ + exper, data = wages)
stargazer(reg1, reg2, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## =========================================
##                  Dependent variable:     
##              ----------------------------
##                         lwage            
##                   (1)            (2)     
## -----------------------------------------
## educ            0.037***      0.080***   
##                 (0.005)        (0.005)   
##                                          
## exper                         0.046***   
##                                (0.003)   
##                                          
## Constant        5.816***      4.835***   
##                 (0.065)        (0.093)   
##                                          
## -----------------------------------------
## Observations     1,604          1,604    
## R2               0.041          0.146    
## Adjusted R2      0.040          0.145    
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

What is the sign of the OVB? Use the formula: \[ OVB = \beta_1^{short} - \beta_1^{long} = .037 - .080 = -.043\] There is a negative bias from omitting experience. What does this tell us about how the variables are correlated? They are negatively correlated! We can check this using the cor() function, too.

cor(wages$educ, wages$exper)
## [1] -0.5827916

Let’s do another regression.

reg3 <- lm(lwage ~ educ + exper + IQ, data = wages) # positive bias from omitting IQ (imperfect proxy for intelligence)
stargazer(reg1, reg2, reg3, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## ==========================================
##                   Dependent variable:     
##              -----------------------------
##                          lwage            
##                 (1)       (2)       (3)   
## ------------------------------------------
## educ         0.037***  0.080***  0.069*** 
##               (0.005)   (0.005)   (0.006) 
##                                           
## exper                  0.046***  0.048*** 
##                         (0.003)   (0.003) 
##                                           
## IQ                               0.004*** 
##                                   (0.001) 
##                                           
## Constant     5.816***  4.835***  4.587*** 
##               (0.065)   (0.093)   (0.104) 
##                                           
## ------------------------------------------
## Observations   1,604     1,604     1,604  
## R2             0.041     0.146     0.159  
## Adjusted R2    0.040     0.145     0.158  
## ==========================================
## Note:          *p<0.1; **p<0.05; ***p<0.01

How are IQ and education correlated? Check the OVB formula again. \[ OVB = \beta_1^{short} - \beta_1^{long} = .080 - .069 = .011\] There is a positive bias from omitting IQ.

cor(wages$educ, wages$IQ)
## [1] 0.4992129

Education and IQ are positively correlated.

We can keep adding variables to our regression:

reg4 <- lm(lwage ~ educ + exper + IQ + KWW, data = wages) 
reg5 <- lm(lwage ~ educ + exper + IQ + KWW + fatheduc, data = wages) 
reg6 <- lm(lwage ~ educ + exper + IQ + KWW + fatheduc + motheduc, data = wages) 
stargazer(reg1, reg2, reg3, reg4, reg5, reg6, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## ==================================================================
##                               Dependent variable:                 
##              -----------------------------------------------------
##                                      lwage                        
##                (1)      (2)      (3)      (4)      (5)      (6)   
## ------------------------------------------------------------------
## educ         0.037*** 0.080*** 0.069*** 0.058*** 0.058*** 0.057***
##              (0.005)  (0.005)  (0.006)  (0.006)  (0.006)  (0.006) 
##                                                                   
## exper                 0.046*** 0.048*** 0.041*** 0.042*** 0.042***
##                       (0.003)  (0.003)  (0.004)  (0.004)  (0.004) 
##                                                                   
## IQ                             0.004*** 0.003*** 0.003*** 0.003***
##                                (0.001)  (0.001)  (0.001)  (0.001) 
##                                                                   
## KWW                                     0.006*** 0.006*** 0.006***
##                                         (0.002)  (0.002)  (0.002) 
##                                                                   
## fatheduc                                          0.003   -0.0003 
##                                                  (0.003)  (0.004) 
##                                                                   
## motheduc                                                   0.008* 
##                                                           (0.004) 
##                                                                   
## Constant     5.816*** 4.835*** 4.587*** 4.672*** 4.664*** 4.634***
##              (0.065)  (0.093)  (0.104)  (0.106)  (0.106)  (0.108) 
##                                                                   
## ------------------------------------------------------------------
## Observations  1,604    1,604    1,604    1,604    1,604    1,604  
## R2            0.041    0.146    0.159    0.167    0.168    0.170  
## Adjusted R2   0.040    0.145    0.158    0.165    0.165    0.167  
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

Let’s look a little closer at the models with father’s education included. The coefficient is not statistically significant. Should we still include this variable in our regression?

YES Remember, if we exclude variables that should be in the model and they are correlated with the regressors that we include, then we will have omitted variable bias – Even if they aren’t correlated with the included variables, they help us tighten up our standard errors – If theory or intuition says you should have a variable in the model, then it should be in there!

F-tests

Let’s suppose we want to test to see if the coefficients on father’s education and mother’s education are equal to zero at the 5% level. The null hypothesis is: \[H_0\text{: } \beta_5 = \beta_6 = 0\] The alternative hypothesis is: \[H_1\text{: either } \beta_5 \neq 0 \text{ or } \beta_6 \neq 0\]

To do the test, we need to identify our restricted model and our unrestricted model. reg4 is the restricted model and reg6 is the unrestricted.

First let’s do this the long way. Recall, the F-stat is equal to: \[F = \frac{(R^2_u - R^2_r)/q}{(1-R^2_u)/(n-k-1)}\]

Get the necessary info

# R-squared
r2_restrict <- summary(reg4)$r.squared
r2_unrestrict <- summary(reg6)$r.squared

# number of restrictions (count = in null hypothesis)
n_restrict <- 2

# degrees of freedom is unrestricted model
dof <- summary(reg6)$df[2]

# significance level
alpha <- 0.05

# calculate F stat:
f_stat <- ((r2_unrestrict - r2_restrict)/n_restrict)/((1 - r2_unrestrict)/dof) 

# find critical value of F
f_crit <- qf(1-alpha, n_restrict, dof)

We reject the null hypothesis if the F-stat is greater than the F-crit:

f_stat > f_crit
## [1] FALSE

It is not, so we fail to reject the null hypothesis.

Now let’s do this the easy way and have R do it for us. Load the car package to use this function:

p_load(car)
linearHypothesis(reg6, c("fatheduc=0", "motheduc=0"))
## Linear hypothesis test
## 
## Hypothesis:
## fatheduc = 0
## motheduc = 0
## 
## Model 1: restricted model
## Model 2: lwage ~ educ + exper + IQ + KWW + fatheduc + motheduc
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1599 234.57                           
## 2   1597 233.93  2   0.64137 2.1892 0.1123

Notice that the information given here is the exact same as what we calculated above:

f_stat
## [1] 2.189213