Chapter 7-1

Load the wooldridge package and dataset:

# Install the package if not already installed
if (!require(wooldridge)) install.packages("wooldridge")
## Loading required package: wooldridge
# Load the library
library(wooldridge)

# Load the SLEEP75 dataset
data("sleep75")

Fit the regression model given in the question:

# Fit the regression model
model <- lm(sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)

# Summary of the model
summary(model)
## 
## Call:
## lm(formula = sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2378.00  -243.29     6.74   259.24  1350.19 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3840.83197  235.10870  16.336   <2e-16 ***
## totwrk        -0.16342    0.01813  -9.013   <2e-16 ***
## educ         -11.71332    5.86689  -1.997   0.0463 *  
## age           -8.69668   11.20746  -0.776   0.4380    
## I(age^2)       0.12844    0.13390   0.959   0.3378    
## male          87.75243   34.32616   2.556   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417.7 on 700 degrees of freedom
## Multiple R-squared:  0.1228, Adjusted R-squared:  0.1165 
## F-statistic: 19.59 on 5 and 700 DF,  p-value: < 2.2e-16

Answer the questions:

  1. Evidence that men sleep more than women: The coefficient for male (gender dummy) in the model represents the difference in sleep for men compared to women. Check its value and statistical significance

    coef(summary(model))["male", ] # Extract details for 'male'
    ##    Estimate  Std. Error     t value    Pr(>|t|) 
    ## 87.75242861 34.32616153  2.55642999  0.01078518

    The estimate (87.75) indicates that men, on average, sleep 87.75 more minutes per week than women, holding all other factors constant. The p-value (0.0108) is less than 0.05, which means the difference is statistically significant at the 5% level.

    There is strong evidence that men sleep significantly more than women. The difference is approximately 87.75 minutes per week, and this result is unlikely due to random chance.

  2. Tradeoff between working and sleeping: The coefficient for totwrk (total work time) reflects the tradeoff between working and sleeping. It shows how many minutes of sleep decrease for each additional minute of work

    coef(summary(model))["totwrk", ] # Extract details for 'totwrk'
    ##      Estimate    Std. Error       t value      Pr(>|t|) 
    ## -1.634232e-01  1.813210e-02 -9.012918e+00  1.891272e-18

    The coefficient of totwrk (−0.1634) indicates the tradeoff between working and sleeping:

    • Tradeoff: For every additional minute spent working, sleep decreases by approximately 0.1634 minutes (or 9.8 seconds). This demonstrates a negative relationship between working and sleeping.

    • Statistical Significance: The p-value (1.891×10^{-18}) is extremely small, well below the typical threshold of 0.05, indicating that this tradeoff is highly statistically significant.

    There is strong evidence of a statistically significant tradeoff between working and sleeping. For every extra minute of work, the expected reduction in sleep is approximately 0.1634 minutes.

  3. Test the null hypothesis regarding age: To test whether age has no effect on sleeping (both age and age^2 are jointly insignificant), run an F-test

    # Fit a reduced model without age and age^2
    reduced_model <- lm(sleep ~ totwrk + educ + male, data = sleep75)
    
    # Perform an F-test
    anova(reduced_model, model)
    ## Analysis of Variance Table
    ## 
    ## Model 1: sleep ~ totwrk + educ + male
    ## Model 2: sleep ~ totwrk + educ + age + I(age^2) + male
    ##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
    ## 1    702 122631662                           
    ## 2    700 122147777  2    483885 1.3865 0.2506
  • Null hypothesis (H0​): Age (both age and age^2) has no effect on sleeping.

  • Alternative hypothesis (H1​): Age has an effect on sleeping.

  • The p-value for the F-test is 0.2506, which is greater than the common significance level (e.g., α=0.05). This means we fail to reject the null hypothesis. In other words, there is no strong evidence that age (both age and age^2) has a statistically significant effect on sleep, after controlling for other factors (e.g., work time, education, and gender).

Chapter 7-3

Load the Wooldridge Dataset and Perform Analysis

# Install and load the wooldridge package
if (!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)

# Load the GPA2 dataset
data("gpa2")

# Fit the regression model
model <- lm(sat ~ hsize + I(hsize^2) + female + black + female:black, data = gpa2)

# Summary of the model
summary(model)
## 
## Call:
## lm(formula = sat ~ hsize + I(hsize^2) + female + black + female:black, 
##     data = gpa2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.45  -89.54   -5.24   85.41  479.13 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1028.0972     6.2902 163.445  < 2e-16 ***
## hsize          19.2971     3.8323   5.035 4.97e-07 ***
## I(hsize^2)     -2.1948     0.5272  -4.163 3.20e-05 ***
## female        -45.0915     4.2911 -10.508  < 2e-16 ***
## black        -169.8126    12.7131 -13.357  < 2e-16 ***
## female:black   62.3064    18.1542   3.432 0.000605 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133.4 on 4131 degrees of freedom
## Multiple R-squared:  0.08578,    Adjusted R-squared:  0.08468 
## F-statistic: 77.52 on 5 and 4131 DF,  p-value: < 2.2e-16

Answer the Questions

  1. Evidence for hsize^2 and optimal high school size

    To check if hsize^2 is significant and calculate the optimal high school size:

    # Extract the coefficients for hsize and hsize^2
    coef_hsize <- coef(model)["hsize"]
    coef_hsize2 <- coef(model)["I(hsize^2)"]
    
    # Test the significance of hsize^2
    coef(summary(model))["I(hsize^2)", ] # Check the p-value
    ##      Estimate    Std. Error       t value      Pr(>|t|) 
    ## -2.194828e+00  5.271635e-01 -4.163468e+00  3.198417e-05
    # Calculate the optimal high school size (vertex of the parabola)
    optimal_hsize <- -coef_hsize / (2 * coef_hsize2)
    optimal_hsize
    ##   hsize 
    ## 4.39603

    Since the p-value is less than 0.05, hsize^2is statistically significant. This suggests that hsize^2 should be included in the model because it captures a non-linear relationship between high school size and SAT scores.

    The optimal high school size is approximately 4.40 (in hundreds of students), which translates to 440 students.

  2. Difference in SAT score between nonblack females and nonblack males

    The coefficient for female represents the difference in SAT score between nonblack females and nonblack males:

    # Extract the coefficient and its significance for 'female'
    coef(summary(model))["female", ]
    ##      Estimate    Std. Error       t value      Pr(>|t|) 
    ## -4.509145e+01  4.291063e+00 -1.050822e+01  1.656301e-25

    There is strong evidence that nonblack females have a statistically significant lower SAT score (by 45.09 points) compared to nonblack males, holding other factors constant.

  3. Difference in SAT score between nonblack males and black males

    The coefficient for black represents the difference in SAT score between nonblack males and black males:

    # Extract the coefficient and its significance for 'black'
    coef(summary(model))["black", ]
    ##      Estimate    Std. Error       t value      Pr(>|t|) 
    ## -1.698126e+02  1.271315e+01 -1.335725e+01  7.140387e-40
    # Test the null hypothesis (H0: No difference)
    t_stat <- coef(model)["black"] / coef(summary(model))["black", "Std. Error"]
    p_value <- 2 * pt(abs(t_stat), df = nrow(gpa2) - length(coef(model)), lower.tail = FALSE)
    p_value
    ##        black 
    ## 7.140387e-40

    There is very strong evidence to reject the null hypothesis that there is no difference in SAT scores between nonblack males and black males. Black males score significantly lower than nonblack males on the SAT in this dataset.

    on average, black males score 169.81 points lower on the SAT than nonblack males, holding all other factors constant.

  4. Difference in SAT score between black females and nonblack females

    The difference in SAT scores between black females and nonblack females is captured by the interaction term female:black plus the black term:

    # Calculate the combined effect for black females
    black_female_effect <- coef(model)["black"] + coef(model)["female:black"]
    black_female_effect
    ##     black 
    ## -107.5063
    # Test its significance
    se_combined <- sqrt(
      coef(summary(model))["black", "Std. Error"]^2 +
      coef(summary(model))["female:black", "Std. Error"]^2
    )
    t_stat_combined <- black_female_effect / se_combined
    p_value_combined <- 2 * pt(abs(t_stat_combined), df = nrow(gpa2) - length(coef(model)), lower.tail = FALSE)
    p_value_combined
    ##        black 
    ## 1.275493e-06

    Estimated difference: The estimated difference in SAT scores between black females and nonblack females is -107.51. This means, on average, black females score 107.51 points less on the SAT compared to nonblack females, holding all other variables constant.

    Since the p-value is very small, we reject the null hypothesis. This means the difference in SAT scores between black females and nonblack females is statistically significant. In other words, the data provides strong evidence that black females have lower SAT scores compared to nonblack females, on average, when controlling for other variables in the model.

Chapter 7-C1

Load Dataset

# Load the wooldridge package
if (!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)

# Load the GPA1 dataset
data("gpa1")

Answer the questions

  1. Add mothcoll and fathcoll to the existing regression model

    # Original model in equation (7.6)
    original_model <- lm(colGPA ~ hsGPA + ACT + PC, data = gpa1)
    
    # Model with mothcoll and fathcoll
    model_i <- lm(colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll, data = gpa1)
    
    # Compare the summary of the two models
    summary(original_model)
    ## 
    ## Call:
    ## lm(formula = colGPA ~ hsGPA + ACT + PC, data = gpa1)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.7901 -0.2622 -0.0107  0.2334  0.7570 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 1.263520   0.333125   3.793 0.000223 ***
    ## hsGPA       0.447242   0.093647   4.776 4.54e-06 ***
    ## ACT         0.008659   0.010534   0.822 0.412513    
    ## PC          0.157309   0.057287   2.746 0.006844 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3325 on 137 degrees of freedom
    ## Multiple R-squared:  0.2194, Adjusted R-squared:  0.2023 
    ## F-statistic: 12.83 on 3 and 137 DF,  p-value: 1.932e-07
    summary(model_i)
    ## 
    ## Call:
    ## lm(formula = colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll, 
    ##     data = gpa1)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -0.78149 -0.25726 -0.02121  0.24691  0.74432 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  1.255554   0.335392   3.744 0.000268 ***
    ## hsGPA        0.450220   0.094280   4.775 4.61e-06 ***
    ## ACT          0.007724   0.010678   0.723 0.470688    
    ## PC           0.151854   0.058716   2.586 0.010762 *  
    ## mothcoll    -0.003758   0.060270  -0.062 0.950376    
    ## fathcoll     0.041800   0.061270   0.682 0.496265    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3344 on 135 degrees of freedom
    ## Multiple R-squared:  0.2222, Adjusted R-squared:  0.1934 
    ## F-statistic: 7.713 on 5 and 135 DF,  p-value: 2.083e-06

    Adding mothcoll and fathcoll does not substantially improve the model fit or contribute significantly to explaining college GPA. The effect of PC ownership remains statistically significant even after controlling for these parental education variables.

  2. Test for joint significance of mothcoll and fathcoll

    To test if mothcoll and fathcoll are jointly significant, conduct an F-test:

    # Reduced model without mothcoll and fathcoll
    reduced_model <- lm(colGPA ~ hsGPA + ACT + PC, data = gpa1)
    
    # Perform F-test
    anova(reduced_model, model_i)
    ## Analysis of Variance Table
    ## 
    ## Model 1: colGPA ~ hsGPA + ACT + PC
    ## Model 2: colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll
    ##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
    ## 1    137 15.149                           
    ## 2    135 15.094  2  0.054685 0.2446 0.7834

    Since the p-value (0.78340.78340.7834) is much greater than any common significance level (e.g., α=0.05), we fail to reject the null hypothesis. This means there is no strong evidence that mothcoll and fathcoll are jointly significant in explaining variations in college GPA (colGPA), holding other variables constant.

    Adding mothcoll and fathcoll to the model does not provide a statistically significant improvement in explaining college GPA. This suggests that the parents’ college education might not be critical predictors in this context when other factors (like high school GPA, ACT score, and PC ownership) are already included in the model.

  3. Add hsGPA^2 to the model from part (i)

    To check whether adding hsGPA^2 improves the model:

    # Model with hsGPA^2
    model_iii <- lm(colGPA ~ hsGPA + I(hsGPA^2) + ACT + PC + mothcoll + fathcoll, data = gpa1)
    
    # Summary of the model
    summary(model_iii)
    ## 
    ## Call:
    ## lm(formula = colGPA ~ hsGPA + I(hsGPA^2) + ACT + PC + mothcoll + 
    ##     fathcoll, data = gpa1)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -0.78998 -0.24327 -0.00648  0.26179  0.72231 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)  
    ## (Intercept)  5.040328   2.443038   2.063   0.0410 *
    ## hsGPA       -1.802520   1.443552  -1.249   0.2140  
    ## I(hsGPA^2)   0.337341   0.215711   1.564   0.1202  
    ## ACT          0.004786   0.010786   0.444   0.6580  
    ## PC           0.140446   0.058858   2.386   0.0184 *
    ## mothcoll     0.003091   0.060110   0.051   0.9591  
    ## fathcoll     0.062761   0.062401   1.006   0.3163  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3326 on 134 degrees of freedom
    ## Multiple R-squared:  0.2361, Adjusted R-squared:  0.2019 
    ## F-statistic: 6.904 on 6 and 134 DF,  p-value: 2.088e-06
    # Compare model_i and model_iii using an F-test
    anova(model_i, model_iii)
    ## Analysis of Variance Table
    ## 
    ## Model 1: colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll
    ## Model 2: colGPA ~ hsGPA + I(hsGPA^2) + ACT + PC + mothcoll + fathcoll
    ##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
    ## 1    135 15.094                           
    ## 2    134 14.823  1   0.27054 2.4457 0.1202

Chapter 7-C2

Load necessary libraries and data

# Install the required packages if not already installed
if (!require(wooldridge)) install.packages("wooldridge")
if (!require(car)) install.packages("car") # For hypothesis tests
## Loading required package: car
## Loading required package: carData
# Load libraries
library(wooldridge)

# Load the dataset
data("wage2")

Answer the questions

  1. Estimate the base model

    # Base model
    model1 <- lm(log(wage) ~ educ + exper + tenure + married + black + south + urban, data = wage2)
    
    # Summary of the model
    summary(model1)
    ## 
    ## Call:
    ## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
    ##     south + urban, data = wage2)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.98069 -0.21996  0.00707  0.24288  1.22822 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  5.395497   0.113225  47.653  < 2e-16 ***
    ## educ         0.065431   0.006250  10.468  < 2e-16 ***
    ## exper        0.014043   0.003185   4.409 1.16e-05 ***
    ## tenure       0.011747   0.002453   4.789 1.95e-06 ***
    ## married      0.199417   0.039050   5.107 3.98e-07 ***
    ## black       -0.188350   0.037667  -5.000 6.84e-07 ***
    ## south       -0.090904   0.026249  -3.463 0.000558 ***
    ## urban        0.183912   0.026958   6.822 1.62e-11 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3655 on 927 degrees of freedom
    ## Multiple R-squared:  0.2526, Adjusted R-squared:  0.2469 
    ## F-statistic: 44.75 on 7 and 927 DF,  p-value: < 2.2e-16
    # Difference in monthly salary between blacks and nonblacks
    # Coefficient of "black" represents the percentage difference in wages

    Holding other factors constant, there is an 18.8% wage differential between Black and non-Black workers. This difference is statistically significant at a very high level.

  2. Add exper^2 and tenure^2 and test joint insignificance

    # Extended model with quadratic terms
    model2 <- lm(log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + married + black + south + urban, data = wage2)
    
    # Summary of the extended model
    summary(model2)
    ## 
    ## Call:
    ## lm(formula = log(wage) ~ educ + exper + I(exper^2) + tenure + 
    ##     I(tenure^2) + married + black + south + urban, data = wage2)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.98236 -0.21972 -0.00036  0.24078  1.25127 
    ## 
    ## Coefficients:
    ##               Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  5.3586756  0.1259143  42.558  < 2e-16 ***
    ## educ         0.0642761  0.0063115  10.184  < 2e-16 ***
    ## exper        0.0172146  0.0126138   1.365 0.172665    
    ## I(exper^2)  -0.0001138  0.0005319  -0.214 0.830622    
    ## tenure       0.0249291  0.0081297   3.066 0.002229 ** 
    ## I(tenure^2) -0.0007964  0.0004710  -1.691 0.091188 .  
    ## married      0.1985470  0.0391103   5.077 4.65e-07 ***
    ## black       -0.1906636  0.0377011  -5.057 5.13e-07 ***
    ## south       -0.0912153  0.0262356  -3.477 0.000531 ***
    ## urban        0.1854241  0.0269585   6.878 1.12e-11 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3653 on 925 degrees of freedom
    ## Multiple R-squared:  0.255,  Adjusted R-squared:  0.2477 
    ## F-statistic: 35.17 on 9 and 925 DF,  p-value: < 2.2e-16
    # Test joint significance of exper^2 and tenure^2
    linearHypothesis(model2, c("I(exper^2) = 0", "I(tenure^2) = 0"))
    ## 
    ## Linear hypothesis test:
    ## I(exper^2) = 0
    ## I(tenure^2) = 0
    ## 
    ## Model 1: restricted model
    ## Model 2: log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + 
    ##     married + black + south + urban
    ## 
    ##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
    ## 1    927 123.82                           
    ## 2    925 123.42  2   0.39756 1.4898  0.226

Coefficients for I(exper^2) and I(tenure^2):

  • I(exper^2) has an estimated coefficient of -0.0001138, with a p-value of 0.831, indicating it is not significant.

  • I(tenure^2) has an estimated coefficient of -0.0007964, with a p-value of 0.091, which is marginally significant at the 10% level but not at stricter levels.

Joint Insignificance Test (Linear Hypothesis Test):

  • The null hypothesis for the test is that both I(exper^2) = 0 and I(tenure^2) = 0.

  • F-statistic: 1.4898, with a p-value of 0.226, which is greater than 0.20.

  • Conclusion: The test fails to reject the null hypothesis, so the quadratic terms for exper and tenure are jointly insignificant, even at the 20% significance level.

  1. Allow return to education to depend on race

    # Interaction between education and race
    model3 <- lm(log(wage) ~ educ * black + exper + tenure + married + south + urban, data = wage2)
    
    # Summary of the model
    summary(model3)
    ## 
    ## Call:
    ## lm(formula = log(wage) ~ educ * black + exper + tenure + married + 
    ##     south + urban, data = wage2)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.97782 -0.21832  0.00475  0.24136  1.23226 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  5.374817   0.114703  46.859  < 2e-16 ***
    ## educ         0.067115   0.006428  10.442  < 2e-16 ***
    ## black        0.094809   0.255399   0.371 0.710561    
    ## exper        0.013826   0.003191   4.333 1.63e-05 ***
    ## tenure       0.011787   0.002453   4.805 1.80e-06 ***
    ## married      0.198908   0.039047   5.094 4.25e-07 ***
    ## south       -0.089450   0.026277  -3.404 0.000692 ***
    ## urban        0.183852   0.026955   6.821 1.63e-11 ***
    ## educ:black  -0.022624   0.020183  -1.121 0.262603    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3654 on 926 degrees of freedom
    ## Multiple R-squared:  0.2536, Adjusted R-squared:  0.2471 
    ## F-statistic: 39.32 on 8 and 926 DF,  p-value: < 2.2e-16
    # Test if the return to education depends on race
    linearHypothesis(model3, "educ:black = 0")
    ## 
    ## Linear hypothesis test:
    ## educ:black = 0
    ## 
    ## Model 1: restricted model
    ## Model 2: log(wage) ~ educ * black + exper + tenure + married + south + 
    ##     urban
    ## 
    ##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
    ## 1    927 123.82                           
    ## 2    926 123.65  1   0.16778 1.2565 0.2626

    The interaction between education and race (educ:black) does not significantly affect wages. This implies that the return to education does not statistically differ between black and non-black individuals in this model.

  2. Allow wages to differ across marital status and race

    # Create interaction terms for marital status and race
    model4 <- lm(log(wage) ~ married * black + exper + tenure + educ + south + urban, data = wage2)
    
    # Summary of the model
    summary(model4)
    ## 
    ## Call:
    ## lm(formula = log(wage) ~ married * black + exper + tenure + educ + 
    ##     south + urban, data = wage2)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.98013 -0.21780  0.01057  0.24219  1.22889 
    ## 
    ## Coefficients:
    ##                Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)    5.403793   0.114122  47.351  < 2e-16 ***
    ## married        0.188915   0.042878   4.406 1.18e-05 ***
    ## black         -0.240820   0.096023  -2.508 0.012314 *  
    ## exper          0.014146   0.003191   4.433 1.04e-05 ***
    ## tenure         0.011663   0.002458   4.745 2.41e-06 ***
    ## educ           0.065475   0.006253  10.471  < 2e-16 ***
    ## south         -0.091989   0.026321  -3.495 0.000497 ***
    ## urban          0.184350   0.026978   6.833 1.50e-11 ***
    ## married:black  0.061354   0.103275   0.594 0.552602    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3656 on 926 degrees of freedom
    ## Multiple R-squared:  0.2528, Adjusted R-squared:  0.2464 
    ## F-statistic: 39.17 on 8 and 926 DF,  p-value: < 2.2e-16
    # Wage differential between married blacks and married nonblacks
    # Coefficient of "married:black" represents the differential

    The estimated wage differential between married blacks and married nonblacks is approximately -16.4%. However, this difference is not statistically significant, as the interaction term does not provide strong evidence of a meaningful effect.

Chapter 8-1

(i) The OLS estimators, β^j, are inconsistent.

  • Answer: False.

  • Reason: Heteroskedasticity does not cause the OLS estimators to be inconsistent. The OLS estimators remain consistent as long as the model is correctly specified, and the errors have zero mean and are uncorrelated with the regressors. However, heteroskedasticity affects efficiency (see point iii).

(ii) The usual F-statistic no longer has an F-distribution.

  • Answer: True.

  • Reason: Under heteroskedasticity, the usual F-statistic does not follow an exact F-distribution because the standard error estimates are biased. This can lead to incorrect inference unless robust standard errors are used.

(iii) The OLS estimators are no longer BLUE.

  • Answer: True.

  • Reason: The OLS estimators are no longer BLUE (Best Linear Unbiased Estimators) under heteroskedasticity because they are no longer efficient. In the presence of heteroskedasticity, OLS does not produce the smallest variance among linear unbiased estimators. Weighted Least Squares (WLS) or robust standard errors are typically used to address this issue.

Chapter 8-5

Load Required Libraries and Data

# Load required packages
if (!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)

# Load the SMOKE dataset
data("smoke")

Estimate the Linear Probability Model (LPM)

smoke$smoke <- ifelse(smoke$cigs > 0, 1, 0)
# Fit the linear probability model
lpm <- lm(smoke ~ log(cigpric) + log(income) + educ + age + I(age^2) + restaurn + white, data = smoke)

# Summary of the model
summary(lpm)
## 
## Call:
## lm(formula = smoke ~ log(cigpric) + log(income) + educ + age + 
##     I(age^2) + restaurn + white, data = smoke)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6715 -0.3958 -0.2724  0.5344  0.9340 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.6560624  0.8549619   0.767 0.443095    
## log(cigpric) -0.0689259  0.2041088  -0.338 0.735684    
## log(income)   0.0121620  0.0257244   0.473 0.636498    
## educ         -0.0288802  0.0059008  -4.894 1.19e-06 ***
## age           0.0198158  0.0056660   3.497 0.000496 ***
## I(age^2)     -0.0002598  0.0000617  -4.210 2.84e-05 ***
## restaurn     -0.1007616  0.0394431  -2.555 0.010815 *  
## white        -0.0256801  0.0515172  -0.498 0.618286    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4734 on 799 degrees of freedom
## Multiple R-squared:  0.062,  Adjusted R-squared:  0.05378 
## F-statistic: 7.544 on 7 and 799 DF,  p-value: 8.035e-09

Answer the questions

  1. Are there important differences between the usual and heteroskedasticity-robust standard errors?

    To calculate robust standard errors:

    # Install and load the sandwich package for robust SE
    if (!require(sandwich)) install.packages("sandwich")
    ## Loading required package: sandwich
    if (!require(lmtest)) install.packages("lmtest")
    ## Loading required package: lmtest
    ## Loading required package: zoo
    ## 
    ## Attaching package: 'zoo'
    ## The following objects are masked from 'package:base':
    ## 
    ##     as.Date, as.Date.numeric
    library(sandwich)
    library(lmtest)
    
    # Compute robust standard errors
    robust_se <- coeftest(lpm, vcov = vcovHC(lpm, type = "HC1"))
    robust_se
    ## 
    ## t test of coefficients:
    ## 
    ##                 Estimate  Std. Error t value  Pr(>|t|)    
    ## (Intercept)   6.5606e-01  8.6408e-01  0.7593 0.4479200    
    ## log(cigpric) -6.8926e-02  2.0803e-01 -0.3313 0.7404890    
    ## log(income)   1.2162e-02  2.5823e-02  0.4710 0.6377879    
    ## educ         -2.8880e-02  5.6421e-03 -5.1187 3.857e-07 ***
    ## age           1.9816e-02  5.4115e-03  3.6618 0.0002669 ***
    ## I(age^2)     -2.5979e-04  5.7133e-05 -4.5470 6.284e-06 ***
    ## restaurn     -1.0076e-01  3.7690e-02 -2.6734 0.0076615 ** 
    ## white        -2.5680e-02  5.0690e-02 -0.5066 0.6125706    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  2. Effect of a four-year increase in education on smoking probability

    The coefficient of educ represents the change in the probability of smoking for each additional year of education. Multiply the coefficient by 4:

    # Coefficient of education
    educ_coef <- coef(lpm)["educ"]
    
    # Effect of 4 additional years of education
    effect_4_years <- 4 * educ_coef
    effect_4_years
    ##       educ 
    ## -0.1155209

    This will give the estimated decrease in smoking probability for a 4-year increase in education.

  3. At what age does another year of age reduce the probability of smoking?

    # Coefficients for age and age^2
    age_coef <- coef(lpm)["age"]
    age_sq_coef <- coef(lpm)["I(age^2)"]
    
    # Turning point age
    turning_point <- -age_coef / (2 * age_sq_coef)
    turning_point
    ##      age 
    ## 38.13861

    This will provide the age at which smoking probability begins to decrease. In other words, This indicates a decrease in the probability of smoking starting at about age 38.

  4. Interpret the coefficient on restaurn:

The coefficient on `restaurn` indicates the **average change in the probability of smoking** for individuals living in states with restaurant smoking restrictions compared to those without. To extract this:


``` r
# Coefficient for restaurn
restaurn_coef <- coef(lpm)["restaurn"]
restaurn_coef
```

```
##   restaurn 
## -0.1007616
```
  1. Predicted probability of smoking for person #206

    For the given individual (cigpric=67.44, income=6500, educ=16, age=77, restaurn=0, white=0)

    # Create a data frame with the individual's characteristics
    individual <- data.frame(
      cigpric = 67.44,
      income = 6500,
      educ = 16,
      age = 77,
      restaurn = 0,
      white = 0
    )
    
    # Predicted probability
    predicted_probability <- predict(lpm, newdata = individual)
    predicted_probability
    ##            1 
    ## -0.003965468

    If the predicted probability is outside the [0, 1] range, it highlights a limitation of the linear probability model.

Chapter 8-C4

Load necessary libraries

library(wooldridge)
library(lmtest)   # For heteroskedasticity tests
library(sandwich) # For robust covariance matrix estimation
# Load the dataset
data("vote1")

Answer the questions

  1. Estimate the model

    model <- lm(voteA ~ prtystrA + democA + log(expendA) + log(expendB), data = vote1)
    
    # Obtain the residuals and regress them on the independent variables
    residuals <- residuals(model)
    residual_model <- lm(residuals ~ prtystrA + democA + log(expendA) + log(expendB), data = vote1)
    
    # Check R-squared
    r_squared <- summary(residual_model)$r.squared
    cat("R-squared of residual regression:", r_squared, "\n")
    ## R-squared of residual regression: 1.198131e-32
    # Explanation: The R-squared is 0 because residuals from an OLS regression are uncorrelated with the regressors by construction.

    This near-zero R^2 is the expected result, confirming the properties of OLS.

  2. Breusch-Pagan test

    bp_test <- bptest(model)
    cat("Breusch-Pagan Test:\n")
    ## Breusch-Pagan Test:
    print(bp_test)
    ## 
    ##  studentized Breusch-Pagan test
    ## 
    ## data:  model
    ## BP = 9.0934, df = 4, p-value = 0.05881

    The Breusch-Pagan test checks for heteroskedasticity (whether the variance of the residuals is not constant across observations).

    • Null Hypothesis (H0​): Residuals have constant variance (homoskedasticity).

    • Alternative Hypothesis (H1​): Residual variance is not constant (heteroskedasticity).

    With a p-value of 0.05881, which is slightly above the common significance level of 0.050.050.05, we fail to reject the null hypothesis at the 5% level. This suggests that there is weak evidence of heteroskedasticity.

    However, if we use a 10% significance level (α=0.10), the null hypothesis would be rejected, indicating possible heteroskedasticity.

    The evidence for heteroskedasticity is borderline, depending on the significance level chosen:

    • At 5%: No heteroskedasticity detected.

    • At 10%: Weak evidence of heteroskedasticity.

  3. White test for heteroskedasticity

    white_test <- bptest(model, ~ fitted(model) + I(fitted(model)^2), data = vote1)
    cat("White Test:\n")
    ## White Test:
    print(white_test)
    ## 
    ##  studentized Breusch-Pagan test
    ## 
    ## data:  model
    ## BP = 5.49, df = 2, p-value = 0.06425

    The null hypothesis of the White test is that the variance of the residuals is constant (i.e., no heteroskedasticity). If the p-value is less than the significance level (commonly α=0.05= 0.05α=0.05), we reject the null hypothesis, indicating evidence of heteroskedasticity.

    In this case:

    • The p-value = 0.06425, which is slightly greater than 0.05.

    • At the 5% significance level, we fail to reject the null hypothesis, meaning there is no strong evidence of heteroskedasticity.

    • However, at a less stringent significance level (e.g., 10%), the test might suggest mild evidence of heteroskedasticity.

    The evidence for heteroskedasticity is weak. At the 5% level, the residuals can be considered homoskedastic, but there may be slight heteroskedasticity at a 10% level.

Chapter 8-C13

Load necessary libraries and data

# Load necessary libraries
library(wooldridge)
library(lmtest)   # For hypothesis tests
library(sandwich) # For robust standard errors
library(car)

# Load the dataset
data("fertil2")

Answer the questions

  1. Estimate the model

    model1 <- lm(children ~ age + I(age^2) + educ + electric + urban, data = fertil2)
    
    # Usual standard errors
    summary(model1)
    ## 
    ## Call:
    ## lm(formula = children ~ age + I(age^2) + educ + electric + urban, 
    ##     data = fertil2)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -5.9012 -0.7136 -0.0039  0.7119  7.4318 
    ## 
    ## Coefficients:
    ##               Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) -4.2225162  0.2401888 -17.580  < 2e-16 ***
    ## age          0.3409255  0.0165082  20.652  < 2e-16 ***
    ## I(age^2)    -0.0027412  0.0002718 -10.086  < 2e-16 ***
    ## educ        -0.0752323  0.0062966 -11.948  < 2e-16 ***
    ## electric    -0.3100404  0.0690045  -4.493 7.20e-06 ***
    ## urban       -0.2000339  0.0465062  -4.301 1.74e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1.452 on 4352 degrees of freedom
    ##   (3 observations deleted due to missingness)
    ## Multiple R-squared:  0.5734, Adjusted R-squared:  0.5729 
    ## F-statistic:  1170 on 5 and 4352 DF,  p-value: < 2.2e-16
    # Heteroskedasticity-robust standard errors
    robust_se <- coeftest(model1, vcov = vcovHC(model1, type = "HC1"))
    cat("Heteroskedasticity-robust standard errors:\n")
    ## Heteroskedasticity-robust standard errors:
    print(robust_se)
    ## 
    ## t test of coefficients:
    ## 
    ##                Estimate  Std. Error  t value  Pr(>|t|)    
    ## (Intercept) -4.22251623  0.24385099 -17.3160 < 2.2e-16 ***
    ## age          0.34092552  0.01917466  17.7800 < 2.2e-16 ***
    ## I(age^2)    -0.00274121  0.00035051  -7.8206 6.549e-15 ***
    ## educ        -0.07523232  0.00630771 -11.9270 < 2.2e-16 ***
    ## electric    -0.31004041  0.06394815  -4.8483 1.289e-06 ***
    ## urban       -0.20003386  0.04547093  -4.3992 1.113e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    # Compare usual vs robust standard errors
    cat("Are robust SEs always larger? Compare the standard errors above.\n")
    ## Are robust SEs always larger? Compare the standard errors above.
Coefficient Non-Robust SE Robust SE Robust SE > Non-Robust SE?
Intercept 0.2402 0.2439 Yes
age 0.0165 0.0192 Yes
I(age²) 0.00027 0.00035 Yes
educ 0.00630 0.00631 Yes
electric 0.0690 0.0640 No
urban 0.0465 0.0455 No

Robust standard errors are not always larger than the non-robust standard errors. For most coefficients, the robust SEs are larger, but for electric and urban, the robust SEs are slightly smaller. This happens because robust standard errors account for heteroskedasticity and use different assumptions about the variance of residuals.

  1. Add religious dummy variables and test joint significance

    model2 <- lm(children ~ age + I(age^2) + educ + electric + urban + spirit + protest + catholic, data = fertil2)
    # Non-robust F-test
    non_robust_f_test <- linearHypothesis(model2, c("protest = 0", "catholic = 0", "spirit = 0"))
    cat("Non-robust F-test p-value:\n")
    ## Non-robust F-test p-value:
    print(non_robust_f_test)
    ## 
    ## Linear hypothesis test:
    ## protest = 0
    ## catholic = 0
    ## spirit = 0
    ## 
    ## Model 1: restricted model
    ## Model 2: children ~ age + I(age^2) + educ + electric + urban + spirit + 
    ##     protest + catholic
    ## 
    ##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
    ## 1   4352 9176.4                              
    ## 2   4349 9162.5  3     13.88 2.1961 0.08641 .
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    # Robust F-test
    robust_f_test <- linearHypothesis(model2, c("protest = 0", "catholic = 0", "spirit = 0"), vcov = vcovHC(model2, type = "HC1"))
    cat("Robust F-test p-value:\n")
    ## Robust F-test p-value:
    print(robust_f_test)
    ## 
    ## Linear hypothesis test:
    ## protest = 0
    ## catholic = 0
    ## spirit = 0
    ## 
    ## Model 1: restricted model
    ## Model 2: children ~ age + I(age^2) + educ + electric + urban + spirit + 
    ##     protest + catholic
    ## 
    ## Note: Coefficient covariance matrix supplied.
    ## 
    ##   Res.Df Df      F Pr(>F)  
    ## 1   4352                   
    ## 2   4349  3 2.1559 0.0911 .
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    The null hypothesis for both tests is that the coefficients of the three religious dummy variables (protest, catholic, spirit) are jointly equal to zero.

    • At the 5% significance level (α=0.05), we fail to reject the null hypothesis because the p-values are greater than 0.05.

    • At the 10% significance level (α=0.10), the p-values are slightly below or close to 0.10, suggesting weak evidence that the religious dummies are jointly significant.

    While the religious dummies may not be statistically significant at the 5% level, they could be considered marginally significant at the 10% level. This means religion might have a weak joint effect on the number of children, but the evidence is not strong.

  2. Test heteroskedasticity with fitted values

    fitted_vals <- fitted(model2)
    residuals <- residuals(model2)
    residuals_sq <- residuals^2
    
    # Auxiliary regression
    aux_model <- lm(residuals_sq ~ fitted_vals + I(fitted_vals^2), data = fertil2)
    summary(aux_model)
    ## 
    ## Call:
    ## lm(formula = residuals_sq ~ fitted_vals + I(fitted_vals^2), data = fertil2)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -8.873 -1.290 -0.302  0.357 47.684 
    ## 
    ## Coefficients:
    ##                  Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)        0.3126     0.1114   2.807  0.00503 ** 
    ## fitted_vals       -0.1489     0.1018  -1.462  0.14381    
    ## I(fitted_vals^2)   0.2668     0.0196  13.607  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 3.632 on 4355 degrees of freedom
    ## Multiple R-squared:  0.2501, Adjusted R-squared:  0.2497 
    ## F-statistic: 726.1 on 2 and 4355 DF,  p-value: < 2.2e-16
    # Joint significance of fitted values
    hetero_test <- linearHypothesis(aux_model, c("fitted_vals = 0", "I(fitted_vals^2) = 0"))
    cat("Joint significance of regressors (heteroskedasticity test):\n")
    ## Joint significance of regressors (heteroskedasticity test):
    print(hetero_test)
    ## 
    ## Linear hypothesis test:
    ## fitted_vals = 0
    ## I(fitted_vals^2) = 0
    ## 
    ## Model 1: restricted model
    ## Model 2: residuals_sq ~ fitted_vals + I(fitted_vals^2)
    ## 
    ##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
    ## 1   4357 76589                                  
    ## 2   4355 57436  2     19153 726.11 < 2.2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    There is strong evidence of heteroskedasticity in the regression equation for children. The significance of the fitted values and their squared terms in explaining the squared residuals confirms the presence of heteroskedasticity.

  3. Practical importance of heteroskedasticity

    # Check if the p-value exists and handle missing values
    hetero_p_value <- hetero_test$`Pr(>F)`[1]
    
    if (!is.na(hetero_p_value)) {
      if (hetero_p_value < 0.05) {
        cat("Heteroskedasticity is statistically significant.\n")
      } else {
        cat("Heteroskedasticity is not statistically significant.\n")
      }
    } else {
      cat("The p-value for the heteroskedasticity test is missing (NA). Unable to determine significance.\n")
    }
    ## The p-value for the heteroskedasticity test is missing (NA). Unable to determine significance.
    print(hetero_test)
    ## 
    ## Linear hypothesis test:
    ## fitted_vals = 0
    ## I(fitted_vals^2) = 0
    ## 
    ## Model 1: restricted model
    ## Model 2: residuals_sq ~ fitted_vals + I(fitted_vals^2)
    ## 
    ##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
    ## 1   4357 76589                                  
    ## 2   4355 57436  2     19153 726.11 < 2.2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chapter 9-C13

Load necessary libraries and data

# Load necessary libraries
library(wooldridge)

# Load the data
data("infmrt")
infmrt_1990 <- subset(infmrt, year == 1990)
  1. Model with DC dummy

    model_dc <- lm(log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt) + DC, 
                   data = infmrt_1990)
    
    print("Model with DC dummy:")
    ## [1] "Model with DC dummy:"
    print(summary(model_dc))
    ## 
    ## Call:
    ## lm(formula = log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt) + 
    ##     DC, data = infmrt_1990)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -0.36232 -0.08058  0.00000  0.10126  0.24078 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   3.69679    1.36010   2.718  0.00929 ** 
    ## lpcinc        0.03388    0.18825   0.180  0.85799    
    ## lphysic      -0.38282    0.13766  -2.781  0.00789 ** 
    ## lpopul       -0.03649    0.08127  -0.449  0.65558    
    ## log(afdcprt)  0.09766    0.07027   1.390  0.17140    
    ## DC            1.29440    0.19709   6.567 4.48e-08 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1365 on 45 degrees of freedom
    ## Multiple R-squared:  0.5619, Adjusted R-squared:  0.5132 
    ## F-statistic: 11.54 on 5 and 45 DF,  p-value: 3.324e-07
  2. Model without DC dummy

    model_original <- lm(log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt), 
                         data = infmrt_1990)
    
    print("Model without DC dummy:")
    ## [1] "Model without DC dummy:"
    print(summary(model_original))
    ## 
    ## Call:
    ## lm(formula = log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt), 
    ##     data = infmrt_1990)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -0.42525 -0.09194  0.00366  0.11385  0.62067 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)  
    ## (Intercept)    4.4420     1.8760   2.368   0.0222 *
    ## lpcinc        -0.2194     0.2550  -0.860   0.3941  
    ## lphysic        0.1015     0.1609   0.631   0.5313  
    ## lpopul        -0.1874     0.1079  -1.737   0.0891 .
    ## log(afdcprt)   0.1827     0.0956   1.911   0.0623 .
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1889 on 46 degrees of freedom
    ## Multiple R-squared:  0.1419, Adjusted R-squared:  0.06734 
    ## F-statistic: 1.902 on 4 and 46 DF,  p-value: 0.1261
  3. Compare Coefficients

    coef_comparison <- data.frame(
      Variable = names(coef(model_original)),
      Original = coef(model_original),
      With_DC = coef(model_dc)[1:length(coef(model_original))],
      Difference = coef(model_dc)[1:length(coef(model_original))] - coef(model_original)
    )
    
    print("Coefficient Comparison:")
    ## [1] "Coefficient Comparison:"
    print(coef_comparison)
    ##                  Variable   Original     With_DC  Difference
    ## (Intercept)   (Intercept)  4.4419560  3.69679148 -0.74516453
    ## lpcinc             lpcinc -0.2194078  0.03387905  0.25328686
    ## lphysic           lphysic  0.1014942 -0.38281822 -0.48431244
    ## lpopul             lpopul -0.1874178 -0.03649137  0.15092645
    ## log(afdcprt) log(afdcprt)  0.1826712  0.09766312 -0.08500808
  4. Compare R-squared Values

    r2_comparison <- data.frame(
      Model = c("Original", "With DC"),
      R_squared = c(summary(model_original)$r.squared, summary(model_dc)$r.squared)
    )
    
    print("R-squared Comparison:")
    ## [1] "R-squared Comparison:"
    print(r2_comparison)
    ##      Model R_squared
    ## 1 Original 0.1419499
    ## 2  With DC 0.5618755

Chapter 9-1

Evidence of Functional Form Misspecification

When ceoten^2 and comten^2 are added to the model, the R^2increases from 0.353 to 0.375. This increase in R^2indicates that the quadratic terms explain additional variation in the dependent variable log⁡(salary)

To determine whether this suggests functional form misspecification:

  • If the omitted quadratic terms (nonlinear relationships) are significant, their exclusion from the original model would lead to biased or inconsistent estimates.

  • A hypothesis test (e.g., F-test) could formally assess whether the added terms significantly improve the model’s fit.

In this case, the increase in R^2 suggests that the original model may indeed have been misspecified, as it did not account for possible nonlinear relationships involving ceoten and comten.

Chapter 9-5

Exogenous Sample Selection

The issue relates to whether the decision of some colleges not to report campus crimes is correlated with the dependent variable (number of campus crimes) or with other covariates in the model. If the decision to report is independent of crime rates or other factors in the model, this can be considered exogenous sample selection. However:

  1. If colleges with higher crime rates are systematically less likely to report, this creates endogeneity and biased estimates.

  2. If non-reporting is purely random, then the sample selection can be treated as exogenous.

In this case, the likely non-random reporting behavior (e.g., colleges with higher crime rates might deliberately underreport) suggests endogenous sample selection, which would lead to biased estimation and potential misrepresentation of the relationship between student enrollment and campus crimes.

Chapter 9-C5

# Load necessary libraries
library(wooldridge)  # To access Wooldridge's datasets
library(quantreg)    # For LAD (quantile regression)
## Loading required package: SparseM
# Load the data
data("rdchem")  # Load the RDCHM dataset

# Transform sales to billions of dollars
rdchem$sales_billion <- rdchem$sales / 1000

Answer the questions

  1. OLS estimation with and without the largest firm

    # Full sample
    ols_full <- lm(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem)
    summary(ols_full)
    ## 
    ## Call:
    ## lm(formula = rdintens ~ sales_billion + I(sales_billion^2) + 
    ##     profmarg, data = rdchem)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -2.0371 -1.1238 -0.4547  0.7165  5.8522 
    ## 
    ## Coefficients:
    ##                     Estimate Std. Error t value Pr(>|t|)   
    ## (Intercept)         2.058967   0.626269   3.288  0.00272 **
    ## sales_billion       0.316611   0.138854   2.280  0.03041 * 
    ## I(sales_billion^2) -0.007390   0.003716  -1.989  0.05657 . 
    ## profmarg            0.053322   0.044210   1.206  0.23787   
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1.774 on 28 degrees of freedom
    ## Multiple R-squared:  0.1905, Adjusted R-squared:  0.1037 
    ## F-statistic: 2.196 on 3 and 28 DF,  p-value: 0.1107
    # Exclude the firm with sales > $40 billion
    rdchem_no_outlier <- subset(rdchem, sales_billion < 40)
    ols_no_outlier <- lm(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem_no_outlier)
    summary(ols_no_outlier)
    ## 
    ## Call:
    ## lm(formula = rdintens ~ sales_billion + I(sales_billion^2) + 
    ##     profmarg, data = rdchem_no_outlier)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -2.0371 -1.1238 -0.4547  0.7165  5.8522 
    ## 
    ## Coefficients:
    ##                     Estimate Std. Error t value Pr(>|t|)   
    ## (Intercept)         2.058967   0.626269   3.288  0.00272 **
    ## sales_billion       0.316611   0.138854   2.280  0.03041 * 
    ## I(sales_billion^2) -0.007390   0.003716  -1.989  0.05657 . 
    ## profmarg            0.053322   0.044210   1.206  0.23787   
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1.774 on 28 degrees of freedom
    ## Multiple R-squared:  0.1905, Adjusted R-squared:  0.1037 
    ## F-statistic: 2.196 on 3 and 28 DF,  p-value: 0.1107

    The presence of the firm with annual sales of almost $40 billion does not affect the OLS estimates. This could indicate that the outlier does not exert undue influence on the regression coefficients in this case. The low R^2 suggests the model explains only a small portion of the variation in rdintens.

  2. LAD estimation with and without the largest firm

    # Full sample
    lad_full <- rq(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem, tau = 0.5)
    summary(lad_full)
    ## 
    ## Call: rq(formula = rdintens ~ sales_billion + I(sales_billion^2) + 
    ##     profmarg, tau = 0.5, data = rdchem)
    ## 
    ## tau: [1] 0.5
    ## 
    ## Coefficients:
    ##                    coefficients lower bd upper bd
    ## (Intercept)         1.40428      0.87031  2.66628
    ## sales_billion       0.26346     -0.13508  0.75753
    ## I(sales_billion^2) -0.00600     -0.01679  0.00344
    ## profmarg            0.11400      0.01376  0.16427
    # Exclude the firm with sales > $40 billion
    lad_no_outlier <- rq(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem_no_outlier, tau = 0.5)
    summary(lad_no_outlier)
    ## 
    ## Call: rq(formula = rdintens ~ sales_billion + I(sales_billion^2) + 
    ##     profmarg, tau = 0.5, data = rdchem_no_outlier)
    ## 
    ## tau: [1] 0.5
    ## 
    ## Coefficients:
    ##                    coefficients lower bd upper bd
    ## (Intercept)         1.40428      0.87031  2.66628
    ## sales_billion       0.26346     -0.13508  0.75753
    ## I(sales_billion^2) -0.00600     -0.01679  0.00344
    ## profmarg            0.11400      0.01376  0.16427

    The coefficients for LAD estimation with and without the outlier are identical:

    Coefficients With Outlier Without Outlier
    Intercept 1.40428 1.40428
    sales_billion 0.26346 0.26346
    I(sales_billion^2) -0.00600 -0.00600
    profmarg 0.11400 0.11400

    Analysis:

    • LAD (Least Absolute Deviations) minimizes the sum of absolute residuals, which makes it robust to outliers.

    • The inclusion or exclusion of the largest firm (outlier) has no impact on the coefficients. This demonstrates that LAD handles extreme values effectively by assigning them less influence compared to OLS.

  3. Compare OLS and LAD results

    # Check the coefficients and their sensitivity to outliers
    cat("OLS with outlier:\n")
    ## OLS with outlier:
    print(coef(ols_full))
    ##        (Intercept)      sales_billion I(sales_billion^2)           profmarg 
    ##        2.058966676        0.316610977       -0.007389624        0.053322172
    cat("OLS without outlier:\n")
    ## OLS without outlier:
    print(coef(ols_no_outlier))
    ##        (Intercept)      sales_billion I(sales_billion^2)           profmarg 
    ##        2.058966676        0.316610977       -0.007389624        0.053322172
    cat("LAD with outlier:\n")
    ## LAD with outlier:
    print(coef(lad_full))
    ##        (Intercept)      sales_billion I(sales_billion^2)           profmarg 
    ##        1.404279415        0.263463789       -0.006001127        0.114003661
    cat("LAD without outlier:\n")
    ## LAD without outlier:
    print(coef(lad_no_outlier))
    ##        (Intercept)      sales_billion I(sales_billion^2)           profmarg 
    ##        1.404279415        0.263463789       -0.006001127        0.114003661
    # Discussion:
    # LAD (Least Absolute Deviations) is typically less sensitive to outliers compared to OLS (Ordinary Least Squares).
    # Examine the changes in coefficients for both methods with and without the outlier to draw conclusions.

    OLS Results

    The coefficients for OLS with and without the outlier are identical:

    Coefficients With Outlier Without Outlier
    Intercept 2.05897 2.05897
    sales_billion 0.31661 0.31661
    I(sales_billion^2) -0.00739 -0.00739
    profmarg 0.05332 0.05332

    LAD Results

    The coefficients for LAD with and without the outlier are also identical:

    Coefficients With Outlier Without Outlier
    Intercept 1.40428 1.40428
    sales_billion 0.26346 0.26346
    I(sales_billion^2) -0.00600 -0.00600
    profmarg 0.11400 0.11400

    Key Observations

    1. In this specific case, both OLS and LAD coefficients are unaffected by the outlier. This is unusual for OLS, as it is generally sensitive to extreme values.

    2. LAD remains consistent, which aligns with its theoretical robustness to outliers.

    Even though OLS coefficients in this case are identical with and without the outlier, LAD is theoretically more resilient to outliers. Therefore, based on statistical properties and robustness, LAD is still the preferred method in situations where outliers are expected.

Chapter 10-1

Decide if you agree or disagree with the statements

  1. Like cross-sectional observations, we can assume that most time series observations are independently distributed.
    • Disagree: In time series data, observations are often correlated over time (autocorrelation). For example, the value of a variable at time ttt is often influenced by its value at t−1. This violates the assumption of independence. Cross-sectional data, in contrast, typically assumes independence across observations.
  2. The OLS estimator in a time series regression is unbiased under the first three Gauss-Markov assumptions.
    • Agree: If the assumptions of linearity, no perfect multicollinearity, and zero conditional mean of the error term hold, then OLS provides unbiased estimates. However, time series data must also satisfy additional conditions like stationarity for valid inference.
  3. A trending variable cannot be used as the dependent variable in multiple regression analysis.
    • Disagree: A trending variable can be used as the dependent variable, but care must be taken to account for the trend (e.g., by detrending the data or including a time trend variable in the regression). Failing to address the trend can lead to spurious regression results.
  4. Seasonality is not an issue when using annual time series observations.
    • Agree: Annual time series data typically do not exhibit seasonality because seasonal patterns (e.g., quarterly or monthly fluctuations) are averaged out over the year. However, trends and other patterns may still need to be addressed.

Chapter 10-5

# Create the 'time' variable based on 't' (if not already created)
hseinv$time <- hseinv$t

# Create quarterly dummy variables based on the 'time' variable
hseinv$Q1 <- ifelse((hseinv$time %% 4) == 1, 1, 0)
hseinv$Q2 <- ifelse((hseinv$time %% 4) == 2, 1, 0)
hseinv$Q3 <- ifelse((hseinv$time %% 4) == 3, 1, 0)
hseinv$Q4 <- ifelse((hseinv$time %% 4) == 0, 1, 0)

# Model specification
model_housing <- lm(inv ~ price + linv + time + Q1 + Q2 + Q3 + Q4, data = hseinv)

# Summary of the model
summary(model_housing)
## 
## Call:
## lm(formula = inv ~ price + linv + time + Q1 + Q2 + Q3 + Q4, data = hseinv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5199  -3055  -1446   1676  14248 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1193475.8    66292.1 -18.003   <2e-16 ***
## price          24945.3    24204.9   1.031    0.310    
## linv          111177.9     5407.4  20.560   <2e-16 ***
## time            -251.3      174.5  -1.440    0.159    
## Q1               672.1     2092.5   0.321    0.750    
## Q2               599.0     2101.1   0.285    0.777    
## Q3               504.1     2142.7   0.235    0.815    
## Q4                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4779 on 35 degrees of freedom
## Multiple R-squared:  0.9801, Adjusted R-squared:  0.9767 
## F-statistic: 287.4 on 6 and 35 DF,  p-value: < 2.2e-16

Chapter 10-C1

# Load the intdef dataset
data("intdef")

# Define a dummy variable for years after 1979
intdef$dummy <- ifelse(intdef$year > 1979, 1, 0)

# Model specification: Interest rate as a function of the dummy variable
model_policy_shift <- lm(i3 ~ dummy, data = intdef)

# Summary of the model
summary(model_policy_shift)
## 
## Call:
## lm(formula = i3 ~ dummy, data = intdef)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1979 -1.9449 -0.4569  1.3616  7.8121 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.9259     0.4692   8.367 2.53e-11 ***
## dummy         2.2920     0.7167   3.198  0.00232 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.654 on 54 degrees of freedom
## Multiple R-squared:  0.1592, Adjusted R-squared:  0.1437 
## F-statistic: 10.23 on 1 and 54 DF,  p-value: 0.002317

Chapter 10-C9

# Load the volat dataset
data("volat")

# Dependent variable: rsp500 (S&P 500 monthly returns)
# Explanatory variables: pcip (percentage change in industrial production) and i3 (return on 3-month T-bills)

# Model specification
model_rsp500 <- lm(rsp500 ~ pcip + i3, data = volat)

# Summary of the model
summary(model_rsp500)
## 
## Call:
## lm(formula = rsp500 ~ pcip + i3, data = volat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -157.871  -22.580    2.103   25.524  138.137 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.84306    3.27488   5.754 1.44e-08 ***
## pcip         0.03642    0.12940   0.281   0.7785    
## i3          -1.36169    0.54072  -2.518   0.0121 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.13 on 554 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01189,    Adjusted R-squared:  0.008325 
## F-statistic: 3.334 on 2 and 554 DF,  p-value: 0.03637

Statistical significance: Check p-values of pcip and i3.

pcip is significant for 5%. because p-value is bigger than 0.05

i3 is significant for 1%. because p-value is bigger than 0.01

Chapter 11-C7

# Load required packages
library(wooldridge)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the 'consump' dataset
data("consump", package = "wooldridge")

# (i) Test the PIH hypothesis: gc = β0 + β1 * gc_1 + ut
model1 <- lm(gc ~ gc_1, data = consump, na.action = na.omit)
cat("C7 - Model 1 Summary:\n")
## C7 - Model 1 Summary:
summary(model1)
## 
## Call:
## lm(formula = gc ~ gc_1, data = consump, na.action = na.omit)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.027878 -0.005974 -0.001450  0.007142  0.020227 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.011431   0.003778   3.026  0.00478 **
## gc_1        0.446133   0.156047   2.859  0.00731 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01161 on 33 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1985, Adjusted R-squared:  0.1742 
## F-statistic: 8.174 on 1 and 33 DF,  p-value: 0.007311
# (ii) Add additional variables (gc_1, gy_1, i3, inf): 
# gc = β0 + β1 * gc_1 + β2 * gy_1 + β3 * i3 + β4 * inf + ut
model2 <- lm(gc ~ gc_1 + gy_1 + i3 + inf, data = consump, na.action = na.omit)
cat("\nC7 - Model 2 Summary:\n")
## 
## C7 - Model 2 Summary:
summary(model2)
## 
## Call:
## lm(formula = gc ~ gc_1 + gy_1 + i3 + inf, data = consump, na.action = na.omit)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0262462 -0.0076662  0.0009066  0.0063953  0.0174975 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.007e-02  5.839e-03   3.438  0.00174 **
## gc_1         5.398e-01  2.604e-01   2.073  0.04689 * 
## gy_1        -1.017e-01  1.792e-01  -0.567  0.57463   
## i3          -2.523e-05  1.028e-03  -0.025  0.98059   
## inf         -1.701e-03  8.825e-04  -1.928  0.06341 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01069 on 30 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.3819, Adjusted R-squared:  0.2995 
## F-statistic: 4.635 on 4 and 30 DF,  p-value: 0.004938
# Perform joint significance test using ANOVA
cat("\nC7 - ANOVA to compare Model 1 and Model 2:\n")
## 
## C7 - ANOVA to compare Model 1 and Model 2:
anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: gc ~ gc_1
## Model 2: gc ~ gc_1 + gy_1 + i3 + inf
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1     33 0.0044447                              
## 2     30 0.0034276  3 0.0010171 2.9675 0.04767 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# (iii) Check the p-value for gc_1 in the new model
cat("\nC7 - P-Value for gc_1 in Model 2:\n")
## 
## C7 - P-Value for gc_1 in Model 2:
summary(model2)$coefficients
##                  Estimate   Std. Error    t value    Pr(>|t|)
## (Intercept)  0.0200712603 0.0058386779  3.4376379 0.001741936
## gc_1         0.5397639068 0.2604192372  2.0726729 0.046890481
## gy_1        -0.1016586120 0.1791503385 -0.5674486 0.574631669
## i3          -0.0000252261 0.0010282141 -0.0245339 0.980589214
## inf         -0.0017011440 0.0008824782 -1.9276895 0.063408242
# (iv) F-statistic and p-value for joint significance in Model 2
cat("\nC7 - F-statistic and p-value for joint significance in Model 2:\n")
## 
## C7 - F-statistic and p-value for joint significance in Model 2:
anova(model2)
## Analysis of Variance Table
## 
## Response: gc
##           Df    Sum Sq    Mean Sq F value   Pr(>F)   
## gc_1       1 0.0011009 0.00110089  9.6356 0.004141 **
## gy_1       1 0.0000647 0.00006471  0.5664 0.457558   
## i3         1 0.0005279 0.00052785  4.6201 0.039787 * 
## inf        1 0.0004246 0.00042456  3.7160 0.063408 . 
## Residuals 30 0.0034276 0.00011425                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chapter 11-C12

# Load required packages
library(wooldridge)
library(dplyr)

# Load the 'minwage' dataset
data("minwage", package = "wooldridge")

# Handle missing values
minwage_clean <- na.omit(minwage)  # Remove rows with missing values

# Create lagged variables (gwage232_1 and gemp232_1)
minwage_clean <- minwage_clean %>%
  mutate(
    gwage232_1 = lag(gwage232),  # Lagged wage growth
    gemp232_1 = lag(gemp232)    # Lagged employment growth
  )

# (i) Check for autocorrelation in gwage232
acf(minwage_clean$gwage232, main = "Autocorrelation of gwage232", na.action = na.pass)

# (ii) Estimate the dynamic model
# gwage232 = β0 + β1 * gwage232_1 + β2 * gmwage + β3 * gcpi + ut
model_dynamic <- lm(gwage232 ~ gwage232_1 + gmwage + gcpi, data = minwage_clean, na.action = na.omit)
cat("\nC12 - Dynamic Model Summary:\n")
## 
## C12 - Dynamic Model Summary:
summary(model_dynamic)
## 
## Call:
## lm(formula = gwage232 ~ gwage232_1 + gmwage + gcpi, data = minwage_clean, 
##     na.action = na.omit)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.044651 -0.004120 -0.001272  0.004487  0.041568 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.002371   0.000433   5.475 6.45e-08 ***
## gwage232_1  -0.068681   0.034467  -1.993  0.04676 *  
## gmwage       0.151749   0.009519  15.941  < 2e-16 ***
## gcpi         0.257423   0.086571   2.974  0.00306 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007781 on 594 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3067, Adjusted R-squared:  0.3032 
## F-statistic: 87.58 on 3 and 594 DF,  p-value: < 2.2e-16
# (iii) Add lagged employment growth (gemp232_1) to the equation
# gwage232 = β0 + β1 * gwage232_1 + β2 * gmwage + β3 * gcpi + β4 * gemp232_1 + ut
model_dynamic_lagged <- lm(gwage232 ~ gwage232_1 + gmwage + gcpi + gemp232_1, data = minwage_clean, na.action = na.omit)
cat("\nC12 - Dynamic Model with Lagged Employment Summary:\n")
## 
## C12 - Dynamic Model with Lagged Employment Summary:
summary(model_dynamic_lagged)
## 
## Call:
## lm(formula = gwage232 ~ gwage232_1 + gmwage + gcpi + gemp232_1, 
##     data = minwage_clean, na.action = na.omit)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.043902 -0.004319 -0.000965  0.004265  0.042429 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0023987  0.0004289   5.593 3.41e-08 ***
## gwage232_1  -0.0657751  0.0341411  -1.927 0.054512 .  
## gmwage       0.1525456  0.0094293  16.178  < 2e-16 ***
## gcpi         0.2531434  0.0857361   2.953 0.003276 ** 
## gemp232_1    0.0606449  0.0169864   3.570 0.000386 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007705 on 593 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3213, Adjusted R-squared:  0.3167 
## F-statistic: 70.17 on 4 and 593 DF,  p-value: < 2.2e-16
# (iv) Compare models to check the effect on gmwage coefficient
cat("\nC12 - Comparison of gmwage Coefficient:\n")
## 
## C12 - Comparison of gmwage Coefficient:
cat("Without lagged employment:\n")
## Without lagged employment:
summary(model_dynamic)$coefficients["gmwage", ]
##     Estimate   Std. Error      t value     Pr(>|t|) 
## 1.517485e-01 9.519372e-03 1.594102e+01 6.890581e-48
cat("With lagged employment:\n")
## With lagged employment:
summary(model_dynamic_lagged)$coefficients["gmwage", ]
##     Estimate   Std. Error      t value     Pr(>|t|) 
## 1.525456e-01 9.429267e-03 1.617789e+01 4.953489e-49
# (v) Run regression of gmwage on gwage232_1 and gemp232_1, and report R-squared
model_gmwage <- lm(gmwage ~ gwage232_1 + gemp232_1, data = minwage_clean, na.action = na.omit)
cat("\nC12 - Regression of gmwage Summary:\n")
## 
## C12 - Regression of gmwage Summary:
summary(model_gmwage)
## 
## Call:
## lm(formula = gmwage ~ gwage232_1 + gemp232_1, data = minwage_clean, 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01984 -0.00511 -0.00385 -0.00290  0.62190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.003491   0.001469   2.376   0.0178 *
## gwage232_1   0.211609   0.147152   1.438   0.1510  
## gemp232_1   -0.042848   0.073827  -0.580   0.5619  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0335 on 595 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.004094,   Adjusted R-squared:  0.0007463 
## F-statistic: 1.223 on 2 and 595 DF,  p-value: 0.2951
# Report R-squared
cat("\nC12 - R-squared for Regression of gmwage:\n")
## 
## C12 - R-squared for Regression of gmwage:
summary(model_gmwage)$r.squared
## [1] 0.00409388

Chapter 12-C11

# Load required packages
library(wooldridge)
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(dplyr)

# Load the 'nyse' dataset
data("nyse", package = "wooldridge")

# Remove rows with missing data
nyse_clean <- nyse %>% na.omit()

# Create the lagged variable returnt_1, and remove the first row with NA
nyse_clean <- nyse_clean %>%
  mutate(returnt_1 = lag(return, n = 1)) %>%
  na.omit()  # Remove the row with NA due to the lag

# Fit the OLS model
model_ols <- lm(return ~ returnt_1, data = nyse_clean)

# Obtain the squared residuals (ut^2)
nyse_clean <- nyse_clean %>%
  mutate(ut_sq = resid(model_ols)^2)

# Summary of the OLS model
cat("\nC11 (i) - OLS Model Summary:\n")
## 
## C11 (i) - OLS Model Summary:
summary(model_ols)
## 
## Call:
## lm(formula = return ~ returnt_1, data = nyse_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2633  -1.2997   0.1011   1.3203   8.0688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.17862    0.08084   2.210   0.0275 *
## returnt_1    0.05806    0.03811   1.523   0.1281  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.112 on 686 degrees of freedom
## Multiple R-squared:  0.003372,   Adjusted R-squared:  0.001919 
## F-statistic: 2.321 on 1 and 686 DF,  p-value: 0.1281
# Compute average, minimum, and maximum of ut^2
cat("\nAverage of squared residuals:\n", mean(nyse_clean$ut_sq, na.rm = TRUE))
## 
## Average of squared residuals:
##  4.446345
cat("\nMinimum of squared residuals:\n", min(nyse_clean$ut_sq, na.rm = TRUE))
## 
## Minimum of squared residuals:
##  3.591354e-06
cat("\nMaximum of squared residuals:\n", max(nyse_clean$ut_sq, na.rm = TRUE))
## 
## Maximum of squared residuals:
##  232.9687
# (ii) Fit the model of heteroskedasticity:
# Var(ut | returnt-1) = δ0 + δ1 * returnt_1 + δ2 * returnt_1^2
nyse_clean <- nyse_clean %>%
  mutate(returnt_1_sq = returnt_1^2)

model_arch <- lm(ut_sq ~ returnt_1 + returnt_1_sq, data = nyse_clean)

cat("\nC11 (ii) - ARCH Model Summary:\n")
## 
## C11 (ii) - ARCH Model Summary:
summary(model_arch)
## 
## Call:
## lm(formula = ut_sq ~ returnt_1 + returnt_1_sq, data = nyse_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.545  -3.021  -1.974   0.695 221.544 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.25771    0.44131   7.382 4.54e-13 ***
## returnt_1    -0.78556    0.19626  -4.003 6.95e-05 ***
## returnt_1_sq  0.29749    0.03558   8.361 3.46e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.67 on 685 degrees of freedom
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.128 
## F-statistic: 51.41 on 2 and 685 DF,  p-value: < 2.2e-16
# Report coefficients, standard errors, R-squared, and adjusted R-squared
cat("\nCoefficients of ARCH model:\n")
## 
## Coefficients of ARCH model:
print(coef(model_arch))
##  (Intercept)    returnt_1 returnt_1_sq 
##    3.2577065   -0.7855573    0.2974907
cat("\nStandard Errors of ARCH model:\n")
## 
## Standard Errors of ARCH model:
print(summary(model_arch)$coefficients[, 2])
##  (Intercept)    returnt_1 returnt_1_sq 
##    0.4413140    0.1962611    0.0355787
cat("\nR-squared of ARCH model:\n", summary(model_arch)$r.squared)
## 
## R-squared of ARCH model:
##  0.1305106
cat("\nAdjusted R-squared of ARCH model:\n", summary(model_arch)$adj.r.squared)
## 
## Adjusted R-squared of ARCH model:
##  0.127972
# (iii) Conditional variance as a function of returnt-1
# Conditional variance: δ0 + δ1 * returnt-1 + δ2 * returnt-1^2
nyse_clean <- nyse_clean %>%
  mutate(conditional_variance = coef(model_arch)[1] +
                                coef(model_arch)[2] * returnt_1 +
                                coef(model_arch)[3] * returnt_1_sq)

# Find the smallest variance and the corresponding returnt-1
smallest_variance <- nyse_clean %>%
  filter(conditional_variance == min(conditional_variance, na.rm = TRUE))

cat("\nSmallest variance and corresponding returnt_1:\n")
## 
## Smallest variance and corresponding returnt_1:
print(smallest_variance)
##   price   return return_1   t price_1 price_2    cprice  cprice_1 returnt_1
## 1 56.83 1.572835 1.321984 168   55.95   55.22 0.8800011 0.7299995  1.321984
##      ut_sq returnt_1_sq conditional_variance
## 1 1.735706     1.747642             2.739119
# (iv) Check if the model produces negative variance
cat("\nNegative variances produced? ", any(nyse_clean$conditional_variance < 0, na.rm = TRUE))
## 
## Negative variances produced?  FALSE
# (v) Compare to ARCH(1) model
# Fit an ARCH(1) model using 'garch()' from the 'tseries' package
arch1_model <- garch(nyse_clean$return, order = c(0, 1), trace = FALSE)
cat("\nC11 (v) - ARCH(1) Model Summary:\n")
## 
## C11 (v) - ARCH(1) Model Summary:
summary(arch1_model)
## 
## Call:
## garch(x = nyse_clean$return, order = c(0, 1), trace = FALSE)
## 
## Model:
## GARCH(0,1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5451 -0.5314  0.1544  0.7527  3.3353 
## 
## Coefficient(s):
##     Estimate  Std. Error  t value Pr(>|t|)    
## a0    3.2615      0.2404   13.566  < 2e-16 ***
## a1    0.2623      0.0330    7.949 1.78e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostic Tests:
##  Jarque Bera Test
## 
## data:  Residuals
## X-squared = 76.247, df = 2, p-value < 2.2e-16
## 
## 
##  Box-Ljung test
## 
## data:  Squared.Residuals
## X-squared = 0.188, df = 1, p-value = 0.6646
# (vi) Add the second lag (ARCH(2) model)
arch2_model <- garch(nyse_clean$return, order = c(0, 2), trace = FALSE)
cat("\nC11 (vi) - ARCH(2) Model Summary:\n")
## 
## C11 (vi) - ARCH(2) Model Summary:
summary(arch2_model)
## 
## Call:
## garch(x = nyse_clean$return, order = c(0, 2), trace = FALSE)
## 
## Model:
## GARCH(0,2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5890 -0.5387  0.1526  0.7527  3.3325 
## 
## Coefficient(s):
##     Estimate  Std. Error  t value Pr(>|t|)    
## a0   2.97841     0.28199   10.562  < 2e-16 ***
## a1   0.26885     0.03258    8.253 2.22e-16 ***
## a2   0.06260     0.05377    1.164    0.244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostic Tests:
##  Jarque Bera Test
## 
## data:  Residuals
## X-squared = 80.464, df = 2, p-value < 2.2e-16
## 
## 
##  Box-Ljung test
## 
## data:  Squared.Residuals
## X-squared = 0.2925, df = 1, p-value = 0.5886
# Compare ARCH(1) and ARCH(2)
cat("\nComparison of ARCH(1) and ARCH(2) models:\n")
## 
## Comparison of ARCH(1) and ARCH(2) models:
cat("AIC of ARCH(1):", AIC(arch1_model), "\n")
## AIC of ARCH(1): 2929.782
cat("AIC of ARCH(2):", AIC(arch2_model), "\n")
## AIC of ARCH(2): 2921.784