wooldridge
package and
dataset:# Install the package if not already installed
if (!require(wooldridge)) install.packages("wooldridge")
## Loading required package: wooldridge
# Load the library
library(wooldridge)
# Load the SLEEP75 dataset
data("sleep75")
# Fit the regression model
model <- lm(sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = sleep ~ totwrk + educ + age + I(age^2) + male, data = sleep75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2378.00 -243.29 6.74 259.24 1350.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3840.83197 235.10870 16.336 <2e-16 ***
## totwrk -0.16342 0.01813 -9.013 <2e-16 ***
## educ -11.71332 5.86689 -1.997 0.0463 *
## age -8.69668 11.20746 -0.776 0.4380
## I(age^2) 0.12844 0.13390 0.959 0.3378
## male 87.75243 34.32616 2.556 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 417.7 on 700 degrees of freedom
## Multiple R-squared: 0.1228, Adjusted R-squared: 0.1165
## F-statistic: 19.59 on 5 and 700 DF, p-value: < 2.2e-16
Evidence that men sleep more than women: The
coefficient for male
(gender dummy) in the model represents
the difference in sleep for men compared to women. Check its value and
statistical significance
coef(summary(model))["male", ] # Extract details for 'male'
## Estimate Std. Error t value Pr(>|t|)
## 87.75242861 34.32616153 2.55642999 0.01078518
The estimate (87.75) indicates that men, on average, sleep 87.75 more minutes per week than women, holding all other factors constant. The p-value (0.0108) is less than 0.05, which means the difference is statistically significant at the 5% level.
There is strong evidence that men sleep significantly more than women. The difference is approximately 87.75 minutes per week, and this result is unlikely due to random chance.
Tradeoff between working and sleeping: The
coefficient for totwrk
(total work time) reflects the
tradeoff between working and sleeping. It shows how many minutes of
sleep decrease for each additional minute of work
coef(summary(model))["totwrk", ] # Extract details for 'totwrk'
## Estimate Std. Error t value Pr(>|t|)
## -1.634232e-01 1.813210e-02 -9.012918e+00 1.891272e-18
The coefficient of totwrk
(−0.1634) indicates the
tradeoff between working and sleeping:
Tradeoff: For every additional minute spent working, sleep decreases by approximately 0.1634 minutes (or 9.8 seconds). This demonstrates a negative relationship between working and sleeping.
Statistical Significance: The p-value (1.891×10^{-18}) is extremely small, well below the typical threshold of 0.05, indicating that this tradeoff is highly statistically significant.
There is strong evidence of a statistically significant tradeoff between working and sleeping. For every extra minute of work, the expected reduction in sleep is approximately 0.1634 minutes.
Test the null hypothesis regarding age: To test
whether age has no effect on sleeping (both age
and
age^2
are jointly insignificant), run an F-test
# Fit a reduced model without age and age^2
reduced_model <- lm(sleep ~ totwrk + educ + male, data = sleep75)
# Perform an F-test
anova(reduced_model, model)
## Analysis of Variance Table
##
## Model 1: sleep ~ totwrk + educ + male
## Model 2: sleep ~ totwrk + educ + age + I(age^2) + male
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 702 122631662
## 2 700 122147777 2 483885 1.3865 0.2506
Null hypothesis (H0): Age (both age
and age^2
) has no effect on sleeping.
Alternative hypothesis (H1): Age has an effect on sleeping.
age
and age^2
) has a statistically significant
effect on sleep, after controlling for other factors (e.g., work time,
education, and gender).# Install and load the wooldridge package
if (!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)
# Load the GPA2 dataset
data("gpa2")
# Fit the regression model
model <- lm(sat ~ hsize + I(hsize^2) + female + black + female:black, data = gpa2)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = sat ~ hsize + I(hsize^2) + female + black + female:black,
## data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -570.45 -89.54 -5.24 85.41 479.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1028.0972 6.2902 163.445 < 2e-16 ***
## hsize 19.2971 3.8323 5.035 4.97e-07 ***
## I(hsize^2) -2.1948 0.5272 -4.163 3.20e-05 ***
## female -45.0915 4.2911 -10.508 < 2e-16 ***
## black -169.8126 12.7131 -13.357 < 2e-16 ***
## female:black 62.3064 18.1542 3.432 0.000605 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 133.4 on 4131 degrees of freedom
## Multiple R-squared: 0.08578, Adjusted R-squared: 0.08468
## F-statistic: 77.52 on 5 and 4131 DF, p-value: < 2.2e-16
Evidence for hsize^2 and optimal high school size
To check if hsize^2 is significant and calculate the optimal high school size:
# Extract the coefficients for hsize and hsize^2
coef_hsize <- coef(model)["hsize"]
coef_hsize2 <- coef(model)["I(hsize^2)"]
# Test the significance of hsize^2
coef(summary(model))["I(hsize^2)", ] # Check the p-value
## Estimate Std. Error t value Pr(>|t|)
## -2.194828e+00 5.271635e-01 -4.163468e+00 3.198417e-05
# Calculate the optimal high school size (vertex of the parabola)
optimal_hsize <- -coef_hsize / (2 * coef_hsize2)
optimal_hsize
## hsize
## 4.39603
Since the p-value is less than 0.05, hsize^2is statistically significant. This suggests that hsize^2 should be included in the model because it captures a non-linear relationship between high school size and SAT scores.
The optimal high school size is approximately 4.40 (in hundreds of students), which translates to 440 students.
Difference in SAT score between nonblack females and nonblack males
The coefficient for female
represents the difference in
SAT score between nonblack females and nonblack males:
# Extract the coefficient and its significance for 'female'
coef(summary(model))["female", ]
## Estimate Std. Error t value Pr(>|t|)
## -4.509145e+01 4.291063e+00 -1.050822e+01 1.656301e-25
There is strong evidence that nonblack females have a statistically significant lower SAT score (by 45.09 points) compared to nonblack males, holding other factors constant.
Difference in SAT score between nonblack males and black males
The coefficient for black
represents the difference in
SAT score between nonblack males and black males:
# Extract the coefficient and its significance for 'black'
coef(summary(model))["black", ]
## Estimate Std. Error t value Pr(>|t|)
## -1.698126e+02 1.271315e+01 -1.335725e+01 7.140387e-40
# Test the null hypothesis (H0: No difference)
t_stat <- coef(model)["black"] / coef(summary(model))["black", "Std. Error"]
p_value <- 2 * pt(abs(t_stat), df = nrow(gpa2) - length(coef(model)), lower.tail = FALSE)
p_value
## black
## 7.140387e-40
There is very strong evidence to reject the null hypothesis that there is no difference in SAT scores between nonblack males and black males. Black males score significantly lower than nonblack males on the SAT in this dataset.
on average, black males score 169.81 points lower on the SAT than nonblack males, holding all other factors constant.
Difference in SAT score between black females and nonblack females
The difference in SAT scores between black females and nonblack
females is captured by the interaction term female:black
plus the black
term:
# Calculate the combined effect for black females
black_female_effect <- coef(model)["black"] + coef(model)["female:black"]
black_female_effect
## black
## -107.5063
# Test its significance
se_combined <- sqrt(
coef(summary(model))["black", "Std. Error"]^2 +
coef(summary(model))["female:black", "Std. Error"]^2
)
t_stat_combined <- black_female_effect / se_combined
p_value_combined <- 2 * pt(abs(t_stat_combined), df = nrow(gpa2) - length(coef(model)), lower.tail = FALSE)
p_value_combined
## black
## 1.275493e-06
Estimated difference: The estimated difference in SAT scores between black females and nonblack females is -107.51. This means, on average, black females score 107.51 points less on the SAT compared to nonblack females, holding all other variables constant.
Since the p-value is very small, we reject the null hypothesis. This means the difference in SAT scores between black females and nonblack females is statistically significant. In other words, the data provides strong evidence that black females have lower SAT scores compared to nonblack females, on average, when controlling for other variables in the model.
# Load the wooldridge package
if (!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)
# Load the GPA1 dataset
data("gpa1")
Add mothcoll
and fathcoll
to the
existing regression model
# Original model in equation (7.6)
original_model <- lm(colGPA ~ hsGPA + ACT + PC, data = gpa1)
# Model with mothcoll and fathcoll
model_i <- lm(colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll, data = gpa1)
# Compare the summary of the two models
summary(original_model)
##
## Call:
## lm(formula = colGPA ~ hsGPA + ACT + PC, data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7901 -0.2622 -0.0107 0.2334 0.7570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.263520 0.333125 3.793 0.000223 ***
## hsGPA 0.447242 0.093647 4.776 4.54e-06 ***
## ACT 0.008659 0.010534 0.822 0.412513
## PC 0.157309 0.057287 2.746 0.006844 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3325 on 137 degrees of freedom
## Multiple R-squared: 0.2194, Adjusted R-squared: 0.2023
## F-statistic: 12.83 on 3 and 137 DF, p-value: 1.932e-07
summary(model_i)
##
## Call:
## lm(formula = colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll,
## data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78149 -0.25726 -0.02121 0.24691 0.74432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.255554 0.335392 3.744 0.000268 ***
## hsGPA 0.450220 0.094280 4.775 4.61e-06 ***
## ACT 0.007724 0.010678 0.723 0.470688
## PC 0.151854 0.058716 2.586 0.010762 *
## mothcoll -0.003758 0.060270 -0.062 0.950376
## fathcoll 0.041800 0.061270 0.682 0.496265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3344 on 135 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.1934
## F-statistic: 7.713 on 5 and 135 DF, p-value: 2.083e-06
Adding mothcoll
and fathcoll
does not
substantially improve the model fit or contribute significantly to
explaining college GPA. The effect of PC ownership remains statistically
significant even after controlling for these parental education
variables.
Test for joint significance of mothcoll
and
fathcoll
To test if mothcoll
and fathcoll
are
jointly significant, conduct an F-test:
# Reduced model without mothcoll and fathcoll
reduced_model <- lm(colGPA ~ hsGPA + ACT + PC, data = gpa1)
# Perform F-test
anova(reduced_model, model_i)
## Analysis of Variance Table
##
## Model 1: colGPA ~ hsGPA + ACT + PC
## Model 2: colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 137 15.149
## 2 135 15.094 2 0.054685 0.2446 0.7834
Since the p-value (0.78340.78340.7834) is much greater than any
common significance level (e.g., α=0.05), we fail to reject the
null hypothesis. This means there is no strong evidence that
mothcoll
and fathcoll
are jointly significant
in explaining variations in college GPA (colGPA
), holding
other variables constant.
Adding mothcoll
and fathcoll
to the model
does not provide a statistically significant improvement in explaining
college GPA. This suggests that the parents’ college education might not
be critical predictors in this context when other factors (like high
school GPA, ACT score, and PC ownership) are already included in the
model.
Add hsGPA^2
to the model from part (i)
To check whether adding hsGPA^2
improves the model:
# Model with hsGPA^2
model_iii <- lm(colGPA ~ hsGPA + I(hsGPA^2) + ACT + PC + mothcoll + fathcoll, data = gpa1)
# Summary of the model
summary(model_iii)
##
## Call:
## lm(formula = colGPA ~ hsGPA + I(hsGPA^2) + ACT + PC + mothcoll +
## fathcoll, data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78998 -0.24327 -0.00648 0.26179 0.72231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.040328 2.443038 2.063 0.0410 *
## hsGPA -1.802520 1.443552 -1.249 0.2140
## I(hsGPA^2) 0.337341 0.215711 1.564 0.1202
## ACT 0.004786 0.010786 0.444 0.6580
## PC 0.140446 0.058858 2.386 0.0184 *
## mothcoll 0.003091 0.060110 0.051 0.9591
## fathcoll 0.062761 0.062401 1.006 0.3163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3326 on 134 degrees of freedom
## Multiple R-squared: 0.2361, Adjusted R-squared: 0.2019
## F-statistic: 6.904 on 6 and 134 DF, p-value: 2.088e-06
# Compare model_i and model_iii using an F-test
anova(model_i, model_iii)
## Analysis of Variance Table
##
## Model 1: colGPA ~ hsGPA + ACT + PC + mothcoll + fathcoll
## Model 2: colGPA ~ hsGPA + I(hsGPA^2) + ACT + PC + mothcoll + fathcoll
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 135 15.094
## 2 134 14.823 1 0.27054 2.4457 0.1202
# Install the required packages if not already installed
if (!require(wooldridge)) install.packages("wooldridge")
if (!require(car)) install.packages("car") # For hypothesis tests
## Loading required package: car
## Loading required package: carData
# Load libraries
library(wooldridge)
# Load the dataset
data("wage2")
Estimate the base model
# Base model
model1 <- lm(log(wage) ~ educ + exper + tenure + married + black + south + urban, data = wage2)
# Summary of the model
summary(model1)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98069 -0.21996 0.00707 0.24288 1.22822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
## educ 0.065431 0.006250 10.468 < 2e-16 ***
## exper 0.014043 0.003185 4.409 1.16e-05 ***
## tenure 0.011747 0.002453 4.789 1.95e-06 ***
## married 0.199417 0.039050 5.107 3.98e-07 ***
## black -0.188350 0.037667 -5.000 6.84e-07 ***
## south -0.090904 0.026249 -3.463 0.000558 ***
## urban 0.183912 0.026958 6.822 1.62e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared: 0.2526, Adjusted R-squared: 0.2469
## F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16
# Difference in monthly salary between blacks and nonblacks
# Coefficient of "black" represents the percentage difference in wages
Holding other factors constant, there is an 18.8% wage differential between Black and non-Black workers. This difference is statistically significant at a very high level.
Add exper^2
and tenure^2
and test joint
insignificance
# Extended model with quadratic terms
model2 <- lm(log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + married + black + south + urban, data = wage2)
# Summary of the extended model
summary(model2)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + I(exper^2) + tenure +
## I(tenure^2) + married + black + south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98236 -0.21972 -0.00036 0.24078 1.25127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3586756 0.1259143 42.558 < 2e-16 ***
## educ 0.0642761 0.0063115 10.184 < 2e-16 ***
## exper 0.0172146 0.0126138 1.365 0.172665
## I(exper^2) -0.0001138 0.0005319 -0.214 0.830622
## tenure 0.0249291 0.0081297 3.066 0.002229 **
## I(tenure^2) -0.0007964 0.0004710 -1.691 0.091188 .
## married 0.1985470 0.0391103 5.077 4.65e-07 ***
## black -0.1906636 0.0377011 -5.057 5.13e-07 ***
## south -0.0912153 0.0262356 -3.477 0.000531 ***
## urban 0.1854241 0.0269585 6.878 1.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.2477
## F-statistic: 35.17 on 9 and 925 DF, p-value: < 2.2e-16
# Test joint significance of exper^2 and tenure^2
linearHypothesis(model2, c("I(exper^2) = 0", "I(tenure^2) = 0"))
##
## Linear hypothesis test:
## I(exper^2) = 0
## I(tenure^2) = 0
##
## Model 1: restricted model
## Model 2: log(wage) ~ educ + exper + I(exper^2) + tenure + I(tenure^2) +
## married + black + south + urban
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 927 123.82
## 2 925 123.42 2 0.39756 1.4898 0.226
Coefficients for I(exper^2)
and
I(tenure^2)
:
I(exper^2)
has an estimated coefficient of
-0.0001138
, with a p-value of 0.831
,
indicating it is not significant.
I(tenure^2)
has an estimated coefficient of
-0.0007964
, with a p-value of 0.091
, which is
marginally significant at the 10% level but not at
stricter levels.
Joint Insignificance Test (Linear Hypothesis Test):
The null hypothesis for the test is that both
I(exper^2) = 0
and I(tenure^2) = 0
.
F-statistic: 1.4898
, with a p-value of
0.226
, which is greater than
0.20.
Conclusion: The test fails to reject the null
hypothesis, so the quadratic terms for exper
and
tenure
are jointly insignificant, even at
the 20% significance level.
Allow return to education to depend on race
# Interaction between education and race
model3 <- lm(log(wage) ~ educ * black + exper + tenure + married + south + urban, data = wage2)
# Summary of the model
summary(model3)
##
## Call:
## lm(formula = log(wage) ~ educ * black + exper + tenure + married +
## south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97782 -0.21832 0.00475 0.24136 1.23226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.374817 0.114703 46.859 < 2e-16 ***
## educ 0.067115 0.006428 10.442 < 2e-16 ***
## black 0.094809 0.255399 0.371 0.710561
## exper 0.013826 0.003191 4.333 1.63e-05 ***
## tenure 0.011787 0.002453 4.805 1.80e-06 ***
## married 0.198908 0.039047 5.094 4.25e-07 ***
## south -0.089450 0.026277 -3.404 0.000692 ***
## urban 0.183852 0.026955 6.821 1.63e-11 ***
## educ:black -0.022624 0.020183 -1.121 0.262603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared: 0.2536, Adjusted R-squared: 0.2471
## F-statistic: 39.32 on 8 and 926 DF, p-value: < 2.2e-16
# Test if the return to education depends on race
linearHypothesis(model3, "educ:black = 0")
##
## Linear hypothesis test:
## educ:black = 0
##
## Model 1: restricted model
## Model 2: log(wage) ~ educ * black + exper + tenure + married + south +
## urban
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 927 123.82
## 2 926 123.65 1 0.16778 1.2565 0.2626
The interaction between education and race (educ:black) does not significantly affect wages. This implies that the return to education does not statistically differ between black and non-black individuals in this model.
Allow wages to differ across marital status and race
# Create interaction terms for marital status and race
model4 <- lm(log(wage) ~ married * black + exper + tenure + educ + south + urban, data = wage2)
# Summary of the model
summary(model4)
##
## Call:
## lm(formula = log(wage) ~ married * black + exper + tenure + educ +
## south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98013 -0.21780 0.01057 0.24219 1.22889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.403793 0.114122 47.351 < 2e-16 ***
## married 0.188915 0.042878 4.406 1.18e-05 ***
## black -0.240820 0.096023 -2.508 0.012314 *
## exper 0.014146 0.003191 4.433 1.04e-05 ***
## tenure 0.011663 0.002458 4.745 2.41e-06 ***
## educ 0.065475 0.006253 10.471 < 2e-16 ***
## south -0.091989 0.026321 -3.495 0.000497 ***
## urban 0.184350 0.026978 6.833 1.50e-11 ***
## married:black 0.061354 0.103275 0.594 0.552602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3656 on 926 degrees of freedom
## Multiple R-squared: 0.2528, Adjusted R-squared: 0.2464
## F-statistic: 39.17 on 8 and 926 DF, p-value: < 2.2e-16
# Wage differential between married blacks and married nonblacks
# Coefficient of "married:black" represents the differential
The estimated wage differential between married blacks and married nonblacks is approximately -16.4%. However, this difference is not statistically significant, as the interaction term does not provide strong evidence of a meaningful effect.
Answer: False.
Reason: Heteroskedasticity does not cause the OLS estimators to be inconsistent. The OLS estimators remain consistent as long as the model is correctly specified, and the errors have zero mean and are uncorrelated with the regressors. However, heteroskedasticity affects efficiency (see point iii).
Answer: True.
Reason: Under heteroskedasticity, the usual F-statistic does not follow an exact F-distribution because the standard error estimates are biased. This can lead to incorrect inference unless robust standard errors are used.
Answer: True.
Reason: The OLS estimators are no longer BLUE (Best Linear Unbiased Estimators) under heteroskedasticity because they are no longer efficient. In the presence of heteroskedasticity, OLS does not produce the smallest variance among linear unbiased estimators. Weighted Least Squares (WLS) or robust standard errors are typically used to address this issue.
# Load required packages
if (!require(wooldridge)) install.packages("wooldridge")
library(wooldridge)
# Load the SMOKE dataset
data("smoke")
smoke$smoke <- ifelse(smoke$cigs > 0, 1, 0)
# Fit the linear probability model
lpm <- lm(smoke ~ log(cigpric) + log(income) + educ + age + I(age^2) + restaurn + white, data = smoke)
# Summary of the model
summary(lpm)
##
## Call:
## lm(formula = smoke ~ log(cigpric) + log(income) + educ + age +
## I(age^2) + restaurn + white, data = smoke)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6715 -0.3958 -0.2724 0.5344 0.9340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6560624 0.8549619 0.767 0.443095
## log(cigpric) -0.0689259 0.2041088 -0.338 0.735684
## log(income) 0.0121620 0.0257244 0.473 0.636498
## educ -0.0288802 0.0059008 -4.894 1.19e-06 ***
## age 0.0198158 0.0056660 3.497 0.000496 ***
## I(age^2) -0.0002598 0.0000617 -4.210 2.84e-05 ***
## restaurn -0.1007616 0.0394431 -2.555 0.010815 *
## white -0.0256801 0.0515172 -0.498 0.618286
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4734 on 799 degrees of freedom
## Multiple R-squared: 0.062, Adjusted R-squared: 0.05378
## F-statistic: 7.544 on 7 and 799 DF, p-value: 8.035e-09
Are there important differences between the usual and heteroskedasticity-robust standard errors?
To calculate robust standard errors:
# Install and load the sandwich package for robust SE
if (!require(sandwich)) install.packages("sandwich")
## Loading required package: sandwich
if (!require(lmtest)) install.packages("lmtest")
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
library(lmtest)
# Compute robust standard errors
robust_se <- coeftest(lpm, vcov = vcovHC(lpm, type = "HC1"))
robust_se
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5606e-01 8.6408e-01 0.7593 0.4479200
## log(cigpric) -6.8926e-02 2.0803e-01 -0.3313 0.7404890
## log(income) 1.2162e-02 2.5823e-02 0.4710 0.6377879
## educ -2.8880e-02 5.6421e-03 -5.1187 3.857e-07 ***
## age 1.9816e-02 5.4115e-03 3.6618 0.0002669 ***
## I(age^2) -2.5979e-04 5.7133e-05 -4.5470 6.284e-06 ***
## restaurn -1.0076e-01 3.7690e-02 -2.6734 0.0076615 **
## white -2.5680e-02 5.0690e-02 -0.5066 0.6125706
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Effect of a four-year increase in education on smoking probability
The coefficient of educ
represents the change in the
probability of smoking for each additional year of education. Multiply
the coefficient by 4:
# Coefficient of education
educ_coef <- coef(lpm)["educ"]
# Effect of 4 additional years of education
effect_4_years <- 4 * educ_coef
effect_4_years
## educ
## -0.1155209
This will give the estimated decrease in smoking probability for a 4-year increase in education.
At what age does another year of age reduce the probability of smoking?
# Coefficients for age and age^2
age_coef <- coef(lpm)["age"]
age_sq_coef <- coef(lpm)["I(age^2)"]
# Turning point age
turning_point <- -age_coef / (2 * age_sq_coef)
turning_point
## age
## 38.13861
This will provide the age at which smoking probability begins to decrease. In other words, This indicates a decrease in the probability of smoking starting at about age 38.
Interpret the coefficient on
restaurn
:
The coefficient on `restaurn` indicates the **average change in the probability of smoking** for individuals living in states with restaurant smoking restrictions compared to those without. To extract this:
``` r
# Coefficient for restaurn
restaurn_coef <- coef(lpm)["restaurn"]
restaurn_coef
```
```
## restaurn
## -0.1007616
```
Predicted probability of smoking for person #206
For the given individual (cigpric=67.44, income=6500, educ=16, age=77, restaurn=0, white=0)
# Create a data frame with the individual's characteristics
individual <- data.frame(
cigpric = 67.44,
income = 6500,
educ = 16,
age = 77,
restaurn = 0,
white = 0
)
# Predicted probability
predicted_probability <- predict(lpm, newdata = individual)
predicted_probability
## 1
## -0.003965468
If the predicted probability is outside the [0, 1] range, it highlights a limitation of the linear probability model.
library(wooldridge)
library(lmtest) # For heteroskedasticity tests
library(sandwich) # For robust covariance matrix estimation
# Load the dataset
data("vote1")
Estimate the model
model <- lm(voteA ~ prtystrA + democA + log(expendA) + log(expendB), data = vote1)
# Obtain the residuals and regress them on the independent variables
residuals <- residuals(model)
residual_model <- lm(residuals ~ prtystrA + democA + log(expendA) + log(expendB), data = vote1)
# Check R-squared
r_squared <- summary(residual_model)$r.squared
cat("R-squared of residual regression:", r_squared, "\n")
## R-squared of residual regression: 1.198131e-32
# Explanation: The R-squared is 0 because residuals from an OLS regression are uncorrelated with the regressors by construction.
This near-zero R^2 is the expected result, confirming the properties of OLS.
Breusch-Pagan test
bp_test <- bptest(model)
cat("Breusch-Pagan Test:\n")
## Breusch-Pagan Test:
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 9.0934, df = 4, p-value = 0.05881
The Breusch-Pagan test checks for heteroskedasticity (whether the variance of the residuals is not constant across observations).
Null Hypothesis (H0): Residuals have constant variance (homoskedasticity).
Alternative Hypothesis (H1): Residual variance is not constant (heteroskedasticity).
With a p-value of 0.05881, which is slightly above the common significance level of 0.050.050.05, we fail to reject the null hypothesis at the 5% level. This suggests that there is weak evidence of heteroskedasticity.
However, if we use a 10% significance level (α=0.10), the null hypothesis would be rejected, indicating possible heteroskedasticity.
The evidence for heteroskedasticity is borderline, depending on the significance level chosen:
At 5%: No heteroskedasticity detected.
At 10%: Weak evidence of heteroskedasticity.
White test for heteroskedasticity
white_test <- bptest(model, ~ fitted(model) + I(fitted(model)^2), data = vote1)
cat("White Test:\n")
## White Test:
print(white_test)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 5.49, df = 2, p-value = 0.06425
The null hypothesis of the White test is that the variance of the residuals is constant (i.e., no heteroskedasticity). If the p-value is less than the significance level (commonly α=0.05= 0.05α=0.05), we reject the null hypothesis, indicating evidence of heteroskedasticity.
In this case:
The p-value = 0.06425, which is slightly greater than 0.05.
At the 5% significance level, we fail to reject the null hypothesis, meaning there is no strong evidence of heteroskedasticity.
However, at a less stringent significance level (e.g., 10%), the test might suggest mild evidence of heteroskedasticity.
The evidence for heteroskedasticity is weak. At the 5% level, the residuals can be considered homoskedastic, but there may be slight heteroskedasticity at a 10% level.
# Load necessary libraries
library(wooldridge)
library(lmtest) # For hypothesis tests
library(sandwich) # For robust standard errors
library(car)
# Load the dataset
data("fertil2")
Estimate the model
model1 <- lm(children ~ age + I(age^2) + educ + electric + urban, data = fertil2)
# Usual standard errors
summary(model1)
##
## Call:
## lm(formula = children ~ age + I(age^2) + educ + electric + urban,
## data = fertil2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9012 -0.7136 -0.0039 0.7119 7.4318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2225162 0.2401888 -17.580 < 2e-16 ***
## age 0.3409255 0.0165082 20.652 < 2e-16 ***
## I(age^2) -0.0027412 0.0002718 -10.086 < 2e-16 ***
## educ -0.0752323 0.0062966 -11.948 < 2e-16 ***
## electric -0.3100404 0.0690045 -4.493 7.20e-06 ***
## urban -0.2000339 0.0465062 -4.301 1.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 4352 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5734, Adjusted R-squared: 0.5729
## F-statistic: 1170 on 5 and 4352 DF, p-value: < 2.2e-16
# Heteroskedasticity-robust standard errors
robust_se <- coeftest(model1, vcov = vcovHC(model1, type = "HC1"))
cat("Heteroskedasticity-robust standard errors:\n")
## Heteroskedasticity-robust standard errors:
print(robust_se)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.22251623 0.24385099 -17.3160 < 2.2e-16 ***
## age 0.34092552 0.01917466 17.7800 < 2.2e-16 ***
## I(age^2) -0.00274121 0.00035051 -7.8206 6.549e-15 ***
## educ -0.07523232 0.00630771 -11.9270 < 2.2e-16 ***
## electric -0.31004041 0.06394815 -4.8483 1.289e-06 ***
## urban -0.20003386 0.04547093 -4.3992 1.113e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare usual vs robust standard errors
cat("Are robust SEs always larger? Compare the standard errors above.\n")
## Are robust SEs always larger? Compare the standard errors above.
Coefficient | Non-Robust SE | Robust SE | Robust SE > Non-Robust SE? |
---|---|---|---|
Intercept | 0.2402 | 0.2439 | Yes |
age | 0.0165 | 0.0192 | Yes |
I(age²) | 0.00027 | 0.00035 | Yes |
educ | 0.00630 | 0.00631 | Yes |
electric | 0.0690 | 0.0640 | No |
urban | 0.0465 | 0.0455 | No |
Robust standard errors are not always larger than
the non-robust standard errors. For most coefficients, the robust SEs
are larger, but for electric
and urban
, the
robust SEs are slightly smaller. This happens because robust standard
errors account for heteroskedasticity and use different assumptions
about the variance of residuals.
Add religious dummy variables and test joint significance
model2 <- lm(children ~ age + I(age^2) + educ + electric + urban + spirit + protest + catholic, data = fertil2)
# Non-robust F-test
non_robust_f_test <- linearHypothesis(model2, c("protest = 0", "catholic = 0", "spirit = 0"))
cat("Non-robust F-test p-value:\n")
## Non-robust F-test p-value:
print(non_robust_f_test)
##
## Linear hypothesis test:
## protest = 0
## catholic = 0
## spirit = 0
##
## Model 1: restricted model
## Model 2: children ~ age + I(age^2) + educ + electric + urban + spirit +
## protest + catholic
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4352 9176.4
## 2 4349 9162.5 3 13.88 2.1961 0.08641 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Robust F-test
robust_f_test <- linearHypothesis(model2, c("protest = 0", "catholic = 0", "spirit = 0"), vcov = vcovHC(model2, type = "HC1"))
cat("Robust F-test p-value:\n")
## Robust F-test p-value:
print(robust_f_test)
##
## Linear hypothesis test:
## protest = 0
## catholic = 0
## spirit = 0
##
## Model 1: restricted model
## Model 2: children ~ age + I(age^2) + educ + electric + urban + spirit +
## protest + catholic
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 4352
## 2 4349 3 2.1559 0.0911 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis for both tests is that the coefficients of the
three religious dummy variables (protest
,
catholic
, spirit
) are jointly equal to
zero.
At the 5% significance level (α=0.05), we fail to reject the null hypothesis because the p-values are greater than 0.05.
At the 10% significance level (α=0.10), the p-values are slightly below or close to 0.10, suggesting weak evidence that the religious dummies are jointly significant.
While the religious dummies may not be statistically significant at the 5% level, they could be considered marginally significant at the 10% level. This means religion might have a weak joint effect on the number of children, but the evidence is not strong.
Test heteroskedasticity with fitted values
fitted_vals <- fitted(model2)
residuals <- residuals(model2)
residuals_sq <- residuals^2
# Auxiliary regression
aux_model <- lm(residuals_sq ~ fitted_vals + I(fitted_vals^2), data = fertil2)
summary(aux_model)
##
## Call:
## lm(formula = residuals_sq ~ fitted_vals + I(fitted_vals^2), data = fertil2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.873 -1.290 -0.302 0.357 47.684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3126 0.1114 2.807 0.00503 **
## fitted_vals -0.1489 0.1018 -1.462 0.14381
## I(fitted_vals^2) 0.2668 0.0196 13.607 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.632 on 4355 degrees of freedom
## Multiple R-squared: 0.2501, Adjusted R-squared: 0.2497
## F-statistic: 726.1 on 2 and 4355 DF, p-value: < 2.2e-16
# Joint significance of fitted values
hetero_test <- linearHypothesis(aux_model, c("fitted_vals = 0", "I(fitted_vals^2) = 0"))
cat("Joint significance of regressors (heteroskedasticity test):\n")
## Joint significance of regressors (heteroskedasticity test):
print(hetero_test)
##
## Linear hypothesis test:
## fitted_vals = 0
## I(fitted_vals^2) = 0
##
## Model 1: restricted model
## Model 2: residuals_sq ~ fitted_vals + I(fitted_vals^2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4357 76589
## 2 4355 57436 2 19153 726.11 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is strong evidence of heteroskedasticity in
the regression equation for children
. The significance of
the fitted values and their squared terms in explaining the squared
residuals confirms the presence of heteroskedasticity.
Practical importance of heteroskedasticity
# Check if the p-value exists and handle missing values
hetero_p_value <- hetero_test$`Pr(>F)`[1]
if (!is.na(hetero_p_value)) {
if (hetero_p_value < 0.05) {
cat("Heteroskedasticity is statistically significant.\n")
} else {
cat("Heteroskedasticity is not statistically significant.\n")
}
} else {
cat("The p-value for the heteroskedasticity test is missing (NA). Unable to determine significance.\n")
}
## The p-value for the heteroskedasticity test is missing (NA). Unable to determine significance.
print(hetero_test)
##
## Linear hypothesis test:
## fitted_vals = 0
## I(fitted_vals^2) = 0
##
## Model 1: restricted model
## Model 2: residuals_sq ~ fitted_vals + I(fitted_vals^2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4357 76589
## 2 4355 57436 2 19153 726.11 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Load necessary libraries
library(wooldridge)
# Load the data
data("infmrt")
infmrt_1990 <- subset(infmrt, year == 1990)
Model with DC dummy
model_dc <- lm(log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt) + DC,
data = infmrt_1990)
print("Model with DC dummy:")
## [1] "Model with DC dummy:"
print(summary(model_dc))
##
## Call:
## lm(formula = log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt) +
## DC, data = infmrt_1990)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36232 -0.08058 0.00000 0.10126 0.24078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.69679 1.36010 2.718 0.00929 **
## lpcinc 0.03388 0.18825 0.180 0.85799
## lphysic -0.38282 0.13766 -2.781 0.00789 **
## lpopul -0.03649 0.08127 -0.449 0.65558
## log(afdcprt) 0.09766 0.07027 1.390 0.17140
## DC 1.29440 0.19709 6.567 4.48e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1365 on 45 degrees of freedom
## Multiple R-squared: 0.5619, Adjusted R-squared: 0.5132
## F-statistic: 11.54 on 5 and 45 DF, p-value: 3.324e-07
Model without DC dummy
model_original <- lm(log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt),
data = infmrt_1990)
print("Model without DC dummy:")
## [1] "Model without DC dummy:"
print(summary(model_original))
##
## Call:
## lm(formula = log(infmort) ~ lpcinc + lphysic + lpopul + log(afdcprt),
## data = infmrt_1990)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42525 -0.09194 0.00366 0.11385 0.62067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4420 1.8760 2.368 0.0222 *
## lpcinc -0.2194 0.2550 -0.860 0.3941
## lphysic 0.1015 0.1609 0.631 0.5313
## lpopul -0.1874 0.1079 -1.737 0.0891 .
## log(afdcprt) 0.1827 0.0956 1.911 0.0623 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1889 on 46 degrees of freedom
## Multiple R-squared: 0.1419, Adjusted R-squared: 0.06734
## F-statistic: 1.902 on 4 and 46 DF, p-value: 0.1261
Compare Coefficients
coef_comparison <- data.frame(
Variable = names(coef(model_original)),
Original = coef(model_original),
With_DC = coef(model_dc)[1:length(coef(model_original))],
Difference = coef(model_dc)[1:length(coef(model_original))] - coef(model_original)
)
print("Coefficient Comparison:")
## [1] "Coefficient Comparison:"
print(coef_comparison)
## Variable Original With_DC Difference
## (Intercept) (Intercept) 4.4419560 3.69679148 -0.74516453
## lpcinc lpcinc -0.2194078 0.03387905 0.25328686
## lphysic lphysic 0.1014942 -0.38281822 -0.48431244
## lpopul lpopul -0.1874178 -0.03649137 0.15092645
## log(afdcprt) log(afdcprt) 0.1826712 0.09766312 -0.08500808
Compare R-squared Values
r2_comparison <- data.frame(
Model = c("Original", "With DC"),
R_squared = c(summary(model_original)$r.squared, summary(model_dc)$r.squared)
)
print("R-squared Comparison:")
## [1] "R-squared Comparison:"
print(r2_comparison)
## Model R_squared
## 1 Original 0.1419499
## 2 With DC 0.5618755
When ceoten^2 and comten^2 are added to the model, the R^2increases from 0.353 to 0.375. This increase in R^2indicates that the quadratic terms explain additional variation in the dependent variable log(salary)
To determine whether this suggests functional form misspecification:
If the omitted quadratic terms (nonlinear relationships) are significant, their exclusion from the original model would lead to biased or inconsistent estimates.
A hypothesis test (e.g., F-test) could formally assess whether the added terms significantly improve the model’s fit.
In this case, the increase in R^2 suggests that the original model may indeed have been misspecified, as it did not account for possible nonlinear relationships involving ceoten and comten.
The issue relates to whether the decision of some colleges not to report campus crimes is correlated with the dependent variable (number of campus crimes) or with other covariates in the model. If the decision to report is independent of crime rates or other factors in the model, this can be considered exogenous sample selection. However:
If colleges with higher crime rates are systematically less likely to report, this creates endogeneity and biased estimates.
If non-reporting is purely random, then the sample selection can be treated as exogenous.
In this case, the likely non-random reporting behavior (e.g., colleges with higher crime rates might deliberately underreport) suggests endogenous sample selection, which would lead to biased estimation and potential misrepresentation of the relationship between student enrollment and campus crimes.
# Load necessary libraries
library(wooldridge) # To access Wooldridge's datasets
library(quantreg) # For LAD (quantile regression)
## Loading required package: SparseM
# Load the data
data("rdchem") # Load the RDCHM dataset
# Transform sales to billions of dollars
rdchem$sales_billion <- rdchem$sales / 1000
OLS estimation with and without the largest firm
# Full sample
ols_full <- lm(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem)
summary(ols_full)
##
## Call:
## lm(formula = rdintens ~ sales_billion + I(sales_billion^2) +
## profmarg, data = rdchem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0371 -1.1238 -0.4547 0.7165 5.8522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.058967 0.626269 3.288 0.00272 **
## sales_billion 0.316611 0.138854 2.280 0.03041 *
## I(sales_billion^2) -0.007390 0.003716 -1.989 0.05657 .
## profmarg 0.053322 0.044210 1.206 0.23787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.774 on 28 degrees of freedom
## Multiple R-squared: 0.1905, Adjusted R-squared: 0.1037
## F-statistic: 2.196 on 3 and 28 DF, p-value: 0.1107
# Exclude the firm with sales > $40 billion
rdchem_no_outlier <- subset(rdchem, sales_billion < 40)
ols_no_outlier <- lm(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem_no_outlier)
summary(ols_no_outlier)
##
## Call:
## lm(formula = rdintens ~ sales_billion + I(sales_billion^2) +
## profmarg, data = rdchem_no_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0371 -1.1238 -0.4547 0.7165 5.8522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.058967 0.626269 3.288 0.00272 **
## sales_billion 0.316611 0.138854 2.280 0.03041 *
## I(sales_billion^2) -0.007390 0.003716 -1.989 0.05657 .
## profmarg 0.053322 0.044210 1.206 0.23787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.774 on 28 degrees of freedom
## Multiple R-squared: 0.1905, Adjusted R-squared: 0.1037
## F-statistic: 2.196 on 3 and 28 DF, p-value: 0.1107
The presence of the firm with annual sales of almost $40 billion does not affect the OLS estimates. This could indicate that the outlier does not exert undue influence on the regression coefficients in this case. The low R^2 suggests the model explains only a small portion of the variation in rdintens.
LAD estimation with and without the largest firm
# Full sample
lad_full <- rq(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem, tau = 0.5)
summary(lad_full)
##
## Call: rq(formula = rdintens ~ sales_billion + I(sales_billion^2) +
## profmarg, tau = 0.5, data = rdchem)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 1.40428 0.87031 2.66628
## sales_billion 0.26346 -0.13508 0.75753
## I(sales_billion^2) -0.00600 -0.01679 0.00344
## profmarg 0.11400 0.01376 0.16427
# Exclude the firm with sales > $40 billion
lad_no_outlier <- rq(rdintens ~ sales_billion + I(sales_billion^2) + profmarg, data = rdchem_no_outlier, tau = 0.5)
summary(lad_no_outlier)
##
## Call: rq(formula = rdintens ~ sales_billion + I(sales_billion^2) +
## profmarg, tau = 0.5, data = rdchem_no_outlier)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 1.40428 0.87031 2.66628
## sales_billion 0.26346 -0.13508 0.75753
## I(sales_billion^2) -0.00600 -0.01679 0.00344
## profmarg 0.11400 0.01376 0.16427
The coefficients for LAD estimation with and without the outlier are identical:
Coefficients | With Outlier | Without Outlier |
---|---|---|
Intercept | 1.40428 | 1.40428 |
sales_billion | 0.26346 | 0.26346 |
I(sales_billion^2) | -0.00600 | -0.00600 |
profmarg | 0.11400 | 0.11400 |
Analysis:
LAD (Least Absolute Deviations) minimizes the sum of absolute residuals, which makes it robust to outliers.
The inclusion or exclusion of the largest firm (outlier) has no impact on the coefficients. This demonstrates that LAD handles extreme values effectively by assigning them less influence compared to OLS.
Compare OLS and LAD results
# Check the coefficients and their sensitivity to outliers
cat("OLS with outlier:\n")
## OLS with outlier:
print(coef(ols_full))
## (Intercept) sales_billion I(sales_billion^2) profmarg
## 2.058966676 0.316610977 -0.007389624 0.053322172
cat("OLS without outlier:\n")
## OLS without outlier:
print(coef(ols_no_outlier))
## (Intercept) sales_billion I(sales_billion^2) profmarg
## 2.058966676 0.316610977 -0.007389624 0.053322172
cat("LAD with outlier:\n")
## LAD with outlier:
print(coef(lad_full))
## (Intercept) sales_billion I(sales_billion^2) profmarg
## 1.404279415 0.263463789 -0.006001127 0.114003661
cat("LAD without outlier:\n")
## LAD without outlier:
print(coef(lad_no_outlier))
## (Intercept) sales_billion I(sales_billion^2) profmarg
## 1.404279415 0.263463789 -0.006001127 0.114003661
# Discussion:
# LAD (Least Absolute Deviations) is typically less sensitive to outliers compared to OLS (Ordinary Least Squares).
# Examine the changes in coefficients for both methods with and without the outlier to draw conclusions.
OLS Results
The coefficients for OLS with and without the outlier are identical:
Coefficients | With Outlier | Without Outlier |
---|---|---|
Intercept | 2.05897 | 2.05897 |
sales_billion | 0.31661 | 0.31661 |
I(sales_billion^2) | -0.00739 | -0.00739 |
profmarg | 0.05332 | 0.05332 |
LAD Results
The coefficients for LAD with and without the outlier are also identical:
Coefficients | With Outlier | Without Outlier |
---|---|---|
Intercept | 1.40428 | 1.40428 |
sales_billion | 0.26346 | 0.26346 |
I(sales_billion^2) | -0.00600 | -0.00600 |
profmarg | 0.11400 | 0.11400 |
Key Observations
In this specific case, both OLS and LAD coefficients are unaffected by the outlier. This is unusual for OLS, as it is generally sensitive to extreme values.
LAD remains consistent, which aligns with its theoretical robustness to outliers.
Even though OLS coefficients in this case are identical with and without the outlier, LAD is theoretically more resilient to outliers. Therefore, based on statistical properties and robustness, LAD is still the preferred method in situations where outliers are expected.
# Create the 'time' variable based on 't' (if not already created)
hseinv$time <- hseinv$t
# Create quarterly dummy variables based on the 'time' variable
hseinv$Q1 <- ifelse((hseinv$time %% 4) == 1, 1, 0)
hseinv$Q2 <- ifelse((hseinv$time %% 4) == 2, 1, 0)
hseinv$Q3 <- ifelse((hseinv$time %% 4) == 3, 1, 0)
hseinv$Q4 <- ifelse((hseinv$time %% 4) == 0, 1, 0)
# Model specification
model_housing <- lm(inv ~ price + linv + time + Q1 + Q2 + Q3 + Q4, data = hseinv)
# Summary of the model
summary(model_housing)
##
## Call:
## lm(formula = inv ~ price + linv + time + Q1 + Q2 + Q3 + Q4, data = hseinv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5199 -3055 -1446 1676 14248
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1193475.8 66292.1 -18.003 <2e-16 ***
## price 24945.3 24204.9 1.031 0.310
## linv 111177.9 5407.4 20.560 <2e-16 ***
## time -251.3 174.5 -1.440 0.159
## Q1 672.1 2092.5 0.321 0.750
## Q2 599.0 2101.1 0.285 0.777
## Q3 504.1 2142.7 0.235 0.815
## Q4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4779 on 35 degrees of freedom
## Multiple R-squared: 0.9801, Adjusted R-squared: 0.9767
## F-statistic: 287.4 on 6 and 35 DF, p-value: < 2.2e-16
# Load the intdef dataset
data("intdef")
# Define a dummy variable for years after 1979
intdef$dummy <- ifelse(intdef$year > 1979, 1, 0)
# Model specification: Interest rate as a function of the dummy variable
model_policy_shift <- lm(i3 ~ dummy, data = intdef)
# Summary of the model
summary(model_policy_shift)
##
## Call:
## lm(formula = i3 ~ dummy, data = intdef)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1979 -1.9449 -0.4569 1.3616 7.8121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9259 0.4692 8.367 2.53e-11 ***
## dummy 2.2920 0.7167 3.198 0.00232 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.654 on 54 degrees of freedom
## Multiple R-squared: 0.1592, Adjusted R-squared: 0.1437
## F-statistic: 10.23 on 1 and 54 DF, p-value: 0.002317
# Load the volat dataset
data("volat")
# Dependent variable: rsp500 (S&P 500 monthly returns)
# Explanatory variables: pcip (percentage change in industrial production) and i3 (return on 3-month T-bills)
# Model specification
model_rsp500 <- lm(rsp500 ~ pcip + i3, data = volat)
# Summary of the model
summary(model_rsp500)
##
## Call:
## lm(formula = rsp500 ~ pcip + i3, data = volat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -157.871 -22.580 2.103 25.524 138.137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.84306 3.27488 5.754 1.44e-08 ***
## pcip 0.03642 0.12940 0.281 0.7785
## i3 -1.36169 0.54072 -2.518 0.0121 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.13 on 554 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01189, Adjusted R-squared: 0.008325
## F-statistic: 3.334 on 2 and 554 DF, p-value: 0.03637
Statistical significance: Check p-values of pcip and i3.
pcip is significant for 5%. because p-value is bigger than 0.05
i3 is significant for 1%. because p-value is bigger than 0.01
# Load required packages
library(wooldridge)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the 'consump' dataset
data("consump", package = "wooldridge")
# (i) Test the PIH hypothesis: gc = β0 + β1 * gc_1 + ut
model1 <- lm(gc ~ gc_1, data = consump, na.action = na.omit)
cat("C7 - Model 1 Summary:\n")
## C7 - Model 1 Summary:
summary(model1)
##
## Call:
## lm(formula = gc ~ gc_1, data = consump, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027878 -0.005974 -0.001450 0.007142 0.020227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.011431 0.003778 3.026 0.00478 **
## gc_1 0.446133 0.156047 2.859 0.00731 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01161 on 33 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.1985, Adjusted R-squared: 0.1742
## F-statistic: 8.174 on 1 and 33 DF, p-value: 0.007311
# (ii) Add additional variables (gc_1, gy_1, i3, inf):
# gc = β0 + β1 * gc_1 + β2 * gy_1 + β3 * i3 + β4 * inf + ut
model2 <- lm(gc ~ gc_1 + gy_1 + i3 + inf, data = consump, na.action = na.omit)
cat("\nC7 - Model 2 Summary:\n")
##
## C7 - Model 2 Summary:
summary(model2)
##
## Call:
## lm(formula = gc ~ gc_1 + gy_1 + i3 + inf, data = consump, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0262462 -0.0076662 0.0009066 0.0063953 0.0174975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.007e-02 5.839e-03 3.438 0.00174 **
## gc_1 5.398e-01 2.604e-01 2.073 0.04689 *
## gy_1 -1.017e-01 1.792e-01 -0.567 0.57463
## i3 -2.523e-05 1.028e-03 -0.025 0.98059
## inf -1.701e-03 8.825e-04 -1.928 0.06341 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01069 on 30 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.3819, Adjusted R-squared: 0.2995
## F-statistic: 4.635 on 4 and 30 DF, p-value: 0.004938
# Perform joint significance test using ANOVA
cat("\nC7 - ANOVA to compare Model 1 and Model 2:\n")
##
## C7 - ANOVA to compare Model 1 and Model 2:
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: gc ~ gc_1
## Model 2: gc ~ gc_1 + gy_1 + i3 + inf
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 0.0044447
## 2 30 0.0034276 3 0.0010171 2.9675 0.04767 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# (iii) Check the p-value for gc_1 in the new model
cat("\nC7 - P-Value for gc_1 in Model 2:\n")
##
## C7 - P-Value for gc_1 in Model 2:
summary(model2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0200712603 0.0058386779 3.4376379 0.001741936
## gc_1 0.5397639068 0.2604192372 2.0726729 0.046890481
## gy_1 -0.1016586120 0.1791503385 -0.5674486 0.574631669
## i3 -0.0000252261 0.0010282141 -0.0245339 0.980589214
## inf -0.0017011440 0.0008824782 -1.9276895 0.063408242
# (iv) F-statistic and p-value for joint significance in Model 2
cat("\nC7 - F-statistic and p-value for joint significance in Model 2:\n")
##
## C7 - F-statistic and p-value for joint significance in Model 2:
anova(model2)
## Analysis of Variance Table
##
## Response: gc
## Df Sum Sq Mean Sq F value Pr(>F)
## gc_1 1 0.0011009 0.00110089 9.6356 0.004141 **
## gy_1 1 0.0000647 0.00006471 0.5664 0.457558
## i3 1 0.0005279 0.00052785 4.6201 0.039787 *
## inf 1 0.0004246 0.00042456 3.7160 0.063408 .
## Residuals 30 0.0034276 0.00011425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Load required packages
library(wooldridge)
library(dplyr)
# Load the 'minwage' dataset
data("minwage", package = "wooldridge")
# Handle missing values
minwage_clean <- na.omit(minwage) # Remove rows with missing values
# Create lagged variables (gwage232_1 and gemp232_1)
minwage_clean <- minwage_clean %>%
mutate(
gwage232_1 = lag(gwage232), # Lagged wage growth
gemp232_1 = lag(gemp232) # Lagged employment growth
)
# (i) Check for autocorrelation in gwage232
acf(minwage_clean$gwage232, main = "Autocorrelation of gwage232", na.action = na.pass)
# (ii) Estimate the dynamic model
# gwage232 = β0 + β1 * gwage232_1 + β2 * gmwage + β3 * gcpi + ut
model_dynamic <- lm(gwage232 ~ gwage232_1 + gmwage + gcpi, data = minwage_clean, na.action = na.omit)
cat("\nC12 - Dynamic Model Summary:\n")
##
## C12 - Dynamic Model Summary:
summary(model_dynamic)
##
## Call:
## lm(formula = gwage232 ~ gwage232_1 + gmwage + gcpi, data = minwage_clean,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044651 -0.004120 -0.001272 0.004487 0.041568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.002371 0.000433 5.475 6.45e-08 ***
## gwage232_1 -0.068681 0.034467 -1.993 0.04676 *
## gmwage 0.151749 0.009519 15.941 < 2e-16 ***
## gcpi 0.257423 0.086571 2.974 0.00306 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007781 on 594 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3067, Adjusted R-squared: 0.3032
## F-statistic: 87.58 on 3 and 594 DF, p-value: < 2.2e-16
# (iii) Add lagged employment growth (gemp232_1) to the equation
# gwage232 = β0 + β1 * gwage232_1 + β2 * gmwage + β3 * gcpi + β4 * gemp232_1 + ut
model_dynamic_lagged <- lm(gwage232 ~ gwage232_1 + gmwage + gcpi + gemp232_1, data = minwage_clean, na.action = na.omit)
cat("\nC12 - Dynamic Model with Lagged Employment Summary:\n")
##
## C12 - Dynamic Model with Lagged Employment Summary:
summary(model_dynamic_lagged)
##
## Call:
## lm(formula = gwage232 ~ gwage232_1 + gmwage + gcpi + gemp232_1,
## data = minwage_clean, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043902 -0.004319 -0.000965 0.004265 0.042429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0023987 0.0004289 5.593 3.41e-08 ***
## gwage232_1 -0.0657751 0.0341411 -1.927 0.054512 .
## gmwage 0.1525456 0.0094293 16.178 < 2e-16 ***
## gcpi 0.2531434 0.0857361 2.953 0.003276 **
## gemp232_1 0.0606449 0.0169864 3.570 0.000386 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007705 on 593 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3213, Adjusted R-squared: 0.3167
## F-statistic: 70.17 on 4 and 593 DF, p-value: < 2.2e-16
# (iv) Compare models to check the effect on gmwage coefficient
cat("\nC12 - Comparison of gmwage Coefficient:\n")
##
## C12 - Comparison of gmwage Coefficient:
cat("Without lagged employment:\n")
## Without lagged employment:
summary(model_dynamic)$coefficients["gmwage", ]
## Estimate Std. Error t value Pr(>|t|)
## 1.517485e-01 9.519372e-03 1.594102e+01 6.890581e-48
cat("With lagged employment:\n")
## With lagged employment:
summary(model_dynamic_lagged)$coefficients["gmwage", ]
## Estimate Std. Error t value Pr(>|t|)
## 1.525456e-01 9.429267e-03 1.617789e+01 4.953489e-49
# (v) Run regression of gmwage on gwage232_1 and gemp232_1, and report R-squared
model_gmwage <- lm(gmwage ~ gwage232_1 + gemp232_1, data = minwage_clean, na.action = na.omit)
cat("\nC12 - Regression of gmwage Summary:\n")
##
## C12 - Regression of gmwage Summary:
summary(model_gmwage)
##
## Call:
## lm(formula = gmwage ~ gwage232_1 + gemp232_1, data = minwage_clean,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01984 -0.00511 -0.00385 -0.00290 0.62190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.003491 0.001469 2.376 0.0178 *
## gwage232_1 0.211609 0.147152 1.438 0.1510
## gemp232_1 -0.042848 0.073827 -0.580 0.5619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0335 on 595 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.004094, Adjusted R-squared: 0.0007463
## F-statistic: 1.223 on 2 and 595 DF, p-value: 0.2951
# Report R-squared
cat("\nC12 - R-squared for Regression of gmwage:\n")
##
## C12 - R-squared for Regression of gmwage:
summary(model_gmwage)$r.squared
## [1] 0.00409388
# Load required packages
library(wooldridge)
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(dplyr)
# Load the 'nyse' dataset
data("nyse", package = "wooldridge")
# Remove rows with missing data
nyse_clean <- nyse %>% na.omit()
# Create the lagged variable returnt_1, and remove the first row with NA
nyse_clean <- nyse_clean %>%
mutate(returnt_1 = lag(return, n = 1)) %>%
na.omit() # Remove the row with NA due to the lag
# Fit the OLS model
model_ols <- lm(return ~ returnt_1, data = nyse_clean)
# Obtain the squared residuals (ut^2)
nyse_clean <- nyse_clean %>%
mutate(ut_sq = resid(model_ols)^2)
# Summary of the OLS model
cat("\nC11 (i) - OLS Model Summary:\n")
##
## C11 (i) - OLS Model Summary:
summary(model_ols)
##
## Call:
## lm(formula = return ~ returnt_1, data = nyse_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2633 -1.2997 0.1011 1.3203 8.0688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17862 0.08084 2.210 0.0275 *
## returnt_1 0.05806 0.03811 1.523 0.1281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.112 on 686 degrees of freedom
## Multiple R-squared: 0.003372, Adjusted R-squared: 0.001919
## F-statistic: 2.321 on 1 and 686 DF, p-value: 0.1281
# Compute average, minimum, and maximum of ut^2
cat("\nAverage of squared residuals:\n", mean(nyse_clean$ut_sq, na.rm = TRUE))
##
## Average of squared residuals:
## 4.446345
cat("\nMinimum of squared residuals:\n", min(nyse_clean$ut_sq, na.rm = TRUE))
##
## Minimum of squared residuals:
## 3.591354e-06
cat("\nMaximum of squared residuals:\n", max(nyse_clean$ut_sq, na.rm = TRUE))
##
## Maximum of squared residuals:
## 232.9687
# (ii) Fit the model of heteroskedasticity:
# Var(ut | returnt-1) = δ0 + δ1 * returnt_1 + δ2 * returnt_1^2
nyse_clean <- nyse_clean %>%
mutate(returnt_1_sq = returnt_1^2)
model_arch <- lm(ut_sq ~ returnt_1 + returnt_1_sq, data = nyse_clean)
cat("\nC11 (ii) - ARCH Model Summary:\n")
##
## C11 (ii) - ARCH Model Summary:
summary(model_arch)
##
## Call:
## lm(formula = ut_sq ~ returnt_1 + returnt_1_sq, data = nyse_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.545 -3.021 -1.974 0.695 221.544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.25771 0.44131 7.382 4.54e-13 ***
## returnt_1 -0.78556 0.19626 -4.003 6.95e-05 ***
## returnt_1_sq 0.29749 0.03558 8.361 3.46e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.67 on 685 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.128
## F-statistic: 51.41 on 2 and 685 DF, p-value: < 2.2e-16
# Report coefficients, standard errors, R-squared, and adjusted R-squared
cat("\nCoefficients of ARCH model:\n")
##
## Coefficients of ARCH model:
print(coef(model_arch))
## (Intercept) returnt_1 returnt_1_sq
## 3.2577065 -0.7855573 0.2974907
cat("\nStandard Errors of ARCH model:\n")
##
## Standard Errors of ARCH model:
print(summary(model_arch)$coefficients[, 2])
## (Intercept) returnt_1 returnt_1_sq
## 0.4413140 0.1962611 0.0355787
cat("\nR-squared of ARCH model:\n", summary(model_arch)$r.squared)
##
## R-squared of ARCH model:
## 0.1305106
cat("\nAdjusted R-squared of ARCH model:\n", summary(model_arch)$adj.r.squared)
##
## Adjusted R-squared of ARCH model:
## 0.127972
# (iii) Conditional variance as a function of returnt-1
# Conditional variance: δ0 + δ1 * returnt-1 + δ2 * returnt-1^2
nyse_clean <- nyse_clean %>%
mutate(conditional_variance = coef(model_arch)[1] +
coef(model_arch)[2] * returnt_1 +
coef(model_arch)[3] * returnt_1_sq)
# Find the smallest variance and the corresponding returnt-1
smallest_variance <- nyse_clean %>%
filter(conditional_variance == min(conditional_variance, na.rm = TRUE))
cat("\nSmallest variance and corresponding returnt_1:\n")
##
## Smallest variance and corresponding returnt_1:
print(smallest_variance)
## price return return_1 t price_1 price_2 cprice cprice_1 returnt_1
## 1 56.83 1.572835 1.321984 168 55.95 55.22 0.8800011 0.7299995 1.321984
## ut_sq returnt_1_sq conditional_variance
## 1 1.735706 1.747642 2.739119
# (iv) Check if the model produces negative variance
cat("\nNegative variances produced? ", any(nyse_clean$conditional_variance < 0, na.rm = TRUE))
##
## Negative variances produced? FALSE
# (v) Compare to ARCH(1) model
# Fit an ARCH(1) model using 'garch()' from the 'tseries' package
arch1_model <- garch(nyse_clean$return, order = c(0, 1), trace = FALSE)
cat("\nC11 (v) - ARCH(1) Model Summary:\n")
##
## C11 (v) - ARCH(1) Model Summary:
summary(arch1_model)
##
## Call:
## garch(x = nyse_clean$return, order = c(0, 1), trace = FALSE)
##
## Model:
## GARCH(0,1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5451 -0.5314 0.1544 0.7527 3.3353
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## a0 3.2615 0.2404 13.566 < 2e-16 ***
## a1 0.2623 0.0330 7.949 1.78e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Diagnostic Tests:
## Jarque Bera Test
##
## data: Residuals
## X-squared = 76.247, df = 2, p-value < 2.2e-16
##
##
## Box-Ljung test
##
## data: Squared.Residuals
## X-squared = 0.188, df = 1, p-value = 0.6646
# (vi) Add the second lag (ARCH(2) model)
arch2_model <- garch(nyse_clean$return, order = c(0, 2), trace = FALSE)
cat("\nC11 (vi) - ARCH(2) Model Summary:\n")
##
## C11 (vi) - ARCH(2) Model Summary:
summary(arch2_model)
##
## Call:
## garch(x = nyse_clean$return, order = c(0, 2), trace = FALSE)
##
## Model:
## GARCH(0,2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5890 -0.5387 0.1526 0.7527 3.3325
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## a0 2.97841 0.28199 10.562 < 2e-16 ***
## a1 0.26885 0.03258 8.253 2.22e-16 ***
## a2 0.06260 0.05377 1.164 0.244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Diagnostic Tests:
## Jarque Bera Test
##
## data: Residuals
## X-squared = 80.464, df = 2, p-value < 2.2e-16
##
##
## Box-Ljung test
##
## data: Squared.Residuals
## X-squared = 0.2925, df = 1, p-value = 0.5886
# Compare ARCH(1) and ARCH(2)
cat("\nComparison of ARCH(1) and ARCH(2) models:\n")
##
## Comparison of ARCH(1) and ARCH(2) models:
cat("AIC of ARCH(1):", AIC(arch1_model), "\n")
## AIC of ARCH(1): 2929.782
cat("AIC of ARCH(2):", AIC(arch2_model), "\n")
## AIC of ARCH(2): 2921.784