library(dplyr)
library(wooldridge)
library(car)
library(quantreg)
data("sleep75")
summary(lm(sleep ~ totwrk + educ + age + agesq + male,data = sleep75))
##
## Call:
## lm(formula = sleep ~ totwrk + educ + age + agesq + male, data = sleep75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2378.00 -243.29 6.74 259.24 1350.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3840.83197 235.10870 16.336 <2e-16 ***
## totwrk -0.16342 0.01813 -9.013 <2e-16 ***
## educ -11.71332 5.86689 -1.997 0.0463 *
## age -8.69668 11.20746 -0.776 0.4380
## agesq 0.12844 0.13390 0.959 0.3378
## male 87.75243 34.32616 2.556 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 417.7 on 700 degrees of freedom
## Multiple R-squared: 0.1228, Adjusted R-squared: 0.1165
## F-statistic: 19.59 on 5 and 700 DF, p-value: < 2.2e-16
Yes, men sleep more than women, with a coefficient of 87.75 minutes. The p-value for “male” is 0.0108, which is statistically significant, indicating strong evidence that men sleep more.
Yes, there is a significant tradeoff. The coefficient for totwrk is -0.16342, meaning for each additional hour of work, a person sleeps about 0.16 minutes less. The p-value is extremely small (< 2e-16), indicating strong evidence.
To test if age affects sleep, run a regression excluding age and agesq
data("gpa2")
summary(lm(sat ~ hsize + hsizesq + female + black + female:black,data = gpa2))
##
## Call:
## lm(formula = sat ~ hsize + hsizesq + female + black + female:black,
## data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -570.45 -89.54 -5.24 85.41 479.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1028.0972 6.2902 163.445 < 2e-16 ***
## hsize 19.2971 3.8323 5.035 4.97e-07 ***
## hsizesq -2.1948 0.5272 -4.163 3.20e-05 ***
## female -45.0915 4.2911 -10.508 < 2e-16 ***
## black -169.8126 12.7131 -13.357 < 2e-16 ***
## female:black 62.3064 18.1542 3.432 0.000605 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 133.4 on 4131 degrees of freedom
## Multiple R-squared: 0.08578, Adjusted R-squared: 0.08468
## F-statistic: 77.52 on 5 and 4131 DF, p-value: < 2.2e-16
Since the p-value is very small (p-value<0.001), there is strong evidence that hsize^2 should be included in the model. This indicates that the relationship between SAT scores and high school size is nonlinear.
To find the optimal high school size: The optimal size is determined by finding the turning point of the quadratic equation:
\(SAT = \beta_0+\beta_1hsize+\beta_2hsizesq\)
The turning point is given by:
\(hsize_\text{optimal} = \frac{-\beta_1}{2\times\beta_2}\)
Substituting the values:
\(hsize_\text{optimal} = \frac{-19.2971}{2\times-2.1948} \approx4.4\)
The optimal high school size is approximately 440 students (since hsize is measured in hundreds).
The difference in SAT scores between nonblack females and nonblack
males is captured by the coefficient of female
: Nonblack
females score, on average, 45.09 points lower than
nonblack males, holding other factors constant.
The p-value for the female
coefficient is very small,
indicating that this difference is highly statistically significant.
The difference in SAT scores between nonblack males and black males
is captured by the coefficient of black
: Black males score,
on average, 169.81 points lower than nonblack males,
holding other factors constant.
The very small p-value in the difference indicates it is statistically significant.
The difference in SAT scores between black females and nonblack
females is the sum of the coefficients for black
and
female:black
\(Difference =
\beta_\text{black}+\beta_\text{female:black}\)
\(Difference = -169.8126 + 62.3064=−107.5062\)
Black females score, on average, 107.51 points lower than nonblack females, holding other factors constant.
To test the significance of this difference, perform a hypothesis test for the sum of the coefficients:
data("gpa1")
summary(lm(colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll ,data = gpa1))
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll,
## data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78149 -0.25726 -0.02121 0.24691 0.74432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.255554 0.335392 3.744 0.000268 ***
## PC 0.151854 0.058716 2.586 0.010762 *
## hsGPA 0.450220 0.094280 4.775 4.61e-06 ***
## ACT 0.007724 0.010678 0.723 0.470688
## mothcoll -0.003758 0.060270 -0.062 0.950376
## fathcoll 0.041800 0.061270 0.682 0.496265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3344 on 135 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.1934
## F-statistic: 7.713 on 5 and 135 DF, p-value: 2.083e-06
The estimated effect of PC ownership remains positive (0.151854) and is still statistically significant (p-value = 0.010762), meaning that students who own a PC tend to have higher GPAs on average, after accounting for the other variables in the model.
full_model <- lm(colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll, data = gpa1)
reduced_model <- lm(colGPA ~ PC + hsGPA + ACT, data = gpa1)
anova(reduced_model, full_model)
## Analysis of Variance Table
##
## Model 1: colGPA ~ PC + hsGPA + ACT
## Model 2: colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 137 15.149
## 2 135 15.094 2 0.054685 0.2446 0.7834
The p-value for the test of 0.7834, indicating that here is no statistically significant evidence that the variables mothcoll and fathcoll together affect the college GPA. Thus, we conclude that, based on this test, mothcoll and fathcoll do not contribute significantly to explaining the variation in college GPA.
gpa1_iii <- gpa1 %>%
mutate(hsGPAsq = hsGPA^2)
summary(lm(colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll + hsGPAsq,data = gpa1_iii))
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll +
## hsGPAsq, data = gpa1_iii)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78998 -0.24327 -0.00648 0.26179 0.72231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.040328 2.443038 2.063 0.0410 *
## PC 0.140446 0.058858 2.386 0.0184 *
## hsGPA -1.802520 1.443552 -1.249 0.2140
## ACT 0.004786 0.010786 0.444 0.6580
## mothcoll 0.003091 0.060110 0.051 0.9591
## fathcoll 0.062761 0.062401 1.006 0.3163
## hsGPAsq 0.337341 0.215711 1.564 0.1202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3326 on 134 degrees of freedom
## Multiple R-squared: 0.2361, Adjusted R-squared: 0.2019
## F-statistic: 6.904 on 6 and 134 DF, p-value: 2.088e-06
Adding hsGPAsq (high school GPA squared) to the model does not significantly improve the fit. The p-value for hsGPAsq is 0.1202, which is greater than 0.05, indicating it is not statistically significant.
data("wage2")
model1 <- lm(log(wage) ~ educ + exper + tenure + married + black + south + urban, data = wage2)
summary(model1)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98069 -0.21996 0.00707 0.24288 1.22822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
## educ 0.065431 0.006250 10.468 < 2e-16 ***
## exper 0.014043 0.003185 4.409 1.16e-05 ***
## tenure 0.011747 0.002453 4.789 1.95e-06 ***
## married 0.199417 0.039050 5.107 3.98e-07 ***
## black -0.188350 0.037667 -5.000 6.84e-07 ***
## south -0.090904 0.026249 -3.463 0.000558 ***
## urban 0.183912 0.026958 6.822 1.62e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared: 0.2526, Adjusted R-squared: 0.2469
## F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16
Holding other factors fixed, the approximate difference in monthly salary between blacks and nonblacks is -18.85%. In other words, black people approximately received 18.85% less in salary in comparison with nonblack people, holding other factors fixed. The p-value indicate that this is a statistically significant difference.
model2 <- lm(log(wage) ~ educ + exper + tenure + married + black + south + urban + I(exper^2) + I(tenure^2), data = wage2)
summary(model2)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + I(exper^2) + I(tenure^2), data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98236 -0.21972 -0.00036 0.24078 1.25127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3586756 0.1259143 42.558 < 2e-16 ***
## educ 0.0642761 0.0063115 10.184 < 2e-16 ***
## exper 0.0172146 0.0126138 1.365 0.172665
## tenure 0.0249291 0.0081297 3.066 0.002229 **
## married 0.1985470 0.0391103 5.077 4.65e-07 ***
## black -0.1906636 0.0377011 -5.057 5.13e-07 ***
## south -0.0912153 0.0262356 -3.477 0.000531 ***
## urban 0.1854241 0.0269585 6.878 1.12e-11 ***
## I(exper^2) -0.0001138 0.0005319 -0.214 0.830622
## I(tenure^2) -0.0007964 0.0004710 -1.691 0.091188 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.2477
## F-statistic: 35.17 on 9 and 925 DF, p-value: < 2.2e-16
anova(model2, lm(log(wage) ~ educ + exper + tenure + married + black + south + urban, data = wage2))
## Analysis of Variance Table
##
## Model 1: log(wage) ~ educ + exper + tenure + married + black + south +
## urban + I(exper^2) + I(tenure^2)
## Model 2: log(wage) ~ educ + exper + tenure + married + black + south +
## urban
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 925 123.42
## 2 927 123.82 -2 -0.39756 1.4898 0.226
model3 <- lm(log(wage) ~ educ + exper + tenure + married + black + south + urban + educ*black, data = wage2)
summary(model3)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + educ * black, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97782 -0.21832 0.00475 0.24136 1.23226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.374817 0.114703 46.859 < 2e-16 ***
## educ 0.067115 0.006428 10.442 < 2e-16 ***
## exper 0.013826 0.003191 4.333 1.63e-05 ***
## tenure 0.011787 0.002453 4.805 1.80e-06 ***
## married 0.198908 0.039047 5.094 4.25e-07 ***
## black 0.094809 0.255399 0.371 0.710561
## south -0.089450 0.026277 -3.404 0.000692 ***
## urban 0.183852 0.026955 6.821 1.63e-11 ***
## educ:black -0.022624 0.020183 -1.121 0.262603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared: 0.2536, Adjusted R-squared: 0.2471
## F-statistic: 39.32 on 8 and 926 DF, p-value: < 2.2e-16
anova(model3, lm(log(wage) ~ educ + exper + tenure + married + black + south + urban, data = wage2))
## Analysis of Variance Table
##
## Model 1: log(wage) ~ educ + exper + tenure + married + black + south +
## urban + educ * black
## Model 2: log(wage) ~ educ + exper + tenure + married + black + south +
## urban
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 926 123.65
## 2 927 123.82 -1 -0.16778 1.2565 0.2626
The return to education does not significantly depend on race in this data.
model4 <- lm(log(wage) ~ educ + exper + tenure + married + black + south + urban + married:black, data = wage2)
summary(model4)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + married:black, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98013 -0.21780 0.01057 0.24219 1.22889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.403793 0.114122 47.351 < 2e-16 ***
## educ 0.065475 0.006253 10.471 < 2e-16 ***
## exper 0.014146 0.003191 4.433 1.04e-05 ***
## tenure 0.011663 0.002458 4.745 2.41e-06 ***
## married 0.188915 0.042878 4.406 1.18e-05 ***
## black -0.240820 0.096023 -2.508 0.012314 *
## south -0.091989 0.026321 -3.495 0.000497 ***
## urban 0.184350 0.026978 6.833 1.50e-11 ***
## married:black 0.061354 0.103275 0.594 0.552602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3656 on 926 degrees of freedom
## Multiple R-squared: 0.2528, Adjusted R-squared: 0.2464
## F-statistic: 39.17 on 8 and 926 DF, p-value: < 2.2e-16
Holding other factors constant, the estimated wage differential between married blacks and married nonblacks is 6.14%. However, since the p-value is 0.5526, this difference is not statistically significant, meaning there is no strong evidence to suggest that the wage differential between married blacks and married nonblacks is different from zero in this sample.
No. Inconsistency is due to correlation between error and regressors, this is due, for example, to omitted variables, measurement error in the regressors, but not to conditional heteroskedasticity.
Yes. For the F statistic to have a Fisher-F distribution under the null we require both A.MLR5 (conditional homoskedasticity) and A.MLR6 (errors conditionally normal).
Yes. B means best, i.e. it means that OLS is the most efficient (i.e. minimum variance) estimator among all linear unbiased estimators
The heteroskedasticity-robust standard errors are generally similar
to the usual ones, with slight differences (e.g., for age
,restaurn
, and white
) , indicating minor
heteroskedasticity.
The coefficient for educ
is -0.029. If education
increases by 4 years, the estimated probability of smoking decreases by:
4 x 0.029 = 0.116 (approximately an 11.6% reduction in smoking
probability).
At the point where the net effect is zero:
\(age=\frac{0.02}{2\times0.00026}\approx38.46 years\)
After approximately the age of 38, the probability of smoking begins to decrease.
The coefficient for restaurn
is −0.101. This indicates
that living in a state with restaurant smoking restrictions reduces the
probability of smoking by approximately 10.1%.
Substitute into the equation:
\(\hat{smokes}=0.656-0.069\times\text{log(67.44)}+0.012\times\text{log(6500)}-0.29\times16+0.02\times77-0.00026\times77^2-0.101\times0-0.026\times0\approx0.11\)
The predicted smoking probability is 0.11 (or 11%), it indicates a
very low likelihood of smoking for this individual, which aligns with
the actual value of 0 for smokes
for this person.
data("vote1")
modelc4 <- lm(voteA ~ prtystrA + democA + lexpendA + lexpendB,data = vote1)
summary(modelc4)
##
## Call:
## lm(formula = voteA ~ prtystrA + democA + lexpendA + lexpendB,
## data = vote1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.576 -4.864 -1.146 4.903 24.566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.66142 4.73604 7.952 2.56e-13 ***
## prtystrA 0.25192 0.07129 3.534 0.00053 ***
## democA 3.79294 1.40652 2.697 0.00772 **
## lexpendA 5.77929 0.39182 14.750 < 2e-16 ***
## lexpendB -6.23784 0.39746 -15.694 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.573 on 168 degrees of freedom
## Multiple R-squared: 0.8012, Adjusted R-squared: 0.7964
## F-statistic: 169.2 on 4 and 168 DF, p-value: < 2.2e-16
The R-squared=0 because the residuals are uncorrelated with the regressors by construction in OLS.
# Breusch-Pagan test
library(lmtest)
bptest(modelc4)
##
## studentized Breusch-Pagan test
##
## data: modelc4
## BP = 9.0934, df = 4, p-value = 0.05881
The p-value is slightly above the 0.05 significance level, indicating weak evidence of heteroskedasticity.
# White test for heteroskedasticity
bptest(modelc4, ~ prtystrA + democA + log(expendA) + log(expendB) +
I(prtystrA^2) + I(democA^2) + I(log(expendA)^2) + I(log(expendB)^2), data = vote1)
##
## studentized Breusch-Pagan test
##
## data: modelc4
## BP = 19.581, df = 7, p-value = 0.00655
The p-value is well below 0.05, providing strong evidence of heteroskedasticity. The White test is more sensitive as it includes squared and interaction terms, capturing more complex patterns of heteroskedasticity.
data("fertil2")
library(sandwich)
modelc13_1 <- lm(children ~ age + agesq + educ + electric + urban, data = fertil2)
# Usual standard errors
summary(modelc13_1)
##
## Call:
## lm(formula = children ~ age + agesq + educ + electric + urban,
## data = fertil2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9012 -0.7136 -0.0039 0.7119 7.4318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2225162 0.2401888 -17.580 < 2e-16 ***
## age 0.3409255 0.0165082 20.652 < 2e-16 ***
## agesq -0.0027412 0.0002718 -10.086 < 2e-16 ***
## educ -0.0752323 0.0062966 -11.948 < 2e-16 ***
## electric -0.3100404 0.0690045 -4.493 7.20e-06 ***
## urban -0.2000339 0.0465062 -4.301 1.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 4352 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5734, Adjusted R-squared: 0.5729
## F-statistic: 1170 on 5 and 4352 DF, p-value: < 2.2e-16
# Robust standard errors
coeftest(modelc13_1, vcov = vcovHC(modelc13_1, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.22251623 0.24385099 -17.3160 < 2.2e-16 ***
## age 0.34092552 0.01917466 17.7800 < 2.2e-16 ***
## agesq -0.00274121 0.00035051 -7.8206 6.549e-15 ***
## educ -0.07523232 0.00630771 -11.9270 < 2.2e-16 ***
## electric -0.31004041 0.06394815 -4.8483 1.289e-06 ***
## urban -0.20003386 0.04547093 -4.3992 1.113e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The robust standard errors are not always bigger than the nonrobust
ones, for the electric
and urban
variables
# Add religious dummies
modelc13_2 <- lm(children ~ age + agesq + educ + electric + urban + spirit + protest + catholic, data = fertil2)
# Nonrobust test
linearHypothesis(modelc13_2, c("spirit = 0", "protest = 0", "catholic = 0"))
##
## Linear hypothesis test:
## spirit = 0
## protest = 0
## catholic = 0
##
## Model 1: restricted model
## Model 2: children ~ age + agesq + educ + electric + urban + spirit + protest +
## catholic
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4352 9176.4
## 2 4349 9162.5 3 13.88 2.1961 0.08641 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for the nonrobust test is 0.08641
# Robust test
linearHypothesis(modelc13_2, c("spirit = 0", "protest = 0", "catholic = 0"), vcov = vcovHC(modelc13_2, type = "HC1"))
##
## Linear hypothesis test:
## spirit = 0
## protest = 0
## catholic = 0
##
## Model 1: restricted model
## Model 2: children ~ age + agesq + educ + electric + urban + spirit + protest +
## catholic
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 4352
## 2 4349 3 2.1559 0.0911 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for the robust test is 0.0911.
The p-values of the robust and nonrobust tests indicate that the three religious dummy variables are not jointly significant.
# Obtain fitted values and residuals
fitted_vals <- fitted(modelc13_2)
residuals_sq <- residuals(modelc13_2)^2
# Regress u^2 on fitted values and fitted values squared
hetero_test <- lm(residuals_sq ~ fitted_vals + I(fitted_vals^2))
# Joint significance test for heteroskedasticity
linearHypothesis(hetero_test, c("fitted_vals = 0", "I(fitted_vals^2) = 0"))
##
## Linear hypothesis test:
## fitted_vals = 0
## I(fitted_vals^2) = 0
##
## Model 1: restricted model
## Model 2: residuals_sq ~ fitted_vals + I(fitted_vals^2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4357 76589
## 2 4355 57436 2 19153 726.11 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small p-value indicates overwhelming evidence to reject the null
hypothesis that the coefficients of fitted_vals
and
I(fitted_vals^2)
are jointly zero. This result confirms
that heteroskedasticity is present in the model.
The test confirms heteroskedasticity, but its practical importance depends on whether robust standard errors significantly change inference. If robust and non-robust results align, the impact is minimal; otherwise, it requires correction.
Adding \(ceoten^2\) and \(comten^2\) increases \(R^2\) from 0.353 to 0.375. The increase suggests that the quadratic terms improve model fit, indicating possible functional form misspecification in the original model.
The failure of some colleges to report crimes in 1992 may not be exogenous since underreporting could be correlated with unobserved factors like college policies or safety standards. This creates a sample selection issue.
data("infmrt")
infmrt_90 <- infmrt %>%
filter(year == 1990)
summary(lm(infmort ~ lpcinc + lphysic + lpopul + DC, data = infmrt_90))
##
## Call:
## lm(formula = infmort ~ lpcinc + lphysic + lpopul + DC, data = infmrt_90)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4964 -0.8076 0.0000 0.9358 2.6077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9548 12.4195 1.929 0.05994 .
## lpcinc -0.5669 1.6412 -0.345 0.73134
## lphysic -2.7418 1.1908 -2.303 0.02588 *
## lpopul 0.6292 0.1911 3.293 0.00191 **
## DC 16.0350 1.7692 9.064 8.43e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.246 on 46 degrees of freedom
## Multiple R-squared: 0.691, Adjusted R-squared: 0.6641
## F-statistic: 25.71 on 4 and 46 DF, p-value: 3.146e-11
The DC dummy (16.035, p<0.001) is highly significant, indicating DC has an infant mortality rate 16.035 units higher than average, controlling for other factors.
summary(lm(infmort ~ lpcinc + lphysic + lpopul, data = infmrt_90))
##
## Call:
## lm(formula = infmort ~ lpcinc + lphysic + lpopul, data = infmrt_90)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0811 -1.2064 -0.0521 1.0639 7.9589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.85931 20.42785 1.658 0.10408
## lpcinc -4.68466 2.60412 -1.799 0.07845 .
## lphysic 4.15326 1.51266 2.746 0.00853 **
## lpopul -0.08782 0.28725 -0.306 0.76116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.058 on 47 degrees of freedom
## Multiple R-squared: 0.1391, Adjusted R-squared: 0.08413
## F-statistic: 2.531 on 3 and 47 DF, p-value: 0.06841
Including DC improves R-squared from 0.139 to
0.691, significantly improving model fit. It changes
lphysic
to a negative effect, makes lpopul
significant, and renders lpcinc
insignificant, showing DC’s
outlier impact.
data("rdchem")
rdchem_c5 <- rdchem %>%
mutate(sales = sales/1000) %>%
mutate(salessq = salessq/1000)
rdchem_without <- rdchem_c5 %>%
filter(sales < 39)
#With largest firm
summary(lm(rdintens ~ sales + salessq + profmarg, data = rdchem_c5))
##
## Call:
## lm(formula = rdintens ~ sales + salessq + profmarg, data = rdchem_c5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0371 -1.1238 -0.4547 0.7165 5.8522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.059e+00 6.263e-01 3.288 0.00272 **
## sales 3.166e-01 1.389e-01 2.280 0.03041 *
## salessq -7.390e-06 3.716e-06 -1.989 0.05657 .
## profmarg 5.332e-02 4.421e-02 1.206 0.23787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.774 on 28 degrees of freedom
## Multiple R-squared: 0.1905, Adjusted R-squared: 0.1037
## F-statistic: 2.196 on 3 and 28 DF, p-value: 0.1107
#Without largest firm
summary(lm(rdintens ~ sales + salessq + profmarg, data = rdchem_without))
##
## Call:
## lm(formula = rdintens ~ sales + salessq + profmarg, data = rdchem_without)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0843 -1.1354 -0.5505 0.7570 5.7783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.984e+00 7.176e-01 2.764 0.0102 *
## sales 3.606e-01 2.389e-01 1.510 0.1427
## salessq -1.025e-05 1.308e-05 -0.784 0.4401
## profmarg 5.528e-02 4.579e-02 1.207 0.2378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.805 on 27 degrees of freedom
## Multiple R-squared: 0.1912, Adjusted R-squared: 0.1013
## F-statistic: 2.128 on 3 and 27 DF, p-value: 0.1201
When the largest firm is included, sales have a significant positive
effect on R&D intensity (coefficient = 0.3166), while the quadratic
term for sales (salessq
) is marginally significant. Without
the largest firm, the sales coefficient becomes insignificant (0.3606),
and the quadratic term is also insignificant. The profit margin
(profmarg
) is not significant in either model.
The largest firm drives the significance of sales in the model.
Quadratic sales relationship weakens without the largest firm.
Profit margin remains insignificant in both models.
#With largest firm
summary(rq(rdintens ~ sales + salessq + profmarg, data = rdchem_c5))
##
## Call: rq(formula = rdintens ~ sales + salessq + profmarg, data = rdchem_c5)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 1.40428 0.87031 2.66628
## sales 0.26346 -0.13508 0.75753
## salessq -0.00001 -0.00002 0.00000
## profmarg 0.11400 0.01376 0.16427
#Without largest firm
summary(rq(rdintens ~ sales + salessq + profmarg, data = rdchem_without))
##
## Call: rq(formula = rdintens ~ sales + salessq + profmarg, data = rdchem_without)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 2.61047 0.58936 2.81404
## sales -0.22364 -0.23542 0.87607
## salessq 0.00002 -0.00003 0.00003
## profmarg 0.07594 0.00578 0.16392
The intercept is significantly lower in the model with the largest firm (1.404 vs. 2.610), indicating that removing the largest firm leads to a higher baseline R&D intensity.
The sales coefficient changes from positive and insignificant (0.263) with the largest firm to negative and slightly significant (-0.224) without it, suggesting that the largest firm had a notable influence on the sales-R&D relationship.
The quadratic term (salessq
) shows
minimal effect in both models, but the sign flips from negative to
positive when the largest firm is excluded, though the confidence
intervals for both models include zero.
Profmarg has a stronger positive effect in the model with the largest firm (0.114 vs. 0.076), indicating that profit margins play a larger role in determining R&D intensity when the largest firm is included.
In conclusion, excluding the largest firm changes the direction and significance of the sales coefficient, and results in a higher intercept, suggesting that the largest firm had a distinct impact on the R&D intensity equation.
LAD is more resilient to outliers than OLS. In the presence of the largest firm, OLS estimates are significantly influenced by the outlier, especially for the sales coefficient. In contrast, LAD reduces the impact of the largest firm, resulting in less significant sales effects and a higher intercept. This shows that LAD is less sensitive to extreme values, making it more robust in the presence of outliers.
Disagree: Time series observations often exhibit autocorrelation, violating independence.
Agree: Under the first three Gauss-Markov assumptions, OLS remains unbiased.
Agree: A trending variable can cause spurious regression and cannot be the dependent variable unless detrended.
Agree: Seasonality issues are minimized when using annual time series data.
\(\text{housing_starts}_t=\beta_0+\beta_1\text{interest_rate}_t+\beta_2\text{per_capita_income}_t+\beta_3time_t+u_t\)
Where:
\(\text{housing_starts}_t\): Number of housing starts at time t
\(\text{interest_rate}_t\): Interest rate at time t
\(\text{per_capita_income}_t\): Real per capita income at time t
\(time_t\): Time trend variable to account for trends over time
\(u_t\): Error term capturing unexplained variation at time t
data("intdef")
intdef_c1 <- intdef %>%
mutate(after1979 = ifelse(year>1979,1,0))
summary(lm(i3 ~ inf + def + after1979, data = intdef_c1))
##
## Call:
## lm(formula = i3 ~ inf + def + after1979, data = intdef_c1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4674 -0.8407 0.2388 1.0148 3.9654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.29623 0.42535 3.047 0.00362 **
## inf 0.60842 0.07625 7.979 1.37e-10 ***
## def 0.36266 0.12025 3.016 0.00396 **
## after1979 1.55877 0.50577 3.082 0.00329 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.711 on 52 degrees of freedom
## Multiple R-squared: 0.6635, Adjusted R-squared: 0.6441
## F-statistic: 34.18 on 3 and 52 DF, p-value: 2.408e-12
Yes, there is a significant shift in the interest rate equation
around 1979. The coefficient for after1979
is 1.55877 with
a p-value of 0.00329, indicating that, on average, the 3-month T-bill
rate increased by 1.56 percentage points after 1979, holding other
factors constant. This suggests the effects of the policy change.
Additionally, inflation (inf
) and federal deficit
(def
) both significantly affect the interest rate, with
higher inflation and deficits leading to higher rates.
\(\beta_1\) should have a positive sign and \(\beta_2\) should have a negative sign.
summary(lm(rsp500 ~ pcip + i3, data = volat))
##
## Call:
## lm(formula = rsp500 ~ pcip + i3, data = volat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -157.871 -22.580 2.103 25.524 138.137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.84306 3.27488 5.754 1.44e-08 ***
## pcip 0.03642 0.12940 0.281 0.7785
## i3 -1.36169 0.54072 -2.518 0.0121 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.13 on 554 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01189, Adjusted R-squared: 0.008325
## F-statistic: 3.334 on 2 and 554 DF, p-value: 0.03637
Intercept (18.84306): When both predictors (pcip and i3) are zero, the expected return on the S&P 500 (rsp500) is 18.84.
pcip (0.03642): For a 1-unit increase in the percentage change in industrial production (pcip), the return on the S&P 500 (rsp500) is expected to increase by 0.03642, holding i3 constant. The effect is very small and statistically insignificant.
i3 (-1.36169): For a 1% increase in the 3-month T-bill rate (i3), the return on the S&P 500 is expected to decrease by 1.36169, holding pcip constant. This effect is statistically significant.
No, the model has a very low R-squared (0.01189), indicating that the predictors explain only a tiny portion of the variation in the S&P 500 returns. The statistical significance of i3 suggests it has a small, but potentially useful, effect, but overall, the predictability of S&P 500 returns from these variables is weak.
data("consump")
modelc7_i <- lm(gc[3:37] ~ gc_1[3:37], data = consump)
summary(modelc7_i)
##
## Call:
## lm(formula = gc[3:37] ~ gc_1[3:37], data = consump)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027878 -0.005974 -0.001450 0.007142 0.020227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.011431 0.003778 3.026 0.00478 **
## gc_1[3:37] 0.446133 0.156047 2.859 0.00731 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01161 on 33 degrees of freedom
## Multiple R-squared: 0.1985, Adjusted R-squared: 0.1742
## F-statistic: 8.174 on 1 and 33 DF, p-value: 0.007311
Null Hypothesis (H₀): \(\beta_1=0\)
The null hypothesis suggests that there is no relationship between \(gc_t\) and its lagged value \(gc_\text{t-1}\), which would be consistent
with the Permanent Income Hypothesis (PIH). According
to PIH, the growth rate of consumption at time t should be
independent of past consumption growth, implying that past consumption
growth does not provide useful information for predicting current
consumption growth.
Alternative Hypothesis (H₁): \(\beta_1\ne0\)
The alternative hypothesis suggests that past consumption growth \(gc_\text{t-1}\) is significantly related to
current consumption growth \(gc_t\).
This would indicate that consumption growth follows a pattern over time,
contradicting the PIH.
The coefficient for \(gc_\text{t-1}\) is significant and positive, with a value of 0.44613, suggesting that past consumption growth is a significant predictor of current consumption growth. This rejects the null hypothesis that \(\beta_1=0\), which means the data does not support the Permanent Income Hypothesis (PIH), where consumption growth should be independent of past consumption growth.
The p-value for \(\beta_1\) is 0.00731, which is less than 0.05, indicating strong evidence against the null hypothesis.
Thus, we conclude that there is significant autocorrelation in consumption growth, and the PIH does not hold for this data.
modelc7_ii <- lm(gc[3:37] ~ gc_1[3:37] + gy_1[3:37] + i3[2:36] + inf[2:36], data = consump)
summary(modelc7_ii)
##
## Call:
## lm(formula = gc[3:37] ~ gc_1[3:37] + gy_1[3:37] + i3[2:36] +
## inf[2:36], data = consump)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0249090 -0.0075867 0.0000855 0.0087231 0.0188620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0225944 0.0070892 3.187 0.00335 **
## gc_1[3:37] 0.4335777 0.2896546 1.497 0.14487
## gy_1[3:37] -0.1079113 0.1946394 -0.554 0.58340
## i3[2:36] -0.0007467 0.0011107 -0.672 0.50653
## inf[2:36] -0.0008281 0.0010041 -0.825 0.41606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01134 on 30 degrees of freedom
## Multiple R-squared: 0.3038, Adjusted R-squared: 0.211
## F-statistic: 3.273 on 4 and 30 DF, p-value: 0.02431
The p-values for \(gy_\text{t-1},i3_\text{t-1}, inf_\text{t-1}\) are 0.58, 0.51, and 0.42 respectively. This indicates that these new variables are not individually significant at the 5% level.
anova(modelc7_i, modelc7_ii)
## Analysis of Variance Table
##
## Model 1: gc[3:37] ~ gc_1[3:37]
## Model 2: gc[3:37] ~ gc_1[3:37] + gy_1[3:37] + i3[2:36] + inf[2:36]
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 0.0044447
## 2 30 0.0038609 3 0.00058386 1.5122 0.2315
The p-value (0.2315) is greater than 0.05, which means we fail to reject the null hypothesis that the additional variables (\(gy_\text{t-1},i3_\text{t-1}, inf_\text{t-1}\)) do not jointly improve the model’s fit.
The additional variables \(gy_\text{t-1},i3_\text{t-1}, inf_\text{t-1}\) are not jointly significant at the 5% level in explaining the growth of per capita consumption.
The p-value for \(gc_\text{t-1}\) increased to 0.14487, indicating it is not statistically significant at the 5% level when controlling for \(gy_\text{t-1},i3_\text{t-1}, inf_\text{t-1}\). This suggests that past consumption growth (\(gc_\text{t-1}\)) no longer has a significant effect on current consumption growth. The PIH hypothesis, which suggests that current consumption growth depends on past consumption growth, is not strongly supported by this data, as the additional variables seem to explain consumption growth better.
modelc7_iv <- lm(gc[3:37] ~ 1, data = consump)
anova(modelc7_iv, modelc7_ii)
## Analysis of Variance Table
##
## Model 1: gc[3:37] ~ 1
## Model 2: gc[3:37] ~ gc_1[3:37] + gy_1[3:37] + i3[2:36] + inf[2:36]
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 34 0.0055456
## 2 30 0.0038609 4 0.0016848 3.2728 0.02431 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value (0.02431) is less than 0.05, we reject the null hypothesis and conclude that the four explanatory variables (\(gc_\text{t-1},gy_\text{t-1},i3_\text{t-1}, inf_\text{t-1}\)) are jointly significant at the 5% level.
This supports the PIH hypothesis, but the additional variables suggest that factors beyond past consumption growth contribute to explaining current consumption growth.
data("minwage")
minwage232 <- minwage %>%
select(gwage232, gemp232, gmwage, gcpi) %>%
na.omit()
acf(minwage232$gwage232)
The ACF plot suggests that the gwage232 series appear to be weakly dependent
summary(lm(gwage232[2:nrow(minwage232)] ~ gwage232[1:(nrow(minwage232)-1)] + gmwage[2:nrow(minwage232)] + gcpi[2:nrow(minwage232)], data = minwage232))
##
## Call:
## lm(formula = gwage232[2:nrow(minwage232)] ~ gwage232[1:(nrow(minwage232) -
## 1)] + gmwage[2:nrow(minwage232)] + gcpi[2:nrow(minwage232)],
## data = minwage232)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044642 -0.004134 -0.001312 0.004482 0.041612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0024003 0.0004308 5.572 3.79e-08 ***
## gwage232[1:(nrow(minwage232) - 1)] -0.0779092 0.0342851 -2.272 0.02341 *
## gmwage[2:nrow(minwage232)] 0.1518459 0.0096485 15.738 < 2e-16 ***
## gcpi[2:nrow(minwage232)] 0.2630876 0.0824457 3.191 0.00149 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007889 on 606 degrees of freedom
## Multiple R-squared: 0.2986, Adjusted R-squared: 0.2951
## F-statistic: 85.99 on 3 and 606 DF, p-value: < 2.2e-16
The results from the regression indicates that an increase in the federal minimum wage result in a contemporaneous increase in gwage232. The very small value of p-value supports the conclusion.
summary(lm(gwage232[2:nrow(minwage232)] ~ gwage232[1:(nrow(minwage232)-1)] + gmwage[2:nrow(minwage232)] + gcpi[2:nrow(minwage232)] + gemp232[1:(nrow(minwage232)-1)] , data = minwage232))
##
## Call:
## lm(formula = gwage232[2:nrow(minwage232)] ~ gwage232[1:(nrow(minwage232) -
## 1)] + gmwage[2:nrow(minwage232)] + gcpi[2:nrow(minwage232)] +
## gemp232[1:(nrow(minwage232) - 1)], data = minwage232)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043842 -0.004378 -0.001034 0.004321 0.042548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.002451 0.000426 5.753 1.4e-08 ***
## gwage232[1:(nrow(minwage232) - 1)] -0.074546 0.033901 -2.199 0.028262 *
## gmwage[2:nrow(minwage232)] 0.152707 0.009540 16.007 < 2e-16 ***
## gcpi[2:nrow(minwage232)] 0.252296 0.081544 3.094 0.002066 **
## gemp232[1:(nrow(minwage232) - 1)] 0.066131 0.016962 3.899 0.000108 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007798 on 605 degrees of freedom
## Multiple R-squared: 0.3158, Adjusted R-squared: 0.3112
## F-statistic: 69.8 on 4 and 605 DF, p-value: < 2.2e-16
The coefficient for the variable is statistically significant.
summary(lm(gwage232[2:nrow(minwage232)] ~ gmwage[2:nrow(minwage232)] + gcpi[2:nrow(minwage232)], data = minwage232))
##
## Call:
## lm(formula = gwage232[2:nrow(minwage232)] ~ gmwage[2:nrow(minwage232)] +
## gcpi[2:nrow(minwage232)], data = minwage232)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044464 -0.004095 -0.001352 0.004545 0.041188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0021904 0.0004222 5.188 2.9e-07 ***
## gmwage[2:nrow(minwage232)] 0.1505574 0.0096648 15.578 < 2e-16 ***
## gcpi[2:nrow(minwage232)] 0.2427430 0.0822388 2.952 0.00328 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007916 on 607 degrees of freedom
## Multiple R-squared: 0.2926, Adjusted R-squared: 0.2903
## F-statistic: 125.5 on 2 and 607 DF, p-value: < 2.2e-16
The estimate for the coefficient of the gmwage variable for the with and without lags models are 0.152707 and 0.1505574 respectively. Adding the two lagged variables does not have much of an effect on the gmwage coefficient.
summary(lm(gmwage[2:nrow(minwage232)] ~ gwage232[1:(nrow(minwage232)-1)] + gemp232[1:(nrow(minwage232)-1)] , data = minwage232))
##
## Call:
## lm(formula = gmwage[2:nrow(minwage232)] ~ gwage232[1:(nrow(minwage232) -
## 1)] + gemp232[1:(nrow(minwage232) - 1)], data = minwage232)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01914 -0.00500 -0.00379 -0.00287 0.62208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.003433 0.001440 2.384 0.0174 *
## gwage232[1:(nrow(minwage232) - 1)] 0.203167 0.143140 1.419 0.1563
## gemp232[1:(nrow(minwage232) - 1)] -0.041706 0.072110 -0.578 0.5632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03318 on 607 degrees of freedom
## Multiple R-squared: 0.00392, Adjusted R-squared: 0.0006377
## F-statistic: 1.194 on 2 and 607 DF, p-value: 0.3036
The R-squared when running the regression of gmwage on the lagged variables gwage232 and gemp232 is 0.00392, suggesting that the variations in the gmwage variable is weakly correlated with the lagged variables gwage232 and gemp232.
data("nyse")
modelc11_i <- lm(return ~ return_1, data = nyse)
residuals <- modelc11_i$residuals
squared_residuals <- residuals^2
# Calculate the average, minimum, and maximum of squared residuals
avg_squared_residual <- mean(squared_residuals, na.rm = TRUE)
min_squared_residual <- min(squared_residuals, na.rm = TRUE)
max_squared_residual <- max(squared_residuals, na.rm = TRUE)
# Output the results
avg_squared_residual
## [1] 4.440839
min_squared_residual
## [1] 7.35465e-06
max_squared_residual
## [1] 232.8946
nyse_ii <- nyse %>%
select(return, return_1) %>%
na.omit() %>%
mutate(return_1_sq = return_1^2)
nyse_ii$residual_sq <- squared_residuals
summary(lm(residual_sq ~ return_1 + return_1_sq, data = nyse_ii))
##
## Call:
## lm(formula = residual_sq ~ return_1 + return_1_sq, data = nyse_ii)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.459 -3.011 -1.975 0.676 221.469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.25734 0.44085 7.389 4.32e-13 ***
## return_1 -0.78946 0.19569 -4.034 6.09e-05 ***
## return_1_sq 0.29666 0.03552 8.351 3.75e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.66 on 686 degrees of freedom
## Multiple R-squared: 0.1303, Adjusted R-squared: 0.1278
## F-statistic: 51.4 on 2 and 686 DF, p-value: < 2.2e-16
delta_0 <- 3.25734
delta_1 <- -0.78946
delta_2 <- 0.29666
f = function(x) {
delta_0 + delta_1*x + delta_2*x^2
}
x = 0:20
plot(x, f(x), type = 'l')
abline(h = 0)
abline(v = 0)
find.vertex = function(delta_2, delta_1, delta_0) {
x_vertex = -delta_1/(2 * delta_2)
y_vertex = f(x_vertex)
c(x_vertex, y_vertex)
}
V = find.vertex(delta_2, delta_1, delta_0)
V
## [1] 1.33058 2.73212
When \(return_\text{t-1}\) is 1.33058, the variance is the smallest, at 2.73212.
Since the smallest value of the predicted variance is 2.73212, the model does not produce any negative variance estimates.
The R-squared for the model in part (ii) is 0.1303, higher than that of the R-squared value than the ARCH(1) model (at 0.114). This suggests that the model in part (ii) seems to fit slightly better than the ARCH(1) model.
summary(lm(residual_sq[3:689] ~ residual_sq[2:688] + residual_sq[1:687], data = nyse_ii))
##
## Call:
## lm(formula = residual_sq[3:689] ~ residual_sq[2:688] + residual_sq[1:687],
## data = nyse_ii)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.934 -3.298 -2.158 0.600 224.296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.82950 0.45495 6.219 8.69e-10 ***
## residual_sq[2:688] 0.32284 0.03820 8.450 < 2e-16 ***
## residual_sq[1:687] 0.04179 0.03820 1.094 0.274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.76 on 684 degrees of freedom
## Multiple R-squared: 0.1151, Adjusted R-squared: 0.1125
## F-statistic: 44.47 on 2 and 684 DF, p-value: < 2.2e-16
The p-value of the coefficient for the second lag is 0.274, indicating that it’s not statistically significant. The R-squared for the model is 0.1151, lower than that of the model in part (ii), suggesting that it does not fit better than the model in part (ii).