Chapter 7

Chap7.1

We have the model: sleep = 3840.83 - 0.163 * totwrk - 11.71 * educ - 8.7 * age + 0.128* age^2 + 87.75 * male

From the model, we can see that coefficient of male is 87.75, which means that male sleep 87.75 more minutes than women.

2*(1-pt(87.75/34.33,700,FALSE))
## [1] 0.01079613

The t statistic for this variable is 87.75/34.33= 2.56 and p-value is 0.011; so with significant level at 5%, the evidence is strong.

2*(1-pt(0.163/0.018,700,FALSE))
## [1] 0

The p-value for totwrk is very significant with the value at nearly 0. Hence, there is a statistically significant tradeoff between working and sleeping. The estimated tradeoff is 0.163, which means that 1 more minute of work will reduce 0.163 minutes less sleep.

Other regression we need to run to test the null hypothesis that age has no effect on sleeping is:

sleep ~ totwrk+educ+male

Null hypothesis: H0: beta(age)=0, beta(age^2)=0

Chap 7.3

2*(1-pt(2.19/0.53,4131,FALSE))
## [1] 3.666238e-05

The p-value of hsize^2 is 0.000004 which is statiscally significent. Hence, there is strong evidence that hsize^2 should be included in the model.

The optimal high school size for maximize sat is: 19.3 - 2.19hsize*2=0 -> hsize= 4.406393 (hundred) ~ 441 people

Non-black females are represented by female=1, black=0, and non-black males are represented by female=0 and black=0. Hence, the difference between these two groups is coefficient of female variable which is -45.09. It means that non-black females have 45.09 less score in sat than non-black males in average.

2*(1-pt(45.09/4.29,4131,FALSE))
## [1] 0

The p-value is nearly at 0, so there is a statistically significant in this estimated difference.

non-black males are represented by female=0 and black= 0, black males are represented by female=0 and black=1. Hence, the difference between these two groups is coefficient of black variable which is -169.81. It means that black males have 169.81 less score in sat than non-black males in average.

2*(1-pt(169.81/12.71,4131,FALSE))
## [1] 0

The p-value is nearly at 0, so there is a statistically significant in this estimated difference.

non-black females are represented by female=1 and black= 0, black females are represented by female=1 and black=1. Hence, the difference between these two groups is the sum of black variable and female * black variable (female - (female+black+female * black)) which is 169.81-62.31=107.5. It means that non-black females have 107.5 more score in sat than black females in average.

The difference based on two variables, so t-test cannot be conducted to measure the significance of the relationship. To test the difference, we can create 3 dummy variables with base group is non-black females and obtain t statistic from the coefficient of black female dummy variable.

Chap 7.C1

From the model we can see that there are two significant variables including PC and hsGPA, Compared to the result estimated in (7.6), PC coefficient decrease from 0.157 to 0.151, and it is still statistically significant.

data_7C1 <- gpa1
model_7C1 <- lm(colGPA~PC+hsGPA+ACT+mothcoll+fathcoll,data= data_7C1)
summary(model_7C1)
## 
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll, 
##     data = data_7C1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78149 -0.25726 -0.02121  0.24691  0.74432 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.255554   0.335392   3.744 0.000268 ***
## PC           0.151854   0.058716   2.586 0.010762 *  
## hsGPA        0.450220   0.094280   4.775 4.61e-06 ***
## ACT          0.007724   0.010678   0.723 0.470688    
## mothcoll    -0.003758   0.060270  -0.062 0.950376    
## fathcoll     0.041800   0.061270   0.682 0.496265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3344 on 135 degrees of freedom
## Multiple R-squared:  0.2222, Adjusted R-squared:  0.1934 
## F-statistic: 7.713 on 5 and 135 DF,  p-value: 2.083e-06

F-test for join significance of mothcoll and fatcoll is 0.24 < 4.42; hence, we cannot rejuct H0 or these variables are jointly insignificant.

model_7C1_ur <- lm(colGPA~PC+hsGPA+ACT,data= data_7C1)
summary(model_7C1_ur)
## 
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT, data = data_7C1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7901 -0.2622 -0.0107  0.2334  0.7570 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.263520   0.333125   3.793 0.000223 ***
## PC          0.157309   0.057287   2.746 0.006844 ** 
## hsGPA       0.447242   0.093647   4.776 4.54e-06 ***
## ACT         0.008659   0.010534   0.822 0.412513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3325 on 137 degrees of freedom
## Multiple R-squared:  0.2194, Adjusted R-squared:  0.2023 
## F-statistic: 12.83 on 3 and 137 DF,  p-value: 1.932e-07
F= ((0.2222-0.2194)/2)/((1-0.2222)/135)
F
## [1] 0.2429931
critical_value = qf(0.95,2,135, TRUE)
critical_value
## [1] 4.423013

This generalization is not needed because after adding hsGPA^2, both hsGPA and hsGPA^2 are not significant in the model.

model_7C1_3 <- lm(colGPA~PC+hsGPA+I(hsGPA^2)+ACT+mothcoll+fathcoll,data= data_7C1)
summary(model_7C1_3)
## 
## Call:
## lm(formula = colGPA ~ PC + hsGPA + I(hsGPA^2) + ACT + mothcoll + 
##     fathcoll, data = data_7C1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78998 -0.24327 -0.00648  0.26179  0.72231 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.040328   2.443038   2.063   0.0410 *
## PC           0.140446   0.058858   2.386   0.0184 *
## hsGPA       -1.802520   1.443552  -1.249   0.2140  
## I(hsGPA^2)   0.337341   0.215711   1.564   0.1202  
## ACT          0.004786   0.010786   0.444   0.6580  
## mothcoll     0.003091   0.060110   0.051   0.9591  
## fathcoll     0.062761   0.062401   1.006   0.3163  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3326 on 134 degrees of freedom
## Multiple R-squared:  0.2361, Adjusted R-squared:  0.2019 
## F-statistic: 6.904 on 6 and 134 DF,  p-value: 2.088e-06

Chap 7.C2

The difference between black and non-black is -0.18835, which means that black people have monthly salary 18.84% less than non-black people. This difference is statistically significant at p-value is nearly at 0.

data_7C2 <- wage2
model_7C2 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban,data= data_7C2)
summary(model_7C2)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
##     south + urban, data = data_7C2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98069 -0.21996  0.00707  0.24288  1.22822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.395497   0.113225  47.653  < 2e-16 ***
## educ         0.065431   0.006250  10.468  < 2e-16 ***
## exper        0.014043   0.003185   4.409 1.16e-05 ***
## tenure       0.011747   0.002453   4.789 1.95e-06 ***
## married      0.199417   0.039050   5.107 3.98e-07 ***
## black       -0.188350   0.037667  -5.000 6.84e-07 ***
## south       -0.090904   0.026249  -3.463 0.000558 ***
## urban        0.183912   0.026958   6.822 1.62e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared:  0.2526, Adjusted R-squared:  0.2469 
## F-statistic: 44.75 on 7 and 927 DF,  p-value: < 2.2e-16

P-value of the F-test is 0.2259 so these added variables are jointly insignificant at 20%.

model_7C2_2 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban+I(exper^2)+I(tenure^2),data= data_7C2)
summary(model_7C2_2)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
##     south + urban + I(exper^2) + I(tenure^2), data = data_7C2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98236 -0.21972 -0.00036  0.24078  1.25127 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.3586756  0.1259143  42.558  < 2e-16 ***
## educ         0.0642761  0.0063115  10.184  < 2e-16 ***
## exper        0.0172146  0.0126138   1.365 0.172665    
## tenure       0.0249291  0.0081297   3.066 0.002229 ** 
## married      0.1985470  0.0391103   5.077 4.65e-07 ***
## black       -0.1906636  0.0377011  -5.057 5.13e-07 ***
## south       -0.0912153  0.0262356  -3.477 0.000531 ***
## urban        0.1854241  0.0269585   6.878 1.12e-11 ***
## I(exper^2)  -0.0001138  0.0005319  -0.214 0.830622    
## I(tenure^2) -0.0007964  0.0004710  -1.691 0.091188 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared:  0.255,  Adjusted R-squared:  0.2477 
## F-statistic: 35.17 on 9 and 925 DF,  p-value: < 2.2e-16
F= ((0.255-0.2526)/2)/((1-0.255)/925)
F
## [1] 1.489933
1-pf(F,2,925)
## [1] 0.2259282

p-value of the interaction between educ and black is 0.2626, which is insignificant. Hence, the return on education does not depend on race.

model_7C2_3 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban+educ*black,data= data_7C2)
summary(model_7C2_3)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
##     south + urban + educ * black, data = data_7C2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97782 -0.21832  0.00475  0.24136  1.23226 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.374817   0.114703  46.859  < 2e-16 ***
## educ         0.067115   0.006428  10.442  < 2e-16 ***
## exper        0.013826   0.003191   4.333 1.63e-05 ***
## tenure       0.011787   0.002453   4.805 1.80e-06 ***
## married      0.198908   0.039047   5.094 4.25e-07 ***
## black        0.094809   0.255399   0.371 0.710561    
## south       -0.089450   0.026277  -3.404 0.000692 ***
## urban        0.183852   0.026955   6.821 1.63e-11 ***
## educ:black  -0.022624   0.020183  -1.121 0.262603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared:  0.2536, Adjusted R-squared:  0.2471 
## F-statistic: 39.32 on 8 and 926 DF,  p-value: < 2.2e-16

Married-black people are represented by married=1 and black= 1, married-nonblack people are represented by married=1 and black=0. Hence, the difference between these two groups is the sum of black variable and married * black variable which is -0.2408+0.0614=-0.1794. It means that married-black people have monthly salary at 17.94% less than married-nonblack people in average.

model_7C2_4 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban+married*black,data= data_7C2)
summary(model_7C2_4)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black + 
##     south + urban + married * black, data = data_7C2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98013 -0.21780  0.01057  0.24219  1.22889 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.403793   0.114122  47.351  < 2e-16 ***
## educ           0.065475   0.006253  10.471  < 2e-16 ***
## exper          0.014146   0.003191   4.433 1.04e-05 ***
## tenure         0.011663   0.002458   4.745 2.41e-06 ***
## married        0.188915   0.042878   4.406 1.18e-05 ***
## black         -0.240820   0.096023  -2.508 0.012314 *  
## south         -0.091989   0.026321  -3.495 0.000497 ***
## urban          0.184350   0.026978   6.833 1.50e-11 ***
## married:black  0.061354   0.103275   0.594 0.552602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3656 on 926 degrees of freedom
## Multiple R-squared:  0.2528, Adjusted R-squared:  0.2464 
## F-statistic: 39.17 on 8 and 926 DF,  p-value: < 2.2e-16

Chapter 8

Chap 8.1

The consequences of heteroskedasticity are:

  1. The usual F statistic no longer has an F distribution.

  2. The OLS estimators are no longer BLUE.

It is because OLS standard errors are based directly on estimators of the variances - which are biased without homoskedasticity assumption. Hence, they are no longer valid for constructing confidence intervals and statistical inference (t-test or F-test). Besides, OLS is BLUE also based on homoskedasticity assumption, so it is also no longer valid.

Chap 8.5

  1. The standard erors between two sets are quite similar; there are not any significant differences between them.

  2. If the education increases by 4 years the smokes will reduce 4*0.029=0.116 percent.

  3. We have the year of age reduce smoking probability when its marginal value change from positive to negative. Hence it will be 0.02/(2*0.00026)= 38.46 ~ from 38 to 39 year-old the probability of smoking will decrease.

  4. Holding other factors in the equation fixed, a person in a state with restaurant smoking restrictions has a .101 lower chance of smoking.

  5. The probability of smoking for person number 206 is: p= 0.656-0.069* log(67.44)+0.012* log(6500)-0.029* 16 + 0.020* 77-0.00026 * (77^2)-0.101 * 0-0.026*0 = 0.0052 ~ 0.52%

Chap 8.C4

First, we obtain the residuals from the linear regression model and combine it into the dataset.

data_8C4 <- vote1
model_8C4_1 <- lm(voteA~prtystrA+democA+log(expendA)+log(expendB),data= data_8C4)
summary(model_8C4_1)
## 
## Call:
## lm(formula = voteA ~ prtystrA + democA + log(expendA) + log(expendB), 
##     data = data_8C4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.576  -4.864  -1.146   4.903  24.566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.66141    4.73604   7.952 2.56e-13 ***
## prtystrA      0.25192    0.07129   3.534  0.00053 ***
## democA        3.79294    1.40652   2.697  0.00772 ** 
## log(expendA)  5.77929    0.39182  14.750  < 2e-16 ***
## log(expendB) -6.23784    0.39746 -15.694  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.573 on 168 degrees of freedom
## Multiple R-squared:  0.8012, Adjusted R-squared:  0.7964 
## F-statistic: 169.2 on 4 and 168 DF,  p-value: < 2.2e-16
data_8C4 <- data_8C4 %>% mutate(residual_squared=(model_8C4_1$residuals)^2)
head(data_8C4,5)
##   state district democA voteA expendA expendB prtystrA lexpendA lexpendB
## 1    AL        7      1    68 328.296   8.737       41 5.793916 2.167567
## 2    AK        1      0    62 626.377 402.477       60 6.439952 5.997638
## 3    AZ        2      1    73  99.607   3.065       55 4.601233 1.120048
## 4    AZ        3      0    69 319.690  26.281       64 5.767352 3.268846
## 5    AR        3      0    75 159.221  60.054       66 5.070293 4.095244
##     shareA residual_squared
## 1 97.40767        14.038465
## 2 60.88104        88.688079
## 3 97.01476         3.667330
## 4 92.40370         5.176379
## 5 72.61247       287.464317

Second, we regress residuals on independent variables:

model_8C4_1_r <- lm(residual_squared~prtystrA+democA+log(expendA)+log(expendB),data= data_8C4)
summary(model_8C4_1_r)
## 
## Call:
## lm(formula = residual_squared ~ prtystrA + democA + log(expendA) + 
##     log(expendB), data = data_8C4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94.88 -46.16 -23.51  15.84 508.32 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  113.9635    50.8150   2.243   0.0262 *
## prtystrA      -0.2993     0.7649  -0.391   0.6961  
## democA        15.6192    15.0912   1.035   0.3022  
## log(expendA) -10.3057     4.2040  -2.451   0.0153 *
## log(expendB)  -0.0514     4.2645  -0.012   0.9904  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.25 on 168 degrees of freedom
## Multiple R-squared:  0.05256,    Adjusted R-squared:   0.03 
## F-statistic:  2.33 on 4 and 168 DF,  p-value: 0.05806

Looking at the result, we see that the R-squared is nearly 0. It is because according to the definition, residuals are random values that implies the variation of dependent variable which are not captured by independent variables. Hence, the residuals are not allowed to correlated to independent variables.

The p-value of BP test is p= 0.01817 <0.05 . Hence, at 5% significant level, the presentation of heteroskedasticity is significant.

bptest(model_8C4_1_r)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_8C4_1_r
## BP = 11.893, df = 4, p-value = 0.01817

The p-value of White test for heteroskedasticity is 0.000015 ~0. Hence, the presentation of heteroskedasticity is significant at 5%,1%, 0.1% significant level. The evidence for heterskedasticity is very strong.

white_test(model_8C4_1_r)
## White's test results
## 
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 22.2
## P-value: 1.5e-05

Chap 8.C13

Looking at the result (first for nonrobust and second for robust), we can conclude that most (but not all) robust standard errors are bigger than nonrobust ones.

data_8C13 <- fertil2 %>% select(children, age, educ, electric, urban,spirit,protest,catholic) %>% na.omit()
model_8C13_1 <- lm(children~ age+I(age^2)+educ+electric+urban, data= data_8C13)
summary(model_8C13_1)
## 
## Call:
## lm(formula = children ~ age + I(age^2) + educ + electric + urban, 
##     data = data_8C13)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9012 -0.7136 -0.0039  0.7119  7.4318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.2225162  0.2401888 -17.580  < 2e-16 ***
## age          0.3409255  0.0165082  20.652  < 2e-16 ***
## I(age^2)    -0.0027412  0.0002718 -10.086  < 2e-16 ***
## educ        -0.0752323  0.0062966 -11.948  < 2e-16 ***
## electric    -0.3100404  0.0690045  -4.493 7.20e-06 ***
## urban       -0.2000339  0.0465062  -4.301 1.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 4352 degrees of freedom
## Multiple R-squared:  0.5734, Adjusted R-squared:  0.5729 
## F-statistic:  1170 on 5 and 4352 DF,  p-value: < 2.2e-16
robust_8C13_1<- coeftest(model_8C13_1,vcov= vcovHC(model_8C13_1, type="HC0"))
y_robust_1 <- robust_8C13_1[1,1]+robust_8C13_1[2,1]*data_8C13$age+robust_8C13_1[3,1]*(data_8C13$age^2)+robust_8C13_1[4,1]*data_8C13$educ+robust_8C13_1[5,1]*data_8C13$electric+robust_8C13_1[6,1]*data_8C13$urban

First, we conduct the Ftest for nonrobust model. From the result, the F-test statistic is 2.04 < 3.4, hence, the three religious dummy variables are jointly insignificant at 5% significant level. The p-value of F test is 0.106.

model_8C13_2 <- lm(children~ age+I(age^2)+educ+electric+urban+spirit+protest+catholic, data= data_8C13)
summary(model_8C13_2)
## 
## Call:
## lm(formula = children ~ age + I(age^2) + educ + electric + urban + 
##     spirit + protest + catholic, data = data_8C13)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9514 -0.7041 -0.0101  0.7213  7.4435 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3147175  0.2433391 -17.731  < 2e-16 ***
## age          0.3418668  0.0165186  20.696  < 2e-16 ***
## I(age^2)    -0.0027596  0.0002722 -10.139  < 2e-16 ***
## educ        -0.0762130  0.0064608 -11.796  < 2e-16 ***
## electric    -0.3057189  0.0690241  -4.429 9.69e-06 ***
## urban       -0.2034131  0.0465864  -4.366 1.29e-05 ***
## spirit       0.1404603  0.0558052   2.517   0.0119 *  
## protest      0.0753961  0.0652158   1.156   0.2477    
## catholic     0.1173539  0.0834249   1.407   0.1596    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.451 on 4349 degrees of freedom
## Multiple R-squared:  0.574,  Adjusted R-squared:  0.5733 
## F-statistic: 732.6 on 8 and 4349 DF,  p-value: < 2.2e-16
F= ((0.574-0.5734)/3)/((1-0.574)/4349)
F
## [1] 2.041784
critical_value = qf(0.95,3,4349, TRUE)
critical_value
## [1] 3.403585
p_value = 1- pf(F,3,4349)
p_value
## [1] 0.1058349

Second, we conduct the Ftest for robust model.From the result, the F-test statistic is 2.196 < 3.4, hence, the three religious dummy variables are jointly insignificant at 5% significant level. The p-value of F test is 0.086; thus, the three dummy variables are jointly significant at 10% significant level.

robust_8C13_2 <- coeftest(model_8C13_2,vcov= vcovHC(model_8C13_2, type="HC0"))
y_robust <- robust_8C13_2[1,1]+robust_8C13_2[2,1]*data_8C13$age+robust_8C13_2[3,1]*(data_8C13$age^2)+robust_8C13_2[4,1]*data_8C13$educ+robust_8C13_2[5,1]*data_8C13$electric+robust_8C13_2[6,1]*data_8C13$urban+robust_8C13_2[7,1]*data_8C13$spirit+robust_8C13_2[8,1]*data_8C13$protest+robust_8C13_2[9,1]*data_8C13$catholic
data_8C13 <- data_8C13 %>% mutate(residual_squared=(model_8C13_2$residuals)^2) %>% 
  mutate(y1= fitted(model_8C13_2)) %>%  mutate(y2=y_robust)
ssr_r <- sum((y_robust_1 -data_8C13$children)^2)
ssr_ur <- sum((y_robust -data_8C13$children)^2)
F= ((ssr_r-ssr_ur)/3)/(ssr_ur/4349)
F
## [1] 2.196093
p_value = 1- pf(F,3,4349)
p_value
## [1] 0.08640607

From the result, we can see that p-value of Ftest for the jointly significant of y1 and y2 is nearly at 0. It indicates that the expected values of models can explain the variation in residuals. In other words, the residuals correlated with the explanatory variables, representing heteroskedasticity

head(data_8C13,5)
##   children age educ electric urban spirit protest catholic residual_squared
## 1        0  24   12        1     1      0       0        0        0.7688962
## 2        3  32   13        1     1      0       0        0        0.4909979
## 3        1  30    5        1     1      1       0        0        2.9169527
## 4        2  42    4        1     1      0       0        0        5.5779494
## 5        2  43   11        1     1      0       1        0        4.0440167
##          y1        y2
## 1 0.8768673 0.8768673
## 2 2.2992876 2.2992876
## 3 2.7079089 2.7079089
## 4 4.3617683 4.3617683
## 5 4.0109741 4.0109741
sum(is.na(data_8C13$y2))
## [1] 0
model_8C13_3 <- lm(residual_squared~y1+y2, data= data_8C13)
summary(model_8C13_3)
## 
## Call:
## lm(formula = residual_squared ~ y1 + y2, data = data_8C13)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.386 -1.909 -0.342  0.689 49.477 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.53547    0.09423  -5.683 1.41e-08 ***
## y1           1.16334    0.03337  34.866  < 2e-16 ***
## y2                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.708 on 4356 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.218 
## F-statistic:  1216 on 1 and 4356 DF,  p-value: < 2.2e-16

In terms of statistical inference, the heteroscedasticity is practically important because it make our conclusion wrong. However, according to obtain the estimators, it is not really important. This conclusion is the same as the Chap 8.1 question.

Chapter 9

Chap 9.1

From the F-test, p-value obtained are 0.054. Hence, at 5% significant level, there is not significant evidence for functional form misspecification in the model. However, if we adjust the significant level larger than 5.5%, the evidence become significant.

F= ((0.375-0.353)/2)/((1-0.375)/169)
F
## [1] 2.9744
p_value = 1-pf(F,2,169)
p_value
## [1] 0.05375882

Chap 9.5

In this case, when students are deciding which college to go to, they might consider the level of crime on campus. If a college has a high crime rate, there’s a possibility that they might not want to openly share their crime data to attract more students. In other words, the colleges which do not report their campus crimes have high probability of high campus crimes. Hence, college failure to report crimes can be viewed as exogenous sample selection.

Chap 9.C3

There are some reasons why the unobserved factors in u migh be correlated with grant such as: resources of the company (positive relationship with grant), industry characteristics, geographical location, performance of the trainees, etc.

The coefficient of grant is 0.0566, but its p-value is 0.8895, which indicates the coefficient is not significantly different from 0. Hence, we cannot conclude receiving a job training grant lower a firm’s scrap rate.

data_9C3 <- jtrain %>% filter(year==1988)
model_9C3_2 <- lm(log(scrap)~grant,data=data_9C3)
nobs(model_9C3_2)
## [1] 54
summary(model_9C3_2)
## 
## Call:
## lm(formula = log(scrap) ~ grant, data = data_9C3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4043 -0.9536 -0.0465  0.9636  2.8103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.4085     0.2406   1.698   0.0954 .
## grant         0.0566     0.4056   0.140   0.8895  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.423 on 52 degrees of freedom
##   (103 observations deleted due to missingness)
## Multiple R-squared:  0.0003744,  Adjusted R-squared:  -0.01885 
## F-statistic: 0.01948 on 1 and 52 DF,  p-value: 0.8895

After added an explanatory variable log(scrap1987), the significant of grant1988 increase, and the estimated values become negative. Specifically, coefficient of grant1988 is -0.254, which means that the companies with grant have lower 25.4% scrap rate than companies without grant generally.

This conclusion is statistically significant at 5% level against the one-sided alternative H1: Bgrant <0. It is because the p-value obtained from model is 0.045 <0.05

data_9C3_3 <- jtrain %>% select(year,fcode,scrap,grant)
data_9C3_3_tran<- data_9C3_3 %>%
  pivot_wider(names_from = year, values_from = c(scrap, grant), names_sep = "")
model_9C3_3 <- lm(log(scrap1988)~grant1988+log(scrap1987),data=data_9C3_3_tran)
summary(model_9C3_3)
## 
## Call:
## lm(formula = log(scrap1988) ~ grant1988 + log(scrap1987), data = data_9C3_3_tran)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9146 -0.1763  0.0057  0.2308  1.5991 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.02124    0.08910   0.238   0.8126    
## grant1988      -0.25397    0.14703  -1.727   0.0902 .  
## log(scrap1987)  0.83116    0.04444  18.701   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5127 on 51 degrees of freedom
##   (103 observations deleted due to missingness)
## Multiple R-squared:  0.8728, Adjusted R-squared:  0.8678 
## F-statistic: 174.9 on 2 and 51 DF,  p-value: < 2.2e-16
t_stat <- summary(model_9C3_3)$coefficients["grant1988", "t value"]
p_value <- pt(t_stat, df = length(model_9C3_3$residuals) - length(model_9C3_3$coefficients), lower.tail = TRUE)
print("One-sided p-value:")
## [1] "One-sided p-value:"
print(p_value)
## [1] 0.04508135

According to the model above, the p-value for log(scrap1987) is nearly at 0, indicating a significant evidence of impact of scrap1987 on scrap1988.

From the result below, we can see that the p-value for grant (0.04) even lower than the previous test while the p-value (~0) for scrap is higher but still very significant.

coeftest(model_9C3_3,  vcov = vcovHC(model_9C3_3, type = "HC0"))
## 
## t test of coefficients:
## 
##                 Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)     0.021237   0.097032  0.2189   0.82763    
## grant1988      -0.253970   0.142249 -1.7854   0.08014 .  
## log(scrap1987)  0.831161   0.071469 11.6297 5.862e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t_stat_grant <- -0.253970/0.142249
p_value_grant <- pt(t_stat_grant, df = 51, lower.tail = TRUE)
print("One-sided p-value for H1: beta_grant < 0:")
## [1] "One-sided p-value for H1: beta_grant < 0:"
print(p_value_grant)
## [1] 0.04007254
t_stat_scrap <- 0.831161/0.071469 
p_value_scrap <- 2 * pt(abs(t_stat_scrap), df = 51, lower.tail = FALSE)
print("Two-sided p-value for H1: beta_scrap != 0:")
## [1] "Two-sided p-value for H1: beta_scrap != 0:"
print(p_value_scrap)
## [1] 5.862668e-16

Chap 9.C4

The coefficient on DC is 11.801 which is economically large and its p-value is nearly 0, meaning it’s statistically significant. The coefficient indicates that there are 11.8/1000 deaths in DC more than other states.

data_9C4 <- infmrt
model_9C4_1 <- lm(infmort~log(pcinc)+log(physic)+DC, data=data_9C4)
summary(model_9C4_1)
## 
## Call:
## lm(formula = infmort ~ log(pcinc) + log(physic) + DC, data = data_9C4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1313 -1.0042 -0.1868  1.0718  3.0486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.1855     7.2753   5.111 1.58e-06 ***
## log(pcinc)   -2.6302     0.9576  -2.747  0.00716 ** 
## log(physic)  -0.4273     0.7853  -0.544  0.58765    
## DC           11.8012     1.2297   9.597 9.06e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.359 on 98 degrees of freedom
## Multiple R-squared:  0.5779, Adjusted R-squared:  0.565 
## F-statistic: 44.72 on 3 and 98 DF,  p-value: < 2.2e-16

The Rsquared and adjusted Rsquared are much larger when we include DC variable because we are predicting the infant moratality rate for DC. Hence, including a dummy variable will have to improve Rsquared a lot.

model_9_44 <- lm(infmort~log(pcinc)+log(physic)+log(popul), data=data_9C4)
summary(model_9_44)
## 
## Call:
## lm(formula = infmort ~ log(pcinc) + log(physic) + log(popul), 
##     data = data_9C4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0204 -1.1897 -0.1152  0.9083  8.1890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.22614   10.13466   3.574 0.000547 ***
## log(pcinc)  -4.88415    1.29319  -3.777 0.000273 ***
## log(physic)  4.02783    0.89101   4.521 1.73e-05 ***
## log(popul)  -0.05356    0.18748  -0.286 0.775712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.891 on 98 degrees of freedom
## Multiple R-squared:  0.1818, Adjusted R-squared:  0.1568 
## F-statistic:  7.26 on 3 and 98 DF,  p-value: 0.0001898

Chapter 10

Chap 10.1

  1. Like cross-sectional observation, we can assume that most time series observations are independently distributed.

I disagree with this statement because in time series, observations are correlated over time, and many of them are strongly correlated. Hence, we cannot assume that most time series observations are independently distributed.

  1. The OLS estimator in a time series regression is unbiased under the first three GM assumptions.

I agree with this statement (theorem 10.1)

  1. A trending variable cannot be used as the dependent variable in multiple regression analysis.

I disagree with this statement because trending variable can be used as dependent variable.However, we should be careful with some issues presented such as spurious regression.

  1. Seasonality is not an issue when using annual time series observations.

I agree with this statement because seasonality in time series data refers to regular and predictable fluctuations or patterns that occur at specific intervals, typically within a one-year period. These patterns repeat over time and are often associated with certain seasons, months, days of the week. Hence, with the annual time series observations, seasonality is generally not a concern.

Chap 10.5

We can conduct a model with housing starts will be the dependent variable which will be explained by interest rate, real per capita income, and three dummy variables Q1,Q2,Q3:

housing starts ~ interest rate + real per capita income + Q1 + Q2 + Q3

Chap 10.C1

Based on the model, the coefficient of later1979 is 1.56 which is economically large and its p-value is 0.004 which is statistically significant. Hence, there is strong evidence that there is a shift in the interest rate equation after 1979. Specifically, the interest rate after 1979 is 1.56% larger than the previous timeframe.

data_10C1 <- intdef %>% mutate(later1979= ifelse(year>1979,1,0))
head(data_10C1,5)
##   year   i3  inf  rec  out        def i3_1 inf_1      def_1        ci3 cinf
## 1 1948 1.04  8.1 16.2 11.6 -4.6000004   NA    NA         NA         NA   NA
## 2 1949 1.10 -1.2 14.5 14.3 -0.1999998 1.04   8.1 -4.6000004 0.06000006 -9.3
## 3 1950 1.22  1.3 14.4 15.6  1.2000008 1.10  -1.2 -0.1999998 0.12000000  2.5
## 4 1951 1.55  7.9 16.1 14.2 -1.9000006 1.22   1.3  1.2000008 0.32999992  6.6
## 5 1952 1.77  1.9 19.0 19.4  0.3999996 1.55   7.9 -1.9000006 0.22000003 -6.0
##        cdef y77 later1979
## 1        NA   0         0
## 2  4.400001   0         0
## 3  1.400001   0         0
## 4 -3.100001   0         0
## 5  2.300000   0         0
model_10C1 <- lm(i3~inf+def+later1979, data=data_10C1)
summary(model_10C1)
## 
## Call:
## lm(formula = i3 ~ inf + def + later1979, data = data_10C1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4674 -0.8407  0.2388  1.0148  3.9654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.29623    0.42535   3.047  0.00362 ** 
## inf          0.60842    0.07625   7.979 1.37e-10 ***
## def          0.36266    0.12025   3.016  0.00396 ** 
## later1979    1.55877    0.50577   3.082  0.00329 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.711 on 52 degrees of freedom
## Multiple R-squared:  0.6635, Adjusted R-squared:  0.6441 
## F-statistic: 34.18 on 3 and 52 DF,  p-value: 2.408e-12

Chap 10.C6

  1. Following the requirement, I regress grf ~ t+t^2 and save the residuals as gf in the dataset.
data_10C6 <- fertil3
model_10C6_1 <- lm(gfr~t+I(t^2),data=data_10C6)
data_10C6 <- data_10C6 %>% mutate(gf=model_10C6_1$residuals)
head(data_10C6)
##     gfr    pe year t tsq  pe_1 pe_2 pe_3 pe_4 pill ww2 tcu      cgfr   cpe
## 1 124.7  0.00 1913 1   1    NA   NA   NA   NA    0   0   1        NA    NA
## 2 126.6  0.00 1914 2   4  0.00   NA   NA   NA    0   0   8  1.900002  0.00
## 3 125.0  0.00 1915 3   9  0.00    0   NA   NA    0   0  27 -1.599998  0.00
## 4 123.4  0.00 1916 4  16  0.00    0    0   NA    0   0  64 -1.599998  0.00
## 5 121.0 19.27 1917 5  25  0.00    0    0    0    0   0 125 -2.400002 19.27
## 6 119.8 23.94 1918 6  36 19.27    0    0    0    0   0 216 -1.199997  4.67
##   cpe_1 cpe_2 cpe_3 cpe_4 gfr_1    cgfr_1    cgfr_2    cgfr_3   cgfr_4 gfr_2
## 1    NA    NA    NA    NA    NA        NA        NA        NA       NA    NA
## 2    NA    NA    NA    NA 124.7        NA        NA        NA       NA    NA
## 3  0.00    NA    NA    NA 126.6  1.900002        NA        NA       NA 124.7
## 4  0.00     0    NA    NA 125.0 -1.599998  1.900002        NA       NA 126.6
## 5  0.00     0     0    NA 123.4 -1.599998 -1.599998  1.900002       NA 125.0
## 6 19.27     0     0     0 121.0 -2.400002 -1.599998 -1.599998 1.900002 123.4
##         gf
## 1 17.58000
## 2 19.43218
## 3 17.80028
## 4 16.18430
## 5 13.78423
## 6 12.60009

Rsquared for the detrend dependent variable gf is 0.6015 which is smaller compared to the trend dependent variable gfr (0.727). Although the explanatory power decrease, explanatory variables still explain a fair amount of variation in the detrend gf.

model_10C6_2 <- lm(gf~pe+ww2+pill+t+I(t^2),data=data_10C6)
summary(model_10C6_2)
## 
## Call:
## lm(formula = gf ~ pe + ww2 + pill + t + I(t^2), data = data_10C6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.9791  -6.9775  -0.2713   7.7975  19.9861 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.035673   4.360738   3.907 0.000223 ***
## pe            0.347813   0.040260   8.639 1.91e-12 ***
## ww2         -35.880277   5.707921  -6.286 2.95e-08 ***
## pill        -10.119723   6.336094  -1.597 0.115008    
## t            -2.603123   0.389386  -6.685 5.86e-09 ***
## I(t^2)        0.027572   0.004971   5.546 5.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.74 on 66 degrees of freedom
## Multiple R-squared:  0.6015, Adjusted R-squared:  0.5713 
## F-statistic: 19.92 on 5 and 66 DF,  p-value: 4.719e-12

After reestimating the equation by adding t^3 in the model, the additional term is statistically significant. Hence, it is needed to add t^3 in the model. To retest, I plot the relationship between t and gfr and it is quadratic relationship.

model_10C6_3 <- lm(gfr~pe+ww2+pill+t+I(t^2)+I(t^3),data=data_10C6)
summary(model_10C6_3)
## 
## Call:
## lm(formula = gfr ~ pe + ww2 + pill + t + I(t^2) + I(t^3), data = data_10C6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4463  -6.0271   0.3312   6.5957  17.7281 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.428e+02  4.338e+00  32.919  < 2e-16 ***
## pe           1.619e-01  4.131e-02   3.920 0.000216 ***
## ww2         -1.905e+01  5.042e+00  -3.778 0.000346 ***
## pill        -2.501e+01  5.346e+00  -4.679 1.51e-05 ***
## t           -5.612e+00  5.428e-01 -10.340 2.32e-15 ***
## I(t^2)       1.554e-01  2.030e-02   7.653 1.21e-10 ***
## I(t^3)      -1.290e-03  1.894e-04  -6.809 3.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.267 on 65 degrees of freedom
## Multiple R-squared:  0.8405, Adjusted R-squared:  0.8257 
## F-statistic: 57.07 on 6 and 65 DF,  p-value: < 2.2e-16
plot(data_10C6$gfr,data_10C6$t)