We have the model: sleep = 3840.83 - 0.163 * totwrk - 11.71 * educ - 8.7 * age + 0.128* age^2 + 87.75 * male
From the model, we can see that coefficient of male is 87.75, which means that male sleep 87.75 more minutes than women.
2*(1-pt(87.75/34.33,700,FALSE))
## [1] 0.01079613
The t statistic for this variable is 87.75/34.33= 2.56 and p-value is 0.011; so with significant level at 5%, the evidence is strong.
2*(1-pt(0.163/0.018,700,FALSE))
## [1] 0
The p-value for totwrk is very significant with the value at nearly 0. Hence, there is a statistically significant tradeoff between working and sleeping. The estimated tradeoff is 0.163, which means that 1 more minute of work will reduce 0.163 minutes less sleep.
Other regression we need to run to test the null hypothesis that age has no effect on sleeping is:
sleep ~ totwrk+educ+male
Null hypothesis: H0: beta(age)=0, beta(age^2)=0
2*(1-pt(2.19/0.53,4131,FALSE))
## [1] 3.666238e-05
The p-value of hsize^2 is 0.000004 which is statiscally significent. Hence, there is strong evidence that hsize^2 should be included in the model.
The optimal high school size for maximize sat is: 19.3 - 2.19hsize*2=0 -> hsize= 4.406393 (hundred) ~ 441 people
Non-black females are represented by female=1, black=0, and non-black males are represented by female=0 and black=0. Hence, the difference between these two groups is coefficient of female variable which is -45.09. It means that non-black females have 45.09 less score in sat than non-black males in average.
2*(1-pt(45.09/4.29,4131,FALSE))
## [1] 0
The p-value is nearly at 0, so there is a statistically significant in this estimated difference.
non-black males are represented by female=0 and black= 0, black males are represented by female=0 and black=1. Hence, the difference between these two groups is coefficient of black variable which is -169.81. It means that black males have 169.81 less score in sat than non-black males in average.
2*(1-pt(169.81/12.71,4131,FALSE))
## [1] 0
The p-value is nearly at 0, so there is a statistically significant in this estimated difference.
non-black females are represented by female=1 and black= 0, black females are represented by female=1 and black=1. Hence, the difference between these two groups is the sum of black variable and female * black variable (female - (female+black+female * black)) which is 169.81-62.31=107.5. It means that non-black females have 107.5 more score in sat than black females in average.
The difference based on two variables, so t-test cannot be conducted to measure the significance of the relationship. To test the difference, we can create 3 dummy variables with base group is non-black females and obtain t statistic from the coefficient of black female dummy variable.
From the model we can see that there are two significant variables including PC and hsGPA, Compared to the result estimated in (7.6), PC coefficient decrease from 0.157 to 0.151, and it is still statistically significant.
data_7C1 <- gpa1
model_7C1 <- lm(colGPA~PC+hsGPA+ACT+mothcoll+fathcoll,data= data_7C1)
summary(model_7C1)
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT + mothcoll + fathcoll,
## data = data_7C1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78149 -0.25726 -0.02121 0.24691 0.74432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.255554 0.335392 3.744 0.000268 ***
## PC 0.151854 0.058716 2.586 0.010762 *
## hsGPA 0.450220 0.094280 4.775 4.61e-06 ***
## ACT 0.007724 0.010678 0.723 0.470688
## mothcoll -0.003758 0.060270 -0.062 0.950376
## fathcoll 0.041800 0.061270 0.682 0.496265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3344 on 135 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.1934
## F-statistic: 7.713 on 5 and 135 DF, p-value: 2.083e-06
F-test for join significance of mothcoll and fatcoll is 0.24 < 4.42; hence, we cannot rejuct H0 or these variables are jointly insignificant.
model_7C1_ur <- lm(colGPA~PC+hsGPA+ACT,data= data_7C1)
summary(model_7C1_ur)
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + ACT, data = data_7C1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7901 -0.2622 -0.0107 0.2334 0.7570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.263520 0.333125 3.793 0.000223 ***
## PC 0.157309 0.057287 2.746 0.006844 **
## hsGPA 0.447242 0.093647 4.776 4.54e-06 ***
## ACT 0.008659 0.010534 0.822 0.412513
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3325 on 137 degrees of freedom
## Multiple R-squared: 0.2194, Adjusted R-squared: 0.2023
## F-statistic: 12.83 on 3 and 137 DF, p-value: 1.932e-07
F= ((0.2222-0.2194)/2)/((1-0.2222)/135)
F
## [1] 0.2429931
critical_value = qf(0.95,2,135, TRUE)
critical_value
## [1] 4.423013
This generalization is not needed because after adding hsGPA^2, both hsGPA and hsGPA^2 are not significant in the model.
model_7C1_3 <- lm(colGPA~PC+hsGPA+I(hsGPA^2)+ACT+mothcoll+fathcoll,data= data_7C1)
summary(model_7C1_3)
##
## Call:
## lm(formula = colGPA ~ PC + hsGPA + I(hsGPA^2) + ACT + mothcoll +
## fathcoll, data = data_7C1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78998 -0.24327 -0.00648 0.26179 0.72231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.040328 2.443038 2.063 0.0410 *
## PC 0.140446 0.058858 2.386 0.0184 *
## hsGPA -1.802520 1.443552 -1.249 0.2140
## I(hsGPA^2) 0.337341 0.215711 1.564 0.1202
## ACT 0.004786 0.010786 0.444 0.6580
## mothcoll 0.003091 0.060110 0.051 0.9591
## fathcoll 0.062761 0.062401 1.006 0.3163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3326 on 134 degrees of freedom
## Multiple R-squared: 0.2361, Adjusted R-squared: 0.2019
## F-statistic: 6.904 on 6 and 134 DF, p-value: 2.088e-06
The difference between black and non-black is -0.18835, which means that black people have monthly salary 18.84% less than non-black people. This difference is statistically significant at p-value is nearly at 0.
data_7C2 <- wage2
model_7C2 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban,data= data_7C2)
summary(model_7C2)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban, data = data_7C2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98069 -0.21996 0.00707 0.24288 1.22822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
## educ 0.065431 0.006250 10.468 < 2e-16 ***
## exper 0.014043 0.003185 4.409 1.16e-05 ***
## tenure 0.011747 0.002453 4.789 1.95e-06 ***
## married 0.199417 0.039050 5.107 3.98e-07 ***
## black -0.188350 0.037667 -5.000 6.84e-07 ***
## south -0.090904 0.026249 -3.463 0.000558 ***
## urban 0.183912 0.026958 6.822 1.62e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared: 0.2526, Adjusted R-squared: 0.2469
## F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16
P-value of the F-test is 0.2259 so these added variables are jointly insignificant at 20%.
model_7C2_2 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban+I(exper^2)+I(tenure^2),data= data_7C2)
summary(model_7C2_2)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + I(exper^2) + I(tenure^2), data = data_7C2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98236 -0.21972 -0.00036 0.24078 1.25127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3586756 0.1259143 42.558 < 2e-16 ***
## educ 0.0642761 0.0063115 10.184 < 2e-16 ***
## exper 0.0172146 0.0126138 1.365 0.172665
## tenure 0.0249291 0.0081297 3.066 0.002229 **
## married 0.1985470 0.0391103 5.077 4.65e-07 ***
## black -0.1906636 0.0377011 -5.057 5.13e-07 ***
## south -0.0912153 0.0262356 -3.477 0.000531 ***
## urban 0.1854241 0.0269585 6.878 1.12e-11 ***
## I(exper^2) -0.0001138 0.0005319 -0.214 0.830622
## I(tenure^2) -0.0007964 0.0004710 -1.691 0.091188 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.2477
## F-statistic: 35.17 on 9 and 925 DF, p-value: < 2.2e-16
F= ((0.255-0.2526)/2)/((1-0.255)/925)
F
## [1] 1.489933
1-pf(F,2,925)
## [1] 0.2259282
p-value of the interaction between educ and black is 0.2626, which is insignificant. Hence, the return on education does not depend on race.
model_7C2_3 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban+educ*black,data= data_7C2)
summary(model_7C2_3)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + educ * black, data = data_7C2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97782 -0.21832 0.00475 0.24136 1.23226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.374817 0.114703 46.859 < 2e-16 ***
## educ 0.067115 0.006428 10.442 < 2e-16 ***
## exper 0.013826 0.003191 4.333 1.63e-05 ***
## tenure 0.011787 0.002453 4.805 1.80e-06 ***
## married 0.198908 0.039047 5.094 4.25e-07 ***
## black 0.094809 0.255399 0.371 0.710561
## south -0.089450 0.026277 -3.404 0.000692 ***
## urban 0.183852 0.026955 6.821 1.63e-11 ***
## educ:black -0.022624 0.020183 -1.121 0.262603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared: 0.2536, Adjusted R-squared: 0.2471
## F-statistic: 39.32 on 8 and 926 DF, p-value: < 2.2e-16
Married-black people are represented by married=1 and black= 1, married-nonblack people are represented by married=1 and black=0. Hence, the difference between these two groups is the sum of black variable and married * black variable which is -0.2408+0.0614=-0.1794. It means that married-black people have monthly salary at 17.94% less than married-nonblack people in average.
model_7C2_4 <- lm(log(wage)~educ+exper+tenure+married+black+south+urban+married*black,data= data_7C2)
summary(model_7C2_4)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure + married + black +
## south + urban + married * black, data = data_7C2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98013 -0.21780 0.01057 0.24219 1.22889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.403793 0.114122 47.351 < 2e-16 ***
## educ 0.065475 0.006253 10.471 < 2e-16 ***
## exper 0.014146 0.003191 4.433 1.04e-05 ***
## tenure 0.011663 0.002458 4.745 2.41e-06 ***
## married 0.188915 0.042878 4.406 1.18e-05 ***
## black -0.240820 0.096023 -2.508 0.012314 *
## south -0.091989 0.026321 -3.495 0.000497 ***
## urban 0.184350 0.026978 6.833 1.50e-11 ***
## married:black 0.061354 0.103275 0.594 0.552602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3656 on 926 degrees of freedom
## Multiple R-squared: 0.2528, Adjusted R-squared: 0.2464
## F-statistic: 39.17 on 8 and 926 DF, p-value: < 2.2e-16
The consequences of heteroskedasticity are:
The usual F statistic no longer has an F distribution.
The OLS estimators are no longer BLUE.
It is because OLS standard errors are based directly on estimators of the variances - which are biased without homoskedasticity assumption. Hence, they are no longer valid for constructing confidence intervals and statistical inference (t-test or F-test). Besides, OLS is BLUE also based on homoskedasticity assumption, so it is also no longer valid.
The standard erors between two sets are quite similar; there are not any significant differences between them.
If the education increases by 4 years the smokes will reduce 4*0.029=0.116 percent.
We have the year of age reduce smoking probability when its marginal value change from positive to negative. Hence it will be 0.02/(2*0.00026)= 38.46 ~ from 38 to 39 year-old the probability of smoking will decrease.
Holding other factors in the equation fixed, a person in a state with restaurant smoking restrictions has a .101 lower chance of smoking.
The probability of smoking for person number 206 is: p= 0.656-0.069* log(67.44)+0.012* log(6500)-0.029* 16 + 0.020* 77-0.00026 * (77^2)-0.101 * 0-0.026*0 = 0.0052 ~ 0.52%
First, we obtain the residuals from the linear regression model and combine it into the dataset.
data_8C4 <- vote1
model_8C4_1 <- lm(voteA~prtystrA+democA+log(expendA)+log(expendB),data= data_8C4)
summary(model_8C4_1)
##
## Call:
## lm(formula = voteA ~ prtystrA + democA + log(expendA) + log(expendB),
## data = data_8C4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.576 -4.864 -1.146 4.903 24.566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.66141 4.73604 7.952 2.56e-13 ***
## prtystrA 0.25192 0.07129 3.534 0.00053 ***
## democA 3.79294 1.40652 2.697 0.00772 **
## log(expendA) 5.77929 0.39182 14.750 < 2e-16 ***
## log(expendB) -6.23784 0.39746 -15.694 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.573 on 168 degrees of freedom
## Multiple R-squared: 0.8012, Adjusted R-squared: 0.7964
## F-statistic: 169.2 on 4 and 168 DF, p-value: < 2.2e-16
data_8C4 <- data_8C4 %>% mutate(residual_squared=(model_8C4_1$residuals)^2)
head(data_8C4,5)
## state district democA voteA expendA expendB prtystrA lexpendA lexpendB
## 1 AL 7 1 68 328.296 8.737 41 5.793916 2.167567
## 2 AK 1 0 62 626.377 402.477 60 6.439952 5.997638
## 3 AZ 2 1 73 99.607 3.065 55 4.601233 1.120048
## 4 AZ 3 0 69 319.690 26.281 64 5.767352 3.268846
## 5 AR 3 0 75 159.221 60.054 66 5.070293 4.095244
## shareA residual_squared
## 1 97.40767 14.038465
## 2 60.88104 88.688079
## 3 97.01476 3.667330
## 4 92.40370 5.176379
## 5 72.61247 287.464317
Second, we regress residuals on independent variables:
model_8C4_1_r <- lm(residual_squared~prtystrA+democA+log(expendA)+log(expendB),data= data_8C4)
summary(model_8C4_1_r)
##
## Call:
## lm(formula = residual_squared ~ prtystrA + democA + log(expendA) +
## log(expendB), data = data_8C4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.88 -46.16 -23.51 15.84 508.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.9635 50.8150 2.243 0.0262 *
## prtystrA -0.2993 0.7649 -0.391 0.6961
## democA 15.6192 15.0912 1.035 0.3022
## log(expendA) -10.3057 4.2040 -2.451 0.0153 *
## log(expendB) -0.0514 4.2645 -0.012 0.9904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.25 on 168 degrees of freedom
## Multiple R-squared: 0.05256, Adjusted R-squared: 0.03
## F-statistic: 2.33 on 4 and 168 DF, p-value: 0.05806
Looking at the result, we see that the R-squared is nearly 0. It is because according to the definition, residuals are random values that implies the variation of dependent variable which are not captured by independent variables. Hence, the residuals are not allowed to correlated to independent variables.
The p-value of BP test is p= 0.01817 <0.05 . Hence, at 5% significant level, the presentation of heteroskedasticity is significant.
bptest(model_8C4_1_r)
##
## studentized Breusch-Pagan test
##
## data: model_8C4_1_r
## BP = 11.893, df = 4, p-value = 0.01817
The p-value of White test for heteroskedasticity is 0.000015 ~0. Hence, the presentation of heteroskedasticity is significant at 5%,1%, 0.1% significant level. The evidence for heterskedasticity is very strong.
white_test(model_8C4_1_r)
## White's test results
##
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 22.2
## P-value: 1.5e-05
Looking at the result (first for nonrobust and second for robust), we can conclude that most (but not all) robust standard errors are bigger than nonrobust ones.
data_8C13 <- fertil2 %>% select(children, age, educ, electric, urban,spirit,protest,catholic) %>% na.omit()
model_8C13_1 <- lm(children~ age+I(age^2)+educ+electric+urban, data= data_8C13)
summary(model_8C13_1)
##
## Call:
## lm(formula = children ~ age + I(age^2) + educ + electric + urban,
## data = data_8C13)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9012 -0.7136 -0.0039 0.7119 7.4318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2225162 0.2401888 -17.580 < 2e-16 ***
## age 0.3409255 0.0165082 20.652 < 2e-16 ***
## I(age^2) -0.0027412 0.0002718 -10.086 < 2e-16 ***
## educ -0.0752323 0.0062966 -11.948 < 2e-16 ***
## electric -0.3100404 0.0690045 -4.493 7.20e-06 ***
## urban -0.2000339 0.0465062 -4.301 1.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 4352 degrees of freedom
## Multiple R-squared: 0.5734, Adjusted R-squared: 0.5729
## F-statistic: 1170 on 5 and 4352 DF, p-value: < 2.2e-16
robust_8C13_1<- coeftest(model_8C13_1,vcov= vcovHC(model_8C13_1, type="HC0"))
y_robust_1 <- robust_8C13_1[1,1]+robust_8C13_1[2,1]*data_8C13$age+robust_8C13_1[3,1]*(data_8C13$age^2)+robust_8C13_1[4,1]*data_8C13$educ+robust_8C13_1[5,1]*data_8C13$electric+robust_8C13_1[6,1]*data_8C13$urban
First, we conduct the Ftest for nonrobust model. From the result, the F-test statistic is 2.04 < 3.4, hence, the three religious dummy variables are jointly insignificant at 5% significant level. The p-value of F test is 0.106.
model_8C13_2 <- lm(children~ age+I(age^2)+educ+electric+urban+spirit+protest+catholic, data= data_8C13)
summary(model_8C13_2)
##
## Call:
## lm(formula = children ~ age + I(age^2) + educ + electric + urban +
## spirit + protest + catholic, data = data_8C13)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9514 -0.7041 -0.0101 0.7213 7.4435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3147175 0.2433391 -17.731 < 2e-16 ***
## age 0.3418668 0.0165186 20.696 < 2e-16 ***
## I(age^2) -0.0027596 0.0002722 -10.139 < 2e-16 ***
## educ -0.0762130 0.0064608 -11.796 < 2e-16 ***
## electric -0.3057189 0.0690241 -4.429 9.69e-06 ***
## urban -0.2034131 0.0465864 -4.366 1.29e-05 ***
## spirit 0.1404603 0.0558052 2.517 0.0119 *
## protest 0.0753961 0.0652158 1.156 0.2477
## catholic 0.1173539 0.0834249 1.407 0.1596
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.451 on 4349 degrees of freedom
## Multiple R-squared: 0.574, Adjusted R-squared: 0.5733
## F-statistic: 732.6 on 8 and 4349 DF, p-value: < 2.2e-16
F= ((0.574-0.5734)/3)/((1-0.574)/4349)
F
## [1] 2.041784
critical_value = qf(0.95,3,4349, TRUE)
critical_value
## [1] 3.403585
p_value = 1- pf(F,3,4349)
p_value
## [1] 0.1058349
Second, we conduct the Ftest for robust model.From the result, the F-test statistic is 2.196 < 3.4, hence, the three religious dummy variables are jointly insignificant at 5% significant level. The p-value of F test is 0.086; thus, the three dummy variables are jointly significant at 10% significant level.
robust_8C13_2 <- coeftest(model_8C13_2,vcov= vcovHC(model_8C13_2, type="HC0"))
y_robust <- robust_8C13_2[1,1]+robust_8C13_2[2,1]*data_8C13$age+robust_8C13_2[3,1]*(data_8C13$age^2)+robust_8C13_2[4,1]*data_8C13$educ+robust_8C13_2[5,1]*data_8C13$electric+robust_8C13_2[6,1]*data_8C13$urban+robust_8C13_2[7,1]*data_8C13$spirit+robust_8C13_2[8,1]*data_8C13$protest+robust_8C13_2[9,1]*data_8C13$catholic
data_8C13 <- data_8C13 %>% mutate(residual_squared=(model_8C13_2$residuals)^2) %>%
mutate(y1= fitted(model_8C13_2)) %>% mutate(y2=y_robust)
ssr_r <- sum((y_robust_1 -data_8C13$children)^2)
ssr_ur <- sum((y_robust -data_8C13$children)^2)
F= ((ssr_r-ssr_ur)/3)/(ssr_ur/4349)
F
## [1] 2.196093
p_value = 1- pf(F,3,4349)
p_value
## [1] 0.08640607
From the result, we can see that p-value of Ftest for the jointly significant of y1 and y2 is nearly at 0. It indicates that the expected values of models can explain the variation in residuals. In other words, the residuals correlated with the explanatory variables, representing heteroskedasticity
head(data_8C13,5)
## children age educ electric urban spirit protest catholic residual_squared
## 1 0 24 12 1 1 0 0 0 0.7688962
## 2 3 32 13 1 1 0 0 0 0.4909979
## 3 1 30 5 1 1 1 0 0 2.9169527
## 4 2 42 4 1 1 0 0 0 5.5779494
## 5 2 43 11 1 1 0 1 0 4.0440167
## y1 y2
## 1 0.8768673 0.8768673
## 2 2.2992876 2.2992876
## 3 2.7079089 2.7079089
## 4 4.3617683 4.3617683
## 5 4.0109741 4.0109741
sum(is.na(data_8C13$y2))
## [1] 0
model_8C13_3 <- lm(residual_squared~y1+y2, data= data_8C13)
summary(model_8C13_3)
##
## Call:
## lm(formula = residual_squared ~ y1 + y2, data = data_8C13)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.386 -1.909 -0.342 0.689 49.477
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.53547 0.09423 -5.683 1.41e-08 ***
## y1 1.16334 0.03337 34.866 < 2e-16 ***
## y2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.708 on 4356 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.218
## F-statistic: 1216 on 1 and 4356 DF, p-value: < 2.2e-16
In terms of statistical inference, the heteroscedasticity is practically important because it make our conclusion wrong. However, according to obtain the estimators, it is not really important. This conclusion is the same as the Chap 8.1 question.
From the F-test, p-value obtained are 0.054. Hence, at 5% significant level, there is not significant evidence for functional form misspecification in the model. However, if we adjust the significant level larger than 5.5%, the evidence become significant.
F= ((0.375-0.353)/2)/((1-0.375)/169)
F
## [1] 2.9744
p_value = 1-pf(F,2,169)
p_value
## [1] 0.05375882
In this case, when students are deciding which college to go to, they might consider the level of crime on campus. If a college has a high crime rate, there’s a possibility that they might not want to openly share their crime data to attract more students. In other words, the colleges which do not report their campus crimes have high probability of high campus crimes. Hence, college failure to report crimes can be viewed as exogenous sample selection.
There are some reasons why the unobserved factors in u migh be correlated with grant such as: resources of the company (positive relationship with grant), industry characteristics, geographical location, performance of the trainees, etc.
The coefficient of grant is 0.0566, but its p-value is 0.8895, which indicates the coefficient is not significantly different from 0. Hence, we cannot conclude receiving a job training grant lower a firm’s scrap rate.
data_9C3 <- jtrain %>% filter(year==1988)
model_9C3_2 <- lm(log(scrap)~grant,data=data_9C3)
nobs(model_9C3_2)
## [1] 54
summary(model_9C3_2)
##
## Call:
## lm(formula = log(scrap) ~ grant, data = data_9C3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4043 -0.9536 -0.0465 0.9636 2.8103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4085 0.2406 1.698 0.0954 .
## grant 0.0566 0.4056 0.140 0.8895
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.423 on 52 degrees of freedom
## (103 observations deleted due to missingness)
## Multiple R-squared: 0.0003744, Adjusted R-squared: -0.01885
## F-statistic: 0.01948 on 1 and 52 DF, p-value: 0.8895
After added an explanatory variable log(scrap1987), the significant of grant1988 increase, and the estimated values become negative. Specifically, coefficient of grant1988 is -0.254, which means that the companies with grant have lower 25.4% scrap rate than companies without grant generally.
This conclusion is statistically significant at 5% level against the one-sided alternative H1: Bgrant <0. It is because the p-value obtained from model is 0.045 <0.05
data_9C3_3 <- jtrain %>% select(year,fcode,scrap,grant)
data_9C3_3_tran<- data_9C3_3 %>%
pivot_wider(names_from = year, values_from = c(scrap, grant), names_sep = "")
model_9C3_3 <- lm(log(scrap1988)~grant1988+log(scrap1987),data=data_9C3_3_tran)
summary(model_9C3_3)
##
## Call:
## lm(formula = log(scrap1988) ~ grant1988 + log(scrap1987), data = data_9C3_3_tran)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9146 -0.1763 0.0057 0.2308 1.5991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02124 0.08910 0.238 0.8126
## grant1988 -0.25397 0.14703 -1.727 0.0902 .
## log(scrap1987) 0.83116 0.04444 18.701 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5127 on 51 degrees of freedom
## (103 observations deleted due to missingness)
## Multiple R-squared: 0.8728, Adjusted R-squared: 0.8678
## F-statistic: 174.9 on 2 and 51 DF, p-value: < 2.2e-16
t_stat <- summary(model_9C3_3)$coefficients["grant1988", "t value"]
p_value <- pt(t_stat, df = length(model_9C3_3$residuals) - length(model_9C3_3$coefficients), lower.tail = TRUE)
print("One-sided p-value:")
## [1] "One-sided p-value:"
print(p_value)
## [1] 0.04508135
According to the model above, the p-value for log(scrap1987) is nearly at 0, indicating a significant evidence of impact of scrap1987 on scrap1988.
From the result below, we can see that the p-value for grant (0.04) even lower than the previous test while the p-value (~0) for scrap is higher but still very significant.
coeftest(model_9C3_3, vcov = vcovHC(model_9C3_3, type = "HC0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.021237 0.097032 0.2189 0.82763
## grant1988 -0.253970 0.142249 -1.7854 0.08014 .
## log(scrap1987) 0.831161 0.071469 11.6297 5.862e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t_stat_grant <- -0.253970/0.142249
p_value_grant <- pt(t_stat_grant, df = 51, lower.tail = TRUE)
print("One-sided p-value for H1: beta_grant < 0:")
## [1] "One-sided p-value for H1: beta_grant < 0:"
print(p_value_grant)
## [1] 0.04007254
t_stat_scrap <- 0.831161/0.071469
p_value_scrap <- 2 * pt(abs(t_stat_scrap), df = 51, lower.tail = FALSE)
print("Two-sided p-value for H1: beta_scrap != 0:")
## [1] "Two-sided p-value for H1: beta_scrap != 0:"
print(p_value_scrap)
## [1] 5.862668e-16
The coefficient on DC is 11.801 which is economically large and its p-value is nearly 0, meaning it’s statistically significant. The coefficient indicates that there are 11.8/1000 deaths in DC more than other states.
data_9C4 <- infmrt
model_9C4_1 <- lm(infmort~log(pcinc)+log(physic)+DC, data=data_9C4)
summary(model_9C4_1)
##
## Call:
## lm(formula = infmort ~ log(pcinc) + log(physic) + DC, data = data_9C4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1313 -1.0042 -0.1868 1.0718 3.0486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.1855 7.2753 5.111 1.58e-06 ***
## log(pcinc) -2.6302 0.9576 -2.747 0.00716 **
## log(physic) -0.4273 0.7853 -0.544 0.58765
## DC 11.8012 1.2297 9.597 9.06e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.359 on 98 degrees of freedom
## Multiple R-squared: 0.5779, Adjusted R-squared: 0.565
## F-statistic: 44.72 on 3 and 98 DF, p-value: < 2.2e-16
The Rsquared and adjusted Rsquared are much larger when we include DC variable because we are predicting the infant moratality rate for DC. Hence, including a dummy variable will have to improve Rsquared a lot.
model_9_44 <- lm(infmort~log(pcinc)+log(physic)+log(popul), data=data_9C4)
summary(model_9_44)
##
## Call:
## lm(formula = infmort ~ log(pcinc) + log(physic) + log(popul),
## data = data_9C4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0204 -1.1897 -0.1152 0.9083 8.1890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.22614 10.13466 3.574 0.000547 ***
## log(pcinc) -4.88415 1.29319 -3.777 0.000273 ***
## log(physic) 4.02783 0.89101 4.521 1.73e-05 ***
## log(popul) -0.05356 0.18748 -0.286 0.775712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.891 on 98 degrees of freedom
## Multiple R-squared: 0.1818, Adjusted R-squared: 0.1568
## F-statistic: 7.26 on 3 and 98 DF, p-value: 0.0001898
I disagree with this statement because in time series, observations are correlated over time, and many of them are strongly correlated. Hence, we cannot assume that most time series observations are independently distributed.
I agree with this statement (theorem 10.1)
I disagree with this statement because trending variable can be used as dependent variable.However, we should be careful with some issues presented such as spurious regression.
I agree with this statement because seasonality in time series data refers to regular and predictable fluctuations or patterns that occur at specific intervals, typically within a one-year period. These patterns repeat over time and are often associated with certain seasons, months, days of the week. Hence, with the annual time series observations, seasonality is generally not a concern.
We can conduct a model with housing starts will be the dependent variable which will be explained by interest rate, real per capita income, and three dummy variables Q1,Q2,Q3:
housing starts ~ interest rate + real per capita income + Q1 + Q2 + Q3
Based on the model, the coefficient of later1979 is 1.56 which is economically large and its p-value is 0.004 which is statistically significant. Hence, there is strong evidence that there is a shift in the interest rate equation after 1979. Specifically, the interest rate after 1979 is 1.56% larger than the previous timeframe.
data_10C1 <- intdef %>% mutate(later1979= ifelse(year>1979,1,0))
head(data_10C1,5)
## year i3 inf rec out def i3_1 inf_1 def_1 ci3 cinf
## 1 1948 1.04 8.1 16.2 11.6 -4.6000004 NA NA NA NA NA
## 2 1949 1.10 -1.2 14.5 14.3 -0.1999998 1.04 8.1 -4.6000004 0.06000006 -9.3
## 3 1950 1.22 1.3 14.4 15.6 1.2000008 1.10 -1.2 -0.1999998 0.12000000 2.5
## 4 1951 1.55 7.9 16.1 14.2 -1.9000006 1.22 1.3 1.2000008 0.32999992 6.6
## 5 1952 1.77 1.9 19.0 19.4 0.3999996 1.55 7.9 -1.9000006 0.22000003 -6.0
## cdef y77 later1979
## 1 NA 0 0
## 2 4.400001 0 0
## 3 1.400001 0 0
## 4 -3.100001 0 0
## 5 2.300000 0 0
model_10C1 <- lm(i3~inf+def+later1979, data=data_10C1)
summary(model_10C1)
##
## Call:
## lm(formula = i3 ~ inf + def + later1979, data = data_10C1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4674 -0.8407 0.2388 1.0148 3.9654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.29623 0.42535 3.047 0.00362 **
## inf 0.60842 0.07625 7.979 1.37e-10 ***
## def 0.36266 0.12025 3.016 0.00396 **
## later1979 1.55877 0.50577 3.082 0.00329 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.711 on 52 degrees of freedom
## Multiple R-squared: 0.6635, Adjusted R-squared: 0.6441
## F-statistic: 34.18 on 3 and 52 DF, p-value: 2.408e-12
data_10C6 <- fertil3
model_10C6_1 <- lm(gfr~t+I(t^2),data=data_10C6)
data_10C6 <- data_10C6 %>% mutate(gf=model_10C6_1$residuals)
head(data_10C6)
## gfr pe year t tsq pe_1 pe_2 pe_3 pe_4 pill ww2 tcu cgfr cpe
## 1 124.7 0.00 1913 1 1 NA NA NA NA 0 0 1 NA NA
## 2 126.6 0.00 1914 2 4 0.00 NA NA NA 0 0 8 1.900002 0.00
## 3 125.0 0.00 1915 3 9 0.00 0 NA NA 0 0 27 -1.599998 0.00
## 4 123.4 0.00 1916 4 16 0.00 0 0 NA 0 0 64 -1.599998 0.00
## 5 121.0 19.27 1917 5 25 0.00 0 0 0 0 0 125 -2.400002 19.27
## 6 119.8 23.94 1918 6 36 19.27 0 0 0 0 0 216 -1.199997 4.67
## cpe_1 cpe_2 cpe_3 cpe_4 gfr_1 cgfr_1 cgfr_2 cgfr_3 cgfr_4 gfr_2
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA 124.7 NA NA NA NA NA
## 3 0.00 NA NA NA 126.6 1.900002 NA NA NA 124.7
## 4 0.00 0 NA NA 125.0 -1.599998 1.900002 NA NA 126.6
## 5 0.00 0 0 NA 123.4 -1.599998 -1.599998 1.900002 NA 125.0
## 6 19.27 0 0 0 121.0 -2.400002 -1.599998 -1.599998 1.900002 123.4
## gf
## 1 17.58000
## 2 19.43218
## 3 17.80028
## 4 16.18430
## 5 13.78423
## 6 12.60009
Rsquared for the detrend dependent variable gf is 0.6015 which is smaller compared to the trend dependent variable gfr (0.727). Although the explanatory power decrease, explanatory variables still explain a fair amount of variation in the detrend gf.
model_10C6_2 <- lm(gf~pe+ww2+pill+t+I(t^2),data=data_10C6)
summary(model_10C6_2)
##
## Call:
## lm(formula = gf ~ pe + ww2 + pill + t + I(t^2), data = data_10C6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.9791 -6.9775 -0.2713 7.7975 19.9861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.035673 4.360738 3.907 0.000223 ***
## pe 0.347813 0.040260 8.639 1.91e-12 ***
## ww2 -35.880277 5.707921 -6.286 2.95e-08 ***
## pill -10.119723 6.336094 -1.597 0.115008
## t -2.603123 0.389386 -6.685 5.86e-09 ***
## I(t^2) 0.027572 0.004971 5.546 5.53e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.74 on 66 degrees of freedom
## Multiple R-squared: 0.6015, Adjusted R-squared: 0.5713
## F-statistic: 19.92 on 5 and 66 DF, p-value: 4.719e-12
After reestimating the equation by adding t^3 in the model, the additional term is statistically significant. Hence, it is needed to add t^3 in the model. To retest, I plot the relationship between t and gfr and it is quadratic relationship.
model_10C6_3 <- lm(gfr~pe+ww2+pill+t+I(t^2)+I(t^3),data=data_10C6)
summary(model_10C6_3)
##
## Call:
## lm(formula = gfr ~ pe + ww2 + pill + t + I(t^2) + I(t^3), data = data_10C6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4463 -6.0271 0.3312 6.5957 17.7281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.428e+02 4.338e+00 32.919 < 2e-16 ***
## pe 1.619e-01 4.131e-02 3.920 0.000216 ***
## ww2 -1.905e+01 5.042e+00 -3.778 0.000346 ***
## pill -2.501e+01 5.346e+00 -4.679 1.51e-05 ***
## t -5.612e+00 5.428e-01 -10.340 2.32e-15 ***
## I(t^2) 1.554e-01 2.030e-02 7.653 1.21e-10 ***
## I(t^3) -1.290e-03 1.894e-04 -6.809 3.77e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.267 on 65 degrees of freedom
## Multiple R-squared: 0.8405, Adjusted R-squared: 0.8257
## F-statistic: 57.07 on 6 and 65 DF, p-value: < 2.2e-16
plot(data_10C6$gfr,data_10C6$t)