Proof problem

i) Show that M’ = M.

\[M' = (I -X(X'X)^{-1}X')^T = I-{X'}^T(X'X)^{-1}X^T\] \[\implies M' = I-X(X'X)^{-1}X^T = M.\]

ii) Show that \(M^2 = M\).

\[M^2 = (I -X(X'X)^{-1}X')^2\] \[= I-2X(X'X)^{-1}X'+ X(X'X)^{-1}X'X(X'X)^{-1}X'\] \[ = I - 2X(X'X)^{-1}X' + X(X'X)^{-1}X' = I - X(X'X)^{-1}X' = M\]

iii) Show that \(MX = O\) and \(X'M = O\).

\[MX = (I - X(X'X)^{-1}X')X = X- X(X'X)^{-1}X'X = X-X = O\] \[X'M = X'(I-X(X'X)^{-1}X') = X'- X'X(X'X)^{-1}X' = X'-X' = O\]

iv) Show that \(M_1M = MM_1 = M.\)

Before we prove this argument, we first state a fact about projection matrices:

Fact: let \(P = X(X'X)^{-1}X'\) and \(P_1 = X_1(X_1'X_1)^{-1}X_1'\), we have \[PP_1 = P_1P = P_1.\]

The proof of this fact is intuitive: \[PP_1 = PX_1(X_1'X_1)^{-1}X_1' = X_1(X_1'X_1)^{-1}X_1' = P_1.\] The first equality holds since all the columns of \(X_1\) are in \(\text{Span}(X)\), and so are left unchanged by projection matrix \(P\). (Similar for \(P_1P = P1\)).

Then we can apply this fact in \(M_1M\) and \(MM_1\)

\[M_1M = (I-X_1(X_1'X_1)^{-1}X_1)(I-X(X'X)^{-1}X') = (I-P_1)(I-P)\] \[= I-P-P_1+P_1P = I-P-P_1+P_1 = I-P\] \[= M\]

Similarly, we have

\[MM_1 = (I-X(X'X)^{-1}X')(I-X_1(X_1'X_1)^{-1}X_1) = (I-P)(I-P_1)\] \[=I - P_1 -P +PP_1 = I-P_1+P+P_1 = I-P \] \[= M\]

Q1

i)

The code is shown below:

lm1 = lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + t, data = barium)
tidy(lm1)
## # A tibble: 8 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  -2.37    20.8        -0.114 0.909  
## 2 lchempi      -0.686    1.24       -0.554 0.581  
## 3 lgas          0.466    0.876       0.531 0.596  
## 4 lrtwex        0.0782   0.472       0.166 0.869  
## 5 befile6       0.0905   0.251       0.360 0.719  
## 6 affile6       0.0970   0.257       0.377 0.707  
## 7 afdec6       -0.352    0.283      -1.24  0.216  
## 8 t             0.0127   0.00384     3.31  0.00124

As we see, only the trend variable is statistically significant, implying a positive effect on \(log(chempi)\).

ii)

library("car")
lm2 = lm(lchnimp ~ t , data = barium)
## method1
tidy(linearHypothesis(lm1, c("lchempi=0","lgas = 0","lrtwex=0","befile6 = 0","affile6 = 0","afdec6 =0")))
## # A tibble: 2 x 6
##   res.df   rss    df sumsq statistic p.value
##    <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1    129  41.7    NA NA       NA      NA    
## 2    123  40.6     6  1.07     0.540   0.777
## method2:  or by use R^2 of full model and restricted model: [(R_unre^2 - R_res^2)/q]/[(1-R^2)/n-k-1]
## method3:  or by anova()
# anova(lm1, lm2)

As we see, the \(F\) statistics = 0.540 with p-value = 0.777. We can conclude that the regressors other than t are insignificant.

iii)

lm3 = lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + t  + feb + mar+ apr+ may+ jun+ jul+ aug+ sep + oct+ nov + dec, data = barium)
lm4 <-lm(lchnimp ~lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + t, data = barium)
tidy(anova(lm3,lm4))
## # A tibble: 2 x 6
##   res.df   rss    df sumsq statistic p.value
##    <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1    112  37.5    NA NA       NA      NA    
## 2    123  40.6   -11 -3.12     0.847   0.594

F test with p-value = 0.59 shows that the dummy variables are not jointly significant, so nothing change importantly when including dummies.

Q2

i)

lm5 <- lm(gfr ~ t + tsq, data = fertil3)
tidy(lm5)
## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 107.        6.05       17.7   2.38e-27
## 2 t             0.0717    0.382       0.187 8.52e- 1
## 3 tsq          -0.00796   0.00508    -1.57  1.22e- 1
## save the residuals
resi <- residuals(lm5)

ii)

lm6 <- lm(resi~ fertil3$pe + fertil3$ww2 + fertil3$pill + fertil3$t + fertil3$tsq)
summary(lm6)
## 
## Call:
## lm(formula = resi ~ fertil3$pe + fertil3$ww2 + fertil3$pill + 
##     fertil3$t + fertil3$tsq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.9791  -6.9775  -0.2713   7.7975  19.9861 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.035673   4.360738   3.907 0.000223 ***
## fertil3$pe     0.347813   0.040260   8.639 1.91e-12 ***
## fertil3$ww2  -35.880277   5.707921  -6.286 2.95e-08 ***
## fertil3$pill -10.119723   6.336094  -1.597 0.115008    
## fertil3$t     -2.603123   0.389386  -6.685 5.86e-09 ***
## fertil3$tsq    0.027572   0.004971   5.546 5.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.74 on 66 degrees of freedom
## Multiple R-squared:  0.6015, Adjusted R-squared:  0.5713 
## F-statistic: 19.92 on 5 and 66 DF,  p-value: 4.719e-12

The \(R^2\) becomes \(\approx 0.602\) compared with the \(R^2\) of regression without de-trend is \(0.727\). It implies that this de-trended regression still explain certain variation of \(gfr\).

iii)

lm7 <- lm(fertil3$gfr~ fertil3$pe + fertil3$ww2 + fertil3$pill + fertil3$t + fertil3$tsq + fertil3$tcu)
tidy(lm7)
## # A tibble: 7 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  143.       4.34         32.9  2.95e-42
## 2 fertil3$pe     0.162    0.0413        3.92 2.16e- 4
## 3 fertil3$ww2  -19.0      5.04         -3.78 3.46e- 4
## 4 fertil3$pill -25.0      5.35         -4.68 1.51e- 5
## 5 fertil3$t     -5.61     0.543       -10.3  2.32e-15
## 6 fertil3$tsq    0.155    0.0203        7.65 1.21e-10
## 7 fertil3$tcu   -0.00129  0.000189     -6.81 3.77e- 9

The coef of \(t^3\) is quite significant as its p.value is quite small.