\[M' = (I -X(X'X)^{-1}X')^T = I-{X'}^T(X'X)^{-1}X^T\] \[\implies M' = I-X(X'X)^{-1}X^T = M.\]
\[M^2 = (I -X(X'X)^{-1}X')^2\] \[= I-2X(X'X)^{-1}X'+ X(X'X)^{-1}X'X(X'X)^{-1}X'\] \[ = I - 2X(X'X)^{-1}X' + X(X'X)^{-1}X' = I - X(X'X)^{-1}X' = M\]
\[MX = (I - X(X'X)^{-1}X')X = X- X(X'X)^{-1}X'X = X-X = O\] \[X'M = X'(I-X(X'X)^{-1}X') = X'- X'X(X'X)^{-1}X' = X'-X' = O\]
Before we prove this argument, we first state a fact about projection matrices:
Fact: let \(P = X(X'X)^{-1}X'\) and \(P_1 = X_1(X_1'X_1)^{-1}X_1'\), we have \[PP_1 = P_1P = P_1.\]
The proof of this fact is intuitive: \[PP_1 = PX_1(X_1'X_1)^{-1}X_1' = X_1(X_1'X_1)^{-1}X_1' = P_1.\] The first equality holds since all the columns of \(X_1\) are in \(\text{Span}(X)\), and so are left unchanged by projection matrix \(P\). (Similar for \(P_1P = P1\)).
Then we can apply this fact in \(M_1M\) and \(MM_1\)
\[M_1M = (I-X_1(X_1'X_1)^{-1}X_1)(I-X(X'X)^{-1}X') = (I-P_1)(I-P)\] \[= I-P-P_1+P_1P = I-P-P_1+P_1 = I-P\] \[= M\]
Similarly, we have
\[MM_1 = (I-X(X'X)^{-1}X')(I-X_1(X_1'X_1)^{-1}X_1) = (I-P)(I-P_1)\] \[=I - P_1 -P +PP_1 = I-P_1+P+P_1 = I-P \] \[= M\]
The code is shown below:
lm1 = lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + t, data = barium)
tidy(lm1)
## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.37 20.8 -0.114 0.909
## 2 lchempi -0.686 1.24 -0.554 0.581
## 3 lgas 0.466 0.876 0.531 0.596
## 4 lrtwex 0.0782 0.472 0.166 0.869
## 5 befile6 0.0905 0.251 0.360 0.719
## 6 affile6 0.0970 0.257 0.377 0.707
## 7 afdec6 -0.352 0.283 -1.24 0.216
## 8 t 0.0127 0.00384 3.31 0.00124
As we see, only the trend variable is statistically significant, implying a positive effect on \(log(chempi)\).
library("car")
lm2 = lm(lchnimp ~ t , data = barium)
## method1
tidy(linearHypothesis(lm1, c("lchempi=0","lgas = 0","lrtwex=0","befile6 = 0","affile6 = 0","afdec6 =0")))
## # A tibble: 2 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 129 41.7 NA NA NA NA
## 2 123 40.6 6 1.07 0.540 0.777
## method2: or by use R^2 of full model and restricted model: [(R_unre^2 - R_res^2)/q]/[(1-R^2)/n-k-1]
## method3: or by anova()
# anova(lm1, lm2)
As we see, the \(F\) statistics = 0.540 with p-value = 0.777. We can conclude that the regressors other than t are insignificant.
lm3 = lm(lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + t + feb + mar+ apr+ may+ jun+ jul+ aug+ sep + oct+ nov + dec, data = barium)
lm4 <-lm(lchnimp ~lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + t, data = barium)
tidy(anova(lm3,lm4))
## # A tibble: 2 x 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 112 37.5 NA NA NA NA
## 2 123 40.6 -11 -3.12 0.847 0.594
F test with p-value = 0.59 shows that the dummy variables are not jointly significant, so nothing change importantly when including dummies.
lm5 <- lm(gfr ~ t + tsq, data = fertil3)
tidy(lm5)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 107. 6.05 17.7 2.38e-27
## 2 t 0.0717 0.382 0.187 8.52e- 1
## 3 tsq -0.00796 0.00508 -1.57 1.22e- 1
## save the residuals
resi <- residuals(lm5)
lm6 <- lm(resi~ fertil3$pe + fertil3$ww2 + fertil3$pill + fertil3$t + fertil3$tsq)
summary(lm6)
##
## Call:
## lm(formula = resi ~ fertil3$pe + fertil3$ww2 + fertil3$pill +
## fertil3$t + fertil3$tsq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.9791 -6.9775 -0.2713 7.7975 19.9861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.035673 4.360738 3.907 0.000223 ***
## fertil3$pe 0.347813 0.040260 8.639 1.91e-12 ***
## fertil3$ww2 -35.880277 5.707921 -6.286 2.95e-08 ***
## fertil3$pill -10.119723 6.336094 -1.597 0.115008
## fertil3$t -2.603123 0.389386 -6.685 5.86e-09 ***
## fertil3$tsq 0.027572 0.004971 5.546 5.53e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.74 on 66 degrees of freedom
## Multiple R-squared: 0.6015, Adjusted R-squared: 0.5713
## F-statistic: 19.92 on 5 and 66 DF, p-value: 4.719e-12
The \(R^2\) becomes \(\approx 0.602\) compared with the \(R^2\) of regression without de-trend is \(0.727\). It implies that this de-trended regression still explain certain variation of \(gfr\).
lm7 <- lm(fertil3$gfr~ fertil3$pe + fertil3$ww2 + fertil3$pill + fertil3$t + fertil3$tsq + fertil3$tcu)
tidy(lm7)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 143. 4.34 32.9 2.95e-42
## 2 fertil3$pe 0.162 0.0413 3.92 2.16e- 4
## 3 fertil3$ww2 -19.0 5.04 -3.78 3.46e- 4
## 4 fertil3$pill -25.0 5.35 -4.68 1.51e- 5
## 5 fertil3$t -5.61 0.543 -10.3 2.32e-15
## 6 fertil3$tsq 0.155 0.0203 7.65 1.21e-10
## 7 fertil3$tcu -0.00129 0.000189 -6.81 3.77e- 9
The coef of \(t^3\) is quite significant as its p.value is quite small.