You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one HTML file.
Please use the file name as: 20200160XX_Name_04.html.
Grading scheme: \(\left\{ 0,~ 1,~ 2~ \right\}\) points per question.
2: submitted on time and more or less correct answer
1: submitted on time and partially correct answer
0: submitted with a completely incorrect answer or late submission (any day after the due date for more than one homework assignment).
Send your answers to: 0205@cupk.edu.cn. Due date: 24:00 October 22, 2022 (Saturday evening).
The data set \(\tt state.x77\) in \(\Re\) contains the following statistics (among others) related to the 50 states of the United States of America:
\(\tt Population\): population estimate (1975)
\(\tt Income\): per capita income (1974)
\(\tt Illiteracy\): illiterracy (1970, percent of population)
\(\tt HS.Grad\): percent high school graduates (1970)
state.data <- data.frame(state.x77)
We are interested in the relation between \(\tt Income\) and other 3 variables (\(\tt Population\), \(\tt Illiteracy\), \(\tt HS.Grad\)).
(2 points)head(state.data)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
str(state.data)
## 'data.frame': 50 obs. of 8 variables:
## $ Population: num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...
a = c(2,3,4,7)
plot(state.data[,a], pch=23, bg='orange')
(2 points)income.lm = lm(Income~Population+Illiteracy+HS.Grad,data = state.data )
summary(income.lm)
##
## Call:
## lm(formula = Income ~ Population + Illiteracy + HS.Grad, data = state.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -987.30 -213.67 -50.68 219.74 1430.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1940.74115 710.64526 2.731 0.008924 **
## Population 0.03786 0.01500 2.524 0.015129 *
## Illiteracy -73.57563 145.07584 -0.507 0.614470
## HS.Grad 45.57445 10.93778 4.167 0.000135 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 465.8 on 46 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.4254
## F-statistic: 13.09 on 3 and 46 DF, p-value: 2.612e-06
(2 points)par(mfrow=c(2, 2))
plot(income.lm)
(2 points)plot(abs(dffits(income.lm)), pch=23, bg='orange',cex=2,ylab="DFFITS")
(2 points)plot(cooks.distance(income.lm), pch=23, bg='orange', cex=2, ylab="Cook's distance")
state.data[which(cooks.distance(income.lm) > 0.1),]
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## New Mexico 1144 3601 2.2 70.32 9.7 55.2 120 121412
## Utah 1203 4022 0.6 72.90 4.5 67.3 137 82096
(2 points)state.data[which(hatvalues(income.lm) > 0.3), ]
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
(2 points)n = nrow(state.data)
cutoff = qt(0.05 / (2 * n), (n - 5), lower.tail = FALSE)
state.data[which(abs(rstudent(income.lm)) > cutoff), ]
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
income.lm1 = lm(Income ~ Population + Illiteracy + HS.Grad, data = state.data[-2,])
summary(income.lm1)
##
## Call:
## lm(formula = Income ~ Population + Illiteracy + HS.Grad, data = state.data[-2,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -806.64 -223.85 -94.83 228.49 895.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2922.00458 669.84639 4.362 7.42e-05 ***
## Population 0.04463 0.01322 3.375 0.00153 **
## Illiteracy -248.72103 134.46160 -1.850 0.07092 .
## HS.Grad 29.75007 10.38054 2.866 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407 on 45 degrees of freedom
## Multiple R-squared: 0.4997, Adjusted R-squared: 0.4663
## F-statistic: 14.98 on 3 and 45 DF, p-value: 6.723e-07
(2 points)Consider the Scottish hills races data.
url = 'http://www.statsci.org/data/general/hills.txt'
races.table = read.table(url, header=TRUE, sep='\t')
Choose an observation index \(i\) (e.g. \(i = 33\), which corresponds to the outlying observation number 33) and create an dummy variable \(U\), where all the values of \(U\) are zero except for its \(i\)-th (33rd value) value which is one.
U <- rep(0, nrow(races.table))
U[33] <- 1
Now consider comparing the following models:
\[ H_0: \text{Time} = \beta_0 + \beta_1 \text{Distance} + \beta_2 \text{Climb} + \varepsilon , \quad \text{say, } \quad \text{Model 1} \]
\[ H_0: \text{Time} = \beta_0 + \beta_1 \text{Distance} + \beta_2 \text{Climb} + \beta_3 U + \varepsilon , \quad \text{say, } \quad \text{Model 2} \]
Let \(t_i^* = \dfrac{\hat{Y}_i - \hat{Y}_{(i),i}}{\hat{\sigma}_{(i)}\sqrt{1 - H_{ii}}}\) be the \(i\)-th externally standardized residual (studentized residuals), where \(\hat{\sigma}_{(i)}^2\) is the mean square error of fit without the \(i\)-th observation, \(H_{ii}\) is the \(i\)-th leverage value (That is the \(i\)-th diagonal element of \(\boldsymbol{X}\left( \boldsymbol{X}^{\rm T} \boldsymbol{X} \right)^{-1} \boldsymbol{X}^{\rm T}\)). Verify using the 33rd observation that
(2 points)races.table$U=U
races.lm = lm(Time ~ Distance + Climb, data=races.table)
races.lm2 = lm(Time ~ Distance + Climb + U, data=races.table)
summary(races.lm2)
##
## Call:
## lm(formula = Time ~ Distance + Climb + U, data = races.table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.898 -7.009 -1.321 1.764 64.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.306261 4.433809 -1.873 0.0705 .
## Distance 6.159210 0.610807 10.084 2.63e-11 ***
## Climb 0.010726 0.002112 5.079 1.71e-05 ***
## U 11.913956 16.241521 0.734 0.4687
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.78 on 31 degrees of freedom
## Multiple R-squared: 0.9204, Adjusted R-squared: 0.9127
## F-statistic: 119.5 on 3 and 31 DF, p-value: < 2.2e-16
rstudent(races.lm)[33]
## 33
## 0.7335493
(2 points)anova(races.lm,races.lm2)
## Analysis of Variance Table
##
## Model 1: Time ~ Distance + Climb
## Model 2: Time ~ Distance + Climb + U
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 32 6891.9
## 2 31 6774.3 1 117.59 0.5381 0.4687
rstudent(races.lm)[33]^2
## 33
## 0.5380945
(2 points)races.lm3 = lm(Time ~ Distance + Climb, data=races.table[-33,])
races.lm3
##
## Call:
## lm(formula = Time ~ Distance + Climb, data = races.table[-33,
## ])
##
## Coefficients:
## (Intercept) Distance Climb
## -8.30626 6.15921 0.01073
(2 points)races.lm2$coefficients
## (Intercept) Distance Climb U
## -8.30626092 6.15921043 0.01072625 11.91395619
races.lm3$coefficients
## (Intercept) Distance Climb
## -8.30626092 6.15921043 0.01072625
This question will review some of the fundamental concepts of the multiple linear regression model.
(2 points)(2 points)\(\epsilon \sim N(0,\sigma^{2})\)
Observations forms a independent sample.
(2 points)(2 points)(2 points)(2 points)