1. For the prostate data, fit a model with lpsa as the response and the other variables as predictors:
library(faraway)
## Warning: 套件 'faraway' 是用 R 版本 4.3.1 來建造的
data(prostate)
r<-lm(lpsa~.,data=prostate)
summary(r)
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7331 -0.3713 -0.0170 0.4141 1.6381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.669337 1.296387 0.516 0.60693
## lcavol 0.587022 0.087920 6.677 2.11e-09 ***
## lweight 0.454467 0.170012 2.673 0.00896 **
## age -0.019637 0.011173 -1.758 0.08229 .
## lbph 0.107054 0.058449 1.832 0.07040 .
## svi 0.766157 0.244309 3.136 0.00233 **
## lcp -0.105474 0.091013 -1.159 0.24964
## gleason 0.045142 0.157465 0.287 0.77503
## pgg45 0.004525 0.004421 1.024 0.30886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
(a) Compute 95% Cis for the parameter associated with age. Using just these intervals, what could we have deduced about the p-value for age in the regression summary?
confint(r)
## 2.5 % 97.5 %
## (Intercept) -1.906960983 3.245634379
## lcavol 0.412298699 0.761744954
## lweight 0.116603435 0.792331414
## age -0.041840618 0.002566267
## lbph -0.009101499 0.223209561
## svi 0.280644232 1.251670420
## lcp -0.286344443 0.075395916
## gleason -0.267786053 0.358069248
## pgg45 -0.004260932 0.013311395
變數 age 的 95% CI 為 (-0.042,0.003),區間包含 0, 因此可推得在 significant level \(alpha = 0.05\) 下的假設檢定,\(H_0\): age 係數為 0,\(H_1\): otherwise 將無法拒絕虛無假設。
由 regression summary 可得到 p-value for age = 0.082 > 0.05,無法拒絕年齡係數為零的虛無假設。
(b) Which variables are statistically significant at the 5% level?
lcavol, lweight, svi 等 3 variables, p-values 均小於 0.05
(c) Fit a model with age and lcavol as a predictor and use an F-test to compare it to the full model.
r1<-lm(lpsa~age+lcavol, data=prostate)
anova(r1,r)
## Analysis of Variance Table
##
## Model 1: lpsa ~ age + lcavol
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 94 58.912
## 2 88 44.163 6 14.749 4.8983 0.0002308 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value = 0.00023 < 0.05, 因此只放 age 及 lcavol 的 model 和 full model 有顯著不同。
2. Using the teengamb data, fit a model with gamble as the response and the other variables as predictors. Predict the amount that a male with average (given these data) status, income and verbal score would gamble along with an appropriate 95% CI.
data(teengamb)
r2<-lm(gamble~.,data=teengamb)
new.data<-data.frame("sex"=0,t(colMeans(teengamb[,2:4])))
predict(r2,new.data,interval="prediction")
## fit lwr upr
## 1 28.24252 -18.51536 75.00039
預測值為 28.243,95% CI: (-18.515,75.000)。
3. Using the sat dataset, fit a model with the total SAT score as the response and expend, salary, ratio and takers as predictors. Perform regression diagnostics on this model to answer the following questions. Display any plots that are relevant. Do not provide any plots about which you have nothing to say.
data(sat)
r3<-lm(total~expend+salary+ratio+takers,data=sat)
summary(r3)
##
## Call:
## lm(formula = total ~ expend + salary + ratio + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## salary 1.6379 2.3872 0.686 0.496
## ratio -3.6242 3.2154 -1.127 0.266
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(r3)
(a) Check the constant variance assumption for the errors.
由 “Residuals vs Fitted” 的圖,可看到 residuals 的分散程度和 fitted values 的大小無明顯差異,故看起來沒有違反 constant variance assumption。
(b) Check the normality assumption.
由 “Q-Q Residuals” 的圖,樣本點均在 45度線附近,因此看起來沒有違反 normality assumption。
(c) Check for the largest leverage points.
hatv <- hatvalues(r3)
head(sort(hatv,decreasing = T))
## Utah California Connecticut New Jersey New York Alaska
## 0.2921128 0.2821179 0.2254519 0.2220978 0.1915752 0.1803061
Utah 的 leverage 最大,為 0.292。由 “Residuals vs Leverage” 的圖,也可看出 Utah 的 leverage 最大。
(d) Check for the possible outlier.
由 “Q-Q Residuals” 的圖,West Virginia 最為偏離 45度線,有可能為 outlier。
(e) Check for the possible influential points.
由 “Residuals vs Leverage” 的圖,Utah 有最大的 Cook’s distance
halfnorm(cooks.distance(r3), 3, labs=rownames(sat), ylab="Cook's distances")
由 haf-normal plot 也可看出,Utah 的 Cook’s distance 和其他樣本也有差距。故 Utah 為 possible influential point。