6.15
part (a)
setwd("C:/Users/Dongrui/Desktop/21 Fall/Linear Regression Models/HW/HW5")
options(scipen = 200)
data=read.table("CH06PR15.txt",col.names = c("y","x1","x2","x3"))
stem(data$x1)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 23
## 2 | 58899999
## 3 | 012233344
## 3 | 66678
## 4 | 0012233344
## 4 | 557779
## 5 | 0233
## 5 | 55
stem(data$x2)
##
## The decimal point is at the |
##
## 40 | 0
## 42 | 00
## 44 | 0
## 46 | 00000
## 48 | 0000000000
## 50 | 000000000000
## 52 | 000000
## 54 | 0000
## 56 | 00
## 58 | 0
## 60 | 0
## 62 | 0
stem(data$x3)
##
## The decimal point is 1 digit(s) to the left of the |
##
## 18 | 000000
## 20 | 00000000
## 22 | 0000000000000
## 24 | 00000000000
## 26 | 0000
## 28 | 0000
It is clear that \(x_2\) and \(x_3\) are integers and seems to be roughly normally distributed, while \(x_1\) are decimals with two lump sum on 3 and 4.
Part(b)
library(GGally)
ggpairs(data,title="scatter plot matrix and correlation matrix")
It seems that \(y\) has a significant negative relationship with \(x_1, x_2\) and \(x_3\). And all the \(x\) have a significant positive relationship with each other.
part(c)
model1 = lm(y~x1+x2+x3,data = data)
model1
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 158.491 -1.142 -0.442 -13.470
The estimated regression function is \(y = 158.491-1.142x_1-0.442x_2-13.470x_3\). b2 here means that given \(x_1, x_3\), as \(x_2\) goes up by 1, \(y\) goes down by 0.442.
part(d)
boxplot(model1$residuals)
No, there does not seem to be any outliers. The distribution of the residuals seems to be roughly symmetrical.
part(e)
par(mfrow=c(2, 2))
plot(model1$fitted.values,model1$residuals)
abline(0,0)
plot(data$x1,model1$residuals)
abline(0,0)
plot(data$x2,model1$residuals)
abline(0,0)
plot(data$x3,model1$residuals)
abline(0,0)
par(mfrow=c(3, 1))
plot(data$x1*data$x2,model1$residuals)
abline(0,0)
plot(data$x1*data$x3,model1$residuals)
abline(0,0)
plot(data$x2*data$x3,model1$residuals)
abline(0,0)
n = length(data$y)
Ee = anova(model1)$`Mean Sq`[4]*qnorm( (1:n - 0.375)/(n + 0.25) )
qqnorm(model1$residuals)
It seems that for the two-factor interaction plot on \(x_2x_3\) has residuals more concentrated on the left part of the plot. Other than that everything seems to be normal.
part(f)
data1=data[1,]
dp = rbind(t(duplicated(data$x1)),t(duplicated(data$x2)),t(duplicated(data$x3)))
for(i in 1:length(data$y)) {
if (dp[1,i]*dp[2,i]*dp[3,i]){
data1=rbind(data1,data[i,])
}
}
data1[1,]=c(1,1,1,1)
data2=data1[1,]
rownames(data1) = seq(length=nrow(data1))
dp1 = rbind(t(duplicated(data1$x1)),t(duplicated(data1$x2)),t(duplicated(data1$x3)))
for(i in 1:length(data1$y)) {
if (dp1[1,i]*dp1[2,i]*dp1[3,i]){
data2=rbind(data2,data2[i,])
}
}
Since there are no replicate observations for each \(x_i , i\in 1,2,3\), we can not perform a formal test for lack of fit here.
part(g)
dt1 = data.frame(cbind(log(model1$residuals^2),data$x1,data$x2,data$x3))
model2 = lm(X1~X2+X3+X4,data = dt1)
anova1 = anova(model1)
anova2 = anova(model2)
c=3
n=length(data$y)
X2BP=sum(anova2$`Sum Sq`[1:3])/2/anova1$`Sum Sq`[4]^2*n^2
chi_stat = qchisq(0.99,c)
Because we have X2BP = 0.0006102< chi_stat = 11.3448667, we accept the null hypothesis, which states that error variance are constant. And we reject the alternative hypothesis, which states that error variance are not constant.
7.5
part(a)
model2 = lm(y~x2,data = data)
model12 = lm(y~x2 + x1,data = data)
model312 = lm(y~x2 + x1 + x3,data = data)
ssrx2 = anova(model2)$`Sum Sq`[2]
ssrx12 = anova(model2)$`Sum Sq`[2] - anova(model12)$`Sum Sq`[3]
ssrx312 = anova(model12)$`Sum Sq`[3] - anova(model312)$`Sum Sq`[4]
\(SSR(X_2) = 8509.0443452\), \(SSR(X_1|X_2) = 3896.0441424\), \(SSR(X_3|X_1,X_2) = 364.159521\), degree of freedom associated with each extra sum of squares is 1, and the df of residuals is 42.
part(b)
f1 = ssrx312/anova(model312)$`Sum Sq`[4]*42
f = qf(.975,1,42)
p = pf(f1,1,42,lower.tail = FALSE)
If \(f_1\) is greater than \(f\), then we reject the null hypothesis, which means that \(\beta_3\) should not be zero. Else we accept the null hypothesis that \(\beta_3\) should be zero. In this case, \(f_1 = 3.5997349\) is less than \(f = 5.4038588\), hence we accept the null hypothesis that \(X_3\) should be dropped. And the P-value of the test is 0.0646781.
7.6
model1 = lm(y~x1,data = data)
model321 = lm(y~x2 + x1 + x3,data = data)
ssrx321 = anova(model1)$`Sum Sq`[2] - anova(model321)$`Sum Sq`[4]
f1 = ssrx321/2/anova(model321)$`Sum Sq`[4]*42
f = qf(.975,2,42)
p = pf(f1,2,42,lower.tail = FALSE)
If \(f_1\) is greater than \(f\), then we reject the null hypothesis, which means that \(\beta_2, \beta_3\) should not all be zero. Else we accept the null hypothesis that \(\beta_2, \beta_3\) should all be zero. In this case, \(f_1 = 4.1768031\) is greater than \(f = 4.0327099\), hence we reject the null hypothesis that \(X_2, X_3\) should all be dropped. And the P-value of the test is 0.0221612.