HW5.knit

STAT GU4205/GR5205 (Section 004) Linear Regression Models

HW5 by Dongrui Liu, Uni:dl3390

6.15

part (a)

setwd("C:/Users/Dongrui/Desktop/21 Fall/Linear Regression Models/HW/HW5")
options(scipen = 200)
data=read.table("CH06PR15.txt",col.names = c("y","x1","x2","x3"))
stem(data$x1)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   2 | 23
##   2 | 58899999
##   3 | 012233344
##   3 | 66678
##   4 | 0012233344
##   4 | 557779
##   5 | 0233
##   5 | 55

stem(data$x2)

## 
##   The decimal point is at the |
## 
##   40 | 0
##   42 | 00
##   44 | 0
##   46 | 00000
##   48 | 0000000000
##   50 | 000000000000
##   52 | 000000
##   54 | 0000
##   56 | 00
##   58 | 0
##   60 | 0
##   62 | 0

stem(data$x3)

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   18 | 000000
##   20 | 00000000
##   22 | 0000000000000
##   24 | 00000000000
##   26 | 0000
##   28 | 0000

It is clear that \(x_2\) and \(x_3\) are integers and seems to be roughly normally distributed, while \(x_1\) are decimals with two lump sum on 3 and 4.

Part(b)

library(GGally)
ggpairs(data,title="scatter plot matrix and correlation matrix")

It seems that \(y\) has a significant negative relationship with \(x_1, x_2\) and \(x_3\). And all the \(x\) have a significant positive relationship with each other.

part(c)

model1 = lm(y~x1+x2+x3,data = data)
model1

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##     158.491       -1.142       -0.442      -13.470

The estimated regression function is \(y = 158.491-1.142x_1-0.442x_2-13.470x_3\). b2 here means that given \(x_1, x_3\), as \(x_2\) goes up by 1, \(y\) goes down by 0.442.

part(d)

boxplot(model1$residuals)

No, there does not seem to be any outliers. The distribution of the residuals seems to be roughly symmetrical.

part(e)

par(mfrow=c(2, 2))
plot(model1$fitted.values,model1$residuals)
abline(0,0)
plot(data$x1,model1$residuals)
abline(0,0)
plot(data$x2,model1$residuals)
abline(0,0)
plot(data$x3,model1$residuals)
abline(0,0)

par(mfrow=c(3, 1))
plot(data$x1*data$x2,model1$residuals)
abline(0,0)
plot(data$x1*data$x3,model1$residuals)
abline(0,0)
plot(data$x2*data$x3,model1$residuals)
abline(0,0)

n = length(data$y)
Ee = anova(model1)$`Mean Sq`[4]*qnorm( (1:n - 0.375)/(n  + 0.25) )
qqnorm(model1$residuals)

It seems that for the two-factor interaction plot on \(x_2x_3\) has residuals more concentrated on the left part of the plot. Other than that everything seems to be normal.

part(f)

data1=data[1,]
dp = rbind(t(duplicated(data$x1)),t(duplicated(data$x2)),t(duplicated(data$x3)))
for(i in 1:length(data$y)) {
  if (dp[1,i]*dp[2,i]*dp[3,i]){ 
    data1=rbind(data1,data[i,])
  }
}
data1[1,]=c(1,1,1,1)
data2=data1[1,]
rownames(data1) = seq(length=nrow(data1))
dp1 = rbind(t(duplicated(data1$x1)),t(duplicated(data1$x2)),t(duplicated(data1$x3)))
for(i in 1:length(data1$y)) {
  if (dp1[1,i]*dp1[2,i]*dp1[3,i]){
    data2=rbind(data2,data2[i,])
  }
}

Since there are no replicate observations for each \(x_i , i\in 1,2,3\), we can not perform a formal test for lack of fit here.

part(g)

dt1 = data.frame(cbind(log(model1$residuals^2),data$x1,data$x2,data$x3))
model2 = lm(X1~X2+X3+X4,data = dt1)
anova1 = anova(model1)
anova2 = anova(model2)
c=3
n=length(data$y)
X2BP=sum(anova2$`Sum Sq`[1:3])/2/anova1$`Sum Sq`[4]^2*n^2
chi_stat = qchisq(0.99,c)

Because we have X2BP = 0.0006102< chi_stat = 11.3448667, we accept the null hypothesis, which states that error variance are constant. And we reject the alternative hypothesis, which states that error variance are not constant.

7.5

part(a)

model2 = lm(y~x2,data = data)
model12 = lm(y~x2 + x1,data = data)
model312 = lm(y~x2 + x1 + x3,data = data)
ssrx2 = anova(model2)$`Sum Sq`[2]
ssrx12 = anova(model2)$`Sum Sq`[2] - anova(model12)$`Sum Sq`[3]
ssrx312 = anova(model12)$`Sum Sq`[3] -  anova(model312)$`Sum Sq`[4]

\(SSR(X_2) = 8509.0443452\), \(SSR(X_1|X_2) = 3896.0441424\), \(SSR(X_3|X_1,X_2) = 364.159521\), degree of freedom associated with each extra sum of squares is 1, and the df of residuals is 42.

part(b)

f1 = ssrx312/anova(model312)$`Sum Sq`[4]*42
f = qf(.975,1,42)
p = pf(f1,1,42,lower.tail = FALSE)

If \(f_1\) is greater than \(f\), then we reject the null hypothesis, which means that \(\beta_3\) should not be zero. Else we accept the null hypothesis that \(\beta_3\) should be zero. In this case, \(f_1 = 3.5997349\) is less than \(f = 5.4038588\), hence we accept the null hypothesis that \(X_3\) should be dropped. And the P-value of the test is 0.0646781.

7.6

model1 = lm(y~x1,data = data)
model321 = lm(y~x2 + x1 + x3,data = data)
ssrx321 = anova(model1)$`Sum Sq`[2] - anova(model321)$`Sum Sq`[4]
f1 = ssrx321/2/anova(model321)$`Sum Sq`[4]*42
f = qf(.975,2,42)
p = pf(f1,2,42,lower.tail = FALSE)

If \(f_1\) is greater than \(f\), then we reject the null hypothesis, which means that \(\beta_2, \beta_3\) should not all be zero. Else we accept the null hypothesis that \(\beta_2, \beta_3\) should all be zero. In this case, \(f_1 = 4.1768031\) is greater than \(f = 4.0327099\), hence we reject the null hypothesis that \(X_2, X_3\) should all be dropped. And the P-value of the test is 0.0221612.