STAT GU4205/GR5205 (Section 004) Linear Regression Models

HW3 by Dongrui Liu, Uni:dl3390

3.4

First, import the data.

setwd("C:/Users/Dongrui/Desktop/21 Fall/Linear Regression Models/HW/HW3")
options(scipen = 200)
data=read.table("CH03PR04.txt")
#dotplot from Minitab

Part (a) Caption for the picture.

The plot shows that the number of copiers serviced are concentrated on 2,4,5,7 and there are no clear outlying cases. This plot is prepared using minitab.

Part (b)

library(ggplot2)
Number_of_copiers_serviced=data$V2
Time=1:length(data$V2)
plot(Time,Number_of_copiers_serviced)

dt=data.frame(cbind(Number_of_copiers_serviced,Time))
ggplot(dt, aes(x=Time, y=Number_of_copiers_serviced)) + 
  geom_point()+
  geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

The cases are given in time order, however, there does not seem to be a relationship between the number of copiers serviced and the time.

Part (c)

model1=lm(data$V1~data$V2,data=data)
stem(residuals(model1))
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   -2 | 30
##   -1 | 
##   -1 | 3110
##   -0 | 99997
##   -0 | 44333222111
##    0 | 001123334
##    0 | 5666779
##    1 | 112234
##    1 | 5

The noteworthy feature in this plot is that there are two residuals with absolute value greater than 20, other than that, everything seems fine.

Part (d)

plot(data$V2,residuals(model1))

dt=data.frame(cbind(data$V2,residuals(model1)))
ggplot(dt, aes(x=X1, y=X2)) + 
  xlab("X")+
  ylab("residuals")+
  geom_point()+
  geom_smooth(method=lm)

plot(model1$fitted.values,residuals(model1))

dt=data.frame(cbind(model1$fitted.values,residuals(model1)))
ggplot(dt, aes(x=X1, y=X2)) + 
  xlab("Y_fitted")+
  ylab("residuals")+
  geom_point()+
  geom_smooth(method=lm)

The plots provide the same information. There seems to be an non-linear relationship between residuals and X and \(\hat{Y}\), which is a departure from the simple linear regression model

Part (e)

qqnorm(model1$res,ylab="Raw Residuals")
qqline(model1$res)

a = qqnorm(model1$res,ylab="Raw Residuals")

library(olsrr)
cor1=ols_test_correlation(model1)

Using n = 45, from table B.6, we have the test statistic as (0.977+0.981)/2=0.979, which is less than coefficient of correlation 0.9891021 between the residuals and their expected values under normal distribution. Hence, we accept the null hypothesis that the residual is indeed normally distributed. The normality assumption appears to be tenable.

Part (f)

Time=1:45
plot(Time,model1$residuals)

There does not a correlation between the error terms over time.

Part (g)

library(SciViews)
dt=data.frame(cbind(data$V2,ln((model1$residuals)^2)))
model2=lm(X2~X1,data=dt)
a=anova(model1)
b=anova(model2)
#The test statistic is:
X2BP=round((b$`Sum Sq`[1]/2)/(a$`Sum Sq`[2])^2*45^2,8)
c=qchisq(0.95,1)

The null hypothesis is that there is no relationship be X and the error variance, which shows constancy of error variance. The alternative hypothesis is that there exists such relationship and constancy of error variance does not stand. If the Statistic we computed is larger than the value of determined confidence level(in this case \(\alpha\)=0.05), then we reject the null hypothesis. If else, we accept the null hypothesis. Hence we accept the null hypothesis because 0.0000119 is less than 3.8414588.

Part (h)

plot(data$V3,model1$residuals)

plot(data$V4,model1$residuals)

Clearly we can see that there exists a significant relationship between the mean operational age of copiers serviced on the call and the error while there is no such relationship between years of experience of the service person making the call with the error. Hence, we can improve our model by including X2 into the model.

3.4

Part (a)

x=0:9
y=c(98,135,162,178,221,232,283,300,374,395)
dt=data.frame(cbind(x,y))
ggplot(dt, aes(x=x, y=y)) + 
  xlab("X")+
  ylab("Y")+
  geom_point()+
  geom_smooth(method=lm)

A linear relation appear to be adequate.

Part (b)

library(MASS)
model1=lm(y~x,data=data)
bc=function(y,lambda){(y^(lambda)-1)/(lambda*(prod(y)^(1/length(y)))^(lambda-1))}
lambdas=c(.3,.4,.5,.6,.7)
SSEs=c()
for (i in 1:5) {
  lambda=lambdas[i]
  ys=bc(y,lambda)
  dt=data.frame(cbind(x,ys))
  model=lm(ys~x,data = dt)
  SSEs[i]=sum(model$residuals^2)
}
#use lambdas[which.min(SSEs)] to retrieve lambda

The suggested transformation of Y is the one with \(\lambda\) equals 0.5.

Part (c)

y1=sqrt(y)
dt=data.frame(cbind(x,y1))
model2=lm(y1~x,data=dt)

The estimated linear regression function is Y = 10.2609315 + 1.0762918*X

Part (d)

dt=data.frame(cbind(x,y1))
ggplot(dt, aes(x=x, y=y1)) + 
  xlab("X")+
  ylab("Y1")+
  geom_point()+
  geom_smooth(method=lm)

The regression line appears to be a good fit to the transformated data.

Part (e)

plot(model2$fitted.values,model2$residuals)

qqnorm(model2$residuals)
qqline(model2$residuals)

ols_test_correlation(model2)
## [1] 0.9686797

The residuals versus the fitted values plot seems to be okay, which means there may not exist a significant relationship between them. the normal probability plot seems a bit zigzag, but overall it is safe to say the error terms are normally distributed.

Part (f)

Since we made the transformation such that \(y\) =\(\sqrt{y}\), we should square both side of our estimated regression function to get back the function which has the original units as its output. Hence the function is \((\sqrt{y})^2 = (10.260931 + 1.076292*x)^2\)