SMS LAB 1

This data frame contains the following columns:

Y Miles/gallon

x1 Displacement (cubic in)

x2 Horsepower (ft-lb)

x3 Torque (ft-lb)

x4 Compression ratio

x5 Rear axle ratio

x6 Carburetor (barrels)

x7 No. of transmission speeds

x8 Overall length (in)

x9 Width (in)

x10 Weight (lb)

x11 Type of transmission (1=automatic, 0=manual)

Extract data from the following excel file: gmp.txt.

df <- read.delim("gmp.txt",sep="")

Develop a multiple linear regression model by considering all variables

model.lm <- lm(y~., data =df)
model.lm

## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4           x5  
##   17.339838    -0.075588    -0.069163     0.115117     1.494737     5.843495  
##          x6           x7           x8           x9          x10          x11  
##    0.317583    -3.205390     0.180811    -0.397945    -0.005115     0.638483

Part A

Construct a normal probability plot of the residuals. Does there seem to be any problems with the normality assumption?

model.stdres = rstandard(model.lm)
qqnorm(model.stdres, ylab="Standardized Residuals",  xlab="Normal Scores",  main="Sales") 
qqline(model.stdres)

Based on the probability plot of the residuals, The resulting plot is approximately linear with the exception of data point when normal scores more than 1. Overall, it appear to be normal.

However, more rigorous and formal quantification of normality may be requested for checking the assumption.

Support your answer in (i) by using appropriate normality test.

shapiro.test(model.stdres)

## 
##  Shapiro-Wilk normality test
## 
## data:  model.stdres
## W = 0.96009, p-value = 0.3113

\(H_0\): The errors/data follows Normal Distribution

\(H_1\): The errors/data does not follow Normal Distribution

since (p-value = 0.3113) > 0.05, then do not reject \(H_0\)

At \(\alpha = 0.05\), the error/data follows normal distribution

Construct and interpret a plot of residuals versus the predicted response.

# Save the predicted values
yhat<- predict(model.lm)   
# Save the residual values
epsilon <- residuals(model.lm) 

plot(cbind(yhat,epsilon))
#add horizontal line at 0
abline(h = 0, lty = 2)

Based on the residuals versus the predicted response plot, there are no any visible non-random pattern. It suggest that there is no relationship between the residuals and predicted response.

Hence, assumption can be made that the residuals have a constant variance.

Part B

Detect any outliers occurs using Cook’s Distance method

Plot the influential observation by Cook’s Distance and comments

cooksd <- cooks.distance(model.lm)
# plot cook's distance
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") 


# add cutoff line
abline(h = 4*mean(cooksd, na.rm=T), col="red") 
# greater than 4 times the mean may be classified as influential

# add labels
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")

Based on the plot, the 14th data row and 17th data row exceed the 4*mean threshold. Hence, these two data point as influential data points that have a negative impact on the model.

Examine and list the point observation(s) which consider as outlier(s). Justify your answer.

# influential row numbers 
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])
head(df[influential, ]) # influential observations.

Since the data of 14th row and 17th row exceed the threshold of 4 times the mean. They, they are consider as influential data points which consider as outliers.

Based on the output above, the data for x1, x3 and x8 in row 17 are very high compared to row 14. The data in row 14 look okay.

It suggests the influential data points should be removed to prevent have negative impact on the model.

Part C

By considering only x1, x2, x3, x8, x9 and x10, construct the lack of fit test. Interpret and justify your answer.

#let the full regression model as model1

model1 <- lm(y~., data =df)

model2 <- lm(y~x1+x2+x3+x8+x9+x10, data =df)


#lack of fit test
anova(model1, model2)

Since the (P-value=0.478) > 0.05, do not reject \(H_0\). At \(\alpha= 0.05\), the full model((model) do not offer significantly better fit than the reduced model (model 2).