{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
Homework #4
Question 1
library(readr)
Airfreight <- read_csv("C:/Users/2015m/Downloads/HW4_Datasets/Airfreight.csv")
head(Airfreight)
lmairf <- lm(Y~X, data = Airfreight)
summary(lmairf)
Airfreight$ID <- seq.int(nrow(Airfreight))
res.lmairf <- resid(lmairf)
plot(res.lmairf ~ Airfreight$ID, ylab="Residual", xlab="Time",
main="Residual plot against Time")
abline(h=0)
I donβt see a pattern popping out at me.
#Runs test
#install.packages("lawstat")
library(lawstat)
runs.test(res.lmairf)
#Durbin-Watson
#install.packages("lmtest")
library(lmtest)
dwtest(lmairf, alternative = "two.sided")
Neither test could reject the null. Which suggests that the linear regression residuals of time series data are uncorrelated and error terms are independent. Just as I thought.
#When did we start calling the residuals "e"?
plot(res.lmairf ~ Airfreight$X, ylab="Residual", xlab="Number of Transfers",
main="Residual plot against X")
abline(h=0)
Residual variance is looks inconsistant, leading me to believe a linear regression model maybe inappropriate.
#qq plot
qqnorm(res.lmairf);qqline(res.lmairf)
#Shapiro-Wilk test
shapiro.test(res.lmairf)
The residuals in the qq plot do seem to be following a bit of a waved pattern, but I wouldnβt consider it strong evidance. In the Shapiro-Wilk normality test, we cannot reject the null, suggesting the error terms are indeed normally distributed.
bptest(lmairf)
The alternative hypothesis for this test is that the error variance is not constant. At a Confedence Interval of 95%, we cannot reject the null hypothesis. This mean we cannot rule out the error variance being constant at this CI. Thus, we conclude that the error variance is constant.
Question 2
library(readr)
Crime <- read_csv("C:/Users/2015m/Downloads/HW4_Datasets/Crime.csv")
head(Crime)
lmcrime <- lm(rate~pcnt, data = Crime)
summary(lmcrime)
res.lmcrime <- resid(lmcrime)
boxplot(res.lmcrime)
The boxplot seems to be heavier on the higher residuals. There seem to be a positive residual outlier.
# residual plots for Y hat
lmcrime.fitted <- fitted(lmcrime)
plot(res.lmcrime ~ lmcrime.fitted, xlab="fitted values", ylab="Fited Values for Crime Rate",
main="Plot of residuals against fitted values" )
abline(h=0,lwd=2)
# residual plots for X
plot(res.lmcrime ~ Crime$pcnt, ylab="Residual", xlab="Percentage Having a Highschool Diploma",
main="Residual plot against X")
abline(h=0)
The residuals versus X look relatively normal and would indicate variance is rather consistent as X changes. The residuals versus Y show a potential inconsistency in error values, but I wonβt rule a linear regression model out yet.
#qq plot
qqnorm(res.lmcrime);qqline(res.lmcrime)
#Shapiro-Wilk test
shapiro.test(res.lmcrime)
Judging from the qq plot, the left tail looks heavy, but otherwise it looks generally normal. In the Shapiro-Wilk normality test, we cannot reject the null at a CI of 95%, suggesting the error terms are indeed normally distributed.
mean.lot <- mean(Crime$pcnt)
#sort the X values
sort.pcnt <- sort(Crime$pcnt)
order.pcnt <- order(Crime$pcnt)
res.lmcrime <- resid(lmcrime)
sort.res <- res.lmcrime[order.pcnt]
#Divide the data set into two groups
group1.index <- which((sort.pcnt < mean.lot)|(sort.pcnt ==mean.lot))
group1 <- sort.pcnt[group1.index]
group2.index <- which((sort.pcnt > mean.lot))
group2 <- sort.pcnt[group2.index]
group1.res <- sort.res[group1.index]
group2.res <- sort.res[group2.index]
abd1 <- abs(group1.res - median(group1.res))
abd2 <- abs(group2.res - median(group2.res))
#test statistic based on the two-sample t-test
t.test(abd1,abd2)
With a p-value of 0.513, we cannot reject the null. This suggests the conclusion of the true difference in means equaling to 0 and the error term have consistant variance. This supports the conclusion in part c to not rule out the linear regression model.
Question 3
library(readr)
Machine <- read_table2("C:/Users/2015m/Downloads/HW4_Datasets/Machine.txt",
col_names = FALSE)
head(Machine)
#linear regression function
lmmach <- lm(X1~X2, data = Machine)
summary(lmmach)
#obtain the residuals
res.lmmach <- resid(lmmach)
# residual plots for X
plot(res.lmmach ~ Machine$X2, ylab="Residual", xlab="Speed Setting of Machine ",
main="Residual plot against X")
abline(h=0)
The residual plot seems to show an increase in residual variance as X increases. The constancy of variance assumption might not be met.
bptest(lmmach)
The alternative hypothesis for the Breusch-Pagen test is that for the equation logπ (ππ^2) =πΎ0 +πΎ1ππ, πΎ1 is equal to zero. For a CI of 90% (or Ξ± = 0.10), we can reject the null hypothsis, meaning we can conclude the error variance is not constant.
plot(res.lmmach ~ Machine$X2, ylab="Residual", xlab="Speed Setting of Machine ",
main="Residual plot against X")
abline(h=0)
plot(res.lmmach^2 ~ Machine$X2, ylab="Residual Squared", xlab="Speed Setting of Machine ",
main="Squared Residual plot against X")
abline(h=0)
As stated before, it looks like the variance of the error term is not consistent as the value of X changes
lmmach.res <- lm(abs(res.lmmach) ~ X2, data=Machine)
W <- diag(1/fitted(lmmach.res)^2)
Y <- matrix(Machine$X1, ncol=1)
X <- matrix(cbind(1,Machine$X2),ncol=2)
inv.XWX <- solve(t(X)%*%W%*%X)
XWY <- t(X)%*%W%*%Y
fit.Yw <- sqrt(W)%*%X%*%b
Yw <- sqrt(W)%*%Y
Xw <- sqrt(W)%*%X
w.res <- (Yw-fit.Yw)
plot(w.res ~ Xw[,2], xlab="Xw", ylab="weighted residuals", main="weighted residual plot against X")
abline(h=0)
b <- inv.XWX%*%XWY
b
The intercept is a little smaller (or more negative) than the ordinary least squares estimate, but the slope is almost the same. Itβs not a huge difference.
#ordinary least squares estimates for standard deviation
#Got standard errors of 16.73052 and 0.05381 from part a)
qt(0.975, 10)*16.73052
qt(0.975, 10)*0.05381
s.b1 <- sqrt(diag(inv.XWX))[2]
s.b0 <- sqrt(diag(XWY))[1]
qt(0.975, 10)*s.b0
qt(0.975, 10)*s.b1
The standard deviations of the weighted least squares estimates were significantly small for both π0 and π1.
Question 4
library(readr)
lung_pressure <- read_csv("C:/Users/2015m/Downloads/HW4_Datasets/lung_pressure.csv")
head(lung_pressure)
lmlp <- lm(Y ~ X1 + X2 + X3, data=lung_pressure)
#scatter plot matrix
pairs(lung_pressure[,1:4], pch = 19)
#correlation matrix of the X variables
res <- cor(lung_pressure)
round(res, 2)
The scatter plot matrix suggests all X varaibles have a negative correclation with Y. There also seems to be evidence of a multicollinearity problem in how highly correlated X3 is with both X1 and X2. Both correlation coefficient are greater than 0.7, which is considered stong correlation.
std.res.lmlp <- rstudent(lmlp)
fitted.y <- fitted(lmlp)
par(mfrow=c(2,2))
plot(std.res.lmlp ~ fitted.y, xlab="fitted values", ylab="residuals", main="residual plot against fitted values")
plot(std.res.lmlp ~ lung_pressure$X1, xlab="X1", ylab="residuals", main="residual plot against X1")
plot(std.res.lmlp ~ lung_pressure$X2, xlab="X2", ylab="residuals", main="residual plot against X2")
plot(std.res.lmlp ~ lung_pressure$X3, xlab="X3", ylab="residuals", main="residual plot against X3")
At first glace, Iβd say no. I donβt see any real trend, but I could be convinced that there is inconsistency in the variance.
#qq plot
qqnorm(std.res.lmlp);qqline(std.res.lmlp)
#Shapiro-Wilk test
shapiro.test(std.res.lmlp)
While the qq plot gives me pause based on how it steers off on the right. However, the Shapiro-Wilk test (assuming we are using a CI of 95% or Ξ± = 0.05) does not reject the null of the normality assumption being resonable.
#install.packages("car")
library(car)
vif(lmlp)
The maximum VIF value is 22.474, which is greater than 10, indicating there could be serious multicollinearity problem
X <- as.matrix(cbind(1,lung_pressure[,-1]))
inv.XX <- solve(t(X)%*%X)
H <- X%*%inv.XX%*%t(X)
lev <- diag(H)
lev
The outlying X observations great than 0.2 are the 4 with the highest value, X values of 0.27569243, 0.53886673, 0.21775095, 0.87827870, 0.47982100 (third, seventh, eighth, and fifteenth).
#install.packages("olsrr")
library(olsrr)
#DFFITS values
ols_plot_dffits(lmlp)
#DFBETAS values
ols_plot_dfbetas(lmlp)
#Cook's distance
ols_plot_cooksd_chart(lmlp)
Cases 1, 7, and 8 were consistently out of range and were definitely outliers. Case 3 occationally acted as an outlier, but it wasnβt often enough to convince me it wasnβt just one somewhat odd case.