Pellar-Kosbar_HW4

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

Homework #4

Question 1

library(readr)
Airfreight <- read_csv("C:/Users/2015m/Downloads/HW4_Datasets/Airfreight.csv")
head(Airfreight)

Fit a linear regression function using X and Y.

lmairf <- lm(Y~X, data = Airfreight)
summary(lmairf)

The cases are given in time order. Prepare a residual plot against the time (𝑖) to ascertain whether the error terms are correlated over time. Is any systematic pattern evident in your plot?

Airfreight$ID <- seq.int(nrow(Airfreight))
res.lmairf <- resid(lmairf)
plot(res.lmairf ~ Airfreight$ID, ylab="Residual", xlab="Time", 
     main="Residual plot against Time")
abline(h=0)

I don’t see a pattern popping out at me.

Conduct Runs test and Durbin-Watson test to check whether the error terms are independent or not. The significance level 𝛼 is .05.

#Runs test
#install.packages("lawstat")
library(lawstat)
runs.test(res.lmairf)

#Durbin-Watson
#install.packages("lmtest")
library(lmtest)
dwtest(lmairf, alternative = "two.sided")

Neither test could reject the null. Which suggests that the linear regression residuals of time series data are uncorrelated and error terms are independent. Just as I thought.

Plot the residuals 𝑒 against X to ascertain whether the regression model (a) is appropriate for the data. What is your conclusion?

#When did we start calling the residuals "e"?
plot(res.lmairf ~ Airfreight$X, ylab="Residual", xlab="Number of Transfers", 
     main="Residual plot against X")
abline(h=0)

Residual variance is looks inconsistant, leading me to believe a linear regression model maybe inappropriate.

Draw a normal probability plot of the residuals. Does the normality assumption appear to be reasonable here? Conduct a Shapiro-Wilk test for the normality under 𝛼 = 0.05.

#qq plot
qqnorm(res.lmairf);qqline(res.lmairf)

#Shapiro-Wilk test
shapiro.test(res.lmairf)

The residuals in the qq plot do seem to be following a bit of a waved pattern, but I wouldn’t consider it strong evidance. In the Shapiro-Wilk normality test, we cannot reject the null, suggesting the error terms are indeed normally distributed.

Conduct the Breusch-Pagan test to determine whether or not the error variance varies with the level of X. Use α = .05. State the alternatives, decision rule, and conclusion.

bptest(lmairf)

The alternative hypothesis for this test is that the error variance is not constant. At a Confedence Interval of 95%, we cannot reject the null hypothesis. This mean we cannot rule out the error variance being constant at this CI. Thus, we conclude that the error variance is constant.

Question 2

library(readr)
Crime <- read_csv("C:/Users/2015m/Downloads/HW4_Datasets/Crime.csv")
head(Crime)

Fit a linear regression function using X and Y.

lmcrime <- lm(rate~pcnt, data = Crime)
summary(lmcrime)

Obtain the residuals 𝑒𝑖 and prepare a box plot of the residuals. Does the distribution of the residuals appear to be symmetrical?

res.lmcrime <- resid(lmcrime)
boxplot(res.lmcrime)

The boxplot seems to be heavier on the higher residuals. There seem to be a positive residual outlier.

Draw residual plots of 𝑒𝑖 versus 𝑌̂𝑖 and 𝑒𝑖 versus 𝑋𝑖 on separate graphs. Do these plots provide the same information? Is the linear regression model (a) appropriate for the data? State your findings.

# residual plots for Y hat
lmcrime.fitted <- fitted(lmcrime)
plot(res.lmcrime ~ lmcrime.fitted, xlab="fitted values", ylab="Fited Values for Crime Rate", 
     main="Plot of residuals against fitted values" )
abline(h=0,lwd=2)

# residual plots for X
plot(res.lmcrime ~ Crime$pcnt, ylab="Residual", xlab="Percentage Having a Highschool Diploma", 
     main="Residual plot against X")
abline(h=0)

The residuals versus X look relatively normal and would indicate variance is rather consistent as X changes. The residuals versus Y show a potential inconsistency in error values, but I won’t rule a linear regression model out yet.

Draw a normal probability plot of the residuals. Does the normality assumption appear to be reasonable here? Conduct a Shapiro-Wilk test for the normality under 𝛼 = 0.05.

#qq plot
qqnorm(res.lmcrime);qqline(res.lmcrime)

#Shapiro-Wilk test
shapiro.test(res.lmcrime)

Judging from the qq plot, the left tail looks heavy, but otherwise it looks generally normal. In the Shapiro-Wilk normality test, we cannot reject the null at a CI of 95%, suggesting the error terms are indeed normally distributed.

Conduct the Brown-Forsythe test to determine whether or not the error variance varies with the level of X. Divide the data into the two groups; Group1: X ≤ 𝑋̅ and Group2: X > 𝑋̅, and use α = .05. State the decision rule and conclusion. Does your conclusion support your preliminary findings in part (c)?

mean.lot <- mean(Crime$pcnt)

#sort the X values
sort.pcnt <- sort(Crime$pcnt)
order.pcnt <- order(Crime$pcnt)
res.lmcrime <- resid(lmcrime)
sort.res <- res.lmcrime[order.pcnt]

#Divide the data set into two groups
group1.index <- which((sort.pcnt < mean.lot)|(sort.pcnt ==mean.lot))  
group1 <- sort.pcnt[group1.index]

group2.index <- which((sort.pcnt > mean.lot))  
group2 <- sort.pcnt[group2.index]

group1.res <- sort.res[group1.index]
group2.res <- sort.res[group2.index]

abd1 <- abs(group1.res - median(group1.res))
abd2 <- abs(group2.res - median(group2.res))

#test statistic based on the two-sample t-test
t.test(abd1,abd2)

With a p-value of 0.513, we cannot reject the null. This suggests the conclusion of the true difference in means equaling to 0 and the error term have consistant variance. This supports the conclusion in part c to not rule out the linear regression model.

Question 3

library(readr)
Machine <- read_table2("C:/Users/2015m/Downloads/HW4_Datasets/Machine.txt", 
    col_names = FALSE)
head(Machine)

Fit a linear regression function by ordinary least squares, obtain the residuals, and plot the residuals against X. What does the residual plot suggest?

#linear regression function
lmmach <- lm(X1~X2, data = Machine) 
summary(lmmach)

#obtain the residuals
res.lmmach <- resid(lmmach)

# residual plots for X
plot(res.lmmach ~ Machine$X2, ylab="Residual", xlab="Speed Setting of Machine ", 
     main="Residual plot against X")
abline(h=0)

The residual plot seems to show an increase in residual variance as X increases. The constancy of variance assumption might not be met.

Conduct the Breusch-Pagan test for constancy of the error variance, assuming log𝑒 (𝜎𝑖^2) = (𝛾0) + (𝛾1𝑋𝑖); use α = 0.10. State the alternatives, decision rule, and conclusion.

bptest(lmmach)

The alternative hypothesis for the Breusch-Pagen test is that for the equation log𝑒 (𝜎𝑖^2) =𝛾0 +𝛾1𝑋𝑖, 𝛾1 is equal to zero. For a CI of 90% (or α = 0.10), we can reject the null hypothsis, meaning we can conclude the error variance is not constant.

Plot the squared residuals against X. What does the plot suggest about the relation between the variance of the error term and X?

plot(res.lmmach ~ Machine$X2, ylab="Residual", xlab="Speed Setting of Machine ", 
     main="Residual plot against X")
abline(h=0)

plot(res.lmmach^2 ~ Machine$X2, ylab="Residual Squared", xlab="Speed Setting of Machine ", 
     main="Squared Residual plot against X")
abline(h=0)

As stated before, it looks like the variance of the error term is not consistent as the value of X changes

Estimate the variance function (𝑙𝑚(𝑒^2~𝑋)) by regressing the squared residuals against X, and then calculate the estimated weight for each case using 𝑤𝑖 = 1/(𝑣̂𝑖) where 𝑣̂𝑖is the ith fitted value from the variance function.

lmmach.res <- lm(abs(res.lmmach) ~ X2, data=Machine)
W <- diag(1/fitted(lmmach.res)^2)
Y <- matrix(Machine$X1, ncol=1)
X <- matrix(cbind(1,Machine$X2),ncol=2)

inv.XWX <- solve(t(X)%*%W%*%X)
XWY <- t(X)%*%W%*%Y

fit.Yw <- sqrt(W)%*%X%*%b
Yw <- sqrt(W)%*%Y
Xw <- sqrt(W)%*%X
w.res <- (Yw-fit.Yw)

plot(w.res ~ Xw[,2], xlab="Xw", ylab="weighted residuals", main="weighted residual plot against X")
abline(h=0)

Using the estimated weights, obtain the weighted least squares estimates of 𝛽0and 𝛽1. Are the weighted least squares estimates similar to the ones obtained with ordinary least squares in part (a)?

b <- inv.XWX%*%XWY

b

The intercept is a little smaller (or more negative) than the ordinary least squares estimate, but the slope is almost the same. It’s not a huge difference.

Compare the estimated standard deviations of the weighted least squares estimates 𝑏𝑤0 and 𝑏𝑤1 in part (e) with those for the ordinary least squares estimates in part (a). What do you find?

#ordinary least squares estimates for standard deviation
#Got standard errors of 16.73052 and 0.05381 from part a)
qt(0.975, 10)*16.73052
qt(0.975, 10)*0.05381


s.b1 <- sqrt(diag(inv.XWX))[2]
s.b0 <- sqrt(diag(XWY))[1]

qt(0.975, 10)*s.b0
qt(0.975, 10)*s.b1

The standard deviations of the weighted least squares estimates were significantly small for both 𝑏0 and 𝑏1.

Question 4

library(readr)
lung_pressure <- read_csv("C:/Users/2015m/Downloads/HW4_Datasets/lung_pressure.csv")
head(lung_pressure)

Obtain the scatter plot matrix. Also obtain the correlation matrix of the X variables. What do the scatter plots suggest about the nature of the functional relationship between Y and each of the predictor variables? Are any serious multicollinearity problems evident? Explain.

lmlp <- lm(Y ~ X1 + X2 + X3, data=lung_pressure)

#scatter plot matrix
pairs(lung_pressure[,1:4], pch = 19)

#correlation matrix of the X variables
res <- cor(lung_pressure)
round(res, 2)

The scatter plot matrix suggests all X varaibles have a negative correclation with Y. There also seems to be evidence of a multicollinearity problem in how highly correlated X3 is with both X1 and X2. Both correlation coefficient are greater than 0.7, which is considered stong correlation.

Obtain the jackknifed residuals and plot them separately against Ŷ and each of the three predictor variables. On the basis of these plots, should any further modifications of the regression model be attempted? Identify any outlying Y observations.

std.res.lmlp <- rstudent(lmlp)
fitted.y <- fitted(lmlp)

par(mfrow=c(2,2))
plot(std.res.lmlp ~ fitted.y, xlab="fitted values", ylab="residuals", main="residual plot against fitted values")
plot(std.res.lmlp ~ lung_pressure$X1, xlab="X1", ylab="residuals", main="residual plot against X1")
plot(std.res.lmlp ~ lung_pressure$X2, xlab="X2", ylab="residuals", main="residual plot against X2")
plot(std.res.lmlp ~ lung_pressure$X3, xlab="X3", ylab="residuals", main="residual plot against X3")

At first glace, I’d say no. I don’t see any real trend, but I could be convinced that there is inconsistency in the variance.

Prepare a normal probability plot of the residuals. Also, conduct Shapiro-Wil test for testing of the normality. Does the normality assumption appear to be reasonable here?

#qq plot
qqnorm(std.res.lmlp);qqline(std.res.lmlp)

#Shapiro-Wilk test
shapiro.test(std.res.lmlp)

While the qq plot gives me pause based on how it steers off on the right. However, the Shapiro-Wilk test (assuming we are using a CI of 95% or α = 0.05) does not reject the null of the normality assumption being resonable.

Obtain the variance inflation factors. Are there any indications that serious multicollinearity problems are present? Explain.

#install.packages("car")
library(car)
vif(lmlp)

The maximum VIF value is 22.474, which is greater than 10, indicating there could be serious multicollinearity problem

Obtain the diagonal elements of the hat matrix. Using the criterion with the rule of thumb, identify any outlying X observations.

X <- as.matrix(cbind(1,lung_pressure[,-1]))
inv.XX <- solve(t(X)%*%X)
H <- X%*%inv.XX%*%t(X)
lev <- diag(H)
lev

The outlying X observations great than 0.2 are the 4 with the highest value, X values of 0.27569243, 0.53886673, 0.21775095, 0.87827870, 0.47982100 (third, seventh, eighth, and fifteenth).

Cases 1, 3, 8, and 15 are moderately far outlying with respect to their X values, and case 7 is relatively far outlying with respect to its Y value. Obtain DFFITS, DFBETAS, and Cook’s distance values for these cases to assess their influence. What do you conclude?

#install.packages("olsrr")
library(olsrr)

#DFFITS values
ols_plot_dffits(lmlp)

#DFBETAS values
ols_plot_dfbetas(lmlp)

#Cook's distance
ols_plot_cooksd_chart(lmlp)

Cases 1, 7, and 8 were consistently out of range and were definitely outliers. Case 3 occationally acted as an outlier, but it wasn’t often enough to convince me it wasn’t just one somewhat odd case.

Pellar-Kosbar_HW4_MA4710