408 HW4

Question 1

2.1 a. Yes this conclusion is warranted because 0 is not in the confidence interval of the slope with a 95% significance level (alpha = 0.05)

b. This value is meaningless in this example because a population of zero is outside the scope of the model.

2.10 a. A prediction interval because it says “what will,” therefore we are predicting the temperature for tomorrow.

b. A confidemce interval because this is based off of information we already have.

c. A prediction interval because because we are predicting for next month (in the future).

2.17 Alpha > than 0.033 since they rejected. If the alpha level was 0.01, we would reject the null and conclude that the slope is not = 0 (therefore, there is an association between x and y).

2.18

The t-test is more versatile because the F test only works on slopes = 0 or not, while the t-test can be used on a hypothesized slope of any number.

2.66a.

ci.fun = function(m){
X = c(4,8,12,16,20)
Error = rnorm(5,0,5)
Y = 20 + 4*X + Error
LRM = lm(Y~X)
#LRM$coefficients
b0=LRM$coefficients["(Intercept)"]
b1=LRM$coefficients["X"]
Xh = data.frame(X = 10)
alpha = 0.05
CI = predict.lm(LRM, Xh, interval = "confidence", 1-alpha)
lower = CI$fit[,2]
upper = CI$fit[,3]
expectedYHath=20+4*Xh
containyesno <- (expectedYHath>lower) && (expectedYHath<upper)
return(list(b1,lower,upper, containyesno))
}

m = 200
A = sapply(1:m, ci.fun)
CIMatrix = matrix(unlist(A), nrow=4)

#PART C.
sim.slopes = CIMatrix[1,]
hist(sim.slopes)

mean(sim.slopes)

## [1] 4.055776

sd(sim.slopes)

## [1] 0.3965975

#These results are consistent with the theory because the histogram appears normally distributed with a mean close to 4 and standard deviation 

#PART D.
mean(CIMatrix[4,])*100

## [1] 97

This is the proportion of the 200 confidence intervals that contained the expected YHath (60). It is close to the theoretical value of 95% (alpha = 0.05).

Question 2

uadata = read.table("univadmissions.txt", header=TRUE)
attach(uadata)
fit = lm(gpa.endyr1~act) # Fit a SLR model of GPA at the end of freshmen year on ACT score.
plot(fit$res~act, pch=25) #residual plot
abline(0,0)

#(a) Draw the diagnostics plots using the plot() function. Check all assumptions only based graphical diagnostics.

#plots of act score
#X:ACT Score, Y:GPA at the end of fresman year
hist(act,breaks=10) #histogram

stem(act) #stem--and-leaf plot

## 
##   The decimal point is at the |
## 
##   13 | 000
##   14 | 0000
##   15 | 000
##   16 | 00000000
##   17 | 000000000000
##   18 | 0000000000000000000000000
##   19 | 000000000000000000000000
##   20 | 0000000000000000000000000000000000
##   21 | 0000000000000000000000000000000000000000
##   22 | 00000000000000000000000000000000000000000000000000000000000000
##   23 | 000000000000000000000000000000000000000000000000000000000000000000
##   24 | 000000000000000000000000000000000000000000000000000000000000000
##   25 | 000000000000000000000000000000000000000000000000000000000000000000
##   26 | 0000000000000000000000000000000000000000000000000000000
##   27 | 0000000000000000000000000000000000000000000000000000000000000
##   28 | 00000000000000000000000000000000000000000000000000000000000000
##   29 | 000000000000000000000000000000000000000000
##   30 | 00000000000000000000000000000
##   31 | 000000000000000000000000000
##   32 | 00000000
##   33 | 00000000
##   34 | 00
##   35 | 0

dotchart(act) #dot chart

boxplot(act, horizontal = TRUE)  #box-plot

plot(act,gpa.endyr1) #Scatterplot
abline(fit) #Linear regression

#i. Is the linearity assumption met? Why? ii. Is the normality assumption met? Why?
plot(fit,which =1)  #Residuals vs Fitted

plot(act,fit$residuals)  #Residuals vs act
abline(h=0,col="red")

The residuals vs. fitted shows a slight curve and the residual plot (residuals vs. act) shows the same. The data has a slight “U” shape.

Is the constant variance assumption met? Why?

#2c. Constant variance
plot(act,abs(fit$residuals))  #absolute value of Residuals vs act

plot(fit,which=3)  #Squared root of e** vs Fitted

The absolute value of the residuals vs. act plot shows a slightly larger variance of act scores from 15-25, and then a slightly smaller variance with act scores 25 to 35 (a very small variance with act scores above 30).

Are there outliers? Why?

plot(fit,which=5)
abline(h=3, col="green")
abline(h=-3, col="green")

There are a few standardized residuals less than -3 which are outliers.

In addition, check the normality assumption by the Shapiro-Wilk normality test. Is the normality assumption met?

hist(fit$residuals)   #histgram of residuals

#The histogram of residuals appears to be skewed left and therefore not normal.
plot(density(fit$residuals),ylim=c(0,0.2))  #plot density
curve(dnorm(x,0,3), col="blue", add = TRUE) #plot of N(0,3)

boxplot(fit$residuals) #box plot of resuduals

#The density curve and boxplots show the same thing.  The residuals do NOT appear to be normal

plot(fit,which=2) #Normal probability plots(QQplot)

qqnorm(fit$residuals)
qqline(fit$residuals,col="red")

#The normal probability plot (Q-Q Plot) is NOT linear.

#Shapiro-Wilk Test
shapiro.test(fit$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residuals
## W = 0.97167, p-value = 1.925e-10

#The Shapiro-Wilk test gives a p-value of 1.925 e-10, thus we reject the null hypothesis and conclude that the distribution of residuals is not normal.

#Anderson-Darling test
library(nortest)
ad.test(fit$residuals)

## 
##  Anderson-Darling normality test
## 
## data:  fit$residuals
## A = 5.3723, p-value = 2.893e-13

#The Anderson-Darling test gives a p-value of 2.893 e-13, so we draw the same conclusion.

In addition, check the constant variance assumption by the Breusch-Pagan test. Is the constant variance assumption met?

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(zoo)
bptest(fit)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 11.837, df = 1, p-value = 0.0005808

#The Breusch-Pagan test gave up a p-value of 0.0005808 which is very small, therefore we would reject the null and conclude that there is evidence of nonconstant variance.

3. Next, use the same data as in question 1, based on the result you have in the diagnostics check, we may fix some of them by transforming the response.

Find a proper power transformation of the response using the Box-Cox procedure. Need to install and load “MASS” library first.

library(MASS)
boxcox(fit) # need to change the range of λ.

boxcox(fit,lambda = seq(1, 3, 1/10)) # draw a Box-Cox plot where λ is between 1 and 3.

#We can see that the best lamda is 2.

State the SLR model with the transformed response variable.

fit2 = lm(gpa.endyr1^2~act) # Fit a SLR model of GPA squared at the end of freshmen year on ACT score (lambda = 2)
#(c) Fit the new model in (b) to the data and check the normality assumption and constant variance assumption using graphical diagnostics. Are the assumptions met? look better?
#Linearity
plot(fit2,which =1)

#The line is very close horizontal at 0.  They both look fine.
par(mfrow=c(1,2))
plot(fit,which=1)
plot(fit2,which=1)

#(d) Check the normality assumption by the Shapiro-Wilk normality test. Is the normality assumption met? In this problem, is the normality assumption necessary to draw statistical inference other than constructing a prediction interval?

#Normality

par(mfrow=c(1,2))
plot(fit,which=3)
plot(fit2,which=3)

shapiro.test(fit2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residuals
## W = 0.99267, p-value = 0.00153

# The Shapiro-Wilk test yields a low p-value of the transformed data set, so it seems like the errors are not normally distributed (however this is okay because of the large sample size).

Check the constant variance assumption by the Breusch-Pagan test. Is the constant variance assumption met?

library(lmtest)
library(zoo)
bptest(fit2)  # Constant variance

## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 0.49429, df = 1, p-value = 0.482

par(mfrow=c(1,2))
plot(fit,which=3)
plot(fit2,which=3)

# The constant-variance assumption is met because the B-P test yields a large p-value.