ci.fun = function(m){
X = c(4,8,12,16,20)
Error = rnorm(5,0,5)
Y = 20 + 4*X + Error
LRM = lm(Y~X)
#LRM$coefficients
b0=LRM$coefficients["(Intercept)"]
b1=LRM$coefficients["X"]
Xh = data.frame(X = 10)
alpha = 0.05
CI = predict.lm(LRM, Xh, interval = "confidence", 1-alpha)
lower = CI$fit[,2]
upper = CI$fit[,3]
expectedYHath=20+4*Xh
containyesno <- (expectedYHath>lower) && (expectedYHath<upper)
return(list(b1,lower,upper, containyesno))
}
m = 200
A = sapply(1:m, ci.fun)
CIMatrix = matrix(unlist(A), nrow=4)
#PART C.
sim.slopes = CIMatrix[1,]
hist(sim.slopes)
mean(sim.slopes)
## [1] 4.055776
sd(sim.slopes)
## [1] 0.3965975
#These results are consistent with the theory because the histogram appears normally distributed with a mean close to 4 and standard deviation
#PART D.
mean(CIMatrix[4,])*100
## [1] 97
This is the proportion of the 200 confidence intervals that contained the expected YHath (60). It is close to the theoretical value of 95% (alpha = 0.05).
uadata = read.table("univadmissions.txt", header=TRUE)
attach(uadata)
fit = lm(gpa.endyr1~act) # Fit a SLR model of GPA at the end of freshmen year on ACT score.
plot(fit$res~act, pch=25) #residual plot
abline(0,0)
#(a) Draw the diagnostics plots using the plot() function. Check all assumptions only based graphical diagnostics.
#plots of act score
#X:ACT Score, Y:GPA at the end of fresman year
hist(act,breaks=10) #histogram
stem(act) #stem--and-leaf plot
##
## The decimal point is at the |
##
## 13 | 000
## 14 | 0000
## 15 | 000
## 16 | 00000000
## 17 | 000000000000
## 18 | 0000000000000000000000000
## 19 | 000000000000000000000000
## 20 | 0000000000000000000000000000000000
## 21 | 0000000000000000000000000000000000000000
## 22 | 00000000000000000000000000000000000000000000000000000000000000
## 23 | 000000000000000000000000000000000000000000000000000000000000000000
## 24 | 000000000000000000000000000000000000000000000000000000000000000
## 25 | 000000000000000000000000000000000000000000000000000000000000000000
## 26 | 0000000000000000000000000000000000000000000000000000000
## 27 | 0000000000000000000000000000000000000000000000000000000000000
## 28 | 00000000000000000000000000000000000000000000000000000000000000
## 29 | 000000000000000000000000000000000000000000
## 30 | 00000000000000000000000000000
## 31 | 000000000000000000000000000
## 32 | 00000000
## 33 | 00000000
## 34 | 00
## 35 | 0
dotchart(act) #dot chart
boxplot(act, horizontal = TRUE) #box-plot
plot(act,gpa.endyr1) #Scatterplot
abline(fit) #Linear regression
#i. Is the linearity assumption met? Why? ii. Is the normality assumption met? Why?
plot(fit,which =1) #Residuals vs Fitted
plot(act,fit$residuals) #Residuals vs act
abline(h=0,col="red")
The residuals vs. fitted shows a slight curve and the residual plot (residuals vs. act) shows the same. The data has a slight “U” shape.
#2c. Constant variance
plot(act,abs(fit$residuals)) #absolute value of Residuals vs act
plot(fit,which=3) #Squared root of e** vs Fitted
The absolute value of the residuals vs. act plot shows a slightly larger variance of act scores from 15-25, and then a slightly smaller variance with act scores 25 to 35 (a very small variance with act scores above 30).
plot(fit,which=5)
abline(h=3, col="green")
abline(h=-3, col="green")
There are a few standardized residuals less than -3 which are outliers.
hist(fit$residuals) #histgram of residuals
#The histogram of residuals appears to be skewed left and therefore not normal.
plot(density(fit$residuals),ylim=c(0,0.2)) #plot density
curve(dnorm(x,0,3), col="blue", add = TRUE) #plot of N(0,3)
boxplot(fit$residuals) #box plot of resuduals
#The density curve and boxplots show the same thing. The residuals do NOT appear to be normal
plot(fit,which=2) #Normal probability plots(QQplot)
qqnorm(fit$residuals)
qqline(fit$residuals,col="red")
#The normal probability plot (Q-Q Plot) is NOT linear.
#Shapiro-Wilk Test
shapiro.test(fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit$residuals
## W = 0.97167, p-value = 1.925e-10
#The Shapiro-Wilk test gives a p-value of 1.925 e-10, thus we reject the null hypothesis and conclude that the distribution of residuals is not normal.
#Anderson-Darling test
library(nortest)
ad.test(fit$residuals)
##
## Anderson-Darling normality test
##
## data: fit$residuals
## A = 5.3723, p-value = 2.893e-13
#The Anderson-Darling test gives a p-value of 2.893 e-13, so we draw the same conclusion.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(zoo)
bptest(fit)
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 11.837, df = 1, p-value = 0.0005808
#The Breusch-Pagan test gave up a p-value of 0.0005808 which is very small, therefore we would reject the null and conclude that there is evidence of nonconstant variance.
library(MASS)
boxcox(fit) # need to change the range of λ.
boxcox(fit,lambda = seq(1, 3, 1/10)) # draw a Box-Cox plot where λ is between 1 and 3.
#We can see that the best lamda is 2.
fit2 = lm(gpa.endyr1^2~act) # Fit a SLR model of GPA squared at the end of freshmen year on ACT score (lambda = 2)
#(c) Fit the new model in (b) to the data and check the normality assumption and constant variance assumption using graphical diagnostics. Are the assumptions met? look better?
#Linearity
plot(fit2,which =1)
#The line is very close horizontal at 0. They both look fine.
par(mfrow=c(1,2))
plot(fit,which=1)
plot(fit2,which=1)
#(d) Check the normality assumption by the Shapiro-Wilk normality test. Is the normality assumption met? In this problem, is the normality assumption necessary to draw statistical inference other than constructing a prediction interval?
#Normality
par(mfrow=c(1,2))
plot(fit,which=3)
plot(fit2,which=3)
shapiro.test(fit2$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit2$residuals
## W = 0.99267, p-value = 0.00153
# The Shapiro-Wilk test yields a low p-value of the transformed data set, so it seems like the errors are not normally distributed (however this is okay because of the large sample size).
library(lmtest)
library(zoo)
bptest(fit2) # Constant variance
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 0.49429, df = 1, p-value = 0.482
par(mfrow=c(1,2))
plot(fit,which=3)
plot(fit2,which=3)
# The constant-variance assumption is met because the B-P test yields a large p-value.