STAT 488 HW5

Problem 3.3

gpa <-read.table("/Users/kajalchokshi/desktop/CH03PR03.txt", header=FALSE)
head(gpa)

##      V1 V2  V3 V4
## 1 3.897 21 122 99
## 2 3.885 14 132 71
## 3 3.778 28 119 95
## 4 2.540 22  99 75
## 5 3.028 21 131 46
## 6 3.865 31 139 77

Part A.

It does not seem like there are any outliers in the act scores. They would be visible on the box plot if there were. Based on the histogram though, there may be a bit of left skew.

boxplot(gpa$V2, main="Boxplot of ACT Scores")

hist(gpa$V2, main="Histogram of ACT scores")

Part C.

LINERO - linearity, independence, normality of residuals, equal/constant variance, randomness, outliers

From the Residuals vs. Fitted Plot, the red line is “fairly flat” and does not have much of a pattern. It has a bit of an upward curve, but not really a pattern. Thus, the linearity assumption is upheld. The departures from the regression model can be found below where we subtract the residuals from the fitted value.

We assume independence.

The constant variance assumption is held, but when looking at the square root of residuals vs the GPA, we notice that around act = 26, the contance variance assumption shifts. When act < 26, the constant variance assumption is not as clear. When the act >26, the constant variance assumption is more definitely met.

Looking at the boxplot and histogram of the residuals, we see that the residuals are fairly normal with the except of a few outliers. Their is a bit of left skew as well.

lm_gpa <- lm(gpa$V1 ~ gpa$V2)

#Linearity
plot(lm_gpa, which=1)

#Constant Variance
plot(gpa$V2,abs(lm_gpa$residuals)) #Residuals vs GPA

plot(lm_gpa,which=3)  #Squared root of e** vs Fitted

#(lm_gpa$residuals - gpa$V1)

#Normality
boxplot(lm_gpa$residuals)

hist(lm_gpa$residuals)

### Part E. We will conduct a BP Test to verify constant variance.

Hypothesis Test:

H0: homoskedasticity (constant variance)

Ha: heteroskedasticity (non constant variance)

Decision Rule:

If p-value < alpha, then reject H0.

Conclusion:

From our bptest, our p-value when act >= 26 is = 0.7562. We fail to reject our H_0, that the data is homoskedastic. There is evidence of constant variance which agrees with the conclusion from part C.

From our bptest, our p-value when act < 26 is 0.0109. We fail to reject our H_0, that the data is homoskedastic. There is evidence of constant variance which agrees with the conclusion from part C.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(zoo)

a <- 0.01
sub1 <- gpa[(gpa[,2]>=26),]
sub1_lm <- lm(sub1$V1 ~ sub1$V2)
bptest(sub1_lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  sub1_lm
## BP = 0.096399, df = 1, p-value = 0.7562

sub2 <- gpa[(gpa[,2]<26),]
sub2_lm <- lm(sub2$V1 ~ sub2$V2)
bptest(sub2_lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  sub2_lm
## BP = 6.4818, df = 1, p-value = 0.0109

Part F.

From the first plot of intelligene vs the residuals from our original model, it seems like there is an upward trend which implies that intelligence may be a valuable predictor. I would add intelligence to the model.

From the second plot, there is no pattern that is obvious which does not give clear insight whether or not class rank would be a good variable to add to the regression equation.

plot(gpa$V3,lm_gpa$residuals, main = "Intelligence vs Residuals of Old Model")  #Residuals vs V3

plot(gpa$V3,gpa$V1)

plot(gpa$V4,lm_gpa$residuals, main= "Class Rank vs Residuals of Old Model") #Residuals vs V4

Problem 4.5

Part A.

We are 95% confident that the true slope is in the interval (1.8405, 2.22825). We are 95% confident that the true intercept is in the interval (162.9013, 174.29875). Thus, we are 90% confident that both of the true values for slope and interept lie in their respective intervals.

phard <-read.table("/Users/kajalchokshi/desktop/CH01PR22.txt", header=FALSE)
head(phard)

##    V1 V2
## 1 199 16
## 2 205 16
## 3 196 16
## 4 200 16
## 5 218 24
## 6 220 24

time <- phard[,2]
hardness <- phard[,1]

#Bonferroni CI 
lm_phard <- lm(hardness ~ time)
lm_phard

## 
## Call:
## lm(formula = hardness ~ time)
## 
## Coefficients:
## (Intercept)         time  
##     168.600        2.034

joint_fam <- 2 #number of joint betas we want.
alpha <- 0.1
joint_level <- alpha/joint_fam
bonf_int <- confint(lm_phard, level = 1-joint_level)
bonf_int #### Verify that what denominator is in class today

##                2.5 %    97.5 %
## (Intercept) 162.9013 174.29875
## time          1.8405   2.22825

#B=t(.975;  14)  =  2.145
#b0=  168.6000,
#s{b0}=  2.6570,
#b1=  2.0344,  
#s{b1}=.0904

Part B.

Dr. Kong explained we do not need to know if b0 and b1 are positively or negatively correlated. I believe that b0 and b1 are negatively correlated, based on the formula in the book. The confidence intervals above do not account for this.

Part C.

The family confidence coefficient of 90 percent means that the proportion of samples that are correct for every member of the family of estimates in repeated sampling and recaluclatin of each estimation is 90 percent.

Problem 4.9

Part A.

The meaning of the family confidence coefficient here is that we are 90% confident that the mean hardness for the elapsed time 20, 30, and 40 hours falls in the respective intervals below.

#Bonferroni CI for X=3,5,7
alphaF = 0.90
g=3
#Bonferroni CI for X=3,5,7
predict(lm_phard,new=data.frame(time = c(20,30,40)),
        interval = "confidence",level=1-alphaF/g)

##        fit      lwr      upr
## 1 209.2875 208.1200 210.4550
## 2 229.6312 228.7396 230.5229
## 3 249.9750 248.5189 251.4311

Part C.

We used the Bonferroni method as the t value is smaller. I computed the Sheffe interval as well and it is wider.

Using the Bonferroni method, we are 90% confident that both our predicted mean hardness fall in the intervals (222.4710, 236.7915) and (242.4562, 257.4938) for 30 and 40 hours of time, respectively.

g=2
alphaF <- 0.1
#Scheffe procedure PI for X
n<-nrow(phard)
W1<-sqrt(2*qf((1-alphaF),g,(n-2)))
MSE<-sum((lm_phard$residuals)^2)/(n-2)

xbar=mean(time)
Sxx=sum((time-xbar)^2)
s_Yhpred=sqrt(MSE*(1+1/n+(c(30,40)-xbar)^2/Sxx)) 

Yh_hat=predict(lm_phard,newdata=data.frame(time<-c(30,40)))
simulPI=cbind(Yh_hat,Yh_hat-W1*s_Yhpred,Yh_hat+W1*s_Yhpred)
colnames(simulPI)=c("fitted","lower","upper")
rownames(simulPI)=c("30", "40")
simulPI

##      fitted    lower    upper
## 30 229.6312 221.8354 237.4271
## 40 249.9750 241.7889 258.1611

g=2
alphaF = 0.1
predict(lm_phard,new=data.frame(time=c(30,40)),
        interval = "predict",level=1-alphaF/g)

##        fit      lwr      upr
## 1 229.6312 222.4710 236.7915
## 2 249.9750 242.4562 257.4938

t=qt((1-alphaF/g/2),(n-2))
list(W1,t)

## [[1]]
## [1] 2.335152
## 
## [[2]]
## [1] 2.144787

#t is slightly smaller. Use Bonferroni method.

STAT 488 HW5

Kajal Chokshi

10/10/2018

Problem 3.3

Part A.

Part C.

LINERO - linearity, independence, normality of residuals, equal/constant variance, randomness, outliers

Part F.

Problem 4.5

Part A.

Part B.

Part C.

Problem 4.9

Part A.

Part C.