2.1

A.

Yes, the conclusion is warranted based on these results as the confidence interval for the slope does not contain 0. The level of significance is 0.05, and at this level the slope is significantly different than 0.

B.

It is important to understand the scope of a research experiment and model before focusing on the interpretation. X, the population in millions = 0, will not be in our scope so we cannot attempt to interpret this value or the negative value of X.

2.10

A.

A prediction interval will attempt to help you understand what the response variable will be with an observation outside of the data set. A prediction interval would be used here to predict what the humidity level will be tomorrow with the temperature input as the predictor variable and humidty as the response.

B.

A confidence interval with mean response would be the correct interval to use as we are trying to find an average within our observed data.

C.

A prediction interval would be better to use in this situation as we are trying to predict the energy consumption next month, which is new data that is not in our observed data.

2.17

With a p-value = 0.033 and the fact that we rejected our \(H_0\), we conclude that our p-value < \(\alpha\). If our \(\alpha\) = 0.01, then our p-value > \(\alpha\) and we would fail to reject our \(H_0\).

2.18

A t-test is more versatile than an F-test when concerning the parameter \(\beta_1\) because we can test both one-sided and two-sided tests.

2.66

A.

Y1, Y2, Y3, Y4, and Y5 are as follows respectively without the error term for the true regression model: 36 52 68 84 100 Y1, Y2, Y3, Y4, and Y5 are as follows with the model with mean = 0 variance = 25: 29.96467 53.38715 73.42221 72.27151 102.14562

b_0 = 17.2644 b_1 = 4.0812

Y_h = 60 when X_h = 10

A 95% confidence interval for E[Y_h]= (45.9738, 70.19445) when X_h = 10

set.seed(1234)
X <- c(4,8,12,16,20) 
#True Regression line is E[Y] = 20 + 4*X
Y <- 20 + 4*X + + rnorm(5,0,5)
Y 
## [1]  29.96467  53.38715  73.42221  72.27151 102.14562
lin_mod <- lm(Y~X)
summary(lin_mod)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##       1       2       3       4       5 
##  -3.624   3.474   7.184 -10.291   3.258 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  17.2644     8.4197   2.050  0.13270   
## X             4.0812     0.6347   6.431  0.00762 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.028 on 3 degrees of freedom
## Multiple R-squared:  0.9324, Adjusted R-squared:  0.9098 
## F-statistic: 41.35 on 1 and 3 DF,  p-value: 0.007624
new <- data.frame(X <- 10)
Y
## [1]  29.96467  53.38715  73.42221  72.27151 102.14562
a<-0.05
predict(lin_mod, new, interval = "confidence", 1-a)
## $fit
##        fit      lwr      upr
## 1 58.07592 45.95738 70.19445
## 
## $se.fit
## [1] 3.807931
## 
## $df
## [1] 3
## 
## $residual.scale
## [1] 8.027824

B.

sim <- function () {
  X <- c(4,8,12,16,20) 
  #True Regression line is E[Y] = 20 + 4*X
  Y <- 20 + 4*X + + rnorm(5,0,5)
  lin_mod <- lm(Y~X)
  new <- data.frame(X <- 10)
  pred <- predict(lin_mod, new, interval = "confidence")
  pred <- setNames(c(pred), dimnames(pred)[[2]])
  ## return simulation result
  c(coef(lin_mod), pred)
}

set.seed(1234)
sim()
## (Intercept)           X         fit         lwr         upr 
##   17.264351    4.081157   58.075918   45.957382   70.194455
set.seed(0)
z<-t(replicate(200, sim()))
head(z)
##      (Intercept)        X      fit      lwr      upr
## [1,]   24.100535 3.987755 63.97808 57.61262 70.34354
## [2,]    6.417639 5.101501 57.43265 52.44263 62.42267
## [3,]   20.652355 3.797991 58.63227 52.74861 64.51593
## [4,]   20.349829 3.816426 58.51409 52.59115 64.43702
## [5,]   19.891873 4.095140 60.84327 57.49911 64.18742
## [6,]   24.586749 3.589483 60.48158 53.64574 67.31743
mean(z[,2])
## [1] 3.976249
sd(z[,2])
## [1] 0.4263377

C.

From this frequency distribution, mean(b1) = 3.9762 and sd(b1) = 0.4263 Our theoretical distributions are E(b1) = 4 and sd(b1) = 0.3953 The values from our frequency distribution and theoretical distribution are fairly similar.

set.seed(1234)
sim()
## (Intercept)           X         fit         lwr         upr 
##   17.264351    4.081157   58.075918   45.957382   70.194455
set.seed(0)
z <- t(replicate(200, sim()))
head(z)
##      (Intercept)        X      fit      lwr      upr
## [1,]   24.100535 3.987755 63.97808 57.61262 70.34354
## [2,]    6.417639 5.101501 57.43265 52.44263 62.42267
## [3,]   20.652355 3.797991 58.63227 52.74861 64.51593
## [4,]   20.349829 3.816426 58.51409 52.59115 64.43702
## [5,]   19.891873 4.095140 60.84327 57.49911 64.18742
## [6,]   24.586749 3.589483 60.48158 53.64574 67.31743
mean(z[,2])
## [1] 3.976249
sd(z[,2])
## [1] 0.4263377
sqrt(25/160)
## [1] 0.3952847

D.

Our expected proportion is 0.95. Our actual proportion is 0.925 (this should be closer to 0.95 but when we increase our draws it gets closer)

mean(z[, "lwr"] < 60 & z[, "upr"] > 60)
## [1] 0.925

Written Problem 2

Part A.i

I would argue that the linearity assumption is met. From the Residuals vs. Fitted Plot, the red line is “fairly flat” and does not have much of a pattern.

Part A.ii

It is difficult to tell if normality is met. I would argue that the normality assumption is not met. Looking at the QQ plot in R, it is has a slight curved hump upward, suggest left skew slightly. It is fairly symmetric and but does not fall about a straight line.

Part A.iii

The constant variance assumption is not met though it could be argued the other way too. From the Residuals vs. Fitted Plot, we can see that there is heteroskedasticity within the graph. That is, the spread of the residuals is decreasing as the fitted values change and the variance is not exactly constant. From the Squared root of e** vs Fitted, we see the red line starts higher, slopes downwards and returns upwards and then levels off. This is not as constant as we would like, though it is generally straight.

Part A. iv

It does not seem like there are any residuals based on the graph looking at |ei| > 3.

uadata<-read.table("/Users/kajalchokshi/desktop/univ.txt", header=TRUE)
attach(uadata)
ua_lm<-lm(gpa.endyr1 ~ act) # Fit a SLR model of GPA at the end of freshmen year on ACT score.
#2i. Linearity
plot(ua_lm,which = 1)  #Residuals vs Fitted

plot(act,ua_lm$residuals)  #Residuals vs ACT
abline(h=0,col="red")

#2ii. Normality
hist(ua_lm$residuals)   #histgram of residuals

plot(density(ua_lm$residuals),ylim=c(0,0.2))  #plot density
curve(dnorm(x,0,3), col="blue", add = TRUE) #plot of N(0,3)

boxplot(ua_lm$residuals) #box plot of resuduals

plot(ua_lm,which=2) #Normal probability plots(QQplot)

qqnorm(ua_lm$residuals) 

#2iii Constant Variance
plot(act,ua_lm$residuals)  #Residuals vs Act

plot(act,abs(ua_lm$residuals))  #Residuals vs Act

plot(ua_lm,which=3)  #Squared root of e** vs Fitted

#2iv. Unusual observations
plot(ua_lm,which=5)
abline(h=3,col="green")

Part B.

Based on the Shapiro Wilk Test, the p-value = 1.925e-10 < alpha = 0.05. We reject our null hypothesis that the data is normal.

#Shapiro-Wilk Test for normality
shapiro.test(ua_lm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  ua_lm$residuals
## W = 0.97167, p-value = 1.925e-10

Part C.

From our bptest, our p-value = 0.0005808. We reject our H_0 that the data is homoscedastic. There is evidence of nonconstant variance.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(zoo)
bptest(ua_lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  ua_lm
## BP = 11.837, df = 1, p-value = 0.0005808
head(rstandard(ua_lm))
##         1         2         3         4         5         6 
## -2.939890 -2.685678 -2.385417 -2.655405 -2.875069 -2.232061
summary(lm(log(rstandard(ua_lm)^2)~act)) 
## 
## Call:
## lm(formula = log(rstandard(ua_lm)^2) ~ act)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6240  -0.9172   0.5208   1.5450   3.5954 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.59606    0.52699  -1.131    0.258
## act         -0.02761    0.02119  -1.303    0.193
## 
## Residual standard error: 2.257 on 703 degrees of freedom
## Multiple R-squared:  0.002409,   Adjusted R-squared:  0.0009899 
## F-statistic: 1.698 on 1 and 703 DF,  p-value: 0.193

Written Problem 3

Part A.

The \(\lambda\) which maximizes a larger log-likelihood is 2.010101. Thus, we choose \(\lambda = 2\).

library(MASS)
boxcox(ua_lm) # need to change the range of λ.

bc <- boxcox(ua_lm,lambda = seq(1, 3, 1/10)) # draw a Box-Cox plot where λ is between 1 and 3.

bc$x[which.max(bc$y)]
## [1] 2.010101

Part B.

We chose \(\lambda = 2\). Our \(\beta_0 = 1.2510\) and \(\beta_1 = 0.32846\) for our new linear model. So, our new regression equation will be \({Y}^2 = 1.20510 + 0.32846X\)

ua_lm2 <- lm((gpa.endyr1)^2 ~ act)
summary(ua_lm2)
## 
## Call:
## lm(formula = (gpa.endyr1)^2 ~ act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2114 -2.2793  0.1727  2.3741  9.2345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.20510    0.76782    1.57    0.117    
## act          0.32846    0.03087   10.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.288 on 703 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.1374 
## F-statistic: 113.2 on 1 and 703 DF,  p-value: < 2.2e-16

Part C.

Normality Assumption- Based on the QQ plot, I would argue that the normality assumption is met better than the first model. It is lies straighter along the diagonal and is not as curved up but a curve still exists.

Constant Variance Assumption- In comparison to the first model, the constant variance is definitely more clear. The line is much straighter though it still is not as flat as we would like it to be.

#2ii. Normality
hist(ua_lm2$residuals)   #histgram of residuals

plot(density(ua_lm2$residuals),ylim=c(0,0.2))  #plot density
curve(dnorm(x,0,3), col="blue", add = TRUE) #plot of N(0,3)

boxplot(ua_lm2$residuals) #box plot of resuduals

plot(ua_lm2,which=2) #Normal probability plots(QQplot)

qqnorm(ua_lm2$residuals) 

#2iii Constant Variance
plot(act,ua_lm2$residuals)  #Residuals vs Weight

plot(act,abs(ua_lm2$residuals))  #Residuals vs Weight

plot(ua_lm2,which=3)  #Squared root of e** vs Fitted

Part D.

From the Shapiro Wilk Test, we see the p-value is much greater than the p-value of the first model, though it is still fairly small. From this p-value we would still reject the H_0, that the data is normal.

#Shapiro-Wilk Test
shapiro.test(ua_lm2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  ua_lm2$residuals
## W = 0.99267, p-value = 0.00153

Part E.

From the BP test, we see that our p-value is 0.482 which is much higher than our first model. With a p-value > alpha = 0.05, we fail to reject our null hypothesis of homoscedasticity.

library(lmtest)
library(zoo)
bptest(ua_lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  ua_lm2
## BP = 0.49429, df = 1, p-value = 0.482
head(rstandard(ua_lm2))
##         1         2         3         4         5         6 
## -2.075704 -1.979288 -1.793027 -2.135931 -2.500509 -1.916012
summary(lm(log(rstandard(ua_lm2)^2)~act)) 
## 
## Call:
## lm(formula = log(rstandard(ua_lm2)^2) ~ act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.414  -1.004   0.490   1.509   3.449 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.74300    0.50051  -3.482 0.000528 ***
## act          0.02458    0.02013   1.221 0.222429    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.143 on 703 degrees of freedom
## Multiple R-squared:  0.002117,   Adjusted R-squared:  0.0006973 
## F-statistic: 1.491 on 1 and 703 DF,  p-value: 0.2224