Yes, the conclusion is warranted based on these results as the confidence interval for the slope does not contain 0. The level of significance is 0.05, and at this level the slope is significantly different than 0.
It is important to understand the scope of a research experiment and model before focusing on the interpretation. X, the population in millions = 0, will not be in our scope so we cannot attempt to interpret this value or the negative value of X.
A prediction interval will attempt to help you understand what the response variable will be with an observation outside of the data set. A prediction interval would be used here to predict what the humidity level will be tomorrow with the temperature input as the predictor variable and humidty as the response.
A confidence interval with mean response would be the correct interval to use as we are trying to find an average within our observed data.
A prediction interval would be better to use in this situation as we are trying to predict the energy consumption next month, which is new data that is not in our observed data.
With a p-value = 0.033 and the fact that we rejected our \(H_0\), we conclude that our p-value < \(\alpha\). If our \(\alpha\) = 0.01, then our p-value > \(\alpha\) and we would fail to reject our \(H_0\).
A t-test is more versatile than an F-test when concerning the parameter \(\beta_1\) because we can test both one-sided and two-sided tests.
Y1, Y2, Y3, Y4, and Y5 are as follows respectively without the error term for the true regression model: 36 52 68 84 100 Y1, Y2, Y3, Y4, and Y5 are as follows with the model with mean = 0 variance = 25: 29.96467 53.38715 73.42221 72.27151 102.14562
b_0 = 17.2644 b_1 = 4.0812
Y_h = 60 when X_h = 10
A 95% confidence interval for E[Y_h]= (45.9738, 70.19445) when X_h = 10
set.seed(1234)
X <- c(4,8,12,16,20)
#True Regression line is E[Y] = 20 + 4*X
Y <- 20 + 4*X + + rnorm(5,0,5)
Y
## [1] 29.96467 53.38715 73.42221 72.27151 102.14562
lin_mod <- lm(Y~X)
summary(lin_mod)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## 1 2 3 4 5
## -3.624 3.474 7.184 -10.291 3.258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.2644 8.4197 2.050 0.13270
## X 4.0812 0.6347 6.431 0.00762 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.028 on 3 degrees of freedom
## Multiple R-squared: 0.9324, Adjusted R-squared: 0.9098
## F-statistic: 41.35 on 1 and 3 DF, p-value: 0.007624
new <- data.frame(X <- 10)
Y
## [1] 29.96467 53.38715 73.42221 72.27151 102.14562
a<-0.05
predict(lin_mod, new, interval = "confidence", 1-a)
## $fit
## fit lwr upr
## 1 58.07592 45.95738 70.19445
##
## $se.fit
## [1] 3.807931
##
## $df
## [1] 3
##
## $residual.scale
## [1] 8.027824
sim <- function () {
X <- c(4,8,12,16,20)
#True Regression line is E[Y] = 20 + 4*X
Y <- 20 + 4*X + + rnorm(5,0,5)
lin_mod <- lm(Y~X)
new <- data.frame(X <- 10)
pred <- predict(lin_mod, new, interval = "confidence")
pred <- setNames(c(pred), dimnames(pred)[[2]])
## return simulation result
c(coef(lin_mod), pred)
}
set.seed(1234)
sim()
## (Intercept) X fit lwr upr
## 17.264351 4.081157 58.075918 45.957382 70.194455
set.seed(0)
z<-t(replicate(200, sim()))
head(z)
## (Intercept) X fit lwr upr
## [1,] 24.100535 3.987755 63.97808 57.61262 70.34354
## [2,] 6.417639 5.101501 57.43265 52.44263 62.42267
## [3,] 20.652355 3.797991 58.63227 52.74861 64.51593
## [4,] 20.349829 3.816426 58.51409 52.59115 64.43702
## [5,] 19.891873 4.095140 60.84327 57.49911 64.18742
## [6,] 24.586749 3.589483 60.48158 53.64574 67.31743
mean(z[,2])
## [1] 3.976249
sd(z[,2])
## [1] 0.4263377
From this frequency distribution, mean(b1) = 3.9762 and sd(b1) = 0.4263 Our theoretical distributions are E(b1) = 4 and sd(b1) = 0.3953 The values from our frequency distribution and theoretical distribution are fairly similar.
set.seed(1234)
sim()
## (Intercept) X fit lwr upr
## 17.264351 4.081157 58.075918 45.957382 70.194455
set.seed(0)
z <- t(replicate(200, sim()))
head(z)
## (Intercept) X fit lwr upr
## [1,] 24.100535 3.987755 63.97808 57.61262 70.34354
## [2,] 6.417639 5.101501 57.43265 52.44263 62.42267
## [3,] 20.652355 3.797991 58.63227 52.74861 64.51593
## [4,] 20.349829 3.816426 58.51409 52.59115 64.43702
## [5,] 19.891873 4.095140 60.84327 57.49911 64.18742
## [6,] 24.586749 3.589483 60.48158 53.64574 67.31743
mean(z[,2])
## [1] 3.976249
sd(z[,2])
## [1] 0.4263377
sqrt(25/160)
## [1] 0.3952847
Our expected proportion is 0.95. Our actual proportion is 0.925 (this should be closer to 0.95 but when we increase our draws it gets closer)
mean(z[, "lwr"] < 60 & z[, "upr"] > 60)
## [1] 0.925
I would argue that the linearity assumption is met. From the Residuals vs. Fitted Plot, the red line is “fairly flat” and does not have much of a pattern.
It is difficult to tell if normality is met. I would argue that the normality assumption is not met. Looking at the QQ plot in R, it is has a slight curved hump upward, suggest left skew slightly. It is fairly symmetric and but does not fall about a straight line.
The constant variance assumption is not met though it could be argued the other way too. From the Residuals vs. Fitted Plot, we can see that there is heteroskedasticity within the graph. That is, the spread of the residuals is decreasing as the fitted values change and the variance is not exactly constant. From the Squared root of e** vs Fitted, we see the red line starts higher, slopes downwards and returns upwards and then levels off. This is not as constant as we would like, though it is generally straight.
It does not seem like there are any residuals based on the graph looking at |ei| > 3.
uadata<-read.table("/Users/kajalchokshi/desktop/univ.txt", header=TRUE)
attach(uadata)
ua_lm<-lm(gpa.endyr1 ~ act) # Fit a SLR model of GPA at the end of freshmen year on ACT score.
#2i. Linearity
plot(ua_lm,which = 1) #Residuals vs Fitted
plot(act,ua_lm$residuals) #Residuals vs ACT
abline(h=0,col="red")
#2ii. Normality
hist(ua_lm$residuals) #histgram of residuals
plot(density(ua_lm$residuals),ylim=c(0,0.2)) #plot density
curve(dnorm(x,0,3), col="blue", add = TRUE) #plot of N(0,3)
boxplot(ua_lm$residuals) #box plot of resuduals
plot(ua_lm,which=2) #Normal probability plots(QQplot)
qqnorm(ua_lm$residuals)
#2iii Constant Variance
plot(act,ua_lm$residuals) #Residuals vs Act
plot(act,abs(ua_lm$residuals)) #Residuals vs Act
plot(ua_lm,which=3) #Squared root of e** vs Fitted
#2iv. Unusual observations
plot(ua_lm,which=5)
abline(h=3,col="green")
Based on the Shapiro Wilk Test, the p-value = 1.925e-10 < alpha = 0.05. We reject our null hypothesis that the data is normal.
#Shapiro-Wilk Test for normality
shapiro.test(ua_lm$residuals)
##
## Shapiro-Wilk normality test
##
## data: ua_lm$residuals
## W = 0.97167, p-value = 1.925e-10
From our bptest, our p-value = 0.0005808. We reject our H_0 that the data is homoscedastic. There is evidence of nonconstant variance.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(zoo)
bptest(ua_lm)
##
## studentized Breusch-Pagan test
##
## data: ua_lm
## BP = 11.837, df = 1, p-value = 0.0005808
head(rstandard(ua_lm))
## 1 2 3 4 5 6
## -2.939890 -2.685678 -2.385417 -2.655405 -2.875069 -2.232061
summary(lm(log(rstandard(ua_lm)^2)~act))
##
## Call:
## lm(formula = log(rstandard(ua_lm)^2) ~ act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6240 -0.9172 0.5208 1.5450 3.5954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.59606 0.52699 -1.131 0.258
## act -0.02761 0.02119 -1.303 0.193
##
## Residual standard error: 2.257 on 703 degrees of freedom
## Multiple R-squared: 0.002409, Adjusted R-squared: 0.0009899
## F-statistic: 1.698 on 1 and 703 DF, p-value: 0.193
The \(\lambda\) which maximizes a larger log-likelihood is 2.010101. Thus, we choose \(\lambda = 2\).
library(MASS)
boxcox(ua_lm) # need to change the range of λ.
bc <- boxcox(ua_lm,lambda = seq(1, 3, 1/10)) # draw a Box-Cox plot where λ is between 1 and 3.
bc$x[which.max(bc$y)]
## [1] 2.010101
We chose \(\lambda = 2\). Our \(\beta_0 = 1.2510\) and \(\beta_1 = 0.32846\) for our new linear model. So, our new regression equation will be \({Y}^2 = 1.20510 + 0.32846X\)
ua_lm2 <- lm((gpa.endyr1)^2 ~ act)
summary(ua_lm2)
##
## Call:
## lm(formula = (gpa.endyr1)^2 ~ act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2114 -2.2793 0.1727 2.3741 9.2345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.20510 0.76782 1.57 0.117
## act 0.32846 0.03087 10.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.288 on 703 degrees of freedom
## Multiple R-squared: 0.1387, Adjusted R-squared: 0.1374
## F-statistic: 113.2 on 1 and 703 DF, p-value: < 2.2e-16
Normality Assumption- Based on the QQ plot, I would argue that the normality assumption is met better than the first model. It is lies straighter along the diagonal and is not as curved up but a curve still exists.
Constant Variance Assumption- In comparison to the first model, the constant variance is definitely more clear. The line is much straighter though it still is not as flat as we would like it to be.
#2ii. Normality
hist(ua_lm2$residuals) #histgram of residuals
plot(density(ua_lm2$residuals),ylim=c(0,0.2)) #plot density
curve(dnorm(x,0,3), col="blue", add = TRUE) #plot of N(0,3)
boxplot(ua_lm2$residuals) #box plot of resuduals
plot(ua_lm2,which=2) #Normal probability plots(QQplot)
qqnorm(ua_lm2$residuals)
#2iii Constant Variance
plot(act,ua_lm2$residuals) #Residuals vs Weight
plot(act,abs(ua_lm2$residuals)) #Residuals vs Weight
plot(ua_lm2,which=3) #Squared root of e** vs Fitted
From the Shapiro Wilk Test, we see the p-value is much greater than the p-value of the first model, though it is still fairly small. From this p-value we would still reject the H_0, that the data is normal.
#Shapiro-Wilk Test
shapiro.test(ua_lm2$residuals)
##
## Shapiro-Wilk normality test
##
## data: ua_lm2$residuals
## W = 0.99267, p-value = 0.00153
From the BP test, we see that our p-value is 0.482 which is much higher than our first model. With a p-value > alpha = 0.05, we fail to reject our null hypothesis of homoscedasticity.
library(lmtest)
library(zoo)
bptest(ua_lm2)
##
## studentized Breusch-Pagan test
##
## data: ua_lm2
## BP = 0.49429, df = 1, p-value = 0.482
head(rstandard(ua_lm2))
## 1 2 3 4 5 6
## -2.075704 -1.979288 -1.793027 -2.135931 -2.500509 -1.916012
summary(lm(log(rstandard(ua_lm2)^2)~act))
##
## Call:
## lm(formula = log(rstandard(ua_lm2)^2) ~ act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.414 -1.004 0.490 1.509 3.449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.74300 0.50051 -3.482 0.000528 ***
## act 0.02458 0.02013 1.221 0.222429
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.143 on 703 degrees of freedom
## Multiple R-squared: 0.002117, Adjusted R-squared: 0.0006973
## F-statistic: 1.491 on 1 and 703 DF, p-value: 0.2224