ch19<-read.table("CH01PR19.txt")
head(ch19)
## V1 V2
## 1 3.897 21
## 2 3.885 14
## 3 3.778 28
## 4 2.540 22
## 5 3.028 21
## 6 3.865 31
n <- nrow(ch19)
ACT <- ch19[,2]
GPA <- ch19[,1]
ch19_lm <-lm(GPA~ACT)
\(b_0 = 2.114049\) and \(b_1 = 0.03882713\)
The regression function is as follows: \(\hat{y} = 2.114049 + 0.03882713x\)
ch19_lm$coefficients
## (Intercept) ACT
## 2.11404929 0.03882713
The point estimate of the mean freshman GPA for students with ACT score = 30 is 3.28
new <- data.frame(ACT <-30)
new
## ACT....30
## 1 30
predict(ch19_lm, new)
## 1
## 3.278863
The sum of the residuals is -2.942091e-15, which is ~ 0.
sum(ch19_lm$residuals)
## [1] -2.942091e-15
\(\sigma^{2}\) or \(MSE = 0.3883848\) and \(\sigma\) or \(\sqrt{MSE} = 0.623125\). \(\sigma\) is expressed in GPA points.
MSE <- sum(ch19_lm$residuals^2)/(n-2)
sigma <- sigma(ch19_lm)
sigma^2
## [1] 0.3882848
The 99% confidence interval is between \((0.0053856,0.07226864)\). This means we are 99% confident that \(\beta_1\) is within the constructed interval, i.e, if this experiment were to be repeated N times, where N is large enough, \(\beta_1\) is in approximately 99% of the intervals.
This interval does not contain 0. The admissions director might be interested in whether the confidence interval includes 0, because that may suggest that the difference is not signifcant. If ACT scores have no impact on a student’s GPA, they do not have to list a minimum requirement and can increase the amount of applicants.
confint(ch19_lm, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 1.273902675 2.95419590
## ACT 0.005385614 0.07226864
Hypothesis Test:
\(H_0: \beta_1 = 0\)
\(H_a: \beta_1 \neq 0\)
Decision Rule:
If \(\mid T_{H_0}\mid = \frac{\beta_1 - 0}{s(b_1)} > t(1-\frac{\alpha}{2}, n-2)\), then reject \(H_0\).
Conclusion:
The t-statistic for ACT is 3.040 > 2.618 based on the summary function.
At an \(\alpha = 0.01\), we reject the \(H_0: \beta_1 = 0\), as \(\mid T_{H_0}\mid > t\). This means there is sufficient evidence to reject the claim that \(H_0: \beta_1 = 0\). This implies that there exists a significant linear relationship between ACT score and GPA.
summary(ch19_lm)
##
## Call:
## lm(formula = GPA ~ ACT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.74004 -0.33827 0.04062 0.44064 1.22737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.11405 0.32089 6.588 1.3e-09 ***
## ACT 0.03883 0.01277 3.040 0.00292 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6231 on 118 degrees of freedom
## Multiple R-squared: 0.07262, Adjusted R-squared: 0.06476
## F-statistic: 9.24 on 1 and 118 DF, p-value: 0.002917
a <-0.01
m <- qt(1-(a/2), n-2) #Multiplier
m
## [1] 2.618137
The p-value based on the summary() is 0.00292. Since our p-value of \(0.00292 < \alpha = 0.01\). Another way to reject the \(H_0\) is to verify that \(p < \alpha\). This is true, and supports our conclusion to reject the \(H_0\)
A 95% interval estimate of the mean freshman GPA for students whose ACT score is 28 is (3.0614, 3.3410). This means that the we are 95% confident that a student who scores a 28 on their ACT will have a GPA within the above interval.
n <- nrow(ch19)
X <- ch19[,2]
Y <- ch19[,1]
Xbar <- mean(X)
Ybar <- mean(Y)
b1 <- sum((X-Xbar)*(Y-Ybar))/((n-1)*var(X)) #Formula
b0 <- Ybar-b1*Xbar #Formula
b0
## [1] 2.114049
b1
## [1] 0.03882713
X_h <- 28
Yhath <- b0 + b1 * X_h # predicted value for X_h=28
Yhath
## [1] 3.201209
Yhat <- b0+b1*X
Res <- Y-Yhat
MSE <- sum(Res^2)/(n-2) #MSE
sxx<- sum((X-Xbar)^2) #Sxx
s2 <- MSE*((1/n) + (((X_h-Xbar)^2))/sxx) #S^2(y_hat)
s <- sqrt(s2) #s(y_hat)
a<-0.05
m <- qt(1-(a/2), n-2) #multiplier
Yhath + (m*s)
## [1] 3.341033
Yhath - (m*s)
## [1] 3.061384
Using R, we obtain the same confidence interval of (3.0614, 3.3410).
summary(ch19_lm)
##
## Call:
## lm(formula = GPA ~ ACT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.74004 -0.33827 0.04062 0.44064 1.22737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.11405 0.32089 6.588 1.3e-09 ***
## ACT 0.03883 0.01277 3.040 0.00292 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6231 on 118 degrees of freedom
## Multiple R-squared: 0.07262, Adjusted R-squared: 0.06476
## F-statistic: 9.24 on 1 and 118 DF, p-value: 0.002917
new <- data.frame(ACT <- 28)
predict.lm(ch19_lm, new, interval = "confidence", 1-a)
## $fit
## fit lwr upr
## 1 3.201209 3.061384 3.341033
##
## $se.fit
## [1] 0.07060873
##
## $df
## [1] 118
##
## $residual.scale
## [1] 0.623125
We are 95% confident that with Mary’s ACT score of 28, her GPA will lie in between the interval (1.959436, 4.443144).
s2_new <- MSE*(1+(1/n) + (((X_h-Xbar)^2))/sxx) #Prediction interval S^2(Y_hat_new)
s_new <- sqrt(s2_new)
s_new
## [1] 0.6271128
Yhath + (m*s_new)
## [1] 4.443063
Yhath - (m*s_new)
## [1] 1.959355
We obtain the roughly the same prediction interval (1.959355, 4.443063)
predict.lm(ch19_lm, new, interval = "prediction", 1-a)
## $fit
## fit lwr upr
## 1 3.201209 1.959355 4.443063
##
## $se.fit
## [1] 0.07060873
##
## $df
## [1] 118
##
## $residual.scale
## [1] 0.623125
Yes, the prediction interval in part B is wider than the confidence interval found in part A. This is because prediction intervals account for the population mean and data scatter. The confidence interval is an inference for a parameter.The prediction interval allows for non parameterized variables to affect the predicted value.
The 95% confidence band for \(X_h = 28\) is \(3.026159 \le \beta_0 + \beta_1*X_h \leq 3.376258\)
This confidence band is wider than our confidence interval. This is expected as it represents all the confidence intervals for the regression line.
W <- sqrt(2*qf(0.95,2,length(GPA)-2))
gpa_CI <- predict.lm(ch19_lm, new, interval = "confidence", 1-a)
cbu <- gpa_CI$fit[,1]+W*gpa_CI$se.fit
cbl <- gpa_CI$fit[,1]-W*gpa_CI$se.fit
cbu
## [1] 3.376258
cbl
## [1] 3.026159
SSE <- sum(Res^2) # Sum of squared error/residuals
SSR = sum((Yhat-Ybar)^2) # Sum of squared regression
SSTO = (n-1)*var(Y) # Sum of squared Total
c(SSE, SSR, SSTO, SSTO-SSE) # SSTO=SSE+SSR
## [1] 45.817608 3.587846 49.405454 3.587846
MSE = SSE/(n-2)
MSR = SSR /1
ss <- c(SSR, SSE, SSTO)
df <-c(1,118,119)
ms <- c(MSR, MSE, NA)
types <-factor(c('Regression','Error','Total'))
ch19_anova <- data.frame(Source=types,SS=ss, df=df, MS=ms)
ch19_anova
## Source SS df MS
## 1 Regression 3.587846 1 3.5878459
## 2 Error 45.817608 118 0.3882848
## 3 Total 49.405454 119 NA
anova(ch19_lm)
## Analysis of Variance Table
##
## Response: GPA
## Df Sum Sq Mean Sq F value Pr(>F)
## ACT 1 3.588 3.5878 9.2402 0.002917 **
## Residuals 118 45.818 0.3883
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MSR estimates \(\sigma^2 + \beta_1^2\sum(X_i-\bar{X})^2\) and MSE estimtes \(\sigma^2\). MSR and MSE have the same quantity if \(\beta_1 = 0\)
MSR is 3.588 and MSE is 45.818.
Hypothesis Test:
\(H_0: \beta_1 = 0\)
\(H_a: \beta_1 \neq 0\)
Decision Rule:
If \(F^* > F\), then reject \(H_0\).
Conclusion:
From our ANOVA table, we know that \(F^* = 9.2402\) and below we calculated \(F = 6.854641\). Thus, \(F^* > F\) and we reject the null hypothesis that \(H_0: \beta_1 = 0\).
# Test Hypothesis H0: beta1=0 vs Ha: beta1 != 0
Fstar = MSR/MSE # ANOVA F test statistic
Fstar
## [1] 9.240243
sb1 <- sqrt(MSE/sum((X-Xbar)^2))
t1 <- b1/sb1
t1^2 # t1^2=Fstar
## [1] 9.240243
cri_f = qf(1-a,1,n-2) # Critical point to reject H0: beta1=0
cri_f # Reject H0
## [1] 3.921478
a <- 0.01
n <- nrow(ch19)
F_test <- qf((1-a),1,n-2)
F_test
## [1] 6.854641
The absolute magnitude of the reduction in the variation of Y when X is introduced into the regression model is 3.588 which is obtained by SSR.
The relative reduction is \(\frac{3.588}{49.406} = 7.262 %\).
The relative reduction is also called the coefficient of determination, or \(R^2\).
# Coefficient of Determination
Rsq = SSR/SSTO # R square
Rsq
## [1] 0.07262044
Listed on the summary tab as R^2 is 0.07262. Additionally, we can calculate SSR/SSTO from the summary tab.
3.588/49.406
## [1] 0.07262276
r = +0.269481
sqrt(SSR/SSTO)
## [1] 0.2694818
sqrt(0.07262)
## [1] 0.269481
\(R^2\) has the more clear cut operational interpretation. \(R^2\) accounts for the percentage of variation in Y explained by X.
The regression function is as follows: \(\hat{y} = 168.600 + 2.034x\)
Yes the estimated regression function appears to have a good fit. For every unique X = Hours, the estimated regression line intersects roughly the average of the Y observations.
# Read Data
ch22<-read.table("CH01PR22.txt")
n <- nrow(ch22) # number of objects (rows)
X <- ch22[,2] # 2ed variable (col)
Y <- ch22[,1] # 1st variable (col)
Xbar <- mean(X)
Ybar <- mean(Y)
b1 <- sum((X-Xbar)*(Y-Ybar))/((n-1)*var(X))
b1 #Formula
## [1] 2.034375
b0 <- Ybar-b1*Xbar
b0 #Formula
## [1] 168.6
c(b0,b1) # print b0 and b1 out
## [1] 168.600000 2.034375
Yhat <- b0+b1*X # Predicted values
Res <- Y-Yhat # Residuals
Res[1:5] # First five residuals
## [1] -2.150 3.850 -5.150 -1.150 0.575
MSE <- sum(Res^2)/(n-2) # Estimated Variance MSE=SSE/(n-2)
sd <- sqrt(MSE) # Estimated Standard Deviation S
ch22<-read.table("CH01PR22.txt")
hours <- ch22$V2
hardness <-ch22$V1
ch22_lm <- lm(hardness ~ hours)
ch22_lm
##
## Call:
## lm(formula = hardness ~ hours)
##
## Coefficients:
## (Intercept) hours
## 168.600 2.034
plot(hardness ~ hours, main = "Time vs Hardness", xlab = "Time in hours",
ylab = "Hardness in Brinell", col = "purple", pch=16)
abline(ch22_lm, col = "red")
The point esimate of the mean hardness when X=40 is 249.96
Y_hat <- 168.600 + (2.034*40)
Y_hat
## [1] 249.96
\(\beta_1 = 2.034\) is the slope of the estimated regression line. The slope captures the change in the mean response when X increases by one measurement. Thus, with every increase in 1 hour, the mean response increases by 2.034.
The sum of the residuals is nearly 0. ###### Method 1 & 2 - Formulas by Hand and Using R
#I received two different values, but the value of Res = ch22_lm$residuals. I wonder if this is due to rounding.
sum(Res)
## [1] -3.410605e-13
sum(ch22_lm$residuals)
## [1] -1.998401e-15
\(\sigma = \sqrt{MSE} = 3.234027\) and \(\sigma^2 = MSE = 10.45893\). \(\sigma\) is expressed in Brinell units.
MSE <- sum(Res^2)/(n-2) # Estimated Variance MSE=SSE/(n-2)
MSE
## [1] 10.45893
sqrt(MSE)
## [1] 3.234027
sigma<-sigma(ch22_lm)
sigma
## [1] 3.234027
sigma^2
## [1] 10.45893
The confidence interval is (1.765287, 2.303463). We are 99% confidence that the change in mean hardness when the elapsed time increased by 1 hour would be between 1.765287 and 2.303463.
alpha=0.05
# Confidence Interval for beta1
sb1 <- sqrt(MSE/sum((X-Xbar)^2)) # Estimated SD of b1
sb1
## [1] 0.09039379
m <- qt(1-(a/2), n-2) #multiplier
m
## [1] 2.976843
CI_beta1 <- c(b1-m*sb1, b1+m*sb1) #CI for beta1
CI_beta1
## [1] 1.765287 2.303463
confint(ch22_lm, level=0.99)
## 0.5 % 99.5 %
## (Intercept) 160.690457 176.509543
## hours 1.765287 2.303463
Hypothesis Test:
\(H_0: \beta_1 = 2\)
\(H_a: \beta_1 \neq 2\)
Decision Rule:
If \(\mid T_{H_0}\mid = \frac{\beta_1 - 0}{s(b_1)} > t(1-\frac{\alpha}{2}, n-2)\), then reject \(H_0\).
Conclusion:
With a \(t^* = T_{H_0} = 0.380531 < 2.9768\), we fail to reject the \(\H_0\). Additionally our p-value = \(0.7092627\)
summary(ch22_lm)
##
## Call:
## lm(formula = hardness ~ hours)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1500 -2.2188 0.1625 2.6875 5.5750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.60000 2.65702 63.45 < 2e-16 ***
## hours 2.03438 0.09039 22.51 2.16e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.234 on 14 degrees of freedom
## Multiple R-squared: 0.9731, Adjusted R-squared: 0.9712
## F-statistic: 506.5 on 1 and 14 DF, p-value: 2.159e-12
t <- (b1 - 2)/sb1
t
## [1] 0.3802805
t <- 2.0344
se <- 0.0904
n<-16
a <-0.01
t_star <- (t-2)/se
t_star
## [1] 0.380531
m <- qt(1-(a/2), n-2)
#t∗= (2.0344−2)/.0904 =.381. If|t∗| ≤2.977 concludeH0, otherwiseHa.
p_val <- 2*(1-pt(abs(t_star),n-2))
p_val
## [1] 0.7092627
#ConcludeH0.P-value=.71
We extract the information we need from the table and we are able to make use the same formulas above. We can bypass calculating individual values though which is nice.
summary(ch22_lm)
##
## Call:
## lm(formula = hardness ~ hours)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1500 -2.2188 0.1625 2.6875 5.5750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.60000 2.65702 63.45 < 2e-16 ***
## hours 2.03438 0.09039 22.51 2.16e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.234 on 14 degrees of freedom
## Multiple R-squared: 0.9731, Adjusted R-squared: 0.9712
## F-statistic: 506.5 on 1 and 14 DF, p-value: 2.159e-12
t <- 2.0344
se <- 0.0904 #sb1
n <-16
a <-0.01
SSE <- sum(Res^2) # Sum of squared error/residuals
SSR = sum((Yhat-Ybar)^2) # Sum of squared regression
SSTO = (n-1)*var(Y) # Sum of squared Total
c(SSE, SSR, SSTO, SSTO-SSE) # SSTO=SSE+SSR
## [1] 146.425 5297.512 5443.938 5297.512
MSE = SSE/(n-2)
MSR = SSR /1
ss <- c(SSR, SSE, SSTO)
df <-c(1,14,15)
ms <- c(MSR, MSE, NA)
types <-factor(c('Regression','Error','Total'))
ch22_anova <- data.frame(Source=types,SS=ss, df=df, MS=ms)
ch22_anova
## Source SS df MS
## 1 Regression 5297.512 1 5297.51250
## 2 Error 146.425 14 10.45893
## 3 Total 5443.938 15 NA
anova(ch22_lm)
## Analysis of Variance Table
##
## Response: hardness
## Df Sum Sq Mean Sq F value Pr(>F)
## hours 1 5297.5 5297.5 506.51 2.159e-12 ***
## Residuals 14 146.4 10.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis Test:
\(H_0: \beta_1 = 0\)
\(H_a: \beta_1 \neq 0\)
Decision Rule:
If \(F^* > F\), then reject \(H_0\).
Conclusion:
From our ANOVA table and calculations, we know that \(F^* = 506.51\) and below we calculated \(F = 8.861593\). Thus, \(F^* > F\) and we reject the null hypothesis that \(H_0: \beta_1 = 0\).
# Test Hypothesis H0: beta1=0 vs Ha: beta1 != 0
Fstar = MSR/MSE # ANOVA F test statistic
Fstar
## [1] 506.5062
sb1 <- sqrt(MSE/sum((X-Xbar)^2))
t1 <- b1/sb1
t1^2 # t1^2=Fstar
## [1] 506.5062
cri_f = qf(1-a,1,n-2) # Critical point to reject H0: beta1=0
cri_f
## [1] 8.861593
a <- 0.01
n <- nrow(ch22)
F_test <- qf((1-a),1,n-2)
F_test
## [1] 8.861593
The plot is shown below.
From the graph below it seems that SSR is the larger part of the total because it has a larger spread.
Based on this, I would argue that the magnitude of \(R^2\) would be closer to 1. This is because the coefficient of detrmination is defined to be the SSR/SSTO. So, the large SSR is, the closer the ratio is to 1.
Yhat <- b0+b1*X # Predicted values
se <- Y-Yhat
sr <- Yhat-Ybar
se
## [1] -2.150 3.850 -5.150 -1.150 0.575 2.575 -2.425 5.575 3.300 0.300
## [11] 1.300 -3.700 0.025 -1.975 3.025 -3.975
sr
## [1] -24.4125 -24.4125 -24.4125 -24.4125 -8.1375 -8.1375 -8.1375
## [8] -8.1375 8.1375 8.1375 8.1375 8.1375 24.4125 24.4125
## [15] 24.4125 24.4125
X
## [1] 16 16 16 16 24 24 24 24 32 32 32 32 40 40 40 40
plot(X, sr, type='p', col = 'red', pch=16, xlab="Xi", ylab="SSE,SSR")
points(X, se, type = 'p', col='blue', pch=16)
legend(16,20, legend=c("Yhat-Ybar", "Yi-Yhat"), col=c("red","blue"), lty=16:16, cex=1)
From our formulas by hand and our summary of the linear model, \(R^2 = 0.9731\) and \(r=0.986\)
Rsq = SSR/SSTO # R square
Rsq
## [1] 0.9731031
# Coefficient of Correlation
r=sqrt(Rsq)
r
## [1] 0.9864599
summary(ch22_lm)
##
## Call:
## lm(formula = hardness ~ hours)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1500 -2.2188 0.1625 2.6875 5.5750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.60000 2.65702 63.45 < 2e-16 ***
## hours 2.03438 0.09039 22.51 2.16e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.234 on 14 degrees of freedom
## Multiple R-squared: 0.9731, Adjusted R-squared: 0.9712
## F-statistic: 506.5 on 1 and 14 DF, p-value: 2.159e-12