Problem 1 (Question 7 a and c, and 8 a and b from HW 1)

ch19<-read.table("CH01PR19.txt")
head(ch19)
##      V1 V2
## 1 3.897 21
## 2 3.885 14
## 3 3.778 28
## 4 2.540 22
## 5 3.028 21
## 6 3.865 31
n <- nrow(ch19)
ACT <- ch19[,2]
GPA <- ch19[,1]

ch19_lm <-lm(GPA~ACT)

7 A

\(b_0 = 2.114049\) and \(b_1 = 0.03882713\)

The regression function is as follows: \(\hat{y} = 2.114049 + 0.03882713x\)

ch19_lm$coefficients
## (Intercept)         ACT 
##  2.11404929  0.03882713

7 C

The point estimate of the mean freshman GPA for students with ACT score = 30 is 3.28

new <- data.frame(ACT <-30)
new
##   ACT....30
## 1        30
predict(ch19_lm, new)
##        1 
## 3.278863

8 A

The sum of the residuals is -2.942091e-15, which is ~ 0.

sum(ch19_lm$residuals)
## [1] -2.942091e-15

8 B

\(\sigma^{2}\) or \(MSE = 0.3883848\) and \(\sigma\) or \(\sqrt{MSE} = 0.623125\). \(\sigma\) is expressed in GPA points.

MSE <- sum(ch19_lm$residuals^2)/(n-2)

sigma <- sigma(ch19_lm)
sigma^2
## [1] 0.3882848

Problem 2 (Question 4 from HW 2)

4 A

The 99% confidence interval is between \((0.0053856,0.07226864)\). This means we are 99% confident that \(\beta_1\) is within the constructed interval, i.e, if this experiment were to be repeated N times, where N is large enough, \(\beta_1\) is in approximately 99% of the intervals.

This interval does not contain 0. The admissions director might be interested in whether the confidence interval includes 0, because that may suggest that the difference is not signifcant. If ACT scores have no impact on a student’s GPA, they do not have to list a minimum requirement and can increase the amount of applicants.

confint(ch19_lm, level = 0.99)
##                   0.5 %     99.5 %
## (Intercept) 1.273902675 2.95419590
## ACT         0.005385614 0.07226864

4 B

Hypothesis Test:

\(H_0: \beta_1 = 0\)

\(H_a: \beta_1 \neq 0\)

Decision Rule:

If \(\mid T_{H_0}\mid = \frac{\beta_1 - 0}{s(b_1)} > t(1-\frac{\alpha}{2}, n-2)\), then reject \(H_0\).

Conclusion:

The t-statistic for ACT is 3.040 > 2.618 based on the summary function.

At an \(\alpha = 0.01\), we reject the \(H_0: \beta_1 = 0\), as \(\mid T_{H_0}\mid > t\). This means there is sufficient evidence to reject the claim that \(H_0: \beta_1 = 0\). This implies that there exists a significant linear relationship between ACT score and GPA.

summary(ch19_lm)
## 
## Call:
## lm(formula = GPA ~ ACT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74004 -0.33827  0.04062  0.44064  1.22737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.11405    0.32089   6.588  1.3e-09 ***
## ACT          0.03883    0.01277   3.040  0.00292 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6231 on 118 degrees of freedom
## Multiple R-squared:  0.07262,    Adjusted R-squared:  0.06476 
## F-statistic:  9.24 on 1 and 118 DF,  p-value: 0.002917
a <-0.01

m <- qt(1-(a/2), n-2) #Multiplier
m
## [1] 2.618137

4 C

The p-value based on the summary() is 0.00292. Since our p-value of \(0.00292 < \alpha = 0.01\). Another way to reject the \(H_0\) is to verify that \(p < \alpha\). This is true, and supports our conclusion to reject the \(H_0\)

Problem 3

2.13

Part A.
Method 1 - Formulas by Hand

A 95% interval estimate of the mean freshman GPA for students whose ACT score is 28 is (3.0614, 3.3410). This means that the we are 95% confident that a student who scores a 28 on their ACT will have a GPA within the above interval.

n <- nrow(ch19)
X <- ch19[,2]
Y <- ch19[,1]
Xbar <- mean(X)
Ybar <- mean(Y)
b1 <- sum((X-Xbar)*(Y-Ybar))/((n-1)*var(X)) #Formula
b0 <- Ybar-b1*Xbar #Formula

b0
## [1] 2.114049
b1
## [1] 0.03882713
X_h <- 28

Yhath <- b0 + b1 * X_h # predicted value for X_h=28
Yhath
## [1] 3.201209
Yhat <- b0+b1*X
Res <- Y-Yhat
MSE <- sum(Res^2)/(n-2) #MSE

sxx<- sum((X-Xbar)^2) #Sxx


s2 <- MSE*((1/n) + (((X_h-Xbar)^2))/sxx) #S^2(y_hat)
s <- sqrt(s2) #s(y_hat)


a<-0.05

m <- qt(1-(a/2), n-2) #multiplier

Yhath + (m*s)
## [1] 3.341033
Yhath - (m*s)
## [1] 3.061384
Method 2 - Using R

Using R, we obtain the same confidence interval of (3.0614, 3.3410).

summary(ch19_lm)
## 
## Call:
## lm(formula = GPA ~ ACT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74004 -0.33827  0.04062  0.44064  1.22737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.11405    0.32089   6.588  1.3e-09 ***
## ACT          0.03883    0.01277   3.040  0.00292 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6231 on 118 degrees of freedom
## Multiple R-squared:  0.07262,    Adjusted R-squared:  0.06476 
## F-statistic:  9.24 on 1 and 118 DF,  p-value: 0.002917
new <- data.frame(ACT <- 28)

predict.lm(ch19_lm, new, interval = "confidence", 1-a)
## $fit
##        fit      lwr      upr
## 1 3.201209 3.061384 3.341033
## 
## $se.fit
## [1] 0.07060873
## 
## $df
## [1] 118
## 
## $residual.scale
## [1] 0.623125
Part B.
Method 1 - Formulas by Hand

We are 95% confident that with Mary’s ACT score of 28, her GPA will lie in between the interval (1.959436, 4.443144).

s2_new <- MSE*(1+(1/n) + (((X_h-Xbar)^2))/sxx) #Prediction interval S^2(Y_hat_new)
s_new <- sqrt(s2_new)
s_new
## [1] 0.6271128
Yhath + (m*s_new)
## [1] 4.443063
Yhath - (m*s_new)
## [1] 1.959355
Method 2 - Using R

We obtain the roughly the same prediction interval (1.959355, 4.443063)

predict.lm(ch19_lm, new, interval = "prediction", 1-a)
## $fit
##        fit      lwr      upr
## 1 3.201209 1.959355 4.443063
## 
## $se.fit
## [1] 0.07060873
## 
## $df
## [1] 118
## 
## $residual.scale
## [1] 0.623125
Part C.

Yes, the prediction interval in part B is wider than the confidence interval found in part A. This is because prediction intervals account for the population mean and data scatter. The confidence interval is an inference for a parameter.The prediction interval allows for non parameterized variables to affect the predicted value.

Part D.

The 95% confidence band for \(X_h = 28\) is \(3.026159 \le \beta_0 + \beta_1*X_h \leq 3.376258\)

This confidence band is wider than our confidence interval. This is expected as it represents all the confidence intervals for the regression line.

W <- sqrt(2*qf(0.95,2,length(GPA)-2))

gpa_CI <- predict.lm(ch19_lm, new, interval = "confidence", 1-a)

cbu <- gpa_CI$fit[,1]+W*gpa_CI$se.fit
cbl <- gpa_CI$fit[,1]-W*gpa_CI$se.fit

cbu
## [1] 3.376258
cbl
## [1] 3.026159

2.23

Part A.
Method 1 - Formulas by Hand
SSE <- sum(Res^2) # Sum of squared error/residuals
SSR = sum((Yhat-Ybar)^2)        # Sum of squared regression
SSTO = (n-1)*var(Y)             # Sum of squared Total
c(SSE, SSR, SSTO, SSTO-SSE)     # SSTO=SSE+SSR
## [1] 45.817608  3.587846 49.405454  3.587846
MSE = SSE/(n-2)
MSR = SSR /1

ss <- c(SSR, SSE, SSTO)
df <-c(1,118,119)
ms <- c(MSR, MSE, NA)
types <-factor(c('Regression','Error','Total'))

ch19_anova <- data.frame(Source=types,SS=ss, df=df, MS=ms)
ch19_anova
##       Source        SS  df        MS
## 1 Regression  3.587846   1 3.5878459
## 2      Error 45.817608 118 0.3882848
## 3      Total 49.405454 119        NA
Method 2 - Using R
anova(ch19_lm)
## Analysis of Variance Table
## 
## Response: GPA
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## ACT         1  3.588  3.5878  9.2402 0.002917 **
## Residuals 118 45.818  0.3883                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Part B.
Method 1 & 2, Formulas by hand used above, used table for answers

MSR estimates \(\sigma^2 + \beta_1^2\sum(X_i-\bar{X})^2\) and MSE estimtes \(\sigma^2\). MSR and MSE have the same quantity if \(\beta_1 = 0\)

MSR is 3.588 and MSE is 45.818.

Part C.

Hypothesis Test:

\(H_0: \beta_1 = 0\)

\(H_a: \beta_1 \neq 0\)

Decision Rule:

If \(F^* > F\), then reject \(H_0\).

Conclusion:

From our ANOVA table, we know that \(F^* = 9.2402\) and below we calculated \(F = 6.854641\). Thus, \(F^* > F\) and we reject the null hypothesis that \(H_0: \beta_1 = 0\).

Method 1 - Formulas by Hand
# Test Hypothesis H0: beta1=0 vs Ha: beta1 != 0
Fstar = MSR/MSE                 # ANOVA F test statistic
Fstar  
## [1] 9.240243
sb1 <- sqrt(MSE/sum((X-Xbar)^2))
t1 <- b1/sb1 
t1^2                             # t1^2=Fstar
## [1] 9.240243
cri_f = qf(1-a,1,n-2)       # Critical point to reject H0: beta1=0
cri_f                           # Reject H0 
## [1] 3.921478
Method 2 - Using R
a <- 0.01
n <- nrow(ch19)
F_test <- qf((1-a),1,n-2)
F_test
## [1] 6.854641
Part D.

The absolute magnitude of the reduction in the variation of Y when X is introduced into the regression model is 3.588 which is obtained by SSR.

The relative reduction is \(\frac{3.588}{49.406} = 7.262 %\).

The relative reduction is also called the coefficient of determination, or \(R^2\).

Method 1 - Formulas by Hand
# Coefficient of Determination
Rsq = SSR/SSTO    # R square
Rsq
## [1] 0.07262044
Method 2 - Using R

Listed on the summary tab as R^2 is 0.07262. Additionally, we can calculate SSR/SSTO from the summary tab.

3.588/49.406
## [1] 0.07262276
Part E.

r = +0.269481

Method 1 - Formulas by Hand
sqrt(SSR/SSTO)
## [1] 0.2694818
Method 2 - Using R
sqrt(0.07262)
## [1] 0.269481
Part F.

\(R^2\) has the more clear cut operational interpretation. \(R^2\) accounts for the percentage of variation in Y explained by X.

Problem 4

1.22

Part A.

The regression function is as follows: \(\hat{y} = 168.600 + 2.034x\)

Yes the estimated regression function appears to have a good fit. For every unique X = Hours, the estimated regression line intersects roughly the average of the Y observations.

Method 1 - Formulas by Hand
# Read Data
ch22<-read.table("CH01PR22.txt")
n <- nrow(ch22)   # number of objects (rows)
X <- ch22[,2]     # 2ed variable (col)
Y <- ch22[,1]     # 1st variable (col)

Xbar <- mean(X)   
Ybar <- mean(Y)
b1 <- sum((X-Xbar)*(Y-Ybar))/((n-1)*var(X)) 
b1                  #Formula
## [1] 2.034375
b0 <- Ybar-b1*Xbar      
b0                  #Formula
## [1] 168.6
c(b0,b1)          # print b0 and b1 out
## [1] 168.600000   2.034375
Yhat <- b0+b1*X         # Predicted values
Res <- Y-Yhat           # Residuals
Res[1:5]                # First five residuals
## [1] -2.150  3.850 -5.150 -1.150  0.575
MSE <- sum(Res^2)/(n-2) # Estimated Variance MSE=SSE/(n-2) 
sd <- sqrt(MSE)         # Estimated Standard Deviation S
Method 2 - Using R
ch22<-read.table("CH01PR22.txt")

hours <- ch22$V2
hardness <-ch22$V1

ch22_lm <- lm(hardness ~ hours)
ch22_lm
## 
## Call:
## lm(formula = hardness ~ hours)
## 
## Coefficients:
## (Intercept)        hours  
##     168.600        2.034
plot(hardness ~ hours, main = "Time vs Hardness", xlab = "Time in hours", 
    ylab = "Hardness in Brinell", col = "purple", pch=16)
abline(ch22_lm, col = "red")

Part B.

The point esimate of the mean hardness when X=40 is 249.96

Method 1 & 2 - Formulas by Hand and Using R
Y_hat <- 168.600 + (2.034*40) 
Y_hat 
## [1] 249.96
Part C.

\(\beta_1 = 2.034\) is the slope of the estimated regression line. The slope captures the change in the mean response when X increases by one measurement. Thus, with every increase in 1 hour, the mean response increases by 2.034.

1.26

Part A.

The sum of the residuals is nearly 0. ###### Method 1 & 2 - Formulas by Hand and Using R

#I received two different values, but the value of Res = ch22_lm$residuals. I wonder if this is due to rounding. 

sum(Res) 
## [1] -3.410605e-13
sum(ch22_lm$residuals)
## [1] -1.998401e-15
Part B.

\(\sigma = \sqrt{MSE} = 3.234027\) and \(\sigma^2 = MSE = 10.45893\). \(\sigma\) is expressed in Brinell units.

Method 1 - Using Formulas by Hand
MSE <- sum(Res^2)/(n-2) # Estimated Variance MSE=SSE/(n-2) 
MSE
## [1] 10.45893
sqrt(MSE)
## [1] 3.234027
Method 2 - Using R
sigma<-sigma(ch22_lm)
sigma
## [1] 3.234027
sigma^2
## [1] 10.45893

2.7

Part A.

The confidence interval is (1.765287, 2.303463). We are 99% confidence that the change in mean hardness when the elapsed time increased by 1 hour would be between 1.765287 and 2.303463.

Method 1 - Using Formulas by Hand
alpha=0.05

# Confidence Interval for beta1
sb1 <- sqrt(MSE/sum((X-Xbar)^2))  # Estimated SD of b1
sb1
## [1] 0.09039379
m <- qt(1-(a/2), n-2) #multiplier
m
## [1] 2.976843
CI_beta1 <- c(b1-m*sb1, b1+m*sb1)  #CI for beta1
CI_beta1  
## [1] 1.765287 2.303463
Method 2 - Using R
confint(ch22_lm, level=0.99)
##                  0.5 %     99.5 %
## (Intercept) 160.690457 176.509543
## hours         1.765287   2.303463
Part B.

Hypothesis Test:

\(H_0: \beta_1 = 2\)

\(H_a: \beta_1 \neq 2\)

Decision Rule:

If \(\mid T_{H_0}\mid = \frac{\beta_1 - 0}{s(b_1)} > t(1-\frac{\alpha}{2}, n-2)\), then reject \(H_0\).

Conclusion:

With a \(t^* = T_{H_0} = 0.380531 < 2.9768\), we fail to reject the \(\H_0\). Additionally our p-value = \(0.7092627\)

Method 1 - Using Formulas by Hand
summary(ch22_lm)
## 
## Call:
## lm(formula = hardness ~ hours)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1500 -2.2188  0.1625  2.6875  5.5750 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.60000    2.65702   63.45  < 2e-16 ***
## hours         2.03438    0.09039   22.51 2.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.234 on 14 degrees of freedom
## Multiple R-squared:  0.9731, Adjusted R-squared:  0.9712 
## F-statistic: 506.5 on 1 and 14 DF,  p-value: 2.159e-12
t <- (b1 - 2)/sb1
t
## [1] 0.3802805
t <- 2.0344
se <- 0.0904
n<-16
a <-0.01

t_star <- (t-2)/se
t_star
## [1] 0.380531
m <- qt(1-(a/2), n-2)

#t∗=  (2.0344−2)/.0904  =.381.  If|t∗|  ≤2.977  concludeH0,  otherwiseHa. 

p_val <- 2*(1-pt(abs(t_star),n-2))  
p_val
## [1] 0.7092627
#ConcludeH0.P-value=.71
Method 2 - Using R

We extract the information we need from the table and we are able to make use the same formulas above. We can bypass calculating individual values though which is nice.

summary(ch22_lm)
## 
## Call:
## lm(formula = hardness ~ hours)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1500 -2.2188  0.1625  2.6875  5.5750 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.60000    2.65702   63.45  < 2e-16 ***
## hours         2.03438    0.09039   22.51 2.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.234 on 14 degrees of freedom
## Multiple R-squared:  0.9731, Adjusted R-squared:  0.9712 
## F-statistic: 506.5 on 1 and 14 DF,  p-value: 2.159e-12
t <- 2.0344
se <- 0.0904 #sb1
n <-16
a <-0.01

2.26

Part A.
Method 1 - Using Formulas by Hand
SSE <- sum(Res^2) # Sum of squared error/residuals
SSR = sum((Yhat-Ybar)^2)        # Sum of squared regression
SSTO = (n-1)*var(Y)             # Sum of squared Total
c(SSE, SSR, SSTO, SSTO-SSE)     # SSTO=SSE+SSR
## [1]  146.425 5297.512 5443.938 5297.512
MSE = SSE/(n-2)
MSR = SSR /1

ss <- c(SSR, SSE, SSTO)
df <-c(1,14,15)
ms <- c(MSR, MSE, NA)
types <-factor(c('Regression','Error','Total'))

ch22_anova <- data.frame(Source=types,SS=ss, df=df, MS=ms)
ch22_anova
##       Source       SS df         MS
## 1 Regression 5297.512  1 5297.51250
## 2      Error  146.425 14   10.45893
## 3      Total 5443.938 15         NA
Method 2 - Using R
anova(ch22_lm)
## Analysis of Variance Table
## 
## Response: hardness
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hours      1 5297.5  5297.5  506.51 2.159e-12 ***
## Residuals 14  146.4    10.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Part B.

Hypothesis Test:

\(H_0: \beta_1 = 0\)

\(H_a: \beta_1 \neq 0\)

Decision Rule:

If \(F^* > F\), then reject \(H_0\).

Conclusion:

From our ANOVA table and calculations, we know that \(F^* = 506.51\) and below we calculated \(F = 8.861593\). Thus, \(F^* > F\) and we reject the null hypothesis that \(H_0: \beta_1 = 0\).

Method 1 - Formulas by Hand
# Test Hypothesis H0: beta1=0 vs Ha: beta1 != 0
Fstar = MSR/MSE                 # ANOVA F test statistic
Fstar  
## [1] 506.5062
sb1 <- sqrt(MSE/sum((X-Xbar)^2))
t1 <- b1/sb1 
t1^2                             # t1^2=Fstar
## [1] 506.5062
cri_f = qf(1-a,1,n-2)       # Critical point to reject H0: beta1=0
cri_f  
## [1] 8.861593
Method 2 - Using R
a <- 0.01
n <- nrow(ch22)
F_test <- qf((1-a),1,n-2)
F_test
## [1] 8.861593
Part C.

The plot is shown below.

From the graph below it seems that SSR is the larger part of the total because it has a larger spread.

Based on this, I would argue that the magnitude of \(R^2\) would be closer to 1. This is because the coefficient of detrmination is defined to be the SSR/SSTO. So, the large SSR is, the closer the ratio is to 1.

Yhat <- b0+b1*X         # Predicted values
se <- Y-Yhat  
sr <- Yhat-Ybar
se
##  [1] -2.150  3.850 -5.150 -1.150  0.575  2.575 -2.425  5.575  3.300  0.300
## [11]  1.300 -3.700  0.025 -1.975  3.025 -3.975
sr
##  [1] -24.4125 -24.4125 -24.4125 -24.4125  -8.1375  -8.1375  -8.1375
##  [8]  -8.1375   8.1375   8.1375   8.1375   8.1375  24.4125  24.4125
## [15]  24.4125  24.4125
X
##  [1] 16 16 16 16 24 24 24 24 32 32 32 32 40 40 40 40
plot(X, sr, type='p', col = 'red', pch=16, xlab="Xi", ylab="SSE,SSR")
points(X, se, type = 'p', col='blue', pch=16)
legend(16,20, legend=c("Yhat-Ybar", "Yi-Yhat"), col=c("red","blue"), lty=16:16, cex=1)

Part D.

From our formulas by hand and our summary of the linear model, \(R^2 = 0.9731\) and \(r=0.986\)

Method 1 - Formulas by Hand
Rsq = SSR/SSTO    # R square
Rsq
## [1] 0.9731031
# Coefficient of Correlation
r=sqrt(Rsq)       
r   
## [1] 0.9864599
Method 2 - Using R
summary(ch22_lm)
## 
## Call:
## lm(formula = hardness ~ hours)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1500 -2.2188  0.1625  2.6875  5.5750 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.60000    2.65702   63.45  < 2e-16 ***
## hours         2.03438    0.09039   22.51 2.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.234 on 14 degrees of freedom
## Multiple R-squared:  0.9731, Adjusted R-squared:  0.9712 
## F-statistic: 506.5 on 1 and 14 DF,  p-value: 2.159e-12