STP 530 Homework Assignment 2

*** 2.13 – refer to the grade point average problem. ***

GPA.data <- read.delim("HW1_exercise2.txt", header = FALSE, sep = " ")
GPA <- GPA.data[,2]
ACT <- GPA.data[,6]
GPA.lm <- lm(GPA~ACT)

*** a. ***
Obtain a 95% interval estimate of the mean freshman GPA for students whose ACT test score is 28. Interpret your confidence interval.

freshman.gpa <- data.frame(ACT=28)
gpa.confidence.int <- predict(GPA.lm, freshman.gpa, interval = "confidence", level = 0.95, se.fit = TRUE)
gpa.confidence.int

## $fit
##        fit      lwr      upr
## 1 3.201209 3.061384 3.341033
## 
## $se.fit
## [1] 0.07060873
## 
## $df
## [1] 118
## 
## $residual.scale
## [1] 0.623125

Given the linear model from old GPA data, the mean freshman GPA of students with ACT test score of 28 is expected to be between 3.06 and 3.34 under a 95% confidence interval which minimizes both Type I and Type II error. This means that according to this linear model, 95% of students with an ACT score of 28 will have a freshman GPA between 3.06 and 3.34.

*** b. ***
Mary Jones obtained a score of 28 on the entrance test. Predict her freshman GPA using a 95% prediction interval. Interpret your prediction interval.

gpa.prediction.int <- predict(GPA.lm, freshman.gpa, interval = "prediction", level = 0.95, se.fit = TRUE)
gpa.prediction.int

## $fit
##        fit      lwr      upr
## 1 3.201209 1.959355 4.443063
## 
## $se.fit
## [1] 0.07060873
## 
## $df
## [1] 118
## 
## $residual.scale
## [1] 0.623125

Mary Jones is predicted have a freshman GPA between 1.96 and 4.44. We have a 95% confidence interval

*** c. ***
Is the prediction interval in part (b) wider than the confidence interval in part (a)? Should it be?

Yes, the prediction interval is much larger than the confidence interval. The prediction interval is larger due to its conceptual difference from the confidence interval. The confidence interval is an inference on a parameter; therefore, it is intended to cover the value of the parameter. The prediction interval describes the value for a random variable and therefore must have a wider interval to allow for non-parameterized variables to impact the predicted value.

*** d. ***
Determine the boundary values of the 95% confidence band for the regression line when \({X_h=28}\). Is your confidence band wider at this point than the confidence interval in part (a)? Should it be?

W <- sqrt(2*qf(0.95,2,length(GPA)-2))
conf.band.upper <- gpa.confidence.int$fit[,1]+W*gpa.confidence.int$se.fit
conf.band.lower <- gpa.confidence.int$fit[,1]-W*gpa.confidence.int$se.fit
conf.band.upper

## [1] 3.376258

conf.band.lower

## [1] 3.026159

The 95% confidence band for \({X-h=28}\) is \({3.026159 \le \beta_0 + \beta_1X-h \le 3.376258}\). The confidence band is a little bit wider than the confidence interval at \({X_h=28}\), because it is representing the confidence intervals for the entire regression line, not just at a single \({X_h}\).

*** 2.23 – refer to GPA problem ***
*** a. ***
Set up the ANOVA table.

GPA.vartable <- anova(GPA.lm)
GPA.vartable

## Analysis of Variance Table
## 
## Response: GPA
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## ACT         1  3.588  3.5878  9.2402 0.002917 **
## Residuals 118 45.818  0.3883                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*** b. ***
i) What is estimated by MSR in your ANOVA table?
MSR is the Sum of Squares due to regression standardized by the degrees of freedom in the model.
ii) By MSE?
MSE is the Mean Square due to error. MSE is the unbiased estimator of variance squared.
iii) Under what conditions do MSR and MSE estimate the same quantity?
When \({\beta_1=0}\).

*** c. ***
Conduct an F test of whether or not \({\beta_1=0}\). Control the alpha risk at 0.01. State the alternatives, decision rule, and conclusion.

The null hypothesis is that \({\beta_1=0}\).
The alternative hypothesis is that \({\beta_1 \ne 0}\).

alpha <- 0.01
n <- length(GPA)
F.test.gpa <- qf((1-alpha),1,n-2)
F.test.gpa

## [1] 6.854641

\({F^*=9.2402}\), \({F=6.8546}\)
Reject the null hypothesis. The alternative hypothesis is correct: \({\beta_1\ne0}\).

*** d. ***
What is the absolute magnitude of the reduction in the variation of Y when X is introduced into the regression model?

summary(GPA.lm)

## 
## Call:
## lm(formula = GPA ~ ACT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74004 -0.33827  0.04062  0.44064  1.22737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.11405    0.32089   6.588  1.3e-09 ***
## ACT          0.03883    0.01277   3.040  0.00292 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6231 on 118 degrees of freedom
## Multiple R-squared:  0.07262,    Adjusted R-squared:  0.06476 
## F-statistic:  9.24 on 1 and 118 DF,  p-value: 0.002917

r <- sqrt(0.07262)
r

## [1] 0.269481

What is the relative reduction?
The relative reduction in the variation of Y when X is introduced into the regression model is the \({R^2}\) value. \({R^2=0.07262}\).

What is the name of the latter measure?
The \({R^2}\) value.

*** e. ***
Obtain r and attach the appropriate sign.

r <- sqrt(0.07262)
r

## [1] 0.269481

The r value for GPA.lm is 0.269481. The sign is positive because there is a positive correlation between the two sets of data, indicated by the positive slope of the linear model.

*** f. ***
Which measure, \({R^2}\) or r, has the more clear-cut operational interpretation? Explain.
\({R^2}\) is a more clear-cut operational interpretation. The \({R^2}\) term is a value between 0 and 1 that describes the percent of the variance of variable y that is explained by variable x. The \({R^2}\) is more frequently used to describe the relationship between the two variables.

*** 2.28 – Muscle Mass ***

MuscMass.dat <- read.table("HW1_exercise3.txt", header = FALSE, sep = ",")
Age <- MuscMass.dat[,2]
MuscMass <- MuscMass.dat[,1]
MuscMass.lm <- lm(MuscMass~Age)

*** a. ***
Obtain a 95% confidence interval for the mean muscle mass for women of age 60.

womens.age <- data.frame(Age=60)
muscmass.confidence.int <- predict(MuscMass.lm, womens.age, interval = "confidence", level = 0.95, se.fit = TRUE)
muscmass.confidence.int

## $fit
##        fit      lwr      upr
## 1 84.94683 82.83471 87.05895
## 
## $se.fit
## [1] 1.055154
## 
## $df
## [1] 58
## 
## $residual.scale
## [1] 8.173177

Interpret your confidence interval.
There is a 95% confidence that for a woman aged 60, their muscle mass will be between 82.83471 and 87.05895.

*** b. ***
Obtain a 95% prediction interval for the muscle mass of a woman whose age is 60.

muscmass.prediction.int <- predict(MuscMass.lm, womens.age, interval = "prediction", level = 0.95, se.fit = TRUE)
muscmass.prediction.int

## $fit
##        fit      lwr     upr
## 1 84.94683 68.45067 101.443
## 
## $se.fit
## [1] 1.055154
## 
## $df
## [1] 58
## 
## $residual.scale
## [1] 8.173177

Is the prediction interval relatively precise?
No, the prediction interval is not relatively precise.

*** c. ***
Determine the boundary values of the 95% confidence band for the regression line when \({X_h=60}\).

W <- sqrt(2*qf(0.95,2,length(MuscMass)-2))
mm.conf.band.upper <- muscmass.confidence.int$fit[,1]+W*muscmass.confidence.int$se.fit
mm.conf.band.lower <- muscmass.confidence.int$fit[,1]-W*muscmass.confidence.int$se.fit
mm.conf.band.upper

## [1] 87.59774

mm.conf.band.lower

## [1] 82.29593

Is your confidence band wider at this point than the confidence interval in part (a)? Should it be?
The 95% confidence band for \({X-h=28}\) is \({82.29593 \le \beta_0 + \beta_1X-h \le 87.59774}\). The confidence band is a little bit wider than the confidence interval at \({X_h=60}\), because it is representing the confidence intervals for the entire regression line, not just at a single \({X_h}\).

*** 2.29 – Muscle mass ***
*** a. ***
Plot the deviations \({Y_i - {\hat{Y_i}}}\) against \({X_i}\) on one graph.

regress.devs <- MuscMass - predict(MuscMass.lm)
plot(regress.devs, Age, xlab= "Deviation around fitted regression line (SSE)", ylab = "Women's Age", ylim= c(40,80), xlim = c(-20,20))

Plot the deviations \({\hat{Y_i}} - \bar{Y}\) against \({X_i}\) on another graph, using the same scales as in the first graph.

regress.mean.devs <- predict(MuscMass.lm) - mean(MuscMass)
plot(regress.mean.devs, Age, xlab= "Deviation of fitted regression value around mean (SSR)", ylab = "Women's Age", ylim= c(40,80), xlim = c(-20,20))

From your two graphs, does SSE or SSR appear to be the large component of SSTO? What does this imply about the magnitude of \({R^2}\)?
SSR appears to be the larger component of SSTO because SSE is the sum of square error which is approximately zero. The magnitude of \({R^2}\) will be larger because more of the variance is explained by the regression due to the sum of square errors approaching zero.

*** b. ***
Set up the ANOVA table.

anova(MuscMass.lm)

## Analysis of Variance Table
## 
## Response: MuscMass
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Age        1 11627.5 11627.5  174.06 < 2.2e-16 ***
## Residuals 58  3874.4    66.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*** c. ***
Test whether or not \({\beta_1=0}\) using an F test with \({\alpha=0.05}\).

n <- length(Age)
alpha <- 0.05
F.test.muscmass <- qf(1-alpha,1,n-2)
F.test.muscmass

## [1] 4.006873

State the alternatives, decision rule, and conclusion.
The null hypothesis is that \({\beta_1=0}\).
The alternative hypothesis is that \({\beta_1\ne0}\).
\({F^*=174.06}\), \({F=4.006873}\)
Conclude that the null hypothesis is correct: \({\beta_1=0}\).

*** d. ***
What proportion of the total variation in muscle mass remains “unexplained” when age is introduced into the analysis?

summary(MuscMass.lm)

## 
## Call:
## lm(formula = MuscMass ~ Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1368  -6.1968  -0.5969   6.7607  23.4731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.3466     5.5123   28.36   <2e-16 ***
## Age          -1.1900     0.0902  -13.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.173 on 58 degrees of freedom
## Multiple R-squared:  0.7501, Adjusted R-squared:  0.7458 
## F-statistic: 174.1 on 1 and 58 DF,  p-value: < 2.2e-16

The proportion of the total variation in muscle mass that remains “unexplained” when age is introduced is approximately 25%.

Is this proportion relatively small or large?
This proportion is relatively small because about 75% of the variance IS explained by age. Therefore, age is good predictor of muscle mass.

*** e. ***
Obtain \({R^2}\) and \(r\).

R <- sqrt(0.7501)
R

## [1] 0.8660831

\({r = -0.8661}\), \({R^2 = 0.7501}\), r is negative because the slope of the regression is negative.

*** 2.34 – refer to GPA problem. ***
*** a. ***
Would it be more reasonable to consider \({X_i}\) as known constants or as random variables here? Explain.
It would be better to consider \({X_i}\) as random variables than as known constants because we are interested in making an inference on GPA based on the ACT score. Since we cannot control the new data points in the ACT, we continue with considering the \({X_i}\) as random variables.

*** b. ***
If the \({X_i}\) were considered to be random variables, would this have any effect on prediction intervals for new applicants? Explain.

The \({Y_i}\) will have conditional distributions that are normal and independent. Additionally, if \({X_i}\) are random, the \({g(X_i)}\) will not involve the regression parameters (\({\beta_0}\),\({\beta_1}\), and \({\sigma^2}\)). THerefore, considering \({X_i}\) as random variables produces a less constrained linear model, thefore expanding the prediction intervals that describe the possible outcomes for new applicants.

*** 2.45 – Water flow. ***
An engineer, desiring to estimate the coefficent of correlation \({\rho_{12}}\) between rate of water flow at point A in a stream (\({Y_1}\)) and concurrent rate of flow at point B (\({Y_2}\)), obtained \({r_{12}=0.83}\) in a sample of 147 cases. Assume that bivariate normal model is appropriate.

*** a. ***
Obtain a 99% confidence interval for \({\rho_{12}}\).

r.12 <- 0.83
n.waterflow <- 147
#z.prime is found through Table B.8
z.prime <- 1.188
sigma <- 1/(sqrt(147-3))
z <- 2.326
gamma.upper <- z.prime + z*sigma
gamma.lower <- z.prime - z*sigma
t.star.stat <- (r.12*sqrt(n.waterflow-2))/(sqrt(1-r.12^2))
gamma.upper

## [1] 1.381833

gamma.lower

## [1] 0.9941667

Therefore, according to Table B.8, the confidence interval is approximately \({0.76\le\rho_{12}\le0.88}\).

*** b. ***
Convert the confidence interval in part (a) to a 99% confidence interval for \({\rho_{12}^2}\).

sq.gamma.upper <- gamma.upper^2
sq.gamma.lower <- gamma.lower^2
sq.gamma.upper

## [1] 1.909463

sq.gamma.lower

## [1] 0.9883674

The confidence interval is approximately \({0.76\le\rho_{12}^2\le0.96}\).

*** 2.47 – Muscle mass ***
Assume that the normal bivariate model is appropriate.
*** a. ***
Compute the Pearson product-moment correlation coefficient for \({r_{12}}\).

pearson.coef <- cor(MuscMass.dat, method = "pearson")
pearson.coef

##           V1        V2
## V1  1.000000 -0.866064
## V2 -0.866064  1.000000

The \({r_{12}=-0.866}\).

*** b. ***
Test whether muscle mass and age are statistically independent in the population; use \({\alpha = 0.05}\).

n.muscmass <- length(MuscMass)
t.pearson <- (pearson.coef[1,2]*(sqrt(n.muscmass-2)))/sqrt(1-(pearson.coef[1,2])^2)
t.pearson

## [1] -13.19326

t.stat <- qt(0.975,n.muscmass-2)
t.stat

## [1] 2.001717

State the alternatives, decision rule, and conclusion.
\({H_0 : rho_{12}=0}\)
\({H_a:rho_{12}\ne0}\)
Given an \({\alpha = 0.05}\), \({t(1-\frac\alpha2;n-2)=t(0.975;58)}\). Therefore, \({t=2.001717}\).
If \({|t.pearson| \le t.stat}\), conclude \({H_0}\).
If \({|t.pearson| > t.stat}\), conclude \({H_a}\).
Since \({|t.stat| = 2.001717}\) and \({|t.pearson| = 13.19326}\), conclude to reject the \({H_0}\) that \({rho_{12}\ne0}\).

*** c. ***
The bivariate normal model assumption is possibly inappropriate here. Compute the Spearman rank correlation coefficent \({r_s}\).

spearman.coef <- cor(MuscMass.dat, method = "spearman")
spearman.coef

##            V1         V2
## V1  1.0000000 -0.8657217
## V2 -0.8657217  1.0000000

The Spearman rank correlation coefficient, \({r_s=-0.8657}\).

*** d. ***
Repeat part (b), this time basing the test of independence on the Spearman rank correlation computed in part (c) and test statistic (2.101). Use \({\alpha=0.05}\).

n.muscmass <- length(MuscMass)
t.spearman <- (spearman.coef[1,2]*(sqrt(n.muscmass-2)))/sqrt(1-(spearman.coef[1,2])^2)
t.spearman

## [1] -13.17243

t.stat.spear <- qt(0.975,n.muscmass-2)
t.stat.spear

## [1] 2.001717

State the alternatives, decision rule, and conclusion.
\({H_0 : rho_{12}=0}\)
\({H_a:rho_{12}\ne0}\)
Given an \({\alpha = 0.05}\), \({t(1-\frac\alpha2;n-2)=t(0.975;58)}\). Therefore, \({t=2.001717}\).
If \({|t.spearman| \le t.stat.spear}\), conclude \({H_0}\).
If \({|t.spearman| > t.stat.spear}\), conclude \({H_a}\).
Since \({|t.stat| = 2.001717}\) and \({|t.pearson| = 13.17243}\), conclude to reject the \({H_0}\) that \({rho_{12}\ne0}\).

*** e. ***
How do your estimates and conclusions in parts (a) and (b) compare to those obtained in parts (c) and (d)?
Very similar, therefore the Pearson product-moment correlation coefficient would be sufficient for this data.

STP 530 Homework Assignment 2

Peter Crank

September 28, 2017