Question 1:

Using the LungCap data - the first column (LungCap) and the second column (Age), assume that errors are independent and normally distributed \(\epsilon_i \stackrel{iid}{\sim} N(0,\sigma^2)\).

Import the data:

lung <- read.csv("/Users/seankrinik/Documents/CSULB/STAT510/HW1/LungCapData.csv", 
header=T)
attach(lung)

Part A

Compute the 90% confidence limits for \(\beta_1\).

model <- lm(LungCap~Age)
m_smry <- summary(model)
ci_beta1 <- as.data.frame(confint(model, level=.9));ci_beta1[2,]
##          5 %      95 %
## Age 0.521526 0.5681708

We see the confidence interval is \([.522, .568]\) which tells us any values within that range are likely values for \(\beta_1\) at our 90% confidence level.

Part B

Compute the 90% confidence limits for \(\sigma^2\)

ci_sig <- function(mdl, lvl) {
  mse <- sigma(mdl)^2
  n <- summary(mdl)$df[2]
  alpha <- (1-lvl)/2
  lower <- ((n)*mse)/qchisq(1-alpha,n)
  upper <- ((n)*mse)/qchisq(alpha,n)
  x <- data.frame(lower,upper,mse)
  colnames(x) <- c("lower", "upper", "estimate");row.names(x) <- "Sigma_2"
  x
}
ci_sig(model, .9)
##            lower    upper estimate
## Sigma_2 2.140021 2.544517 2.328461

The confidence interval for \(\sigma^2\) is \([2.14, 2.54]\) meaning 90% of time, estimates for \(\sigma^2\) will fall in this interval. Note, \(\sigma^2\) is the variance of the errors of our regression model.

Part C

Perform a size \(\alpha = 0.10\) test of \[ \begin{aligned} H_0: \beta_1 = 1\\ H_1: \beta_1 \neq 1 \end{aligned} \]

B_hat <- m_smry$coef[2,1]
sd_Bhat <- m_smry$coef[2,2]
ts <- (B_hat-1)/sd_Bhat;ts
## [1] -32.1415
pval <- pt(ts, as.numeric(m_smry$df[2]));pval
## [1] 9.163101e-142

After testing \(H_0: \beta_1 = 1\), we will reject the null hypothesis at \(\alpha = .1\). This makes sense since our estimate for \(\beta_1\) is .545 (not close to 1). Additionally, from our result in part A, the confidence interval for \(\beta_1\) doesn’t come close to including 1.

Question 2

Part A (3.8, #3.44)

Find and interpret the values of \(r\) and \(r^2\) for the simple linear regression relating the proportion of names recalled (y) and position (order) of the student (x) during the ‘name game’.

First, let’s find the coefficent of correlation (\(r\)) for our data:

namegame <- read.table("/Users/seankrinik/Documents/CSULB/STAT510/HW2/NAMEGAME2.txt",
header = T)
attach(namegame)
model2 <- lm(RECALL~POSITION)
msmry2 <- summary(model2)
r <- as.data.frame(cor(POSITION,RECALL))
colnames(r) <- c("r (coef of correlation)");r
##   r (coef of correlation)
## 1               0.2332561

The coefficient of correlation is a scaleless quantitative measure of the linear relationship between our response and preditor. With a value of .233, we notice a slightly positive linear relationship in our data as seen by the regression line overlay in red.

plot(POSITION,RECALL, main = "Recall (y) v. Position (x)")
abline(model2, col="red")

Next, we can find \(r^2\) by squaring the coefficient of correlation, or look at our model summary and get the Multiple \(R^2\) value. This value is our coefficient of determination, or proportion of variance that is explained by using our dependent variable to predict values for our response.

r_2 <- as.data.frame(msmry2$r.squared)
colnames(r_2) <- c("r_2 (coef of determination)"); r_2
##   r_2 (coef of determination)
## 1                  0.05440839

With a small \(r^2\), we see that very little variance (only 5.44%) is explained by the linear relationship between Recall and Position.

Part B (3.9 #3.54)

A) Find a 99% confidence interval for the mean recall proportion for students in the fifth position during the ‘name game’. Interpret the result.

We want to find \(E[y|x=5]\), so we need to run a prediction on our model while letting the Age value be constant at 5.

new_data <- data.frame(POSITION=5)
pred_conf <- predict(model2, newdata = new_data, interval = "confidence", level=.99)
pred_conf
##         fit       lwr       upr
## 1 0.7025529 0.6459537 0.7591521

Our interval is [.646, .759], meaning we can expect on average for the mean of recall value to be within this interval 99% of the time if we are in the fifth position.

B) Find a 99% prediction interval for the recall proportion of a particular student in the fifth position during the ‘name game’. Interpret the result.

We need to find \(E[y_p|X=5]\), or the prediction interval for a response given we are in the fifth level.

pred_pred <- predict(model2, newdata = new_data, interval = "predict", level=.99)
pred_pred
##         fit        lwr      upr
## 1 0.7025529 0.03656847 1.368537

Similar to part B, on average we can expect that 99% of the time, an individual value of recall will fall within our prediction interval given we are in the fifth level.

To visualize the data, I include the confidence and prediction intervals (\(\alpha = .01\)) across all values of ‘position’. Note, the blue lines on this graph are the confidence interval for our predicted values, and the green lines represent our prediction interval.

seq_conf <- seq(range(POSITION)[1],range(POSITION)[2],by=1)
conf_int_plot <- predict(model2, newdata = data.frame(POSITION=seq_conf),
interval = "confidence", level=.99)
pred_int_plot <- predict(model2, newdata = data.frame(POSITION=seq_conf), 
interval = "predict", level=.99)
plot(POSITION,RECALL, 
     main = "Recall (y) v. Position (x)",
     ylim = c(-.1,1.5))
abline(model2, col="red")
lines(seq_conf, conf_int_plot[,2], lty=2, col="blue")
lines(seq_conf, conf_int_plot[,3], lty=2, col="blue")
lines(seq_conf, pred_int_plot[,2], lty=2, col="green")
lines(seq_conf, pred_int_plot[,3], lty=2, col="green")