Using the LungCap data - the first column (LungCap) and the second column (Age), assume that errors are independent and normally distributed \(\epsilon_i \stackrel{iid}{\sim} N(0,\sigma^2)\).
Import the data:
lung <- read.csv("/Users/seankrinik/Documents/CSULB/STAT510/HW1/LungCapData.csv",
header=T)
attach(lung)
Compute the 90% confidence limits for \(\beta_1\).
model <- lm(LungCap~Age)
m_smry <- summary(model)
ci_beta1 <- as.data.frame(confint(model, level=.9));ci_beta1[2,]
## 5 % 95 %
## Age 0.521526 0.5681708
We see the confidence interval is \([.522, .568]\) which tells us any values within that range are likely values for \(\beta_1\) at our 90% confidence level.
Compute the 90% confidence limits for \(\sigma^2\)
ci_sig <- function(mdl, lvl) {
mse <- sigma(mdl)^2
n <- summary(mdl)$df[2]
alpha <- (1-lvl)/2
lower <- ((n)*mse)/qchisq(1-alpha,n)
upper <- ((n)*mse)/qchisq(alpha,n)
x <- data.frame(lower,upper,mse)
colnames(x) <- c("lower", "upper", "estimate");row.names(x) <- "Sigma_2"
x
}
ci_sig(model, .9)
## lower upper estimate
## Sigma_2 2.140021 2.544517 2.328461
The confidence interval for \(\sigma^2\) is \([2.14, 2.54]\) meaning 90% of time, estimates for \(\sigma^2\) will fall in this interval. Note, \(\sigma^2\) is the variance of the errors of our regression model.
Perform a size \(\alpha = 0.10\) test of \[ \begin{aligned} H_0: \beta_1 = 1\\ H_1: \beta_1 \neq 1 \end{aligned} \]
B_hat <- m_smry$coef[2,1]
sd_Bhat <- m_smry$coef[2,2]
ts <- (B_hat-1)/sd_Bhat;ts
## [1] -32.1415
pval <- pt(ts, as.numeric(m_smry$df[2]));pval
## [1] 9.163101e-142
After testing \(H_0: \beta_1 = 1\), we will reject the null hypothesis at \(\alpha = .1\). This makes sense since our estimate for \(\beta_1\) is .545 (not close to 1). Additionally, from our result in part A, the confidence interval for \(\beta_1\) doesn’t come close to including 1.
Find and interpret the values of \(r\) and \(r^2\) for the simple linear regression relating the proportion of names recalled (y) and position (order) of the student (x) during the ‘name game’.
First, let’s find the coefficent of correlation (\(r\)) for our data:
namegame <- read.table("/Users/seankrinik/Documents/CSULB/STAT510/HW2/NAMEGAME2.txt",
header = T)
attach(namegame)
model2 <- lm(RECALL~POSITION)
msmry2 <- summary(model2)
r <- as.data.frame(cor(POSITION,RECALL))
colnames(r) <- c("r (coef of correlation)");r
## r (coef of correlation)
## 1 0.2332561
The coefficient of correlation is a scaleless quantitative measure of the linear relationship between our response and preditor. With a value of .233, we notice a slightly positive linear relationship in our data as seen by the regression line overlay in red.
plot(POSITION,RECALL, main = "Recall (y) v. Position (x)")
abline(model2, col="red")
Next, we can find \(r^2\) by squaring the coefficient of correlation, or look at our model summary and get the Multiple \(R^2\) value. This value is our coefficient of determination, or proportion of variance that is explained by using our dependent variable to predict values for our response.
r_2 <- as.data.frame(msmry2$r.squared)
colnames(r_2) <- c("r_2 (coef of determination)"); r_2
## r_2 (coef of determination)
## 1 0.05440839
With a small \(r^2\), we see that very little variance (only 5.44%) is explained by the linear relationship between Recall and Position.
We want to find \(E[y|x=5]\), so we need to run a prediction on our model while letting the Age value be constant at 5.
new_data <- data.frame(POSITION=5)
pred_conf <- predict(model2, newdata = new_data, interval = "confidence", level=.99)
pred_conf
## fit lwr upr
## 1 0.7025529 0.6459537 0.7591521
Our interval is [.646, .759], meaning we can expect on average for the mean of recall value to be within this interval 99% of the time if we are in the fifth position.
We need to find \(E[y_p|X=5]\), or the prediction interval for a response given we are in the fifth level.
pred_pred <- predict(model2, newdata = new_data, interval = "predict", level=.99)
pred_pred
## fit lwr upr
## 1 0.7025529 0.03656847 1.368537
Similar to part B, on average we can expect that 99% of the time, an individual value of recall will fall within our prediction interval given we are in the fifth level.
To visualize the data, I include the confidence and prediction intervals (\(\alpha = .01\)) across all values of ‘position’. Note, the blue lines on this graph are the confidence interval for our predicted values, and the green lines represent our prediction interval.
seq_conf <- seq(range(POSITION)[1],range(POSITION)[2],by=1)
conf_int_plot <- predict(model2, newdata = data.frame(POSITION=seq_conf),
interval = "confidence", level=.99)
pred_int_plot <- predict(model2, newdata = data.frame(POSITION=seq_conf),
interval = "predict", level=.99)
plot(POSITION,RECALL,
main = "Recall (y) v. Position (x)",
ylim = c(-.1,1.5))
abline(model2, col="red")
lines(seq_conf, conf_int_plot[,2], lty=2, col="blue")
lines(seq_conf, conf_int_plot[,3], lty=2, col="blue")
lines(seq_conf, pred_int_plot[,2], lty=2, col="green")
lines(seq_conf, pred_int_plot[,3], lty=2, col="green")