Include test and confidence interval for one of the regression coefficients of either mussels data or another data set. Include the full process (hypothesis, test stat, pval, conclusion in context of problem) for hypothesis test. Include confidence interval interpretation. Include confidence interval for a mean value of y given values of x. Include corresponding prediction interval. Interpret both.
For this learning log I will be using the California Test Data from the AER package. I will be looking at the effect reading test scores and income levels have on math test scores.
data("CASchools", package = "AER")
mod.cal <- lm(math~income + read, data = CASchools)
summary(mod.cal)
##
## Call:
## lm(formula = math ~ income + read, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.8890 -4.4845 -0.0684 4.5075 25.0698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131.28869 15.05972 8.718 < 2e-16 ***
## income 0.28016 0.06693 4.186 3.47e-05 ***
## read 0.79051 0.02405 32.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.091 on 417 degrees of freedom
## Multiple R-squared: 0.8577, Adjusted R-squared: 0.857
## F-statistic: 1257 on 2 and 417 DF, p-value: < 2.2e-16
This summary of the multiple linear regression gives us a ton of information about our model. For starters, we obtain our coefficients for the multiple regression equation. In this case our beta(zero) is 131.28869, our beta(one) for income is .28016, and our beta(two) for reading scores is .79051. We can interpret these in order to learn more about how these predictors affect math scores. For example, our income coefficient is reading coefficient is .79051. This means that for every one point increase in reading score, we can expect the math score to go up by .79051 points, assuming everything else is held constant.
Another piece of information we can take away from this summary is our test statistic for each coefficient, which can be found underneath t value. We can use this t value to test our hypothesis that that beta value should equal 0 (i.e. that factor does not enhance our linaer regression). We compare our test statistic to the t-score with n-# of predictors-1 degrees of freedom. In this case we have 420 data points and 3 predictors, so our degrees of freedom will be 417, which is confirmed in our summary. If we move over one column, we can see that the p-value for this test statistic, which tells us the probability of our null hypothesis being confirmed. Looking at income as a predictor, we can see that it has a t value of 4.186. If we move over one column we find that this predictor has a p-value of 3.47*10^-5. This is a small p-value, and tells that we can reject our null hypothesis and that the coefficient for income is not equal to 0. thus we will keep it in our model.
We can also create confidence and perdiction intervals for math scores by inputing a income level and a reading score.
new.data <- data.frame(income = 10, read = 616)
confy <- predict(mod.cal, new.data , interval = "confidence")
confy
## fit lwr upr
## 1 621.0465 619.4503 622.6426
First we will create a confidence interval for math scores when the average income is 10(thousand) and the reading score was 616. From the output, we see that the estimate for math scores is 621.0465, and that the confidence interval is [619.4503,622.6426]. We interpret this as meaning that we are 95% confident that the mean math scores for schools with an anerage income of 10(thousand) and a reading score of 616 is between 619.45 and 622.6426. We can also create a prediction interval which is slightly different than a confidence interval.
predy <- predict(mod.cal, new.data, interval = "predict")
predy
## fit lwr upr
## 1 621.0465 607.017 635.0759
One thing to notice is that both the confidence interval and the prediction interval give us the same point estimate. However, the prediction interval simply provides us a point estimate, while the confidence interval provides us with an interval for what the mean value is. For this reason, the interval for our prediction interval is much larger, even though we inputed the same predictor values. There is a much larger variability when it comes to predicting just one individual math score, so the interval must be wider to account for that variability.
confint(mod.cal)
## 2.5 % 97.5 %
## (Intercept) 101.6862588 160.8911223
## income 0.1485942 0.4117207
## read 0.7432355 0.8377911
Here we create confidence intervals with 95% confidence. What these tell us is that we are 95% confident that our coefficient for each predictor is between those two numbers. For example, our confidence interval for income is [.1485,.4117]. We can interpret this to mean that we are 95% confident our income coefficient is between those two values.