Include test and confidence interval for one of the regression coefficients of either mussels data or another data set. Include the full process (hypothesis, test stat, pval, conclusion in context of problem) for hypothesis test. Include confidence interval interpretation. Include confidence interval for a mean value of y given values of x. Include corresponding prediction interval. Interpret both.
For this regression, I will be analyzing the influence of average school district income and percent of English learners on average math score. The data is from all 420 K-6 and K-8 California school districts from 1998-1999.
data("CASchools", package = "AER")
mymod <- lm(math~income + english, data = CASchools)
summary(mymod)
##
## Call:
## lm(formula = math ~ income + english, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.918 -7.626 0.369 7.494 30.294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 636.62931 1.58800 400.90 <2e-16 ***
## income 1.50359 0.08154 18.44 <2e-16 ***
## english -0.40059 0.03222 -12.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.48 on 417 degrees of freedom
## Multiple R-squared: 0.6273, Adjusted R-squared: 0.6255
## F-statistic: 350.9 on 2 and 417 DF, p-value: < 2.2e-16
Here we can see the values for our linear regression model, our t test values, p-values, multiple R-squared, and more. The regression equation is: y (hat)= 636.629+1.504B1-0.401B2
where the response, y, is average math score and B1 and B2 are Beta 1 and 2 respectively. It’s important to explain what this equation tells us. We know that for every increase (in 1000 USD) of average school district income, average math scores increase by 1.504 points for any percent of English learners. That is, we hold the predictor variable percent of English learners constant.
The null hypothesis is H0: B2=0 and the alternative hypothesis is HA: B2=/0. That is, we are testing if the predictor % of English learners provides information for the linear relationship with the response average math score. From the output above, the test stat for the T-test is -12.43 and the p-val is 2*10^-16, which is R’s way of telling us that the p-val is very small. Thus, we have sufficient evidence to reject the null that B2=0 and accept the alternative hypothesis that B2 does influence math scores. It should be noted that degrees of freedom is 417=420-3 which is number of data points minus 3 predictors.
We will find a 95% confidence interval for the mean math score given average district income and % of English Learners.
new.data <- data.frame(income = 10.4150, english=12.4087591)
confy <- predict(mymod, new.data , interval = "confidence")
confy
## fit lwr upr
## 1 647.3184 645.9123 648.7245
We are 95% confident that the mean math score for a district with an income of 10.4150 (thous USD) and 12.408% english learners is between 645.9123 and 648.7245 . We also estimate the average math score to be 647.3184. Note we have a data point with these values and the math score lies outside this interval at 605.4.
We can also create a prediction interval for a single district average score:
predy <- predict(mymod, new.data, interval = "predict")
predy
## fit lwr upr
## 1 647.3184 624.7151 669.9216
We estimate a single district math score to be between 607.017 and 635.0759 given these predictors. Our prediction interval is wide because it is more difficult to predict a single score and we are less certain than for a confidence interval. Notice the point estimate of 647.3184 is also the same.
We can create confidence intervals for the coefficients Beta 1 and Beta 2.
confint(mymod)
## 2.5 % 97.5 %
## (Intercept) 633.5078316 639.7507977
## income 1.3433053 1.6638718
## english -0.4639263 -0.3372509
We are 95% confident that Beta 1 (income coefficient) is between 1.343 and 1.664. We are 95% confident that Beta 2 (% English Speakers coefficient) is between -0.464 and -0.337. Double checking these values with our known regression equation, the values make sense.