library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## Warning: package 'ggplot2' was built under R version 3.4.3
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.3
## Warning: package 'forcats' was built under R version 3.4.3
library(Stat2Data)
library(skimr)
## Warning: package 'skimr' was built under R version 3.4.4
3.1 Predicting a statistics final exam grade. A statistics professor assigned various grades during the semester including a midterm exam (out of 100 points) and a logistic regression project (out of 30 points). The prediction equation below was fit, using data from 24 students in the class, to predict the final exam score (out of 100 points) based on the midterm and project grades:
\(\hat{Final} = 11.0 +0.53 * Midterm +1.20 * Project\)
a. What would this tell you about a student who got perfect scores on the midterm and project?
11.0 +0.53 * 100 +1.20 * 30
## [1] 100
Comments: This prediction equation would expect a student who got a perfect score on the midterm (100/100) and project (30/30) to get a perfect score (100/100) on the final.
b. Michael got a grade of 87 on his midterm, 21 on the project, and an 80 on the final. Compute his residual and write a sentence to explain what that value means in Michael’s case.
expectedScore=11+.53*87+1.2*21
expectedScore
## [1] 82.31
residualOfScores=80-expectedScore
residualOfScores
## [1] -2.31
Comments: The value of Michael’s residual is -2.31. This means that Michael was 2.31 below the amount of points that were expected for he to get in the final exam with the prediction model (ie he got 80 points on the final and the model predicted that he would get 82.31 points).
3.3 Breakfast cereals. A regression model was fit to a sample of breakfast cereals. The response variable Y is calories per serving. The predictor variables are X1, grams of sugar per serving, and X2, grams of fiber per serving.
The fitted regression model is \(Y = 109.3 + 1.0 - X1 - 3.7 * X2\)
In the context of this setting, interpret -3.7, the coefficient of X2. That is, describe how fiber is related to calories per serving, in the presence of the sugar variable.
109.3+1
## [1] 110.3
109.3+1-3.7
## [1] 106.6
For each serving of cereal, we would expect that for each increase of 1 gram of fiber we would expect to see a 3.7 decrease in the number of calories, while sugar is hold constant.
data(MathEnrollment)
head(MathEnrollment)
## Ayear Fall Spring
## 1 2001 259 246
## 2 2002 301 206
## 3 2003 343 288
## 4 2004 307 215
## 5 2005 286 230
## 6 2006 273 247
3.13 Enrollments in mathematics courses.
In Exercise 2.23 on page 85, we consider a model to predict spring enrollment in mathematics courses based on the fall enrollment. The residuals for that model showed a pattern of growing over the years in the data. Thus, it might be beneficial to add the academic year variable AYear to our model and fit a multiple regression. The data are provided in the file MathEnrollment.
a. Fit a multiple regression model for predicting spring enrollment (Spring) from fall enrollment (Fall) and academic year (AYear), after removing the data from 2003 that had special circumstances. Report the fitted prediction equation.
No2003=c(1:2,4:11)
modSpring<-lm(MathEnrollment$Spring[No2003]~MathEnrollment$Fall[No2003]+MathEnrollment$Ayear[No2003])
modSpring
##
## Call:
## lm(formula = MathEnrollment$Spring[No2003] ~ MathEnrollment$Fall[No2003] +
## MathEnrollment$Ayear[No2003])
##
## Coefficients:
## (Intercept) MathEnrollment$Fall[No2003]
## -11715.784 -1.007
## MathEnrollment$Ayear[No2003]
## 6.107
summary(modSpring)
##
## Call:
## lm(formula = MathEnrollment$Spring[No2003] ~ MathEnrollment$Fall[No2003] +
## MathEnrollment$Ayear[No2003])
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1945 -9.3982 0.3212 5.8503 18.2036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.172e+04 2.686e+03 -4.361 0.00331 **
## MathEnrollment$Fall[No2003] -1.007e+00 2.041e-01 -4.933 0.00169 **
## MathEnrollment$Ayear[No2003] 6.107e+00 1.337e+00 4.566 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 7 degrees of freedom
## Multiple R-squared: 0.871, Adjusted R-squared: 0.8342
## F-statistic: 23.64 on 2 and 7 DF, p-value: 0.0007704
Comments The fitted prediction equation is \(\hat{Spring}=-11715.784-1.007*Fall+6.107*Ayear\)
This means that we predict the number of students that enroll in spring will decrease by 1.007 for each 1 student increase in the number of students enrolled in Fall, while the academic year (Ayear) is constant. (and 2003 is excluded from the model)
It also means that we predict the number of students that enroll in spring will decrease by 6.107 for each 1 year increase in the academic year (Ayear), while the number of students in Fall is constant. (and 2003 is excluded from the model)
b. Prepare appropriate residual plots and comment on the conditions for inference. Did the slight problems with the residual plots (e.g., increasing residuals over time) that we noticed for the simple linear model disappear?
plot(modSpring)
The Residuals vs Fitted plot of the residuals show that the average of the residuals is around 0 which is what we want in the dataset, since this plot is checking linearity. If there were patterns in this plot that would be a sign for concern since that would mean that the linear model is not good for these variables. The plot does have a slight pattern which is the v shape in the graph but this might be because we are using so little data (10 points of data). As such we can say that the graph for the most part has linearity. This plot also checks for constant variance, by seeing if there are the same number of dots below and above the 0 mark and spread evenly throughout the entire plot. Since the plot does this we can say its residuals do have constant variance.
The Normal Q-Q plot shows that the majority of the points extremely close to the line. Since this plot checks for normality we can say that the residuals are normal, which is also what we want.
For the last condition, Independence, or are the variables independent. We can’t tell this from the plots but whether by thinking are each year independent of each other? Are each Fall semester different than each other. Thanks to this thinking I will say yes they are independence of one another since it is not required for a student to enroll in math during a certain semester and a student is free to pick whether or not to take another math class. Such as that unless if the school is requiring their students to take a math class once a year there is independence between variables.
So yes the conditions for inference are met with the data and model and the slight problems we noticed have disappeared.