Predicting how new college students will do during their first year is of major interest to colleges and high schools. If we can identify factors that might prove helpful, we can make adjustments to help more students succeed.
Our data today was collected with this purpose in mind. We have information on 219 college students. Our response variable is GPA
, the GPA of the student at the end of their first year of college.
To load the data, run the following three lines of code:
install.packages("Stat2Data")
library(Stat2Data)
data("FirstYearGPA")
Once you have run these lines, immediately put a # in front of the install.packages("Stat2Data"), so you should have #install.packages("Stat2Data")
.
In addition to the response variable GPA
, we have information on:
HSGPA
- the student's high school GPA.SATV
- the student's score on the Verbal/Reading SAT.SATM
- the student's score on the Math SAT.Male
- a binary indicator for biological sex ( 1 = Male).HU
- the total number of hours of humanities classes the student took in high school.SS
- the total number of hours of social science classes the student took in high school.FirstGen
- a binary indicator for whether the student was the first in their family to go to college.White
- a binary indicator for whether the student self-identifies as white.CollegeBound
- a binary indicator for whether the student went to a high school where at least 50% of the students were intending to go to college.Our first question of interest is the relationship between the score on the math portion of the SAT and first year college GPA. In general, is a higher score on the math SAT portion related to a higher GPA in the first year of college?
Build an appropriate LSLR model (Model 1) to explore this question. Based on your model, write out the LSLR line. How would you respond to the question of interest? Explain your reasoning. Hint: Notice that this question asks about a general (population) relationship, not just in the sample.
Next, we want to add SAT verbal and high school GPA into the model along with SAT math. Why? Because all three of these things are generally used by colleges as a way to determine college acceptance! Is there actually a relationship between these things and first year college GPA?
Build an appropriate LSLR model (Model 2) to explore this question. After accounting for verbal SAT score and high school GPA, do we have evidence of a linear relationship between math SAT score and first year college GPA?
Looking at these two models, we notice something interesting. If we do not put high school GPA and the verbal SAT score in the model, that means we assume they do not explain any meaningful variation in first year GPA. Under this assumption, it looks like there is a relationship between math SAT score and first year GPA. As soon as we account for (i.e., put in the model) verbal SAT score and high school GPA, we no longer have any evidence. Why is that??
To explore this, let's use the code anova(Model1)
. This allows us to break down the variance of first year GPA into two parts: the part explained by the model, and the part the model cannot explain. In other words, we analyze the variance.
Based on the analysis of variance, how much of the variance in first year GPA is explained by the math SAT score? Is this a large or small amount relative to how much is left over in the residuals (the RSS)?
Run anova(Model2)
. Based on this second the analysis of variance, which of the explanatory variables explores the most variation in first year GPA? The least?
Generally, we say that explanatory variables that can explain more variance in Y are more predictive than other explanatory variables. Once we have estimated all we can with that variable, we figure out how mooch more we can explain with the next most predictive explanatory variable. In other words, after accounting for the first variable we check for the relationship between Y and the next X. This means that while an X may have seem to have a relationship with Y when it was on its own in the model, when you add in more variables that effect may go away or change!
This is why it is so important to consider potential sources of variation in Y when fitting a model, and before drawing any conclusions.
Create a new model called Model 3 that uses SAT verbal and high school GPA as explanatory variables. Do we have convincing evidence that Model 2 explains more variation in first year GPA than Model 3? Explain.
As we add extra predictors into the model, we are accounting for more possible sources of the variation in first year college GPA. This a good thing!! We want to make sure statistical models consider all possible sources of variation in Y, or at least all plausible sources. However, as we can see when we add in more predictors the story an individual predictor tells may change.
To note this, when we are formally writing out the interpretation of a coefficient for models with multiple predictors, we say something like "After accounting for high school GPA, we predict that every additional point in verbal SAT score is associated with between a 0.00056 to 0.0019 increase in first year GPA."
Using Model 3, and the style of interpretation given above, interpret the 95% confidence interval for high school GPA. Hint: Remember that the code confint(Model3)
will be helpful here.
The interpretation in the previous question assumes high school GPA increase by 1, which is a HUGE change in GPA. Using Model 3, and the style of interpretation given above, interpret the coefficient for high school GPA in terms of an increase of .1, not 1.
So, if you don't like the scale of your predictor, you can change it! If your X is recorded in dollars, but you want to interpret in terms of changes of a hundred dollars, we can!
Now we are given our actual research objective. We want to build a model that explains as much of the variation in First Year GPA as possible. In other words, we want to maximize the adjusted R-squared.
How can we get started with this?? We have 9 possible explanatory variables in our data set, which means there are a grand total of 512 possible models we could build...and that does not include any possibility of interactions or polynomials!!! Crazy!!
To start, let's look for any needed transformations. This only applies to the numeric variables. Create 5 plots (one for each numeric explanatory variable) to check for linearity and stack them using grid.arrange()
. An example of how to do this is below!! These are just example graphs, you will be replacing them with the ones we need.
g1 <- ggplot(cars, aes(x = speed)) + geom_histogram(bins = 10)
g2 <- ggplot(cars, aes(x=dist)) + geom_histogram(bins=10)
gridExtra::grid.arrange(g1,g2, g1,g2, ncol = 2)
Just change the ncol = to match the number of columns you want in your output. If you want two graphs side by side, use ncol = 2.If you want them one on top of the other, use ncol = 1.
Based on the graphs, I don't see any need for polynomials or any transformations at this stage...phew! But, we still have 512 possible models to consider. What do we do??
We will be using best subset selection (BSS) to help us. This technique can help us identify the combination of predictors that yields the highest value of adjusted R-squared.
The first step in running BSS in R is to load the library we need.
library(leaps)
This library contains all the code we will need to run BSS. Once we have it loaded, the command we need to run the first stage of BSS is regsubsets
. For example, suppose I wanted to run BSS on three variables X1, X2, and X3. The code is then:
BSSout <- regsubsets( Y ~ X1 + X2 + X3, data = FirstYearGPA, nvmax = 3)
What does the nvmax
do? It tells R the maximum number of predictors we want to allow in our model. So, right now we consider models of 1, 2, or 3 predictors.
Using the categorical explanatory variables only, run the first stage of BSS and call the output BSScat
. Then, use the code plot(BSScat, scale = "adjr2")
to plot the results.
What is the highest value of adjusted R-squared you obtained, and what predictors were used to obtain that model?
What is the lowest value of adjusted R-squared you obtained, and what predictors were used to obtain that model?
BSS looks specifically at adjusted R-squared values, and ranks the models by this metric. Even tiny changes in adjusted R-squared will be shown in the ordering in the table. This means that after running BSS, we have to decide if these small changes are enough to motivate us to choose one model over the other. This depends on your goal!
Is there evidence that the model in Question 10 explains more variance in GPA than the model in Question 11?
Now, use all the possible explanatory variables (categorical and numeric) and run the first stage of BSS and call the output BSSall
. Then, use the code plot(BSSall, scale = "adjr2")
to plot the results.
Based on the results, explain which model you would choose to use to model first year GPA. There is more than one correct answer here!!
Based on your choice, build your model, and run an analysis of variance. Show your results. Which variables seems to be explaining the most variation in first year GPA?
Are your conditions for LSLR met? Justify your answers.