In this R guide we will be working with GPAGender data.
library(Lock5withR)
## Warning: package 'Lock5withR' was built under R version 3.4.3
data("GPAGender")
The best way to know what we are working with is it run some basic functions for our exploratory data analysis. These include head, names, dim. But first, we have to attach the data.
attach(GPAGender)
head(GPAGender)
## Exercise SAT GPA Pulse Piercings GenderCode Sex
## 1 10 1210 3.13 54 0 1 Male
## 2 4 1150 2.50 66 3 0 Female
## 3 14 1110 2.55 130 0 1 Male
## 4 3 1120 3.10 78 0 1 Male
## 5 3 1170 2.70 40 6 0 Female
## 6 5 1150 3.20 80 4 0 Female
Using the head() function, we are able to see the first 6 results from each set of data within the GPAGender. We learn that the variables are Exercise, SAT, GPA, Pulse, Piercings, GenderCode (1 for male, 0 for female) and Sex.
names(GPAGender)
## [1] "Exercise" "SAT" "GPA" "Pulse" "Piercings"
## [6] "GenderCode" "Sex"
The names function gives us the title of each variable.
dim(GPAGender)
## [1] 343 7
The function dim(), tells us two numbers. First, the number of observations (343). Secondly, the number of variables (7).
Chapter 3 focuses on simple linear regression, modeling the linear relationship between 2 variables. Considering the GPAGender data, we will be looking at the relationship between a two different, quantitative variables. The dependent variable will be a student’s college GPA (on a 4.00 scale). The independent variable is combined SAT scores, out of 1600. The way the SAT is scored is 0 is low, and 1600 is high. For reference, one has to get at least a 1273 in order to be in the top 10% of students who take the test. Using this data, we will figure out if SAT score is a good predictor for college GPA.
First, we want to find the slope and y-intercept of the line. We can do this two ways. We will use the lm() function to find the coefficients to start off. In this function, the dependent variable goes first, and the independent variable will go second.
Mod 1 will have the relationship between SAT score and GPA.
mod1 <- lm(GPA~SAT)
mod1
##
## Call:
## lm(formula = GPA ~ SAT)
##
## Coefficients:
## (Intercept) SAT
## 1.680514 0.001226
This gives us the y-intercept and the slope of what the regression line documenting the relationship between SAT score and GPA would be. THe equation would be \[\hat{GPA}= \hat{1.6805}+\hat{.001226} x_i\]. The interpretation of the y-intercept: if a person were to get a 0 on their SAT, their GPA would most likely be 1.68. The interpretation of the slope: for every additional point on the SAT a student earns, their GPA will increase, on average, by .001226
The second way we can find this regression equation is by using the function summary().
summary(mod1)
##
## Call:
## lm(formula = GPA ~ SAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17600 -0.21841 0.03723 0.26030 1.00788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6805139 0.2006508 8.375 1.45e-15 ***
## SAT 0.0012258 0.0001656 7.402 1.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3696 on 341 degrees of freedom
## Multiple R-squared: 0.1384, Adjusted R-squared: 0.1359
## F-statistic: 54.78 on 1 and 341 DF, p-value: 1.06e-12
The section “Estimate” is where one can find the coefficents of the slope and y-intercept.
We can use this line to predict a certain interval with any SAT score we want. If we choose the SAT score 1000, we can plug that in to the equation.
1.6805 + (1000*.00123)
## [1] 2.9105
This model says that with a 1000 score on the SAT, a student, on average, should earn a 2.91 GPA.
To see the accuracy of the line, we can plot the data points on the same scatterplot as the regression line we found.
plot(SAT, GPA, col = "pink", pch = 16, xlab = "SAT (800 - 1600)", ylab = "GPA (2.00 - 4.00)")
abline(1.6805,0.0012258)
Something to note: there are so the axises of the graph start at 800 for the SAT and 2.0 for the GPA. This is because the minimum GPA of the data set is 2.00 and the minimum SAT score for the data set is 820. Both of these numbers can be easily determined by using the min() function.
min(GPAGender$GPA)
## [1] 2
min(GPAGender$SAT)
## [1] 820
So, even though the actual scale of both GPA and SAT go lower than the scatterplot does, that area doesn’t concern us. The scatterplot is able to zoom in because that would be extra, blank space that we don’t need.
To check the accuracy of this line, we can use certain coefficients to understand it. First, is R2. This is the coefficient of determination. It measures the usefulness of the linear regression model. It can be found using the summary() function as well.
summary(mod1)
##
## Call:
## lm(formula = GPA ~ SAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17600 -0.21841 0.03723 0.26030 1.00788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6805139 0.2006508 8.375 1.45e-15 ***
## SAT 0.0012258 0.0001656 7.402 1.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3696 on 341 degrees of freedom
## Multiple R-squared: 0.1384, Adjusted R-squared: 0.1359
## F-statistic: 54.78 on 1 and 341 DF, p-value: 1.06e-12
It is found as the multiple R-squared. Our R2 is .1384. A perfectly matched line would have a R2 of 1.0000. R2 tells us the percent of varitability that is linearly attributed to the predictor variables of the model. 13.84% is not very high, but I’m not surprised. There could be a huge combination of predictors influencing the college GPA that the SAT score doesn’t cover. Some examples might be major, how many hours they study, or if they are taking classes that interest them.
The second coefficient is the correlation coefficient. To find r you take the square root of the R-squared. If the slope of the line is positive, you take the positive square root, if negative you take the negative square root. This value is between one and negative one. The closer to -1 it is, the more strongly, negatively related the two variables are (y decreases as x increases, or vice versa). The closer to 1 it is, the more strongly, positively related they are (both increasing or decreasing). When r is at zero, then there is no relationship between the independent and the dependent variables.
sqrt(.1384)
## [1] 0.3720215
Our r is .3720, so there is a slightly strong positive relationship between SAT scores and GPA. This says that when the SAT scores increase, the student’s GPA will increase as well.
We can use the function cor() to also find the correlation coefficient, instead of taking the sqrt of the coefficient of determination.
cor(GPA, SAT, method = c("pearson"))
## [1] 0.3720446
We get roughly the same numbers, a few decimal places are off due to rounding.
Even though we have correlation between the two variables, we have to make sure that there are no assumptions of linear regression that we violated. One of the assumptions is that the error terms (or the residuals) have to follow the normal distribution. First, we have to calculate the residuals of the function in R. Secondly, we can use the histogram function to separate the residuals into bins, and then see if the residuals follow the normal distribution curve.
resids <- mod1$residuals
hist(resids)
When looking at this historgram, it looks like it mostly follows the normal distribution. Luckily, we can look at a different method to double check, if we have any resounding doubts after looking at the histogram. To do this we graph a straight line of the residuals as well as a straight line of the normal distribution, to see if they line up.
qqnorm(resids)
qqline(resids)
After looking at the two lines, it looks like our residuals follow the normal distribution.
The next assumption we have to check is the constant variance. The fancier way of saying this is that we are checking for homoscedasticity. There are two ways we can do this: 1) plot our relationship and check to see if they are evenly spread out, 2) plot the residuals against the independent variable with a line at (0,0) and see if they are evenly spread over. it.
plot(SAT, GPA, ylab = "GPA Scores (2.0 - 4.0)", xlab = "SAT Scores (800 - 1600)", col="violetred", pch=20)
This plot still looks a little unclear, so we can graph it with the residuals and the horizontal line.
plot(mod1$residuals ~ SAT, col = "purple", pch = 20, xlab = "SAT (800 - 1600)", ylab = "Residuals")
abline(0,0)
After looking at this graph, it seems that there is a little bit of evidence of non-constant variance. However, with the SAT and GPA comparison, this isn’t all that surprising. The SAT is meant to have a clump of scores around the center, and then get lower and lower occurance as the score moves away from the middle.
The last assumption is the no outliers condition. In order to do a correct linear model, there should be no outliers on the data. The outliers can severely impact the slope and intercept, so in order to get an accurate model there should be none. This model does violate this assumption, there are a few outliers. However, there are not very extreme, so it won’t effect our model that much.
Next, we want to find mean squared error. This can be found using the summary() function. We want this number to be low because it tells us the difference between the predicted and actual values of the model.
summary(mod1)
##
## Call:
## lm(formula = GPA ~ SAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17600 -0.21841 0.03723 0.26030 1.00788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6805139 0.2006508 8.375 1.45e-15 ***
## SAT 0.0012258 0.0001656 7.402 1.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3696 on 341 degrees of freedom
## Multiple R-squared: 0.1384, Adjusted R-squared: 0.1359
## F-statistic: 54.78 on 1 and 341 DF, p-value: 1.06e-12
It says our mean standard error is .3696, which is ok. The mean standard error is in terms of our response variable. Considering that the GPA is only out of 4.00, .37 as the mean standard error is slightly high.
Next we will take the confidence interval of the entire linear model. We use the confint() function to find the 90% confidence interval.
confint(mod1, level = .9)
## 5 % 95 %
## (Intercept) 1.3495736190 2.011454191
## SAT 0.0009526512 0.001498957
The interpretation of this 90% confidence interval is: for ever addition point on the SAT, I am 90% sure that the GPA will increase between [.0009527, .0014990]. The intercept part is interpreted as: I am 90% confident that, with a 0 on the SAT, that the GPA would be between [1.35, 2.01].
Our next task is finding the prediction and confidence intervals. The two intervals have similar R code, but are different in some of their basic traits. Prediction intervals are for one person. For example, if one person wanted to see what their GPA would be based off their SAT score. Confidence intervals are the mean for all people with the specified SAT score. In general, prediction intervals are a little wider than confidence intervals. They only have one value from which to predict, confidence intervals get to use a lot of data points, therefore being a little more accurate and thus, narrower. If there is a 95% confidence interval, I will be 95% confident that the mean of the data falls between the two given points. If there is a 95% prediction interval, I am 95% sure that the prediction is accurately between the two points given.
We can also do the prediction and confidence intervals by choosing an SAT score as a new data frame. We will be able to find the prediction and confidence intervals for that exact number. To choose a number we can look at the summary of the data to find a mid- to high- score on the SAT’s. By choosing this frame, we find the confidence/prediction interval given the SAT score is 1270.
summary(GPAGender)
## Exercise SAT GPA Pulse
## Min. : 0.000 Min. : 820 Min. :2.000 Min. : 35.0
## 1st Qu.: 5.000 1st Qu.:1130 1st Qu.:2.900 1st Qu.: 62.0
## Median : 8.000 Median :1200 Median :3.200 Median : 70.0
## Mean : 8.882 Mean :1206 Mean :3.158 Mean : 69.9
## 3rd Qu.:12.000 3rd Qu.:1270 3rd Qu.:3.400 3rd Qu.: 78.0
## Max. :30.000 Max. :1550 Max. :4.000 Max. :130.0
## Piercings GenderCode Sex
## Min. : 0.000 Min. :0.0000 Female:161
## 1st Qu.: 0.000 1st Qu.:0.0000 Male :182
## Median : 0.000 Median :1.0000
## Mean : 1.694 Mean :0.5306
## 3rd Qu.: 3.000 3rd Qu.:1.0000
## Max. :10.000 Max. :1.0000
After looking at the SAT data section, I will choose the data frame number to be 1270, the start of the 3rd quartile.
newdata <- data.frame(SAT = 1270)
we have defined the new data frame, so now we can calculate the prediction interval.
predictint <- predict(mod1, newdata, interval="predict")
predictint
## fit lwr upr
## 1 3.237285 2.508897 3.965673
This data tells us that for a 1270 SAT score, the GPA will most likely (90% sure) be between [2.51, 3.97]. The fit part tells us the point prediction, which says the GPA will most likely be around 3.24 when the SAT score is 1270.
Next we will do the confidence interval.
confidentint <- predict(mod1, newdata, interval="confidence")
confidentint
## fit lwr upr
## 1 3.237285 3.192768 3.281803
This data tells us that for the 1270 SAT score, the GPA should be between [3.19, 3.28]. The mean estimate (fit) says that the GPA will most likely be around 3.24. When the SAT score is 1270, I am 95% confidence that the mean GPA will be between 3.19 and 3.28.
Next we want to see the width of the two intervals. It should be that the prediction interval is wider, because it has less evidence with which to work.
predictint %*% c(0, -1, 1)
## [,1]
## 1 1.456776
confidentint %*% c(0, -1, 1)
## [,1]
## 1 0.08903531
R tells us that the prediction interval is wider, and that is what we want!
Next, we will make sure they are centered at the same point.
predictint[1] == confidentint[1]
## [1] TRUE
Yes, they are centered at the same point! That is good, we don’t want our intervals to be around different points.
Lastly, there are F-tests and T-tests. Both of these can be found on the summary function of our linear model from the beginning. For both of these, we have to use the null and alternative hypothesis. The Null is beta(i)=0. The alternative is that beta(i)=/0. Putting these equations in words, that means that the null hypothesis is that one of the betas (that show how much the predictor changes the response) is 0–the predictor doesn’t change the response.
summary(mod1)
##
## Call:
## lm(formula = GPA ~ SAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17600 -0.21841 0.03723 0.26030 1.00788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6805139 0.2006508 8.375 1.45e-15 ***
## SAT 0.0012258 0.0001656 7.402 1.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3696 on 341 degrees of freedom
## Multiple R-squared: 0.1384, Adjusted R-squared: 0.1359
## F-statistic: 54.78 on 1 and 341 DF, p-value: 1.06e-12
The F-stat is 54.78, with the p-value being 1.06*10-12. Using the normal alpha of .05, this p-value is smaller than alpha, which means we reject the null hypothesis. We know that at least one beta is not zero, which shows that we have a line with a non-zero slope.
The test stat for the t-test and p-value are found under the coefficients section. The test stat is 7.402. The p-value is 1.06*10-12, which is <.05, so we are able to reject the null hypothesis.
Both the T-test and the F-test show that we can reject the null, meaning that beta(i)=/0.
In conclusion, with the right data it can be quite easy to create a simple linear regression. However, there are many other components to test the validity and accuracy of the model. One should use the coefficient of determination and the correlation coefficient to see how much of the model is linearly attributed to the predictors, as well as look at the correlation between the two. One also needs to make sure that they make sure none of the assumptions are violated: residuals following normal distribution, constant variance, and no outliers. Once the model is made, one can create confidence and prediction models using data points from the set. Lastly, F-tests and T-tests can be used to make sure that the beta(s) are not equal to zero–meaning when the predictor changes, it influences the response to change as well.