In this R guide we will be covering concepts regarding simple linear regression. First, we will choose a dataset, and then we will identify and plot our predictor and response variables. Once we’re familiar with our data and have analyzed our scatter plot, we will create a simple linear regression model. We will add the line of best fit to our scatter plot, interpret the results, and test the significance of our predictor. Then we will use our model to calculate predictions and residuals. Once we understand the basics of our graph and regression model we will test the linear regression assumptions to see if they hold for our model. In addition, we will calculate the mean square error, coefficient of determination, and correlation coefficient for our model. Finally, we will create confidence intervals and prediciton intervals.
For this document we will be using the “nlschools” dataset. This dataset is in the MASS library. Let’s call and attach our dataset. Once our dataset is attached we won’t have to continuously reference it throughout our work.
library(MASS)
data(nlschools)
attach(nlschools)
Now that our dataset is attached, let’s become more familar with it. Use the “head” command to see the first six rows of our dataset. Use the “name” command to view all of the variables, and use the “summary” command to see an overview of the data. We will revisit the “summary” command later in this R guide.
head(nlschools)
## lang IQ class GS SES COMB
## 1 46 15.0 180 29 23 0
## 2 45 14.5 180 29 10 0
## 3 33 9.5 180 29 15 0
## 4 46 11.0 180 29 23 0
## 5 20 8.0 180 29 10 0
## 6 30 9.5 180 29 10 0
names(nlschools)
## [1] "lang" "IQ" "class" "GS" "SES" "COMB"
summary(nlschools)
## lang IQ class GS
## Min. : 9.00 Min. : 4.00 15580 : 33 Min. :10.00
## 1st Qu.:35.00 1st Qu.:10.50 5480 : 31 1st Qu.:23.00
## Median :42.00 Median :12.00 15980 : 31 Median :27.00
## Mean :40.93 Mean :11.83 16180 : 31 Mean :26.51
## 3rd Qu.:48.00 3rd Qu.:13.00 18380 : 31 3rd Qu.:31.00
## Max. :58.00 Max. :18.00 5580 : 30 Max. :39.00
## (Other):2100
## SES COMB
## Min. :10.00 0:1658
## 1st Qu.:20.00 1: 629
## Median :27.00
## Mean :27.81
## 3rd Qu.:35.00
## Max. :50.00
##
Becoming familiar with the data is a necessary first step to working with the data and interpretting the results. However, there is no need to submit the results from the previous three commands.
Now that we’re familiar with the data, we can create a scatter plot to illustrate the relationship between two quantitative variables. For this R guide, we are interested in the relationship between verbal IQ and language test score. The population of our dataset consists of students; verbal IQ is the student’s verbal IQ, and language test score is the score that the student earned on a language test. Our predictor is verbal IQ (IQ), and our response is language test score (lang). The argument for this is that students with higher verbal IQs will have higher language test scores.
Let’s create the scatter plot. Use the “plot” command and list the two variables, separated by a comma, with the independent variable listed first.
plot(IQ, lang)
To make the graph easier to interpret, we can add jitter and remane the axis labels by typing:
plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
By looking at our scatter plot, we see that there might be a positive, linear relationship between verbal IQ and language test scores. Students with higher verbal IQs tend to have higher language test scores. However, the points in our plot aren’t packed very tightly, so it’s hard to determine the strength of this potential linear relationship just by looking at the scatter plot.
Now that we’ve created our scatter plot, let’s create our simple linear regression model. This model will be in the following form: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] For our model regarding the relationship between verbal IQ (the predictor) and langauge test score (the response), the variables are as follows:
\(\hat{y_i}\) = point prediciton for language test score when verbal IQ is \(x_i\)
\(\hat{\beta_0}\) = estimate of the intercept = estimated language test score when verbal IQ is 0
\(\hat{\beta_1}\) = estimate of the slope = estimated change in language test score associated with a one unit increase in verbal IQ
Let’s name our model “nls.mod” so that we can call upon it easily. To create the model, use the “lm” command including our dependent variable followed by “~” and our independent variable. Then, call on the model to get the estimates of our intercept and slope.
nls.mod <- lm(lang ~ IQ)
nls.mod
##
## Call:
## lm(formula = lang ~ IQ)
##
## Coefficients:
## (Intercept) IQ
## 9.528 2.654
The output tells us that the intercept estimate is 9.528, which means that we would expect a student with a verbal IQ of 0 to have a language test score of 9.528. However, this intercept can’t be interpretted because none of the observations in our dataset have a verbal IQ near or equal to 0. Our slope estimate is 2.654, which means that for every one unit increase in a student’s verbal IQ, we would expect their language test score to increase by 2.654 points.
We can write out our specific model in the following form:\[\hat{y_i}= 9.528+2.654 x_i\]
Now that we’ve created our simple linear regression model, let’s graph our line of best fit on our scatter plot. First, recall the scatter plot that we created earlier. Then use the “abline” command with a comma spearated list of our intercept and slope estimates; this will create the line of best fit.
plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
abline(9.528, 2.654)
With our line of best fit included on our graph, it looks like a linear model could fit our data. However, since the points aren’t tightly packed, it’s not certain that a linear model fits our data well. We will calculate the correlation coefficient later in this R guide, which measures the strength of the linear relationship between our variables.
Now that we’ve created our model, let’s test the significance of our predictor, verbal IQ. We can do this by using a hypothesis test and the t-distribution. Our null hypothesis states that verbal IQ does not have a linear relationship with language test score. In other words, our independent variable, verbal IQ, has a coefficient of 0 in our linear regression model. Our alternative hypothesis states that verbal IQ does have a linear relationship with our response, so verbal IQ has a coefficient not equal to 0. Our hypotheses can be written in the following form: \[H_0:\hat{\beta_i}=0\] \[H_0:\hat{\beta_i} ≠ 0\]
If the test shows that verbal IQ is significant, we will reject the null hypothesis in favor of the alternative and conclude that there is sufficient evidence to show that verbal IQ does have a linear relationship with language test score.
We can use the “summary” function to test the significance of our predictor.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
The summary function tells us a lot about our model, but to test the significance of our predictor we care about the p-values from our hypothesis test. Using the t-distribution, the Pr(>|t|) column tells us the significance of our intercept and our predictor. The p-value for verbal IQ is <2e-16, which is approximately zero. Since our p-value is so small, regardless of our significance level, we can reject the null hypothesis in favor of the alternative. Therefore, we can conclude that verbal IQ does have a linear relationship with language test score.
If we wanted to, we could also use the F-distribution to test the significance of our predictors. Our hypotheses to test the significance of the predictor would be the same as above. The summary function gives us the output for the F-test, and it says that our p-value for our model is < 2.2e-16, which is pretty much zero. Therefore, using the F-test we would also reject the null hyothessis and conclude that verbal IQ does have a linear relationship with langauge test score.
If we wanted to we could test the significance of our intercept estimate in the same way that we tested the significane of our slope estimate. We would write our hypotheses out using \(\hat{B_0}\) instead of \(\hat{B_1}\), and we would find the corresponding p-value in our summary function to make our conclusion.
The summary command also tells us the number of degrees of freedom for our model, which is 2285. Our number of degrees of freedom is 2285 because there are 2287 entries in our dataset, and we are assuming two regression coefficients in our model (df = number of observations - number of regression coefficients). Our estimates for our intercept and slope are also given in the model summary in the “Estimate” column.
Next, we can use our simple linear regression model to make predicitons about our data. We can select a student from our dataset and enter their verbal IQ into our model to get a predicted value of their language test score.
For example, let’s predict the language test score for 1000th student in our data. The verbal IQ and language test score for this student are in the 1000th row of our dataset, and to access their verbal IQ we use indexing. Specifically, we list the name of the dataset, followed by a set of brackets. In the brackets we need to list the row and column numbers for the number that we are concerned about. Since we want to know the verbal IQ for the student in the 1000th row, we will list 1000 as the row and 2 as the column number, because verbal IQ is the variable in the second column of our data.
nlschools[1000, 2]
## [1] 15
Our output tells us that the verbal IQ for the student in the 1000th row is 15. Thus, we will plug 15 into our regression equation as the value for our independent variable, and this will allow us to predict the value of this student’s language test score.
One way to do so would be to type out the value of our intercept, slope, and verbal IQ to have R calculate it.
9.52848 + 2.65390 * 15
## [1] 49.33698
Here, R tells us that a prediction for the 1000th student’s language test score is 49.33698.
Instead of typing everything out, we could use the “coef” command. This command outputs the intercept and slope for our model in a vector of length two, with the intercept listed first, followed by the slope of our model.
coef(nls.mod)
## (Intercept) IQ
## 9.528484 2.653896
If we matrix multiply this vector by another vector of length 2 containing 1 and the value of our independent variable (verbal IQ), we get the predicted value of our dependent variable (language test score). In our example, we will multiply our coef vector by c(1, 15) since the verbal IQ for the student in the 1000th row is 15. To matix multiply use the %*% function
coef(nls.mod)%*%c(1, 15)
## [,1]
## [1,] 49.33692
Our output tells us that the the predicted language test score for the student in the 1000th row is 49.33692. This is the same as the ourtput from the alternative method; they don’t match exactly due rounding error for our intercept and slope values that we used in our first method.
Since we’re able to identify the observed values in our dataset and calculate predicted values, we can calculate residuals for our data. Residual is the observed value minus the expected value, or observed minus predicted. Let’s calculate the residual for the language test score for the 1000th student. First, let’s identify this student’s actual language test score. Since language test score is listed in the first column of the data, we can use the following indexing to find the student’s actual score.
nlschools[1000, 1]
## [1] 52
We see that the observed value for the students test score is 52. Earlier we leanred how to calculate the expected value, so let’s use what we’ve leanred to caluculate the residual.
nls.obs <- nlschools[1000, 1]
IQ.1000 <- nlschools[1000, 2]
nls.pred <- coef(nls.mod)%*%c(1, IQ.1000)
nls.resid <- nls.obs - nls.pred
nls.resid
## [,1]
## [1,] 2.663082
Our residual is 2.663082, which means that our predicted value for the student’s language test score was 2.663082 points lower than the student’s observed language test score.
Now that we’re able to create and plot our model and use it for predictions, we should check to see if our linear regression assumptions hold for our model. There are four assumptions we make for our model, and if they hold we can rely on our model to produce accurate estimates and intervals.
First we assume that the population of potential errors has a mean value equal to zero for any given verbal IQ. To check this assumption, we call on all of our residuals and calculate the mean.
all.resids <- nls.mod$residuals
mean(all.resids)
## [1] 1.522946e-15
The mean of the errors is very close to zero, so we can conclude that this assumption holds.
The second assumption is the normality assumption, which states that the population of potential errors has a normal distribution for any given value of our independent variable, verbal IQ. We can check this assumption by first looking at a histogram of our residuals.
hist(all.resids)
The histogram is somewhat helpful, and it looks like our errors are normally distributed. However, it would be easier to compare our data against a straight line rather than the curved histogram. To compare our data against a straight line, we can plot the quantiles of a normal distribution and then plot the quantiles of our residuals against them. If the residuals from our model are normally distributed they will follow a straight line.
qqnorm(all.resids)
qqline(all.resids)
From our graph we can see that our residuals are packed very tightly and are very close to the straight line. Therefore, we can conclude that our residuals (errors) are normally distributed.
The third assumption is the constant variance assumption, also called homoscedasticity. This assumption states that the variance of the residuals will be equal for all values of our independent variable, verbal IQ. We could check this assumption by looking at our original data.
plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
We can try to see if the vertical spread of our points is equal for each value of our independent variable (verbal IQ). From this graph it looks like there might not be equal variances because the spread of the points looks wider for verbal IQ values between 6 and 12 than it does for higher or lower values.
However, a better way to test homoscedasticity is to plot the residuals against the values of our independet variable, verbal IQ. We can also plot a staight line to make it easier to compare the variance between different verbal IQ values.
plot(nls.mod$residuals ~ IQ)
abline(0,0)
From this graph it is much easier to see that there is not homoscedasticity; rather, there is heteroscedasticity because the variance differs for varying verbal IQ values; the vertical spread between points isn’t consistant between different values of verbal IQ. It appears that the variance is lower for high verbal IQs than it is for mid-range or low verbl IQs.
As we continue to use our simple linear regression model to produce various estimates and intervals we will need to keep in mind that not all of the linear regression assumptions hold for our model. The residuals show that the model systematically under predicts langauge test score for low verbal IQs and over predicts for high verbal IQs. The variance for high verbal IQs is lower than the variance for mid-range and low verbal IQs though overall. The heteroscedasticity will directly impact our confidence and prediciton intervals, which I will mention when we create and interpret our invervals.
The final assumption is the independence assumption, which states that any value of our error term is statistically independent of any other error value. This assumption is tested by analyzing how our data was collected. R tells us that information about the students was collected from 132 classes in 131 schools throughout the Netherlands. Since information was gathered from a wide variety of classes and schools, we can assume that our data is independent.
Another way that we can use our model is to find the point estimate for the constant variance of our residuals (errors), which is the mean square error (MSE). The “summary” command can be used to find the mean squre error.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
In our output, the “Residual statndard error” gives us our mean square error. In this case our MSE is 7.137, which is the point estimate for the constant variance of our residuals. If we wanted a point estimate for the constant standard deviation of our errors, we would just take the square-root of our mean square error, which is called the standard error.
Remember, earlier we showed that the constant variance assumption doesn’t hold for our model, so this MSE isn’t totlaly accurate.
We can measure the usefulness of our model by calculating the coefficient of determination, commonly denoted \(r^2\). The coefficient of determination is the percentage of the variability of the response that is explained by its linear relationship with the predictor. In other words, it equals the explained variance divided by the total variance. We can use the “summary” command to find the coefficient of determination, which is called “Multiple R-squared” in the output.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
From our output we see that our coefficient of determination is 0.3719, which means that our model accounts for 37.19% of the total variance of language test scores.
While the coefficient of determination is importnat, its square-root, the correlation coefficient, can be more helpful when analyzing a linear regression model. Correlation, denoted \(r\), describes the linear relationship between our predictor and response variables. If \(r = 0\) there is no linear relationship between the two variables, but if \(|r| = 1\), there is strong linear relationship between the two variables. Generally, \(|r|\) values closer to 1 express a stronger linear relationship than \(|r|\) values closer to 0. Also, if \(r < 0\), the linear relarionship is negative, meaning that as the independent variable increases, the dependent variable decreases. If \(r > 0\), the linear relationship is positive, meaning that as the independent variable increases, the dependent variable increases.
Let’s calcuclate the correlation for our model to analyze the linear relationship between our predictor, verbal IQ, and our response, language test score. We can use the “cor” command with each of our variables listed to do so.
cor(IQ, lang)
## [1] 0.6098195
Our correlation between verbal IQ and language test score is 0.6098195. Since our correlation is a positive number, we know that there is a positive correlation between verbal IQ and language test score. In other words, as verbal IQ increases, we can expect language test score to increase as well. This positive correlation is moderately strong, but isn’t extremely strong because it isn’t very close to 1. In the context of our dataset this means that as verbal IQ increases we would expect to see language test score increase, but we shouldn’t be shocked if the langauge test score doesn’t increase.
Next let’s calculate a confidence interval for our slope and intercept estimates. The “confint” command will give us the lower and upper bounds for each of these confidence intervals. To start, let’s calculate 95% confidence intervals for each of our parameters in our model. The “confint” command assumes a 95% confidence interval, so the only argument that we need to provide is our model. Also, keep in mind that since not all of the regression assumptions hold for our model, this confidence interval might not be entirely accurate.
confint(nls.mod)
## 2.5 % 97.5 %
## (Intercept) 7.828646 11.22832
## IQ 2.512401 2.79539
Our output tells us that a 95% confidence interval for our intercept estimate is [7.828646, 11.22832]. This means that we can be 95% certain that the value of our intercept is between 7.828646 and 11.22832. Equivalently, if many confidence intervals were constructed for our intercept estimate, 95% of the intervals would include the true intercept value. The confidence interval for our slope estimate is [2.512401, 2.79539], which means that we can be 95% certain that the value of our slope is between 2.512401 and 2.79539. Similarly, if many confidence intervals were constructed for our slope estimate, 95% of the intervals would include the true value for the slope.
Confidence intervals can also be constructed for other parameters related to our dataset. For example, let’s construct a 90% confidence interval for the average language test score of students with a verbal IQ of 10. To do so, we will need to create a new dataset that only includes students with a verbal IQ of 10. We will call this our “new.nls.data”
new.nls.data <- data.frame(IQ = 10)
Now we can create our confidence interval for the mean language test score of students with a verbal IQ of 10. We will use the “predict” command with four arguments. First we need to tell R what model we want to use and what dataset we want to use. Then we tell R that we want to construct a confidence interval, and we tell R that we want to make a 90% confidence interval.
confIQ <- predict(nls.mod, new.nls.data, interval= "confidence", level=.9)
confIQ
## fit lwr upr
## 1 36.06744 35.73921 36.39567
Our output gives us an estimate of the mean test score of all students with a verbal IQ of 10. It also tells us the lower and upper bounds of our 90% confidence interval. The estimated value, 36.06744, tells us that the average language test score of all students with a verbal IQ of 10 is 36.06744. In addition, the 90% confidence interval for the mean language test score is [35.73921, 36.39567]. This means that we are 90% confident that the mean language test score of all students with a verbal IQ of 10 is between 35.73921 and 36.39567. Also, if multiple confidence intervals were constructed in this way, 90% of them would include the true mean language test score for students with a verbal IQ of 10.
Again, keep in mind that this prediciton and interval may not be correct because not all of the linear regession assumptions hold for our model.Specifically, the interval should be more narrow because the MSE over estimates the variance for language test score when verbal IQ is 10.
In addition to constructing confidence intervals for parameters, we can create prediction intervals for any individual point in our data if the linear regression assumptions hold. To demonstrate this, let’s now create a 90% prediction interval for a student whose verbal IQ is 10. Notice that instead of caring about the average language test score of all students with a verbal IQ of 10, we now only want to estimate the language test score for any one student with a verbal IQ of 10. Since we are still using a verbal IQ of 10, we can use the new dataset that we created in the last calculation. However, this time we will tell R that we want to create a prediciton interval, not a confidence interval.
predIQ <- predict(nls.mod, new.nls.data, interval= "prediction", level = .90)
predIQ
## fit lwr upr
## 1 36.06744 24.31822 47.81666
Similar to when we created the confidence interval, our output gives us an esitmated value and the lower and upper bounds for the 90% prediction interval. We see that the predicted value, 36.06744, is now an estimate for an individual student’s language test score if that student has an IQ of 10. Notice that this estimated value is the same as the estimated average value from our confidence interval. However, our lower and upper bounds for our prediciton interval are very different than the bounds from our confidence interval. Our 90% prediciton interval is [24.31822, 47.81666], which means that we can be 90% confident that if a student has a verbal IQ of 10, their language test score will be between 24.31822 and 47.81666. It also means that if multiple prediciton intervals were created in this manner, 90% of them would include the correct point prediciton (actual test score).
Keep in mind that the estimate and prediciton interval might not be accurate because not all of the regression assumptions hold for our model. Specifically, the interval should be more narrow because the MSE over estimates the variance for language test score when verbal IQ is 10.
Clearly our prediciton interval is wider than our confidence interval, but how much wider? We can use R to calculate the width of each of our intervals by matrix multiplying our intervals with the vector (0, -1, 1).
confIQ %*% c(0, -1, 1)
## [,1]
## 1 0.6564533
predIQ %*% c(0, -1, 1)
## [,1]
## 1 23.49845
Our output tells us that the width of our confidence interval is 0.6564533, and the width of our prediciton interval is 23.49845. It makes sense that our preditiction interval is wider than our confidence interval because, as proven by the central limit theorem, the variance for the mean of the language test scores is smaller than the variance of any individual language test score. Plus, the fact that the points in our scatter plot weren’t tighly packed illustrates that there is somewhat high variance in language test scores for a given verbal IQ.
In review, throughout this R guide we learned how to do the following: create and analyze a scatter plot from our dataset, create and interpret a simple linear regression model, plot and interpret the line of best fit from our model on our scatter plot, test the significance of our predictor, calculate and interpret point predictions using our model, calculate residuals, test the linear regression assumptions, calculate and interpret residuals, calculate and interpret the coefficient of determination and correlation coefficient, create and interpret confidence intervals for our regression coefficients, create and interpret a confidence interval for the mean of our response, and create and interpret a prediciton interval for our response.