In this R guide we will be covering concepts regarding simple linear regression. A simple linear regression model is used to analyze the relationship between a quantitative response and one quantitative predictor. The simple linear regression equation is in the following form: \[{y_i}= {\beta_0}+{\beta_1} x_i + {\epsilon_i}\] where \({y_i}\) is the response, \({\beta_0}\) is the intercept, \({\beta_1}\) is the slope, and \({\epsilon_i}\) is ther error term.
Our simple linear regression model will allow us to determine whether or not there is a linear relationship between the response and predictor from our data set, which are described in the follwoing section.
For this document we will be using the “nlschools” dataset. This dataset is in the MASS library. Let’s call and attach our dataset. Once our dataset is attached we won’t have to continuously reference it throughout our work. Also, use the “nlschools” command to learn about how and why the data was collected.
library(MASS)
data(nlschools)
attach(nlschools)
?nlschools
Now our dataset is attached. Also, we can see that this dataset includes information from the study of 2287 eighth-grade pupils (age about 11) in 132 classes from 131 schools in the Netherlands. Those conducting the study gathered information regarding the students’ language test score, verbal IQ, class ID, class size, family social-economic status, and whether the students’ classes are multi-grade classes or not.
For this R guide, we will be focusing on the relationship between the students’ verbal IQ and their language test scores. Verbal IQ (denoted “IQ”) will be our predictor, and language test score (denoted “lang”) will be our response. The argument for this is that students with higher verbal IQs will have higher language test scores.
To learn more about our data we can use the “head” command to see the first six rows of our dataset, and use the “summary” command to see an overview of the data. We will revisit the “summary” command later in this R guide.
head(nlschools)
## lang IQ class GS SES COMB
## 1 46 15.0 180 29 23 0
## 2 45 14.5 180 29 10 0
## 3 33 9.5 180 29 15 0
## 4 46 11.0 180 29 23 0
## 5 20 8.0 180 29 10 0
## 6 30 9.5 180 29 10 0
summary(nlschools)
## lang IQ class GS
## Min. : 9.00 Min. : 4.00 15580 : 33 Min. :10.00
## 1st Qu.:35.00 1st Qu.:10.50 5480 : 31 1st Qu.:23.00
## Median :42.00 Median :12.00 15980 : 31 Median :27.00
## Mean :40.93 Mean :11.83 16180 : 31 Mean :26.51
## 3rd Qu.:48.00 3rd Qu.:13.00 18380 : 31 3rd Qu.:31.00
## Max. :58.00 Max. :18.00 5580 : 30 Max. :39.00
## (Other):2100
## SES COMB
## Min. :10.00 0:1658
## 1st Qu.:20.00 1: 629
## Median :27.00
## Mean :27.81
## 3rd Qu.:35.00
## Max. :50.00
##
From our first output we are able to see how our data is arranged. The output from the summary command tells us that students’ verbal IQs range from 4 to 18, and laguage test scores range from 9 to 58.
Now that we’re familiar with the data, we can create a scatter plot to illustrate the relationship between our two quantitative variables, verbal IQ and language test score.
Let’s create the scatter plot. Use the “plot” command and list the two variables, separated by a comma, with the independent variable listed first.
plot(IQ, lang)
To make the graph easier to interpret, we can remane the axes labels. Also, since there are many overlapping data points, let’s use the “smoothScatter” commnad to create a new type of plot. This plot will have a darker color where there are more data points and a light color where there are few data points.
smoothScatter(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
By looking at this plot we see that there seems be a positive, linear relationship between verbal IQ and language test score. Students with higher verbal IQs tend to have higher language test scores. However, the data points are spread out quite a bit, especially for verbal IQs less than 10, so it’s difficult to determine the strength of this potential linear relationship just by looking at the plot.
Now that we’ve created our scatter plot, let’s create our simple linear regression model. Our model will be an approximation of the simple linear regression equation mentioned in the intrduction becasue we are creating this model based on our limited sample of 2287 data points. Therefore, we will drop the error term in our model, and \({\beta_0}\) and \({\beta_1}\) will be approximated by \(\hat{\beta_0}\) and \(\hat{\beta_1}\) respectively. Thus, our model, expressing the relationship between verbal IQ and language test score will be in the follwoing form: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] where
\(\hat{y_i}\) = point prediciton for language test score when verbal IQ is \(x_i\)
\(\hat{\beta_0}\) = estimate of the intercept = estimated language test score when verbal IQ is 0
\(\hat{\beta_1}\) = estimate of the slope = estimated change in language test score associated with a one unit increase in verbal IQ
\(x_i\) = observed verbal IQ value
Let’s name our model “nls.mod” so that we can call upon it easily. To create the model, use the “lm” command including our dependent variable followed by “~” and our independent variable. Then, call on the model to get the estimates of our intercept and slope.
nls.mod <- lm(lang ~ IQ)
nls.mod
##
## Call:
## lm(formula = lang ~ IQ)
##
## Coefficients:
## (Intercept) IQ
## 9.528 2.654
The output tells us that the intercept estimate is 9.528. However, this intercept can’t be interpretted because none of the observations in our dataset have a verbal IQ near or equal to 0. Our slope estimate is 2.654, which means that for every one unit increase in a student’s verbal IQ, we would expect their language test score to increase by 2.654 points.
Now, let’s use the equation for model to graph the line of best-fit on our scatter plot. First, recall the scatter plot that we created earlier. Then use the “abline” command with a comma spearated list of our intercept and slope estimates; this will create the line of best fit.
smoothScatter(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
abline(9.528, 2.654)
With our line of best fit included on our graph, it looks like a linear model could fit our data. However, since the points aren’t tightly packed, it’s difficult to determine how well this linear model fits our data. Later in thsi R guide we will analyze the usefulnes and strength of our model.
Before we move on to our analysis, we should check to see if the linear regression assumptions hold for our model. There are four assumptions we make for our model, and if they hold we can rely on our model to produce accurate estimates and intervals. However, if one or more of our assumptions don’t hold, our model may not produce reliable outputs.
First we assume that the population of potential errors has a mean value equal to zero for any given verbal IQ. If this assumption doesn’t hold ???(waiting for notes from Knudson)???
To check this assumption, we call on all of our residuals and calculate the mean.
all.resids <- nls.mod$residuals
mean(all.resids)
## [1] 1.522946e-15
The mean of the errors is very close to zero, so we can conclude that this assumption holds.
The second assumption is the normality assumption, which states that the population of potential errors has a normal distribution for any given value of our independent variable, verbal IQ. IF this assumption doesn’t hold ???(waiting for notes from Knudson)??? We can check this assumption by first looking at a histogram of our residuals.
hist(all.resids)
The histogram is somewhat helpful, and it looks like our errors are normally distributed. However, it would be easier to compare our data against a straight line rather than the curved histogram. To compare our data against a straight line, we can plot the quantiles of a normal distribution and then plot the quantiles of our residuals against them. If the residuals from our model are normally distributed they will follow a straight line.
qqnorm(all.resids)
qqline(all.resids)
From our graph we can see that our residuals are packed pretty close to the straight line. Our residuals are slightly spread out at the edges fot their distribution, but the amount is negligent. Therefore, we can conclude that our residuals are normally distributed.
The third assumption is the constant variance assumption, also called homoscedasticity. This assumption states that the variance of the residuals will be equal for all values of our independent variable, verbal IQ. If this assumption doesn’t hold ???(waiting for notes from Knudson)???
We could check this assumption by looking at our original data.
plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
We can try to see if the vertical spread of our points is equal for each value of our independent variable (verbal IQ). From this graph it looks like there might not be equal variances because the spread of the points looks wider for verbal IQ values between 6 and 12 than it does for higher or lower values. However, our plot has a lot of overlapping points, so it is difficult to clearly analyze this plot.
A better way to test homoscedasticity is to plot the residuals against the values of our response, language test score (lang). We can also plot a staight line to make it easier to compare the variance between different verbal IQ values.
smoothScatter(nls.mod$residuals ~ nls.mod$fitted, xlab = "Fitted Values", ylab= "Residuals")
abline(0,0)
From this graph we can see that most of the residuals seem to be evenly spread. The smooth scatter plot is helpful in showing us that while some outliars might not have equal varince, the dark blue area on our plot, where the majority of the data points are, does seem to have equal variance. If we were using our model for something more high-staks we might want to further check this assumption. However, for our purposes we will conclude that this assumption holds.
The final assumption is the independence assumption, which states that any value of our error term is statistically independent of any other error value. If this assumption doesn’t hold ???(waiting for notes from Knudson)???
This assumption is tested by analyzing how our data was collected. R tells us that information about the students was collected from 132 classes in 131 schools throughout the Netherlands. Since information was gathered from a wide variety of classes and schools, we can assume that our data is independent.
Therefore, we can conclude that all of the regression assumptions hold for our model.
Now that we know that our assumptions hold, we can use our model is to calculate the point estimate for the constant variance of our residuals, which is the mean square error (MSE). As its name implies, the MSE is the average of all of the squared errors for our model. The “summary” command can be used to find the mean squre error.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
In our output, the “Residual statndard error” gives us the the standard error of our model, which is a point estimate for the constant standard deviation of our errors. The MSE is the square of this, so the MSE for our model is \(7.137^2 = 50.93677\). Thus, the point estimate for the constant variance of our residuals is 50.96677.
We can measure the usefulness of our model by calculating the coefficient of determination, commonly denoted \(r^2\). The coefficient of determination is the percentage of the variability of the response that is explained by its linear relationship with the predictor. In other words, it equals the explained variance divided by the total variance. We can use the “summary” command to find the coefficient of determination, which is called “Multiple R-squared” in the output.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
From our output we see that our coefficient of determination is 0.3719, so our model only accounts for 37.19% of the total variance of language test scores. Thus, most of the variation in langauge test scores is not due to the linear relationship between verbal IQ and language test scores.
While the coefficient of determination is importnat, its square-root, the correlation coefficient, can be more helpful when analyzing a linear regression model. Correlation, denoted \(r\), describes the linear relationship between our predictor and response variables. If \(r = 0\) there is no linear relationship between the two variables, but if \(|r| = 1\), there is strong linear relationship between the two variables. Also, if \(r < 0\), the linear relarionship is negative, meaning that as the independent variable increases, the dependent variable decreases. If \(r > 0\), the linear relationship is positive, meaning that as the independent variable increases, the dependent variable increases.
Let’s calcuclate the correlation for our model to analyze the linear relationship between our independent variable, verbal IQ, and our dependent variable, language test score. We can use the “cor” command with each of our variables listed to do so.
cor(IQ, lang)
## [1] 0.6098195
Our correlation between verbal IQ and language test score is 0.6098195. Since our correlation is a positive number, we know that there is a positive correlation between verbal IQ and language test score. In other words, as verbal IQ increases, we can expect language test score to increase as well, which confirms the trend we observed in our scatter plot. This positive correlation is moderately strong, but it’s not extremely strong because 0.6098195 isn’t very close to 1. In the context of our dataset this means that as verbal IQ increases we would expect to see language test score increase, but we shouldn’t be shocked if the langauge test score doesn’t increase.
Now that we know how strong the linear relationship is between verbal IQ and language test score, let’s check to see if there is a significant relationship between our predictor, verbal IQ, and our response, language test score. This isimportant because our linear regression model isn’t useful unless there is a significant relationship between our two variables.
For the following tests we will use a significance level of \({\alpha} = 0.05\).
One way that we can test the significance of our predictor is to conduct a two-sided t-test. Our null hypothesis states that verbal IQ does not have a linear relationship with language test score. In other words, our independent variable, verbal IQ, has a coefficient of 0 in our linear regression model. Our alternative hypothesis states that verbal IQ does have a linear relationship with our response, so verbal IQ has a coefficient not equal to 0. Our hypotheses can be written in the following form: \[H_0:\hat{\beta_i}=0\] \[H_0:\hat{\beta_i} ≠ 0\]
If the test shows that verbal IQ is significant, we will reject the null hypothesis in favor of the alternative and conclude that there is sufficient evidence to show that verbal IQ does have a linear relationship with language test score.
We can use the “summary” function to test the significance of our predictor.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
The summary function tells us a lot about our model, but right now we want to focus on the output from the t-test. Our t-distributon has 2285 dgrees of freedom because there are 2287 entries in our dataset, and we are assuming two regression coefficients in our model (df = number of observations - number of regression coefficients). The dgrees of freedom from t-distribution is listed in the “Residual standard error” row of our output. Our test statistic for verbal IQ, listed in the “t value” column, is 36.78, and the corresponding p-value, listed in the “Pr(>|t|)” column, is less than \(2 * 10^-16\). The p-value for verbal IQ expresses the significance of that predictor, and since our p-value is less than our significance level of 0.05, we see that verbal IQ is significant. In other words, since the p-value is less than our significance level, we can reject the null hypothesis in favor of the alternative. Therefore, we can conclude that verbal IQ does have a linear relationship with language test score.
If we wanted to, we could test the significance of our intercept estimate in the same way that we tested the significane of our slope estimate. We would write our hypotheses out using \(\hat{B_0}\) instead of \(\hat{B_1}\), and we would find the corresponding p-value in our summary function to make our conclusion.
We could also use an F-test to test the significance of our predictor. Our hypotheses and significance level would be the same as above. The summary function gives us the output for the F-test. Our F-test statistic is listed in the last row of the output, and is equal to 1353; the degrees of freedom for our F-distribution are 1 and 2285. The p-value for our model is less than \(2 * 10^-16\), which is less than our significance level of 0.05. Therefore, using the F-test we would also reject the null hyothessis in favor of the alternative and conclude that verbal IQ does have a significant linear relationship with langauge test score.
Now that we know that our prediway that we can use our model is to find the point estimate for the constant variance of our residuals (errors), which is the mean square error (MSE). AS its name implies, the MSE is the average of all of teh squared errors for our model. The “summary” command can be used to find the mean squre error.
summary(nls.mod)
##
## Call:
## lm(formula = lang ~ IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7022 -4.3944 0.6056 5.2595 26.2212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.52848 0.86682 10.99 <2e-16 ***
## IQ 2.65390 0.07215 36.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared: 0.3719, Adjusted R-squared: 0.3716
## F-statistic: 1353 on 1 and 2285 DF, p-value: < 2.2e-16
In our output, the “Residual statndard error” gives us our mean square error. In this case our MSE is 7.137, which is the point estimate for the constant variance of our residuals. If we wanted a point estimate for the constant standard deviation of our errors, we would just take the square-root of our mean square error, which is called the standard error.
Remember, earlier we showed that the constant variance assumption doesn’t hold for our model, so this MSE isn’t totlaly accurate. Specifically, the real MSE is greater than 7.137 because our model systematically overestimates low language test scores and underestimates high language test scores. Thus, the real variance is greater than what the model shows.
Next, we can use our simple linear regression model to make predicitons about our data. We can select a student from our dataset and enter their verbal IQ into our model to get a predicted value of their language test score.
For example, let’s predict the language test score for 1000th student in our data. The verbal IQ and language test score for this student are in the 1000th row of our dataset, and to access their verbal IQ we use indexing. Specifically, we list the name of the dataset, followed by a set of brackets. In the brackets we need to list the row and column numbers for the number that we are concerned about. Since we want to know the verbal IQ for the student in the 1000th row, we will list 1000 as the row and 2 as the column number, because verbal IQ is the variable in the second column of our data. Let’s call this value obsIQ, and surround our code with parentheses to get an immediate output.
(obsIQ <- nlschools[1000, 2])
## [1] 15
Our output tells us that the verbal IQ for the student in the 1000th row is 15.
Now, we can use this verbal IQ value to predict the language test score for this student. To do so, we will use the “coef” command and matrix multiplication. The “coef” command outputs the intercept and slope for our model in a vector of length two, with the intercept listed first, followed by the slope. If we matrix multiply this vector by another vector of length two with 1 as the first entry and our verbal IQ value as the second entry, we will get the estimate for the language test score.
(predLang <- coef(nls.mod)%*%c(1, 15))
## [,1]
## [1,] 49.33692
Our output tells us that the the predicted language test score for the student in the 1000th row is 49.33692.
Let’s use residuals to see how close this student’s predicted langauge test score is to their actual score. First, let’s find our what the student’s acutal test score was.
(obsLang <- nlschools[1000, 1])
## [1] 52
The students actual test score was 52.
The residual is the observed value minus the expected value, or observed minus predicted. Thus, to find the residual we will subtract the predicted test score for this student from their actual test score. To do so, we will use more indexing to reference the student’s actual test score. Then we will subtract our predicted value from the actual value.
(resid <- obsLang - predLang)
## [,1]
## [1,] 2.663082
Our residual is 2.663082, which means that we underestimated this student’s test score by 2.663082.
Next let’s calculate a confidence interval for our slope and intercept estimates. The “confint” command will give us the lower and upper bounds for each of these confidence intervals. To start, let’s calculate 95% confidence intervals for each of our parameters in our model. The “confint” command assumes a 95% confidence interval, so the only argument that we need to provide is the name of our model.
confint(nls.mod)
## 2.5 % 97.5 %
## (Intercept) 7.828646 11.22832
## IQ 2.512401 2.79539
Our output tells us that a 95% confidence interval for our intercept estimate is [7.828646, 11.22832]. This means that we can be 95% certain that the value of our intercept is between 7.828646 and 11.22832. Equivalently, if many confidence intervals were constructed for our intercept estimate, 95% of the intervals would include the true intercept value. The confidence interval for our slope estimate is [2.512401, 2.79539], which means that we can be 95% certain that the value of our slope is between 2.512401 and 2.79539. Similarly, if many confidence intervals were constructed for our slope estimate, 95% of the intervals would include the true value for the slope.
Confidence intervals can also be constructed for other parameters related to our dataset. For example, let’s construct a 90% confidence interval for the average language test score of students with a verbal IQ of 10. To do so, we will need to create a new dataset that only includes students with a verbal IQ of 10. We will call this our “new.nls.data”
new.nls.data <- data.frame(IQ = 10)
Now we can create our confidence interval for the mean language test score of students with a verbal IQ of 10. We will use the “predict” command with four arguments. First we need to tell R what model we want to use and what dataset we want to use. Then we tell R that we want to construct a confidence interval, and we tell R that we want to make a 90% confidence interval.
(confIQ <- predict(nls.mod, new.nls.data, interval= "confidence", level=.9))
## fit lwr upr
## 1 36.06744 35.73921 36.39567
Our output gives us an estimate of the mean test score of all students with a verbal IQ of 10. It also tells us the lower and upper bounds of our 90% confidence interval. The estimated value, 36.06744, tells us that the average language test score of all students with a verbal IQ of 10 is 36.06744. In addition, the 90% confidence interval for the mean language test score is [35.73921, 36.39567]. This means that we are 90% confident that the mean language test score of all students with a verbal IQ of 10 is between 35.73921 and 36.39567. Also, if multiple confidence intervals were constructed in this way, 90% of them would include the true mean language test score for students with a verbal IQ of 10.
In addition to constructing confidence intervals for parameters, we can create prediction intervals for any individual point in our data. To demonstrate this, let’s now create a 90% prediction interval for a student whose verbal IQ is 10. Notice that instead of caring about the average language test score of all students with a verbal IQ of 10, we now only want to estimate the language test score for any one student with a verbal IQ of 10. Since we are still using a verbal IQ of 10, we can use the new dataset that we created in the last calculation. However, this time we will tell R that we want to create a prediciton interval, not a confidence interval.
(predIQ <- predict(nls.mod, new.nls.data, interval= "prediction", level = .90))
## fit lwr upr
## 1 36.06744 24.31822 47.81666
Similar to when we created the confidence interval, our output gives us an esitmated value and the lower and upper bounds for the 90% prediction interval. We see that the predicted value, 36.06744, is now an estimate for an individual student’s language test score if that student has an IQ of 10. Notice that this estimated value is the same as the estimated average value from our confidence interval. However, our lower and upper bounds for our prediciton interval are very different than the bounds from our confidence interval. Our 90% prediciton interval is [24.31822, 47.81666], which means that we can be 90% confident that if a student has a verbal IQ of 10, their language test score will be between 24.31822 and 47.81666. It also means that if multiple prediciton intervals were created in this manner, 90% of them would include the correct point prediciton (actual test score).
Clearly our prediciton interval is wider than our confidence interval, but how much wider? We can use R to calculate the width of each of our intervals by matrix multiplying our intervals with the vector (0, -1, 1).
confIQ %*% c(0, -1, 1)
## [,1]
## 1 0.6564533
predIQ %*% c(0, -1, 1)
## [,1]
## 1 23.49845
Our output tells us that the width of our confidence interval is 0.6564533, and the width of our prediciton interval is 23.49845. It makes sense that our preditiction interval is wider than our confidence interval because, as proven by the central limit theorem, the variance for the mean of the language test scores is smaller than the variance of any individual language test score.