In thi R guide we will be covering concepts regarding simple linear regression. First, we will choose a dataset, and then we will identify and plot our predictor and response variables. Once we’re familiar with our data and have analyzed our scatter plot, we will create a simple linear regression model. We will add the line of best fit to our scatter plot, interpret the results, and use our model to calculate predictions and residuals. Once we understand the basics of our graph and regression model, we will be able to test the significance of our slope and intercept, create confidence intervals and prediciton intervals, and calculate the coefficient of determination and correlation coefficient for our model. Finally, we will trst the linear regression assumptions to see if they hold for our model.

Calling Data

For this document we will be using the “nlschools” dataset. This dataset is in the MASS library. Let’s call and attach our dataset. Once our dataset is attached we won’t have to continuously reference it throughout our work.

library(MASS)
data(nlschools)
attach(nlschools)

Now that our dataset is attached, let’s become more familar with it. Use the “head” command to see the first six rows of our dataset. Use the “name” command to view all of the variables, and use the “summary” command to see an overview of the data. We will revisit the “summary” command later in this R guide.

head(nlschools)
##   lang   IQ class GS SES COMB
## 1   46 15.0   180 29  23    0
## 2   45 14.5   180 29  10    0
## 3   33  9.5   180 29  15    0
## 4   46 11.0   180 29  23    0
## 5   20  8.0   180 29  10    0
## 6   30  9.5   180 29  10    0
names(nlschools)
## [1] "lang"  "IQ"    "class" "GS"    "SES"   "COMB"
summary(nlschools)
##       lang             IQ            class            GS       
##  Min.   : 9.00   Min.   : 4.00   15580  :  33   Min.   :10.00  
##  1st Qu.:35.00   1st Qu.:10.50   5480   :  31   1st Qu.:23.00  
##  Median :42.00   Median :12.00   15980  :  31   Median :27.00  
##  Mean   :40.93   Mean   :11.83   16180  :  31   Mean   :26.51  
##  3rd Qu.:48.00   3rd Qu.:13.00   18380  :  31   3rd Qu.:31.00  
##  Max.   :58.00   Max.   :18.00   5580   :  30   Max.   :39.00  
##                                  (Other):2100                  
##       SES        COMB    
##  Min.   :10.00   0:1658  
##  1st Qu.:20.00   1: 629  
##  Median :27.00           
##  Mean   :27.81           
##  3rd Qu.:35.00           
##  Max.   :50.00           
## 

Becoming familiar with the data is a necessary first step to working with the data and interpretting the results. However, there is no need to submit the results from the previous three commands.

Scatter Plot

Now that we’re familiar with the data, we can create a scatter plot to illustrate the relationship between two quantitative variables. For this R guide, we are interested in the relationship between verbal IQ, which is the student’s verbal IQ, and language test score, which is the score the student earned on a language test. Our predictor is verbal IQ (IQ), and our response is language test score (lang). The argument for this is that students with higher verbal IQs will have higher language test scores.

Let’s create the scatter plot. Use the “plot” command and list the two varibales, separated by a comma, with the independent variable listed first.

plot(IQ, lang)

To make the graph easier to interpret, we can remane the axis labels by typing:

plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")

By looking at our scatter plot, we see that there might be a positive, linear relationship between verbal IQ and language test scores. Students with higher verbal IQs tend to have higher language test scores. However, the points aren’t packed very tightly, so if there is a linear relationship, it doesn’t look like it will be very strong.

Creating our Model

Now that we’ve created our scatter plot, let’s create our simple linear regression model. This model will be in the following form: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] where: \(\hat{y_i}\) is our our predicted response; \(\hat{\beta_0}\) is our intercept estimate; \(\hat{\beta_1}\) is our slope estimate.

Let’s name our model “nls.mod” so that we can call upon it easily. To create the model, use the “lm” command including our dependent variable followed by “~” and our independent variable. Then, call on the model to get the estimates of our intercept and slope.

nls.mod <- lm(lang ~ IQ)
nls.mod
## 
## Call:
## lm(formula = lang ~ IQ)
## 
## Coefficients:
## (Intercept)           IQ  
##       9.528        2.654

The output tells us that the intercept estimate is 9.528, which means that we would expect a student with a verbal IQ of 0 to have a language test score of 9.528. However, this intercept can’t be interpretted because none of the observations in our dataset have a verbal IQ near or equal to 0. Our slope estimate is 2.654, which means that for every one unit increase in a student’s verbal IQ, we would expect their language test score to increase by 2.654 points.

Plotting the Line of Best Fit

Now that we’ve created our simple linear regression model, let’s plot our line of best fit on our scatter plot. First, recall the scatter plot that we created earlier. Then use the “abline” command with a comma spearated list of our intercept and slope estimates; this will create the line of best fit.

plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")
abline(9.528, 2.654)

With our line of best fit plotted on our graph, it looks like a linear model could fit our data. However, since the points aren’t tightly packed, it’s not certain that a linear model fits our data well. We will calculate the correlation coefficient later in this R guide, which measures the strength of the linear relationship between our variables.

Testing the Significance of Predictors

Now that we’ve created our model, let’s test the significance of our predictor, verbal IQ. We can do this by using a hypothesis test. Our null hypothesis states that verbal IQ does not have a linear relationship with language test score. In other words, our independent variable, verbal IQ, has a coefficient of 0 in our linear regression model. Our alternative hypothesis states that verbal IQ does have a linear relationship with our response, so our independent variable, verbal IQ, has a coefficient not equal to 0. If the test shows that verbal IQ is significant, we will reject the null hypothesis in favor of the alternative and conclude that there is sufficient evidence to show that verbal IQ does have a linear relationship with language test score.

We can use the “summary” function to test the significance of our predictor.

summary(nls.mod)
## 
## Call:
## lm(formula = lang ~ IQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.7022  -4.3944   0.6056   5.2595  26.2212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.52848    0.86682   10.99   <2e-16 ***
## IQ           2.65390    0.07215   36.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared:  0.3719, Adjusted R-squared:  0.3716 
## F-statistic:  1353 on 1 and 2285 DF,  p-value: < 2.2e-16

The summary function tells us a lot about our model, but to test the significance of our predictor we care about the p-values from our hypothesis test. The Pr(>|t|) column tells us the significance of our intercept and our predictor. The p-value for IQ is <2e-16, which is approximately zero. Since our p-value is so small, regardless of our significance level, we can reject the null hypothesis in favor of the alternative. Therefore, we can clonclude that verbal IQ does have a linear relationship with language test score.

The summary command also tells us the number of degrees of freedom for our model, which is 2285. Our number of degrees of freedom is 2285 because there are 2287 entries in our dataset, and we are assuming two regression coefficients in our model (df = number of observations - number of regression coefficients). Our estimates for our intercept and slope are also given in the model summary in the “Estimate” column.

Predictions

Next, we can use our linear model to make predicitons about our data. We can select a student from our dataset and enter their verbal IQ into our model to get a predicted value of their language test score.

For example, let’s predict the language test score for 1000th student in our data. The verbal IQ and language test score for this student are in the 1000th row of our dataset, and to access their verbal IQ we use indexing. Specifically, we list the name of the dataset, followed by a set of brackets. In the brackets we need to list the row and column numbers for the number that we are concerned about. Since we want to know the verbal IQ for the student in the 1000th row, we will list 1000 as the row, and 2 as the column number, because verbal IQ is the variable in the second column of our data.

nlschools[1000, 2]
## [1] 15

Our output tells us that the verbal IQ for the student in the 1000th row is 15. Thus, we will plug 15 into our regression equation as the value for our independent variable, and this will allow us to predict the value of this students language test score.

One way to do so would be to type out the value of our intercept, slope, and verbal IQ and have R calculate it.

9.52848 + 2.65390 * 15
## [1] 49.33698

Here, R tells us that a prediction for the 1000th student’s language test score is 49.33698.

Instead of typing everything out, we could use the “coef” command. This command outputs the intercept and slope for our model in a vector of length two, with the intercept listed first, followed by the slope of our model.

coef(nls.mod)
## (Intercept)          IQ 
##    9.528484    2.653896

If we matrix multiply this vector by another vector of length 2 containing 1 and the value of our independent variable (verbal IQ), we get the predicted value of our dependent variable (language test score). In our example, we will multiple our coef vaector by c(1, 15) since the verbal IQ for the student in the 1000th row is 15. To matix multiply use the %*% function

coef(nls.mod)%*%c(1, 15)
##          [,1]
## [1,] 49.33692

Our output tells us that the the predicted language test score for the student in the 1000th row is 49.33692. This is the same as the ourtput from the alternative method; they don’t match exactly due rounding error for our intercept and slope values that we used in our first method.

Residuals

Since we’re able to identify the observed values in our dataset and calculate predicted values, we can calculate residuals for our data. Residual is the observed value minus the expected value, or observed minus predicted. Let’s calculate the residual for the language test score for the 1000th student. First, let’s identify this student’s actual language test score. Since language test score is listed in the first column of the data, we can use the following indexing to find the student’s actual score.

nlschools[1000, 1]
## [1] 52

We see that the observed value for the students test score is 52. Earlier we leanred how to calculate the expected value, so let’s use what we’ve leanred to caluculate the residual.

nls.obs <- nlschools[1000, 1]
IQ.1000 <- nlschools[1000, 2]
nls.pred <- coef(nls.mod)%*%c(1, IQ.1000)
nls.resid <- nls.obs - nls.pred
nls.resid
##          [,1]
## [1,] 2.663082

Our residual is 2.663082, which means that our predicted value for the student’s language test score was 2.663082 points lower than the student’s observed language test score.

Testing Linear Regression Assumptions

Now that we’re able to create and plot our model and use it for predictions, we should check to see if our linear regression assumptions hold for our model. If the assumptions hold for our model we can rely on our model to produce accurate estimates and intervals.

The first assumption is the normality assumption, which states that all of the residuals (errors) for our model are normally distributed. To get all of our residuals, use the follwoing command:

all.resids <- nls.mod$residuals

Now we can test whether or not these residuals are normally distributed. To do so, we can first look at a histogram of our residuals.

hist(all.resids)

The histogram is somewhat helpful, but it would be easier to compare our data against a straight line ratehr than the curved histogram. To compare our data against a straight line, we can plot the quantiles of a normal distribution and then plot the quantiles of our residuals against them. If the residuals from our model are normally distributed they will follow a straight line.

qqnorm(all.resids)
qqline(all.resids)

From our graph we can see that our residuals are packed very tightly and are very close to the straight line. Therefore, we can conclude that our residuals (errors) are normally distributed.

Another assumption is the constant variance assumption, also called homoscedasticity. This assumption states that the variance of the residuals will be equal for all values of our independent variable. We could check this assumption by looking at our original data.

plot(IQ, lang, ylab= "Language Test Score", xlab= "Verbal IQ")

We can try to see if the vertical spread of our points is equal for each value of our independt variable (verbal IQ). From this graph it looks like there might not be equal variances becasue the spread of the points looks wider for verbal IQ values between 6 and 12 than it does for higher or lower values. However, a better way to test homoscedasticity is to plot the residuals against the values of our independet variable, verbal IQ. We can also plot a staight line to make it easier to compare the variance between different verbal IQ values.

plot(nls.mod$residuals ~ IQ)
abline(0,0)

From this graph it is much easier to see that there is not homoscedasticity; rather, there is heteroscedasticity because the variance differs for different verbal IQ values. For example, the vertical spread between points for a verbal IQ of 12 seems to be about 40 units, but the vertical spread for a verbal IQ of 17 is only about 15 units.

As we continue to use our simple linear regression model to produce various estimates and intervals we will need to keep in mind that not all of the linear regression assumptionss hold for our model. Thus, we will need to be more cautious to accept the outputs that our model gives us.

Mean Square Error

Another way that we can use our model is to find the point estimate for the constant variance of our residuals (errors), which is the mean square error (MSE). The “summary” command can be used to find the mean squre error.

summary(nls.mod)
## 
## Call:
## lm(formula = lang ~ IQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.7022  -4.3944   0.6056   5.2595  26.2212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.52848    0.86682   10.99   <2e-16 ***
## IQ           2.65390    0.07215   36.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared:  0.3719, Adjusted R-squared:  0.3716 
## F-statistic:  1353 on 1 and 2285 DF,  p-value: < 2.2e-16

In our output, the “Residual statndard error” gives us our mean square error. In this case our MSE is 7.137, which is the point estimate for the constant variance of our residuals. If we wanted a point estimate for the constant standard deviation of our errors, we would just take the square-root of our mean square error, which is called the standard error. However, earlier we showed that the constant variance assumption doesn’t hold for our model, so for thsi specific model this point prediciton isn’t entirely accurate.

Coefficient of Determination and Correlation

We can measure the usefulness of our model by calculating the coefficient of determination, commonly denoted “r^2”. The coefficient of determination is the percentage of the variability of the response that is explained by its linear relationship with the predictor. In other words, it equals the explained variance divided by the total variance. We can use the “summary” command to find the coefficient of determination, which is called “Multiple R-squared” in the output.

summary(nls.mod)
## 
## Call:
## lm(formula = lang ~ IQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.7022  -4.3944   0.6056   5.2595  26.2212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.52848    0.86682   10.99   <2e-16 ***
## IQ           2.65390    0.07215   36.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.137 on 2285 degrees of freedom
## Multiple R-squared:  0.3719, Adjusted R-squared:  0.3716 
## F-statistic:  1353 on 1 and 2285 DF,  p-value: < 2.2e-16

From our output we see that our coefficient of determination is 0.3719, which means that our model accounts for 37.19% of the total variance of language test scores.

While the coefficient of determination is importnat, its square-root, the correlation coefficient, can be more helpful when analyzing a linear regression model. Correlation, denoted “r”, describes the linear relationship between our predictor and response variables. If |r| = 0.5 there is no linear relationship between the two variables, but if |r| = 1, there is strong linear relationship between the two variables. Generally, |r| values closer to 1 express a stronger linear relationship than |r| values closer to 0.5. Also, if r < 0, the linear relarionship is negative, meaning that as the independent variable increases, the dependent variable decreases. If r > 0, the linear relationship is positive, meaning that as the independent variable increases, the dependent variable increases.

Let’s calcuclate the correlation for our model to analyze the linear relationship between our predictor, verbal IQ, and our response, language test score. We can use the “cor” command with each of our variables listed to do so.

cor(IQ, lang)
## [1] 0.6098195

Our correlation between verbal IQ and language test score is 0.6098195. Since our correlation is a positive number, we know that there is a positive correlation between verbal IQ and language test score. In other words, as verbal IQ increases, we can expect language test score to increase as well. However, this positive correlation isn’t very strong. Our correlation is greater than 0.5, so there is some relationship between our two varibales. However, our correlation is not close to 1, so the linear relationship is weak.

Confidence Intervals for Coefficients

Next let’s calculate a confidence interval for our slope and intercept estimates. The “confint” command will give us the lower and upper bounds for each of these confidence intervals. To start, let’s calculate 95% confidence intervals for each of our parameters in our model. The “confint” command assumes a 95% confidence interval, so the only arguament that we need to provide is our model. Also, keep in mind that since not all of the regression assumptions hold for our model, this confidence interval might not be entirely accurate.

confint(nls.mod)
##                2.5 %   97.5 %
## (Intercept) 7.828646 11.22832
## IQ          2.512401  2.79539

Our output tells us that a 95% confidence interval for our intercept estimate is [7.828646, 11.22832]. This means that we can be 95% certain that the value of our intercept is between 7.828646 and 11.22832. Equivalently, if many confidence intervals were constructed for our intercept estimate, 95% of the intervals would include the best value for the estimate. The confidence interval for our slope estimate is [2.512401, 2.79539], which means that we can be 95% certain that the value of our slope is between 2.512401 and 2.79539. similarly, if many confidence intervals were constructed for our slope estimate, 95% of the intervals would include the best value for the estimate.

Confidence Interval for the Mean

Confidence intervals can also be constructed for other parameters related to our dataset. For example, let’s construct a 99% confidence interval for the average language test score of students with a verbal IQ of 10. To do so, we will need to create a new dataset that only includes students with a verbal IQ of 10. We will call this our “new.nls.data”

new.nls.data <- data.frame(IQ = 10)

Now we can create our confidence interval for the mean language test score of students with a verbal IQ of 10. We will use the “predict” command with four arguments. First we need to tell R what model we want to use and what dataset we want to use. Then we tell R that we want to construct a confidence interval, and we tell R that we want to make a 99% confidence interval.

confIQ <- predict(nls.mod, new.nls.data, interval= "confidence", level = .99)
confIQ
##        fit      lwr      upr
## 1 36.06744 35.55322 36.58166

Our output gives us an estimate of the mean test score of all students with a verbal IQ of 10. It also tells us the lower and upper bounds of our 99% confidence interval. The estimated value, 36.06744, tells us that the avergae language test score of all students with a verbal IQ of 10 is 36.06744. In addition, the 99% confidence interval for the mean language test score is [35.55322 36.58166]. This means that we are 99% confident that the mean language test score of all students with a verbal IQ of 10 is between 35.55322 and 36.58166. Also, if multiple confidence intervals were constructed in this way, 99% of them would include the true mean language test score.

Again, keep in mind that this prediciton and interval may not be correct because not all of the linear regession assumptions hold for our model.

Prediction Intervals

In addition to constructing confidence intervals for parameters, we can create prediction intervals for any individual point in our data if the linear regression assumptions hold. To demonstrate this, let’s now create a 99% prediction interval for a student whose verbal IQ is 10. Notice that instead of caring about the average language test score of all students with a verbal IQ of 10, we now only want to estimate the language test score for any one student with a verbal IQ of 10. Since we are still using a verbal IQ of 10, we can use the new dataset that we created in the last calculation. However, this time we will tell R that we want to create a prediciton interval, not a confidence interval.

predIQ <- predict(nls.mod, new.nls.data, interval= "prediction", level = .99)
predIQ
##        fit      lwr      upr
## 1 36.06744 17.66032 54.47456

Similar to when we created the confidence interval, our output gives us an esitmated value and the lower and upper bounds for the 99% prediction interval. We see that the predicted value, 36.06744, is now an estimate for an indivisual student’s language test score if that student has an IQ of 10. Notice that this estimated value is the same as the estimated average value from our confidence interval. However, our lower and upper bounds for our prediciton interval are very different than the bounds from our confidence interval. Our 99% prediciton interval is [17.66032, 54.47456] which means that we can be 99% confident that if a student has a verbal IQ of 10, their language test score would be between 17.66032 and 54.47456. It also means that if multiple prediciton intervals were created in this manner, 99% of them would include the correct point prediction.

Keep in mind that the estimate and prediciton interval might not be acurate ecause not all of the regression assumptions hold for our model.

Clearly our prediciton interval is wider than our confidence interval, but how much wider? We can use R to calculate the width of each of our intervals by matrix multiplying our intervals with the vector (0, -1, 1).

confIQ %*% c(0, -1, 1)
##       [,1]
## 1 1.028444
predIQ %*% c(0, -1, 1)
##       [,1]
## 1 36.81423

Our output tells us that the width of our confidence interval is 1.028444, and the width of our prediciton interval is 36.81423. It makes sense that our preditiction interval is wider than our confidence interval because, as proven by the cental limit theorem, the variance for the mean of the language test scores is smaller than the variance of any individual language test score.

Summary

In review, throughout this R guide we learned how to do the following: create and analyze a scatter plot from our dataset, create and interpret a simple linear regression model, plot and interpret the line of best fit from our model on our scatter plot, test the significance of our predictor, calculate and interpret point predictions using our model, calculate residuals, test the linear regression assumptions, calculate and interpret residuals, calculate and interpret the coefficint of determination and correlation coefficient, create and interpret confidence intervals for our regression coefficients, create and interpret a confidence interval for the mean of our response, create and interpret a prediciton interval for our response.