Simple linear regression approximately models the relationship between the dependent variable (y) and the independent variable (x) with a straight line. This R guide will refer to the independent variable as the predictor and the dependent variable as the response. Because our model is linear, the parameters are the y-intercept and slope. The simple linear regression model is: \[ y=\mu_y|_x + \epsilon = \beta_0 + \beta_1x + \epsilon \] where \(\mu_y|_x = \beta_0 + \beta_1x\) is the mean value of the response when the predictor is x. \(\beta_0\) is the y-intercept and \(\beta_1\) is the slope. \(\epsilon\) is the error term. Since we don’t know the true values of the y-intercept and slope, we use our data to estimate these values and these estimates are our regression least squares point estimates.
In this R guide we will use data from an R package to plot our regression model, determine the strength of the linear relationship between variables, and predict values. Then we will compute confidence intervals for our parameters, and compute confidence and prediction intervals for the response given a predictor value.
The data we will be using is the CASchools data in the AER R package. The data is from all 420 K-6 and K-8 districts in California with data for 1998 and 1999. We will use the regression model to try to predict the reading scores (response) using the % of student who qualify for reduced lunch (predictor). I think that as % of students qualifying for reduced lunch increases, reading scores decrease.
Here’s some helpful R commands to load the AER package, call the CASchools data, and to attach the variables so that we can use them in our R code:
library(AER)
data(CASchools)
attach(CASchools)
Before digging into this data, it is a good idea to explore a little more by displaying some of the data and the variable names.
head(CASchools)
## district school county grades students
## 1 75119 Sunol Glen Unified Alameda KK-08 195
## 2 61499 Manzanita Elementary Butte KK-08 240
## 3 61549 Thermalito Union Elementary Butte KK-08 1550
## 4 61457 Golden Feather Union Elementary Butte KK-08 243
## 5 61523 Palermo Union Elementary Butte KK-08 1335
## 6 62042 Burrel Union Elementary Fresno KK-08 137
## teachers calworks lunch computer expenditure income english read
## 1 10.90 0.5102 2.0408 67 6384.911 22.690001 0.000000 691.6
## 2 11.15 15.4167 47.9167 101 5099.381 9.824000 4.583333 660.5
## 3 82.90 55.0323 76.3226 169 5501.955 8.978000 30.000002 636.3
## 4 14.00 36.4754 77.0492 85 7101.831 8.978000 0.000000 651.9
## 5 71.50 33.1086 78.4270 171 5235.988 9.080333 13.857677 641.8
## 6 6.40 12.3188 86.9565 25 5580.147 10.415000 12.408759 605.7
## math
## 1 690.0
## 2 661.9
## 3 650.9
## 4 643.5
## 5 639.9
## 6 605.4
names(CASchools)
## [1] "district" "school" "county" "grades" "students"
## [6] "teachers" "calworks" "lunch" "computer" "expenditure"
## [11] "income" "english" "read" "math"
We can visualize the CASchools reading and % reduced lunch data in a scatterplot because each of our variables are quantitative. Before building the model and doing calculations it’s important to first use a scatterplot to determine if there appears to be a linear relationship. Meaning, is there a straight-line increase/decrease in reading scores as % of students qualifying for reduced lunch increases? The general form for the plot R code is plot(response ~ predictor).
Note that axis labels and a title can be added by modifying the above code as follows;
plot(lunch, read, xlab="Students Qualifying for Reduced-Price Lunch (%)", ylab="Average Reading Score", main="Reduced-Price Lunch and Reading Score")
It is also important to observe whether our data points are closely packed or scattered apart. Later we will talk about how this is part of our model assumptions.
Now we are ready to build the regression model. Use the command lm(response ~ predictor) to obtain the y-intercept and slope least squares point estimates.
mymod<-lm(read ~ lunch)
mymod
##
## Call:
## lm(formula = read ~ lunch)
##
## Coefficients:
## (Intercept) lunch
## 684.0962 -0.6515
The y-intercept is 684.0962 and the slope (under lunch) is -0.6515. We can interpret this output to mean that for every % increase in students qualifying for reduced-price lunch, we expect reading scores to decrease by -0.6515. We should be careful when interpreting the y-intercept in simple linear regression problems. We don’t interpret the y-intercept in the context of this problem because 0% of students qualifying for reduced-price lunch is not in our data.
The linear regression equation is: \[ \hat{y}= 684.0962 - 0.6515x \]
Now that we have our parameters we can see what our regression line looks like plotted with our data. Use the same scatterplot as above along with the command abline(y-intercept, slope).
plot(lunch, read, xlab="Students Qualifying for Reduced-Price Lunch (%)", ylab="Average Reading Score", main="Reduced-Price Lunch and Reading Score")
abline(684.0962, -0.6515)
We can better visualize the linear relationship between the response and predictor as well as see how tightly packed our data is around the regression line.
The correlation coefficient, denoted r, is a measure of the strength of the linear relationship between the response and the predictor. When r=1 we have perfect positive correlation and r=-1 mean we have perfect negative correlation. Use the cor(response,predictor) to calculate the correlation coefficient
cor(read,lunch)
## [1] -0.8788077
We have strong negative correlation of -0.8788077 between the reading score and % of students qualifying for reduced-price lunch. Remember correlation does not imply causation, so we can only state the tendency that as % of students qualifying for reduced-price lunch increases, reading scores decrease.
Now that we have built our model, we can make predictions about reading scores based on % of students qualifying for reduced-price lunch. Here we predict the reading score when % qualifying is 76.3226 (%). Although we could pick a percent, we would like to compare the prediction with a data point. One option to find a data point would be look at the data. Let’s instead do some indexing and have R do the work for us:
Looking back at the output for names() we can see that the lunch variable values are in the 8th column and the read variable values are in the 13th column. Let’s look at the data collected from the third school district:
CASchools[3,8]
## [1] 76.3226
actualread<-CASchools[3, 13]
predread<-coef(mymod)%*%c(1,CASchools[3,8])
actualread
## [1] 636.3
predread
## [,1]
## [1,] 634.3716
The 3rd school district in the data had 76.3226% of students qualifying for reduced-priced lunch and a reading score of 636.3. Based on our model, we predict the reading score to be 634.3716. This is equivalent to plugging 76.3226% into our regression equation:
684.0962-0.6515*76.3226
## [1] 634.372
We can then calculate the residual, or error, by finding the difference between the actual score and the predicted score:
actualread-predread
## [,1]
## [1,] 1.928416
The residual value is 1.9284 so our prediction is 1.9284 too low for the reading score.
There are several assumption that must be true in order to perform hypothesis testing and to create intervals.
Homescadisticity means constant variance. We check that the population of y-values has a variance that doesn’t depend on the x-values by first reviewing our scatterplot. This equivalently means the residuals have a variance that don’t depend on the x-values.
plot(lunch, read, xlab="Students Qualifying for Reduced-Price Lunch (%)", ylab="Average Reading Score", main="Reduced-Price Lunch and Reading Score")
It can be difficult to make this determination by simply looking at the scatterplot, but here we don’t see evidence of homoscedasticity violation. Let’s check this assumption by plotting the residuals against reading score as well as including a straight line to aid in visualization
myresids<-mymod$residuals
plot(myresids ~ read)
abline(0,0)
We can see better that we aren’t too concerned with violating the assumption.
Another assumption is that the residuals have a normal distribution. An easy check for this assumption is to make a histogram as see if it appears normal.
hist(myresids)
This histogram looks to follow a normal distribution (bell-shaped). We can again get a better look. Let’s use a normal qq plot
qqnorm(myresids)
qqline(myresids)
The points are all very close to the line, so we are not violating the normality assumption.
The residuals should also have a mean equal to 0.
mean(myresids)
## [1] 4.078913e-16
Another assumption is that the residuals should be statistically independent of each other. It is worth noting that regression assumptions don’t exactly hold in practice, but we should only worry when these assumptions are greatly violated.
Knowing our model does not violate assumptions, we can display a summary of our model and some useful additional information with the summary command.
summary(mymod)
##
## Call:
## lm(formula = read ~ lunch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.996 -6.560 -0.002 6.574 39.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 684.0962 0.9045 756.35 <2e-16 ***
## lunch -0.6515 0.0173 -37.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.607 on 418 degrees of freedom
## Multiple R-squared: 0.7723, Adjusted R-squared: 0.7718
## F-statistic: 1418 on 1 and 418 DF, p-value: < 2.2e-16
One piece of information we can find are our correlation coefficients we found earlier listed under “Estimate” next to “(intercept)” and “lunch”.
The t value and Pr(>|t|) found in the summary give us information about testing the significance of the y-intercept and slope. Our simple linear regression model is not much use unless there is a significant relationship between y and x. The null and alternative hypothesis are: ‘\[H_0:\beta_1=0 \ H_A:\beta_1\neq0\]’ We can reject the null hypothesis and accept the alternative hypothesis when |t|>t[alpha/2] with n-2 degrees of freedom. The p-value is then twice the area under the t curve to the right of |t|. A larger |t| yields a smaller p-value. As significance between the response and predictor increases, we expect to have a smaller p-value.
Let’s take a look at the summary to determine if there’s a significant relationship between reading scores and % of students qualifying for reduced-price lunch. That is, does a change in % qualifying change reading score in our model. According the the summary, the test stat is -37.65 and the p-val is 2e-16. For a significance level \(\alpha= 0.05\), we have evidence to reject the null and accept the alternative hypothesis that % of students qualifying for reduced-price lunch is a significant predictor for reading score. A similar process is carried out to test the significance of the y-intercept.
We can also find the multiple R-squared value in the summary output. This value tells us how well the data fits our regression model. R-Squared ranges between 0 to 1 with a higher value indicating a stronger fit. Our R-squared value is 0.7723. Adjusted R squared is 0.7718 and this value accounts for the number of predictors.
The F-test is used to test the significance of the regression relationship between x and y. This means we are testing the significance of the simple linear regression model. Again we are testing the null and alternative hypothesis: \[H_0:\beta_1=0 \ H_A:\beta_1\neq0\] The p-value is the area under the curve of the F-distribution( 1 numerator and n-2 degrees) to the right of the F test stat.
Toward the bottom of the summary we see that we have 418 degrees of freed, our F test stat is 1418, and our p-value is < 2.2e-16. Thus we have evidence to reject the null and accept the alternative that the regression relationship between % of students qualifying for reduced-price lunch and reading scores is significant.
We can also use the summary to find information about our constant variance assumption.A point estimate for variance is the mean square error. The residual standard error is 9.607. This is the square root of the mean square error. We can look back at our scatterplot to again consider if our constant variation assumption is violated now that we have a point estimate for variance.
It is often useful to calculate a confidence interval for the slope. This helps us get a better idea of what the true value of the slope is. Use the confint() command to find the endpoints of the interval and to specify the level.
confint(mymod, level=0.9)
## 5 % 95 %
## (Intercept) 682.6051854 685.5872349
## lunch -0.6800298 -0.6229824
We are 90% that the true slope lies between -0.6800298 and -0.6229824.
Our regression line \(\hat{y}= 684.0962 - 0.6515x\) is not likely to be exactly equal to the mean value of reading scores for a value of % of students qualifying for reduced-price lunch. It is also not likely to be equal to a particular reading score for a value of % of students qualifying for reduced-price lunch. When regression assumption hold, we can place bounds on how far \(\hat{y}\), our reading score prediction, might be from these values by creating confidence and prediction intervals.First create a new data frame.
newdata1<-data.frame(lunch=76.3226)
I have chosen the 76.3226% for % of students qualifying for reduced lunch.
Use the predict() command and be sure to specify confidence for the interval type.
confy<-predict(mymod,newdata1,interval="confidence")
confy
## fit lwr upr
## 1 634.3716 632.9555 635.7877
We are 95% that the average reading score is between 632.9555 and 635.7877 when % qualifying for lunch is 76.3226%.
The key difference between a confidence interval and a prediction interval is that a prediction interval is for an individual value of y.
predy<-predict(mymod, newdata1, interval="predict")
predy
## fit lwr upr
## 1 634.3716 615.4354 653.3077
When 76.3226% of student qualify for reduced lunch, we are 95% sure that a district reading score will be between 615.4354 and 653.3077.
Because predicting a single value of reading score for prediction intervals is more difficult than calculating the mean reading score as in confidence intervals, we expect the prediction interval to be wider. Here’s some code to calculate the interval lengths:
predy%*%c(0, -1,1)
## [,1]
## 1 37.87226
confy%*%c(0,-1,1)
## [,1]
## 1 2.832207
The prediction interval length is 37.87226 and the confidence interval length is 2.832207. It’s also good practice to double check these intervals have the same center.
confy[1]==predy[1]
## [1] TRUE
The output returns TRUE so our intervals are centered at the same value.
A simple linear regression model is useful when we are interested in predicting response values when we assume the relationship between the predictor and response have a linear relationship. It’s important to visualize the data as well as check assumptions before building the model and creating intervals. The simple linear regression model overall is a good way to predict reading scores based on % of students qualifying for reduced price lunch because there’s strong negative correlation. However, there is likely some outside influence, like socioeconomic factors underlying this data.