An Introduction to Simple Linear Regression

Chapter 3 of “Forecasing, Time Series, and Regression” discusses simple linear regression models and how to use them to show the strength of the linear relationship between the predictor and response variables.
We will use the simple linear regression model equation,

\(\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\) .

This term can also be written including the error term,

\({y_i}= \hat{\beta_0}+\hat{\beta_1} x_i +{e_i}\) .

We can interpret \(\hat{\beta_0}\) as the y-intercept of the equation, but for some independent variables, it would be inappropriate to interpret the y-intercept in the context of our data set. For measuring a variable such as a person’s weight, we would be extrapolating the data if we assumed there were a point \({x_i}\) = 0. However, for many variables (number of pages read, amount of snowfall, count of attendees at an event, etc), 0 would be a valid possible value for \({x_i}\).

To understand the formula, we should first understand each of the components:
* \(\hat{\beta_1}\) is the slope of the line.
* \(\hat{y_i}\) is the output from the given predictor variable x.
* \({\epsilon_i}\) is the error term or residual, which describes the effects of all other predictor values other than the predictor value x on the dependent y variable. It is calculated as \({\epsilon_i} = {y_i} - \hat{y_i}\) .

Simple linear regression includes multiple assumptions we must consider for the residual term:

  1. at any \({x_i}\) population of potential error term values has a mean equal to 0. R already verifiess this for us, so we do not have to.
  2. at any \({x_i}\), the error has a variance that does not depend on x (also known as homoscedasticity).
  3. at any \({x_i}\), the potential error terms have a normal distribution.
  4. at any \({x_i}\) and \({y_i}\), the residual error term is statistically independent and identically distributed from any other error.

We will examine the residual error term in greater detail later in this guide.

A Motivating Example

Using the data set state.x77, we will plot a linear regression model and analyze the data. Most of the data we are using is cross-sectional - where the data is observed at a single moment in time, rather than time series data - where the data is observed multiple times over a period.

Exploratory Data Analysis

First, we can get an idea of what our data set looks like by using the command head(data_name) which will display the first few lines of our data set, where data_name is the name of your data set. Next, summary(data_name) will show statistics of each variable.

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766
summary(state.x77)
##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

Now we can see the data contains information for all 50 states: the 1975 population estimate, income per capita in 1974, percent illiteracy, life expectancy, murder rates per 100 thousand population, percent of high school graduates in 1970, mean number of days below freezing, and the area in square miles.

Next, we will create a scatterplot using the two variables that we choose to examine for a linear relationship. I propose the question: does per capita income for each state have a linear relationship with the murder rate in that state? The command to plot these two variables includes the dependent variable, Murder, based on the independent variable, Income. We must also indicate the data set we want to use, and rename the labels on the graph to be more informative, using xlab for the x axis, ylab for the y axis, and main for the title. The commands for this are shown below:

plot(Murder~Income, data= state.x77,
     main="Figure 1", xlab ="Per Capita State Income", ylab = "Murders per 100 Thousand")

This scatterplot provides all 50 states’ number of murders plotted against the per capita state income. I don’t see a super clear pattern, although it does look like there are higher number of murders with lower state income (displayed in the top left side of the scatterplot). I also notice one outlier data point, Alaska, at a per capita income of 6315 and 11.3 murders. This may disrupt any linear relationship we may observe, which we should be note.

Creating the Linear Regression Model

I want to look at the relationship between these two quantitative variables see if there is a simple linear correlation. Using state.77, we will define average income per capita as a predictor and murder rates (per 100,000 population) as a response.

We must create a linear model and name it Statemodel, with the response based on the predictor using the ~. However, since state.x77 is formatted as a matrix, we must first create a table with the matrix data, using the as.data.frame(data_name) command:

statetable <- as.data.frame(state.x77)
Statemodel <- lm(Murder ~ Income, data= statetable)

The best way to start interpreting the model is to use the summary(model_name) command.

summary(Statemodel)
## 
## Call:
## lm(formula = Murder ~ Income, data = statetable)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0495 -2.8033 -0.2727  3.0730  6.5999 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.5093092  3.7782753   3.576  0.00081 ***
## Income      -0.0013822  0.0008439  -1.638  0.10797    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.63 on 48 degrees of freedom
## Multiple R-squared:  0.05294,    Adjusted R-squared:  0.03321 
## F-statistic: 2.683 on 1 and 48 DF,  p-value: 0.108

The summary provides a great deal of helpful information. We can see that we have an intercept \(\hat{\beta_0}\) = 13.5093092 and a slope of \(\hat{\beta_1}\) = -.0013822. A negative slope should mean that as average income increases, the number of murders should decline. In this example, for any $100 increase of income per capita, we expect the murder rate (per 100,000 population) to decline by .13822. We cannot interpret the intercept in our case because the per capita state income would never be 0.

Now that we have the components of our linear regression line, let’s plot it over Figure 1, using the intercept and slope given. The command abline(intercept,slope) draws the line using the slope and intercept provided from our summary.

plot(Murder~Income, data= state.x77,
     main="Figure 2", xlab ="Per Capita State Income", ylab = "Number of Murders")
abline(13.5093092,-0.0013822)

This data looks fairly spread out and some points are fairly far from the line which means there would be a high error term. I’m not convinced that the per capita state income and the number of murders are linearly correlated because the data points do not seem to closely follow the pattern of the line. However, we can investigate further using other methods.

Exploring the Error Term

A big indicator for interpreting the fit of the linear model is the residual error component. We will investigate the residuals of the data points from the line; if they are large, our model does not indicate a strong linear relationship.

Point Estimates

Most data points do not lie directly on our linear regression line. The deviation from this line, our error term, is important to interpreting how well the line fits the data.
To interpret this, we will compare the line point estimate of the linear model to the actual data point for Alabama. First, we will use the data in the matrix to tell us the per capita state income and number of murders for Alabama, respectively:

state.x77[1,2]
## [1] 3624
state.x77[1,5]
## [1] 15.1

Here, the actual number of murders is 15.1 for a state with per capita income of 3624. Now, using the coefficients we found from before, we will generate the prediction based on the linear regression line for the number of murders, given a per capita income of 3624:

13.509309+-0.001382*3624
## [1] 8.500941

The number of murders predicted is 8.500941, which is a pretty big underestimate of 15.1. Ideally, a data set should have a small difference between the predicted value and the actual value to indicate it is a good model.

Mean Square Error

We can use MSE to estimate \({\sigma}^2\), which is shown in the Linear Model Summary (listed next to residual standard error), or we can calculate it by hand using the formula given in the textbook. R provides the residuals with the command model_name$residuals.

resids <- Statemodel$residuals

As you can see looking back at the linear model summary, the MSE is 3.63. We would prefer an estimate of \({\sigma}^2\) much closer to 0 because smaller residuals indicate a higher linear correlation.

Graphing the Residuals

If you remember at the beginning of this R guide, we touched on the assumptions about errors, including their distributions. The way to visually test the assumption that the residuals are normally distributed is through a qq-plot. If the residuals are normally distributed, they will follow the qqline(residuals_name) fairly closely. We plot this by using the command qqplot(residuals_name) with the residuals we named above.

qqnorm(resids, main="Normal Q-Q Plot of Residuals from Statemodel")
qqline(resids)

In my opinion, the qq-plot points follow the line relatively closely between (-1,1), although toward the edges we can see the data points moving further from the line. This means the tail ends of our data (very low income and very high income) are not as spread out as should be for normally distributed errors.

Heteroscedasticity

It is also important to check the error assumption that each error is independent or constant, and does not depend on what the x is, and the fancy word for this is homoscedasticity. Heteroscedasticity is demonstrated when there is a pattern to the residual errors, which is bad news for any model, because you don’t want systematic overprediction or underprediction. The command we use is plot() with the residuals we defined from our model plotted against the independent variable Income. We also add the abline(0,0) to show that the points should be equally spread out.

plot(resids~Income, data=state.x77, main="Residuals Against Income", ylab="Residuals of Statemodel")
abline(0,0)

Heteroscedasticity is bad for the model because it violates one of the assumptions about simple linear regression models. Luckily for us, this model’s residuals do not appear to have a strong pattern, so it does not violate the error assumption we stated above.

Hypothesis Testing

The purpose of hypothesis testing is to question the significance of the linear relationship between our independent and dependent variables.
We will provide a null hypothesis and an alternate to test against. Typically, we will test two-sided, but you can test one-sided if you so choose. For this simple linear regression example, the null hypothesis is \({\beta_1}\) = 0 and we will test against \({\beta_1}\) \(\neq\) 0.

T-Test

The purpose of a T test is to test the significance of the slope.
We use the test statistic \(\hat{\beta_1} / ( s/\sqrt{SS_(xx_)} )\)
If we decide to reject the null, we are saying the \(\hat{\beta_1}\) (the slope) is nonzero and that there is a significant linear relationship between the two variables. My guess is that we will not reject the null hypothesis.
In the Linear Model summary we can see that the slope is very close to 0, and the p value is .10797, which is not significant enough to reject the null hypothesis.

F-Test

We use an F test to determine the significance of the linear relationship between the predictor and response variable. We will use the test statistic:
F stat = explained error / (unexplained error / \((n - 2))\)
Our Linear Model summary gave us the p value = .108, displayed in the last line of the summary. Again, that is a high p-value, which means that we are probably correct in thinking they are not linearly correlated. Right next to the F test, we can note that the linear model summary provided us with the number of degrees of freedom, which is 48. In general, a simple linear regression model has n - 2 degrees of freedom, which makes sense with our model of n = 50 states.

Confidence and Prediction Intervals

We have two tools available for our testing: confidence and prediction intervals. These two provide upper and lower boundaries for a response variable, given values of the predictor variable. We know that the regression line will have errors (some quite large, for this example) and we might want to know how far from the line the actual response variables lie.

Distance Value

The distance value can be used in calculating the confidence and prediction intervals for simple linear regression. We don’t need to use it because R will calculate the intervals for us. However, for those interested, our distance value can be calculated by using the formula: \(DV=1/n+(x_0-\bar{x})^2/(SS_(xx_))\)

Creating the Intervals

Confidence intervals are used for a mean value of the response variable. Prediction intervals are used for an individual value of y. Logically, we know the individual value will have a larger deviation from the regression line than a mean variable would, because the mean is an average of all the individual values. This results in the prediction interval being larger than the confidence interval. We can check this is true for our data set, by defining newdat to be the case that Income is 5000 and using the predict(model_name) and specifying either a confidence or prediction interval:

newdat <- data.frame(Income=5000)
conf <-predict(Statemodel, newdat, interval = "confidence")
conf
##        fit      lwr     upr
## 1 6.598144 5.190439 8.00585

Looking at all the states with per capita income of 5000, we can be 95% confident that the true population mean number of murders is between 5.190439 and 8.00585.

pred <-predict(Statemodel, newdat, interval = "prediction")
pred
##        fit       lwr      upr
## 1 6.598144 -0.834448 14.03074

We are 95% confident that a given state with per capita income of 5000 will have between -0.834448 and 14.03074 murders. Obviously, a negative number here is illogical: this is a case where we are extrapolating past the valid data range.
We will compute the width of each interval to check that the prediction interval is indeed larger. We use the command interval_name %*% c(0,-1,1):

conf %*% c(0, -1, 1)
##       [,1]
## 1 2.815411
pred %*% c(0, -1, 1)
##       [,1]
## 1 14.86518

The width of the confidence interval is 2.815411 and the width of the prediction interval is 14.86518, so the prediction interval is indeed larger than the confidence interval.
We can also check they are centered in the same place:

conf[1] == pred[1]
## [1] TRUE

R produces the result TRUE, so they are centered in the same place.

Confidence Intervals for the Parameters

We can find a 95% confidence interval (which is the default \(1-{\alpha}\)%) for the parameters \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using the function confint(model_name) command.

confint(Statemodel)
##                    2.5 %       97.5 %
## (Intercept)  5.912577550 2.110604e+01
## Income      -0.003078949 3.144832e-04

This provides us with the upper and lower boundaries: 2.5% on each side. The biggest takeaway from this result is that the slope’s confidence interval, next to “Income”, can be interpreted that we are 95% confident that the slope will be between -0.003078949 and 3.144832e-04.

Simple Coefficient of Determination and Correlation

\({r^2}\) is the simple coefficient of determination, and measures the proportion of variability in the response that is explained by its linear relationship with the predictor.
For our model, using the Linear Model Summary we can see that our R squared is 0.05294, which is very low. A model with a low \({r^2}\) would mean there is much more unexplained than explained variation, which means the model is not very useful.
We can also interpret \({r}\), the simple correlation coefficient using the cor(data_name) function:

cor(state.x77)
##             Population     Income  Illiteracy    Life Exp     Murder
## Population  1.00000000  0.2082276  0.10762237 -0.06805195  0.3436428
## Income      0.20822756  1.0000000 -0.43707519  0.34025534 -0.2300776
## Illiteracy  0.10762237 -0.4370752  1.00000000 -0.58847793  0.7029752
## Life Exp   -0.06805195  0.3402553 -0.58847793  1.00000000 -0.7808458
## Murder      0.34364275 -0.2300776  0.70297520 -0.78084575  1.0000000
## HS Grad    -0.09848975  0.6199323 -0.65718861  0.58221620 -0.4879710
## Frost      -0.33215245  0.2262822 -0.67194697  0.26206801 -0.5388834
## Area        0.02254384  0.3633154  0.07726113 -0.10733194  0.2283902
##                HS Grad      Frost        Area
## Population -0.09848975 -0.3321525  0.02254384
## Income      0.61993232  0.2262822  0.36331544
## Illiteracy -0.65718861 -0.6719470  0.07726113
## Life Exp    0.58221620  0.2620680 -0.10733194
## Murder     -0.48797102 -0.5388834  0.22839021
## HS Grad     1.00000000  0.3667797  0.33354187
## Frost       0.36677970  1.0000000  0.05922910
## Area        0.33354187  0.0592291  1.00000000

As we can see from this linear correlation chart, the correlation of Murder and Income is -.2300776. It is very slightly negatively linearly correlated.

Conclusion

Our results from the model would leave me to believe there is very little linear correlation between the Per Capita State Income and Number of Murders. I would be interested to see if this model provides similar results year-to-year. I am interested to see if there is any linear correlation between other variables and State Income, as well.