An Introduction to Simple Linear Regression

Chapter 3 discusses simple linear regression models and how to use them to show the strength of the relationship between the predictor and response variables.
We will use the Simple Linear Regression model equation, \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] This term can also be written including the error term, \[{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i +{e_i} \] For many equations, we can interpret \(\hat{\beta_0}\) as the y-intercept of the equation, but for some independent variables, this would be inappropriate. For measuring a variable such as a person’s weight, we would be extrapolating the data if we assumed there were a point \({x_i}\) = 0. However, for many variables (number of pages read, amount of snowfall, count of attendees at an event, etc), 0 would be a valid possible value for \({x_i}\).

To understand the formula, we should first understand each of the components:
* \(\hat{\beta_1}\) is the slope of the line.
* \({y_i}\) is the output from the given predictor variable x.
* \({e_i}\) is the error term, which describes the effects of all other predictor values other than the predictor value x on the dependent y variable.
There is an assumption that the error term for a given \({x_i}\) and \({y_i}\) is statistically independent and identically distributed from any other error. Other assumptions we must consider are:

  1. at any \({x_i}\) population of potential error term values has a mean equal to 0
  2. at any \({x_i}\), the error has a variance that doesn’t depend on x (also known as homoscedasticity)
  3. at any \({x_i}\), the potential error terms have a normal distribution

A Motivating Example

Using the data set “state.x77”, we will plot a linear regression model and analyze the data. Most of the data we are using is cross-sectional, where the data is observed at a single moment in time, rather than time series data, where the data is observed multiple times over a period.
First, we can get an idea of what our data set looks like by using the command head(), which will show the first few lines of our data set, and summary(), which will show where most of the data in each column lies.

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766
summary(state.x77)
##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

Now we can see the data contains information for all 50 states: the 1975 population estimate, income per capita in 1974, percent illiteracy, life expectancy, murder rates per 100 thousand population, percent of high school graduates in 1970, mean number of days below freezing, and the area in square miles.

I want to look at the realationship between two categories of data in our set and see if there is a correlation.

Let’s say we want to compare average income per capita and murder rates (per 100,000 population as defined in the data set); we want to find out if a change in a ‘predictor’ variable results in a change in a ‘response’ variable.
We must create a linear model:

statetable <- as.data.frame(state.x77)
Statemodel <- lm(Murder ~ Income, data= statetable)
summary(Statemodel)
## 
## Call:
## lm(formula = Murder ~ Income, data = statetable)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0495 -2.8033 -0.2727  3.0730  6.5999 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.5093092  3.7782753   3.576  0.00081 ***
## Income      -0.0013822  0.0008439  -1.638  0.10797    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.63 on 48 degrees of freedom
## Multiple R-squared:  0.05294,    Adjusted R-squared:  0.03321 
## F-statistic: 2.683 on 1 and 48 DF,  p-value: 0.108

Looking at the summary we can see that we have an intercept of 13.5093092. This intercept is \(\hat{\beta_0}\) and our slope, or \(\hat{\beta_1}\) is -.0013822. A negative slope should mean that as average income increases, the number of murders should decline. We cannot interpret the intercept in our case because the per capita state income would never be 0.

We will start out with creating a scatterplot:

plot(Murder~Income, data= state.x77,
     main="Scatterplot", xlab ="Per Capita State Income", ylab = "Number of Murders")

Now that we have a scatterplot, let’s plot a line through the data, using the intercept and slope given. The command abline() draws the line using the slope and intercept provided from our summary.

plot(Murder~Income, data= state.x77,
     main="Scatterplot", xlab ="Per Capita State Income", ylab = "Number of Murders")
abline(13.5093092,-0.0013822)

This data looks fairly spread out and some points are fairly far from the line which means there would be a high error term. I’m not convinced that the Per Capita State Income and the Number of Murders are correlated. However, we can investigate further using other methods.

Exploring the Error Term

A big indicator for interpreting the fit of the linear model is the error component. We will investigate the distances of the data points from the line; if the distances are large, our model does not fit as well as one with a small error term.

Point Estimates

Most data points do not lie directly on our linear regression line. The deviation from this line, our error term, is important to interpreting how well the line fits the data.
To interpret this, we will compare the line point estimate of the linear model to the actual data point for Alabama. First, we will use the data in the matrix to tell us the per capita state income and number of murders for Alabama, respectively:

state.x77[1,2]
## [1] 3624
state.x77[1,5]
## [1] 15.1

Here, the actual number of murders is 15.1. Now, using the coefficients we found from before, we will generate the prediction based on the regression line for the number of murders:

13.509309+0.001382*3624
## [1] 18.51768

The number of murders predicted is 18.51, which is a pretty big overestimate Ideally, a data set should have a small difference between the predicted value and the actual value to indicate it is a good model.

Mean Square Error

We can use MSE to estimate \(\hat{\sigma}^2\), which is shown in the Linear Model Summary (listed next to residual standard error), or we can calculate it by hand using the formula given in the book. We can define the residuals as resids using R, then calculating:

resids <- Statemodel$residuals
sqrt(sum((Statemodel$residuals)^2)/48)
## [1] 3.629733

As you can see, they both are 3.63. We would prefer an estimate of \(\hat{\sigma}^2\) much closer to 0.
Another note about the vast information the Linear Model Summary provided us: this model has 48 degrees of freedom. In general, a model has * n - 2 * degrees of freedom, which makes sense with our model of n = 50 states.

Graphing the Residuals

If you remember the beginning of this R guide, we touched on the assumptions about errors, including their distributions. We can look at how the errors are distributed and plot with a qq-plot to see how they fit to our line.

hist(resids)

qqnorm(resids)
qqline(resids)

My intuition says that these errors don’t look relatively normal in the histogram, which is one of the assumptions. However, the QQ plot looks relatively reasonable.

Hypothesis Testing

The purpose of hypothesis testing is to question the significance of the relationship between our independent and dependent variables.
We will provide a null hypothesis and an alternate to test against. Typically, we will test two-sided, but you can test one-sided if you so choose.

T-Test

The purpose of a T test is to test the significance of the slope.
Null hypothesis: \(\hat{\beta_1}\) = 0
to test against \(\hat{\beta_1}\) =/= 0
We use the test statistic \(\hat{\beta_1}\) / (s / \({\sqrt(SS_x)}\))
If we decide to reject the null, we are saying the \(\hat{\beta_1}\) (the slope) is nonzero and that there is a significant relationship between the two variables. My guess is that we will not reject the null hypothesis.
In the Linear Model summary we can see that the slope is very close to 0, and the p value is .10797, which is not significant enough to reject the null hypothesis.

F-Test

We use an F test to determine the significance of the relationship between the predictor and response variable. We will use the test statistic:
F stat = explained error/(unexplained error/(n-2))
Our Linear Model summary gave us the p-value = .108, displayed in the last line of the summary. Again, that is a high p-value, which means that we are probably correct in thinking they are not correlated.

Confidence and Prediction Intervals

We have two tools available for our testing: confidence and prediction intervals. These two provide upper and lower boundaries for a response variable, given values of the predictor variable. We know that the regression line will have errors (some quite large, for this example) and we might want to know how far from the line the actual response variables lie.

Distance Value

The distance value can be used in calculating the confidence and prediction intervals for simple linear regression. We don’t need to use it because R will calculate the intervals for us. However, for those interested, our distance value for can be calculated by using the formula: \(DV=1/n+(x_0-\bar{x})^2/(SS_x)\)

Creating the Intervals

Confidence intervals are used for a mean value of the response variable. Prediction intervals are used for an individual value of y. Logically, we know the individual variable will have a larger deviation from the regression line than a mean variable would, because the mean is an average of all the individual variables. This results in the prediction interval being larger than the confidence interval. We can check this is true for our data set:

newdat <- data.frame(Income=5000)
conf <-predict(Statemodel, newdat, interval = "confidence")
conf
##        fit      lwr     upr
## 1 6.598144 5.190439 8.00585

Looking at all the states with per capita income of 5000, we can be 95% confident that the true population mean number of murders is between 5.190439 and 8.00585.

pred <-predict(Statemodel, newdat, interval = "prediction")
pred
##        fit       lwr      upr
## 1 6.598144 -0.834448 14.03074

We are 95% confident that a state with per capita income of 5000 will have between -0.834448 and 14.03074 murders. Obviously, a negative number here is illogical: this is a case where we are extrapolating past the valid data range.
We will compute the width of each just to check that the prediction interval is indeed larger.

conf %*% c(0, -1, 1)
##       [,1]
## 1 2.815411
pred %*% c(0, -1, 1)
##       [,1]
## 1 14.86518

We can also check they are centered in the same place:

conf[1] == pred[1]
## [1] TRUE

Simple Coefficient of Determination and Correlation

\({r^2}\) is the simple coefficient of determination, and it measures how useful a linear regression model is.
\({r^2}\) = explained variation / total variation
For our model, using the Linear Model Summary we can see that our R squared is 0.05294, which is very low. A model with a low \({r^2}\) would mean there is much more unexplained than explained variation, which means the model is not very useful.
We can also interpret \({r}\), the simple correlation coefficient using the cor() function:

cor(state.x77)
##             Population     Income  Illiteracy    Life Exp     Murder
## Population  1.00000000  0.2082276  0.10762237 -0.06805195  0.3436428
## Income      0.20822756  1.0000000 -0.43707519  0.34025534 -0.2300776
## Illiteracy  0.10762237 -0.4370752  1.00000000 -0.58847793  0.7029752
## Life Exp   -0.06805195  0.3402553 -0.58847793  1.00000000 -0.7808458
## Murder      0.34364275 -0.2300776  0.70297520 -0.78084575  1.0000000
## HS Grad    -0.09848975  0.6199323 -0.65718861  0.58221620 -0.4879710
## Frost      -0.33215245  0.2262822 -0.67194697  0.26206801 -0.5388834
## Area        0.02254384  0.3633154  0.07726113 -0.10733194  0.2283902
##                HS Grad      Frost        Area
## Population -0.09848975 -0.3321525  0.02254384
## Income      0.61993232  0.2262822  0.36331544
## Illiteracy -0.65718861 -0.6719470  0.07726113
## Life Exp    0.58221620  0.2620680 -0.10733194
## Murder     -0.48797102 -0.5388834  0.22839021
## HS Grad     1.00000000  0.3667797  0.33354187
## Frost       0.36677970  1.0000000  0.05922910
## Area        0.33354187  0.0592291  1.00000000

As we can see from this correlation chart, the correlation of Murder and Income is -.2300776. It is slightly negatively correlated.

Conclusion

Our results from the model would leave me to believe there is very little correlation between the Per Capita State Income and Number of Murders. I would be interested to see if this model provides similar results year-to-year. I am interested to see if there is any correlation between other variables and State Income, as well.