Chapter 3 discusses simple linear regression models and how to use them to show the strength of the relationship between the predictor and response variables.
We will use the Simple Linear Regression model equation, \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] This term can also be written including the error term, \[{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i +{e_i} \] For many equations, we can interpret \(\hat{\beta_0}\) as the y-intercept of the equation, but for some independent variables, this would be inappropriate. For measuring a variable such as a person’s weight, we would be extrapolating the data if we assumed there were a point \({x_i}\) = 0. However, for many variables (number of pages read, amount of snowfall, count of attendees at an event, etc), 0 would be a valid possible value for \({x_i}\).
To understand the formula, we should first understand each of the components:
* \(\hat{\beta_1}\) is the slope of the line.
* \({y_i}\) is the output from the given predictor variable x.
* \({e_i}\) is the error term, which describes the effects of all other predictor values other than the predictor value x on the dependent y variable.
There is an assumption that the error term for a given \({x_i}\) and \({y_i}\) is statistically independent and identically distributed from any other error. Other assumptions we must consider are:
Using the data set “state.x77”, we will plot a linear regression model and analyze the data. Most of the data we are using is cross-sectional, where the data is observed at a single moment in time, rather than time series data, where the data is observed multiple times over a period.
First, we can get an idea of what our data set looks like by using the command head(), which will show the first few lines of our data set, and summary(), which will show where most of the data in each column lies.
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
summary(state.x77)
## Population Income Illiteracy Life Exp
## Min. : 365 Min. :3098 Min. :0.500 Min. :67.96
## 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12
## Median : 2838 Median :4519 Median :0.950 Median :70.67
## Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88
## 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89
## Max. :21198 Max. :6315 Max. :2.800 Max. :73.60
## Murder HS Grad Frost Area
## Min. : 1.400 Min. :37.80 Min. : 0.00 Min. : 1049
## 1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985
## Median : 6.850 Median :53.25 Median :114.50 Median : 54277
## Mean : 7.378 Mean :53.11 Mean :104.46 Mean : 70736
## 3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81163
## Max. :15.100 Max. :67.30 Max. :188.00 Max. :566432
Now we can see the data contains information for all 50 states: the 1975 population estimate, income per capita in 1974, percent illiteracy, life expectancy, murder rates per 100 thousand population, percent of high school graduates in 1970, mean number of days below freezing, and the area in square miles.
I want to look at the realationship between two categories of data in our set and see if there is a correlation.
Let’s say we want to compare average income per capita and murder rates (per 100,000 population as defined in the data set); we want to find out if a change in a ‘predictor’ variable results in a change in a ‘response’ variable.
We must create a linear model:
statetable <- as.data.frame(state.x77)
Statemodel <- lm(Murder ~ Income, data= statetable)
summary(Statemodel)
##
## Call:
## lm(formula = Murder ~ Income, data = statetable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0495 -2.8033 -0.2727 3.0730 6.5999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.5093092 3.7782753 3.576 0.00081 ***
## Income -0.0013822 0.0008439 -1.638 0.10797
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.63 on 48 degrees of freedom
## Multiple R-squared: 0.05294, Adjusted R-squared: 0.03321
## F-statistic: 2.683 on 1 and 48 DF, p-value: 0.108
Looking at the summary we can see that we have an intercept of 13.5093092. This intercept is \(\hat{\beta_0}\) and our slope, or \(\hat{\beta_1}\) is -.0013822. A negative slope should mean that as average income increases, the number of murders should decline. We cannot interpret the intercept in our case because the per capita state income would never be 0.
We will start out with creating a scatterplot:
plot(Murder~Income, data= state.x77,
main="Scatterplot", xlab ="Per Capita State Income", ylab = "Number of Murders")
Now that we have a scatterplot, let’s plot a line through the data, using the intercept and slope given. The command abline() draws the line using the slope and intercept provided from our summary.
plot(Murder~Income, data= state.x77,
main="Scatterplot", xlab ="Per Capita State Income", ylab = "Number of Murders")
abline(13.5093092,-0.0013822)
This data looks fairly spread out and some points are fairly far from the line which means there would be a high error term. I’m not convinced that the Per Capita State Income and the Number of Murders are correlated. However, we can investigate further using other methods.
A big indicator for interpreting the fit of the linear model is the error component. We will investigate the distances of the data points from the line; if the distances are large, our model does not fit as well as one with a small error term.
Most data points do not lie directly on our linear regression line. The deviation from this line, our error term, is important to interpreting how well the line fits the data.
To interpret this, we will compare the line point estimate of the linear model to the actual data point for Alabama. First, we will use the data in the matrix to tell us the per capita state income and number of murders for Alabama, respectively:
state.x77[1,2]
## [1] 3624
state.x77[1,5]
## [1] 15.1
Here, the actual number of murders is 15.1. Now, using the coefficients we found from before, we will generate the prediction based on the regression line for the number of murders:
13.509309+0.001382*3624
## [1] 18.51768
The number of murders predicted is 18.51, which is a pretty big overestimate Ideally, a data set should have a small difference between the predicted value and the actual value to indicate it is a good model.
We can use MSE to estimate \(\hat{\sigma}^2\), which is shown in the Linear Model Summary (listed next to residual standard error), or we can calculate it by hand using the formula given in the book. We can define the residuals as resids using R, then calculating:
resids <- Statemodel$residuals
sqrt(sum((Statemodel$residuals)^2)/48)
## [1] 3.629733
As you can see, they both are 3.63. We would prefer an estimate of \(\hat{\sigma}^2\) much closer to 0.
Another note about the vast information the Linear Model Summary provided us: this model has 48 degrees of freedom. In general, a model has * n - 2 * degrees of freedom, which makes sense with our model of n = 50 states.
If you remember the beginning of this R guide, we touched on the assumptions about errors, including their distributions. We can look at how the errors are distributed and plot with a qq-plot to see how they fit to our line.
hist(resids)
qqnorm(resids)
qqline(resids)
My intuition says that these errors don’t look relatively normal in the histogram, which is one of the assumptions. However, the QQ plot looks relatively reasonable.
The purpose of hypothesis testing is to question the significance of the relationship between our independent and dependent variables.
We will provide a null hypothesis and an alternate to test against. Typically, we will test two-sided, but you can test one-sided if you so choose.
The purpose of a T test is to test the significance of the slope.
Null hypothesis: \(\hat{\beta_1}\) = 0
to test against \(\hat{\beta_1}\) =/= 0
We use the test statistic \(\hat{\beta_1}\) / (s / \({\sqrt(SS_x)}\))
If we decide to reject the null, we are saying the \(\hat{\beta_1}\) (the slope) is nonzero and that there is a significant relationship between the two variables. My guess is that we will not reject the null hypothesis.
In the Linear Model summary we can see that the slope is very close to 0, and the p value is .10797, which is not significant enough to reject the null hypothesis.
We use an F test to determine the significance of the relationship between the predictor and response variable. We will use the test statistic:
F stat = explained error/(unexplained error/(n-2))
Our Linear Model summary gave us the p-value = .108, displayed in the last line of the summary. Again, that is a high p-value, which means that we are probably correct in thinking they are not correlated.
We have two tools available for our testing: confidence and prediction intervals. These two provide upper and lower boundaries for a response variable, given values of the predictor variable. We know that the regression line will have errors (some quite large, for this example) and we might want to know how far from the line the actual response variables lie.
The distance value can be used in calculating the confidence and prediction intervals for simple linear regression. We don’t need to use it because R will calculate the intervals for us. However, for those interested, our distance value for can be calculated by using the formula: \(DV=1/n+(x_0-\bar{x})^2/(SS_x)\)
Confidence intervals are used for a mean value of the response variable. Prediction intervals are used for an individual value of y. Logically, we know the individual variable will have a larger deviation from the regression line than a mean variable would, because the mean is an average of all the individual variables. This results in the prediction interval being larger than the confidence interval. We can check this is true for our data set:
newdat <- data.frame(Income=5000)
conf <-predict(Statemodel, newdat, interval = "confidence")
conf
## fit lwr upr
## 1 6.598144 5.190439 8.00585
Looking at all the states with per capita income of 5000, we can be 95% confident that the true population mean number of murders is between 5.190439 and 8.00585.
pred <-predict(Statemodel, newdat, interval = "prediction")
pred
## fit lwr upr
## 1 6.598144 -0.834448 14.03074
We are 95% confident that a state with per capita income of 5000 will have between -0.834448 and 14.03074 murders. Obviously, a negative number here is illogical: this is a case where we are extrapolating past the valid data range.
We will compute the width of each just to check that the prediction interval is indeed larger.
conf %*% c(0, -1, 1)
## [,1]
## 1 2.815411
pred %*% c(0, -1, 1)
## [,1]
## 1 14.86518
We can also check they are centered in the same place:
conf[1] == pred[1]
## [1] TRUE
\({r^2}\) is the simple coefficient of determination, and it measures how useful a linear regression model is.
\({r^2}\) = explained variation / total variation
For our model, using the Linear Model Summary we can see that our R squared is 0.05294, which is very low. A model with a low \({r^2}\) would mean there is much more unexplained than explained variation, which means the model is not very useful.
We can also interpret \({r}\), the simple correlation coefficient using the cor() function:
cor(state.x77)
## Population Income Illiteracy Life Exp Murder
## Population 1.00000000 0.2082276 0.10762237 -0.06805195 0.3436428
## Income 0.20822756 1.0000000 -0.43707519 0.34025534 -0.2300776
## Illiteracy 0.10762237 -0.4370752 1.00000000 -0.58847793 0.7029752
## Life Exp -0.06805195 0.3402553 -0.58847793 1.00000000 -0.7808458
## Murder 0.34364275 -0.2300776 0.70297520 -0.78084575 1.0000000
## HS Grad -0.09848975 0.6199323 -0.65718861 0.58221620 -0.4879710
## Frost -0.33215245 0.2262822 -0.67194697 0.26206801 -0.5388834
## Area 0.02254384 0.3633154 0.07726113 -0.10733194 0.2283902
## HS Grad Frost Area
## Population -0.09848975 -0.3321525 0.02254384
## Income 0.61993232 0.2262822 0.36331544
## Illiteracy -0.65718861 -0.6719470 0.07726113
## Life Exp 0.58221620 0.2620680 -0.10733194
## Murder -0.48797102 -0.5388834 0.22839021
## HS Grad 1.00000000 0.3667797 0.33354187
## Frost 0.36677970 1.0000000 0.05922910
## Area 0.33354187 0.0592291 1.00000000
As we can see from this correlation chart, the correlation of Murder and Income is -.2300776. It is slightly negatively correlated.
Our results from the model would leave me to believe there is very little correlation between the Per Capita State Income and Number of Murders. I would be interested to see if this model provides similar results year-to-year. I am interested to see if there is any correlation between other variables and State Income, as well.