Simple Linear Regression Model

In this R guide, we are going to discuss and analyze the simple linear regression of two variables. We will analyze the relationship between the dependent (y) and independent variable (x), compute predictions and intervals of the variables, and explain the utility of the simple linear regression.

For this R guide, we will use USArrests dataset.

Its important to begin by attaching the data so we don’t need to refer it by name every time and also its important to look at what the data looks like. We can look at the first few rows and column names to get an idea of what our data is about.

data(USArrests)
attach(USArrests)
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

We are going to be interested in the relationship between two variables, UrbanPop and Assault. UrbanPop is the percent of urban population in the state and Assault is the number of assault arrests per 100,000 residents.

In my simple linear regression model, UrbanPop is our explanatory variable and Assault will be our response variable. I will use urban population as an independent variable and assault arrests as the dependent variable because I believe if a state has a higher urban population, there will be a higher number of assault arrests due to denser population and larger social inequalities in urban areas.

Scatter Plots

Scatter plots can be useful to display the relationship between two variables. We can use the command “plot” with the quantitative variables separated by commas. The first quantitative variable is the explanatory variable, while the second quantitative variable is the response. Below we will graph the plot using UrbanPop and Assault: Notice that UrbanPop is the explanatory and Assault is the response

Furthermore, we can add axis labels to make the plot more clear and easier to understand.

plot(UrbanPop, Assault, ylab = "Assault Arrests per 100,000 Residents", xlab = "State's Percent Urban Population ", main = "Urban Population vs Assault Arrests")

If the Assault variable increases/decreases in a straight line as UrbanPop variable increases/decreases that could give us evidence that there is a relationship between the two quantitative variables. However, with this data, it is hard to see if there is a linear relationship between them, since the data points are varied and spread out along the x axis. The next step is to create a regression model to statistically describe the relationship of UrbanPop and Assault.

Creating the Linear Regression Model

To create a linear model that will describe the correlation of the variables, we can use the “lm” command. In the command the response variable is first and the explanatory variable follows.

In our data, the response is assault arrests per 100,000 residents and our explanatory variable is the state’s urban population percent:

mymod <- lm(Assault~ UrbanPop, data=USArrests)
mymod
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
## 
## Coefficients:
## (Intercept)     UrbanPop  
##       73.08         1.49

Now that we have our slope and intercept, we need to interpret it. In our data, we’d say: for every single percent increase in a state’s urban population, the number of assault arrests per 100,000 residents increases by 1.49. I don’t believe this is a significant change.

We will call our intercept as β0 and our slope as β1. Our simple linear regression model is:

y=β0 + β1x

or

y=73.08 + 1.49x

Plotting the Regression Line

After finding the regression line using the “lm” command we are going to place the line in our scatter plot. This can help see the linear trend.

To do this we plot the scatterplot as normal and then follow it up with the “abline” command. In the “abline” command we need to separate the intercept and the slope by a comma.

plot(UrbanPop, Assault)
abline(73.08, 1.49)

At first look, it appears the line fits the data. However, we want to know how well the line fits the data.

Regression Assumptions

Normality Assumption

First, we need to see if our residuals are normally distributed.

Let’s look at the residuals histogram:

myresids <- mymod$residuals
hist(myresids)

From the histogram, it is hard to tell if the residuals are normally distributed. We should double check by plotting the residuals against a normal distribution line:

qqnorm(myresids)
qqline(myresids)

From the plot, we can see that the residuals mostly follow the straight normal distribution line. This shows us that the residuals are normally distributed.

Constant Variance Assumption

Second, we need to make sure the variance of the residuals is constant. If they are constant, it means there is no evidence of heteroscedasticity and the residuals are normally distributed. We can check equal variance by plotting the residuals against urban household percent:

plot(mymod$residuals ~ UrbanPop)
abline(0,0)

Residual variance is constant across urban pop percent, so we can say the residuals are normally distributed.

Mean of Error Term Values

The last two regression assumptions we need to follow is: 1) The mean of the potential error term values is equal to 0:

mean(myresids)
## [1] 1.46273e-15

Independence

Each residual is independent of other residuals. This makes sense because the data points are randomly selected variables which suggests independence.

Therefore, the regression assumptions are met so we can trust the results of the tests.

Prediction

Since the residuals are normally distributed, we can use the simple regression line to predict assault arrests based on urban population.

Reminder our regression line:

y =73.08 + 1.49x

To predict, we need to plug in our explanatory variable (urban population of a state) into our regression line. First, we need to grab a data point. To do that we need put the row number of a state and the variable we are looking for(in our case its UrbanPop). For example, let’s use California.

USArrests[5, "UrbanPop"]
## [1] 91

To determine what our regression line will predict for California’s assault arrests per 100,000 residents we will plug in CA urban population percent of 91:

73.08+ 1.49*(91)
## [1] 208.67

The regression line predicts that at a urban population percent of 91, there will be 208 assault arrests per 100,000 residents. In reality, California had 276 assault arrests. Further down, we will test to see if there is a linear relationship between the two variables to see if this line gives us accuracte predictions.

Next, we will look at the mean square error and standard error.

Linear Regression Performance

Mean Square Error

Instead of computing these by hand, we can use the summary command! :)

summary(mymod)
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948

Looking at the bottom, our model created a mean square error(MSE) of 81.33. The MSE measures the performance of the regression line. It is the average of squares of the residuals/errors. 81.33 is large and it means the regression line is not precise to the set of data points.

R Squared

R squared is also called the coefficient of determination. It measures the percent of variation in the y variable that can be explained by the independent variable. We can find R squared at the bottom of the output summary(mymod). Our data gives us a R squared value of 0.06701. This agrees with the results of the large MSE. The small R squared value tells us that percent urban population in our regression model explains very little of the variability of the assault arrest data. By taking the square root of R squared we can get correlation.

Correlation

The correlation coefficient is used in statistics to measure how strong a relationship is between two variables. It ranges on a scale between (-1 to 1). The closer to either negative 1 or positive 1 indicates a strong relationship. Let’s calculate the correlation bewteen our response and predictor variables.

cor(UrbanPop, Assault)
## [1] 0.2588717

Our correlation of 0.25887 tells us that there is little linear relationship bewteen assault arrests and urban pop percent. This was confirmed by our null hypothesis test that there is no relationship between y and x.

T-Test

Our simple linear regression model is not likely to be useful unless there is a significant relationship between our response variable (assault arrests) and explanatory variable(urban population). To find the significance of this relationship we need to test the null hypothesis.

Ho: β1 = 0 , There is no relationship between assault arrests and urban population

against

Ha: β1 ≠ 0 , There is a relationship between assault arrests and urban population

We can run this test using command “summary”:

summary(mymod)
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948

Using an alpha level of 0.05, the p-value (0.06948) > alpha (0.05). Therefore, we accept the null hypothesis and there is no evidence that assault arrests and percent of a state’s population living in urban areas have a relationship.

Confidence Interval for the Slope

We can use the command “confint” to create a confidence interval with a given confidence %. For this example, we will use 95%.

confint(mymod, level=.95)
##                   2.5 %     97.5 %
## (Intercept) -35.1977859 181.350949
## UrbanPop     -0.1234722   3.104352

This tells us that for every 1 percent increase in urban population in a state, this model predicts 95% of the samples will have a slope between -0.123 and 3.104. Also, 95% of the samples will gives us an intercept between -35.19 and 181.35. However, since there can’t be a negative amount of assault arrests in a state, the intercept can’t be interpreted.

F-Test

A F-test is another way to test the significance of the regression relationship between x and y. Since, this is a linear regression F-test and T-test are the same. By looking at the command “summary” we can see the F- stat and p value.

summary(mymod)
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948

The F-stat is 3.448 and degrees of freedom are 1 and 48. Since our p-value(0.06948) is greater than the alpha(0.05) we accept the null. There is no evidence between assault arrests per 100,000 residents and a state’s urban population percent.

Confidence and Prediction Intervals

Now, we are going to be interested in a confidence interval for a mean value of y when x equals a particular value xo. In this case, we will find the confidence interval for assault arrests per 100,000 residents for a state that has urban population percent of 80.

newdata <- data.frame(UrbanPop=80)
confy <- predict(mymod, newdata, interval="confidence", level = .95)
confy
##        fit      lwr      upr
## 1 192.3118 159.4569 225.1667

This tells us that 95% of the samples will create a mean of assault arrests between 159.4569 and 225.1667 for a state with 80 percent urban population.

This next distribution will now use an individual value of y when x still equals a particular value xo. Again, I will use an urban population percent of 80.

predy <- predict(mymod, newdata, interval="predict", level=.95)
predy
##        fit      lwr      upr
## 1 192.3118 25.51695 359.1066

This prediction interval is wider since we used an individual y value instead of the mean value of y. This is because the variance for a single value is bigger than the variance for the mean of all values.

The prediction and confidence interval need to be centered at the same point. Let’s run a calculation to check:

confy[1] == predy[1]
## [1] TRUE

Yes they are. Perfect:)

That’s it. Thanks for reading!!