Introduction to Linear Regression

This document will walk through a linear regression model. We assume that the two variables chosen will be quantitative, one predictor and one response. We will plot the data and learn how to interpret the scatterplot. We will calculate correlation and then the linear regression model. We will find the regression line and interpret. Then we will use it for prediction and check the assumptions from the model.

For this document, I’ve decided to use the dataset BaseballHits from the Lock5withR Library. To start off, attach the data and look at the variable names.

library(Lock5withR)
data(BaseballHits)
attach(BaseballHits)
names(BaseballHits)
##  [1] "Team"           "League"         "Wins"           "Runs"          
##  [5] "Hits"           "Doubles"        "Triples"        "HomeRuns"      
##  [9] "RBI"            "StolenBases"    "CaughtStealing" "Walks"         
## [13] "Strikeouts"     "BattingAvg"

Before using a dataset, we need to learn about it and the variables contained inside.

This data is a collection of the number of hits and wins for Major League Baseball teams. There are 30 observations in the dataset. We can see that there are 14 variables to choose from. You can look at the variables and decide which ones might have a linear relationship. Any of them could potentially have linear relationships so it’s okay to change your mind after you look at the plots.

I chose to work with BattingAvg as the predictor and HomeRuns as the response. BattingAvg is the team batting average for the season, measured in percentage. HomeRuns is the number of home runs hit in the season by the team.

Plotting our Data

Scatterplots

We can make a scatterplot to see what the data looks like. We will want it to have somewhat of a linear-looking relationship. The plot command takes two variables (both quantitative). The first is a predictor and the second is the response.

Since the batting average of the team is our predictor, that will go first in the list. The second will be the number of home runs in the season since we are using the batting average of the team to predict the number of home runs in the season. We could have this go either way but I think a team with a higher batting average will have more home runs.

plot(BattingAvg, HomeRuns)

A scatterplot is made. Looking at it, we can see a generally positive correlation between the two variables. There is one outlier that we should remember when looking at the data but shouldn’t affect the model too much.

The graph makes sense to us since we know what the data is but for it to make sense to another person, we’ll need labels. The y-axis is for the number of home runs for the team and the x-axis is the batting average of the team. The graph itself shows the batting average and the number of home runs for the team.

With the plot command, we now add xlab with the x-axis label, ylab with the y-axis label and a title for the graph as main.

plot(BattingAvg, HomeRuns, xlab = "Batting Average (in %)", ylab = "Home Runs (total in the Season)", main = "Batting Average and Home Runs in a Season")

The new plot will now make sense to everyone, regardless of whether they saw the data or not.

Correlation

Correlation is something we can look at before we move on. Correlations shows both the strength and direction of the linear relationship between the response and the predictor. The closer it is to 1, the stronger the positive relationship. 0 means no relationship and -1 is a strong, negative relationship.

Let’s look at the cor command in R that takes two quantitative variables. The predictor and response can be in either order and it will give the same result.

cor(HomeRuns, BattingAvg)
## [1] 0.3171791

A 0.317 correlation is closer to 0 than 1, which means that it isn’t a strong correlation. It does show that there is some positive correlation between the batting average of the team and the number of home runs.

Linear Regression Model

Creating the Model

Once we know there is some type of linear relationship, we can calculate the linear regression model. To get the model from R, we use the lm command that takes the response followed by a tilda (~) and then the predictor.

mymod <- lm(HomeRuns ~ BattingAvg)
mymod
## 
## Call:
## lm(formula = HomeRuns ~ BattingAvg)
## 
## Coefficients:
## (Intercept)   BattingAvg  
##      -109.5       1023.3

We can also look at the summary of the model that gives a few more important pieces of data.

summary(mymod)
## 
## Call:
## lm(formula = HomeRuns ~ BattingAvg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.890 -17.051  -7.064  13.087 112.716 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -109.5      148.9  -0.736   0.4681  
## BattingAvg    1023.3      578.2   1.770   0.0877 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.35 on 28 degrees of freedom
## Multiple R-squared:  0.1006, Adjusted R-squared:  0.06848 
## F-statistic: 3.132 on 1 and 28 DF,  p-value: 0.08766

Interpreting the Model

So, what does the model tell us?

The output from the model tells us the intercept and the slope for the line we want. Using the output, we can write out the regression line. \[{HomeRuns}= -109.5+1023.3*BattingAvg\]

The slope is the easier of the two to interpret. It usually means that for every increase of 1 for the explanatory variable, the response will increase by that amount. For our example, it means that for every 0.01 increase in the batting average (an increase of 1%), home runs will increase by 1023.3.

The intercept is harder to interpret, especially in our case. Usually, it means that when the explanatory variable is 0, the response will be the intercept. With our data, it wouldn’t make sense to predict home runs based on a batting average near or at 0. Our data only goes from around 0.24 to 0.27 for batting average. We can’t use our data to predict outside of that range because it won’t be as accurate.

What about the summary of the model?

The first thing we should look at is the p-value for the t-test done on the predictor It is located in the coefficents section. This tells us if the predictor is useful or not.

For our example, the p-value is 0.0877. It is relatively small. This leads us to conclude that at the 0.1 significance level, there is significant evidence to say that there is a linear relationship between the batting average and the number of home runs of a team.

R-squared is at towards the bottom of the summary, titled “Multiple R-squared”. R-squared is the explained variance (variance that can be explained by the model) over the total variation (which includes variance that can’t be explained by the model). It tells us the proportion of the variation in the response that can be explained by its linear relationship with the predictor.

In our example, 10.06% of the variation in the number of home runs can be explained by the batting average of the team. This means that this is not the best model to predict home runs. It would be better to add more predictors, which can be done with multi-linear regression models that we’ll learn about in another guide.

We can also find the F-test down at the bottom, titled “F-statistic” and “p-value”. The F-test asks if there is a relationship between the response and the predictor. The null hypothesis is that there is no relationship (which would mean beta_0 equals 0) and the alternative is yes (which means beta_0 does not equal 0). The F-teststat is the explained variation divided by the unexplained variation divided by n-2. Under the null hypothesis, the teststat has a F distribution with degrees of freedom 1 equaling 1 and the degrees of freedom 2 equaling n-2. The p-value is the portion to the right of the teststat.

In our example, we have a F-teststat of 3.132, degrees of freedom 1 of 1 and degrees of freedom 2 of 28. The degrees of freedom 2 make sense because there are 30 observations in the data set. The p-value is 0.08766. At the 0.1 significance level, there is significant evidence to say that there is a linear relationship between the batting average and the number of home runs of a team.

Plotting the Regression Line

Now, we can show the linear trend in our scatterplot. It makes the plot easier for other people to read.

To do this, we simply take the plot from before and add to it. At the end of the command, add abline(intercept, slope).

plot(BattingAvg, HomeRuns, xlab = "Batting Average (in %)", ylab = "Home Runs (total in the Season)", main = "Batting Average and Home Runs in a Season", abline(-109.5,1023.3))

Using the Model

Predicting the Response

Now that we have the model, we can use that to predict the number of home runs based on a batting average. Predicting with the model is easy. All you had to do is plug in a value for the predictor (the batting average of the team in our example) and find the response.

I chose to use line 5 of our data. To do that, our predictor will now be data[row number, column number (or “column name”)]

-109.5 + 1023.3*BaseballHits[5, "BattingAvg"]
## [1] 153.4881

We can also look at the actual number of home runs to compare to. This time when calling the data, we use “HomeRuns” as the column name.

BaseballHits[5, "HomeRuns"]
## [1] 149

We can see that the prediction and the actual results are not the same. We don’t expect them to be since the data is not a perfect linear relationship.

Residuals

The residuals are how far off the predictions are. To calculate residuals, we take observed minus predicted. We can quickly find one before we check the model assumptions.

The predicted value for row 5 is 153.4881.

The observed value for row 5 is 149.

The residual for this data point is 149 - 153.4881 = -4.4881.

These come in handy when checking some of the assumptions we made for this model.

Prediction and Confidence Intervals for the Response

We can also look at confidence intervals and prediction intervals for the response. First, we have to get a new data set. This data set will include only the data for a certain batting average. I picked a 0.25 batting average.

mydata <- data.frame(BattingAvg = .25)

Once we have the new data set, we can find the confidence and prediction intervals for the home runs. We will use predict (lm model, new data set, interval). To get the two different intervals, interval is set to confidence or predict.

Prediction Interval

Prediction intervals look at a team and tries to predict the number of home runs for that team.

predictionint <- predict(mymod, mydata, interval="predict")
predictionint
##        fit      lwr      upr
## 1 146.3305 78.42245 214.2386

We are 95% sure a team with a batting average of 0.25, hits between 78 and 214 home runs in the season.

Confidence Interval

Confidence intervals look at all the teams with that batting average and predict the mean of the number of home runs for the season.

confidenceint <- predict(mymod, mydata, interval="confidence") 
confidenceint
##        fit      lwr      upr
## 1 146.3305 131.4829 161.1781

If we look at all the teams with a batting average of 0.25, we are 95% confident their mean is between 131 and 161 home runs in the season.

Confidence Interval for the Regression Coefficients

The last type of interval we can look at is the confidence interval for the regression coefficients.

confint(mymod)
##                 2.5 %    97.5 %
## (Intercept) -414.4633  195.4625
## BattingAvg  -161.1393 2207.7866

The intercept interval means: if the batting average was 0, the number of home runs would be between -414 and 195. This, however, sounds strange and doesn’t make sense. This is an interval that works in theory but does not make sense with our data. It isn’t possible to have a single home run with a 0 batting average.

The slope interval makes more sense. For every increase of 1% in the batting average of a team, we are 95% sure that the number of home runs in the season increases by something between -161 and 2208.

Checking Model Assumptions

The regression model has some assumptions. There are three big ones: independence of errors, normal distribution of errors and homoscedasticity.

Independence of Errors

The first assumption is that the errors are independent of each other. This dataset is cross-sectional so we most likely don’t have to worry about the independence. The cross-sectional data is taken from 30 different teams at the same time. Time-series data is the data we’d have to worry about independence of errors since those are from the same place but over a span of time.

Normal Distribution of Errors

The errors are supposed to be normally distributed. This is basically saying that there are as many overpredictions as underpredictions.

To start with, we can find the residuals. We talked about individual residuals earlier but now we can find all of the residuals at once with the $residuals command which only takes the model.

myresids <- mymod$residuals

We could look at a histogram for the residuals to see if they are normally distributed but an easier way is looking at the quantiles. We can compare the quantiles of the residuals to the quantiles of the normal distribution. They should follow a straight line if the errors are normally distributed.

qqnorm(myresids)
qqline(myresids)

Overall, there are few that are very far off the line. It is not perfect but it is not too far off. This means that this assumption is correct.

Homoscedasticity

The last assumption is homoscedasticity. This means that we assume the variance of the residuals are equal for the predictor values. We are assuming the error term is approximately the same across all of the predicted values.

To check this, we can plot the residuals against the fitted values to compare the variances. We’ll add in a line at 0 so that we can easily see if the variances are off.

plot(mymod$residuals ~ mymod$fitted.values)
abline(0,0)

From this plot, we can check the homoscedasticity. If the data points are evenly spread out, the data is homoscedastic.

Our data has one outlier but besides that, the graph looks fairly spread out. This means that the data is homoscedastic.

Wrapping Up Linear Regression

This document went through a linear regression model with two quantitative variables. We learned how to plot the data and interpret the graph shown. We then calculate the linear regression model and interpreted the line given. We learned how to use the model to predict. Lastly, we learned how to check the assumptions of the model.

This was just for simple linear regression. Many of these same R commands work for multiple linear regression and other types of models.