This document will walk through a linear regression model. We assume that the two variables chosen will be quantitative, one predictor and one response. We will plot the data and learn how to interpret the scatterplot. We will calculate correlation and then the linear regression model. We will find the regression line and interpret. Then we will use it for prediction and check the assumptions from the model.
For this document, I’ve decided to use the dataset BaseballHits from the Lock5withR Library. To start off, attach the data and look at the variable names.
library(Lock5withR)
data(BaseballHits)
attach(BaseballHits)
names(BaseballHits)
## [1] "Team" "League" "Wins" "Runs"
## [5] "Hits" "Doubles" "Triples" "HomeRuns"
## [9] "RBI" "StolenBases" "CaughtStealing" "Walks"
## [13] "Strikeouts" "BattingAvg"
Before using a dataset, we need to learn about it and the variable contained inside.
This data is a collection of the number of hits and wins for Major League Baseball teams. There are 30 observations in the dataset. We can see that there are 14 variables to choose from. You can look at the variables and decide which ones might have a linear relationship. Any of them could potentially have linear relationships so it’s okay to change your mind after you look at the plots.
I chose to work with BattingAvg as the predictor and HomeRuns as the response. BattingAvg is the team batting average for the season, measured in percentage. HomeRuns is the number of home runs hit in the season by the team.
We can make a scatterplot to see what the data looks like. We will want it to have somewhat of a linear-looking relationship. The plot command takes two variables, first a predictor and then a response.
Since BattingAvg is our predictor, that will go first in the list. The second will be HomeRuns since we are using BattingAvg to predict HomeRuns. We could have this go either way but I think a team with a higher batting average will have more home runs.
plot(BattingAvg, HomeRuns)
A scatterplot is made. Looking at it, we can see a generally positive correlation between the two variables. There is one outlier that we should remember when looking at the data.
The graph makes sense to us since we know what the data is but for it to make sense to another person, we’ll need labels. The y-axis is for the number of home runs for the team and the x-axis is the batting average for the team. The graph itself shows the batting average and the number of home runs for the team.
With the plot command, we now add xlab with the x-axis label, ylab with the y-axis label and a title for the graph as main.
plot(BattingAvg, HomeRuns, xlab = "Batting Average (in %)", ylab = "Home Runs (total in the Season)", main = "Batting Average and Home Runs in a Season")
Correlation is something we can look at before we move on. Correlations talks about the strength and direction of the linear relationship between the response and the predictor. The closer it is to 1, the stronger the positive relationship. 0 means no relationship and -1 is a strong, negative relationship.
Let’s look at the cor command in R that takes two quantitative variables. The predictor and response can be in either order and it will give the same result.
cor(HomeRuns, BattingAvg)
## [1] 0.3171791
Once we know there is some type of linear relationship, we can calculate the linear regression model. To get the model from R, we use the lm command that takes the response followed by a ~ and then the predictor.
mymod <- lm(HomeRuns ~ BattingAvg)
mymod
##
## Call:
## lm(formula = HomeRuns ~ BattingAvg)
##
## Coefficients:
## (Intercept) BattingAvg
## -109.5 1023.3
We can also look at the summary of the model that gives a few more important pieces of data.
summary(mymod)
##
## Call:
## lm(formula = HomeRuns ~ BattingAvg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.890 -17.051 -7.064 13.087 112.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -109.5 148.9 -0.736 0.4681
## BattingAvg 1023.3 578.2 1.770 0.0877 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.35 on 28 degrees of freedom
## Multiple R-squared: 0.1006, Adjusted R-squared: 0.06848
## F-statistic: 3.132 on 1 and 28 DF, p-value: 0.08766
So, what does the model tell us?
The output from the model tells us the intercept and the slope for the line we want. Using the output, we can write out the regression line. \[{HomeRuns}= -109.5+1023.3*BattingAvg\]
The slope is the easier of the two to interpret. It usually means that for every increase of 1 for the explanatory variable, the response will increase by that amount. For our example, it means that for every 0.01 increase in the batting average, home runs will increase by 1023.3.
The intercept is harder to interpret, especially in our case. Usually, it means that when the explanatory variable is 0, the response will be the intercept. With our data, it wouldn’t make sense to predict home runs based on a batting average near 0. Our data only goes from around 0.24 to 0.27 for batting average. We can’t use our data to predict outside of that range because it won’t be as accurate.
What about the summary?
R-squared is at towards the bottom of the summary, titled “Multiple R-squared”. R-squared is the explained variance (variance that can be explained by the model) over the total variation (which includes variance that can’t be explained by the model). It tells us the proportion of the variation in the response that can be explained by its linear relationship with the predictor.
In our example, 10.06% of the variation in the number of home runs can be explained by the batting average of the team.
We can also find the F-test down at the bottom, titled “F-statistic” and “p-value”. The F-test asks if there is a relationship between the response and the predictor. The null hypothesis is that there is no relationship (which would mean beta not equals 0) and the alternative is yes (which means beta not does not equal 0). The F-teststat, which can be found in the summary, is the explained variation divided by the unexplained variation divided by n-2 and under the null hypothesis, the teststat has a F distribution with degrees of freedom 1 equaling 1 and the degrees of freedom 2 equaling n-2. The p-value is the portion to the right of the teststat.
In our example, we have a F-teststat of 3.132, degrees of freedom 1 of 1 and degrees of freedom 2 of 28. The degrees of freedom 2 make sense because there are 30 observations in the data set. The p-value is 0.08766. At the 0.1 significance level, there is significant evidence to say that there is a linear relationship between the batting average and the number of home runs of a team.
Now, we can show the linear trend in our scatterplot. It makes the plot easier for other people to read.
To do this, we simply take the plot from before and add to it. At the end of the command, add abline(intercept, slope).
plot(BattingAvg, HomeRuns, xlab = "Batting Average (in %)", ylab = "Home Runs (total in the Season)", main = "Batting Average and Home Runs in a Season", abline(-109.5,1023.3))
Predicting with the model is easy. All you had to do is plug in a value for the predictor (BattingAvg in our example) and find the response.
I chose to use line 5 of our data. To do that, BattingAvg will now be data[row number, column number or “column name”]
-109.5 + 1023.3*BaseballHits[5, "BattingAvg"]
## [1] 153.4881
We can also look at the actual number of home runs to compare to.
BaseballHits[5, "HomeRuns"]
## [1] 149
We can see that the prediction and the actual results are not the same. We don’t expect them to be since the data is not a perfect linear relationship.
To calculate residuals, we takes observed minus predicted. We can quickly find one before we check the model assumptions.
The predicted value for row 5 is 153.4881.
The observed value for row 5 is 149.
The residual for this data point is 149 - 153.4881 = -4.4881.
The regression model has some assumptions. Can we check some without R?
The first assumption is that the errors are independent of each other. This dataset is cross-sectional so we most likely don’t have to worry about the independence. Time-series data is the data we’d have to worry about independence of errors.
To start with, we can find the residuals.
myresids <- mymod$residuals
We could look at a histogram for the residuals to see if they are normally distributed but an easier way is looking at the quantiles. We can compare the quantiles of the residuals to the quantiles of the normal distribution. They should follow a straight line if the errors are normally distributed.
qqnorm(myresids)
qqline(myresids)
Overall, there are few that are very far off the line. It is not perfect but it is not too far off.
The last assumption is homoscedasticity. This means that we assume the variance of the residuals are equal for the predictor values.
To check this, we can plot the residuals against the BattingAvg to compare the variances. We’ll add in a line at 0 so that we can easily see if the variances are off.
plot(mymod$residuals ~ BattingAvg)
abline(0,0)
From this plot, we can check the homoscedasticity. If the data points are evenly spread out, the data is homoscedastic. Our data has one outlier but besides that, the data is homoscedastic.
We can also look at confidence intervals and prediction intervals for the response. First, we have to get a new data set. This data set will include only the data for a certain batting average I picked 0.25 BattingAvg.
mydata <- data.frame(BattingAvg = .25)
Once we have the new data set, we can find the confidence and prediction intervals for the home runs. We will use predict (lm model, new data set, interval). To get the two different intervals, interval is set to confidence or predict.
predictionint <- predict(mymod, mydata, interval="predict")
predictionint
## fit lwr upr
## 1 146.3305 78.42245 214.2386
We are 95% sure a team with a batting average of 0.25, hits between 78 and 214 home runs in the season.
confidenceint <- predict(mymod, mydata, interval="confidence")
confidenceint
## fit lwr upr
## 1 146.3305 131.4829 161.1781
If we look at all the teams with a batting average of 0.25, we are 95% confident their mean is between 131 and 161 home runs in the season.
This document went through a linear regression model with two quantitative variables. We learned how to plot the data and interpret the graph shown. We then calculate the linear regression model and interpreted the line given. We learned how to use the model to predict. Lastly, we learned how to check the assumptions of the model.
This was just for simple linear regression. Many of these same R commands work for multiple linear regression and other types of models.