Getting Started

This guide will cover techniques used in Ch 3, focusing on Simple Linear Regression.

For this guide we will use the data “BaseballHits” from the “Lock5withR” package. After you install this package, we can start working with the data. We will attach the data so we do not have to keep referencing it by name.

The BaseballHits data contains observations of the number of home runs and wins in the 2010 season for Major League Baseball teams.

library(Lock5withR)
data("BaseballHits")
attach(BaseballHits)

Plotting our Data

Let’s start by using a scatterplot to take a look at our data. We can use a scatterplot to look at the relation of 2 quantitative variables, in this case we are looking at Wins and Homeruns.

We will use the command “plot” which takes 2 arguments, the explanatory variable, followed by the response variable (both should be quantitative). Additional arguments included are not necessary but are useful in creating helpful labels for our plot.

Our explanatory variable is Wins, or how many wins a given MLB had in a season. Our response variable is HomeRuns, or how many homeruns a team had that season.

plot(HomeRuns,Wins, ylab = "Wins", xlab="Home Runs", main = "Number of Home Runs and Wins for MLB Teams (2010)")

We can see that there is a positive trend between home runs and wins, i.e. the more home runs a team scores, the more likely they are to have a large number of wins. Though we can see this trend, it is not followed extremely closely and the points are somewhat spread out. This spread in the data points means that if a team has a high number of home runs, they could have either a high number of wins or a more medium number of wins. Though spread out, there seems to be a linear relationship between home runs and wins.

Creating the Linear Model

After seeing that there is a linear relationship between home runs and wins, we will want to create a linear model, finding the “line of best fit” or “regression line”.

We will name our model mod1 and use the {lm} command to create our model, listing first the response variable and then the explanatory variable, seperated by “~”. Both variables should be quantitative. If we then run mod1, we will see it gives us the intercept and slope of our line of best fit.

mod1<-lm(Wins~HomeRuns)
mod1
## 
## Call:
## lm(formula = Wins ~ HomeRuns)
## 
## Coefficients:
## (Intercept)     HomeRuns  
##     58.8974       0.1437

We can use the intercept and slope to interpret our line of best fit. In this case we get the equation: \[\hat{Wins} = 58.8974 +.1437*(Homeruns)\]

This means that for every additional home run hit, a team is likely to win .1437 more games. We are unable to interpret our intercept in this case because we do not have any data points near 0. **Remember: we cannot interpret our intercept if we do not have data points at or around 0 because that would be extrapolation.

To better see how this line lies on our data, we can plot it over our original plot with the command {abline} listing first our intercept and then our slope. I added a third, unnecesary, argument {col=4}. This changes the color of our regression line to blue, just for fun.

plot(HomeRuns,Wins, ylab = "Wins", xlab="Home Runs", main = "Number of Home Runs and Wins for MLB Teams (2010)")
abline(58.8974,.1437, col=4)

Making Predictions

Let’s make a prediction using our model. To do this, we simply plug in a value of the explanatory variable (home runs) into our regression line equation.

Let’s predict how many wins a team who scores 150 home runs will have. I created a variable {HRpred} for the number of Home runs we are plugging into the equation and a variable {win.prediction} to store our prediction of team wins.

HRpred<- 150
win.prediction<- 58.8974 + .1437*HRpred
win.prediction
## [1] 80.4524

Using our model, we get a prediction that when a team hits 150 home runs in a season, they will have about 80 wins.

Looking at Residuals

Remember we find a residual by calculating (observed - prediction). In this case we will look at how many wins a team that hit 150 home runs actually had vs the amount we predicted using our regression line. I created a variable {resid1} to store our residual. Then, all we have to do is simply subtract!

To find our observed value we will use indexing. To do this we will be using brackets to help us find a specific piece of data. This is helpful so you are not forced to search through your data to find something, but rather, can just call it up. In this case we are looking for the number of Wins where Homeruns=150. This will look like: Wins[HomeRuns==150] and will return the observed value where HomeRuns is equal to 150.

resid1<-Wins[HomeRuns==150]-win.prediction
resid1
## [1] 5.5476

We get the value 5.5476 for our residual. This means that our prediction was 5.5476 wins too low in comparison to our observed value.

Checking Model Assumptions

Since we are using a linear regression model, we need to make sure we check the assumptions of these models.

  1. Errors will have a normal distribution. Let’s look at this by making a histogram of the residuals. We can also plot the quantiles of the residuals against the quantiles of a normal distribution. If the residuals are normally distributed, they’ll follow a straight line.
bballresid<- mod1$residuals
hist(bballresid, xlab= "Residuals", main = NULL)

qqnorm(bballresid)
qqline(bballresid, col=4)

Looking at the histogram, the distribution does not look very normal, though, when looking at the plot of quantiles of the residuals against the quantiles of a normal distribution, we see that the points follow a straight line fairly well. The better they follow the line, the more normally distributed.

  1. Homoscedasticity (constant variance)

We will look at this by looking at the original data. We are looking to see that the variance stays the same as home runs go up.

We can also look at this by plotting the residuals and plotting a line at (0,0) using the {abline} command again. We want the variance to stay constant across the whole line

plot(HomeRuns,Wins, ylab = "Wins", xlab="Home Runs", main = "Number of Home Runs and Wins for MLB Teams (2010)")

plot(bballresid)
abline(0,0, col=4)

Looking at our original plot, the variance seems to stay pretty constant, though it is hard to tell since the points are spread out. Using the second method, we see that the variance seems to increase slightly, but still remains fairly constant throughout.

Mean Square Error

Let’s look at the mean square error. Rather than calculating it by hand, we can let R do the work for us. This can be found using the command {summary(mod1)} which gives us a summary or our model.

summary(mod1)
## 
## Call:
## lm(formula = Wins ~ HomeRuns)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0088  -8.2368   0.9163   7.0167  14.6914 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 58.89738    8.77274   6.714 2.74e-07 ***
## HomeRuns     0.14374    0.05579   2.577   0.0155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 28 degrees of freedom
## Multiple R-squared:  0.1917, Adjusted R-squared:  0.1628 
## F-statistic: 6.639 on 1 and 28 DF,  p-value: 0.01554

We can see that the mean square error is 10.07, under the category “Residual standard error”. Mean square error is a method of averaging the error (residuals) in our model.

Confidence Intervals for Regression Coefficients

If we want to create a confidence interval for our regression coefficients we can use the command {confint}. We use the arguments of our model, and the level of certainty we want from our interval.

confint(mod1,level=.90)
##                     5 %       95 %
## (Intercept) 43.97380649 73.8209573
## HomeRuns     0.04884278  0.2386398

This means we are 90% sure our intercept will be between 43.97380649 and 73.8209573 and our slope will be between 0.04884278 and 0.2386398. Looking at this we can say we are 90% sure that for every additional home run scored, the number of wins a team will have that season will go up by a number on the interval (0.04884278, 0.2386398.) We are still unable to interpret our intercept as we do not have data points around 0.

Correlation

Next, let’s look at the correlation between our response and explanatory variables, in this case, Wins and Home Runs. We will use the command {cor} with the arguments as the two variables we are looking at (in this case order does not matter). This will give us the simple correlation coefficient (r). r measures the strength of the linear relationship of x and y and falls in the interval [-1,1]. The closer to 1 the more positive correlation, and closer to -1 a more negative correlation. **Remember: r can only be used for SIMPLE linear regression.

cor(Wins, HomeRuns)
## [1] 0.4377998

In this case our simple correlation coefficient is about .4378. This does not represent an extremely strong positive correlation between variables, but a positive relationship nonetheless.

Prediction and Confidence Intervals

Let’s make prediction and confidence intervals. First, we will create a new data frame for the point we are interested in (when HomeRuns=150). To do this we will use the command {data.frame(HomeRuns=150)}. If we were interested in another point, we could simply replace the argument after the command.

new.data<-data.frame(HomeRuns=150)

Now we can use our new data frame to make some intervals. Both our prediction interval and confidence interval will use the command {predict(our model, new data frame, interval=“”)} the default confidence level is 95%. The only difference between the two is what type of interval we specify which you can see below.

predsl <- predict(mod1, new.data, interval="predict") 
confsl <- predict(mod1, new.data, interval="confidence") 
predsl
##        fit      lwr      upr
## 1 80.45857 59.48749 101.4297
confsl
##        fit      lwr      upr
## 1 80.45857 76.66833 84.24882

The prediction interval estimates that there is a 95% chance that the number of wins when a team scores 150 homeruns is between about 60 and 102. The confidence interval estimates that there is a 95% chance that the mean number of wins for all teams that score 150 home runs is between about 77 and 85.

Let’s make R compare these intervals for us. let’s look at the width of each interval.

predsl %*% c(0, -1, 1)
##       [,1]
## 1 41.94217
confsl %*% c(0, -1, 1)  
##       [,1]
## 1 7.580488

We can see that the prediction interval is much wider than the confidence interval. This is expected because the variance of x is less than the variance of xbar (or the mean).

Next, let’s check if both intervals are centered at the same place.

confsl[1] == predsl[1]
## [1] TRUE

This tells us they are centered at the same place. This makes sense because they should both be centered at the value our model predicts for the value HomeRuns=150.

Summary Command

We have already used the command {summary} to get some information about our model, but let’s look a little more at some of the information it gives us.

summary(mod1)
## 
## Call:
## lm(formula = Wins ~ HomeRuns)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0088  -8.2368   0.9163   7.0167  14.6914 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 58.89738    8.77274   6.714 2.74e-07 ***
## HomeRuns     0.14374    0.05579   2.577   0.0155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 28 degrees of freedom
## Multiple R-squared:  0.1917, Adjusted R-squared:  0.1628 
## F-statistic: 6.639 on 1 and 28 DF,  p-value: 0.01554

You will see the intercept and slope values we found earlier. Another thing you can find here is degrees of freedom. Remember our degrees of freedom should be (n-number of regression coefficients). In this case our degrees of freedom our 28 because we have 30 data points and 2 regression coefficients. We can also look at the t-test values for intercept and slope. Both p-values are less than the typical alpha value of .05 in this case which means we can reject the null hypotheses that Ho: intercept=0 and Ho: slope=0. This means that the number of home runs hit has a significant linear relationship with the number of wins a team has in a season.

Another thing we can find in this summary is “Multiple R-squared”, which gives us the r-squared statistic, or the proportion of total var. that is explained by our simple linear regression model. It also gives our F-stat, degrees of freedom, and p-value in the bottom line. The p-value is the area under the curve to the right. The p-value again is less than .05. This means that the number of home runs hit has a significant linear relationship with the number of wins a team has in a season. For simple linear regression models, the f-test will be the same as the t-test. When we get into multiple linear regression models this will not be the case.

Conclusion

In this guide we looked mainly at simple linear regression. Many of the techniques we used will also be helpful when we begin to look at multiple linear regression.