Downloading and Attaching Data sets: 27-37 Scatterplots: 55-56 Axis Labels: 68-77 Linear Regression Model Creation: 79-95 Plotting Linear Regression Model: 97-137 Prediction Using Linear Regression: 139-156 Mean Square Error: 158-166 Significance of the Slope and y-Intercept: 168-180 Confidence Intervals for Regression Coefficients: 183-190 Correlation: 192-199 Confidence and Prediction Intervals: 201-224 F-Test: 226-232
In this R guide for Chapter 3 we are going to discuss simple linear regression of two variables. We will do this by analyzing the relationship between the dependent variable (y) and independent variable (x), form predictions and intervals of the variables, and explain the usefulness of simple linear regression.
First we are going to have to download the data set from the website. Lock5Stat.com has the data set for the 2016 NBA Standings. First we are going to attach the data in R so we dont have to refer to it by name everytime. We also want to make sure we look to see what the data looks like. This will help us take the most important bits of data and compare it for our own liking. Since we are dealing with the NBA Standings for 2016 we should look at a couple of the columns and rows to get good feel for what the data is about.
Team<-read.csv(url("http://www.lock5stat.com/datasets/NBAStandings2016.csv"))
After downloading the data we are going to attach the data in R so we dont have to refer to it by name everytime. We also want to make sure we look to see what the data looks like. This will help us take the most important bits of data and compare it for our own liking. Since we are dealing with the NBA Standings for 2016 we should look at a couple of the columns and rows to get good feel for what the data is about.
head(Team)
## Team Wins Losses WinPct PtsFor PtsAgainst
## 1 Golden State Warriors 73 9 0.890 114.9 104.1
## 2 San Antonio Spurs 67 15 0.817 103.5 92.9
## 3 Cleveland Cavaliers 57 25 0.695 104.3 98.3
## 4 Toronto Raptors 56 26 0.683 102.7 98.2
## 5 Oklahoma City Thunder 55 27 0.671 110.2 102.9
## 6 Los Angeles Clippers 53 29 0.646 104.5 100.2
Now lets attach the data so we don’t have to refer to it everytime.
attach(Team)
## The following object is masked _by_ .GlobalEnv:
##
## Team
We are going to be most interested in the relationship between the two variables of the Team’s losses column and their Defense rank (Or Points Against). The Team’s losses are out of 82 games played in one season and the Points Against is the amount of points on average given up by a team during the game.
In our linear regression model, Points Against is our explanatory variable and Team Losses will be our response variable. My reasoning has to do with the limits of an offense. You can only score so many baskets in a game of 4, 15 minute quarters. So the more points you give up, the less likely your team will be able to edge out with a win.
To illustrate the relationship between two variables Scatter plots can be quite useful. One of the tools we can use is the command “plot”. The “plot” intakes quantitative variables separated by commas. The first quantitative variable is the explanatory variable, while the second quantitative variable is the response.
We will graph the plot using Losses and PtsAgainst: Notice that PtsAgainst is the explanatory and Losses is the response
plot(PtsAgainst,Losses)
If you look at the plot you notice the axis’ aren’t the prettiest, so lets change them to be more presentable and intuitive.
plot(PtsAgainst, Losses, ylab = "Losses in One Season",
xlab = "Average Points given up Per Game",
main = "NBA Standings Loss and Points Against")
When the PtsAgainst (or Average Points Given up Per Game) variable increases/decreases in a straight line as Losses (or Losses in One Season) variable increases/decreases that could give us evidence that there is a relationship between the two quantitative variables. But just looking at this data it is not crystal clear that there is a linear relationship between the two. There is a lot of variability and data is spread pretty far along the x-axis even though it looks to be correlated.
We need to create a regression model to describe statistically the relationship of PtsAgainst (Average Points given up Per Game) and Losses (Losses in One Season).
We will use the “lm” command to create a linear model that will describe the correlation between variables. In the “lm” command the response variable is first and the explanatory variable is second.
In our data, the response is Losses in One Season and our explanatory variable is team’s Average Points Given up Per Game.
mymod <- lm(Losses~ PtsAgainst, data=Team)
mymod
##
## Call:
## lm(formula = Losses ~ PtsAgainst, data = Team)
##
## Coefficients:
## (Intercept) PtsAgainst
## -212.64 2.47
Now that we have the slope and intercept for your model we need to interpret it. For our data, every point Average Points given up Per Game goes up, the number of Losses goes up by 2.47 games. When you take into account only a total of 82 games this is a pretty big change.
We will call our intercept as A and our slope as B. Our simple liner regression model is: y=A + Bx or y=-212.64 + 2.47x
After finding the regression line using the “lm” command we are going to put the line in our scatter plot. This will help use see a linear trend if there is one.
To do this we plot the scatterplot as normal and then follow it up with the “abline” command. Using the “abline” command make sure to separate the intercept and the slope by a comma.
plot(PtsAgainst, Losses)
abline(-212.64, 2.47)
Looking at the new plot it seems the line fits the data linearly. But we need to know how “well” the line fits the data First let’s see if our residual values are normally distributed.
Now lets look at the residuals histogram:
myresids <- mymod$residuals
hist(myresids)
Looking at the histogram, it looks like the residuals are not really normally distributed. We need to double check though by plotting the residuals against a normal distribution line:
qqnorm(myresids)
qqline(myresids)
From the plot we see that the residuals pretty much follow the straight normal distribution line. This tells us that the residuals are normally distributed.
Next, we need to makes sure variance of the residuals are constant. If they are there is no evidence of heteroscedasticity and the residuals are normally distributed. We can check equal variance by plotting residuals against PtsAgainst:
plot(mymod$residuals ~ PtsAgainst)
abline(0,0)
Residual variance looks to be constant across PtsAgainst so we are able to say the residuals are normally distributed.
The last 2 regression assumptions we need to pay attention to is: 1. The mean of the potential error term values is equal to 0:
mean(myresids)
## [1] 3.842991e-16
So in conclusion, the regression assumptions are satisfied and we can trust the results of the tests.
Because residuals are normally distributed we can use our simple regression line to predict losses based on points against.
Remember that our regression line:
y=-212.64 + 2.47x
To make a prediction we need to put in our explanatory variable (Points Against) into our regression line. First let’s grab a data point. To be able to do this we need to put the row number of a Team and the variable we are looking for (in this PtsAgainst). For example let’s use Sacramento.
Team[22, "PtsAgainst"]
## [1] 109.1
To find our regression line we will predict for Sacramento’s Losses by plugging in Sacramentos points against of 109.1
-212.64+ 2.47*(109.1)
## [1] 56.837
Now we will look at mean square error and standard error for future statistical formulas.
Instead of doing these by hand, R has it’s own function/ command
summary(mymod)
##
## Call:
## lm(formula = Losses ~ PtsAgainst, data = Team)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.533 -5.095 0.749 3.717 18.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -212.6391 53.9467 -3.942 0.000491 ***
## PtsAgainst 2.4704 0.5251 4.705 6.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.56 on 28 degrees of freedom
## Multiple R-squared: 0.4415, Adjusted R-squared: 0.4215
## F-statistic: 22.13 on 1 and 28 DF, p-value: 6.217e-05
Our model produced a mean square error of 10.56. The MSE measures performance of the regression line. It is the average of squares of the residuals(errors). 10.56 is large in the context of this problem so it means the regression line is not entirely precise to the data point sets.
Summary(mymod) gives us the R squared value of 0.4415. This goes along with the results of large MSE. The small R squared value also tells us that our regression model tells us very little of the variability of the losses data.
Our simple linear regression model is not useful unless there is a significant relationship between our response variable (losses) and explanatory variable (points against). To find the significance of this relationship we need to test our null hypothesis.
Ho:B = 0; There isn’t relationship between losses and points against
against
Ha: B â not equal to 0; There is a relationship between losses and points against
We can use the same data from the “summary” command or summary(mymod)
Using an alpha level of 0.05, the p-value (6.217e-05) > alpha (0.05). Therefore, we reject the null hypothesis and there is evidence that losses and points against has a relationship.
We use command “confint” to make a confidence interval with a chosen given confidence %. For this example, we will use 95%:
confint(mymod, level=.95)
## 2.5 % 97.5 %
## (Intercept) -323.143870 -102.134279
## PtsAgainst 1.394807 3.546053
This tells us that for every point increase in points against, the model predicts 95% of the samples will have a slope between 1.393807 and 3.546053. And 95% of the samples give us an intercept between -323.14 and -102.134. The intercept can’t be interpreted because it’s negative and you can’t get negative points in a basketball game.
Now we calculate the correlation between our response and predictor variables.
cor(PtsAgainst, Losses)
## [1] 0.6644513
The correlation of 0.6644513 tells us that there is somewhat of a linear relationship between points against and losses. Our failure to reject the null hypothesis confirms this. Meaning there is a relationship between x and y or our comparison variables.
Next we are going to look at a confidence interval for a mean value of y when x equals a particular value xo. For this example, we will find the confidence interval for losses that has a points against of 100.
newdata <- data.frame(PtsAgainst=100)
confy <- predict(mymod, newdata, interval="confidence", level = .95)
confy
## fit lwr upr
## 1 34.40395 29.52101 39.28689
This output shows us 95% of samples will create a losses mean between 29.5 and 39.2 games. Since these are games you might just need to round it to 30 and 40 games for a points against value of 100. Obviously Goldent State is an outlier here because they are one of the best teams of all time.
The next distribution will now use an value of y when x still equals a particular value xo. Like the last interval we will use 100 for our points against.
predy <- predict(mymod, newdata, interval="predict", level=.95)
predy
## fit lwr upr
## 1 34.40395 12.22964 56.57826
This prediction interval is wider in range for points against since we used an individual y value instead of the mean value of y. This is due most likely because variance of a single value is bigger than the variance for all mean values.
The prediction and confidence interval logically will be centered at the same point. We will run a calculation to prove this:
confy[1] == predy[1]
## [1] TRUE
As you can see it is true
The F-test is another tool we have if we want to test the significance of the regression relationship between x and y. Let’s again use the command “summary” so we can see the F-stat and p value.
summary(mymod)
##
## Call:
## lm(formula = Losses ~ PtsAgainst, data = Team)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.533 -5.095 0.749 3.717 18.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -212.6391 53.9467 -3.942 0.000491 ***
## PtsAgainst 2.4704 0.5251 4.705 6.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.56 on 28 degrees of freedom
## Multiple R-squared: 0.4415, Adjusted R-squared: 0.4215
## F-statistic: 22.13 on 1 and 28 DF, p-value: 6.217e-05
The F-stat is 22.13 and degrees of freedom are 1 and 28. Since our p-value (6.217e-05) is less than the alpha (0.05) we reject the null. There is evidence to support a relationship between losses and points against (Atleast for the NBA) :)