Introduction

What is Linear Regression?

This R Guide will cover the topic of Simple Linear Regression. Simple Linear Regression compares two variables and measures dependence. In the case of Simple Linear Regression, these variables are always quantitative variables, meaning that each variable holds a numerical or measurable value, such as height and weight. The goal of Linear Regression is to approximate this relationship using a straight line and perform significance tests on the elements of the line: slope and intercept.

Installation

Before we get started, let’s download some data to use. The following command will download a multitude of data sets from fivethirtyeight.com if pasted into R. If you would like, you can install it for yourself, but for the purposes of this R Guide, this step is unnecessary.

#if (require(fivethirtyeight)) {install.packages("fivethirtyeight", repos="https://cran.rstudio.com")}
library(fivethirtyeight)
## Warning: package 'fivethirtyeight' was built under R version 3.3.3

Data Selection

I would like to begin with the “avengers” data. Displaying a summary of the data set provides some context for the data. There are 21 variables, some quantitative and some qualitative. To summarize, the data includes the name of the Avenger superhero, the number of comic book appearances as an Avenger, status as an Avenger, gender, length of time since induction to the Avengers team (I will refer to it as “tenure.”), and information on the deaths and returns of the Avenger.

summary(avengers)
##      url             name_alias         appearances      current       
##  Length:173         Length:173         Min.   :   2.0   Mode :logical  
##  Class :character   Class :character   1st Qu.:  58.0   FALSE:91       
##  Mode  :character   Mode  :character   Median : 132.0   TRUE :82       
##                                        Mean   : 414.1   NA's :0        
##                                        3rd Qu.: 491.0                  
##                                        Max.   :4333.0                  
##     gender          probationary_intro full_reserve_avengers_intro
##  Length:173         Length:173         Length:173                 
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##       year      years_since_joining   honorary           death1       
##  Min.   :1900   Min.   :  0.00      Length:173         Mode :logical  
##  1st Qu.:1979   1st Qu.:  5.00      Class :character   FALSE:104      
##  Median :1996   Median : 19.00      Mode  :character   TRUE :69       
##  Mean   :1988   Mean   : 26.55                         NA's :0        
##  3rd Qu.:2010   3rd Qu.: 36.00                                        
##  Max.   :2015   Max.   :115.00                                        
##   return1          death2         return2         death3       
##  Mode :logical   Mode :logical   Mode :logical   Mode:logical  
##  FALSE:23        FALSE:1         FALSE:8         TRUE:2        
##  TRUE :46        TRUE :16        TRUE :8         NA's:171      
##  NA's :104       NA's :156       NA's :157                     
##                                                                
##                                                                
##   return3         death4        return4         death5       
##  Mode :logical   Mode:logical   Mode:logical   Mode:logical  
##  FALSE:1         TRUE:1         TRUE:1         TRUE:1        
##  TRUE :1         NA's:172       NA's:172       NA's:172      
##  NA's :171                                                   
##                                                              
##                                                              
##  return5           notes          
##  Mode:logical   Length:173        
##  TRUE:1         Class :character  
##  NA's:172       Mode  :character  
##                                   
##                                   
## 
attach(avengers)

I think that it would be interesting to first look at two variables that should show some dependence: “years_since_joining” as the predictor and “appearances” as the response. Will the superhero’s tenure as an Avenger explain the number of comic book appearances? Lets see!

Linear Regression on Appearances and Tenure

Output

SLR_AppearancesTenure <- lm(appearances ~ years_since_joining)
summary(SLR_AppearancesTenure)
## 
## Call:
## lm(formula = appearances ~ years_since_joining)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -544.9 -344.7 -273.7  108.3 3921.3 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          374.166     68.599   5.454  1.7e-07 ***
## years_since_joining    1.502      1.703   0.882    0.379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 678.4 on 171 degrees of freedom
## Multiple R-squared:  0.004528,   Adjusted R-squared:  -0.001293 
## F-statistic: 0.7778 on 1 and 171 DF,  p-value: 0.379

T-test

Unfortunately, our expectation was false. We can gather from the high p-value of 0.379 in the years_since_joining row that the t-test’s null hypothesis that our slope is not significant cannot be rejected.

H0: B1=0 (The null hypothesis H0 is rejected for p-values below the chosen significance level. P-values higher than the significance level fail to reject the null hypothesis)

H1: B1<>0

Because we have not rejected the null hypothesis, we should not be using tenure in our prediction of appearances.

F-test

This output also gives us the results of an F-test, which also tests the significance of the regression relationship between tenure and appearances. As you can see in the bottom line of the output, the F-statistic is low and the p-value is high (0.379). The F-test’s null hypothesis states that, collectively, our predictor variables have no effect on appearances. Wait! We only have one predictor variable. Notice the p-value for the t-test: It is also 0.379. In Simple Linear Regression, the p-values and thus the choice to reject or fail to reject the null hypothesis are the same for both tests. You will see in the next R Guide, that the story is not the same in Multiple Linear Regression.

Scatterplot

Let’s take a look at a scatterplot to see the poor relationship visually. Scatterplots are a great way to show the relationship (or lack thereof) of two quantitative variables such as appearances and tenure. If the points seem to follow a trend, there is a good chance that they will pass Linear Regression tests.

plot(years_since_joining,appearances, xlab="Tenure (yrs)", ylab="Appearances (# of comics)", main="Scatterplot")

Interesting! The scatterplot does not show an obvious linear relationship. Because each point represents one Avenger, it makes sense that the bottom of the plot is more condensed with Avengers. Apparently, there have been a lot of unpopular, new, or short-lived Avengers. After all, I cannot name 173 Avengers!

Linear Regression on Appearances and Tenure (Filtered)

Data Filtration

Perhaps if we filtered the data to only include the more well-known and currently living Avengers, we would see a more linear relationship. We can do that in R! Let’s do the same linear regression test, but only for living Avengers with 1000 or more appearances in the comics.

Output

SLR_AppearancesTenure_Filtered <- lm(appearances ~ years_since_joining, data=subset(avengers, appearances>1000 & death1=="FALSE"))
summary(SLR_AppearancesTenure_Filtered)
## 
## Call:
## lm(formula = appearances ~ years_since_joining, data = subset(avengers, 
##     appearances > 1000 & death1 == "FALSE"))
## 
## Residuals:
##      18      58      59     105     141     142 
## -136.65  312.27  -51.73 -198.07  156.67  -82.48 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1392.889    146.502   9.508 0.000683 ***
## years_since_joining   16.148      6.594   2.449 0.070522 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 217.6 on 4 degrees of freedom
## Multiple R-squared:  0.5999, Adjusted R-squared:  0.4999 
## F-statistic: 5.998 on 1 and 4 DF,  p-value: 0.07052

The p-value of 0.07052 is much nicer, but only under the significance level of 0.1 will the null hypothesis of the t and f tests be rejected. Another red flag in this linear regression is the degrees of freedom value of 4. Degrees of freedom are calculated as the number of data points (n) minus the number of predictor variables + 1 (1+1). For us, that means n=6. Only 6 Avengers haven’t died and also were featured in over 1000 comics. We should not be comfortable with that low of a sample size. As a general rule, aim for at least 25 data points when conducting statistical tests.

However, let’s ignore this for the moment so we can do some interpretation.

Interpretation

r & r-squared

The r-squared value is perhaps the next most important output after the p-value. The r-squared value is called the simple coefficient of determination and basically tells us what percentage of our response variable is explained by our predictor, so in this case, 60% of the variability in our appearances value can be explained by tenure. The r-squared value becomes increasingly important in Multiple Linear Regression, when multiple predictors are working together to build up the explained percentage. r-squared may prompt you to ask, what is r? r is the simple correlation coefficient. So if you want to quantify how correlated your linear relationship is, you could take the square root of r-squared and that would return you the right number, but some intuition would be required to set the +/- sign for r. You can calculate r in R using the cor() function. The r-value does not always mean that there is a linear relationship present. Correlation is not causation!

r <- cor(appearances[appearances>1000 & death1=="FALSE"],years_since_joining[appearances>1000 & death1=="FALSE"])
r
## [1] 0.7745326

The Prediction Equation

The simple linear regression prediction equation is:

\[\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i\]

The true value of the response variable can be given by the prediction equation + error.

\[{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i+Error_i\]

Yes, in doing linear regression by hand, this equation is more useful, but it is still important in the interpretation of R’s linear regression output. y-hat is the value we are attempting to explain. In this case y-hat will be the predicted number of appearances of an Avenger with x years since joining the team. Let’s use the equation as an equation of the Least Squares Regression Line.

plot(appearances[appearances>1000 & death1=="FALSE"]~years_since_joining[appearances>1000 & death1=="FALSE"],xlab="Tenure (yrs)", ylab="Appearances (# of comics)", main="Scatterplot")
abline(1392.889,16.148)

Error

The error term at the end of the equation represents the difference between the predicted value and the real value of the response variable. At any given x, we expect the error to be equal to zero, because it follows a standard N(0,1) normal distribution, but no prediction is going to be perfect with an error of exactly zero, but the idea is to minimize error. Each error term is statistically independent, meaning if we get a super high error for one x, we can make no assumptions about the next error term based upon that fact. We also assume that error terms will have a constant variance, which does not depend on x. We will get into an analysis of error later in the R Guide, through residuals, but for now, just understand that predictions are just that: predictions.

Beta0

Beta0 is the intercept value, estimated above in the R output as 1392.889.

\[\hat{Appearances}=1392.889+\hat{\beta_1}*Tenure\]

Lazily interpreted, we could say that the intercept is the number of appearances we predict when x (years since joining) is equal to 0. But does that really make sense? A brand new Avenger, just signed to the team, could not possibly have 1393 comic book appearances as an Avenger! The intercept can only be interpreted within the experimental region of the data, so for example, while it is not acceptable to interpret the intercept at x = 0, it would be acceptable to use the intercept when predicting the appearances of Mr. Fantastic, who has been an Avenger for 26 years.

Beta1

Beta1 is the coefficient for the predictor variable, estimated in the R output as 16.148.

\[\hat{Appearances}=1392.889+16.148*Tenure\]

This means that for every year that a superhero is a member of the Avengers, you can expect them to appear in 16 more comics. That sounds reasonable! How confident are we in that number?

Confidence Intervals on the Slope and Intercept

Confidence Intervals for regression coefficients are a way of visualizing the Beta0 intercept and Beta1 slope in a different way. Instead of seeing the point estimate of the intercept and slope (as seen in the prediction equation) Confidence Intervals show the estimate as a range, which decreases when confidence is high and increases when confidence is low.

confint(SLR_AppearancesTenure_Filtered)
##                          2.5 %     97.5 %
## (Intercept)         986.133457 1799.64439
## years_since_joining  -2.159174   34.45477

We are 95% confident that for every year that a superhero is a member of the Avengers, we can expect them to appear in between [-2, 34] more comics. Like I said earlier, it is tricky to interpret the intercept in this model, but you could also (lazily) interpret the confidence interval for the intercept as: We are 95% confident that when an Avenger’s tenure is equal to zero, they will have appeared in between [986, 1800] comics.

Prediction Interval for a given Tenure

Putting all of the coefficients into the prediction equation, let’s predict an appearances value for Mr. Fantastic, who is one of the 6 Avengers who has not yet died and has also been featured in 1000 comics.

MisterFantasticAppearancesPREDICT = 1392.889 + 16.148 * 26 + 0
MisterFantasticAppearancesPREDICT
## [1] 1812.737

Without having to always be jotting down coefficient values, we can predict right here in R, using prediction intervals.

Fantastic <- data.frame(years_since_joining = 26)
FantasticPredy <- predict(SLR_AppearancesTenure_Filtered, Fantastic , interval = "prediction", level=0.95)
FantasticPredy
##        fit      lwr     upr
## 1 1812.732 1142.503 2482.96

We predict Mr. Fantastic to have been in 1813 comics, with 95% confidence that the value lies between [1142.5, 2482.96]. Let’s see how close we were to the true value!

which(name_alias=="Reed Richards")
## [1] 58

Mr. Fantastic is the 58th row of the data. Let’s use that to see his appearances value.

avengers[58,"appearances"]
## [1] 2125

Mr. Fantastic has appeared in 2125 comics, about 312 more than we predicted for him. We could be dealing with a significant amount of error. Lets take a look at our residuals.

Residuals

Like I said before, residuals are another word for error terms. Mr. Fantastic’s residual is around 312 because his true appearances value was 312 higher than his predicted. If you were to use the prediction equation already knowing this, you could substitute 312 for the error term. \[{y}=1393+16x+312\]

However, when you have many data points, you don’t want to be calculating residuals by hand.

Error Assumptions

E~N(0,1)

A QQ plot can help us check if our residuals follow the standard normal distribution.

AvengerResiduals <- SLR_AppearancesTenure_Filtered$residuals
qqnorm(AvengerResiduals)
qqline(AvengerResiduals)

In a Q-Q plot, we like to see the points close to the line. These residuals are not too far off from the line. We has reason to believe that our data does indeed follow a nomrmal distribution. We should also check to make sure the mean error is close to zero, as the standard normal distribution suggests.

Zero = mean(AvengerResiduals)
Zero
## [1] -1.421085e-14

Perfect!

Statistically Independent

If any of these assumptions are going to give us trouble, it is this one. Our data is not exactly independent. For a superhero to be included in a comic, another superhero has to be excluded. This does not necessarily mean that our errors will be just as dependent, but it is something to keep your eye on.

Homoscedasticity (Equal Variance)

Let’s see the residuals plotted against the predictor variable to check for equal variance, or homoscedasticity. We are looking for a random distribution of residuals here.

plot(AvengerResiduals ~ years_since_joining[appearances>1000 & death1=="FALSE"],xlab="Tenure (yrs)",ylab="Residuals", main="Heteroscedasticity Check")
abline(0,0)

The residuals do not appear to have greater variance as years since joining increases, thus there is no evidence of heteroscedasticity. However, we again are only working with 6 data points. Measures of error that take into account the amount of data points are Mean Square Error and Standard Error.

Measures of Error

MSE <- sum((AvengerResiduals)^2)/4
MSE
## [1] 47360.43
SE <- sqrt(sum((AvengerResiduals)^2)/4)
SE
## [1] 217.6245

The Mean Square Error is 47360.4 and the Standard Error is 217.6, which is also listed in the summary under Residual Standard Error. Mean Square Error is a point estimate of the variance and the Standard Error is a point estimate of the standard deviation. Of course we are dealing with appearances with values in the thousands, but I am not super comfortable with a Standard Error of 217.6.

Mr. Fantastic’s residual was the greatest of his 6 peers and we have previously calculated his prediction interval, which takes error into account. Just for fun, let’s imagine all six of his peers joined the Avengers in the same year that he did. What will the mean of the appearance values be?

Confidence Interval for the Mean

We can create a confidence interval for the mean by fixing the years since joining at 26.

Fantastic6 <- predict(SLR_AppearancesTenure_Filtered, Fantastic , interval = "confidence", level=0.95)
Fantastic6
##        fit      lwr      upr
## 1 1812.732 1522.695 2102.769

We can expect the same mean for the group that we did for only Mr. Fantastic, because we are essentially finding the expected mean for 6 Mr. Fantastics. We are 95% confident that the mean appearances value lies between [1523, 2103].

Conclusion

Concerns

The data we used didn’t show the linear relationship that we would have liked to see. We had to tinker with the data to get even semi-usable data for linear regression. I kept moving along with the Avengers data because it was not perfect. I wanted to explain on multiple occasions why we should be suspicious of our results, rather than picking a perfect data set and running regression without a more in depth look at small sample size issues and large standard error issues. Another explanation that I would like to make clear is that Linear Regression is best suited for cross-sectional data, rather than time-series data. The Avengers data I chose could be misunderstood to be time series data because of the use of time. However, cross-sectional data means data displayed “at a moment in time”, rather than “over time,” and in this case we are sitting at a point in time and counting the total number of years since an event and the total number of events since that date, which means we are dealing with a cross-sectional data set, rather than a time series. This data would be interesting to revisit in a later R Guide when we are dealing with time series.