Linear Regression predicts a quantitative response. It is used to predict the value of an outcome variable Y based on the input predictor vaiable X. The goal is to establish a linear relationship between the predictor variable and the response variable in order to estimate the value of the response Y.
Linear regression has been around since Sir Frances Galton in 1885, but it still serves as a good starting point for statistical analysis. It is essential to have a good understanding of linear regression before you move toward a more complex method.
This tutoiral uses a package called nycflights13 and it will need to be installed prior to completing this tutorial. This is a package with multiple data sets and this tutorial utilizes the flights data set. This data set makes it convenient to demonstrate linear regression and easier to understand. It is a large data set with 336,776 entries.
#Packages
library(nycflights13) #Data
library(modelr) #Pipeline Modeling Functions
library(tidyverse) #Data Manipulation
library(broom) #Tidy Model Outputs
#View Data
flights
The two columns we will use in flights are distance and air_time. The initial relationship between these two variables will be studied prior to starting a statistical analysis. The first thing we want to do is see if the relationship is linear.
Linear regression is used to model variable Y as a mathematical function of variable X, so that we can use this regression model to predict Y when we only know X.
\[Y = \beta 0 + \beta 1X + \epsilon \]
Y = Air Time
X = Distance
\(\beta 0\) = Intercept
\(\beta 1\) = Coefficient (slope)
\(\epsilon\) = is a mean-zero random error term (variability in Y the model is unable to explain)
To build this model in R you start with Y ~ X. In order to add a regression line we will use the function lm which stands for “linear model” and add the dependent and independent variables. The “linear model” function produces the best-fit linear relationship by minimizing the least squares criterion. It is expressed in terms of “standard error”.
regmodel <- lm(air_time ~ distance, data = flights) #Build Linear Regression Model
plot(flights$air_time, flights$distance,
main = "Scatterplot",
xlab = "Distance",
ylab = "Time In Air",
col = "blue")
In order to make sure our model fits“” we need to look at the summary.
summary(regmodel)
##
## Call:
## lm(formula = air_time ~ distance, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.397 -7.334 -1.320 6.513 145.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.847e+01 3.888e-02 474.9 <2e-16 ***
## distance 1.261e-01 3.036e-05 4154.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.78 on 327344 degrees of freedom
## (9430 observations deleted due to missingness)
## Multiple R-squared: 0.9814, Adjusted R-squared: 0.9814
## F-statistic: 1.726e+07 on 1 and 327344 DF, p-value: < 2.2e-16
\(\beta 1\) = Coefficient (slope)
\[Air Time = 18.47 + 0.1261(Distance) + \epsilon \] The intercept estimate is 18.47 so when the distance is zero we can expect the air time to be 18.47 minutes. That doesn’t make a lot of sense here, as we wouldn’t expect a flight of 0 miles to take 18 minutes and 28 seconds, but the intercept is kept so that it fits the other data points. The slope of the line is 0.1261. And for every 1 minute increase in air time we can expect the average increase in distance to be 0.1261 units. These are also called the beta coefficients.
We also need to see if the coefficients are statistically significant. We do this by looking at the standard error. This is mathematically shown in the equations below:
\[SE(\beta_0)^2 = \sigma^2\bigg[\frac{1}{n}+\frac{\bar{x}^2}{\sum^n_{i=1}(x_i - \bar{x})^2} \bigg], \quad SE(\beta_1)^2 = \frac{\sigma^2}{\sum^n_{i=1}(x_i - \bar{x})^2} \]
where \[\sigma^2 = Var(\epsilon)\].
A coefficient will be within 2 standard errors of its estimate about 95% of the time.
\[\beta_1 \pm 2 \cdot SE(\beta_1)\]
confint(regmodel)
## 2.5 % 97.5 %
## (Intercept) 18.3903654 18.5427908
## distance 0.1260598 0.1261788
The results show that our confidence interval concludes we are 95% sure that the coefficient is 0.1260 to 0.1261. This is great! We are 95% sure that the expected increase in time for an increase of 1 mile in distance is between 0.1260 and 0.1261 minutes.
\[t=\frac{\beta_1 - 0}{SE(\beta_1)}\]
A large t-statistic will provide a small p-value. This will conclude that there is a relationship between distance and time in the air.
According to the summary our t-statistic = 4154.4
Let’s check the Residual Standard Error (RSE) or how much the reponse will deviate from the true regression line. This is a measure of goodness of fit.
\[RSE = \sqrt{\frac{1}{n-2}\sum^n_{i=1}(y_i - \hat{y}_i)^2}\]
sigma(regmodel)
## [1] 12.78196
This means the distance will deviate from the true regression line by 12.78196. So what? Well, when you compare it to the average distance over all the flights the percentage error is 1%.
sigma(regmodel)/mean(flights$distance)
## [1] 0.01229138
Next, we can check the R2 value.
\[R^2 = 1 - \frac{RSS}{TSS}= 1 - \frac{\sum^n_{i=1}(y_i-\hat{y}_i)^2}{\sum^n_{i=1}(y_i-\bar{y}_i)^2}\]
rsquare(regmodel, data = flights)
## [1] 0.9742638
R-Squared tells us the proportion of variation in the dependent variable. This suggests that 97% of the variability in our air time data can be expained. It doesn’t get much better than that!
The last piece is to look at the F-statistic.
\[F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)} \]
We can see this at the bottom of our summary.
summary(regmodel)
##
## Call:
## lm(formula = air_time ~ distance, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.397 -7.334 -1.320 6.513 145.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.847e+01 3.888e-02 474.9 <2e-16 ***
## distance 1.261e-01 3.036e-05 4154.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.78 on 327344 degrees of freedom
## (9430 observations deleted due to missingness)
## Multiple R-squared: 0.9814, Adjusted R-squared: 0.9814
## F-statistic: 1.726e+07 on 1 and 327344 DF, p-value: < 2.2e-16
The F-statistic is a measure of goodness-of-fit. A larger F-statistic will produce a statistically significant p-value (p<0.05). In our case we see at the bottom of our summary statement that the F-statistic is 1.726e+07 producing a p-value of p<2.2e-16. The p-values are important because we can only say this model is statistically significant when they are less than the pre-determined level that we have set at 0.05.
This model could not get any better! I made sure to pick an example that would have a solid relationship for the tutorial.
Let’s start by going back to the lm function. Using this with ggplot allows us to see our model accuracy. This allows us to compare the linearity of our model (blue line with the 95% confidence interval in shaded region) with a non-linear model. You can see that as the distance gets higher there is less of a relationship.
ggplot(flights, aes(distance, air_time)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(se = FALSE, color = "red") +
scale_y_continuous(name = "Time in Air") +
scale_x_continuous(name = "Distance")
There are a number of visualizations we need in order to make sure our model is performing the way it should.
What do we use this for?
It is a scatterplot of residuals on the y axes and fitted values on the x axes. This plot is used to detect non-linearity by looking at the blue line for questionable patterns. It will also address heteroskedasticity or unequal error variances. We want to make sure there is not a funnel shape with our residuals. This can be resolved sometimes with a log or square root transformation of Y.
The characteristics of a well-behaved residual vs fitted plot and what they suggest about the appropriateness of the simple linear regression model is as follows:
regmodel_results <- augment(regmodel, flights)
ggplot(regmodel_results, aes(.fitted, .resid)) +
geom_ref_line(h = 0) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("Residuals vs Fitted")
This tells us if the residuals apread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally spread points.
ggplot(regmodel_results, aes(.fitted, sqrt(.std.resid))) +
geom_ref_line(h = 0) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("Scale-Location")
This scatterplot shows if residuals are normally distributed. A Q-Q plot plots the distribution of our residuals against the theoretical normal distribution. The closer the points are to falling directly on the diagonal line then the more we can interpret the residuals as normally distributed. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight
qq_plot <- qqnorm(regmodel_results$.resid)
This addresses the outliers and influential cases. However, not all outliers are influential to determine a regression line but we need to see if these are or not. If they are not, then the results wouldn’t be much different if we either include or exclude them from analysis.
On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. Another way to put it is that they don’t get along with the trend in the majority of the cases.
When cases have high Cook’s distance scores and are to the upper or lower right of our leverage plot they have leverage meaning they are influential to the regression results. The regression results will be altered if we exclude those cases.
par(mar = rep(2, 4))
plot(regmodel, which = 4, id.n = 5)
plot(regmodel, which = 5, id.n = 5)
If you want to look at these top 5 observations with the highest Cook’s distance in case you want to assess them further you can use the following.
regmodel_results %>%
top_n(5, wt = .cooksd)
This is a lot of different steps that are necessary when designing a model. The most common metrics to look at while selecting the model are:
Later we will look at how we can extend a simple linear regression model into a multiple regression.