Robbie Beane
Simple linear regression is a supervised learning method that can be used to model a single quantitative response \(Y\) as a function of a single quantitative predictor \(X\).
When performing simple linear regression to describe a relationship between two variables \(X\) and \(Y\), we begin by assuming that the relationship is determined by an unknown hypothetical model (or population model) of the form:
\[Y = \beta_0 + \beta_1 X + \varepsilon\]
We collect a sample of paired observations \((x_i, y_i)\) to use as a training set. We use the training set to create a fitted model. The fitted model is an approximation of the hypothetical model, and can be written in either of the following (equivalent) forms:
\[\hat Y = \hat \beta_0 + \hat \beta_1 X\] \[Y = \hat \beta_0 + \hat \beta_1 X + \hat \varepsilon\]
Given a paired observation \(\left(x_i, y_i \right)\), the fitted value of y given \(x_i\) is given by:
\[\hat y_i = \hat\beta_0 + \hat\beta_1 x_i\]
The residual associated with the observation is given by:
\[\hat \varepsilon_i = y_i - \hat y_i\]
As discussed in previous lessons, we will select our fitted model \(\hat Y = \hat \beta_0 + \hat \beta_1 X\) by choosing parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\) so that the following quantities are minimized on the training set:
\[SSE = \sum\limits_{i=1}^n \hat \varepsilon_i^2 \hspace{2 em} MSE = \frac{1}{n}\sum\limits_{i=1}^n \hat \varepsilon_i^2\] At present, we will work with \(SSE\). To emphasize that \(SSE\) is a function of the proposed parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\), we sometimes write:
\[SSE\left ( \hat\beta_0, \hat\beta_1 \right) = \sum\limits_{i=1}^n \hat \varepsilon_i^2 = \sum\limits_{i=1}^n \left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right) ^2\]
Through the magic of calculus, we can show that for a given training set \(\left(x_i, y_i \right)\), the quantity \(SSE\left ( \hat\beta_0, \hat\beta_1 \right)\) has a unique minimum which is obtained as follows:
Define quantities \(S_{XX}\) and \(S_{XY}\) by:
\[S_{XX} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)^2 = \sum\limits_{i=1}^n x_i^2 - n\bar x^2\]
\[S_{XY} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right) = \sum\limits_{i=1}^n x_i y_i - n\bar x \bar y\]
The parameter estimates that minimize SSE on the training set are given by:
\[\hat\beta_1= \frac{S_{XY}}{S_{XX}} \hspace{20px} \mathrm{and} \hspace{20px}\hat\beta_0 = \bar y - \hat\beta_1 \bar x\]
The derivation of the results above are shown in the accompanying notebook titled 3.1.a - Derivation of Parameter Estimates.
There are many different ways to write the formula for \(\hat\beta_1\). One commonly encountered formula is \(\hat\beta_1 = \frac{\mathrm{cov}[X,Y]}{s_X^2}\). This formula can be derived from our previous formula for \(\hat\beta_1\) by multiplying the top and bottom of the expression by \(1/(n-1)\).
The quantities \(SSE\) and \(MSE\) provide measures for the quality of a fitted model. However, these metrics are measured in the square of the units in which \(Y\) is measured. This makes it difficult to use these quanties to assess the fitted model. Fortunately, there is a unitless measure, called the R-Squared statistic, that we can use to score our models.
The R-Squared statistic is given by:
\[R^2 = 1 - \frac{SSE}{SST}\]
Where:
\[SST = \sum\limits_{i=1}^n \left(y_i - \bar y \right)^2 \]
Note that the sample variance of the response variable \(Y\) is given by \(s_Y^2 = SST / (n-1)\). As such, \(SST\) is, in a sense, a measure of the variability of the response variable.
It can be shown that \(R^2\) is always between 0 and 1. Furthermore, we can show that \(R^2\) measures the proportion of variability in the response variable \(Y\) that the fitted model is able to explain using the predictor \(X\).
In the following example, we will be interested in studying the relationship between the price and mileage of used cars.
Assume that we gather a sample of 20 recently sold used 2016 Ford Fictus automobiles. For each vehicle, we record the sales price (in thousands of dollars) and mileage (in thousands of miles) of the vehicle. In the code chunk below, We create vectors mileage and price to store these values.
Before constructing our model, let’s calculate the sample mean and standard deviation for both the mileage and price of the automobiles in the sample.
xbar <- mean(mileage)
sx <- sd(mileage)
x_stats <- c(xbar, sx)
names(x_stats) <- c('mean', 'stdev')
ybar <- mean(price)
sy <- sd(price)
y_stats <- c(ybar, sy)
names(y_stats) <- c('mean', 'stdev')
rbind(x_stats, y_stats)## mean stdev
## x_stats 51.605 33.99704
## y_stats 28.085 16.17273
Notice that the the standard deviation of the price is very large with respect to its mean. This indicates that if we were to use only the sample mean price as an estimate for the price of one of these automobiles, there would be a lot of uncertainty in our estimate.
To study the relationship between price and mileage of this automobile model, we might create a scatterplot from the paired observations in our sample.
plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)',
main = 'Relationship between Mileage and Price')We see from this plot that the cars with greater mileage tend to have a lower sales price (as you might expect). In fact, it appears that the relationship between the price and mileage of the vehicles might be roughly linear.
We will use the lm function in R to find a linear model that attempts to capture the relationship between price and mileage. We will store the resulting model in a variable called model and will then use the summary function to get some information about the model.
##
## Call:
## lm(formula = price ~ mileage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.329 -4.961 -1.335 5.862 10.259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.54834 2.77369 18.224 4.76e-13 ***
## mileage -0.43529 0.04523 -9.625 1.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.702 on 18 degrees of freedom
## Multiple R-squared: 0.8373, Adjusted R-squared: 0.8283
## F-statistic: 92.63 on 1 and 18 DF, p-value: 1.604e-08
There is a lot of information in this summary. We will eventually learn how to interpret all of the information presented here. For now, lets focus on one piece of information: the coefficient estimates.
The coefficient estimates are shown in the summary above, but we can access them directly as follows:
## (Intercept) mileage
## 50.5483433 -0.4352939
These are the coefficients that determine the slope and intercept of our linear model. This tells us that our model has the following form:
This relationship between price and mileage can also be represented as follows:
The summary of the model also reports the value of \(R^2\) under the listing “Multiple R-squared”. It can also be accessed directly as follows:
## [1] 0.8372993
This means that our model is able to explain roughly 83.72% of the variability in the price of a used 2016 Ford Fictus in terms of the mileage for the vehicle.
Let’s add our regression line to the scatter plot of price and mileage.
plot(price ~ mileage, pch=21, bg='orange', col='black', cex=1.5,
xlab = 'Mileage (in 1000s of Miles)', ylab = 'Price (in 1000s of Dollars)',
main = 'Relationship between Mileage and Price')
abline(model$coefficients, col='cadetblue', lwd=2)
The fitted values for the cars in our sample are stored within the model variable
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 49.2 48.8 48.2 47.5 42.1 38.2 34.5 34.4 32.1 27.8 27.3 27.3 23.0 20.8 14.7
## 16 17 18 19 20
## 14.0 12.1 8.1 7.1 4.5
To illustrate this idea, we will add the fitted values to our scatterplot.
Residuals reflect the uncertainty remaining in our model.
Our model object model contains the residuals for the observations in our sample.
## 1 2 3 4 5 6 7 8 9 10 11 12
## 4.5 8.0 10.3 -5.5 6.8 -5.0 -12.3 -1.8 -1.8 -8.0 -1.2 -2.4
## 13 14 15 16 17 18 19 20
## -4.9 -9.1 -1.4 9.4 1.1 5.5 7.7 0.1
Since the residuals represent the uncertainly in our predictions, we would like to get a sense as to how they are distributed. To that end, we generate a histogram of the residuals.
It seems reasonable to assume that the residuals might be normally distributed with a mean of zero. Let’s calculate their sample mean and standard deviation.
res_mean <- mean(res)
res_sd <- sd(res)
res_stats <- c(res_mean, res_sd)
names(res_stats) <- c('mean', 'stdev')
round(res_stats,4)## mean stdev
## 0.0000 6.5235
Assume that we are interested in purchasing a used 2016 Ford Fictus with 45,000 miles. We would like to use our model to determine a fair price for the car. We could calulcate this by plugging 45 into the equation for our model.That gives:
In other words, our model predicts that such a vehicle should cost (on average) $30,960.
We can use the R function predict to calculate this predicted value.
## 1
## 30.96012
We know that the predictions made by the model are not 100% accurate. We expect there to be some error in the predictions. To better understand how much the actual price of a car with 45,000 miles might vary, we will use predict to create an interval that we are 95% certain contains the true price of the car. This interval is called a prediction interval.
## fit lwr upr
## 1 30.96012 16.51791 45.40232
We can use the predict function to generate predictions for several new observations at once. Let’s say that we would like to contruct 95% prediction intervals for the prices of 3 vehicles: one with 40,000 miles, one with 50,000 miles, and one with 70,000 miles. We can do so as follows:
newdata = data.frame(mileage=c(40, 50, 70))
predict(model, newdata, interval='prediction', level=0.95)## fit lwr upr
## 1 33.13659 18.665948 47.60722
## 2 28.78365 14.354278 43.21302
## 3 20.07777 5.543722 34.61181