Cars dataset is part of the datasets package. The dataset contains two columns, speed, and dist. Data were recorded in 1920. The car traveling speed is stored in the speed column, and column dist contain distance in feet car travels after applying the breaks. Using this data, we will construct a regression model and evaluate the quality of the model, such that given any speed we can derive distance it travels before stopping.
library(knitr)
library(datasets)
cars.data <- cars
kable(head(cars[1:2], n=10), align='l', caption = "Cars Data - R package")
| speed | dist |
|---|---|
| 4 | 2 |
| 4 | 10 |
| 7 | 4 |
| 7 | 22 |
| 8 | 16 |
| 9 | 10 |
| 10 | 18 |
| 10 | 26 |
| 10 | 34 |
| 11 | 17 |
summary(cars.data)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
plot(x = cars.data$speed, y = cars.data$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
Dependent variable distance also known as predictor is plotted on y-axis and independent variable speed also known as output is plotted on x-axis.
From the plot we can see as speed increases distance is also increasing, we can safely assume that distance is a function of speed.
Simple regression model is a straight line. It is written as \(y = a_0 + a_1x_1\). Where \(a_0\) is the y-intercept of the line, \(a_1\) is slope, \(x_1\) is input variable speed and \(y\) is output dist. Using lm function we can construct the model.
cars.lm <- lm(cars.data$dist ~ cars.data$speed)
cars.lm
##
## Call:
## lm(formula = cars.data$dist ~ cars.data$speed)
##
## Coefficients:
## (Intercept) cars.data$speed
## -17.579 3.932
y-intercept \(a_0\) is -17.579 and slope \(a_1\) is 3.932.
Regression model is \(y = -17.579 + 3.932x_1\), \(Distance = -17.579 + 3.932*Speed\).
The model explains for every one unit increase in speed, after applying brakes distance car travels before stopping increases by 3.932 units. In this case for every one-mile increase in speed, the car travels extra 3.932 feet after applying breaks.
plot(x = cars.data$speed, y = cars.data$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
abline(h=mean(cars.data$dist))
abline(cars.lm, col="red")
The plot shows regression model superimposed on data. The black horizontal line indicates average distance and the red line is actual regression model. It explains as speed increases distance car travels after brakes are applied also increases.
Using summary function, we can determine the quality of the resulting model.
summary(cars.lm)
##
## Call:
## lm(formula = cars.data$dist ~ cars.data$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars.data$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
residuals(cars.lm)
## 1 2 3 4 5 6
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584
## 7 8 9 10 11 12
## -3.744993 4.255007 12.255007 -8.677401 2.322599 -15.609810
## 13 14 15 16 17 18
## -9.609810 -5.609810 -1.609810 -7.542219 0.457781 0.457781
## 19 20 21 22 23 24
## 12.457781 -11.474628 -1.474628 22.525372 42.525372 -21.407036
## 25 26 27 28 29 30
## -15.407036 12.592964 -13.339445 -5.339445 -17.271854 -9.271854
## 31 32 33 34 35 36
## 0.728146 -11.204263 2.795737 22.795737 30.795737 -21.136672
## 37 38 39 40 41 42
## -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
## 43 44 45 46 47 48
## 2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285
## 49 50
## 43.201285 4.268876
Residual is a difference between actual measured value and corresponding values on the fitted regression line. The positive value indicates observed value is above the fitted line and the negative value means the observed value is below the fitted line.
In case of the best-fitted model, mean of the residual will be zero, as this follows a normal distribution. For any given data there will be enough observed values above and below the fitted line.
summary(residuals(cars.lm))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.070 -9.525 -2.272 0.000 9.215 43.200
In this case, mean is equal to zero, however, there is a huge difference between Minumum and Maximum values. Using 1.5IRQ rule we can find out outliers.
Q1 = -9.53
Q3 = 9.21
IQR = Q3 - Q1
Lower.Bound = Q1 - (1.5 * IQR)
Upper.Bound = Q3 + (1.5 * IQR)
res <- residuals(cars.lm)
There is total of 0 observations that can be considered as outliers using lower bound -37.64.
res[which(res < Lower.Bound)]
## named numeric(0)
There is total of 2 observations that can be considered as outliers using upper bound 37.32.
res[which(res > Upper.Bound)]
## 23 49
## 42.52537 43.20128
plot(cars.lm, which=c(1,1))
Plot Residuals Vs. Fitted shows the distribution of residuals values around mean. In this case, residuals are evenly distributed. It indicates there is no increase or decrease pattern identified as we move from left to right. Red line shows residuals below zero are higher compared to residuals above zero. A slight upward trend can be noticed indicating as speed increases feet traveled by car increases before stopping.
qqnorm(residuals(cars.lm))
qqline(residuals(cars.lm))
Another method to determine the quality of the regression model is Q-Q plot. If residuals are normally distributed they hug the fitted line. In this case, tail end on top right diverge, indicating the distribution of residuals is not normal. We can notice two outliers on top right corner of the plot.
summary(cars.lm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.579095 6.7584402 -2.601058 1.231882e-02
## cars.data$speed 3.932409 0.4155128 9.463990 1.489836e-12
Estimate -17.576 is y-intercept, and 3.932 is slope of the fitted regression line. Std. Error explains the closeness of estimate to mean. In this case, speed estimate is 0.4155 standard deviations away from the mean.
\(Pr(>|t|)\) is calculated using t-value, and it explains variable probability of non-relevance to the model. Higher value means the probability is higher, that the variable is not relevant to the model. In this case, p-value of the speed 0.0000000000015 is very low, suggesting it is not relevant to the model by 0.0000000000015 percent.
Multiple R-squared: 0.651 suggests that regression model can explain 65.1% of the variation in data. In this case, the regression model can predict 65.1% of the time relationship between speed and distance.
References: