Cars dataset is part of the datasets package. The dataset contains two columns, speed, and dist. Data were recorded in 1920. The car traveling speed is stored in the speed column, and column dist contain distance in feet car travels after applying the breaks. Using this data, we will construct a regression model and evaluate the quality of the model, such that given any speed we can derive distance it travels before stopping.

Visualize Data

library(knitr)
library(datasets)

cars.data <- cars

kable(head(cars[1:2], n=10), align='l', caption = "Cars Data - R package")
Cars Data - R package
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
summary(cars.data)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
plot(x = cars.data$speed, y = cars.data$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")

Dependent variable distance also known as predictor is plotted on y-axis and independent variable speed also known as output is plotted on x-axis.

From the plot we can see as speed increases distance is also increasing, we can safely assume that distance is a function of speed.

The Linear Model Function

Simple regression model is a straight line. It is written as \(y = a_0 + a_1x_1\). Where \(a_0\) is the y-intercept of the line, \(a_1\) is slope, \(x_1\) is input variable speed and \(y\) is output dist. Using lm function we can construct the model.

cars.lm <- lm(cars.data$dist ~ cars.data$speed)
cars.lm
## 
## Call:
## lm(formula = cars.data$dist ~ cars.data$speed)
## 
## Coefficients:
##     (Intercept)  cars.data$speed  
##         -17.579            3.932

y-intercept \(a_0\) is -17.579 and slope \(a_1\) is 3.932.

Regression model is \(y = -17.579 + 3.932x_1\), \(Distance = -17.579 + 3.932*Speed\).

The model explains for every one unit increase in speed, after applying brakes distance car travels before stopping increases by 3.932 units. In this case for every one-mile increase in speed, the car travels extra 3.932 feet after applying breaks.

plot(x = cars.data$speed, y = cars.data$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
abline(h=mean(cars.data$dist))
abline(cars.lm, col="red")

The plot shows regression model superimposed on data. The black horizontal line indicates average distance and the red line is actual regression model. It explains as speed increases distance car travels after brakes are applied also increases.

Evaluating the Quality of the Model

Using summary function, we can determine the quality of the resulting model.

summary(cars.lm)
## 
## Call:
## lm(formula = cars.data$dist ~ cars.data$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -17.5791     6.7584  -2.601   0.0123 *  
## cars.data$speed   3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Analysis

residuals(cars.lm)
##          1          2          3          4          5          6 
##   3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584 
##          7          8          9         10         11         12 
##  -3.744993   4.255007  12.255007  -8.677401   2.322599 -15.609810 
##         13         14         15         16         17         18 
##  -9.609810  -5.609810  -1.609810  -7.542219   0.457781   0.457781 
##         19         20         21         22         23         24 
##  12.457781 -11.474628  -1.474628  22.525372  42.525372 -21.407036 
##         25         26         27         28         29         30 
## -15.407036  12.592964 -13.339445  -5.339445 -17.271854  -9.271854 
##         31         32         33         34         35         36 
##   0.728146 -11.204263   2.795737  22.795737  30.795737 -21.136672 
##         37         38         39         40         41         42 
## -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
##         43         44         45         46         47         48 
##   2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285 
##         49         50 
##  43.201285   4.268876

Residual is a difference between actual measured value and corresponding values on the fitted regression line. The positive value indicates observed value is above the fitted line and the negative value means the observed value is below the fitted line.

In case of the best-fitted model, mean of the residual will be zero, as this follows a normal distribution. For any given data there will be enough observed values above and below the fitted line.

summary(residuals(cars.lm))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -29.070  -9.525  -2.272   0.000   9.215  43.200

In this case, mean is equal to zero, however, there is a huge difference between Minumum and Maximum values. Using 1.5IRQ rule we can find out outliers.

Q1 = -9.53
Q3 = 9.21
IQR = Q3 - Q1

Lower.Bound = Q1 - (1.5 * IQR)
Upper.Bound = Q3 + (1.5 * IQR)

res <- residuals(cars.lm)

There is total of 0 observations that can be considered as outliers using lower bound -37.64.

res[which(res < Lower.Bound)]
## named numeric(0)

There is total of 2 observations that can be considered as outliers using upper bound 37.32.

res[which(res > Upper.Bound)]
##       23       49 
## 42.52537 43.20128
plot(cars.lm, which=c(1,1))

Plot Residuals Vs. Fitted shows the distribution of residuals values around mean. In this case, residuals are evenly distributed. It indicates there is no increase or decrease pattern identified as we move from left to right. Red line shows residuals below zero are higher compared to residuals above zero. A slight upward trend can be noticed indicating as speed increases feet traveled by car increases before stopping.

qqnorm(residuals(cars.lm))
qqline(residuals(cars.lm))

Another method to determine the quality of the regression model is Q-Q plot. If residuals are normally distributed they hug the fitted line. In this case, tail end on top right diverge, indicating the distribution of residuals is not normal. We can notice two outliers on top right corner of the plot.

summary(cars.lm)$coefficients
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     -17.579095  6.7584402 -2.601058 1.231882e-02
## cars.data$speed   3.932409  0.4155128  9.463990 1.489836e-12

Estimate -17.576 is y-intercept, and 3.932 is slope of the fitted regression line. Std. Error explains the closeness of estimate to mean. In this case, speed estimate is 0.4155 standard deviations away from the mean.

\(Pr(>|t|)\) is calculated using t-value, and it explains variable probability of non-relevance to the model. Higher value means the probability is higher, that the variable is not relevant to the model. In this case, p-value of the speed 0.0000000000015 is very low, suggesting it is not relevant to the model by 0.0000000000015 percent.

Multiple R-squared: 0.651 suggests that regression model can explain 65.1% of the variation in data. In this case, the regression model can predict 65.1% of the time relationship between speed and distance.

References: