Instruction

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Steps to follow:

  1. Visualize the Data
  2. The Linear Model Function
  3. Evaluating the Quality of the Model
  4. Residual Analysis

The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s. Dataset is predefined in R. First load the dataset and see the available observations and attributes. Dataset consisits of 50*2 i.e 50 rows and 2 columns, they are speed and dist.

[,1] speed numeric Speed (mph)
[,2] dist numeric Stopping distance (ft)

(?cars, gives description of the dataset)

dim(cars)
## [1] 50  2

Below shows first 6 rows of data.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Below shows the summary of data.

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

check if there is any missing value avalable in the dataset:

# list rows of data that have missing values
cars[!complete.cases(cars),]
## [1] speed dist 
## <0 rows> (or 0-length row.names)

No missing values in the dataset.

Visualize the Data

For numeric paired data, scatter plot is the best way to show the relationship. The relationship between two variables/columns is called their correlation . If the line goes from a high-value on the y-axis down to a high-value on the x-axis, the variables have a negative correlation . A perfect positive correlation is given the value of 1.

Let’s plot a scatter plot using speed and dist.

library(ggplot2)
ggplot(aes(x = speed, y = dist), data = cars) +
  geom_point(color = 'blue') +
  ggtitle('Speed vs stopping distance') +
  xlab('speed (mph)') +
  ylab('stopping distance (ft)') +
  theme(plot.title = element_text(hjust = 0.5))

From the initial view, there looks to be a positive linear relationship between the speed of the car and the stopping disctance. This implies if speed will go up then stopping distance also high.

cor(cars$speed, cars$dist)
## [1] 0.8068949

Positive correlation : 0.81

The Linear Model function

Defining a linear model between the speed and stopping distance using lm function. Then we will evaluate the model.

Linear regression is a way to model the relationship between two variables. The equation has the form \(Y= c + mX\), where where \(Y\) is the dependent variable, \(X\) is the independent variable , \(m\) is the slope of the line and \(c\) is the y-intercept.

lm = lm(dist ~ speed, data = cars)
intercept <- coef(lm)[1]
intercept
## (Intercept) 
##   -17.57909
slope <- coef(lm)[2]
slope
##    speed 
## 3.932409

Above model shows, the y−intercept is −17.5791 and the slope is 3.9324 . The equation can be written as below.
\[ Stopping Distance = -17.58 + 3.93 * speed\] Above plot shows plots the original data along with the fitted line.

ggplot(cars, aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(method='lm', formula= y ~ x ) +
  ggtitle('Speed vs stopping distance') +
  xlab('speed (mph)') +
  ylab('stopping distance (ft)') +
  theme(plot.title = element_text(hjust = 0.5))

Evaluating the Quality of the Model

The regression model does not tell us anything about the model’s quality. The function summary() extracts some additional information that we can use to determine how well the data fit the resulting model.

# summary of a linear model
summary(lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Let’s examine each of the items presented in this summary.

Residuals: The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line. If the line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero. This we will see in the residual analysis.

Std. Error: The Std. Error column shows the statistical standard error for each of the coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. The standard error on this variable is about 9.5 times smaller than the estimate which is good.

Residual standard error: The Residual standard error is a measure of the total variation in the residual values.

Degrees of freedom : The number of degrees of freedom is the total number of measurements or observations used to generate the model, minus the number of coefficients in the model.
50 - 2 = 48 (degrees of freedom)

Multiple R-squared: It is a statistical measure of how well the model describes the measured data.In general, values of \(R^2\) that are closer to 1 indicate a better-fitting model.However, a good model does not necessarily require a large \(R^2\) value. It may still accurately predict future observations, even with a small \(R^2\) value. This model explains about 65% of the variation.

Adjusted R-squared: The adjusted \(R^2\) is always smaller than the \(R^2\) value. The higher value better for the model.

F-statistic: the low p-value better for the model. The P-value of this coefficient is very small indicating that it is statistically significant.

Residual Analysis

In this section we will see, distribution of residuals, Q-Q plot.

residuals <- residuals(lm)
hist(residuals, col = "steelblue")

Above histogram shows residuals are normaly distributed but right skewed.

qqnorm(resid(lm))
qqline(resid(lm))

If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line. However, the two ends diverge significantly from that line. This behavior indicates that the residuals are not normally distributed.
This test further confirms that using only the speed as a predictor in the model is insufficient to explain the data.

Future work

The model needs to improve, may be multiple input factors helps to improve model and can explain the data better.