ASSIGNMENT 11

Problem

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Answer

Data Exploration

help(cars)

# Summary of the data
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
# Number of Columns of the data
paste("Number of Columns of the data:",ncol(cars))
## [1] "Number of Columns of the data: 2"
# Number of Rows of the data
paste("Number of Rows of the data:",nrow(cars))
## [1] "Number of Rows of the data: 50"
# Columns of the data
paste("Column Names of the data:",colnames(cars))
## [1] "Column Names of the data: speed" "Column Names of the data: dist"
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
# Sample of the data
dplyr::glimpse(cars)
## Observations: 50
## Variables: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13,...
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26,...
# Entire dataset
DT::datatable(cars)

This data was gotten from measuring the speed and stopping distances of cars in 1920s. It contains only two variables (Speed and Stopping distance) and 50 observations. The numeric Stopping distance is measured in foot (ft). The data gives the speed of cars and the distances taken to stop.

  • speed numeric Speed (mph)
  • dist numeric Stopping distance (ft)

Checking missing values in the dataset

sapply(cars, function(y) sum(length(which(is.na(y)))))/nrow(cars)*100
## speed  dist 
##     0     0

Data does not have missing values.

Data Visualization

ggplot(data=cars, aes(cars$speed)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "green", high = "red") +
  labs(title = "Historgram - Speed") +
  labs(x = "speed") +
  labs(y = "Count")

ggplot(data=cars, aes(cars$dist)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "green", high = "red") +
  labs(title = "Historgram - Distance") +
  labs(x = "distance ") +
  labs(y = "Count")

ggplot(cars, aes(x=speed, y=dist)) +
  geom_point(size=2, shape=23)

ggplot(cars, aes(speed, dist)) + 
  geom_point(size = 2, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE, alpha = .2) +
  labs(title = "Speed vs Stopping Distance", 
       x = "Speed (mph)", 
       y = "Stopping distance (ft)") 

Correlation

cor(cars$dist,cars$speed)
## [1] 0.8068949

It is a strong uphill (positive) linear relationship.

#Plot the spread
plot(x = cars$speed, y = cars$dist, main="Cars Data", xlab = "Speed(mph)", ylab = "Distance(feet)")

We can see as speed increases distance is also increasing, we can safely assume that distance is a function of speed.

Density plot

par(mfrow=c(1, 2))  
plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency") # Plot for speed looks normal.
plot(density(cars$dist), main="Density Plot: Speed", ylab="Frequency") # plot for distance skewed towards right.

Checking outliers

par(mfrow=c(1,2))
boxplot(cars$speed)
boxplot(cars$dist)

Variable “distance” does have an outier (row 49, value 120), but it does not seem that there is something wrong or unusual with the data. I would keep that outlier.

The summary analysis indicates the following:

  • The residual distribution is appoximately normal, which is a positive indication that the model is valid.
  • The coefficients indicate a relationship between speed and distance. With a p-value less than 0.01 this relationship is considered very strong.
  • The intercept is a non-logical value (it is impossible for stopping distance to be less than 0), but this is OK. It simply means that we cannot extend the predictive ability of the regression far past the upper and lower independent variable bounds
  • The R2 of ~65% indicates a strong, positive relationship modeled by this regression

Linear Regression Model

linear_model <- lm(cars$dist ~ cars$speed)
summary(linear_model)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Summary: R-squared for the model is .6511, so model explains around 65% of the variation in distance due to speed. Also the standard error is very less compared to the coefficients(around 10 times) which is good for the model.

plot(cars$speed, cars$dist, xlab = "Speed (mph)", ylab = "Distance (feet)",main="Speed vs Stopping Distance",col = c("red", "blue"))
abline(linear_model)

Residuals

residual <- residuals(linear_model)
summary(residual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -29.069  -9.525  -2.272   0.000   9.215  43.201

The mean is equal to zero, so that looks good.

plot(linear_model$residuals ~ cars$speed, xlab='Fitted Values', ylab='Residuals',main="Speed vs Linear Model Residuals",col = "red")
abline(0,0, col="yellow4") #abline(h=0, lty=3)

The plot does the comparing between the sample quantiles and theoretical quantiles.

qqnorm(linear_model$residuals)
qqline(linear_model$residuals)

Seeing the residual plot, we can see there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails. By looking at the Q-Q plot we can say that their is normal spread of the data in the middle but at the either end we have some outliers which is slightly off the straight line.

augment(linear_model) %>%
  ggplot(aes(x=.std.resid)) + 
  geom_histogram(aes(y=..density..), bins = 10 ,colour="black") + 
  geom_density(alpha=.2, fill="blue") + 
  ggtitle('Histogram of Residuals')

Histogram seems nearly normal based on the symmetric or bell shape curve.

Model Interpretation

From the model, the stopping distance can be expressed as: distance = 8.2839 + 0.16557 * speed

This implies that:

  1. Every increase in speed, will cause a 0.16557S increase in stopping distance.
  2. The speed is probably relevant in this model because its p-value is very near to zero while the Y-intercept’s p-value is approximately 1 percent.
  3. The model produced a Multiple R-squared of 0.6511, implying that about 65% variation in the stopping distance is accounteed for by the least-squares line.
ggplot(data = cars, aes(x=speed, y=linear_model$residuals)) + 
  geom_point(size = 2, alpha = .3) + 
  geom_abline(intercept = 0, slope = 0, color = "blue") +
  theme(panel.grid.major = element_line(color = "green")) +
  labs(title = "Car speed vs Model Residuals", 
       x = "Car Speed (mph)", 
       y = "Model Residuals") 

We can observe the residuals have near normal distribution though some tails can be observed. Reviewing further using a histogram:

hist(linear_model$residuals, main="Histogram of Linear model Residuals", xlab="Residuals")

There appears to be a modest normal distribution as depicted by the above histogram.

Testing further using inference.

Inference:

\(H_0\): There is no relationship between speed and stopping distance \(H_A\): There is a positive relationship (correlation) between Speed and stopping distance

As already noted: distance = 8.28391 + 0.16557 * speed

Here, we can reject \(H_0\) and accept \(H_A\) as the model appears to describe the relationship well since the p-value is close to zero. Every increase in speed, will cause a 0.16557S increase in stopping distance.

Conclusion

A car’s speed appears to be a fairly good predictor of stopping distance, which makes intuitive sense. The correlation between stopping distance and car’s`speed is positive and the relationship is linear. Also the residuals are distributed normally. There may be a few outliers but they are not that far out from the rest of the data points. This model created seems to be great fit for the data where the data is distributed normally.