Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Loading cars dataset and getting a glimpse of data in it.

glimpse(cars)
## Observations: 50
## Variables: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 1…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 3…
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
dim(cars)
## [1] 50  2

Cars dataset has 50 observations with 2 variables (dist & speed). Now we will explore the relationship between these 2 variables.

In the next step, we will use ggpairs() function from the GGally package that allows to build a scatterplot matrix. It draws scatterplots of each pair of numeric variable on the left part of the figure. The right part has Pearson correlation and the diagonal contains variable distribution.

ggpairs(data = cars, title = "Cars Data")

Lets see correlation matrix plot using ggcorr(). The ggcorr() function visualizes the correlation of each pair of variable as a square. As shown below it shows strong correlation.

ggcorr(data = cars, label = TRUE)

The correlation coefficients provide information about the relationship between 2 variables; the closer it’s value to 1, the stronger the relationship is. In this case the correlation between both variables (speed and dist ) is quiet strong. The scatter plots shown above visualize the relationships between the two.

Lets form the hypothesis.

\(H_0\) : Speed and Dist are not related to each other. \(H_A\) : Speed and Dist have some relationship.

m_dist_spd <- lm(dist ~ speed, data = cars)

# model summary
summary(m_dist_spd)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The intercept here is -17.5791 which is expected car distance having speed as 0. It’s obvious we cant think of a car having -ve distance traveled. The slope shows that for each additional increase of speed, the distance travelled by car increases by a factor of 3.9324. Std. Error shows the average variation of the estimated coefficients from the actual average of response variable.

A p-value is used in hypothesis testing to support or reject the null hypothesis. The smaller the p-value, the stronger the evidence that one should reject the null hypothesis. In our case it is less than 0.05 so we reject Null hypothesis described above.

The R2 value is a measure for how close our data is to the linear regression model. In this case the R2 value is 0.6511 which means our data fits the linear regression model. Simialrly higher F-statistic value indicates close linear relationship.

Below is the plot between speed and dist relationship.

# draw model
m_dist_spd %>% 
  ggplot(aes(speed, dist)) +
  geom_point() +
  geom_smooth(method = lm, se = F)
## `geom_smooth()` using formula 'y ~ x'

Residuals describes about how well our model fits our data. The residuals should have a pretty symmetrical distribution around zero. Lets visualize residuals in our case.

# residuals
m_dist_spd %>% 
  ggplot(aes(fitted(m_dist_spd), resid(m_dist_spd))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Below is the histogram plot for residuals.

# residuals histogram
m_dist_spd %>% 
  ggplot(aes(m_dist_spd$residuals)) + 
  geom_histogram(binwidth = 1)

Our residuals look pretty symmetrical around 0, suggesting that our model fits the data well. The histogram plot shows a little left skewness.

Fianlly lets draw the QQ plot to compare two probability distributions by plotting their quantiles against each other. Departures from the straight line at the extreme ends of the plot might indicate skewness. In this case it looks quite consistent.

# qq plot
m_dist_spd %>% 
  ggplot(aes(sample = resid(m_dist_spd))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

Conclusion

As described above, seeing the p-value, R-squared value and all the above graphs we can conclude the model seems good. There could some other factors to further consider which might influence this relationship.