Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Answer:

Load the libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Get an overview of data

glimpse(cars)
## Observations: 50
## Variables: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13...
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28...

There are 50 observations and 2 variables . Speed and dist are double.

Summary Statistics

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

There are no outlier in the data.

Correlation between the variables

cor(cars)
##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000

Plot the data

plot(cars$speed,cars$dist,type='p',main="Speed VS Distance")

The correlation between sped distance is strong and positive.

Lienar Model

Fit the model

We fitted a linear model of distance as a function f speed . Speed is the dependent and dist is the independent variable.

fit <- lm(dist ~ speed, data = cars)
summary(fit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Fitted regression model: \(distance_{i} = -17.5791 + 3.9324 * speed_{i}\)

For 1 unit increase in speed distance traveled increases by 3.9324 units. ### Model Summary: 1. The model explains 65.11% variability in distance due to speed. 2. Speed is significant predictor of distance at 5% level of significant since the p value 1.49e-12 is less than 0.05. 3. The model is a valid model since the F statistics p value 1.49e-12 is less than 0.05 at 5% level of significance.

Residual Analysis

par(mfrow=c(2,2))
plot(fit)

From the residual vs fitted value plot we can see the there is no pattern in the data hence the data randomness of the residuals and heteroscidatcity is satisfied.

From the normal q-q plot we can see that the residuals are Approximately normally distribute.

Conclusion

From the overall analysis we can say that speed is a good predictor of distance and our model is a well fitted model since the assumptions of the linear regression model are satisfied here.