DATA605_HW11_ChunjieNan

Assignment 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

Data Overview

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

As the str() function shows, the data is structured with 2 variables(speed & dist) and 50 observations.

Scatter plot

ggplot(cars, aes(x = dist, y = speed))+geom_point()+ggtitle("Association between distance and speed")

Correlation

cor(cars)

##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000

corrplot(cor(cars), type = 'upper')

Simple Linear Regression

smod<-lm(dist ~ speed, data = cars)
summary(smod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As the summary of the linear regression model shows,

slope = 3.9324

intercept = -17.5791

therefore, the linear regression model formula is y = -17.5791 + 3.9324x

plots

plot(cars, xlab = "Speed", ylab = "distance")
abline(smod) # plot the model

ols_plot_resid_hist(smod) # residual distribution

ols_plot_resid_fit(smod) # Heteroscedasticity

ols_plot_cooksd_chart(smod) # cook's distance for indentifying outliers

The residuals distribution suggests that the distribution is slightly right skewed.
The P-value is 1.49e-12, it means that distance is statistically significantly associated with speed.
The adjusted R-squared is 0.6438, which means that this model explains 64.38% of the data’s variation.
The distance and speed is strongly correlated, an increase of 1 unit of speed results increase the distance of 0.8 unit.
The ols_plot_resid_fit() function tells the model has Heteroscedasticity problem.
The Cook’s distance plot shows there are two outliers in the data, located at the 23rd and 49th rows.