HW11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Data Exploration

help(cars)
data(cars)
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

The data gives the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

A data frame with 50 observations on 2 variables.

[,1] speed numeric Speed (mph)

[,2] dist numeric Stopping distance (ft)

Checking missing values.

sapply(cars, function(y) sum(length(which(is.na(y)))))/nrow(cars)*100

## speed  dist 
##     0     0

Data does not have missing values.

hist(cars$dist, main = "Distance Variable Distribution")

hist(cars$speed, main = "Speed Variable Distribution")

Distance Variable is right skewed, while Speed variables is lightly left skewed. Model may benefit from data

plot (cars$speed, cars$dist, main = "Speed VS Distance")

cor(cars)

##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000

There is a clear linear relationships between speed and distance. Speed and distance variables are strongly positively correlated (0.806)

Checking outliers

boxplot(cars$speed)

boxplot(cars$dist)

which.max(cars$dist)

## [1] 49

cars$dist[which.max(cars$dist)]

## [1] 120

Variable “distance” does have an outier (row 49, value 120), but it does not seem that there is something wrong or unusual with the data. I would keep that outlier. I have tried to remove the outlier or replace it with the median, but this resulted in worse model fit and performance.

Data Preparation

library(caret)
pp <- preProcess(cars, method = c( "BoxCox", "center","scale"))
cars_trans<- predict(pp, cars)

Building model

set.seed(123)
model_1<- train( dist ~ speed, data = cars_trans, method = "lm",  trControl = trainControl("cv", number = 5))
summary(model_1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02193 -0.34501 -0.08888  0.29192  1.55798 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.262e-17  7.702e-02    0.00        1    
## speed       8.423e-01  7.781e-02   10.82 1.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5446 on 48 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7034 
## F-statistic: 117.2 on 1 and 48 DF,  p-value: 1.773e-14

model_1

## Linear Regression 
## 
## 50 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 41, 39, 40, 40, 40 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.5467803  0.6992205  0.4410613
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

model_1 has Adjusted R-squared = 0.7034 and average RMSE = 0.5467803 tested using 5 folds cross-validation.

Rsquared = 0.6992205, that means approx.70% of the variation in stopping distance is explained by the speed of the car.

The regression equation of the linear model with distance as a response variable (y) and speed as the explanatory variable (x) is

y=8.42x + 8.26

Checking residuals

residuals<-resid(model_1)
plot(residuals)

qqnorm(residuals) 
qqline(residuals)

The residuals look almost normally distributed and random, that means that there is no useful information is hidden in residuals to be extracted by the model.

Based on the data used in this analysis, it appears that the stopping distance of a car does depend on the speed the car was traveling. More specifically, for every additional MPH of speed the car is traveling, it will take the car an additional 4 more feet to come to a complete stop.

HW11

Olga Shiligin

09/11/2019

Data Exploration

Data Preparation

Building model