Data 605 HW 11

library(knitr)
library(rmdformats)

## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=31)

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
## 
##     cement
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Cars data summary:

attach(cars)
summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

Visualize the relationsip between the response variable, stopping distance and speed.

plot(cars$speed, cars$dist, xlab = "Speed (mph)", ylab = "Stopping Distance (ft)", 
    main = "Stopping Distance vs. Speed")

Create a linear model

lm.fit <- lm(dist ~ speed, data = cars)
summary(lm.fit)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Visualizing the linear model:

plot(cars$speed, cars$dist, xlab = "Speed (mph)", ylab = "Stopping Distance (ft)", 
    main = "Stopping Distance vs. Speed")
abline(lm.fit)

It seems the linear model (lm.fit) was a decent fit. It has an adjusted R-squared of 0.64.

Let’s check for the assumptions.

hist <- hist(cars$dist, breaks = 10, col = "pink", xlab = "Distance", main = "Histogram with Normal Curve")

xfit <- seq(min(cars$dist), max(cars$dist), length = 40)
yfit <- dnorm(xfit, mean = mean(cars$dist), sd = sd(cars$dist))
yfit <- yfit * diff(hist$mids[1:2]) * length(cars$dist)
lines(xfit, yfit, col = "cadetblue1", lwd = 2)

It doesn’t look like the distribution of dist forms a normal distribution.

Examining the Residuals

hist(lm.fit$residuals)

We don’t have normality in residuals.

qqnorm(lm.fit$residuals)
qqline(lm.fit$residuals)

Residuals don’t have constant variance as shown below.

plot(fitted(lm.fit), residuals(lm.fit), xlab = "fitted", ylab = "residuals")
abline(h = 0)

Breusch Pagan Test for Heteroskedasticity

The test results show that we reject the null hypothesis. Therefore, the variance is not constant.

ols_test_breusch_pagan(lm.fit)

 Breusch Pagan Test for Heteroskedasticity
 -----------------------------------------
 Ho: the variance is constant            
 Ha: the variance is not constant        

              Data               
 --------------------------------
 Response : dist 
 Variables: fitted values of dist 

        Test Summary          
 -----------------------------
 DF            =    1 
 Chi2          =    4.650233 
 Prob > Chi2   =    0.03104933 
bc <- boxcox(dist ~ log(speed), data = cars)

# to find lambda value with the highest log-likelihood
lambda <- bc$x[which.max(bc$y)]
lambda
[1] 0.2626263

Let’s do a transformation by raising distance to lambda.

lm.fit2 <- lm(dist^lambda ~ log(speed), data = cars)
summary(lm.fit2)

Call:
lm(formula = dist^lambda ~ log(speed), data = cars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45300 -0.17549 -0.01524  0.16402  0.60400 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02630    0.23456   0.112    0.911    
log(speed)   0.95888    0.08708  11.011 9.86e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2529 on 48 degrees of freedom
Multiple R-squared:  0.7164,    Adjusted R-squared:  0.7105 
F-statistic: 121.2 on 1 and 48 DF,  p-value: 9.856e-15

Adjust R-squared improves to .71.

Let’s examine the residuals.

hist(lm.fit2$residuals)

Normality in residuals ?

qqnorm(lm.fit2$residuals)
qqline(lm.fit2$residuals)

Yes. It becomes normal.

plot(fitted(lm.fit2), residuals(lm.fit2), xlab = "fitted", ylab = "residuals")
abline(h = 0)

Residuals do have constant variance as shown above.

ols_test_breusch_pagan(lm.fit2)

 Breusch Pagan Test for Heteroskedasticity
 -----------------------------------------
 Ho: the variance is constant            
 Ha: the variance is not constant        

                 Data                   
 ---------------------------------------
 Response : dist^lambda 
 Variables: fitted values of dist^lambda 

        Test Summary         
 ----------------------------
 DF            =    1 
 Chi2          =    1.877362 
 Prob > Chi2   =    0.1706343 

P-value is greater than .05. We don’t reject H0 hypothesis. We do have constant variance.

So a BoxCox transformation of lambda = 0.2626263 was a good choice. It all checks out.