Libraries

library(rvest) 
library(dplyr)
library(knitr)
library(rcompanion)
library(MASS)
library(tidyverse)
library(caret)

Summary of data; Histograms and Density Plots

  print(kable(head(cars)))

## 
## 
##  speed   dist
## ------  -----
##      4      2
##      4     10
##      7      4
##      7     22
##      8     16
##      9     10

  print(kable(summary(cars)))

## 
## 
##          speed           dist      
## ---  -------------  ---------------
##      Min.   : 4.0   Min.   :  2.00 
##      1st Qu.:12.0   1st Qu.: 26.00 
##      Median :15.0   Median : 36.00 
##      Mean   :15.4   Mean   : 42.98 
##      3rd Qu.:19.0   3rd Qu.: 56.00 
##      Max.   :25.0   Max.   :120.00

  hist(cars$speed, main='Histogram of Car Speed')

  plot(density(cars$speed),main = "Density of Car Speed")

Car Speed distribution is slightly right skewed

  hist(cars$dist, main='Histogram of Car Distance')

  plot(density(cars$dist),main = "Density of Car Distance")

Car Distance is heavily left skewed.

Correlation

  cor(cars)

##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000

Correlation between Speed and Distance is strong

Simple Regression Model

  ourMod<-lm(dist~speed, data=cars)
  
  summary(ourMod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Median of Residuals is pretty close to 0.

For our coefficient, std error is almost 10 times smaller than the coefficient, so variability is not big.

Based, on p value it appears that speed is a relevant estimator.

R-squared indicates that the model explains ~65% of data variability.

  plot(ourMod)

Residuals vs fitted graph shows that residuals are not uniformly spread.

Residuals are not normally distributed (right tail is skewed)

Variance is relatively constant

Log Transformed Model

# Log transformation

logMod<-lm(log(dist)~log(speed),data=cars)

summary(logMod)

## 
## Call:
## lm(formula = log(dist) ~ log(speed), data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00215 -0.24578 -0.02898  0.20717  0.88289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.7297     0.3758  -1.941   0.0581 .  
## log(speed)    1.6024     0.1395  11.484 2.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared:  0.7331, Adjusted R-squared:  0.7276 
## F-statistic: 131.9 on 1 and 48 DF,  p-value: 2.259e-15

plot(logMod)

Simple log transformation makes residuals near normally distributed, but issues with non constant variance become more pronounced.

Robust Model

 # for comparison - Robust regression

  rMod<-rlm(log(dist)~log(speed),data=cars)
  
  summary(rMod)

## 
## Call: rlm(formula = log(dist) ~ log(speed), data = cars)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04336 -0.24864 -0.02244  0.19720  0.87525 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) -0.5942  0.3914    -1.5182
## log(speed)   1.5540  0.1453    10.6942
## 
## Residual standard error: 0.3608 on 48 degrees of freedom

  plot(rMod)

Applying Robust Regression to log transformed data shows that we might have some potential outliers.

Regression Plots

  plot(cars, main="Regression Model: Car Speed vs Car Distance")
  abline(ourMod,col="red")

Even from plotting our first model, we can see that residuals are not uniformly, symmetrically, and randomly distributed

plot(log(cars$speed),log(cars$dist),main="Log Transformed Model vs Robust Model")
  abline(logMod,col="red")
  abline(rMod,col="blue")

Log transformed models are very similar and definitely their residuals look better. But residuals are still have issues. Residuals are more widely spread on left vs right of the plot.

Conclusion

In conclusion, I would like to say that the first/original model is not optimal. It does not meet residual normality criteria. Log transformed ones are definitely better, but it would benefit from more work. Possibility of outliers should be considered. Some other or additional transformation could be of benefit. And additional independent variables could make our model better.

Data 605. Assignment for Week 11. 04/13/2019.

Mikhail Groysman

April 13, 2019