library(rvest)
library(dplyr)
library(knitr)
library(rcompanion)
library(MASS)
library(tidyverse)
library(caret)
print(kable(head(cars)))
##
##
## speed dist
## ------ -----
## 4 2
## 4 10
## 7 4
## 7 22
## 8 16
## 9 10
print(kable(summary(cars)))
##
##
## speed dist
## --- ------------- ---------------
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
hist(cars$speed, main='Histogram of Car Speed')
plot(density(cars$speed),main = "Density of Car Speed")
Car Speed distribution is slightly right skewed
hist(cars$dist, main='Histogram of Car Distance')
plot(density(cars$dist),main = "Density of Car Distance")
Car Distance is heavily left skewed.
cor(cars)
## speed dist
## speed 1.0000000 0.8068949
## dist 0.8068949 1.0000000
Correlation between Speed and Distance is strong
ourMod<-lm(dist~speed, data=cars)
summary(ourMod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Median of Residuals is pretty close to 0.
For our coefficient, std error is almost 10 times smaller than the coefficient, so variability is not big.
Based, on p value it appears that speed is a relevant estimator.
R-squared indicates that the model explains ~65% of data variability.
plot(ourMod)
Residuals vs fitted graph shows that residuals are not uniformly spread.
Residuals are not normally distributed (right tail is skewed)
Variance is relatively constant
# Log transformation
logMod<-lm(log(dist)~log(speed),data=cars)
summary(logMod)
##
## Call:
## lm(formula = log(dist) ~ log(speed), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00215 -0.24578 -0.02898 0.20717 0.88289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7297 0.3758 -1.941 0.0581 .
## log(speed) 1.6024 0.1395 11.484 2.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.7276
## F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15
plot(logMod)
Simple log transformation makes residuals near normally distributed, but issues with non constant variance become more pronounced.
# for comparison - Robust regression
rMod<-rlm(log(dist)~log(speed),data=cars)
summary(rMod)
##
## Call: rlm(formula = log(dist) ~ log(speed), data = cars)
## Residuals:
## Min 1Q Median 3Q Max
## -1.04336 -0.24864 -0.02244 0.19720 0.87525
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -0.5942 0.3914 -1.5182
## log(speed) 1.5540 0.1453 10.6942
##
## Residual standard error: 0.3608 on 48 degrees of freedom
plot(rMod)
Applying Robust Regression to log transformed data shows that we might have some potential outliers.
plot(cars, main="Regression Model: Car Speed vs Car Distance")
abline(ourMod,col="red")
Even from plotting our first model, we can see that residuals are not uniformly, symmetrically, and randomly distributed
plot(log(cars$speed),log(cars$dist),main="Log Transformed Model vs Robust Model")
abline(logMod,col="red")
abline(rMod,col="blue")
Log transformed models are very similar and definitely their residuals look better. But residuals are still have issues. Residuals are more widely spread on left vs right of the plot.
In conclusion, I would like to say that the first/original model is not optimal. It does not meet residual normality criteria. Log transformed ones are definitely better, but it would benefit from more work. Possibility of outliers should be considered. Some other or additional transformation could be of benefit. And additional independent variables could make our model better.