CUNY MSDA DATA 605 - Simple Linear Regression

Nicholas Schettini

November 7, 2018

library(tidyverse)
library(MASS)
library(fueleconomy)
library(lmtest)
library(car)
library(gvlma)

Question

Can hp in a vehicle be predictive of MPG?

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Model Summary

model <- lm(mtcars$mpg ~ mtcars$hp, data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## mtcars$hp   -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
intercept <- coef(model)[1]
slope <- coef(model)[2]
ggplot(model, aes(mtcars$hp, mtcars$mpg)) +
  geom_point() +
  geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

The simple linear model is expressed as:

\[\widehat{mpg} = 30.09886 + -0.06823{*hp}\] The p-values seem to indicate that the variables are siginificant for the model. The model shows a -0.06823 decrease in mpg per HP increase. The least-squares line accounts for \(R^2\) of \(0.6024\).

Model Validation

res <- residuals(model)
res <- as.data.frame(res)
ggplot(res, aes(res)) +
  geom_histogram(fill = 'blue', alpha=0.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(fitted(model),resid(model))

qqnorm(resid(model))
qqline(resid(model))

crPlots(model)

The Q-Q plot, histogram, and CR plots seem to show that the residuals are not normally distributed.

Outlier Analysis:

outlierTest(model)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##               rstudent unadjusted p-value Bonferonni p
## Maserati Bora 2.568387           0.015632      0.50021
cutoff <- 4/((nrow(mtcars)-length(model$coefficients)-2)) 
plot(model, which=4, cook.levels=cutoff)

gvlma(model)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp, data = mtcars)
## 
## Coefficients:
## (Intercept)    mtcars$hp  
##    30.09886     -0.06823  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model) 
## 
##                        Value   p-value                   Decision
## Global Stat        18.845445 0.0008428 Assumptions NOT satisfied!
## Skewness            2.977924 0.0844075    Assumptions acceptable.
## Kurtosis            0.005681 0.9399169    Assumptions acceptable.
## Link Function      12.369182 0.0004365 Assumptions NOT satisfied!
## Heteroscedasticity  3.492659 0.0616415    Assumptions acceptable.

According to gvlma - conditions for this linear model have not been met.

Ref:

https://stackoverflow.com/questions/43252474/using-and-interpreting-output-from-gvlma https://conservancy.umn.edu/bitstream/handle/11299/189222/LinearRegression_fulltext.pdf https://ademos.people.uic.edu/Chapter12.html#42_testing_for_outliers:_outliertest()