An analysis of the Auto dataset, from the ISLR package (http://www-bcf.usc.edu/~gareth/ISL/data.html) on 7 March 2015.
The dataset covers miles per gallon, cylinders, displacement, horsepower, weight, acceleration, year, origin of 397 cars.
The aim of the analysis is to answer the following question: Is miles per gallon (mpg) a function of the remaining variables in the dataset?
setwd("~/MDSI/STATS")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
library(coefplot)
library(car)
#assumes auto.csv is in your working directory.
auto = read.csv("auto.csv")
str(auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.0 Min. :3.00 Min. : 68 150 : 22 Min. :1613
## 1st Qu.:17.5 1st Qu.:4.00 1st Qu.:104 90 : 20 1st Qu.:2223
## Median :23.0 Median :4.00 Median :146 88 : 19 Median :2800
## Mean :23.5 Mean :5.46 Mean :194 110 : 18 Mean :2970
## 3rd Qu.:29.0 3rd Qu.:8.00 3rd Qu.:262 100 : 17 3rd Qu.:3609
## Max. :46.6 Max. :8.00 Max. :455 75 : 14 Max. :5140
## (Other):287
## acceleration year origin name
## Min. : 8.0 Min. :70 Min. :1.00 ford pinto : 6
## 1st Qu.:13.8 1st Qu.:73 1st Qu.:1.00 amc matador : 5
## Median :15.5 Median :76 Median :1.00 ford maverick : 5
## Mean :15.6 Mean :76 Mean :1.57 toyota corolla: 5
## 3rd Qu.:17.1 3rd Qu.:79 3rd Qu.:2.00 amc gremlin : 4
## Max. :24.8 Max. :82 Max. :3.00 amc hornet : 4
## (Other) :368
head(auto, n=10)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## 7 14 8 454 220 4354 9.0 70 1
## 8 14 8 440 215 4312 8.5 70 1
## 9 14 8 455 225 4425 10.0 70 1
## 10 15 8 390 190 3850 8.5 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## 7 chevrolet impala
## 8 plymouth fury iii
## 9 pontiac catalina
## 10 amc ambassador dpl
Summaries indicate no missing data.
Horsepower was read in as a factor - convert to an integer given the number of levels (94).
auto$horsepower = as.integer(auto$horsepower)
colnames(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
#select the columns we are interested in - I have assumed that names is not a factor that should influence mpg.
#While origin is also potentially a predictive variable, I have been unable to ascertain what the individual
#levels represent and have therefore excluded them.
auto = select(auto, mpg, cylinders, displacement, horsepower, weight, acceleration, year)
head(auto, n = 10)
## mpg cylinders displacement horsepower weight acceleration year
## 1 18 8 307 17 3504 12.0 70
## 2 15 8 350 35 3693 11.5 70
## 3 18 8 318 29 3436 11.0 70
## 4 16 8 304 29 3433 12.0 70
## 5 17 8 302 24 3449 10.5 70
## 6 15 8 429 42 4341 10.0 70
## 7 14 8 454 47 4354 9.0 70
## 8 14 8 440 46 4312 8.5 70
## 9 14 8 455 48 4425 10.0 70
## 10 15 8 390 40 3850 8.5 70
plot(auto)
Examine the correlations between variables and then examine relationships 1 by 1.
correlations = cor(auto)
corrplot(correlations)
Displacement and cylinders are highly correlated. Weight is highly positively correlated with cylinders and displacement.
Exploring some of the relationships between variables:
#Some plots between independent variables
ggplot(auto, aes(x = displacement, y =cylinders)) + geom_point()
ggplot(auto, aes(x = horsepower, y =cylinders)) + geom_point()
ggplot(auto, aes(x = weight, y =cylinders)) + geom_point()
#Quick plots between mpg and single variables.
ggplot(auto, aes(x = mpg, y = weight)) + geom_point(aes(color = cylinders))
ggplot(auto, aes(x = mpg, y = weight)) + geom_point(aes(color = horsepower))
ggplot(auto, aes(x = mpg, y = cylinders)) + geom_point()
ggplot(auto, aes(x = mpg, y = displacement)) + geom_point()
#Check similarity of cylinders and displacement
ggplot(auto, aes(x = mpg, y = acceleration)) + geom_point(aes(color = cylinders))
ggplot(auto, aes(x = mpg, y = acceleration)) + geom_point(aes(color = displacement))
The similarity of the last two charts combined with the correlation of cylinders and displacement indicates that cylinders and displacement are unlikely to both be required in the final model.
The weight ~ mpg scatterplot indicates that the relationship may not be linear and the variable may require some transformation.
I have used all variables in the initial model.
autoModel1 = lm(mpg ~ cylinders + horsepower + weight + displacement + year + acceleration, data = auto)
summary(autoModel1)
##
## Call:
## lm(formula = mpg ~ cylinders + horsepower + weight + displacement +
## year + acceleration, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.750 -2.421 -0.087 1.982 14.309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.64e+01 4.27e+00 -3.84 0.00014 ***
## cylinders -1.57e-01 3.47e-01 -0.45 0.65179
## horsepower 8.98e-03 7.01e-03 1.28 0.20113
## weight -6.83e-03 5.97e-04 -11.43 < 2e-16 ***
## displacement 6.30e-03 7.23e-03 0.87 0.38435
## year 7.64e-01 5.08e-02 15.03 < 2e-16 ***
## acceleration 8.38e-02 7.85e-02 1.07 0.28641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 390 degrees of freedom
## Multiple R-squared: 0.811, Adjusted R-squared: 0.808
## F-statistic: 278 on 6 and 390 DF, p-value: <2e-16
Overall, autoModel1 is predictive, with an adjusted r squared of 80.8% and a low p value for the model. This is interpreted to mean that 80.8% of the variance in mpg is explained by the predictors in the model. The small p value of the overall model indicates that R squared is significantly different from zero.
Weight and Year are the only two significant variables at the 5% level (both are predictive at a less than .01% level).
We can also plot the coefficients to see a visual representation of the confidence intervals for the coefficients:
coefplot(autoModel1)
Remove the variables with high p values to see if the overall explanatory power of the model is lowered. Adding them individually to the base model to see if the coefficients are significant.
autoModel2 = lm(mpg ~ weight + year + cylinders, data = auto)
summary(autoModel2)
##
## Call:
## lm(formula = mpg ~ weight + year + cylinders, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.991 -2.304 -0.081 2.016 14.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.43e+01 4.04e+00 -3.54 0.00045 ***
## weight -6.45e-03 4.61e-04 -14.01 < 2e-16 ***
## year 7.58e-01 4.99e-02 15.20 < 2e-16 ***
## cylinders -1.17e-01 2.33e-01 -0.50 0.61583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 393 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.807
## F-statistic: 554 on 3 and 393 DF, p-value: <2e-16
autoModel2 = lm(mpg ~ weight + year + horsepower, data = auto)
summary(autoModel2)
##
## Call:
## lm(formula = mpg ~ weight + year + horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.059 -2.382 -0.141 1.884 14.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.58e+01 4.04e+00 -3.91 0.00011 ***
## weight -6.48e-03 2.42e-04 -26.83 < 2e-16 ***
## year 7.64e-01 4.90e-02 15.58 < 2e-16 ***
## horsepower 1.00e-02 6.58e-03 1.52 0.12827
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 393 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.808
## F-statistic: 558 on 3 and 393 DF, p-value: <2e-16
autoModel2 = lm(mpg ~ weight + year + displacement, data = auto)
summary(autoModel2)
##
## Call:
## lm(formula = mpg ~ weight + year + displacement, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.835 -2.284 -0.135 2.056 14.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.47e+01 3.99e+00 -3.68 0.00026 ***
## weight -6.75e-03 5.70e-04 -11.84 < 2e-16 ***
## year 7.64e-01 5.07e-02 15.09 < 2e-16 ***
## displacement 8.07e-04 4.74e-03 0.17 0.86482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.44 on 393 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.807
## F-statistic: 554 on 3 and 393 DF, p-value: <2e-16
autoModel2 = lm(mpg ~ weight + year + acceleration, data = auto)
summary(autoModel2)
##
## Call:
## lm(formula = mpg ~ weight + year + acceleration, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.691 -2.357 -0.139 2.013 14.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.26252 4.03360 -3.78 0.00018 ***
## weight -0.00658 0.00023 -28.66 < 2e-16 ***
## year 0.75415 0.04992 15.11 < 2e-16 ***
## acceleration 0.06463 0.07020 0.92 0.35786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 393 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.808
## F-statistic: 555 on 3 and 393 DF, p-value: <2e-16
The coefficients of cylinders, horsepower, displacement and acceleration are not statistically significant when added to the model.
Final model:
autoModel3 = lm(mpg ~ weight + year, data = auto)
summary(autoModel3)
##
## Call:
## lm(formula = mpg ~ weight + year, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.865 -2.305 -0.128 2.037 14.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.47e+01 3.98e+00 -3.68 0.00026 ***
## weight -6.66e-03 2.14e-04 -31.14 < 2e-16 ***
## year 7.62e-01 4.91e-02 15.52 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 394 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.808
## F-statistic: 833 on 2 and 394 DF, p-value: <2e-16
Removing cylinders, horsepower and acceleration has no effect on the overall explanatory power of the model. Adjusted r squared for autoModel2 is 80.8%. Again, weight and year are highly significant.
The model is interpreted to mean that mpg = -14.650237 - 0.006655 * weight + 0.762334 * year.
Plot the coefficients:
coefplot(autoModel3)
par(mfrow=c(2,2))
plot(autoModel3)
The normal Q-Q plot would be a 45 degree line if the residual values are normally distributed with a mean of zero. The plot indicates that this is not the case.
If the dependent variables are linearly related to the independent variables, then the Residuals vs Fitted Values chart (upper left chart) should show no systematic relationship between the fitted values and the residuals. The shape of the graph indicates that this is not the case, with the shape of the curve indicating that a transformation of one or more variables would be appropriate.
Evidence of non-linearity between the dependent variable and independent variables can be assessed using component plus residual plots:
#using the crPlots function in the "car" package:
crPlots(autoModel3)
The above chart indicates that the weight variable may need to be transformed in order to improve the appropriateness of the model.