Data extracted from public data set for Machine Learning at UC Irvine (https://archive.ics.uci.edu/ml/datasets.html)
DataSet: AutoMPG (https://archive.ics.uci.edu/ml/datasets/Auto+MPG)
Load the data from the repository. Manually add the field names (since the data source does not have the first row with the names on it). Note: In this case, since this is a narrow dataset, this is doable by hand. Otherwise I’d look at a file with the field names and load it before assigning it to the variable-names
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
auto <- read.table(fileUrl)
names(auto) <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name")
Look at the following car features and their impact on MPG: Horsepower, Cylinders, Displacement, Weight
ggplot(auto, aes(as.integer(horsepower), mpg)) + geom_point()+geom_smooth()
ggplot(auto, aes(cylinders, mpg)) + geom_point()+geom_smooth()
ggplot(auto, aes(displacement, mpg)) + geom_point()+geom_smooth()
ggplot(auto, aes(weight, mpg)) + geom_point()+geom_smooth()
Analysis: While all these variables seem to have an impact on MPG, we can see that displacement and weight have a stronger correlation.
Looking at parameters for a linear model that fits our data, focusing on displacement and weight
auto.lm1 = lm(mpg ~ displacement, data = auto)
summary(auto.lm1)
##
## Call:
## lm(formula = mpg ~ displacement, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9550 -3.0569 -0.4928 2.3277 18.6192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.174750 0.491824 71.52 <2e-16 ***
## displacement -0.060282 0.002239 -26.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.651 on 396 degrees of freedom
## Multiple R-squared: 0.6467, Adjusted R-squared: 0.6459
## F-statistic: 725 on 1 and 396 DF, p-value: < 2.2e-16
We can define the relationship between MPG and displacement as: MPG = -0.06 * displacement + 35.17 with a correlation of 0.65
auto.lm2 = lm(mpg ~ weight, data = auto)
summary(auto.lm2)
##
## Call:
## lm(formula = mpg ~ weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.012 -2.801 -0.351 2.114 16.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.3173644 0.7952452 58.24 <2e-16 ***
## weight -0.0076766 0.0002575 -29.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.345 on 396 degrees of freedom
## Multiple R-squared: 0.6918, Adjusted R-squared: 0.691
## F-statistic: 888.9 on 1 and 396 DF, p-value: < 2.2e-16
We can define the relationship between MPG and weight as: MPG = -0.0076 * displacement + 46.32 with a correlation of 0.69
On a quick analysis, I found two characteristics, displacement and weight that have a higher impact on MPG than the others, with their relationship defined above.