Regression Analysis for Cars MPG Performance

Data extracted from public data set for Machine Learning at UC Irvine (https://archive.ics.uci.edu/ml/datasets.html)

DataSet: AutoMPG (https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

Step 1: Setup

Load the data from the repository. Manually add the field names (since the data source does not have the first row with the names on it). Note: In this case, since this is a narrow dataset, this is doable by hand. Otherwise I’d look at a file with the field names and load it before assigning it to the variable-names

fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
auto <- read.table(fileUrl)
names(auto) <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name")

Step2: Exploratory Data Analysis

Look at the following car features and their impact on MPG: Horsepower, Cylinders, Displacement, Weight

ggplot(auto, aes(as.integer(horsepower), mpg)) + geom_point()+geom_smooth()

ggplot(auto, aes(cylinders, mpg)) + geom_point()+geom_smooth()

ggplot(auto, aes(displacement, mpg)) + geom_point()+geom_smooth()

ggplot(auto, aes(weight, mpg)) + geom_point()+geom_smooth()

Analysis: While all these variables seem to have an impact on MPG, we can see that displacement and weight have a stronger correlation.

Step 3: Defining the Model

Looking at parameters for a linear model that fits our data, focusing on displacement and weight

auto.lm1 = lm(mpg ~ displacement, data = auto)
summary(auto.lm1)
## 
## Call:
## lm(formula = mpg ~ displacement, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9550  -3.0569  -0.4928   2.3277  18.6192 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.174750   0.491824   71.52   <2e-16 ***
## displacement -0.060282   0.002239  -26.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.651 on 396 degrees of freedom
## Multiple R-squared:  0.6467, Adjusted R-squared:  0.6459 
## F-statistic:   725 on 1 and 396 DF,  p-value: < 2.2e-16

We can define the relationship between MPG and displacement as: MPG = -0.06 * displacement + 35.17 with a correlation of 0.65

auto.lm2 = lm(mpg ~ weight, data = auto)
summary(auto.lm2)
## 
## Call:
## lm(formula = mpg ~ weight, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.012  -2.801  -0.351   2.114  16.480 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.3173644  0.7952452   58.24   <2e-16 ***
## weight      -0.0076766  0.0002575  -29.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.345 on 396 degrees of freedom
## Multiple R-squared:  0.6918, Adjusted R-squared:  0.691 
## F-statistic: 888.9 on 1 and 396 DF,  p-value: < 2.2e-16

We can define the relationship between MPG and weight as: MPG = -0.0076 * displacement + 46.32 with a correlation of 0.69

Step 4: Conclusion

On a quick analysis, I found two characteristics, displacement and weight that have a higher impact on MPG than the others, with their relationship defined above.