Abstract:

As climate change becomes an ever more real threat to the Earth in the minds of leaders across the world, many governments are taking action to mitigate greenhouse gas emissions. The automobile industry in particular is targeted for regulations and in many instances fuel economy standards are set in place for car manufacturers to meet. If manufacturers continue to produce cars with poor gas mileage in spite of the standard, they are often made to pay a tax. In order to meet these standards and avoid excess taxation, engineers and businesses must come up with creative solutions so as to reduce fuel consumption without compromising on performance.

Since this issue is so relevant to today’s engineering challenges and protecting the environment, our team has decided to analyze data in order to explore which avenues may be considered by engineering teams when attempting to meet fuel economy standards. To do this, we have used the MTcars data set, which has data on the design, performance and fuel economy for 32 automobiles from 1973 - 1974. All of the data therein was extracted from the 1974 Motor Trend US magazine.

Our exploratory analysis will be useful as it will delve into the various factors which may have some sort of influence on fuel economy (miles per gallon). The variables we have chosen to compare to the MPG are horsepower, and number of cylinders. We hypothesize that these variables will have a strong relationship to a cars fuel economy. Our hope is that the analysis will provide findings that will identify which components of cars are the biggest perpetrators in minimizing fuel economy.

In order to execute the analysis, we will use both Python and R. Our analysis will begin with describing the data and then will proceed into displaying different relationships between design of the car and the miles per gallon it is able to achieve. A detailed report, including our code and methods, can be viewed below.

Analysis of mtcars

Our question is an exploratory one: Which data field correlates most closely with mpg out of no. of cylinders and horsepower? We want to understand how each correlates with the miles per gallon data.

Before choosing these variables, we considered some of the other ones, such as weight, transmission type, displacement and others. We recognize that some of these variables may be dependent on each other, such as displacement and weight. While We considered that other variables could also impact mpg, we chose to limit our analysis to two variables, number of cylinders and horsepower. These two are commonly used benchmarks which are used to compare vehicle performance.

We tested our question against the SMART criteria. Specific - we chose two specific variables for analysis on mpg. Measurable - the data is directly measurable. Answerable - we have sufficient data to determine the level of correlation and answer our question. Relevant - fuel economy is an important consideration in car comparison. Time-bound - we are using a data set from 1974 that is time-bound.

Source Data

We used the mtcars data set that is built-in to the R distribution.

mtcars data comes from the 1974 Motor Trend magazine. The data includes fuel consumption data, and ten aspects of car design for then-current car models.

First we look at the structure of the data set

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

We find that it contains 32 rows and 11 variables. Now we look at some of the actual data - first few rows and last few rows.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

We see that the data appears tidy. Now we look at the desriptive statistics for each field - (min, 1st Q, Median, Mean, 3rd Q, max)

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Since mode is not a built-in R function, we calculate it for each.

mode_mpg <- names(sort(-table(mtcars$mpg)))[1]
mode_cyl <- names(sort(-table(mtcars$cyl)))[1]
mode_disp <- names(sort(-table(mtcars$disp)))[1]
mode_hp <- names(sort(-table(mtcars$hp)))[1]
mode_drat <- names(sort(-table(mtcars$drat)))[1]
mode_wt <- names(sort(-table(mtcars$wt)))[1]
mode_qsec <- names(sort(-table(mtcars$qsec)))[1]
mode_vs <- names(sort(-table(mtcars$vs)))[1]
mode_am <- names(sort(-table(mtcars$am)))[1]
mode_gear <- names(sort(-table(mtcars$gear)))[1]
mode_carb <- names(sort(-table(mtcars$carb)))[1]

Then we print it.

paste("The mode of the miles per gallon data is", mode_mpg)
## [1] "The mode of the miles per gallon data is 10.4"
paste("The mode of the number of cylinders data is", mode_cyl)
## [1] "The mode of the number of cylinders data is 8"
paste("The mode of the displacement data is", mode_disp)
## [1] "The mode of the displacement data is 275.8"
paste("The mode of the horsepower data is", mode_hp)
## [1] "The mode of the horsepower data is 110"
paste("The mode of the rear axle ratio data is", mode_drat)
## [1] "The mode of the rear axle ratio data is 3.07"
paste("The mode of the weight (1000 lbs) data is", mode_wt)
## [1] "The mode of the weight (1000 lbs) data is 3.44"
paste("The mode of the quarter mile time data is", mode_qsec)
## [1] "The mode of the quarter mile time data is 17.02"
paste("The mode of the V/S data is", mode_vs)
## [1] "The mode of the V/S data is 0"
paste("The mode of the transmission data is", mode_am)
## [1] "The mode of the transmission data is 0"
paste("The mode of the number of forward gears data is", mode_gear)
## [1] "The mode of the number of forward gears data is 3"
paste("The mode of the number of carburetors data is", mode_carb)
## [1] "The mode of the number of carburetors data is 2"

To get a feel for the distribution of some of the data to be analyzed, we plot some histograms, the first against mpg, the second against the number of cylinders, and the third, hp.

library(ggplot2)
ggplot(mtcars, aes(mpg)) +
  geom_histogram(binwidth = 4) + xlab('Miles per Gallon') + ylab('Number of Cars') + 
   ggtitle('Distribution of Cars by Mileage')

Now we show the histogram for number of cylinders:

ggplot(mtcars, aes(cyl)) +
  geom_histogram(binwidth=1) + xlab('Cylinders') + ylab('Number of Cars') +
   ggtitle('Distribution of Cars by Cylinders')

Finally, we show the histogram for horsepower:

ggplot(mtcars, aes(hp)) +
  geom_histogram(binwidth=20) + xlab('horsepower') + ylab('Number of Cars') +
  ggtitle('Distribution of Cars by Horsepower')

We see a good distribution of data across both a wide range of mpg as well as across the different quantity of cylinders, 4, 6, 8, and across a range of horsepower.

Now we look at correlation of hp and mpg.

cor(mtcars$mpg, mtcars$hp)
## [1] -0.7761684

We find a fairly strong negative correlation.

Plotting the data - HP vs MPG

Below is the effect that number of horsepower has on mpg. We have also shown transmission type (manual = 1, auto = 0) as a point of reference, but it is not a primary part of our analysis.

Fitting the Data - HP vs. MPG

We then apply linear regression to fit the data to a line. We use geom_smooth with the linear model method.

ggplot(mtcars, aes(hp, mpg)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ylab("Miles per Gallon") +
  xlab("No. of Horsepower") +
  ggtitle("Impact of Number of Horsepower on MPG")

Since the mpg is unlikely to hit zero as the hp increases, we would expect a more asymptotic line. So let’s apply stat_smooth to get a better fit.

#apply smoothing since mpg unlikely to hit zero
ggplot(mtcars, aes(hp, mpg)) +
  stat_smooth() + geom_point() +
  ylab("Miles per Gallon") +
  xlab ("No. of Horsepower") +
  ggtitle("Impact of Number of Horsepower on MPG")

Effect on Number of Cylinders on MPG

Now we do a similar analysis as above, but instead by looking at the number of cylinders and its effect on miles per gallon.

The correlation of mpg and cyl is shown below.

cor(mtcars$mpg, mtcars$cyl)
## [1] -0.852162

This gives an even stronger negative correlation of -0.85

Doing a quick scatter plot yields the following.

qplot(cyl, mpg, data = mtcars, colour = cyl, geom = "point",     
  ylab = "Miles per Gallon", xlab = "No. of Cylinders",
  main = "Impact of Number of Cylinders on MPG")     

Fitting the Data - Cyl vs. MPG

We then apply linear regression to fit the data to a line. We use geom_smooth with the linear model method.

ggplot(mtcars, aes(cyl, mpg)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ylab("Miles per Gallon") + xlab("No. of Cylinders") +
  ggtitle("Impact of Number of Cylinders on MPG")

Summary

Our analysis shows a strong negative correlation for both number of horsepower (-0.77) as well as number of cylinders (-0.85) on miles per gallon.

As horsepower or cylinders increase, we see miles per gallon decreasing. While both have a strong negative correlation, we find that the impact of having more cylinders in a car has a greater negative impact on miles per gallon achieved.

Also, most of the models in 1974 mileage for most of the cars was 20 miles or less with an average of 20 miles vs. average of cars in 2016 is 27.8. This validates our assumption on importance of car mileage efficiency for car industry, regulators and consumers alike.

References:

Motor Trend Car Road Tests. October 7th, 2016. https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

Energy Policy and Conservation Act. October, 7th, 2016. http://legcounsel.house.gov/Comps/EPCA.pdf

Automobiles on Steroids: Product Attribute Trade-Offs and Technological Progress in the Automobile Sector , 2009 Christopher R. Knittel http://web.mit.edu/knittel/www/papers/steroids_latest.pdf

A Brief History of U.S. Fuel Efficiency Standards , Union of Concerned Scientists http://www.ucsusa.org/clean-vehicles/fuel-efficiency/fuel-economy-basics.html#.V_v9-pMrL-Z

Oil Embargo, 1973-1974 U.S. State Dept. https://history.state.gov/milestones/1969-1976/oil-embargo

For calculating the mode:

[R] find the mode of a dataset; T Lumley, December 24, 1999 https://stat.ethz.ch/pipermail/r-help/1999-December/005668.html

For ggplot2 suggestions:

ggplot2 - An implementation of the grammar of graphics; Hadley Wickham, 2007 http://ggplot2.org/resources/2007-vanderbilt.pdf

Blog: 10 Reasons to switch to ggplot. Mandy Mejia; Nov 13, 2013 https://mandymejia.wordpress.com/2013/11/13/10-reasons-to-switch-to-ggplot-7/

Package ‘ggplot2’, August 29, 2016 https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

Data Visualization with ggplot2 Cheat Sheet; March 2015 https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

For ggplot fitting and smoothing

Add a smoother. ggplot2 help documents http://docs.ggplot2.org/0.9.3.1/stat_smooth.html

Lines: horizontal, vertical, and specified by slope and intercept. ggplot2 help documents http://docs.ggplot2.org/current/geom_abline.html