Executive Summary

The period between 1970 and 1982 marked a significant shift in the United States car industry. American production shifted from heavy, powerful six- and eight-cylinder cars with poor gas mileage to lighter, less powerful, four-cylinder cars with higher fuel efficiency. The global auto industry–including Americans and their European and Japanese competitors–raised overall miles per gallon (MPG) by focusing on four-cylinder cars and making them more fuel efficient.

Methodology

In this project, I used R to examine the “cars_multi” and “cars_price” datasets. I chose R because it is my preferred tool for…

  1. Examining a new data set
  2. Cleaning the data
  3. Performing exploratory data analysis, including descriptive statistics and preliminary visualizaions

While not my preferred tool for more elaborate visualizations, R does facilitate static explanatory visualizations. If I were to create a dynamic, interactive visualization, I would likely opt for D3.js. That said, R seems like a good tool for the purpose of this project.

I will use the following R libraries to assist in my analysis:

library(magrittr)
library(plyr)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)

Loading the Data

The first step in the process of analyzing the datasets is loading them into R dataframes, which I will call “cars” and “prices”, and then joining prices with cars based on the ID.

# Load CSV files
cars = read.csv('DS Engineering project/cars_multi.csv', header=TRUE)
prices = read.csv('DS Engineering project/cars_price.csv', header=TRUE) 

# Merge the two dataframes together using the ID field
cars = cars %>%
  left_join(prices, by = "ID")

Descriptive Statistics

R’s str function gives me a look at the data types in the “cars” dataset. The summary function lets me see basic summary statistics for each column.

str(cars)
## 'data.frame':    398 obs. of  11 variables:
##  $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model       : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
##  $ price       : num  25562 24221 27241 33685 20000 ...
summary(cars)
##        ID             mpg          cylinders      displacement  
##  Min.   :  1.0   Min.   : 9.00   Min.   :3.000   Min.   : 68.0  
##  1st Qu.:100.2   1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2  
##  Median :199.5   Median :23.00   Median :4.000   Median :148.5  
##  Mean   :199.5   Mean   :23.51   Mean   :5.455   Mean   :193.4  
##  3rd Qu.:298.8   3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0  
##  Max.   :398.0   Max.   :46.60   Max.   :8.000   Max.   :455.0  
##                                                                 
##    horsepower      weight      acceleration       model      
##  150    : 22   Min.   :1613   Min.   : 8.00   Min.   :70.00  
##  90     : 20   1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00  
##  88     : 19   Median :2804   Median :15.50   Median :76.00  
##  110    : 18   Mean   :2970   Mean   :15.57   Mean   :76.01  
##  100    : 17   3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00  
##  75     : 14   Max.   :5140   Max.   :24.80   Max.   :82.00  
##  (Other):288                                                 
##      origin                car_name       price      
##  Min.   :1.000   ford pinto    :  6   Min.   : 1598  
##  1st Qu.:1.000   amc matador   :  5   1st Qu.:23110  
##  Median :1.000   ford maverick :  5   Median :30000  
##  Mean   :1.573   toyota corolla:  5   Mean   :29684  
##  3rd Qu.:2.000   amc gremlin   :  4   3rd Qu.:36430  
##  Max.   :3.000   amc hornet    :  4   Max.   :53746  
##                  (Other)       :369

Cleaning and Prepping the Data

Based on the results of the str(cars) function above, I see several issues with how the read.csv function imported the data that need to be cleaned up before going in-depth with the analysis. I will fix those in the code below:

# Cylinders came in as an integer, when it should be a multi-valued discrete, 
# otherwise known as a "factor" in R. 
cars$cylinders = cars$cylinders %>%
  factor(labels = sort(unique(cars$cylinders)))

# Horsepower was imported as a factor, but it should be a continuous numerical 
# variable.
cars$horsepower = as.numeric(levels(cars$horsepower))[cars$horsepower]

# I will change he model (year) column from an integer to a categorical factor.
model_years = sort(unique(cars$model))
cars$model = cars$model %>%
  factor(labels = model_years)

# I am converting the origin column from an integer to a descriptive categorical variable. By looking at the data and cross-referencing it with my knowledge of which brands are based in which countries, I can discern that the origins in this dataset are 
# 1 = United States ("USA") 
# 2 = Europe
# 3 = Japan
head(cars[,c('car_name','origin')], 30)
##                        car_name origin
## 1     chevrolet chevelle malibu      1
## 2             buick skylark 320      1
## 3            plymouth satellite      1
## 4                 amc rebel sst      1
## 5                   ford torino      1
## 6              ford galaxie 500      1
## 7              chevrolet impala      1
## 8             plymouth fury iii      1
## 9              pontiac catalina      1
## 10           amc ambassador dpl      1
## 11          dodge challenger se      1
## 12           plymouth 'cuda 340      1
## 13        chevrolet monte carlo      1
## 14      buick estate wagon (sw)      1
## 15        toyota corona mark ii      3
## 16              plymouth duster      1
## 17                   amc hornet      1
## 18                ford maverick      1
## 19                 datsun pl510      3
## 20 volkswagen 1131 deluxe sedan      2
## 21                  peugeot 504      2
## 22                  audi 100 ls      2
## 23                     saab 99e      2
## 24                     bmw 2002      2
## 25                  amc gremlin      1
## 26                    ford f250      1
## 27                    chevy c20      1
## 28                   dodge d200      1
## 29                     hi 1200d      1
## 30                 datsun pl510      3
origins <- c('USA', 'Europe', 'Japan')
cars$origin <- factor(cars$origin, labels = origins)

Univariate Plots

In this section I will take a look at the distribution of values for each variable in the dataset by creating histograms using ggplot2’s qplot function. I am trying to find out if there is more data to clean up, including outliers or extraneous values. This also might help me begin to identify any relationships between variables that are worth investigating further.

# Miles Per Gallon
qplot(cars$mpg, xlab = 'Miles Per Gallon', ylab = 'Count', binwidth = 2, 
      main='Frequency Histogram: Miles per Gallon')

# Number of Cylinders
qplot(cars$cylinders, xlab = 'Cylinders', ylab = 'Count', 
      main='Frequency Histogram: Number of Cylinders')

table(cars$cylinders)
## 
##   3   4   5   6   8 
##   4 204   3  84 103
# Based on the relatively tiny counts of three- and five-cylinder cars (4 and 3, respectively), I am removing those completely because they end up being a distraction in later plots
cars = cars[!cars$cylinders %in% c(3, 5),]
qplot(cars$cylinders, ylab = 'Count', xlab = 'Cylinders')

There are about twice as many four-cylinder cars in the sample as there are six- or eight-cylinder cars.

# Displacement
qplot(cars$displacement, xlab = 'Displacement', ylab = 'Count', binwidth = 20,
      main='Frequency Histogram: Displacement')

# Horsepower
qplot(cars$horsepower, xlab = 'Horsepower', ylab = 'Count', binwidth = 10,
      main='Frequency Histogram: Horsepower')

# Weight
qplot(cars$weight, xlab = 'Weight', ylab = 'Count', binwidth = 200,
      main='Frequency Histogram: Weight')

The distributions for MPG, displacement, horsepower, and weight are all skewed right–a longer tail toward the higher end of the scale. And there are many more four-cylinder cars than six- or eight-cylinder cars. This supports my intuition that there is a strong correlation between all of those variables.

# Acceleration
qplot(cars$acceleration, xlab = 'Acceleration', ylab = 'Count', binwidth = 1,
      main='Frequency Histogram: Acceleration')

Acceleration has a non-skewed, normal distribution, so perhaps it is less correlated with the other variables.

# Model Year
qplot(cars$model, xlab = 'Model Year', ylab = 'Count', 
      main='Frequency Histogram: Model Year')

The counts from each model year are roughly equivalent across the sample.

cor(cars[ , c('weight', 'displacement', 'horsepower', 'acceleration')], 
     use='complete')
##                  weight displacement horsepower acceleration
## weight        1.0000000    0.9352956  0.8689058   -0.4310667
## displacement  0.9352956    1.0000000  0.9028305   -0.5606220
## horsepower    0.8689058    0.9028305  1.0000000   -0.6920508
## acceleration -0.4310667   -0.5606220 -0.6920508    1.0000000

From the chart the following pairwise correlations are over 80%:

This data supports the guess above that these variables are correlated. Therefore, I can use weight as my primary variable to plot as a driver of MPG, with the knowledge that displacement and horsepower would show similar results. Also, as expected, acceleration’s pairwise correlations with weight, displacement, and horsepower are less. Acceleration’s highest correlation is with horsepower at around 69%.

# Origin
qplot(cars$origin, xlab = 'Origin', ylab = 'Count', main='Frequency Histogram: Origin')

table(cars$origin)
## 
##    USA Europe  Japan 
##    249     67     75

There are more than three times as many cars from the USA as from either Europe or Japan in the sample.

# Price
qplot(cars$price, xlab = 'Price', ylab = 'Count', main='Frequency Histogram: Price')

summary(cars$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1598   23010   30000   29580   36170   53750

Prices are roughly normally distributed, with the mean and median both around $30,000. The normal distribution might be more smoothly normal but for the spikes around the round numbers of $20,000, $30,000, and $40,000. Apparently, automakers and/or their customers prefer round numbers.

Bivariate Plots

In this section I will use more ggplot2 charting techniques to visualize how one variable affects another. I am starting with how weight affects MPG by doing a scatter plot overlaid with a linear best-fit line.

ggplot(data = cars, aes(x = weight, y = mpg)) +
  geom_point() +
  geom_smooth(method='lm') +
  xlab('MPG') +
  ylab('Weight') +
  ggtitle('MPG vs. Weight: Entire Sample')

The data clearly shows that weight and MPG are inversely related: as weight increases, MPG decreases. The R-squared of the linear best fit line, as shown below, is over 70%. This means that variations in a car’s weight explain over 70% of the changes to its MPG.

fit = lm(mpg ~ weight, data=cars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6770 -2.7567 -0.3636  2.1120 16.3712 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.600189   0.779849   59.76   <2e-16 ***
## weight      -0.007759   0.000252  -30.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.239 on 389 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic: 947.9 on 1 and 389 DF,  p-value: < 2.2e-16

The next plot uses boxplots to show the mean and distribution of MPG measurements for each year in the sample.

ggplot(data = cars, aes(x = model, y = mpg)) +
  geom_boxplot() +
  xlab('Model Year') +
  ylab('MPG') +
  ggtitle('MPG Comparison by Model Year')

The trend over time shows a meaningful increase in MPG from the lows of the early 70’s to the high’s of the early 80’s.

Next I will use the same boxplot method to show how the distribution of MPG values compares across the region of origin.

ggplot(data = cars, aes(x = origin, y = mpg)) +
  geom_boxplot() +
  xlab('Region of Origin') +
  ylab('MPG') +
  ggtitle('MPG Comparison by Region of Origin')

The sample MPG values for the USA are significantly below those of Europe and Japan. Based on the inverse relationship observed above between MPG and weight, I would guess that USA cars weigh significantly more than European and Japan cars. I will use boxplots to verify my guess.

ggplot(data = cars, aes(x = origin, y = weight)) +
  geom_boxplot() +
  xlab('Region of Origin') +
  ylab('Weight') +
  ggtitle('Weight Comparison by Region of Origin') 

Indeed, my intuition was correct. American cars weigh more, on average. Now I will again use boxplots to compare MPG values for cars with different numbers of cylinders.

ggplot(data = cars, aes(x = cylinders, y = mpg)) +
  geom_boxplot() +
  xlab('Number of Cylinders') +
  ylab('MPG') +
  ggtitle('MPG Comparison by Number of Cylinders') 

The more cylinders a car has, the worse its fuel efficiency. Four-cylinder cars are probably lighter too. A boxplot can help us visualize those differences.

ggplot(data = cars, aes(x = cylinders, y = weight)) +
  geom_boxplot() +
  xlab('Number of Cylinders') +
  ylab('Weight') +
  ggtitle('Weight Comparison by Number of Cylinders') 

Yes, four-cylinder cars weigh less. A story is emerging. American cars weigh more and get worse gas mileage. It probably stands to reason that the USA automakers are producing more 6- and 8-cylinder cars. Let’s plot the breakdown of cars in the sample by cylinder count.

ggplot(data = cars, aes(x = cylinders, fill = origin)) +
  geom_bar() +
  xlab('Number of Cylinders') +
  ylab('Count') +
  ggtitle('Cars from Each Region by Number of Cylinders')

This chart does a good job of illustrating the types of cars in each region. The United States produces all of the eight-cylinder cars in the sample, and most of the six-cylinder cars. However, the number of four cylinder cars is roughly equal across regions.

Price Analysis

What about prices as they relate to weight?

ggplot(data = cars, aes(x = weight, y = price)) +
  geom_point() +
  xlab('Weight') +
  ylab('Price') +
  ggtitle('Price vs. Weight')

There is no obvious correlation between weight and price; indeed, there are car models at every price point up and down the range of weights. Now let’s look at price vs. acceleration.

ggplot(data = cars, aes(x = acceleration, y = price)) +
  geom_point() +
  xlab('Acceleration') +
  ylab('Price') +
  ggtitle('Price vs. Acceleration')

There is no obvious correlation between acceleration and price either.

fit = lm(price ~ acceleration, data=cars)
summary(fit)
## 
## Call:
## lm(formula = price ~ acceleration, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27356.6  -6528.5    294.5   6371.6  24385.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26418.0     2840.7   9.300   <2e-16 ***
## acceleration    202.9      179.7   1.129    0.259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9772 on 389 degrees of freedom
## Multiple R-squared:  0.003268,   Adjusted R-squared:  0.0007057 
## F-statistic: 1.275 on 1 and 389 DF,  p-value: 0.2594

Another way of showing illustrating the weakness of price to acceleration relationship is to say that a linear regression model shows a very weak R-squared of just 0.003. Very little variance in price is explained by changes in acceleration. Let’s take a last look at prices for different numbers of cylinders:

ggplot(data = cars, aes(x = cylinders, y = price)) +
  geom_boxplot() +
  xlab('Cylinders') +
  ylab('Price') +
  ggtitle('Price vs. Number of Cylinders')

And let’s look at prices over time:

ggplot(data = cars, aes(x = model, y = price)) +
  geom_boxplot() +
  xlab('Year') +
  ylab('Price') +
  ggtitle('Price over Time')

And let’s look at prices from the different regions of origin:

ggplot(data = cars, aes(x = origin, y = price)) +
  geom_boxplot() +
  xlab('Origin') +
  ylab('Price') +
  ggtitle('Prices by Region')

And, finally, let’s look at prices based on fuel-efficiency.

ggplot(data = cars, aes(x = mpg, y = price)) +
  geom_point() +
  xlab('MPG') +
  ylab('Price') +
  ggtitle('Prices by MPG')

After looking at prices from every angle, there is no meaningful difference in prices along any dimension. Therefore, it is reasonable to infer that the ability to sell cars at higher prices does not appear to be what drove automakers towards raising fuel-efficiency standards industry-wide.

Multivariate Plots

This section includes charts that involve three or more variables simultaneously, to give us a more complete look at the questions that presented themselves in the previous sections. Building on the observation in the previous plot, I want to see how each region’s product mix has evolved over time. The best way to illustrate this is with a stacked bar chart over time for each region.

ggplot(data = cars, aes(x = model, fill = cylinders)) +
  geom_bar() +
  facet_wrap(~ origin, ncol = 1) +
  xlab('Model Year') +
  ylab('Count') +
  ggtitle('Each Region of Origin\'s Product Mix Over Time')

As shown in the top section, while the number of four-cylinder cars increases over time, six- and eight-cylinder cars comprise the majority of the United States’ product mix until 1980. Europe and Japan almost exclusively produce four-cylinder cars with just a few exceptions over the entire 13-year period. We can see this phenomenon illustrated when we compare each region’s weight distributions per year using boxplots.

ggplot(data = cars, aes(x = model, y = weight)) +
  geom_boxplot() +
  facet_wrap(~ origin) +
  xlab('Model Year') +
  ylab('Weight') +
  ggtitle('Weight Distributions Over Time by Region of Origin')

As we can see, US cars show much higher average weights than Europe and, especially, Japan, until about 1980, when US weight distribution comes down considerably. From above we know that 1980 is when the US converted to a higher percentage of four-cylinder cars. Note that average weights stay more constant for Europe and Japan over the same time period.

Now we can create a similar comparative boxplot for MPG over time.

ggplot(data = cars, aes(x = model, y = mpg)) +
  geom_boxplot() +
  facet_wrap(~ origin) +
  xlab('Model Year') +
  ylab('MPG') +
  ggtitle('Evolution of MPG Over Time by Region of Origin: All Cars')

The average MPG for each region shows an upward trend, especially towards the end of the 70’s and into the early 80’s. Since Europe and Japan also increased MPG, it is apparent that increasing overall fuel economy was not solely about changing the product mix away from six- and eight-cylinder cars. Indeed, the fuel economy of four-cylinder cars increased over time. We can see that more clearly by restricting our analysis to include only four-cylinder cars.

ggplot(data = subset(cars, cylinders==4), aes(x = model, y = mpg)) +
  geom_boxplot() +
  facet_wrap(~ origin) +
  xlab('Model Year') +
  ylab('MPG') +
  ggtitle('Evolution of MPG Over Time by Region of Origin: 4-Cylinder Cars Only') 

This plot shows that the USA, Europe, and Japan all started out with low fuel-efficiency in the four-cylinder category in the early 70’s. Over time, all three were able to increase the average fuel efficiency of their four-cylinder cars, with Europe and Japan doing so at a faster rate than the Americans. Plotting this as a scatterplot with a best fit line helps illustrate this point a little more clearly.

ggplot(data = subset(cars, cylinders==4), aes(x = model, y = mpg, group = 1)) +
  geom_point() +
  facet_wrap(~ origin) +
  geom_smooth() +
  xlab('Model Year') +
  ylab('MPG') +
  ggtitle('Evolution of MPG Over Time by Region of Origin: 4-Cylinder Cars Only')

In digging into why American fuel-efficiency in its four-cylinder cars did not increase as quickly as that of its rivals, it helps to look at the evolution of the average weight of American four-cylinder cars.

ggplot(data = subset(cars, cylinders==4), aes(x = model, y = weight, group = 1)) +
  geom_point() +
  facet_wrap(~ origin) +
  geom_smooth() +
  xlab('Model Year') +
  ylab('Weight') +
  ggtitle('Evolution of Weight Over Time by Region of Origion: 4-Cylinder Cars Only')

This plot shows that the Americans increased the average weight of a 4-cylinder car while Europe and Japan kept weight pretty constant. So while American fuel-efficiency technology may have been on par with that of Europe and Japan, the US simultaneously increased the weight of its four-cylinder cars, thus creating a drag on improvements in fuel efficiency. This can be illustrated by breaking each year into comparative boxplots.

ggplot(data = subset(cars, cylinders==4), aes(x = origin, y = mpg)) +
  geom_boxplot() +
  facet_wrap(~ model, ncol = 7) +
  ggtitle('MPG Comparison by Region of Origin: 4-Cylinder Cars Only') +
  xlab('Origin') + 
  ylab('MPG') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Japan and Europe demonstrate clear advantages over the USA in both 1980 and 1982, though 1981 is less clear.

Conclusions

The weight of a car is a strong determinant of its fuel-efficiency, as expressed by MPG. Four-cylinder cars are the lightest, and eight-cylinder cars are the heaviest. Therefore, four-cylinder cars get the best gas mileage.

American, European, and Japanese car makers increased average MPG from 1970 to 1982.

One way that US automakers increased average MPG was by shifting their product mix to include more four-cylinder cars, mimicking their European and Japanese competitors.

The other way that US automakers increased average MPG was by getting more efficiency out of those four-cylinder cars. American efficency improvements were not as pronounced as those of the foreign competition. A key driver of that lack of efficiency improvement is that the weight of American four-cylinder cars increased over the time period, while those of Europe and Japan remained relatively constant.