Problem set 5 - Multivariate data analysis

This is a study of fuel consumption data obtained from a Canadian government website, http://data.gc.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64. It’s a great dataset for multivariate analysis, since it contains fuel consumption and CO2 emission numbers for many makes and models of cars, and provides details such as the number of cylinders and transmission type. The data that I studied was the most up-to-date list that was provided, “Original Fuel Consumption Ratings - 2000-2014.” There are 14357 rows in the file! It’s a nice clean data set too. I’m hoping that it will tell me which manufacturer does the best job at producing efficient cars. Here’s a description of the various fields from the table:

# Model   4WD/4X4 = Four-wheel drive
#         AWD = All-wheel drive
#         CNG = Compressed natural gas
#           FFV = Flexible-fuel vehicle
#           NGV = Natural gas vehicle
#           # = High output engine that provides more power than the standard engine of the same size
# Transmission  A = Automatic
#               AM = Automated manual
#               AS = Automatic with select shift
#               AV = Continuously variable
#               M = Manual
#               3 – 10 = Number of gears
# Fuel Type X = Regular gasoline
#           Z = Premium gasoline
#           D = Diesel
#           E = Ethanol (E85)
#           N = Natural Gas
# Fuel Consumption  City and highway fuel consumption ratings are shown in litres per 100 kilometres (L/100 km) - combined rating (55% city, 45% hwy) is shown in L/100 km and in miles per imperial gallon (mpg)
# CO2 Emissions (g/km)  Estimated tailpipe carbon dioxide emissions (in grams per kilometre) are based on fuel type and the combined fuel consumption rating.

setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 5')
dffuel = read.csv('Original MY2000-2014 Fuel Consumption Ratings (2-cycle)_edited.csv',head=TRUE,sep=',') # edited out second row of header, as it ruined import
#dffuel = dffuel[-1,] # delete the first row, since it is part of a two row header
dffuel['X.3'] = NULL # delete empty column
dffuel['X.4'] = NULL # delete empty column
dffuel = dffuel[complete.cases(dffuel),] # remove rows at the end that are not data
names(dffuel) = c('Year','Make','Model','Class','Engine_size','Cylinders','Transmission','Fuel_type','Fuel_consumption_city','Fuel_consumption_highway','Fuel_consumption_combined','Fuel_consumption_combined_mpg','CO2_emissions') # Rename columns
dffuel$Cylinders = factor(dffuel$Cylinders) # Make cylinders a factor
dffuel$Engine_size = as.double(dffuel$Engine_size) # Make double
dffuel$Fuel_consumption_city = as.double(dffuel$Fuel_consumption_city)
dffuel$Fuel_consumption_highway = as.double(dffuel$Fuel_consumption_highway)
dffuel$Fuel_consumption_combined = as.double(dffuel$Fuel_consumption_combined)
dffuel$Fuel_consumption_combined_mpg = as.double(dffuel$Fuel_consumption_combined_mpg)
dffuel$CO2_emissions = as.double(dffuel$CO2_emissions)
dffuel_sub = dffuel[1:1000,] # A subset for faster plotting for test purposes

Firstly, I’ll make a scatterplot matrix to find out if any variables happen to be correlated. I’ll take 1000 samples from the dataset so as not to overload the CPU on my computer. There are four columns for fuel consumption, the first three using units of L/100 km: ‘Fuel_consumption_city’ ‘Fuel_consumption_highway’ ‘Fuel_consumption_combined’ ‘Fuel_consumption_combined mpg’ For my scatterplot matrix, I’ll just use the third of these, ‘Fuel consumption combined.’ I’ll also ditch year, make, and model, since the general categories are more useful for this kind of plot.

set.seed(1)
dffuel_subset = dffuel[,c(4:8,11,13)]
names(dffuel_subset)

## [1] "Class"                     "Engine_size"              
## [3] "Cylinders"                 "Transmission"             
## [5] "Fuel_type"                 "Fuel_consumption_combined"
## [7] "CO2_emissions"

ggpairs(dffuel_subset[sample.int(nrow(dffuel_subset),1000),]) + 
  theme(axis.text=element_text(size=10)) # subset dataframe and sample from subset

Here are some of my observations of the scatterplot matrix: 1. CO2 emissions are estimates based on fuel type and fuel consumption ratings. A graph of CO2 emissions versus Fuel_consumption_combined could be fitted by three lines corresponding to three different fuel types. (There are six different fuel types, but some types must affect CO2 emissions similarly.) The correlation coefficient between CO2 and Fuel_consumption_combined in 0.931. Given this high value, I’ll focus on fuel consumption and ignore CO2 emissions in the following analysis. 2. There is a high correlation coefficient (0.807) between Fuel_consumption_combined and Engine_size. A scatterplot of Fuel_consumption_combined vs. Engine_size could probably be fitted well by a parabolic curve. 3. Fuel consumption depends of fuel type, although the fuel consumption is similar for some types of fuels, according to the box plots. 4. The mode (most frequent value) of Fuel_consumption_combined is around 10 L/100 km, and it varies from around 5 ~ 25 L/100 km. 5. There are many different types of transmissions, and they affect fuel consumption. For some transmissions engine size can vary greatly, but for others the engine size varies over a narrow range. Certain transmissions are used for many classes of cars (which may weigh differently), and other transmissions are used only for a few classes. It is difficult, therefore, to determine directly how transmission type affects fuel consumption. 6. Engine size is strongly correlated with the number of cylinders. Therefore, I’ll focus on engine size in the following analysis. 7. Class is a categorization and not something that can be measured or counted. The weight of the car would be more useful. There are significant differences between fuel consumption and engine sizes for different classes, possibly due to weight differences. Fuel consumption varies between classes, but different classes tend to have different numbers of cylinders, different transmissions, and fuel types.

The situation is quite complex, even with this small number of variables. There are many interdependencies, as noted above. To keep the analysis brief, I’ll focus on the following more fundamental variables, which should capture the most important aspects of the data: * Engine_size * Transmission * Fuel_type * Fuel_consumption_combined. Also, buyers tend to look for a certain class of car, so I’ll compare the fuel consumption of various manufacturers for a single class.

I’ll now plot fuel consumption vs. engine size, and compare the graphs for different transmissions and fuel types.

#dffuel = dffuel[order(dffuel$Engine_size),]

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel) +
  geom_point(alpha=0.2,aes(color=Transmission)) + 
  ylab('Fuel consumption combined')

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel) +
  geom_point(alpha=0.2,aes(color=Fuel_type)) + 
  ylab('Fuel consumption combined')

To my eye, the curves look somewhat parabolic, so I’ll also plot the curves using facet_wrap and fit parabolic curves. (I don’t intend to do a detailed regression analysis here.)

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel) +
  geom_point(alpha=0.2,aes(color=Fuel_type)) +
  geom_smooth(method='lm',formula = y ~ I(x^0.5),size=1) +
  facet_wrap(~Fuel_type) +
  ylab('Fuel consumption combined')

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel) +
  geom_smooth(method='lm',formula = y ~ I(x^0.5),size=1,aes(group=Fuel_type,color=Fuel_type)) + 
  ylab('Fuel consumption combined')

# Summary statistics - mean fuel consumption for different fuel types
# Subset the data so that the engine size varies over a range that is common
# to all fuel types.
dffuel_sub2=subset(dffuel,Engine_size>3 & Engine_size<5)
by(dffuel_sub2$Fuel_consumption_combined,dffuel_sub2$Fuel_type,function(x) mean(x))

## dffuel_sub2$Fuel_type: 
## [1] NA
## -------------------------------------------------------- 
## dffuel_sub2$Fuel_type: D
## [1] 7.5
## -------------------------------------------------------- 
## dffuel_sub2$Fuel_type: E
## [1] 15.9323
## -------------------------------------------------------- 
## dffuel_sub2$Fuel_type: N
## [1] 12.9
## -------------------------------------------------------- 
## dffuel_sub2$Fuel_type: X
## [1] 11.83868
## -------------------------------------------------------- 
## dffuel_sub2$Fuel_type: Z
## [1] 11.42251

The above graphs show that Fuel_type E (ethanol) results in the highest fuel consumption for a given engine size, and D (diesel) is the best. X (regular gasoline) is pretty good too!

Now, a comparison of transmission types.

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel) +
  geom_point(alpha=0.2,aes(color=Transmission)) +
  facet_wrap(~Transmission) + 
  ylab('Fuel consumption combined')

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel) +
  geom_smooth(method='lm',formula = y ~ I(x^0.5),size=1,aes(group=Transmission,color=Transmission),se=FALSE) +
  theme(legend.text=element_text(size=12)) +
  scale_color_hue(h=c(90,360)) +
  scale_x_continuous(limits=c(0,8)) +
  scale_y_continuous(limits=c(0,20)) + 
  ylab('Fuel consumption combined')

## Warning: Removed 67 rows containing missing values (stat_smooth).

## Warning: Removed 20 rows containing missing values (stat_smooth).

## Warning: Removed 32 rows containing missing values (stat_smooth).

## Warning: Removed 1 rows containing missing values (stat_smooth).

## Warning: Removed 3 rows containing missing values (stat_smooth).

## Warning: Removed 26 rows containing missing values (stat_smooth).

There’s too much data on this graph, so now I’ll compare different transmission types for the same number of gears. I arbitrarily chose six gears for the comparison. Therefore, I’ll compare A6, AM6, AS6, AV6, M6.

# http://stackoverflow.com/questions/7963898/extracting-the-last-n-characters-from-a-string-in-r
substrRight <- function(x){
  substr(x, nchar(x), nchar(x))
}

dffuel$Transmission_string = lapply(dffuel$Transmission,toString)

dffuel_6 = subset(dffuel,substrRight(Transmission_string)=='6')


ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel_6) +
  geom_smooth(method='lm',formula = y ~ I(x^0.5),size=1,aes(group=Transmission,color=Transmission),se=FALSE) +
  theme(legend.text=element_text(size=12)) +
  scale_color_hue(h=c(90,360)) +
  scale_x_continuous(limits=c(0,8)) +
  scale_y_continuous(limits=c(0,20)) + 
  ylab('Fuel consumption combined')

## Warning: Removed 32 rows containing missing values (stat_smooth).

## Warning: Removed 1 rows containing missing values (stat_smooth).

## Warning: Removed 26 rows containing missing values (stat_smooth).

The above chart shows, at least in the case of six cylinders, that the transmission type does not make much difference to fuel consumption, with the possible exception of continuously variable transmission (AV). Therefore, I’ll plot the fuel_consumption for the different continuously variable transmissions.

substrLeft <- function(x){
  substr(x,1,2)
}

dffuel_AV = subset(dffuel,substrLeft(Transmission_string)=='AV')

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel_AV) +
  geom_smooth(method='lm',formula = y ~ I(x^0.5),size=1,aes(group=Transmission,color=Transmission),se=FALSE) +
  theme(legend.text=element_text(size=12)) +
  scale_color_hue(h=c(90,360)) +
  scale_x_continuous(limits=c(0,8)) +
  scale_y_continuous(limits=c(0,20)) + 
  ylab('Fuel consumption combined')

Continuously variable transmissions do indeed seem to improve fuel efficiency.

Finally, which manufacturer does the best job overall? I’ll be buying a mid-sized care in future, so I’ll study this class only. I’ll plot median fuel consumption vs. engine size to compare makes.

dffuel_mid_size = subset(dffuel,Class=='MID-SIZE')

ggplot(aes(x=Engine_size,y=Fuel_consumption_combined),data=dffuel_mid_size) +
  geom_point(aes(group=Make,color=Make)) + 
  ylab('Fuel consumption combined')

There are a huge range of makes, and so it’s difficult to choose the best make from this graph. I’ll create some summary data using dplyr.

dffuel_makes = dffuel_mid_size %>%
  group_by(Make) %>%
  summarise(
    fuel_consumption_mean=mean(Fuel_consumption_combined)
    )

ggplot(aes(x=Make,y=fuel_consumption_mean),data=dffuel_makes) +
  geom_bar(stat='identity') +
  theme(axis.text.x=element_text(angle=90,hjust=1)) + 
  ylab('Mean fuel consumption')

Bently, Ferrari, and Rolls-Royce do not do well in the fuel economy stakes. The Korean manufacturers do well, but on average Toyota is tops for mid-sized cars. Is this because Toyota engine sizes are smaller than average? Let’s check it:

ggplot(aes(x=Make,y=Engine_size),data=dffuel_mid_size) +
  geom_boxplot()  +
  theme(axis.text.x=element_text(angle=90,hjust=1)) + 
  ylab('Engine size')

Toyota engine size is indeed on the lower end of the scale. One way of comparing fuel economy between manufacturers is to compare the ratios of fuel consumption to engine size between manufacturers. This gives us an indication of which manufacturer makes the best engines, although it is certainly not a definitive measure.

ggplot(aes(x=Make,y=Fuel_consumption_combined/Engine_size),data=dffuel_mid_size) +
  geom_boxplot()  +
  theme(axis.text.x=element_text(angle=90,hjust=1)) + 
  ylab('Fuel consumption / Engine size')

dffuel_makes_ratio = dffuel_mid_size %>%
  group_by(Make) %>%
  summarise(
    fuel_consumption_div_size_mean=mean(Fuel_consumption_combined/Engine_size)
    )

ggplot(aes(x=Make,y=fuel_consumption_div_size_mean),data=dffuel_makes_ratio) +
  geom_bar(stat='identity') +
  theme(axis.text.x=element_text(angle=90,hjust=1)) + 
  ylab('Mean fuel consumption / Engine size')

The European luxury manufacturers do better according to this measure, with Rolls Royce the best on average. Of course, Rolls Royce cars are usually quite heavy, which accounts for their overall high fuel consumption. Also, as shown by the box plot, particular models from some manufacturers such as Infiniti and Dodge also perform better than Rolls Royce. Fuel type likely has something to do with this, but I’ll finish the analysis here since I’ve produced more than enough graphs.

Problem set 5 - Multivariate data analysis

Christopher Kaalund

20 March 2015