Exploratory Analysis of Auto Data

Summary

This document is intended to be a concise report to explain a few takeaways of a dataset containing information about cars (available here). The analysis was created as part of the Data Science Certificate in the class Methods for Data Analysis at University of Washington.

The idea is to show some findings regarding information related to the price of the cars. Some functions created for this purpose are included in the appendix.

The report starts with data loading/cleaning, followed by the exploratory analysis with three sections with takeaways for price related to weight, make and drive wheel of cars. Finally, some initial basic modeling for exploration is done to close the report.

Data Loading and Cleaning

First we load the file with the functions created for this report (included in the appendix). Then the data is loaded using the function read.auto, that also coerces some character columns to numeric and adjusts the columns num.of.doors and num.of.cylinders, that have numerical data in the form of text (see the function read.auto in the appendix at the end of this report for more info).

# load created functions
source('Functions.R')

# read.auto function loads and cleans the data
Auto.Price = read.auto(path = '.') # function read.auto is included in the appendix
summary(Auto.Price)

##    symboling       normalized.losses         make      fuel.type   aspiration   num.of.doors  
##  Min.   :-2.0000   Min.   : 65       toyota    : 32   diesel: 20   std  :168   Min.   :2.000  
##  1st Qu.: 0.0000   1st Qu.: 94       nissan    : 18   gas   :185   turbo: 37   1st Qu.:2.000  
##  Median : 1.0000   Median :115       mazda     : 17                            Median :4.000  
##  Mean   : 0.8341   Mean   :122       honda     : 13                            Mean   :3.123  
##  3rd Qu.: 2.0000   3rd Qu.:150       mitsubishi: 13                            3rd Qu.:4.000  
##  Max.   : 3.0000   Max.   :256       subaru    : 12                            Max.   :4.000  
##                    NA's   :41        (Other)   :100                            NA's   :2      
##        body.style drive.wheels engine.location   wheel.base         length          width      
##  convertible: 6   4wd:  9      front:202       Min.   : 86.60   Min.   :141.1   Min.   :60.30  
##  hardtop    : 8   fwd:120      rear :  3       1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10  
##  hatchback  :70   rwd: 76                      Median : 97.00   Median :173.2   Median :65.50  
##  sedan      :96                                Mean   : 98.76   Mean   :174.0   Mean   :65.91  
##  wagon      :25                                3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90  
##                                                Max.   :120.90   Max.   :208.1   Max.   :72.30  
##                                                                                                
##      height       curb.weight   engine.type num.of.cylinders  engine.size     fuel.system      bore     
##  Min.   :47.80   Min.   :1488   dohc : 12   Min.   : 2.00    Min.   : 61.0   mpfi   :94   Min.   :2.54  
##  1st Qu.:52.00   1st Qu.:2145   dohcv:  1   1st Qu.: 4.00    1st Qu.: 97.0   2bbl   :66   1st Qu.:3.15  
##  Median :54.10   Median :2414   l    : 12   Median : 4.00    Median :120.0   idi    :20   Median :3.31  
##  Mean   :53.72   Mean   :2556   ohc  :148   Mean   : 4.38    Mean   :126.9   1bbl   :11   Mean   :3.33  
##  3rd Qu.:55.50   3rd Qu.:2935   ohcf : 15   3rd Qu.: 4.00    3rd Qu.:141.0   spdi   : 9   3rd Qu.:3.59  
##  Max.   :59.80   Max.   :4066   ohcv : 13   Max.   :12.00    Max.   :326.0   4bbl   : 3   Max.   :3.94  
##                                 rotor:  4                                    (Other): 2   NA's   :4     
##      stroke      compression.ratio   horsepower       peak.rpm       city.mpg      highway.mpg   
##  Min.   :2.070   Min.   : 7.00     Min.   : 48.0   Min.   :4150   Min.   :13.00   Min.   :16.00  
##  1st Qu.:3.110   1st Qu.: 8.60     1st Qu.: 70.0   1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00  
##  Median :3.290   Median : 9.00     Median : 95.0   Median :5200   Median :24.00   Median :30.00  
##  Mean   :3.255   Mean   :10.14     Mean   :104.3   Mean   :5125   Mean   :25.22   Mean   :30.75  
##  3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00  
##  Max.   :4.170   Max.   :23.00     Max.   :288.0   Max.   :6600   Max.   :49.00   Max.   :54.00  
##  NA's   :4                         NA's   :2       NA's   :2                                     
##      price      
##  Min.   : 5118  
##  1st Qu.: 7775  
##  Median :10295  
##  Mean   :13207  
##  3rd Qu.:16500  
##  Max.   :45400  
##  NA's   :4

We can see that a few columns have NA values, and we have features like make and fuel.system that are categorical variables with a high number of levels.

Also, although not seen in the summary, some features have inter-relationships. For example, the diesel level of fuel.type is only found in the level idi of fuel.system and vice-versa, as we can see below.

table(Auto.Price$fuel.type, Auto.Price$fuel.system)

##         
##          1bbl 2bbl 4bbl idi mfi mpfi spdi spfi
##   diesel    0    0    0  20   0    0    0    0
##   gas      11   66    3   0   1   94    9    1

These will only be treated later with modeling purposes, but it is interesting and vital to notice these cases early on.

Exploratory Analysis

Following we have three sections with the takeaways found for price related to drive wheels, weight and make of cars.

Drive Wheels

Drive wheels essentially dictates the traction of the cars, into 4 wheels (4wd), two forward wheels (fwd) or two rear wheels (rwd). Observing the variable drive.wheels, we see it has exactly three levels: 4wd, fwd and rwd.

table(Auto.Price$drive.wheels)

## 
## 4wd fwd rwd 
##   9 120  76

sort(tapply(Auto.Price$price, Auto.Price$drive.wheels, mean, na.rm=TRUE), decreasing=TRUE)

##      rwd      4wd      fwd 
## 19757.61 10241.00  9244.78

We can also notice that the level 4wd has very few observations (9 out of 205, ~4%) and therefore may be hard to account while making sure to not overfit. Also, by analyzing the price for each level, we see that rwd has a higher mean price than fwd. The following plot also seems to confirm this hypothesis.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.4

ggplot(Auto.Price[Auto.Price$drive.wheels %in% c('fwd','rwd'),], aes(price)) + 
    geom_histogram(binwidth=1000, na.rm=TRUE) + facet_grid(. ~ drive.wheels) + 
    labs(title = "Histogram of Price by Drive Wheels fwd and rwd") + labs(x = "Price (US$)", y = "Frequency")

To assert this hypothesis, we do a Welch Two Sample t-test, defining the following null and alternative hypothesis:

Null Hypothesis - H0: Mean price of rwd is equal to the mean price of fwd+4wd
Alternative Hypothesis - H1: Mean price of rwd is greater than the mean price of fwd+4wd

t.test(Auto.Price$price[Auto.Price$drive.wheels == 'rwd'], 
       Auto.Price$price[Auto.Price$drive.wheels != 'rwd'], "greater", 0, FALSE, FALSE, 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  Auto.Price$price[Auto.Price$drive.wheels == "rwd"] and Auto.Price$price[Auto.Price$drive.wheels != "rwd"]
## t = 9.5858, df = 86.049, p-value = 1.571e-15
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  8636.988      Inf
## sample estimates:
## mean of x mean of y 
## 19757.613  9308.032

With a very small p-value, we can reject the null hypothesis, confirming our finding that the mean price of rwd drive wheel is greater than the mean price of the other two levels.

Curb Weight

The Curb Weight is the total weight of a vehicle with all of its standard equipment. We found this feature by looking at the numerical variables that have the highest correlation with price, as we can see below.

# Obtain numeric-like columns of Auto.Price
numerical.cols <- getColsOfClass(Auto.Price, "num") # function getColsOfClass is included in the appendix

# Obtain three highest correlated (absolute correlation) features with Price
featCorPrice() # function featCorPrice is included in the appendix

## curb.weight       width engine.size 
##   0.8936391   0.8433705   0.8414956

# Obtain three highest correlated (absolute correlation) features with LOG of Price
featCorPrice(log) # function featCorPrice is included in the appendix

## curb.weight       width engine.size 
##   0.9174756   0.8524911   0.8351468

Basically, we looked at the correlation of the numerical features with price and then with log(price). First we see that curb.weight is the highest correlated variable with price (absolute correlation).

We can also notice that by applying the log in price, we get higher correlation. To observe this relationship we can see below a plot of the variable curb.weight by the log(price) with a linear regression line added.

ggplot(Auto.Price, aes(curb.weight, log(price))) + geom_point(aes(alpha = 0.2, size=2)) + 
    guides(alpha=F,size=F) + stat_smooth(method = "lm", level = 0.95, colour = "red") + 
    labs(title = "Scatterplot of Curb Weight by Log of Price with Linear Regression Line") + 
    labs(x = "Curb Weight (lbs)", y = "Log of Price")

The image confirms the linear relationship between curb.weight and log(price).

Make

Intuitively, make should be a variable that influences the price of the cars. We already saw that the feature has many levels, so to try to avoid reaching conclusions based on data that is not representative, we will limit make for those that have more than 10 observations among the 205 rows (~5%).

make.ten.or.more <- names(sort(table(Auto.Price$make)[table(Auto.Price$make) > 10], decreasing = TRUE))
sort(with(Auto.Price[Auto.Price$make %in% make.ten.or.more,], tapply(price, make, mean, na.rm=TRUE)), decreasing = TRUE)

##      volvo     peugot      mazda     nissan volkswagen     toyota mitsubishi     subaru      honda 
##  18063.182  15489.091  10652.882  10415.667  10077.500   9885.812   9239.769   8541.250   8184.692

Sorting the price for each make with more than 10 observations we see that volvo and peugeot (misspelled as peugot) have higher mean prices. Now we can look at the density plots for these makes.

ggplot(Auto.Price[Auto.Price$make %in% make.ten.or.more,], aes(price,fill=make)) + 
    geom_density(na.rm = TRUE) + facet_grid(. ~ make) + guides(fill=F) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) + 
    labs(title = "Density Plot of Price by Make") + labs(x = "Price (US$)", y = "Density")

One make that has a high mean price and is considerably skewed is peugeot. Below we compare its price with the other makes to assess the difference.

library(gridExtra)
bp1 = ggplot(Auto.Price[Auto.Price$make == 'peugot',], aes(1:11, price)) + geom_boxplot() + 
    theme(axis.ticks=element_blank(), axis.text.x = element_blank()) + labs(x = "Peugeot", y = "Price (US$)")
bp2 = ggplot(Auto.Price[Auto.Price$make != 'peugot',], aes(1:194, price)) + geom_boxplot() + 
    theme(axis.ticks=element_blank(), axis.text.x = element_blank()) + labs(x = "Others", y = "Price (US$)")
grid.arrange(bp1, bp2, nrow = 1, top = "Boxplots of Price for Peugeot and other makes")

The plots confirm our hypothesis that Peugeot stands out amongst the makes by having a higher mean price.

Modeling for Exploration

As a final essay, still for exploration purposes, we create a model for the dataset and analyze the results. To control the features and each of the levels of the categorical variables individually, we first transform the factor columns into binary or dummy columns.

library(dummies)
cols_to_dummy <- getColsOfClass(Auto.Price, "char") # function getColsOfClass is included in the appendix
dummy.sel <- dummy.data.frame(Auto.Price[,cols_to_dummy], sep="_", drop=T)
auto.dummy <- cbind(Auto.Price[,-cols_to_dummy], dummy.sel)

After transforming the factor features, we build a simple model with the variables we found in our three takeaways.

# Simple Model
simple.model <- lm(log(price) ~ curb.weight + drive.wheels_rwd + make_peugot, data=auto.dummy)
summary(simple.model)

## 
## Call:
## lm(formula = log(price) ~ curb.weight + drive.wheels_rwd + make_peugot, 
##     data = auto.dummy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65661 -0.11483 -0.00953  0.10152  0.82957 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.270e+00  8.928e-02  81.428  < 2e-16 ***
## curb.weight       7.932e-04  3.818e-05  20.772  < 2e-16 ***
## drive.wheels_rwd  1.991e-01  4.076e-02   4.885 2.14e-06 ***
## make_peugot      -3.860e-01  6.763e-02  -5.707 4.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2051 on 197 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8358, Adjusted R-squared:  0.8333 
## F-statistic: 334.2 on 3 and 197 DF,  p-value: < 2.2e-16

Above we see that with only the three findings we managed to obtain, the model is already giving quite satisfactory results with R-squared of approximatelly 0.84.

Next, we build the model with all of the available features and output the R-squared.

# Full Model
full.model <- lm(log(price) ~ ., data=auto.dummy)
summary(full.model)$r.squared

## [1] 0.9672693

The R-squared above shows us that there is still room to improve, and an even more in-depth analysis can lead to greater results.

It is worth noticing that for modeling purposes there are several other techniques that should be addressed and were skipped in this report (for example, the data should be split into train and test to start with).

Appendix

Functions loaded using the file Functions.R are showed below.

##  Read the csv file into a data frame
read.auto <- function(path = 'SET-YOUR-PATH-HERE'){
    ## Function to read the csv file
    filePath <- file.path(path, 'Automobile price data _Raw_.csv')
    auto.price <- read.csv(filePath, header = TRUE, 
                           stringsAsFactors = TRUE, na.strings = "?")
    
    ## Coerce some character columns to numeric
    numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm',
                 'highway.mpg', 'city.mpg', 'compression.ratio',
                 'engine.size', 'curb.weight', 'height', 'width',
                 'length', 'wheel.base', 'normalized.losses',
                 'symboling')
    auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)
    
    ## Clean and tidy num.of.doors
    auto.price$num.of.doors <- as.character(auto.price$num.of.doors)
    auto.price$num.of.doors[auto.price$num.of.doors == 'four'] <- 4
    auto.price$num.of.doors[auto.price$num.of.doors == 'two'] <- 2
    auto.price$num.of.doors <- as.integer(auto.price$num.of.doors)
    
    ## Clean and tidy num.of.cylinders
    auto.price$num.of.cylinders <- as.character(auto.price$num.of.cylinders)
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'eight'] <- 8
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'five'] <- 5
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'four'] <- 4
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'six'] <- 6
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'three'] <- 3
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'twelve'] <- 12
    auto.price$num.of.cylinders[auto.price$num.of.cylinders == 'two'] <- 2
    auto.price$num.of.cylinders <- as.integer(auto.price$num.of.cylinders)
    
    auto.price
}

# Automatically obtain numeric-like or character-like columns of data
getColsOfClass <- function(full.data, class='num'){
    if(class=='num'){
        classesChosen <- c('integer','numeric','double','float')
    }
    else{
        classesChosen <- c('factor','character')
    }
    return(sort(unique(sapply(1:ncol(full.data), 
                              function(x){
                                  if(class(full.data[,x]) %in% classesChosen){
                                      x
                                  }
                                  else{
                                      0
                                  }
                              }
    )
    )
    )[-1]
    )
}

# Obtain three highest correlated (absolute correlation) features of Auto.Price
featCorPrice <- function(price.function=I, n=3){
    cor.mat <- cor(Auto.Price[,numerical.cols], price.function(Auto.Price[,'price']), use = 'complete.obs')
    feat.cor <- abs(cor.mat)
    feat.cor <- feat.cor[order(feat.cor, decreasing = T),][-1]
    feat.cor[1:n]
}