AutoLoop DS Test

Below are all the R packages and option sets to reproduce this analysis

options(warn=-1)
options("scipen"=100, "digits"=4)
suppressMessages(library(readr))
suppressMessages(library(Hmisc))
suppressMessages(library(gbm))
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(RColorBrewer))

#library(doMC)
#registerDoMC(cores = 8)

Load data into R and quick data inspection

cars_multi <- read_csv("cars_multi.csv", na = "?")
cars_price <- read_csv("cars_price.csv")

Theme function `fte_theme` based on FiveThirtyEight style is not displayed

Merge and Inspect Data sets

There are no missing values or idiosyncrasies for all variables except for the horsepower variable that has 6 observations missing. One may use knn-imputation to fix this. But I won’t border for the time being.

merged_data <- merge(cars_multi, cars_price, by = "ID") 
#describe(merged_data)

Relationship between mpg and price

As the scatter plot shows below, there is not a clear relationship between mpg and price variables. Small mpg values have wide price range, although it is less so fo the highest mpg values.

ggplot(data=merged_data, aes(x=mpg, y=price)) +
  geom_point(alpha=.4, size=4, color="#880011") +
  ggtitle("MPG vs. Price")

But we noticed something that stoud out. There seems to be a strip of points around the price values of 20000, 30000, and 40000. What’s going on there? Let’s investigate this with a histogram of the price variable by itself.

As one can see below, there are spikes in vehicles counts in the price range of 20000, 30000, and 40000. If you’re a car dealership, you better use this info to set the price of your cars :-)

ggplot(data=merged_data, aes(x=price)) +
  geom_histogram(fill="#880011") +  
  ggtitle("Price Histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s dig even deeper, and see how has the above changed over the years. We noticed that the year per year picture is even muddier. But there’s less weaker corrolation between mpg and price in the year 79. It is still a very weak correlation of 32%. There was a close to zero correlation (1%) for the whole data set for these two variables.

ggplot(data=merged_data, aes(x=mpg, y=price)) +
  geom_point(alpha=.4, size=4, color="#880011") +
  ggtitle("MPG vs. Price vs. Year") +
  facet_wrap(~model)

Relationship between mpg, weight and displacement

There’s a strong inverse relationship between displacement and mpg. And Weighter vehicles tend to have higher fuel consumption and smaller displacement. There’s nothing mind-blowing in this graph, actually. Darker dots means heavier vehicles.

ggplot(data=merged_data, aes(x=mpg, y=displacement, color=weight)) +
  geom_point(position = "jitter") + geom_smooth(color="red", size=0.5) +
  ggtitle("MPG vs. Displacment vs. Weight") +
  fte_theme()

Relationship between mpg, acceleration and horse power

In the 70’s, USA made cars consistently had more horsepower than foreign made ones. But that started to change somewhat in the earlier 80’s. The horsepower in foreign made cars stayed more or less constant during this time. Maybe people attitude started to change in regard to what were looking for in cars in the US around the 80’s. We need more recent data to answer that question for sure.

ggplot(data=merged_data, aes(x=origin, y=horsepower)) +
  geom_point(color="orange") + 
  facet_wrap(~model) +
  ggtitle("MPG vs. Origin vs. Horsepower") +
  fte_theme()

Average car price by origin

German made vehicles are among the most expensive on average, folled by Japan. Vehicles prices have gone up as time went by. Inflation may explain this. A nice chart to visualize this can be helpful, but I’ll forgo it for now.

merged_data %>% group_by(origin, model) %>% select(origin, model, price) %>% summarise(average= mean(price)) %>% ungroup() %>% arrange(desc(average))

## Source: local data frame [39 x 3]
## 
##    origin model average
##     (int) (int)   (dbl)
## 1       2    80   37477
## 2       2    78   36455
## 3       2    82   35236
## 4       3    73   34474
## 5       3    72   33441
## 6       1    78   32630
## 7       3    80   32412
## 8       3    82   31545
## 9       2    72   31294
## 10      3    78   31281
## ..    ...   ...     ...

What variables affect vehicles prices the most?

In other to answer this question, we will implement a regression model. Given the time constraint, we will forgo things like data pre-processing, feature engineering, and performance tuning of our model. And finally, given the smaller size of our data set, we will not conduct a train/test split. But we will rely on multiple cross-validation instances to evaluate our model. Our goal here is merely to have a very quick insight on how these variables impact the price tag.

Our model of choice for the task at hand is Gradient Tree Boosting. This is probably one of the best off-the-shelf machine learning algorithm. And it will handle the few missing data, mixture of categorical and numerical variables that we have in our data set. That last statement is actually true for all trees based machine learning algorithms.

#drop useless variables for our purpose
merged_data$ID <- NULL
merged_data$car_name <- NULL 

#perform 10 folds cross-validation and repeat it 10 times.
control <- trainControl(method = "repeatedcv", number = 10, repeats = 10)

set.seed(701) #from random.org
fit <- train(price ~ ., data = merged_data,
                 method = "gbm",
                 trControl = control,
                 verbose = FALSE)

imp <- varImp(fit, scale=TRUE)
imp

## gbm variable importance
## 
##              Overall
## acceleration   100.0
## model           73.7
## mpg             69.0
## displacement    55.6
## weight          49.7
## horsepower      42.5
## origin           0.0
## cylinders        0.0

plot(imp, main = "Variable Importance")

Based on our model, acceleration, year, and fuel economy seem to factor more in the vehicle price.

Please keep in mind that although cylinders and origin have zero importance values based on our model, they cannot be discarded without further analysis. Cylinders size and fuel economy are somewhat correlated (78%). This means that if one throws out the mpg variable, cylinder size may become important variable.

Conclusion

I spent a couple of hours or so this morning to get this done. So this should be considered a quick hack more than anything else. A good data scientist is a also a good hacker :-)

AutoLoop DS Test

Max Sop

July 27, 2016

Load data into R and quick data inspection

Theme function `fte_theme` based on FiveThirtyEight style is not displayed

Merge and Inspect Data sets

Relationship between mpg and price

Relationship between mpg, weight and displacement

Relationship between mpg, acceleration and horse power

Average car price by origin

What variables affect vehicles prices the most?

AutoLoop DS Test

Max Sop

July 27, 2016

Load data into R and quick data inspection

Theme function fte_theme based on FiveThirtyEight style is not displayed

Merge and Inspect Data sets

Relationship between mpg and price

Relationship between mpg, weight and displacement

Relationship between mpg, acceleration and horse power

Average car price by origin

What variables affect vehicles prices the most?

Theme function `fte_theme` based on FiveThirtyEight style is not displayed