Below are all the R packages and option sets to reproduce this analysis
options(warn=-1)
options("scipen"=100, "digits"=4)
suppressMessages(library(readr))
suppressMessages(library(Hmisc))
suppressMessages(library(gbm))
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(RColorBrewer))
#library(doMC)
#registerDoMC(cores = 8)
cars_multi <- read_csv("cars_multi.csv", na = "?")
cars_price <- read_csv("cars_price.csv")
fte_theme based on FiveThirtyEight style is not displayedThere are no missing values or idiosyncrasies for all variables except for the horsepower variable that has 6 observations missing. One may use knn-imputation to fix this. But I won’t border for the time being.
merged_data <- merge(cars_multi, cars_price, by = "ID")
#describe(merged_data)
As the scatter plot shows below, there is not a clear relationship between mpg and price variables. Small mpg values have wide price range, although it is less so fo the highest mpg values.
ggplot(data=merged_data, aes(x=mpg, y=price)) +
geom_point(alpha=.4, size=4, color="#880011") +
ggtitle("MPG vs. Price")
But we noticed something that stoud out. There seems to be a strip of points around the price values of 20000, 30000, and 40000. What’s going on there? Let’s investigate this with a histogram of the price variable by itself.
As one can see below, there are spikes in vehicles counts in the price range of 20000, 30000, and 40000. If you’re a car dealership, you better use this info to set the price of your cars :-)
ggplot(data=merged_data, aes(x=price)) +
geom_histogram(fill="#880011") +
ggtitle("Price Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s dig even deeper, and see how has the above changed over the years. We noticed that the year per year picture is even muddier. But there’s less weaker corrolation between mpg and price in the year 79. It is still a very weak correlation of 32%. There was a close to zero correlation (1%) for the whole data set for these two variables.
ggplot(data=merged_data, aes(x=mpg, y=price)) +
geom_point(alpha=.4, size=4, color="#880011") +
ggtitle("MPG vs. Price vs. Year") +
facet_wrap(~model)
There’s a strong inverse relationship between displacement and mpg. And Weighter vehicles tend to have higher fuel consumption and smaller displacement. There’s nothing mind-blowing in this graph, actually. Darker dots means heavier vehicles.
ggplot(data=merged_data, aes(x=mpg, y=displacement, color=weight)) +
geom_point(position = "jitter") + geom_smooth(color="red", size=0.5) +
ggtitle("MPG vs. Displacment vs. Weight") +
fte_theme()
In the 70’s, USA made cars consistently had more horsepower than foreign made ones. But that started to change somewhat in the earlier 80’s. The horsepower in foreign made cars stayed more or less constant during this time. Maybe people attitude started to change in regard to what were looking for in cars in the US around the 80’s. We need more recent data to answer that question for sure.
ggplot(data=merged_data, aes(x=origin, y=horsepower)) +
geom_point(color="orange") +
facet_wrap(~model) +
ggtitle("MPG vs. Origin vs. Horsepower") +
fte_theme()
German made vehicles are among the most expensive on average, folled by Japan. Vehicles prices have gone up as time went by. Inflation may explain this. A nice chart to visualize this can be helpful, but I’ll forgo it for now.
merged_data %>% group_by(origin, model) %>% select(origin, model, price) %>% summarise(average= mean(price)) %>% ungroup() %>% arrange(desc(average))
## Source: local data frame [39 x 3]
##
## origin model average
## (int) (int) (dbl)
## 1 2 80 37477
## 2 2 78 36455
## 3 2 82 35236
## 4 3 73 34474
## 5 3 72 33441
## 6 1 78 32630
## 7 3 80 32412
## 8 3 82 31545
## 9 2 72 31294
## 10 3 78 31281
## .. ... ... ...
In other to answer this question, we will implement a regression model. Given the time constraint, we will forgo things like data pre-processing, feature engineering, and performance tuning of our model. And finally, given the smaller size of our data set, we will not conduct a train/test split. But we will rely on multiple cross-validation instances to evaluate our model. Our goal here is merely to have a very quick insight on how these variables impact the price tag.
Our model of choice for the task at hand is Gradient Tree Boosting. This is probably one of the best off-the-shelf machine learning algorithm. And it will handle the few missing data, mixture of categorical and numerical variables that we have in our data set. That last statement is actually true for all trees based machine learning algorithms.
#drop useless variables for our purpose
merged_data$ID <- NULL
merged_data$car_name <- NULL
#perform 10 folds cross-validation and repeat it 10 times.
control <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
set.seed(701) #from random.org
fit <- train(price ~ ., data = merged_data,
method = "gbm",
trControl = control,
verbose = FALSE)
imp <- varImp(fit, scale=TRUE)
imp
## gbm variable importance
##
## Overall
## acceleration 100.0
## model 73.7
## mpg 69.0
## displacement 55.6
## weight 49.7
## horsepower 42.5
## origin 0.0
## cylinders 0.0
plot(imp, main = "Variable Importance")
Based on our model, acceleration, year, and fuel economy seem to factor more in the vehicle price.
Please keep in mind that although cylinders and origin have zero importance values based on our model, they cannot be discarded without further analysis. Cylinders size and fuel economy are somewhat correlated (78%). This means that if one throws out the mpg variable, cylinder size may become important variable.
Conclusion
I spent a couple of hours or so this morning to get this done. So this should be considered a quick hack more than anything else. A good data scientist is a also a good hacker :-)