Introduction

Used cars are hard to price. Overwelming information of individual cars, mileage, year, make, condition, all play important roles in determining the car value. Furthermore, car prices changes consistantly when new model comes out and regulation changes. Most consumer like me has limit knowledge about car and find information online overwelming. As a result, when consumers trying to sell or buy a car, they often put themselves at a vulnerable position and bear the risk of misprice.

Research Overview

  1. Information on factors that affect used car values

Based on the research, Make, Model, Year, Mileage, Condition, and Whether accident happened are the main determinates for the values of a used vehicle. Among these criteria, condition is subjective and hard to quantify without taking an in-person look. Although accident is a huge factor in value, a lot of posting does not disclose this information. Also, it is hard to access if the statement is truthful.

  1. select a website that contain large amount of information. An ideal website needs to fulfill the following criteria:

The biggest used auto trading website being used in Alberta include Autotrader, Kijiji, Cargurus. This analysis uses data from Autotrader. This is because it contains more than 25,000 posts for Alberta alone and important vehicle information are collectively presented on a search web pages.

  1. An overview of Autotrader website

After entering searching criteria into its searching engine, in this case, Alberta as location, one can find about 25,000 used vehicles. On the search result page, there are three areas that contain the information we need. Each search title is always written as “Year Make Model OtherInfo”. Prices and Milleage can be found below and beside the title. However, since some posting has no mileage or prices shown on the search result page, we eliminate those by changing both the minimum price and minimum mileage to 1.

The maximium number of listing per page is 100. Taking a look at the HTTP address of the page, we can see that the pages number is controlled by change the number after “rcs =” to a multiple of 100. This means that we can access the information for all pages by changes that number. “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0“&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”

Work Plan

Goal: My goal is to build a Machine Learning model and alone with other tools to explore the relationship between make, model, year, mileage and price. Ideally, this model can be applied to other search result from Autotrader. This model can be used to price any car that is traded on Autotrader, with some exception on luxury makes such as Ferrari.

Plan:

  1. Web scrapping
  2. Data wraggling - organize data, eliminate data
  3. Building ML Model
  4. Test Goodness of Fit
  5. Price my own car
  6. Visualization to exlore relationships

Implementation


Data: Web scrapping

The first step to accomplish the goal is to web scrap information from Autotrader. To begin with, I put in search result that would include all used car posted in Alberta. I also include some criteria such as mileage > 1000 and price > $1, because some posting has no price or mileage. The url for the website is “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”.

After using SelectorGadget to obtain the Xpath for the price, milleage and title which includes year, make and model, a loop was used to scrap these data from 250 pages by changing the number after “rcs=” to multiplies of 100. Price, mileage and title were organized into a dataframe. The package I used to web scrap was ‘rvest’.

## [1] "Number of postings collected: 27888"

Data: tidying

In order to run analysis for the data, the data needs to be cleaned and organized. Taking a look at the first data frame straight from the website. Year, make, model can be isolated from the title. I need to also remove words, unit, and commas from price and milleage columns.

In addition, I decided to only focus on cars that are not luxury, because brands that are too expensive can affect the accuraccy of the analysis on regular cars. To achieve this, I removed brands that have average price aboved population average. The removed brands include Ferrari, Lamborghini, etc. Also, it is important to remove very rarely traded brand. This is because these type of car tends to carriy too much of other value other than simply mileage and year. To do this, I filter out the brands that have less than 30 postings, such as Tesla, Genesis, Hummer, etc. As a result, the data we are left with are non-luxury cars that are traded on a regular basis.

Raw Data Table (df_base)
title price mileage
2016 Ford Focus SE $15,200 Mileage 27,839 km
2018 BMW X3 xDrive30i Premium Essential $38,850 Mileage 37,468 km
2018 Chevrolet SILVERADO 2500HD High Country Fully Loaded Diesel - 4x4 $66,887 Mileage 22,642 km
2018 GMC Yukon Denali ECOTEC3 6.2L V8 420 HP 460 LB-FT TORQUE $64,750 Mileage 23,260 km
2019 Acura TLX Tech A-Spec, Certified Pre-Owned, Navi, Bluetooth $39,999 Mileage 17,546 km
2017 Cadillac XT5 Luxury $34,900 Mileage 45,400 km

This is the result data frame. At this point, we are ready to start preparing the data frame for the Machine Learning analysis.

Data Table After Tidying (dfr)
price mileage year make car
15200 27839 2016 Ford Ford Focus
38850 37468 2018 BMW BMW X3
66887 22642 2018 Chevrolet Chevrolet SILVERADO
64750 23260 2018 GMC GMC Yukon
39999 17546 2019 Acura Acura TLX
34900 45400 2017 Cadillac Cadillac XT5

Analysis: Machine Learning Linear Regression

The first thing is to split the data frame into a training set and a testing set. The ratio was decided to be 80/20. I splited 80% into training set and the remaining 20% into testing set. This process was done using ‘rsample’ package with allows random splitting.

Head of Training Set
price mileage year make car
2 38850 37468 2018 BMW BMW X3
3 66887 22642 2018 Chevrolet Chevrolet SILVERADO
5 39999 17546 2019 Acura Acura TLX
6 34900 45400 2017 Cadillac Cadillac XT5
8 41500 70700 2016 GMC GMC Sierra
9 34999 50816 2018 Jeep Jeep Wrangler

Second, both the training set and testing set needs to be prepared and transformed into ones that are suitable for the machine learning process. I did this using ‘recipes’ package and transform the make and car variable into dummy (binary) variables. Both sets are baked and ready for the analysis. This is what the finished table looks like. The two table both have 519 columns.

Sample of Baked training Set
price mileage year make_Audi make_BMW make_Buick make_Cadillac make_Chevrolet
38850 37468 2018 0 1 0 0 0
66887 22642 2018 0 0 0 0 1
39999 17546 2019 0 0 0 0 0
34900 45400 2017 0 0 0 1 0
41500 70700 2016 0 0 0 0 0
34999 50816 2018 0 0 0 0 0

Third, we are ready to model using ‘parsnip’ package. The training set was fed into a linear regression model. We can see that the P-values for mileage and year are extremely closed to zero, indicating strong negative relationships with price. P-value for different make and model varies dramatically ranging from 0.9 to 0. Most make have very low P-value. However, there are several makes that have very high P-values. The lower the p-value for certain make is, the more useful this model is to price vehicles from this make.

P-values for Year, Mileage and Makes
term estimate p.value
year 1.789148e+03 0.0000000
mileage -8.622600e-03 0.0000000
make_Chrysler 1.029409e+05 0.0000000
make_Dodge 5.892985e+04 0.0000000
make_Volkswagen 3.692203e+04 0.0016319
make_Ram 3.042079e+04 0.0094352
make_Jeep 2.196423e+04 0.0304408
make_Mercedes.Benz 2.069141e+04 0.0305790
make_Honda 1.931998e+04 0.0569042
make_Infiniti 1.610635e+04 0.0578671
make_Kia 1.670240e+04 0.0998571
make_Volvo 1.328998e+04 0.1163953
make_BMW 1.411880e+04 0.1399776
make_Buick -1.577847e+04 0.1780993
make_Toyota -1.138733e+04 0.1798599
make_Scion -1.538433e+04 0.1891998
make_Fiat -1.151325e+04 0.2288770
make_Audi 1.097710e+04 0.2512572
make_Mitsubishi -8.752058e+03 0.2933196
make_Hyundai -7.387704e+03 0.3952224
make_Subaru -6.353383e+03 0.4599910
make_Pontiac -6.623530e+03 0.4886970
make_Lexus 5.499675e+03 0.6388463
make_GMC -4.547940e+03 0.6539856
make_Mazda -4.999468e+03 0.6695810
make_Lincoln -4.377804e+03 0.7086567
make_Nissan -3.959777e+03 0.7353789
make_MINI 2.729928e+03 0.7561089
make_Cadillac -1.874488e+03 0.8263012
make_Ford -1.766719e+03 0.8801359
make_Chevrolet -1.065141e+03 0.9164093

Result

The final step is to test the performance of the model by feeding the model with the testing set.

Model Prediction vs. Actual
.pred price
16124.083 15200
50901.016 64750
53844.242 49000
60362.241 138000
91800.109 134900
-1491.688 119500
Goodness of Fit
.metric .estimator .estimate
mae standard 5253.6088189
rmse standard 8985.3194450
rsq standard 0.6627653

Test on my car

To get a realistic feel of how the model is doing, I want to use it to price the car I own. I own a 2017 Toyota RAV4 with a mileage of 35,000 km. The result is within my estimation range but a bit higher.

Data of My car
price mileage year car make
27429.32 35000 2017 Toyota RAV4 Toyota

Exploration with Visualization

From this graph (Mileage;s effect on Prices - Different Make), we can see from the slope of the best fit lines about which how make affects the depreciation of mileage. Makes that have a steep slope such as BMW, Mercedes-Benz, Cadillac depreciate fastly with mileage increase, compared to other makes. For makes that have relativelt flattened best of fit line, their cars remain more of their value when mileage increases.

We can explore the relationship between year, price and how the makes effect it using a similar style of graph. From the graph, we can see that different make have different effect on the depreciation of price from year. For makes that have a very flattened best of fit line, such as Subaru, Mitsubishi, and Buick, the increase in the vehicle’s ages has little impact on its value. In addition, I also wanted to know what is the most popular vehicle for sales in Alberta. Therefore, I made this chart to answer the question.

Conclusion

This model uses massive data (>10,000) from Autotrader to enable automatic pricing for the majority of vehicle. It applies to other search result in Autotrader by changing the url address. By using this model, consumers would have more knowledge and confidence in the price of vehicles. With more well-informed consumer, the used car market can become more efficient.