Introduction

Used cars are hard to price. Overwelming information of individual cars, mileage, year, make, condition, all play important roles in determining the car value. Furthermore, car prices changes consistantly when new model comes out and regulation changes. Most consumer like me has limit knowledge about car and find information online overwelming. As a result, when consumers trying to sell or buy a car, they often put themselves at a vulnerable position and bear the risk of misprice.

research overview

  1. Information on factors that affect used car values

Based on the research, Make, Model, Year, Mileage, Condition, and Whether accident happened are the main determinates for the values of a used vehicle. Among these criteria, condition is subjective and hard to quantify without taking an in-person look. Although accident is a huge factor in value, a lot of posting does not disclose this information. Also, it is hard to access if the statement is truthful.

  1. select a website that contain large amount of information. An ideal website needs to fulfill the following criteria:

The biggest used auto trading website being used in Alberta include Autotrader, Kijiji, Cargurus. This analysis uses data from Autotrader. This is because it contains more than 25,000 posts for Alberta alone and important vehicle information are collectively presented on a search web pages.

  1. An overview of Autotrader website

After entering searching criteria into its searching engine, in this case, Alberta as location, one can find about 25,000 used vehicles. On the search result page, there are three areas that contain the information we need. Each search title is always written as “Year Make Model OtherInfo”. Prices and Milleage can be found below and beside the title. However, since some posting has no mileage or prices shown on the search result page, we eliminate those by changing both the minimum price and minimum mileage to 1.

The maximium number of listing per page is 100. Taking a look at the HTTP address of the page, we can see that the pages number is controlled by change the number after “rcs =” to a multiple of 100. This means that we can access the information for all pages by changes that number. “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0“&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”

Work Plan

Goal: My goal is to build a Machine Learning model and alone with other tools to explore the relationship between make, model, year, mileage and price. Ideally, this model can be applied to other search result from Autotrader. 

Plan:

  1. Web scrapping
  2. Data wraggling - organize data, eliminate data
  3. Building ML Model
  4. Test Goodness of Fit
  5. Price my own car
  6. Visualization to exlore relationships

Implementation


Data: Web scrapping

The first step to accomplish the goal is to web scrap information from Autotrader. To begin with, I put in search result that would include all used car posted in Alberta. I also include some criteria such as mileage > 1000 and price > $1, because some posting has no price or mileage. The url for the website is “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”.

After using SelectorGadget to obtain the Xpath for the price, milleage and title which includes year, make and model, a loop was used to scrap these data from 250 pages by changing the number after “rcs=” to multiplies of 100. Price, mileage and title were organized into a dataframe. The package I used to web scrap was ‘rvest’.

## [1] "Number of postings collected: 27867"

Data: tidying

In order to run analysis for the data, the data needs to be cleaned and organized. Taking a look at the first data frame straight from the website. Year, make, model can be isolated from the title. I need to also remove words, unit, and commas from price and milleage columns.

In addition, I decided to only focus on cars that are not luxury, because brands that are too expensive can affect the accuraccy of the analysis on regular cars. To achieve this, I removed brands that have average price aboved population average. The removed brands include Ferrari, Lamborghini, etc. Also, it is important to remove very rarely traded brand. This is because these type of car tends to carriy too much of other value other than simply mileage and year. To do this, I filter out the brands that have less than 30 postings, such as Tesla, Genesis, Hummer, etc. As a result, the data we are left with are non-luxury cars that are traded on a regular basis.

Raw Data Table (df_base)
title price mileage
2019 Buick Enclave Premium $46,888 Mileage 21,933 km
2015 Nissan Pathfinder S $21,278 Mileage 21,762 km
2008 Lamborghini Gallardo SUPERLEGGERA, NEW CLUTCH, LEASE TODAY $139,999 Mileage 16,250 km
2018 Audi A5 2.0T Technik quattro 7sp S Tronic Cpe $50,016 Mileage 3,456 km
2018 Ram 1500 SLT - Pentastar V6 $35,999 Mileage 4,613 km
2015 Nissan Murano Platinum AWD Panoramic Sunroof, Navigation, Heated $26,388 Mileage 76,647 km

This is the result data frame. At this point, we are ready to start preparing the data frame for the Machine Learning analysis.

Data Table After Tidying (dfr)
price mileage year make car
46888 21933 2019 Buick Buick Enclave
21278 21762 2015 Nissan Nissan Pathfinder
139999 16250 2008 Lamborghini Lamborghini Gallardo
50016 3456 2018 Audi Audi A5
35999 4613 2018 Ram Ram 1500
26388 76647 2015 Nissan Nissan Murano

Analysis: Machine Learning Linear Regression

The first thing is to split the data frame into a training set and a testing set. The ratio was decided to be 80/20. I splited 80% into training set and the remaining 20% into testing set. This process was done using ‘rsample’ package with allows random splitting.

Head of Training Set
price mileage year make car
46888 21933 2019 Buick Buick Enclave
21278 21762 2015 Nissan Nissan Pathfinder
50016 3456 2018 Audi Audi A5
35999 4613 2018 Ram Ram 1500
26388 76647 2015 Nissan Nissan Murano
49999 29756 2019 Jeep Jeep Grand

Second, both the training set and testing set needs to be prepared and transformed into ones that are suitable for the machine learning process. I did this using ‘recipes’ package and transform the make and car variable into dummy (binary) variables. Both sets are baked and ready for the analysis. This is what the finished table looks like. The two table both have 519 columns.

Sample of Baked traning Set
price mileage year make_Audi make_BMW make_Buick make_Cadillac make_Chevrolet
46888 21933 2019 0 0 1 0 0
21278 21762 2015 0 0 0 0 0
50016 3456 2018 1 0 0 0 0
35999 4613 2018 0 0 0 0 0
26388 76647 2015 0 0 0 0 0
49999 29756 2019 0 0 0 0 0

Third, we are ready to model using ‘parsnip’ package. The training set was fed into a linear regression model. We can see that the P-values for mileage and year are extremely closed to zero, indicating strong negative relationships with price. P-value for different make and model varies dramatically ranging from 0.9 to 0. Most make have very low P-value. However, there are several makes that have a P-value higher than 0.15. The lower the p-value for certain make is, the more useful this model is to price vehicles from this make.

P-values for Year, Mileage and Makes
term estimate p.value
year 1776.609852 0.0000000
mileage -0.008811 0.0000000
make_Chrysler 102232.437387 0.0000000
make_Dodge 57858.110355 0.0000000
make_Volkswagen 36707.640023 0.0016948
make_Jeep 21422.892452 0.0248236
make_Mercedes.Benz 20818.724458 0.0291961
make_Infiniti 18262.272651 0.0311225
make_Honda 19329.329949 0.0562171
make_Volvo 13730.414710 0.1036750
make_Ram 15576.239772 0.1239666
make_BMW 14139.903581 0.1384784
make_Toyota -11785.973418 0.1636503
make_Kia 15864.245132 0.1748360
make_Buick -15669.471880 0.1801130
make_Scion -15270.522103 0.1914709
make_Fiat -11495.889350 0.2285014
make_Audi 11460.141348 0.2299271
make_Mitsubishi -8789.118073 0.2904915
make_Lexus 9683.711263 0.3389125
make_Mazda -8210.563857 0.3743013
make_Hyundai -6909.276423 0.4219340
make_Pontiac -6567.147103 0.4914096
make_GMC 7567.232244 0.5173962
make_Subaru -5602.360464 0.5202876
make_Nissan -4991.254722 0.6693787
make_Lincoln -4341.506870 0.7103283
make_Cadillac -1976.988895 0.8172992
make_Ford -1342.696695 0.9085494
make_MINI 625.187649 0.9460701
make_Chevrolet -58.104000 0.9960344

Result

The final step is to test the performance of the model by feeding the model with the testing set.

Model Prediction vs. Actual
.pred price
38259.05 47993
27920.46 31995
64404.60 130000
89871.29 129900
67874.42 128000
95002.39 119980
Goodness of Fit
.metric .estimator .estimate
mae standard 5377.1689087
rmse standard 9022.2485714
rsq standard 0.6931893

Test on my car

To get a realistic feel of how the model is doing, I want to use it to price the car I own. I own a 2017 Toyota RAV4 with a mileage of 35,000 km. The result is within my estimation range but a bit higher.

Data of My car
price mileage year car make
27183.06 35000 2017 Toyota RAV4 Toyota

Exploration with Visualization

From this graph (Mileage;s effect on Prices - Different Make), we can see from the slope of the best fit lines about which how make affects the depreciation of mileage. Makes that have a steep slope such as BMW, Mercedes-Benz, Cadillac depreciate fastly with mileage increase, compared to other makes. For makes that have relativelt flattened best of fit line, their cars remain more of their value when mileage increases.

We can explore the relationship between year, price and how the makes effect it using a similar style of graph. From the graph, we can see that different make have different effect on the depreciation of price from year. For makes that have a very flattened best of fit line, such as Subaru, Mitsubishi, and Buick, the increase in the vehicle’s ages has little impact on its value. In addition, I also wanted to know what is the most popular vehicle for sales in Alberta. Therefore, I made this chart to answer the question.

Conclusion

This model uses massive data (>10,000) from Autotrader to enable automatic pricing for the majority of vehicle. It applies to other search result in Autotrader by changing the url address. By using this model, consumers would have more knowledge and confidence in the price of vehicles. With more well-informed consumer, the used car market can become more efficient.