Introduction

Used cars are hard to price. Overwelming information of individual cars, mileage, year, make, condition, all play important roles in determining the car value. Furthermore, car prices changes consistantly when new model comes out and regulation changes. Most consumer like me has limit knowledge about car and find information online overwelming. As a result, when consumers trying to sell or buy a car, they often put themselves at a vulnerable position and bear the risk of misprice.

Research Overview

  1. Information on factors that affect used car values

Based on the research, Make, Model, Year, Mileage, Condition, and Whether accident happened are the main determinates for the values of a used vehicle. Among these criteria, condition is subjective and hard to quantify without taking an in-person look. Although accident is a huge factor in value, a lot of posting does not disclose this information. Also, it is hard to access if the statement is truthful.

  1. select a website that contain large amount of information. An ideal website needs to fulfill the following criteria:

The biggest used auto trading website being used in Alberta include Autotrader, Kijiji, Cargurus. This analysis uses data from Autotrader. This is because it contains more than 25,000 posts for Alberta alone and important vehicle information are collectively presented on a search web pages.

  1. An overview of Autotrader website

After entering searching criteria into its searching engine, in this case, Alberta as location, one can find about 25,000 used vehicles. On the search result page, there are three areas that contain the information we need. Each search title is always written as “Year Make Model OtherInfo”. Prices and Milleage can be found below and beside the title. However, since some posting has no mileage or prices shown on the search result page, we eliminate those by changing both the minimum price and minimum mileage to 1.

The maximium number of listing per page is 100. Taking a look at the HTTP address of the page, we can see that the pages number is controlled by change the number after “rcs =” to a multiple of 100. This means that we can access the information for all pages by changes that number. “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0“&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”

Work Plan

Goal: My goal is to build a Machine Learning model and alone with other tools to explore the relationship between make, model, year, mileage and price. Ideally, this model can be applied to other search result from Autotrader. This model can be used to price any car that is traded on Autotrader, with some exception on luxury makes such as Ferrari.

Plan:

  1. Web scrapping
  2. Data wraggling - organize data, eliminate data
  3. Building ML Model
  4. Test Goodness of Fit
  5. Price my own car
  6. Visualization to exlore relationships

Implementation


Data: Web scrapping

The first step to accomplish the goal is to web scrap information from Autotrader. To begin with, I put in search result that would include all used car posted in Alberta. I also include some criteria such as mileage > 1000 and price > $1, because some posting has no price or mileage. The url for the website is “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”.

After using SelectorGadget to obtain the Xpath for the price, milleage and title which includes year, make and model, a loop was used to scrap these data from 250 pages by changing the number after “rcs=” to multiplies of 100. Price, mileage and title were organized into a dataframe. The package I used to web scrap was ‘rvest’.

## [1] "Number of postings collected: 27886"

Data: tidying

In order to run analysis for the data, the data needs to be cleaned and organized. Taking a look at the first data frame straight from the website. Year, make, model can be isolated from the title. I need to also remove words, unit, and commas from price and milleage columns.

In addition, I decided to only focus on cars that are not luxury, because brands that are too expensive can affect the accuraccy of the analysis on regular cars. To achieve this, I removed brands that have average price aboved population average. The removed brands include Ferrari, Lamborghini, etc. Also, it is important to remove very rarely traded brand. This is because these type of car tends to carriy too much of other value other than simply mileage and year. To do this, I filter out the brands that have less than 30 postings, such as Tesla, Genesis, Hummer, etc. As a result, the data we are left with are non-luxury cars that are traded on a regular basis.

Raw Data Table (df_base)
title price mileage
2016 BMW 340i xDrive xDrive Sedan $38,795 Mileage 64,306 km
2018 Lexus GS 350 F Sport Series 2 $48,900 Mileage 13,970 km
2014 Audi R8 Gated 6 Speed, V10, Lease or Finance Today $134,999 Mileage 27,798 km
2017 GMC Yukon XL SLT $52,888 Mileage 65,603 km
2018 Ford Explorer Platinum - Running Boards - $283 B/W $41,912 Mileage 35,672 km
2017 Dodge Journey SXT - Low Mileage $24,995 Mileage 4,521 km

This is the result data frame. At this point, we are ready to start preparing the data frame for the Machine Learning analysis.

Data Table After Tidying (dfr)
price mileage year make car
38795 64306 2016 BMW BMW 340i
48900 13970 2018 Lexus Lexus GS
134999 27798 2014 Audi Audi R8
52888 65603 2017 GMC GMC Yukon
41912 35672 2018 Ford Ford Explorer
24995 4521 2017 Dodge Dodge Journey

Analysis: Machine Learning Linear Regression

The first thing is to split the data frame into a training set and a testing set. The ratio was decided to be 80/20. I splited 80% into training set and the remaining 20% into testing set. This process was done using ‘rsample’ package with allows random splitting.

Head of Training Set
price mileage year make car
38795 64306 2016 BMW BMW 340i
48900 13970 2018 Lexus Lexus GS
134999 27798 2014 Audi Audi R8
52888 65603 2017 GMC GMC Yukon
41912 35672 2018 Ford Ford Explorer
24995 4521 2017 Dodge Dodge Journey

Second, both the training set and testing set needs to be prepared and transformed into ones that are suitable for the machine learning process. I did this using ‘recipes’ package and transform the make and car variable into dummy (binary) variables. Both sets are baked and ready for the analysis. This is what the finished table looks like. The two table both have 519 columns.

Sample of Baked traning Set
price mileage year make_Audi make_BMW make_Buick make_Cadillac make_Chevrolet
38795 64306 2016 0 1 0 0 0
48900 13970 2018 0 0 0 0 0
134999 27798 2014 1 0 0 0 0
52888 65603 2017 0 0 0 0 0
41912 35672 2018 0 0 0 0 0
24995 4521 2017 0 0 0 0 0

Third, we are ready to model using ‘parsnip’ package. The training set was fed into a linear regression model. We can see that the P-values for mileage and year are extremely closed to zero, indicating strong negative relationships with price. P-value for different make and model varies dramatically ranging from 0.9 to 0. Most make have very low P-value. However, there are several makes that have very high P-values. The lower the p-value for certain make is, the more useful this model is to price vehicles from this make.

P-values for Year, Mileage and Makes
term estimate p.value
year 1.698221e+03 0.0000000
mileage -9.122800e-03 0.0000000
make_Chrysler 9.788989e+04 0.0000000
make_Dodge 5.495336e+04 0.0000000
make_Volkswagen 3.531535e+04 0.0032061
make_Infiniti 1.952593e+04 0.0243454
make_Mercedes.Benz 2.170187e+04 0.0265193
make_Jeep 2.056581e+04 0.0355161
make_Honda 2.331794e+04 0.0515760
make_Volvo 1.625919e+04 0.0602690
make_BMW 1.754187e+04 0.0639774
make_Kia 1.815151e+04 0.0802521
make_Ram 1.660487e+04 0.1095143
make_Audi 1.230597e+04 0.2083673
make_Buick -1.497034e+04 0.2113998
make_Toyota -1.060547e+04 0.2192344
make_Scion -1.196431e+04 0.2212684
make_Fiat -1.035949e+04 0.2643235
make_Lexus 1.144138e+04 0.3394770
make_Mitsubishi -7.992626e+03 0.3480063
make_Hyundai -6.977240e+03 0.4273328
make_Pontiac -7.153865e+03 0.4904258
make_Subaru -4.732336e+03 0.5927501
make_GMC -4.290171e+03 0.6791815
make_Nissan -4.811056e+03 0.6879346
make_Mazda -4.801439e+03 0.6885247
make_MINI 3.198003e+03 0.7267216
make_Lincoln -4.108524e+03 0.7315927
make_Ford -1.747386e+03 0.8840122
make_Cadillac -6.755647e+02 0.9384567
make_Chevrolet -8.292376e+02 0.9448136

Result

The final step is to test the performance of the model by feeding the model with the testing set.

Model Prediction vs. Actual
.pred price
36673.36 45016
39310.11 44999
40853.97 31888
59741.02 138000
137973.15 135890
126924.94 134999
Goodness of Fit
.metric .estimator .estimate
mae standard 5287.563360
rmse standard 8543.401095
rsq standard 0.739535

Test on my car

To get a realistic feel of how the model is doing, I want to use it to price the car I own. I own a 2017 Toyota RAV4 with a mileage of 35,000 km. The result is within my estimation range but a bit higher.

Data of My car
price mileage year car make
26998.14 35000 2017 Toyota RAV4 Toyota

Exploration with Visualization

From this graph (Mileage;s effect on Prices - Different Make), we can see from the slope of the best fit lines about which how make affects the depreciation of mileage. Makes that have a steep slope such as BMW, Mercedes-Benz, Cadillac depreciate fastly with mileage increase, compared to other makes. For makes that have relativelt flattened best of fit line, their cars remain more of their value when mileage increases.

We can explore the relationship between year, price and how the makes effect it using a similar style of graph. From the graph, we can see that different make have different effect on the depreciation of price from year. For makes that have a very flattened best of fit line, such as Subaru, Mitsubishi, and Buick, the increase in the vehicle’s ages has little impact on its value. In addition, I also wanted to know what is the most popular vehicle for sales in Alberta. Therefore, I made this chart to answer the question.

Conclusion

This model uses massive data (>10,000) from Autotrader to enable automatic pricing for the majority of vehicle. It applies to other search result in Autotrader by changing the url address. By using this model, consumers would have more knowledge and confidence in the price of vehicles. With more well-informed consumer, the used car market can become more efficient.