Used cars are hard to price. Overwelming information of individual cars, mileage, year, make, condition, all play important roles in determining the car value. Furthermore, car prices changes consistantly when new model comes out and regulation changes. Most consumer like me has limit knowledge about car and find information online overwelming. As a result, when consumers trying to sell or buy a car, they often put themselves at a vulnerable position and bear the risk of misprice.
Based on the research, Make, Model, Year, Mileage, Condition, and Whether accident happened are the main determinates for the values of a used vehicle. Among these criteria, condition is subjective and hard to quantify without taking an in-person look. Although accident is a huge factor in value, a lot of posting does not disclose this information. Also, it is hard to access if the statement is truthful.
The biggest used auto trading website being used in Alberta include Autotrader, Kijiji, Cargurus. This analysis uses data from Autotrader. This is because it contains more than 25,000 posts for Alberta alone and important vehicle information are collectively presented on a search web pages.
After entering searching criteria into its searching engine, in this case, Alberta as location, one can find about 25,000 used vehicles. On the search result page, there are three areas that contain the information we need. Each search title is always written as “Year Make Model OtherInfo”. Prices and Milleage can be found below and beside the title. However, since some posting has no mileage or prices shown on the search result page, we eliminate those by changing both the minimum price and minimum mileage to 1.
The maximium number of listing per page is 100. Taking a look at the HTTP address of the page, we can see that the pages number is controlled by change the number after “rcs =” to a multiple of 100. This means that we can access the information for all pages by changes that number. “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=”0“&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”
Goal: My goal is to build a Machine Learning model and alone with other tools to explore the relationship between make, model, year, mileage and price. Ideally, this model can be applied to other search result from Autotrader.
Plan:
The first step to accomplish the goal is to web scrap information from Autotrader. To begin with, I put in search result that would include all used car posted in Alberta. I also include some criteria such as mileage > 1000 and price > $1, because some posting has no price or mileage. The url for the website is “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”.
After using SelectorGadget to obtain the Xpath for the price, milleage and title which includes year, make and model, a loop was used to scrap these data from 250 pages by changing the number after “rcs=” to multiplies of 100. Price, mileage and title were organized into a dataframe. The package I used to web scrap was ‘rvest’.
## [1] "Number of postings collected: 27867"
In order to run analysis for the data, the data needs to be cleaned and organized. Taking a look at the first data frame straight from the website. Year, make, model can be isolated from the title. I need to also remove words, unit, and commas from price and milleage columns.
In addition, I decided to only focus on cars that are not luxury, because brands that are too expensive can affect the accuraccy of the analysis on regular cars. To achieve this, I removed brands that have average price aboved population average. The removed brands include Ferrari, Lamborghini, etc. Also, it is important to remove very rarely traded brand. This is because these type of car tends to carriy too much of other value other than simply mileage and year. To do this, I filter out the brands that have less than 30 postings, such as Tesla, Genesis, Hummer, etc. As a result, the data we are left with are non-luxury cars that are traded on a regular basis.
| title | price | mileage |
|---|---|---|
| 2019 Buick Enclave Premium | $46,888 | Mileage 21,933 km |
| 2015 Nissan Pathfinder S | $21,278 | Mileage 21,762 km |
| 2008 Lamborghini Gallardo SUPERLEGGERA, NEW CLUTCH, LEASE TODAY | $139,999 | Mileage 16,250 km |
| 2018 Audi A5 2.0T Technik quattro 7sp S Tronic Cpe | $50,016 | Mileage 3,456 km |
| 2018 Ram 1500 SLT - Pentastar V6 | $35,999 | Mileage 4,613 km |
| 2015 Nissan Murano Platinum AWD Panoramic Sunroof, Navigation, Heated | $26,388 | Mileage 76,647 km |
This is the result data frame. At this point, we are ready to start preparing the data frame for the Machine Learning analysis.
| price | mileage | year | make | car |
|---|---|---|---|---|
| 46888 | 21933 | 2019 | Buick | Buick Enclave |
| 21278 | 21762 | 2015 | Nissan | Nissan Pathfinder |
| 139999 | 16250 | 2008 | Lamborghini | Lamborghini Gallardo |
| 50016 | 3456 | 2018 | Audi | Audi A5 |
| 35999 | 4613 | 2018 | Ram | Ram 1500 |
| 26388 | 76647 | 2015 | Nissan | Nissan Murano |
The first thing is to split the data frame into a training set and a testing set. The ratio was decided to be 80/20. I splited 80% into training set and the remaining 20% into testing set. This process was done using ‘rsample’ package with allows random splitting.
| price | mileage | year | make | car |
|---|---|---|---|---|
| 46888 | 21933 | 2019 | Buick | Buick Enclave |
| 21278 | 21762 | 2015 | Nissan | Nissan Pathfinder |
| 50016 | 3456 | 2018 | Audi | Audi A5 |
| 35999 | 4613 | 2018 | Ram | Ram 1500 |
| 26388 | 76647 | 2015 | Nissan | Nissan Murano |
| 49999 | 29756 | 2019 | Jeep | Jeep Grand |
Second, both the training set and testing set needs to be prepared and transformed into ones that are suitable for the machine learning process. I did this using ‘recipes’ package and transform the make and car variable into dummy (binary) variables. Both sets are baked and ready for the analysis. This is what the finished table looks like. The two table both have 519 columns.
| price | mileage | year | make_Audi | make_BMW | make_Buick | make_Cadillac | make_Chevrolet |
|---|---|---|---|---|---|---|---|
| 46888 | 21933 | 2019 | 0 | 0 | 1 | 0 | 0 |
| 21278 | 21762 | 2015 | 0 | 0 | 0 | 0 | 0 |
| 50016 | 3456 | 2018 | 1 | 0 | 0 | 0 | 0 |
| 35999 | 4613 | 2018 | 0 | 0 | 0 | 0 | 0 |
| 26388 | 76647 | 2015 | 0 | 0 | 0 | 0 | 0 |
| 49999 | 29756 | 2019 | 0 | 0 | 0 | 0 | 0 |
Third, we are ready to model using ‘parsnip’ package. The training set was fed into a linear regression model. We can see that the P-values for mileage and year are extremely closed to zero, indicating strong negative relationships with price. P-value for different make and model varies dramatically ranging from 0.9 to 0. Most make have very low P-value. However, there are several makes that have a P-value higher than 0.15. The lower the p-value for certain make is, the more useful this model is to price vehicles from this make.
| term | estimate | p.value |
|---|---|---|
| year | 1776.609852 | 0.0000000 |
| mileage | -0.008811 | 0.0000000 |
| make_Chrysler | 102232.437387 | 0.0000000 |
| make_Dodge | 57858.110355 | 0.0000000 |
| make_Volkswagen | 36707.640023 | 0.0016948 |
| make_Jeep | 21422.892452 | 0.0248236 |
| make_Mercedes.Benz | 20818.724458 | 0.0291961 |
| make_Infiniti | 18262.272651 | 0.0311225 |
| make_Honda | 19329.329949 | 0.0562171 |
| make_Volvo | 13730.414710 | 0.1036750 |
| make_Ram | 15576.239772 | 0.1239666 |
| make_BMW | 14139.903581 | 0.1384784 |
| make_Toyota | -11785.973418 | 0.1636503 |
| make_Kia | 15864.245132 | 0.1748360 |
| make_Buick | -15669.471880 | 0.1801130 |
| make_Scion | -15270.522103 | 0.1914709 |
| make_Fiat | -11495.889350 | 0.2285014 |
| make_Audi | 11460.141348 | 0.2299271 |
| make_Mitsubishi | -8789.118073 | 0.2904915 |
| make_Lexus | 9683.711263 | 0.3389125 |
| make_Mazda | -8210.563857 | 0.3743013 |
| make_Hyundai | -6909.276423 | 0.4219340 |
| make_Pontiac | -6567.147103 | 0.4914096 |
| make_GMC | 7567.232244 | 0.5173962 |
| make_Subaru | -5602.360464 | 0.5202876 |
| make_Nissan | -4991.254722 | 0.6693787 |
| make_Lincoln | -4341.506870 | 0.7103283 |
| make_Cadillac | -1976.988895 | 0.8172992 |
| make_Ford | -1342.696695 | 0.9085494 |
| make_MINI | 625.187649 | 0.9460701 |
| make_Chevrolet | -58.104000 | 0.9960344 |
The final step is to test the performance of the model by feeding the model with the testing set.
| .pred | price |
|---|---|
| 38259.05 | 47993 |
| 27920.46 | 31995 |
| 64404.60 | 130000 |
| 89871.29 | 129900 |
| 67874.42 | 128000 |
| 95002.39 | 119980 |
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 5377.1689087 |
| rmse | standard | 9022.2485714 |
| rsq | standard | 0.6931893 |
To get a realistic feel of how the model is doing, I want to use it to price the car I own. I own a 2017 Toyota RAV4 with a mileage of 35,000 km. The result is within my estimation range but a bit higher.
| price | mileage | year | car | make |
|---|---|---|---|---|
| 27183.06 | 35000 | 2017 | Toyota RAV4 | Toyota |
From this graph (Mileage;s effect on Prices - Different Make), we can see from the slope of the best fit lines about which how make affects the depreciation of mileage. Makes that have a steep slope such as BMW, Mercedes-Benz, Cadillac depreciate fastly with mileage increase, compared to other makes. For makes that have relativelt flattened best of fit line, their cars remain more of their value when mileage increases.
We can explore the relationship between year, price and how the makes effect it using a similar style of graph. From the graph, we can see that different make have different effect on the depreciation of price from year. For makes that have a very flattened best of fit line, such as Subaru, Mitsubishi, and Buick, the increase in the vehicle’s ages has little impact on its value. In addition, I also wanted to know what is the most popular vehicle for sales in Alberta. Therefore, I made this chart to answer the question.
This model uses massive data (>10,000) from Autotrader to enable automatic pricing for the majority of vehicle. It applies to other search result in Autotrader by changing the url address. By using this model, consumers would have more knowledge and confidence in the price of vehicles. With more well-informed consumer, the used car market can become more efficient.