Used cars are hard to price. Overwelming information of individual cars, mileage, year, make, condition, all play important roles in determining the car value. Furthermore, car prices changes consistantly when new model comes out and regulation changes. Most consumer like me has limit knowledge about car and find information online overwelming. As a result, when consumers trying to sell or buy a car, they often put themselves at a vulnerable position and bear the risk of misprice.
Based on the research, Make, Model, Year, Mileage, Condition, and Whether accident happened are the main determinates for the values of a used vehicle. Among these criteria, condition is subjective and hard to quantify without taking an in-person look. Although accident is a huge factor in value, a lot of posting does not disclose this information. Also, it is hard to access if the statement is truthful.
The biggest used auto trading website being used in Alberta include Autotrader, Kijiji, Cargurus. This analysis uses data from Autotrader. This is because it contains more than 25,000 posts for Alberta alone and important vehicle information are collectively presented on a search web pages.
After entering searching criteria into its searching engine, in this case, Alberta as location, one can find about 25,000 used vehicles. On the search result page, there are three areas that contain the information we need. Each search title is always written as “Year Make Model OtherInfo”. Prices and Milleage can be found below and beside the title. However, since some posting has no mileage or prices shown on the search result page, we eliminate those by changing both the minimum price and minimum mileage to 1.
The maximium number of listing per page is 100. Taking a look at the HTTP address of the page, we can see that the pages number is controlled by change the number after “rcs =” to a multiple of 100. This means that we can access the information for all pages by changes that number. “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=”0“&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”
Goal: My goal is to build a Machine Learning model and alone with other tools to explore the relationship between make, model, year, mileage and price. Ideally, this model can be applied to other search result from Autotrader. This model can be used to price any car that is traded on Autotrader, with some exception on luxury makes such as Ferrari.
Plan:
The first step to accomplish the goal is to web scrap information from Autotrader. To begin with, I put in search result that would include all used car posted in Alberta. I also include some criteria such as mileage > 1000 and price > $1, because some posting has no price or mileage. The url for the website is “https://www.autotrader.ca/cars/ab/?rcp=100&rcs=0&srt=3&pRng=1%2C&oRng=1000%2C&prx=-2&prv=Alberta&loc=alberta&hprc=True&wcp=True&sts=Used&showcpo=1&inMarket=advancedSearch”.
After using SelectorGadget to obtain the Xpath for the price, milleage and title which includes year, make and model, a loop was used to scrap these data from 250 pages by changing the number after “rcs=” to multiplies of 100. Price, mileage and title were organized into a dataframe. The package I used to web scrap was ‘rvest’.
## [1] "Number of postings collected: 27888"
In order to run analysis for the data, the data needs to be cleaned and organized. Taking a look at the first data frame straight from the website. Year, make, model can be isolated from the title. I need to also remove words, unit, and commas from price and milleage columns.
In addition, I decided to only focus on cars that are not luxury, because brands that are too expensive can affect the accuraccy of the analysis on regular cars. To achieve this, I removed brands that have average price aboved population average. The removed brands include Ferrari, Lamborghini, etc. Also, it is important to remove very rarely traded brand. This is because these type of car tends to carriy too much of other value other than simply mileage and year. To do this, I filter out the brands that have less than 30 postings, such as Tesla, Genesis, Hummer, etc. As a result, the data we are left with are non-luxury cars that are traded on a regular basis.
| title | price | mileage |
|---|---|---|
| 2016 Ford Focus SE | $15,200 | Mileage 27,839 km |
| 2018 BMW X3 xDrive30i Premium Essential | $38,850 | Mileage 37,468 km |
| 2018 Chevrolet SILVERADO 2500HD High Country Fully Loaded Diesel - 4x4 | $66,887 | Mileage 22,642 km |
| 2018 GMC Yukon Denali ECOTEC3 6.2L V8 420 HP 460 LB-FT TORQUE | $64,750 | Mileage 23,260 km |
| 2019 Acura TLX Tech A-Spec, Certified Pre-Owned, Navi, Bluetooth | $39,999 | Mileage 17,546 km |
| 2017 Cadillac XT5 Luxury | $34,900 | Mileage 45,400 km |
This is the result data frame. At this point, we are ready to start preparing the data frame for the Machine Learning analysis.
| price | mileage | year | make | car |
|---|---|---|---|---|
| 15200 | 27839 | 2016 | Ford | Ford Focus |
| 38850 | 37468 | 2018 | BMW | BMW X3 |
| 66887 | 22642 | 2018 | Chevrolet | Chevrolet SILVERADO |
| 64750 | 23260 | 2018 | GMC | GMC Yukon |
| 39999 | 17546 | 2019 | Acura | Acura TLX |
| 34900 | 45400 | 2017 | Cadillac | Cadillac XT5 |
The first thing is to split the data frame into a training set and a testing set. The ratio was decided to be 80/20. I splited 80% into training set and the remaining 20% into testing set. This process was done using ‘rsample’ package with allows random splitting.
| price | mileage | year | make | car | |
|---|---|---|---|---|---|
| 2 | 38850 | 37468 | 2018 | BMW | BMW X3 |
| 3 | 66887 | 22642 | 2018 | Chevrolet | Chevrolet SILVERADO |
| 5 | 39999 | 17546 | 2019 | Acura | Acura TLX |
| 6 | 34900 | 45400 | 2017 | Cadillac | Cadillac XT5 |
| 8 | 41500 | 70700 | 2016 | GMC | GMC Sierra |
| 9 | 34999 | 50816 | 2018 | Jeep | Jeep Wrangler |
Second, both the training set and testing set needs to be prepared and transformed into ones that are suitable for the machine learning process. I did this using ‘recipes’ package and transform the make and car variable into dummy (binary) variables. Both sets are baked and ready for the analysis. This is what the finished table looks like. The two table both have 519 columns.
| price | mileage | year | make_Audi | make_BMW | make_Buick | make_Cadillac | make_Chevrolet |
|---|---|---|---|---|---|---|---|
| 38850 | 37468 | 2018 | 0 | 1 | 0 | 0 | 0 |
| 66887 | 22642 | 2018 | 0 | 0 | 0 | 0 | 1 |
| 39999 | 17546 | 2019 | 0 | 0 | 0 | 0 | 0 |
| 34900 | 45400 | 2017 | 0 | 0 | 0 | 1 | 0 |
| 41500 | 70700 | 2016 | 0 | 0 | 0 | 0 | 0 |
| 34999 | 50816 | 2018 | 0 | 0 | 0 | 0 | 0 |
Third, we are ready to model using ‘parsnip’ package. The training set was fed into a linear regression model. We can see that the P-values for mileage and year are extremely closed to zero, indicating strong negative relationships with price. P-value for different make and model varies dramatically ranging from 0.9 to 0. Most make have very low P-value. However, there are several makes that have very high P-values. The lower the p-value for certain make is, the more useful this model is to price vehicles from this make.
| term | estimate | p.value |
|---|---|---|
| year | 1.789148e+03 | 0.0000000 |
| mileage | -8.622600e-03 | 0.0000000 |
| make_Chrysler | 1.029409e+05 | 0.0000000 |
| make_Dodge | 5.892985e+04 | 0.0000000 |
| make_Volkswagen | 3.692203e+04 | 0.0016319 |
| make_Ram | 3.042079e+04 | 0.0094352 |
| make_Jeep | 2.196423e+04 | 0.0304408 |
| make_Mercedes.Benz | 2.069141e+04 | 0.0305790 |
| make_Honda | 1.931998e+04 | 0.0569042 |
| make_Infiniti | 1.610635e+04 | 0.0578671 |
| make_Kia | 1.670240e+04 | 0.0998571 |
| make_Volvo | 1.328998e+04 | 0.1163953 |
| make_BMW | 1.411880e+04 | 0.1399776 |
| make_Buick | -1.577847e+04 | 0.1780993 |
| make_Toyota | -1.138733e+04 | 0.1798599 |
| make_Scion | -1.538433e+04 | 0.1891998 |
| make_Fiat | -1.151325e+04 | 0.2288770 |
| make_Audi | 1.097710e+04 | 0.2512572 |
| make_Mitsubishi | -8.752058e+03 | 0.2933196 |
| make_Hyundai | -7.387704e+03 | 0.3952224 |
| make_Subaru | -6.353383e+03 | 0.4599910 |
| make_Pontiac | -6.623530e+03 | 0.4886970 |
| make_Lexus | 5.499675e+03 | 0.6388463 |
| make_GMC | -4.547940e+03 | 0.6539856 |
| make_Mazda | -4.999468e+03 | 0.6695810 |
| make_Lincoln | -4.377804e+03 | 0.7086567 |
| make_Nissan | -3.959777e+03 | 0.7353789 |
| make_MINI | 2.729928e+03 | 0.7561089 |
| make_Cadillac | -1.874488e+03 | 0.8263012 |
| make_Ford | -1.766719e+03 | 0.8801359 |
| make_Chevrolet | -1.065141e+03 | 0.9164093 |
The final step is to test the performance of the model by feeding the model with the testing set.
| .pred | price |
|---|---|
| 16124.083 | 15200 |
| 50901.016 | 64750 |
| 53844.242 | 49000 |
| 60362.241 | 138000 |
| 91800.109 | 134900 |
| -1491.688 | 119500 |
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 5253.6088189 |
| rmse | standard | 8985.3194450 |
| rsq | standard | 0.6627653 |
To get a realistic feel of how the model is doing, I want to use it to price the car I own. I own a 2017 Toyota RAV4 with a mileage of 35,000 km. The result is within my estimation range but a bit higher.
| price | mileage | year | car | make |
|---|---|---|---|---|
| 27429.32 | 35000 | 2017 | Toyota RAV4 | Toyota |
From this graph (Mileage;s effect on Prices - Different Make), we can see from the slope of the best fit lines about which how make affects the depreciation of mileage. Makes that have a steep slope such as BMW, Mercedes-Benz, Cadillac depreciate fastly with mileage increase, compared to other makes. For makes that have relativelt flattened best of fit line, their cars remain more of their value when mileage increases.
We can explore the relationship between year, price and how the makes effect it using a similar style of graph. From the graph, we can see that different make have different effect on the depreciation of price from year. For makes that have a very flattened best of fit line, such as Subaru, Mitsubishi, and Buick, the increase in the vehicle’s ages has little impact on its value. In addition, I also wanted to know what is the most popular vehicle for sales in Alberta. Therefore, I made this chart to answer the question.
This model uses massive data (>10,000) from Autotrader to enable automatic pricing for the majority of vehicle. It applies to other search result in Autotrader by changing the url address. By using this model, consumers would have more knowledge and confidence in the price of vehicles. With more well-informed consumer, the used car market can become more efficient.