The dataset used in this analysis was found on www.kaggle.com. It contains 16 columns of descriptive and numeric data for 3,552,912 unique records, which represent individual automobiles available for sale on an unnamed internet site from 11/14/2015 to 03/16/2017.
While the 16 columns of data contain descriptive details such as: engine displacement, enginge power, transmission type, etc. - this analysis focuses on only five key categorizations:
Dimensions:
Make
Model
Manufacture Year
Price (EUR)
Mileage (km)
Diving into the data, it is apparent that there are several inconsistencies that cause a large portion of the records to be incomplete.
When limiting the dataset to the above 5 categories, per the table below it can be seen that nearly half of the records are incomplete.
Further, some of the data is obviously incorrect. For example, a 2009 Citroen Berlingo with 245,966 kilometers is listed for EUR1.4B. Although it would be a nice sale for the owner, this is probably not a reasonable or accurate listing.
This trend of wildly inaccurate data exists in each of the other data dimensions, as well, and will need to be cleaned.
As the dataset is over 3.5M records, it would be near impossible to clean the data on a record by record basis, so I have chosen to remove all blank or N/A vales from the dataset and limit the dataset by constraining it within parameters that are relevant to me (and will eliminate the dramatic highs from each dimension), namely:
While this does not prevent erroneous entries, it does prevent the impact of catastrophic datapoints that would materially skew the analysis. The remaining values per the parameters defined above yield a dataset that has far fewer erroneous entries.
This limits the dataset to the below valid entries.
## [1] 1747436
With the data limited to mostly only valid and relevant records, we can take a look at some descriptive statistics on the numeric fields of the dataset.
Statistic | Price | Mileage |
---|---|---|
Min | 0.04 | 0 |
Max | 2990000 | 1000000 |
Median | 7504.44 | 91000 |
Mean/Standard Deviation | 10,656.28 ± 14,793.76 | 100,596.54 ± 80,302.68 |
The price ranges from a minimum of EUR0.04 to a maximum of EUR2.9M, representing the wide range of nearly worthless cars to nearly priceless exotic and luxury cars. While the EUR0.04 listing is likely to not be valid, there are several listings near the EUR2.9M range for the Porsche Carrera GT model, which is often seen in the EUR1M+ range. The median of the dataset is a modest EUR7,504, while the mean is EUR10,656 with a standard deviation of EUR14,793. The implication of a median lower than the mean is that there is a right skew of the data with a long tail on the higher end of the price spectrum. This is consistent with what we know about car listings as there are many utility cars grouped around modest price points to serve the masses and few highly priced listings that range into the million plus category.
Mileage behaves similarly to Price in this dataset. The range (limited by filtering to less than 1,000,000km) is from 0 to 1,000,000km, with a median of 91,000km and a mean of 100,597km. Again, the mean is higher than the median, which is indicative of a right skew in the dataset. The standard deviation may not be relevant for mileage given the skew.
Next, we can break down the count of cars available for sale by Manufacturer. This will inform us as to which cars are most popular on the resale market in Europe. Additionally, as the sample of listings is so large with more than 1.5M listings, it also informs us as to which manufacturers produce the most cars for the European market.
Per the graph above, the leader is Volkswagen with 200,000+ car listings. Next, utilitarian brands occupy the 2-4 spots with Skoda, Ford, and Opel each exceeding 150,000+ car listings. Interestingly, Audi, a luxury brand, ranks in the top 5 as number 5 with just over 150,000 cars available for sale. The next closest luxury brand is BMW, which has just over 50,000 car listings available. My expectation prior to this analysis was that Audi, BMW, and Mercedes would have similar sales figures, and therefore a similar number of listings in the European market; however, Audi appears to significantly outpace their luxury car manufacturer rivals.
Given Audi’s dominance, and the inclination of this analysis’ author for Audi sports sedans, this analysis will now focus specifically on Audi sports sedans manufactured from 2013 to 2015 (filters also shown in the table below) with the intention of understanding the differences in price by model, and to ultimately find the perfect Audi sports sedan for the author. This means finding a model that fits the price and mileage preference of the author, while also representing the best deal.
The Audi sports sedan model range consists of the S4, S6, RS4, and RS6. The S4 and RS4 are mid-size sedans, while the S6 and RS6 are large sedans. The S moniker implies sport, while the RS stands for RennSport, or Racing Sport, and therefore offers an increase in performance, which is met with an increase in price. It can be observed that the RS6 occupies the right end of the histogram above. Specifically, no other model has a listing in excess of roughly EUR80,000, but the average RS6 is over this threshold. While the RS6 is the best sports sedan from a performance perspective, it does not appear attainable for the author.
A second look at the histogram reveals that the S6 and RS4 models occupy the same price range, which is rather wide and ranges from roughly EUR40,000 to EUR65,000. Here the author sees an opportunity to prioritize performance over size and select an RS4 as the desired model.
The above density plots, separated by year further demonstrate the overlap of the S6 and RS4, and show that the overlap is the strongest for the 2013 model year. Consistent with depreciation, these listings are also significantly cheaper than the 2015 and 2014 listings for the RS4. One interesting observation is the high impact of depreciation on the RS6 model. For 2015 models (which are likely all new or have low miles), the density is highest around EUR90,000, and the range grouping is rather tight. When observing the 2013 graph, the grouping gets much wider and values range down closer to 60,000 in near uniformity. This indicates that mileage is likely a significant factor in the depreciation of these pricey luxury cars, and it is likely that mileage positively correlates to a drop in price for all cars, but especially so for higher priced cars.
Next, the author limits the dataset to RS4 only, as that represents the right balance between performance, size, and price. Additionally, the author excludes listings with over 100,000kms and a price of less than EUR20,000, as these are likely to be not in good condition.
These violin plots on price and mileage, combined with the price by mileage scatter plot below confirm the depreciative impact of mileage on price, and further confirm that a 2013 RS4 is the most desirable option for the author.
Based solely on a delta from the EUR1.3697/km factor, the second listing appears to offer the best value. With a price of EUR49,990 and mileage of only 49,800km, the price per km factor of EUR1.0038/km is the lowest, and indicates a savings of nearly EUR20,000 over the average of this subset of listings. While features and condition of the car are not considered, the author of this analysis could feel reasonably assured that this listing would be a good place to start the car search if price and mileage are the primary factors influencing his purchase decision.