The dataset that I am using in this project was found on Kaggle, the well-known Machine Learning Competition website. Click here for a full description of the dataset, or read the description file.
I worked in the automotive industry for 12 years and I remain a devoted pistonhead, so getting a better understanding of the used car market was very appealing.
This project focuses on the exploratory data analysis phase of the dataset. In particular, I will try to detect associations between variables, especially against price. The end-goal of such a project would be to build a price-prediction model for vehicles sold by eBay users.
The dataset is well structured but there are some free text fields and many missing values. Moreover, some of the data is in German and needs to be translated. Most of the translations are straightforward, and Google Translate comes to the rescue where required! The “name” column is problematic: It is free text which causes all sorts of issues, and although a German NLP engineer could perhaps find interesting information in it, I chose to simply drop it.
library(data.table)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, mday, month, quarter, wday, week, yday, year
# Load the dataset in a data.table, excluding "names":
cars <- data.table::fread('./autos.csv', na.strings = "", stringsAsFactors = TRUE,
drop = 2)
# Convert date columns to POSIXct:
date_cols <- c("dateCrawled", "dateCreated", "lastSeen")
for(c in date_cols) set(cars,
j = c,
value = parse_date_time(cars[[c]], "%Y-%m-%d %H:%M:%S"))
head(cars)
## dateCrawled seller offerType price abtest vehicleType
## 1: 2016-03-24 11:52:17 privat Angebot 480 test NA
## 2: 2016-03-24 10:58:45 privat Angebot 18300 test coupe
## 3: 2016-03-14 12:52:21 privat Angebot 9800 test suv
## 4: 2016-03-17 16:54:04 privat Angebot 1500 test kleinwagen
## 5: 2016-03-31 17:25:20 privat Angebot 3600 test kleinwagen
## 6: 2016-04-04 17:36:23 privat Angebot 650 test limousine
## yearOfRegistration gearbox powerPS model kilometer
## 1: 1993 manuell 0 golf 150000
## 2: 2011 manuell 190 NA 125000
## 3: 2004 automatik 163 grand 125000
## 4: 2001 manuell 75 golf 150000
## 5: 2008 manuell 69 fabia 90000
## 6: 1995 manuell 102 3er 150000
## monthOfRegistration fuelType brand notRepairedDamage dateCreated
## 1: 0 benzin volkswagen NA 2016-03-24
## 2: 5 diesel audi ja 2016-03-24
## 3: 8 diesel jeep NA 2016-03-14
## 4: 6 benzin volkswagen nein 2016-03-17
## 5: 7 diesel skoda nein 2016-03-31
## 6: 10 benzin bmw ja 2016-04-04
## nrOfPictures postalCode lastSeen
## 1: 0 70435 2016-04-07 03:16:57
## 2: 0 66954 2016-04-07 01:46:50
## 3: 0 90480 2016-04-05 12:47:46
## 4: 0 91074 2016-03-17 17:40:17
## 5: 0 60437 2016-04-06 10:17:21
## 6: 0 33775 2016-04-06 19:17:07
str(cars)
## Classes 'data.table' and 'data.frame': 371824 obs. of 19 variables:
## $ dateCrawled : POSIXct, format: "2016-03-24 11:52:17" "2016-03-24 10:58:45" ...
## $ seller : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
## $ offerType : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
## $ vehicleType : Factor w/ 9 levels NA,"andere","bus",..: 1 5 9 6 6 8 4 8 3 6 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 3 levels NA,"automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ powerPS : int 0 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 252 levels NA,"100","145",..: 120 1 121 120 105 13 10 42 63 120 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int 0 5 8 6 7 10 8 7 8 0 ...
## $ fuelType : Factor w/ 8 levels NA,"andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 3 levels NA,"ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
## $ dateCreated : POSIXct, format: "2016-03-24" "2016-03-24" ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : POSIXct, format: "2016-04-07 03:16:57" "2016-04-07 01:46:50" ...
## - attr(*, ".internal.selfref")=<externalptr>
The column abtest seems to be internal to E-Bay, probably the control or test groups for some internal A/B testing. I don’t believe we will need it. The column nrOfPictures only contains zeros, probably a data collection issue. We don’t need it either. I will also drop the postalCode column, because I don’t intend to cross-reference the data with a postal map of Germany although that could be another interesting project.
The values in the different factors are fairly straightforward. I translate them into English; at the same time I drop the 12 ads from people looking to purchase a car (offerType == Gesuch), as I don’t have confidence that they would have accurate information about car specifications, nor sensible asking prices. As a result, we no longer need this column.
The seller column contains only 3 professional traders, which is insignificant in comparison to the total number of observations. Therefore I drop the column.
Finally, I also noticed that some zeros should really be NAs: In price, monthOfRegistration, powerPS.
The dateCrawled and dateCreated columns might not be very useful in themselves, but they allow us to calculate how long an ad has been up for on the website, and thus gives us an approximate lower bound for selling time (see below for a discussion on this). By default this value is calculated in minutes, I convert it to days. With this new ad_up_time variable, we no longer need the other date variables.
We now have a clean and usable dataset:
str(cars)
## Classes 'data.table' and 'data.frame': 371812 obs. of 12 variables:
## $ price : int 480 18300 9800 1500 3600 650 2200 NA 14500 999 ...
## $ vehicleType : Factor w/ 8 levels "other","people carrier",..: NA 4 8 5 5 7 3 7 2 5 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 2 levels "automatic","manual": 2 2 1 2 2 2 2 2 2 2 ...
## $ powerPS : int NA 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 251 levels "100","145","147",..: 119 NA 120 119 104 12 9 41 62 119 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int NA 5 8 6 7 10 8 7 8 NA ...
## $ fuelType : Factor w/ 7 levels "other","petrol",..: 2 4 4 2 4 2 2 2 2 NA ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 2 levels "yes","no": NA 1 NA 2 2 1 2 2 NA NA ...
## $ ad_up_time : num 818.53 817 1319.81 1.93 342.17 ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "index")= int
We have 12 variables which I can plot individually.
summary(cars$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000e+00 1.250e+03 3.000e+03 1.780e+04 7.490e+03 2.147e+09 10779
From the summary above, we see that prices go up to over \(€2.10^9\)! This is obviously wrong. While looking at all cars over €100,000 in more detail, I noticed that many of these prices seemed either entered at random or confused with kilometers: I found patterns such as ‘111111’ or ‘12345678’, or mainstream models over €150,000. To try and filter out most of these issues, I assumed that such high-end cars would most likely be coupes, convertibles or SUVs. I also dropped any observation above €200,000, assuming the majority of them would be input errors. I then dropped any row that did not match these criteria and looked at the brands of cars above €75,000:
# Filter out most likely price errors:
cars <- cars[price <= 200000 &
!(price > 100000 &
vehicleType %in% c('sedan', 'small car',
'estate', 'people carrier', 'other')), ]
cars[price > 75000, unique(brand)]
## [1] chevrolet porsche other mercedes_benz bmw
## [6] volkswagen audi jaguar seat ford
## [11] land_rover opel renault smart nissan
## [16] mitsubishi
## 40 Levels: alfa_romeo audi bmw chevrolet chrysler citroen dacia ... volvo
Some of these brands are not considered premium and it is surprising to find them here. Let’s see the models and prices for the non-premium brands (Volkswagen, Seat, Ford, Opel, Renault, Smart, Nissan, Mitsubishi):
# Examine non premium brands with prices over €75k:
cars[price > 75000 &
brand %in% c("volkswagen", "seat", "ford", "opel", "renault",
"smart", "nissan", "mitsubishi")]
## price vehicleType yearOfRegistration gearbox powerPS model
## 1: 99999 NA 1910 NA NA NA
## 2: 99999 NA 1970 NA NA other
## 3: 123456 NA 2000 manual 75 golf
## 4: 154651 NA 2005 NA NA altea
## 5: 130000 coupe 1968 NA NA mustang
## 6: 100000 coupe 1968 manual 131 other
## 7: 93000 coupe 1971 manual NA other
## 8: 100000 small car 2013 automatic NA other
## 9: 100000 NA 2000 NA NA NA
## 10: 78000 SUV 2015 automatic 340 touareg
## 11: 99999 SUV 2001 manual 114 x_trail
## 12: 99999 NA 1910 NA NA NA
## 13: 130000 NA 2000 NA NA NA
## 14: 120000 coupe 1967 manual 550 mustang
## 15: 99999 NA 2017 manual 204 golf
## 16: 85000 people carrier 1967 manual 44 transporter
## 17: 123456 NA 1985 NA NA golf
## 18: 79499 SUV 2015 automatic 340 touareg
## 19: 99999 small car 2009 automatic 71 fortwo
## kilometer monthOfRegistration fuelType brand notRepairedDamage
## 1: 150000 NA NA volkswagen NA
## 2: 150000 NA NA volkswagen NA
## 3: 150000 7 NA volkswagen NA
## 4: 150000 NA NA seat NA
## 5: 50000 7 petrol ford NA
## 6: 100000 5 petrol opel no
## 7: 30000 2 petrol renault no
## 8: 150000 11 electric smart no
## 9: 150000 NA NA volkswagen NA
## 10: 50000 4 diesel volkswagen no
## 11: 150000 9 diesel nissan no
## 12: 150000 NA NA volkswagen NA
## 13: 150000 NA NA mitsubishi NA
## 14: 100000 9 petrol ford no
## 15: 150000 2 NA volkswagen NA
## 16: 150000 1 petrol volkswagen no
## 17: 125000 NA NA volkswagen NA
## 18: 20000 5 diesel volkswagen no
## 19: 50000 10 petrol smart no
## ad_up_time
## 1: 1341.10625
## 2: 89.24861
## 3: 207.49583
## 4: 12.56250
## 5: 932.87153
## 6: 0.00000
## 7: 664.18125
## 8: 0.00000
## 9: 1791.43472
## 10: 1817.59167
## 11: 1441.67153
## 12: 125.89375
## 13: 1087.41875
## 14: 904.70556
## 15: 802.15972
## 16: 1592.41250
## 17: 854.05069
## 18: 1642.35347
## 19: 744.62014
So we find a variety of models there, some unnamed. The VW Touareg and Ford Mustang seem legitimate. There are also some really old cars, for which a high price might be justified to a collector, but in most cases the model name is not mentioned for these so it is difficult to say whether they are genuinely expensive cars or errors. Since we have more than enough observations overall, I decide to drop them and only retain the VW Touareg and Ford Mustang.
There are also cars below €100, which I assume are also errors or sellers not wanting to filter themselves out of the price selector on the website. I drop these too.
# Further tidying up of price:
cars <- cars[price <= 75000 |
brand %in% c("chevrolet", "porsche", "other", "mercedes_benz",
"bmw", "audi", "jaguar", "land-rover") |
model %in% c("mustang", "touareg"),
]
cars <- cars[price >= 100,]
I then plot the variable again with log10 transformation on the x-axis:
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
library(scales)
## Warning: package 'scales' was built under R version 3.2.5
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.2.5
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.5
library(GGally)
## Warning: package 'GGally' was built under R version 3.2.5
ggplot(data = subset(cars, !is.na(price)), aes(x = price)) +
geom_histogram(fill = 'deepskyblue4', bins = 200) +
scale_x_continuous(trans = "log10",
breaks = c(100, 300, 1000, 3.e3, 1.e4, 3.e4, 1.e5, 3.e5)) +
geom_vline(xintercept = median(cars$price), colour = 'blue', size = 1) +
geom_vline(xintercept = mean(cars$price), colour = 'brown', size = 1)
Red bar indicates mean, blue bar indicates median
With the scale transformation, we have a roughly normal distribution. We notice that some bins have a much higher count than their neighbours, presumably corresponding to round values or “psychological” price points (e.g. €9,900).
This categorical variable has 8 levels and indicates the body style of the car (sedan, coupe, SUV etc.).
ggplot(data = cars, aes(x = vehicleType)) +
geom_bar(fill = 'deepskyblue4') + coord_flip()
This tends to reflect the general Western European market, with a prominence of “family” vehicles and smaller volumes of “niche” products (although a sample of new car registrations would probably show a higher proportion of SUVs considering the rise of this body style in recent years). Also note the large number of NAs – about 20,000. E-Bay could definitely do a better job at encouraging their customers to write their ads properly.
This will basically tell us about the age of the vehicle. From the data summary, we saw that the minimum year is 1,000, which will come as a surprise to most historians. The maximum year is 9,999, which is obviously wrong as we will all be teleporting by then. So we clearly need to tidy up this variable. To keep things simple, I select only vehicles registered since 1960. As the data was collected in 2016, we set that year as our upper bound. Then we plot a histogram.
summary(cars$yearOfRegistration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1999 2004 2004 2008 9999
# Leave out cars first registered before 1960 and, supposedly, after 2016:
cars <- cars[yearOfRegistration >= 1960 & yearOfRegistration <= 2016]
ggplot(data = subset(cars, !is.na(yearOfRegistration)),
aes(x = yearOfRegistration)) +
geom_histogram(binwidth = 1, fill = 'deepskyblue4') +
scale_x_continuous(breaks = seq(1960, 2016, 5)) +
geom_vline(xintercept = mean(cars$yearOfRegistration),
size = 1, colour = 'brown') +
geom_vline(xintercept = median(cars$yearOfRegistration),
size = 1, colour = 'blue')
Red bar indicates mean, blue bar indicates median
Note that the mean and the median are identical. The distribution is close to normal, with the following exceptions:
I did some research and it turns out that 1999, 2000, 2005, 2006 were all among the strongest years for new car registrations in Germany in the last 20 years. As they are also in the heart of the used car market in terms of age, it makes sense that they would translate into these peaks.
As for 2016, the explanation is less obvious, especially as the data was collected in March and April, which is quite ealy in the year. The peak could be due to some listing errors (on purpose or not) where owners enter a date at random or to attract visitors to their ad. It could also be linked to the website’s features when creating an ad (eg. default value in drop-down menu).
But there could also be a number of genuine 2016 cars suddenly arriving on the market. Employees in the automotive industry have often access to cheap car leasing schemes, whereby they can change their vehicle every 6 months or so. More importantly, most manufacturers register large numbers of demonstrators, press units and self-registered cars (new vehicles registered by the manufacturer or its dealers, in order to artificially boost market share and / or create cheaper opportunities to capture some customers over the competition).
Maybe further investigation will tell us whether this theory holds.
We should note from the summary in the beginning that monthOfRegistration contains nearly 38,000 NAs. Again, it is surprising that year seems mandatory (although its value is clearly not controlled) but month is not. This variable is really a categorical variable.
summary(cars$monthOfRegistration)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 6.387 9.000 12.000 28445
ggplot(data = subset(cars, !is.na(monthOfRegistration)),
aes(x = monthOfRegistration)) +
geom_histogram(binwidth = 0.5, fill = 'deepskyblue4') +
scale_x_continuous(breaks = 1:12)
March and June are the strongest months for vehicle registrations. A quick research on the internet confirmed that this is consistent with the car registration seasonality that we observe in Germany (incidently, there is a similar effect in France and the UK).
The gearbox variable can only take two values: manual or automatic.
ggplot(data = cars, aes(x = gearbox)) +
geom_bar(fill = 'deepskyblue4')
The European market is primarily a manual transmission market, so no surprises here. Again, there are about 15,000 NAs.
The engine power is measured using the metric PS (1PS = 736 Watts). Again, we know from the summary that there are some nonsensical values in the data. I was prepared to remove anything below 40PS, but then I realised that there are Trabants in the dataset! This venerable left-over from the East-German Communist era is now widely used in Berlin as a rental car for people looking for a different experience of the city. Its 2-stroke engine managed 26PS!. Out of respect for such an antiquity, I decided to set the lower bound at 25PS.
As for the excessively high PS values, it looks like most of them are due to people confusing power output and engine capacity (in cm\(^3\)). I decided to set the limit at 600PS, a more than respectable value.
# Drop cars with output below 25 or over 600:
cars <- cars[powerPS >= 25 & powerPS <= 600, ]
ggplot(data = subset(cars, !is.na(powerPS)),
aes(x = powerPS)) +
geom_histogram(binwidth = 10, fill = 'deepskyblue4') +
scale_x_continuous(breaks = seq(0, 600, 20)) +
geom_vline(xintercept = mean(cars$powerPS), size = 1, colour = 'brown') +
geom_vline(xintercept = median(cars$powerPS), size = 1, colour = 'blue') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Red bar indicates mean, blue bar indicates median
The distribution is postively skewed with a long tail (that may contain errors, as we just saw).
There are some prefered values – around 60, 100, 120, 140 for instance. These are values that have become some sort of “market standards”: Most manufacturers will offer engines around these values. It makes it easier for the consumer to compare products. Real, measurable power output of a car is never exactly equal to its rated horsepower due to variances in industrial processes. Differences of around 5% are not uncommon for a same model.
This factor variable contails 251 levels, far too many to plot. But we can select the top 20:
# Create a data.table of unique model names sorted by their respective counts:
model_count <- cars[, .N, by = model]
model_count <- model_count[order(model_count$N, decreasing = TRUE), ]
ggplot(data = model_count[1:10, ],
aes(x = reorder(model, -N), y = N)) +
geom_bar(stat = 'identity', fill = 'deepskyblue4') + coord_flip()
Without surprise, the Volkswagen Golf (the most popular car in Europe) is also number one in the dataset. Note the very large number of vehicles designated as “other” – while some of them are probably models that exist but cannot be selected on the eBay website, it is unlikely that their number would be that high so we have to assume that once again, they are mostly due to human error. Analysing the namevariable (the free-text variable that we dropped at the very begining of the study) might help identify these cars, but this would be a whole project in itself.
The next car on the list is the BMW 3-Series, which is by no means a cheap car. This fact alone shows that we are indeed working on the German market – the most high-end market in Europe.
… or more accurately, “kilometreage”.
summary(cars$kilometer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5000 100000 150000 125500 150000 150000
ggplot(data = cars, aes(x = kilometer)) +
geom_histogram(binwidth = 5000, fill = 'deepskyblue4') +
geom_vline(xintercept = mean(cars$kilometer), size = 1, colour = 'brown') +
geom_vline(xintercept = median(cars$kilometer), size = 1, colour = 'blue')
Red bar indicates mean, blue bar indicates median
This histogram clearly shows that mileage is not a free input field. This is surprising, because mileage is one of the most important pieces of information when it comes to used cars, so maximum accuracy would have been desirable. Moreover, when I went to E-Bay myself to try out the used car ad generator, I was able to enter any value. So maybe there is some aggregating mechanism during data extract or the functionality changed recently?
The second thing that is striking with this chart is the predominance of 150,000km cars. Given the nature of the data, there is no point trying to apply scale transformations to improve the plot. This variable should actually be considered as a categorical variable more than a continuous one. I therefore add a variable in the dataset called km_cat. We will see later which one is more appropriate.
cars <- cars[, km_cat := factor(kilometer)]
This factor variable also contains many NAs (over 33,000).
ggplot(data = cars, aes(x = fuelType)) +
geom_bar(fill = 'deepskyblue4')
“Alternative”" sources of energy are almost negligible in this dataset, which is not surprising considering that over 50% of the vehicles were 12 to 13 years old when this data was extracted.
Petrol is roughly twice as prominent as Diesel.
This is another factor with many levels (40) so we will take the same approach as with model.
# Create a data.table of unique model names sorted by their respective counts:
brand_count <- cars[, .N, by = brand]
brand_count <- brand_count[order(brand_count$N, decreasing = TRUE), ]
ggplot(data = brand_count[1:10, ],
aes(x = reorder(brand, -N), y = N)) +
geom_bar(stat = 'identity', fill = 'deepskyblue4') + coord_flip()
The top 5 brands are German. Number 6 is Ford, which in Europe is largely perceived as German as it has its European headquarters in Cologne and many of its European products are actually designed and built in Germany. The next two brands are French, then Fiat is Italian. Seat is Spanish but it is actually part of the VW Group and their cars share almost all their components with VW products.
In other words, German manufacturers are hugely dominant on their home turf.
One issue we will have when looking for associations between variables is that with brand containing 40 levels, plots will be really hard to read (in addition to causing long processing times). We can improve this by grouping brands into categories based on brand perception. We could simply use mean prices to make these distinctions, but we would then create a correlation to price where there isn’t necessarily one. Moreover, brand perception involves a lot more than just price – there is history, perceived quality, marketing etc.
So I chose another approach: Use my domain-knowledge to manually classify the 40 levels into clusters. I didn’t plan on an exact number of clusters beforehand, just something manageable. I then intuitively grouped brands together and came up with 8 clusters which I then named.
Althought this intuitive approach is subjective, I believe it actually adds information to the dataset, unlike the “group by price” method which removes some.
# Define brand groups based on market perception:
other <- c('lada', 'trabant', 'other')
budget <- c('chevrolet', 'daewoo', 'dacia')
budget_plus <- c('hyundai', 'kia', 'skoda', 'daihatsu')
mid_minus <- c('chrysler', 'fiat', 'ford', 'citroen', 'mitsubishi', 'opel',
'rover', 'seat', 'suzuki')
mid_range <- c('nissan', 'peugeot', 'renault', 'toyota')
mid_plus <- c('honda', 'mazda', 'smart', 'subaru', 'volkswagen')
premium_minus <- c('alfa_romeo', 'lancia', 'saab', 'jeep', 'volvo', 'mini')
premium <- c('audi', 'bmw', 'jaguar', 'land_rover', 'mercedes_benz', 'porsche')
cars[brand %in% other, brand_cat := 'other']
cars[brand %in% budget, brand_cat := 'budget']
cars[brand %in% budget_plus, brand_cat := 'budget_plus']
cars[brand %in% mid_minus, brand_cat := 'mid_minus']
cars[brand %in% mid_range, brand_cat := 'mid_range']
cars[brand %in% mid_plus, brand_cat := 'mid_plus']
cars[brand %in% premium_minus, brand_cat := 'premium_minus']
cars[brand %in% premium, brand_cat := 'premium']
cars$brand_cat <- ordered(cars$brand_cat,
levels = c('budget', 'budget_plus',
'mid_minus', 'mid_range',
'mid_plus', 'premium_minus',
'premium', 'other'),
labels = c('budget', 'budget_plus',
'mid_minus', 'mid_range',
'mid_plus', 'premium_minus',
'premium', 'other'))
ggplot(data = cars, aes(x = brand_cat)) +
geom_bar(fill = 'deepskyblue4') + coord_flip()
There are large disparities between brand categories in terms of count. The biggest surprise is the predominance of the premium brands, which are supposedly the most expensive. But the analysis of the top 10 brands above explains it: This category contains BMW, Audi and Mercedes which are all among the most common brands in Germany. Clearly, this plot would look very different in most other countries.
ggplot(data = cars, aes(x = notRepairedDamage)) +
geom_bar(fill = 'deepskyblue4')
The variable notRepairedDamage can only take two values: “yes” or “no”. But it does have NAs – about 72,000, which is twice as many as the number of “yes”. It does not seem like this is a mandatory field (and I could not find it on EBay). As I understand it, it refers to potential unrepaired damage on the vehicle being sold.
This is a composite variable that we created by substracting the dateCreated from the dateCrawled date. It is measured in days. Of course the idea is to look for potential correlations to other variables, especially price. From that point of view, there are important limitations associated with ad_up_time:
summary(cars$ad_up_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 121.0 358.5 528.8 819.9 1982.0
ggplot(data = cars, aes(x = ad_up_time)) +
geom_histogram(binwidth = 7, fill = 'deepskyblue4') +
scale_x_continuous(breaks = seq(0, 2000, 120)) +
geom_vline(xintercept = mean(cars$ad_up_time), colour = 'brown', size = 1) +
geom_vline(xintercept = median(cars$ad_up_time), colour = 'blue', size = 1)
Red bar indicates mean, blue car indicates median
Here we have observations up to 2,000 days (nearly 5.5 years). The long tail contains a significant number of observations and there seems to be a lot of variance in the data, so it is difficult to just drop observations after an arbitrary number of days posted.
The other thing I notice is that there seems to be some prevailing values at roughly three months intervals. I am not sure why this is – it might be related to the pricing structure used by eBay…?
In this section, I will examine pairs of variables to look for associations between variables, and then zoom in on some multi-variable combinations that seem particularly interesting.
First, let’s visualize potential variable associations with a plot matrix (to reduce computing time, I use a sample of 10,000 observations):
cars$log_price <- log10(cars$price)
ggpairs(select(cars[sample(nrow(cars), 10000)], -c(price, model, km_cat, monthOfRegistration, brand)),
cardinality_threshold = 40,
axisLabels = 'internal',
lower = list(combo = wrap('facethist', bins = 100),
continuous = wrap('points', alpha = .2)))
#ggsave('plot_matrix.png', width = 50, height = 50, units = 'cm', dpi = 400)
Even so the plot matrix is very clutered. For continuous variables we can try a correlation plot:
ggcorr(cars[, .SD, .SDcols = c("yearOfRegistration", "powerPS",
"kilometer", "ad_up_time", "log_price")],
geom = "tile", nbreaks = 5, label = TRUE, label_color = "deepskyblue4",
label_size = 5, hjust = .5, size = 3, angle = -45)
We notice fairly strong correlations between log(price) and power output, year of registration and mileage (negative in the latter case). Other correlations are weak. We will build individual plots for the most promising variable pairs. Eventually I would like to be able to predict prices based on the other variables, so let’s focus on price as one of our variables.
Intuitively, it would seem obvious that there should be a pretty strong relationship between power and price. Is it really the case?
ggplot(data = cars, aes(x = powerPS, y = log_price)) +
geom_point(fill = "deepskyblue1",
shape = 21, alpha = .15, size = 1, position = 'jitter') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
So there is a correlation indeed, but there is also a lot of noise, The relationship does not seem quite linear but looks more like a log function.
Let’s try to break this down but fuel type. The reason for doing this is that from experience, I know that Diesel engines are typically more expensive (on the new car market at least) than petrol for a same value of power output. Therefore we have reason to believe that the relationship between power and price will be different, or at the very least, shifted.
I plot price vs PowerPS with a colour-coding for fuelType (and keeping only the two most popular: petrol and Diesel – upper-case D is intentional as Diesel is the name of the inventor, Rudolf Diesel):
ggplot(data = subset(cars, fuelType %in% c("petrol", "diesel")),
aes(x = powerPS, y = log_price, fill = fuelType)) +
geom_point(shape = 21, alpha = .15, size = 1, position = 'jitter') +
scale_fill_brewer(type = 'qual',
guide = guide_legend(reverse = TRUE,
override.aes = list(alpha = 1,
size = 2))) +
geom_smooth(aes(colour = fuelType))
## `geom_smooth()` using method = 'gam'
This plot confirms that for a given power output, Diesel cars are more expensive (not a great surprise considering that modern Diesel powertrains are usually more technologically advanced). The shape of the smoothers is different too. The petrol smoother looks closer to a regular root or log function, whereas for Diesel cars, it looks more erratic.
ggplot(data = subset(cars, fuelType %in% c("petrol", "diesel")),
aes(x = log(powerPS), y = log_price, fill = fuelType)) +
geom_point(shape = 21, alpha = .15, size = 1, position = 'jitter') +
scale_fill_brewer(type = 'qual',
guide = guide_legend(reverse = TRUE,
override.aes = list(alpha = 1,
size = 2))) +
geom_smooth(aes(colour = fuelType))
## `geom_smooth()` using method = 'gam'
Taking the log of powerPS definitely helps make the relationship look more linear for petrol cars, but it does not have very convincing results on Diesel cars – the curve only looks linear towards the centre of the distribution. This means that when building the linear model, we may need a different set of parameters for each fuel type. Nevertheless, over the power range where the Diesel smoother looks linear, I notice that both curves are almost parallel. This indicates that at least for the bulk of observations, price varies in a similar way on petrol and Diesel cars, but the Diesel prices are shifted up by a constant (in log x log representation).
Let’s try to plot these variables together:
ggplot(data = subset(cars,
fuelType %in% c("petrol", "diesel") & !is.na(gearbox)),
aes(x = log(powerPS), y = log_price)) +
geom_point(shape = 21, fill = 'deepskyblue1',
alpha = .05, position = 'jitter') +
geom_smooth() +
facet_grid(gearbox ~ fuelType)
## `geom_smooth()` using method = 'gam'
We see that petrol cars are more spread out in terms of price and power than Diesel cars. Using the log of power, the smoothers are close to linear in the first category but not in the second one. We also notice that automatic cars tend to be both more powerful and more expensive than manual, but this will need to be confirmed later.
cars[, .(log_price_cor = cor(log_price, log(powerPS)))]
## log_price_cor
## 1: 0.5837276
cars[fuelType %in% c("petrol", "diesel") & !is.na(gearbox),
.(log_price_cor = cor(log_price, log(powerPS))), by = .(fuelType, gearbox)]
## fuelType gearbox log_price_cor
## 1: diesel manual 0.4795269
## 2: diesel automatic 0.4939476
## 3: petrol manual 0.4917666
## 4: petrol automatic 0.5857267
So when grouped by transmission and fuel types, the correlations are all around the .5 mark, except the Petrol / Automatic combination which is around .6. We see that the associations are similar in shape although for manual Diesel cars, the data points form a much tighter group so it is harder to tell.
So far, we have found that there is a positive correlation of price with power, and associations to gearbox and fuel type.
Another continuous variable with a significant correlation to price (according to our correlation matrix) is yearOfRegistration.
ggplot(data = subset(cars, !is.na(yearOfRegistration)),
aes(x = yearOfRegistration, y = price)) +
geom_point(fill = 'deepskyblue4', shape = 21, alpha = .05,
position = 'jitter') +
scale_y_continuous(
breaks = c(300, 1000, 3000, 10000, 30000, 100000, 300000)) +
coord_trans(x = 'identity', y = 'log10', limx = c(1960, 2016))
The distribution looks very strange.
yearOfRegistration alone. On the scatter plot, we can see that this line includes cars at just about any price from a couple hundred Euros to maybe €20,000. Strangely, the bulk of cars just older than this seems to be more expensive, which suggests that they are most probably errors (intentional or not) or people just picking the first year available on the drop-down menu when listing their car. If these cars were really 2016 cars, it would be illogical for them to be advertised at cheaper prices than cars 5 years older. This strongly advocates in favour of filtering 2016 cars out of the data.Let’s first split the data in two with a cut-off in 1995 (21 years ago) and remove the 2016 cars from the dataset (there is only 4 months’ worth of data for that year and it seems highly inaccurate). I then plot the new data with a distinction beween vintage and modern cars.
cars <- cars[yearOfRegistration < 2016, ]
ggplot(data = subset(cars, !is.na(yearOfRegistration)),
aes(x = yearOfRegistration, y = log_price)) +
geom_point(aes(fill = collector_status),
shape = 21, alpha = .05,
position = 'jitter') +
scale_fill_brewer(type = 'qual',
guide = guide_legend(reverse = TRUE,
override.aes = list(alpha = 1,
size = 5))) +
geom_smooth(data = subset(cars,
!is.na(yearOfRegistration) &
yearOfRegistration < 2016 &
collector_status == "modern"),
aes(colour = collector_status)) +
geom_smooth(data = subset(cars,
!is.na(yearOfRegistration) &
yearOfRegistration < 2016 &
collector_status == "vintage"),
aes(colour = collector_status)) +
scale_x_continuous(breaks = seq(1960, 2015, 5))
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
Now I would like to take a look at the correlation values:
print("Vintage cars:")
## [1] "Vintage cars:"
cars[collector_status == "vintage", cor.test(price, yearOfRegistration)]
##
## Pearson's product-moment correlation
##
## data: price and yearOfRegistration
## t = -69.163, df = 28266, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3903722 -0.3704315
## sample estimates:
## cor
## -0.3804461
print("Modern cars:")
## [1] "Modern cars:"
cars[collector_status == "modern", cor.test(price, yearOfRegistration)]
##
## Pearson's product-moment correlation
##
## data: price and yearOfRegistration
## t = 415.1, df = 276410, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6173822 0.6219750
## sample estimates:
## cor
## 0.6196839
The correlation is much weaker for vintage cars than for modern cars because of the greater variance. This variance comes partly from the lower count, but also probably because vintage car prices can vary enormously according to seemingly irrational factors. They no longer compete with each other so market pressure is different. Moreover their value depends on factors such as: rarity, quality of restoration, use of original parts only, authenticity (faithfulness to the exact specifications of the car when it came out of the factory), historical value, part usage, car’s history etc.
Two cars built exactly the same year, 40 years ago, and selling at the same price back then, can nowadays have orders of magnitude between their current values.
In the previous section, I noticed that the variable kilometer behaves more like a categorical variable than a continuous one. However I want to try plotting both it to see which representation works best.
ggplot(data = subset(cars,
!is.na(kilometer)),
aes(x = kilometer, y = log_price)) +
geom_point(colour = 'lightskyblue3', shape = '.', alpha = .25,
position = 'jitter') +
scale_x_continuous(breaks = c(0, as.numeric(levels(cars$km_cat)))) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
There is an association to log_price that looks roughly linear, except between 5,000 and 20,000km. The clearest feature is that the variance increases with mileage. Let’s compare with a boxplot, this time treating mileage as a categorical variable:
ggplot(data = subset(cars, !is.na(km_cat)),
aes(x = km_cat, y = log_price)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
geom_point(aes(x = km_cat, y = log_price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown')
Brown diamonds represent means
Despite kilometer being grouped in bins, I believe the scatter plot is a marginally better representation because the bins are not regularly spaced, so the box plot gives a distorted view.
The 5000km data looks highly suspicious - we could be observing the same effect as with yearOfRegistration == 2016. Let’s check for that in the next set of analyses:
ggplot(data = subset(cars, !is.na(kilometer) & yearOfRegistration < 2016),
aes(x = yearOfRegistration, y = kilometer)) +
geom_point(colour = 'lightskyblue3', position = 'jitter', shape = '.',
alpha = 0.25) +
scale_x_continuous(breaks = seq(1960, 2015, 5), minor_breaks = NULL) +
scale_y_continuous(breaks = seq(0, 150000, 10000)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
The smoother seems to indicate that, similarly to what we observed on price, kilometers tend to be positively correlated to the age of the car up to 15-20 years old, then negatively correlated after that. This could also be linked to the fact that vintage cars don’t tend to run as much, as they are rarely the household’s main car and are often of questionable reliability. However considering the dispersion in the early years of the dataset, I am not sure I should lend much credit to this observation. Modern cars, on the other hand, generally have a higher mileage as they get older, as common sense would have it.
We also notice a group of observations at 5,000 kilometers between 1995 and 2005 approximately, that do not seem to follow the general distribution. Since that period is precisely the one with the highest density of observations in the dataset, I suspect these cars actually have a much higher mileage and that the data is wrong. These are probably the same suspicious observations that we noticed just before, for which kilometers are most likely severely under-valued.
From the plots below, it doesn’t look like ad_up_timeis going to be very informative. The matrix plot also reports that there is virtually no correlation to any other continuous variable.
p1 <- ggplot(data = cars, aes(x = ad_up_time, y = log_price)) +
geom_point(alpha = .15, color = 'lightskyblue3', shape = ".",
position = 'jitter') +
geom_smooth()
p2 <- ggplot(data = cars, aes(x = ad_up_time, y = powerPS)) +
geom_point(alpha = .15, color = 'lightskyblue3', shape = ".",
position = 'jitter') +
geom_smooth()
p3 <- ggplot(data = cars, aes(x = ad_up_time, y = yearOfRegistration)) +
geom_point(alpha = .15, color = 'lightskyblue3', shape = '.',
position = 'jitter') +
geom_smooth()
p4 <- ggplot(data = cars, aes(x = ad_up_time, y = kilometer)) +
geom_point(alpha = .15, color = 'lightskyblue3', shape = ".",
position = 'jitter') +
geom_smooth()
grid.arrange(p1, p2, p3, p4, ncol = 2)
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
If anything, the very tenuous trends we observe are rather counter-intuitive: cars with a low ad_up_time seem to be generally cheaper and to have higher mileage than the others.
In this section, we are going to look at combinations of discrete variables, with a stronger focus on price as this is the variable that I would like to explain.
p1 <- ggplot(data = cars, aes(x = fuelType, y = price)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
scale_y_log10(breaks = c(100, 300, 1000, 3000, 10000, 30000,
100000, 300000)) +
geom_point(aes(x = fuelType, y = price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
coord_flip()
p2 <- ggplot(data = cars, aes(x = fuelType, y = powerPS)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
scale_y_continuous() +
geom_point(aes(x = fuelType, y = powerPS), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
coord_flip()
grid.arrange(p1, p2, nrow = 2)
Brown diamonds indicate mean
These boxplots show that there are numerous outliers in the values of both powerPS and price. We can clearly see that different values of fuel type have different prices, with hybrid and electric cars being significantly more expensive than other types of vehicles.
Similarly, power outputs vary depending on fuel type. Perhaps surprisingly, Diesel cars tend to be more powerful than petrol, although their outliers don’t extend quite as far. This would be explained by the fact that for cost reasons, all low-end models have petrol engines.
To check whether the difference in mean prices is significant between petrol and Diesel cars (the 2 most popular groups), I run a t-test. My null hypothesis is \(price_{petrol} = price_{Diesel}\) and the alternative hypothesis is \(price_{petrol} < price_{Diesel}\). It’s an independent 2-group test.
petrol_diesel <- cars[fuelType %in% c("petrol", "diesel"),
.(price, powerPS, fuelType)]
petrol_diesel$fuelType <- factor(petrol_diesel$fuelType)
with(petrol_diesel, t.test(price ~ fuelType, alternative = "less"))
##
## Welch Two Sample t-test
##
## data: price by fuelType
## t = -115.19, df = 189150, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -3712.24
## sample estimates:
## mean in group petrol mean in group diesel
## 5214.650 8980.668
So \(p-value < .01\). We are very confident that there is a real difference in mean prices between the two groups.
gearbox is another variable that relates to the cars’ technical specifications. I would like to check whether it has an association to price.
ggplot(data = cars, aes(x = gearbox, y = price)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
scale_y_log10(breaks = c(100, 300, 1000, 3000, 10000, 30000,
100000, 300000)) +
geom_point(aes(x = gearbox, y = price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
coord_flip()
Brown diamonds represent mean
I am not sure how to interpret the fact that where gearbox is NA, prices seem lower. Automatic cars seem much more expensive overall, especially bearing in mind that prices are plotted on a logarithmic scale.
Again, let’s run a t-test to make sure the diffrence is significant. My null hypothesis is \(price_{automatic} = price_{manual}\) and the alternative hypothesis is \(price_{automatic} > price_{manual}\). It’s an independent 2-group test.
with(subset(cars, !is.na(gearbox)), t.test(price ~ gearbox, alternative = "greater"))
##
## Welch Two Sample t-test
##
## data: price by gearbox
## t = 129.99, df = 81053, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 5945.265 Inf
## sample estimates:
## mean in group automatic mean in group manual
## 11005.039 4983.581
\(p-value < .01\) and we are 95% confident that there is at least a €5,945 difference in the means. The on-cost of an auto transmission is not enough to explain such a difference, but the population profile would be. Auto transmissions are typically found in larger, more powerful cars, whereas almost all small cars have a manual gearbox.
Combining these three variables, we obtain the following boxplot:
ggplot(data = subset(cars,
fuelType %in% c("petrol", "diesel") & !is.na(gearbox)),
aes(x = fuelType, y = price)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
geom_point(aes(x = fuelType, y = price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
scale_y_continuous(limits = c(0, 50000)) +
facet_wrap(~ gearbox)
## Warning: Removed 1124 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1124 rows containing non-finite values (stat_summary).
Brown diamonds represent means
This boxplot confirms that prices are generally higher for auto transmissions than for manuals, and also higher for Diesel cars than petrol. However there is more dispersion for auto and Diesel than for manual and petrol, probably due to their lower count in the data. We also notice that there are many outliers. Here I used a linear scale for price but cut off at €50,000, and there are many more outliers above the cut-off point that cannot be seen on the plotting area.
p1 <- ggplot(data = subset(cars, !is.na(vehicleType)),
aes(x = price)) +
geom_histogram(fill = 'deepskyblue4', bins = 100) +
scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
facet_wrap(~ vehicleType, ncol = 1)
p2 <- ggplot(data = subset(cars, !is.na(vehicleType)),
aes(y = price, x = vehicleType)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
scale_y_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
geom_point(aes(x = vehicleType, y = price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
scale_x_discrete(limits = rev(levels(cars$vehicleType))) +
coord_flip()
p3 <- ggplot(data = cars,
aes(x = price)) +
geom_freqpoly(aes(colour = vehicleType)) +
scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5))
lay <- rbind(c(1, 2), c(3, 3))
grid.arrange(grobs = list(p1, p2, p3), layout_matrix = lay, heights = c(3, 2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Brown diamonds represent means
As expected, different vehicle types have different price distributions, most of them approaching normal when viewed on a logarithmic scale. The highest mean and median prices are found with SUVs, followed by convertibles, coupés and people carriers (minivans in the US). The variances of these distributions are quite large, coupés in particular. However this looks like a good contributor in explaining price differences.
Since I took the time to manually classify brands by perceived “premiumness”, let’s have a look at potential associations with price:
p1 <- ggplot(data = cars,
aes(x = price)) +
geom_histogram(fill = 'deepskyblue4', bins = 100) +
scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
facet_wrap(~ brand_cat, ncol = 1)
p2 <- ggplot(data = cars,
aes(y = price, x = brand_cat)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
scale_y_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
geom_point(aes(x = brand_cat, y = price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
scale_x_discrete(limits = rev(levels(cars$brand_cat))) +
coord_flip()
p3 <- ggplot(data = cars,
aes(x = price)) +
geom_freqpoly(aes(colour = brand_cat)) +
scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5))
lay <- rbind(c(1, 2), c(3, 3))
grid.arrange(grobs = list(p1, p2, p3), layout_matrix = lay, heights = c(3, 2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Brown diamonds represent means
In general, we observe a fairly logical pattern with higher prices for more premium brands. The only surpise is that the “budget” and “budget_plus” categories seem almost as expensive overall as the “premium_minus” brands. I am not sure why, maybe this has to do with the fact that some of the brands that make up these two categories are fairly recent and therefore have a younger population?
ggplot(data = cars, aes(x = brand_cat, y = yearOfRegistration)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
geom_point(aes(x = brand_cat, y = yearOfRegistration), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
coord_trans(limy = c(1980, 2016))
Brown diamonds represent means
print("Median year of registration by brand category:")
## [1] "Median year of registration by brand category:"
cars[, str(yearOfRegistration)]
## int [1:304677] 2011 2004 2001 2008 1995 2004 2014 1998 2004 2005 ...
## NULL
cars[, .(Median_year = as.numeric(median(yearOfRegistration))), by = brand_cat]
## brand_cat Median_year
## 1: premium 2004
## 2: premium_minus 2004
## 3: mid_plus 2003
## 4: budget_plus 2007
## 5: mid_range 2003
## 6: mid_minus 2003
## 7: budget 2008
## 8: other 1999
It seems that these two categories are indeed younger than the rest (with a 4-5 year difference in the medians) which would explain at least part of the observation we made previously.
p1 <- ggplot(data = subset(cars, !is.na(notRepairedDamage)),
aes(x = price)) +
geom_histogram(fill = 'deepskyblue4', bins = 100) +
scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
facet_wrap(~ notRepairedDamage, ncol = 1)
p2 <- ggplot(data = cars,
aes(y = price, x = notRepairedDamage)) +
geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
scale_y_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
geom_point(aes(x = notRepairedDamage, y = price), stat = 'summary', fun.y = mean,
shape = 18,
size = 4, colour = 'brown') +
scale_x_discrete(limits = rev(levels(cars$notRepairedDamage))) +
coord_flip()
grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 40366 rows containing non-finite values (stat_boxplot).
## Warning: Removed 40366 rows containing non-finite values (stat_summary).
So cars with unrepaired damage are much cheaper on average than cars in good condition. The log10 scale is slightly deceiving here, but in reality there is a 1-to-3 to 1-to-4 difference. We can confirm that the difference is significant by runing an independent 2-sample t-test. My \(H_0\) is \(price_{unrepaired \, damage} = price_{no \, unrepaired \, damage}\). My \(H_a\) is \(price_{unrepaired \, damage} < price_{no \, unrepaired \, damage}\).
with(subset(cars, !is.na(notRepairedDamage)), t.test(price ~ notRepairedDamage,
alternative = 'less') )
##
## Welch Two Sample t-test
##
## data: price by notRepairedDamage
## t = -166.11, df = 72067, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -4797.876
## sample estimates:
## mean in group yes mean in group no
## 2395.173 7241.033
So again, \(p < .01\). We are 95% confident that there is a €4,800 difference on average between cars with unrepaired damage, and cars without.
However we should note that the notion of unrepaired damage is somewhat vague. No one expect 10- or 15-year-old cars to be in immaculate condition: They will always have some scratches and bumps. However these are unlikely to drop the car’s price by a factor 3 so I conclude that to most sellers, unrepaired damage means serious damage, potentially preventing the vehicule from operating normally. Is this definition presented in the eBay guidelines? Or is it just an implicit understanding from sellers? I tried to navigate the eBay website for this information but was unable to find it.
Based on the knowledge that we gained on the previous phases of the project, I would like to attempt a linear regression to predict prices based on the most useful variables. To clean up the data further, I decided to remove all observations that are in the 5,000km group and were registered between 1990 and 2010. I decided to select the following variables:
log_pricepowerPS in interaction with fuelType, following our observations on the shape of the correlationpowerPS in interaction with fuelTypeyearOfRegistration in interaction with collector_statuskilometer in interaction with collector_statusvehicleTypebrand_catnotRepairedDamage in interaction with everything else. The reason for this is that we saw that this particular variable is associated with a 3- or 4-fold price difference, therefore for all intents and purposes it defines two separate markets (much like collector_status)require(dplyr)
cars_no_nas <- na.omit(cars[!(yearOfRegistration %in% seq(1990, 2010)
& kilometer == 5000),
])
fit <- lm(log_price ~ (powerPS : fuelType +
I(log(powerPS)) : fuelType +
gearbox +
yearOfRegistration:collector_status +
kilometer:collector_status +
vehicleType +
brand_cat) : notRepairedDamage,
data = cars_no_nas)
cbind(Predictor = summary(fit)$dimnames, p_value = summary(fit)$coefficients[, 4])
## p_value
## (Intercept) 0.000000e+00
## gearboxautomatic:notRepairedDamageyes 3.125360e-137
## gearboxmanual:notRepairedDamageyes 3.958991e-137
## gearboxautomatic:notRepairedDamageno 1.390427e-08
## vehicleTypepeople carrier:notRepairedDamageyes 1.317575e-05
## vehicleTypeconvertible:notRepairedDamageyes 8.630971e-28
## vehicleTypecoupe:notRepairedDamageyes 4.465875e-05
## vehicleTypesmall car:notRepairedDamageyes 1.693015e-03
## vehicleTypeestate:notRepairedDamageyes 1.636337e-08
## vehicleTypesedan:notRepairedDamageyes 3.471787e-01
## vehicleTypeSUV:notRepairedDamageyes 8.027336e-23
## vehicleTypepeople carrier:notRepairedDamageno 2.420756e-17
## vehicleTypeconvertible:notRepairedDamageno 9.568651e-94
## vehicleTypecoupe:notRepairedDamageno 2.372919e-13
## vehicleTypesmall car:notRepairedDamageno 5.460557e-03
## vehicleTypeestate:notRepairedDamageno 6.389341e-19
## vehicleTypesedan:notRepairedDamageno 2.492941e-08
## vehicleTypeSUV:notRepairedDamageno 3.401488e-39
## brand_cat.L:notRepairedDamageyes 1.461562e-04
## brand_cat.Q:notRepairedDamageyes 4.137184e-02
## brand_cat.C:notRepairedDamageyes 8.535466e-07
## brand_cat^4:notRepairedDamageyes 7.136172e-07
## brand_cat^5:notRepairedDamageyes 4.177829e-01
## brand_cat^6:notRepairedDamageyes 1.361594e-39
## brand_cat^7:notRepairedDamageyes 4.476149e-89
## brand_cat.L:notRepairedDamageno 6.020996e-31
## brand_cat.Q:notRepairedDamageno 2.603099e-08
## brand_cat.C:notRepairedDamageno 6.510218e-34
## brand_cat^4:notRepairedDamageno 2.422874e-59
## brand_cat^5:notRepairedDamageno 1.179173e-26
## brand_cat^6:notRepairedDamageno 5.859188e-102
## brand_cat^7:notRepairedDamageno 3.680811e-294
## powerPS:fuelTypeother:notRepairedDamageyes 4.573723e-01
## powerPS:fuelTypepetrol:notRepairedDamageyes 1.836444e-48
## powerPS:fuelTypecng:notRepairedDamageyes 3.874873e-01
## powerPS:fuelTypediesel:notRepairedDamageyes 9.220426e-22
## powerPS:fuelTypeelectric:notRepairedDamageyes 8.526414e-02
## powerPS:fuelTypehybrid:notRepairedDamageyes 7.424538e-01
## powerPS:fuelTypelpg:notRepairedDamageyes 2.523583e-10
## powerPS:fuelTypeother:notRepairedDamageno 5.999347e-01
## powerPS:fuelTypepetrol:notRepairedDamageno 0.000000e+00
## powerPS:fuelTypecng:notRepairedDamageno 2.252517e-02
## powerPS:fuelTypediesel:notRepairedDamageno 3.827960e-90
## powerPS:fuelTypeelectric:notRepairedDamageno 1.463680e-08
## powerPS:fuelTypehybrid:notRepairedDamageno 2.820426e-01
## powerPS:fuelTypelpg:notRepairedDamageno 1.007769e-15
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes 7.983801e-06
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes 3.530306e-23
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes 9.419925e-07
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes 5.356408e-35
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes 9.731434e-02
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes 9.100139e-08
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes 5.178994e-22
## fuelTypeother:I(log(powerPS)):notRepairedDamageno 3.998841e-27
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno 0.000000e+00
## fuelTypecng:I(log(powerPS)):notRepairedDamageno 4.746770e-75
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno 0.000000e+00
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno 1.250921e-64
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno 5.464982e-260
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno 0.000000e+00
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes 0.000000e+00
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes 0.000000e+00
## yearOfRegistration:collector_statusmodern:notRepairedDamageno 0.000000e+00
## yearOfRegistration:collector_statusvintage:notRepairedDamageno 0.000000e+00
## collector_statusmodern:kilometer:notRepairedDamageyes 1.426397e-318
## collector_statusvintage:kilometer:notRepairedDamageyes 0.000000e+00
## collector_statusmodern:kilometer:notRepairedDamageno 0.000000e+00
## collector_statusvintage:kilometer:notRepairedDamageno 0.000000e+00
print("")
## [1] ""
print(paste("R-squared:", summary(fit)$r.squared))
## [1] "R-squared: 0.764911484617119"
So this regression achieves \(R^2=0.765\), which seems decent. There are many unsignificant parameters though, most of them for low-count fuel types such as electric or natural gas. To view these results more explicitly, let’s predict price on a few observations selected at random:
test_examples <- cars_no_nas[sample(nrow(cars_no_nas), 1000),]
pred_price <- 10^predict(fit, newdata = test_examples)
## Warning in predict.lm(fit, newdata = test_examples): prediction from a
## rank-deficient fit may be misleading
test_examples <- cbind(pred_price, test_examples)
print(head(test_examples, 20))
## pred_price price vehicleType yearOfRegistration gearbox powerPS
## 1: 686.2174 300 small car 1996 manual 60
## 2: 1176.5914 990 sedan 1999 manual 75
## 3: 2558.1184 3800 sedan 2010 manual 77
## 4: 10469.3059 13500 estate 2014 manual 90
## 5: 3986.4063 2250 people carrier 2006 manual 106
## 6: 5350.2319 7999 people carrier 2009 manual 109
## 7: 1994.4817 1750 estate 1999 manual 125
## 8: 2132.5466 1650 people carrier 2002 manual 107
## 9: 1734.0371 2290 small car 2008 manual 54
## 10: 2453.4324 4990 sedan 2004 manual 75
## 11: 3479.7591 1650 sedan 2004 manual 145
## 12: 5625.7161 5100 estate 2003 automatic 230
## 13: 1855.1173 4200 small car 2001 manual 75
## 14: 3551.4687 4300 sedan 2003 manual 116
## 15: 4102.1051 4449 sedan 2006 manual 140
## 16: 21926.5980 21500 sedan 2012 automatic 170
## 17: 6084.3462 4999 sedan 2002 manual 143
## 18: 12872.6935 14200 sedan 2013 manual 140
## 19: 1609.4185 1700 small car 1998 manual 86
## 20: 4495.4456 3100 estate 2007 manual 150
## model kilometer monthOfRegistration fuelType brand
## 1: punto 150000 10 petrol fiat
## 2: astra 125000 3 petrol opel
## 3: punto 150000 4 petrol fiat
## 4: auris 40000 9 diesel toyota
## 5: scenic 150000 6 diesel renault
## 6: berlingo 150000 12 diesel citroen
## 7: impreza 150000 6 petrol subaru
## 8: scenic 150000 4 petrol renault
## 9: panda 150000 9 petrol fiat
## 10: golf 125000 7 petrol volkswagen
## 11: 156 150000 5 petrol alfa_romeo
## 12: 5er 150000 10 lpg bmw
## 13: a2 150000 3 petrol audi
## 14: 3er 125000 11 petrol bmw
## 15: leon 150000 3 diesel seat
## 16: golf 50000 4 diesel volkswagen
## 17: c_klasse 60000 9 petrol mercedes_benz
## 18: octavia 40000 3 petrol skoda
## 19: a_klasse 150000 7 petrol mercedes_benz
## 20: astra 150000 11 diesel opel
## notRepairedDamage ad_up_time km_cat brand_cat log_price
## 1: no 19.359028 150000 mid_minus 2.477121
## 2: no 19.566667 125000 mid_minus 2.995635
## 3: no 501.360417 150000 mid_minus 3.579784
## 4: no 1228.997917 40000 mid_range 4.130334
## 5: no 536.299306 150000 mid_range 3.352183
## 6: no 1334.985417 150000 mid_minus 3.903036
## 7: no 443.735417 150000 mid_plus 3.243038
## 8: no 670.938889 150000 mid_range 3.217484
## 9: no 521.142361 150000 mid_minus 3.359835
## 10: no 120.951389 125000 mid_plus 3.698101
## 11: no 220.303472 150000 premium_minus 3.217484
## 12: no 117.760417 150000 premium 3.707570
## 13: no 361.363194 150000 premium 3.623249
## 14: no 878.544444 125000 premium 3.633468
## 15: no 1771.467361 150000 mid_minus 3.648262
## 16: no 231.984028 50000 mid_plus 4.332438
## 17: no 661.657639 60000 premium 3.698883
## 18: no 444.634722 40000 budget_plus 4.152288
## 19: no 14.452083 150000 premium 3.230449
## 20: no 2.707639 150000 mid_minus 3.491362
## collector_status
## 1: modern
## 2: modern
## 3: modern
## 4: modern
## 5: modern
## 6: modern
## 7: modern
## 8: modern
## 9: modern
## 10: modern
## 11: modern
## 12: modern
## 13: modern
## 14: modern
## 15: modern
## 16: modern
## 17: modern
## 18: modern
## 19: modern
## 20: modern
We see that some predictions are really close whereas others are widely off the mark. Let’s make this more visual:
ggplot(data = test_examples,
aes(x = price, y = pred_price)) +
geom_point(colour = 'deepskyblue4') +
geom_abline(slope = 1., intercept = 0., colour = 'red', size = 1)
The red line represents identity
The higher the price, the more the predictions seem to get wrong, which is intuitively logical seeing that we predicted log_price and not price directly. In log10 coordinates, this phenomenon disappears but the opposite appears: Errors seem larger for low prices:
ggplot(data = test_examples,
aes(x = price, y = pred_price)) +
geom_point(colour = 'deepskyblue4') +
geom_abline(slope = 1., intercept = 0., colour = 'red', size = 1) +
scale_x_log10() + scale_y_log10()
The red line represents identity
We seem to generally underestimate prices a little. Let’s have a look at the residuals:
par(mfrow = c(2, 2))
plot(fit)
## Warning in sqrt(crit * p * (1 - hh)/hh): production de NaN
## Warning in sqrt(crit * p * (1 - hh)/hh): production de NaN
par(mfrow = c(1, 1))
The residuals vs. fitted plot shows that there is still some pattern left in the data, especially toward the high-end of price predictions where we tend to under-estimate the outcome variable. The normal Q-Q curve shows a lot of departure from the ideal which is characteristic of heavy-tailed data such as we have here. The Scale-Location plot shows that the data is not completely homoscedastic. Finally, there does not seem to be any overly influencial observations according to the Residuals vs. Leverage plot, but we did a lot of cleaning up beforehand to get rid of many outliers. Looking at the three observations for which we are provided indices in the plot, we see that they are all alternative fuel vehicles:
cars_no_nas[c(145504, 40697, 166256)]
## price vehicleType yearOfRegistration gearbox powerPS model
## 1: 49700 people carrier 2015 automatic 179 transporter
## 2: 24850 sedan 2012 automatic 306 other
## 3: 3500 sedan 2002 manual 75 golf
## kilometer monthOfRegistration fuelType brand notRepairedDamage
## 1: 20000 3 diesel volkswagen no
## 2: 60000 8 hybrid bmw yes
## 3: 100000 1 petrol volkswagen no
## ad_up_time km_cat brand_cat log_price collector_status
## 1: 1126.81389 20000 mid_plus 4.696356 modern
## 2: 94.83611 60000 premium 4.395326 modern
## 3: 677.73333 100000 mid_plus 3.544068 modern
Using a dataset made available on Kaggle and using data from the German eBay used car ads, I started by doing significant amounts of data cleaning, translations and conversions. I was then able to analyse each variable in turn and noticed that the price variable in particular had a very negatively skewed distribution and that a log transform was required. This initial phase of the analysis also allowed me to discover more issues with the data. Year of registration, price and mileage all had non-sensical entries, most probably due to human errors and maybe cheekiness. Most variables have large amounts of missing values. The brand and model variables tell us a lot about the German market, one of the most high-end market in Europe. Finally, I established that the ad up time was probably not usable as a predictor.
I was then able to look at associations between variables. I established that among the continuous variables, power had highest correlation to price, especially in log form, but with different profiles depending on fuel and transmission types. Looking at mileage was interesting because it exhibited two distinct correlations to price, one for “modern” cars and one for “vintage”. Mileage, another strong contributor, is unfortunately not really a continuous variable in this dataset, as it can only take a few distinct values.
Most categorical variables are also useful to the model. Fuel, gearbox and vehicle type all have associations with price. Unrepaired damages are associated to such a price drop that they litterally create a parallel market, which we must make sure that our linear model is able to capture as distinct from the bulk of the transactions.
Finally, the brand category variable that I manually created turned out to be quite useful as well as it exhibits a significant association to price.
Using the knowledge I had gained from this exploratory analysis, I built a linear model that explains 76.5% of the variance in the data. While this is not a very high level of accuracy, it is a very crude model that cannot account for all associations between variables. For instance, the price of vintage cars can vary differently depending on the brand of the vehicle. Since brand contains over 40 levels, I deliberately aggregated it into categories that try to reflect current market perception, not that of 25 or 30 years ago, so it would not necessarily be adequate as a predictor for vintage car prices. I also left model out of my set of predictors because it has over 200 levels, which is too complex for what I was trying to achieve here. Other potential predictors such as post code and free text description were also left out; while they would require a lot more work to provide sensible information, they could add significantly to the model.
Among other limitations, there is the high level of human error that this dataset seems to contain. To understand the source of some of those errors or to identify some new ones, it would be useful to be able to access the German eBay website as it was when data collection was performed. Similarly, I would like to better understand the criteria and process that were applied when scraping the website.
In terms of further development of this analysis, it would be interesting to compare this dataset to used car data from other sources to see if the model developped here would give us the same level of accuracy, or if eBay market prices are specific. Additionally, as mentioned, the name field probably contains useful information although it would be very tedious to tidy up.
As a closing observation, I would point out that we often hear that domain-knowledge is critical to a good data analyst. I don’t know if it is a general rule, but in the present project, I found that an in-depth knowledge of the automotive market definitely helped me make more meaningful analyses, develop relevant theories and engineer useful variables.