Introduction

The dataset that I am using in this project was found on Kaggle, the well-known Machine Learning Competition website. Click here for a full description of the dataset, or read the description file.

I worked in the automotive industry for 12 years and I remain a devoted pistonhead, so getting a better understanding of the used car market was very appealing.

This project focuses on the exploratory data analysis phase of the dataset. In particular, I will try to detect associations between variables, especially against price. The end-goal of such a project would be to build a price-prediction model for vehicles sold by eBay users.

0. Data Preparation

The dataset is well structured but there are some free text fields and many missing values. Moreover, some of the data is in German and needs to be translated. Most of the translations are straightforward, and Google Translate comes to the rescue where required! The “name” column is problematic: It is free text which causes all sorts of issues, and although a German NLP engineer could perhaps find interesting information in it, I chose to simply drop it.

library(data.table)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, mday, month, quarter, wday, week, yday, year

# Load the dataset in a data.table, excluding "names":
cars <- data.table::fread('./autos.csv', na.strings = "", stringsAsFactors = TRUE, 
              drop = 2)

# Convert date columns to POSIXct:
date_cols <- c("dateCrawled", "dateCreated", "lastSeen")
for(c in date_cols) set(cars, 
                        j = c,
                        value = parse_date_time(cars[[c]], "%Y-%m-%d %H:%M:%S"))
head(cars)

##            dateCrawled seller offerType price abtest vehicleType
## 1: 2016-03-24 11:52:17 privat   Angebot   480   test          NA
## 2: 2016-03-24 10:58:45 privat   Angebot 18300   test       coupe
## 3: 2016-03-14 12:52:21 privat   Angebot  9800   test         suv
## 4: 2016-03-17 16:54:04 privat   Angebot  1500   test  kleinwagen
## 5: 2016-03-31 17:25:20 privat   Angebot  3600   test  kleinwagen
## 6: 2016-04-04 17:36:23 privat   Angebot   650   test   limousine
##    yearOfRegistration   gearbox powerPS model kilometer
## 1:               1993   manuell       0  golf    150000
## 2:               2011   manuell     190    NA    125000
## 3:               2004 automatik     163 grand    125000
## 4:               2001   manuell      75  golf    150000
## 5:               2008   manuell      69 fabia     90000
## 6:               1995   manuell     102   3er    150000
##    monthOfRegistration fuelType      brand notRepairedDamage dateCreated
## 1:                   0   benzin volkswagen                NA  2016-03-24
## 2:                   5   diesel       audi                ja  2016-03-24
## 3:                   8   diesel       jeep                NA  2016-03-14
## 4:                   6   benzin volkswagen              nein  2016-03-17
## 5:                   7   diesel      skoda              nein  2016-03-31
## 6:                  10   benzin        bmw                ja  2016-04-04
##    nrOfPictures postalCode            lastSeen
## 1:            0      70435 2016-04-07 03:16:57
## 2:            0      66954 2016-04-07 01:46:50
## 3:            0      90480 2016-04-05 12:47:46
## 4:            0      91074 2016-03-17 17:40:17
## 5:            0      60437 2016-04-06 10:17:21
## 6:            0      33775 2016-04-06 19:17:07

str(cars)

## Classes 'data.table' and 'data.frame':   371824 obs. of  19 variables:
##  $ dateCrawled        : POSIXct, format: "2016-03-24 11:52:17" "2016-03-24 10:58:45" ...
##  $ seller             : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
##  $ offerType          : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price              : int  480 18300 9800 1500 3600 650 2200 0 14500 999 ...
##  $ abtest             : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
##  $ vehicleType        : Factor w/ 9 levels NA,"andere","bus",..: 1 5 9 6 6 8 4 8 3 6 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 3 levels NA,"automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ powerPS            : int  0 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 252 levels NA,"100","145",..: 120 1 121 120 105 13 10 42 63 120 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: int  0 5 8 6 7 10 8 7 8 0 ...
##  $ fuelType           : Factor w/ 8 levels NA,"andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 3 levels NA,"ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
##  $ dateCreated        : POSIXct, format: "2016-03-24" "2016-03-24" ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
##  $ lastSeen           : POSIXct, format: "2016-04-07 03:16:57" "2016-04-07 01:46:50" ...
##  - attr(*, ".internal.selfref")=<externalptr>

The column abtest seems to be internal to E-Bay, probably the control or test groups for some internal A/B testing. I don’t believe we will need it. The column nrOfPictures only contains zeros, probably a data collection issue. We don’t need it either. I will also drop the postalCode column, because I don’t intend to cross-reference the data with a postal map of Germany although that could be another interesting project.

The values in the different factors are fairly straightforward. I translate them into English; at the same time I drop the 12 ads from people looking to purchase a car (offerType == Gesuch), as I don’t have confidence that they would have accurate information about car specifications, nor sensible asking prices. As a result, we no longer need this column.

The seller column contains only 3 professional traders, which is insignificant in comparison to the total number of observations. Therefore I drop the column.

Finally, I also noticed that some zeros should really be NAs: In price, monthOfRegistration, powerPS.

The dateCrawled and dateCreated columns might not be very useful in themselves, but they allow us to calculate how long an ad has been up for on the website, and thus gives us an approximate lower bound for selling time (see below for a discussion on this). By default this value is calculated in minutes, I convert it to days. With this new ad_up_time variable, we no longer need the other date variables.

We now have a clean and usable dataset:

str(cars)

## Classes 'data.table' and 'data.frame':   371812 obs. of  12 variables:
##  $ price              : int  480 18300 9800 1500 3600 650 2200 NA 14500 999 ...
##  $ vehicleType        : Factor w/ 8 levels "other","people carrier",..: NA 4 8 5 5 7 3 7 2 5 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 2 levels "automatic","manual": 2 2 1 2 2 2 2 2 2 2 ...
##  $ powerPS            : int  NA 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 251 levels "100","145","147",..: 119 NA 120 119 104 12 9 41 62 119 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: int  NA 5 8 6 7 10 8 7 8 NA ...
##  $ fuelType           : Factor w/ 7 levels "other","petrol",..: 2 4 4 2 4 2 2 2 2 NA ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 2 levels "yes","no": NA 1 NA 2 2 1 2 2 NA NA ...
##  $ ad_up_time         : num  818.53 817 1319.81 1.93 342.17 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= int

1. Description of Individual Variables

We have 12 variables which I can plot individually.

1.1. Price

summary(cars$price)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 1.000e+00 1.250e+03 3.000e+03 1.780e+04 7.490e+03 2.147e+09     10779

From the summary above, we see that prices go up to over \(€2.10^9\)! This is obviously wrong. While looking at all cars over €100,000 in more detail, I noticed that many of these prices seemed either entered at random or confused with kilometers: I found patterns such as ‘111111’ or ‘12345678’, or mainstream models over €150,000. To try and filter out most of these issues, I assumed that such high-end cars would most likely be coupes, convertibles or SUVs. I also dropped any observation above €200,000, assuming the majority of them would be input errors. I then dropped any row that did not match these criteria and looked at the brands of cars above €75,000:

# Filter out most likely price errors:
cars <- cars[price <= 200000 &
                 !(price > 100000 & 
                 vehicleType %in% c('sedan', 'small car', 
                                    'estate', 'people carrier', 'other')), ]
cars[price > 75000, unique(brand)]

##  [1] chevrolet     porsche       other         mercedes_benz bmw          
##  [6] volkswagen    audi          jaguar        seat          ford         
## [11] land_rover    opel          renault       smart         nissan       
## [16] mitsubishi   
## 40 Levels: alfa_romeo audi bmw chevrolet chrysler citroen dacia ... volvo

Some of these brands are not considered premium and it is surprising to find them here. Let’s see the models and prices for the non-premium brands (Volkswagen, Seat, Ford, Opel, Renault, Smart, Nissan, Mitsubishi):

# Examine non premium brands with prices over €75k:
cars[price > 75000 & 
         brand %in% c("volkswagen", "seat", "ford", "opel", "renault",
                      "smart", "nissan", "mitsubishi")]

##      price    vehicleType yearOfRegistration   gearbox powerPS       model
##  1:  99999             NA               1910        NA      NA          NA
##  2:  99999             NA               1970        NA      NA       other
##  3: 123456             NA               2000    manual      75        golf
##  4: 154651             NA               2005        NA      NA       altea
##  5: 130000          coupe               1968        NA      NA     mustang
##  6: 100000          coupe               1968    manual     131       other
##  7:  93000          coupe               1971    manual      NA       other
##  8: 100000      small car               2013 automatic      NA       other
##  9: 100000             NA               2000        NA      NA          NA
## 10:  78000            SUV               2015 automatic     340     touareg
## 11:  99999            SUV               2001    manual     114     x_trail
## 12:  99999             NA               1910        NA      NA          NA
## 13: 130000             NA               2000        NA      NA          NA
## 14: 120000          coupe               1967    manual     550     mustang
## 15:  99999             NA               2017    manual     204        golf
## 16:  85000 people carrier               1967    manual      44 transporter
## 17: 123456             NA               1985        NA      NA        golf
## 18:  79499            SUV               2015 automatic     340     touareg
## 19:  99999      small car               2009 automatic      71      fortwo
##     kilometer monthOfRegistration fuelType      brand notRepairedDamage
##  1:    150000                  NA       NA volkswagen                NA
##  2:    150000                  NA       NA volkswagen                NA
##  3:    150000                   7       NA volkswagen                NA
##  4:    150000                  NA       NA       seat                NA
##  5:     50000                   7   petrol       ford                NA
##  6:    100000                   5   petrol       opel                no
##  7:     30000                   2   petrol    renault                no
##  8:    150000                  11 electric      smart                no
##  9:    150000                  NA       NA volkswagen                NA
## 10:     50000                   4   diesel volkswagen                no
## 11:    150000                   9   diesel     nissan                no
## 12:    150000                  NA       NA volkswagen                NA
## 13:    150000                  NA       NA mitsubishi                NA
## 14:    100000                   9   petrol       ford                no
## 15:    150000                   2       NA volkswagen                NA
## 16:    150000                   1   petrol volkswagen                no
## 17:    125000                  NA       NA volkswagen                NA
## 18:     20000                   5   diesel volkswagen                no
## 19:     50000                  10   petrol      smart                no
##     ad_up_time
##  1: 1341.10625
##  2:   89.24861
##  3:  207.49583
##  4:   12.56250
##  5:  932.87153
##  6:    0.00000
##  7:  664.18125
##  8:    0.00000
##  9: 1791.43472
## 10: 1817.59167
## 11: 1441.67153
## 12:  125.89375
## 13: 1087.41875
## 14:  904.70556
## 15:  802.15972
## 16: 1592.41250
## 17:  854.05069
## 18: 1642.35347
## 19:  744.62014

So we find a variety of models there, some unnamed. The VW Touareg and Ford Mustang seem legitimate. There are also some really old cars, for which a high price might be justified to a collector, but in most cases the model name is not mentioned for these so it is difficult to say whether they are genuinely expensive cars or errors. Since we have more than enough observations overall, I decide to drop them and only retain the VW Touareg and Ford Mustang.

There are also cars below €100, which I assume are also errors or sellers not wanting to filter themselves out of the price selector on the website. I drop these too.

# Further tidying up of price: 
cars <- cars[price <= 75000 |
                 brand %in% c("chevrolet", "porsche", "other", "mercedes_benz", 
                              "bmw", "audi", "jaguar", "land-rover") |
                 model %in% c("mustang", "touareg"), 
             ]
cars <- cars[price >= 100,]

I then plot the variable again with log10 transformation on the x-axis:

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.5

library(scales)

## Warning: package 'scales' was built under R version 3.2.5

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.2.5

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.2.5

library(GGally)

## Warning: package 'GGally' was built under R version 3.2.5

ggplot(data = subset(cars, !is.na(price)), aes(x = price)) + 
    geom_histogram(fill = 'deepskyblue4', bins = 200) +
    scale_x_continuous(trans = "log10", 
                       breaks = c(100, 300, 1000, 3.e3, 1.e4, 3.e4, 1.e5, 3.e5)) +
    geom_vline(xintercept = median(cars$price), colour = 'blue', size = 1) +
    geom_vline(xintercept = mean(cars$price), colour = 'brown', size = 1)

Red bar indicates mean, blue bar indicates median

With the scale transformation, we have a roughly normal distribution. We notice that some bins have a much higher count than their neighbours, presumably corresponding to round values or “psychological” price points (e.g. €9,900).

1.2. Vehicle Type

This categorical variable has 8 levels and indicates the body style of the car (sedan, coupe, SUV etc.).

ggplot(data = cars, aes(x = vehicleType)) +
    geom_bar(fill = 'deepskyblue4') + coord_flip()

This tends to reflect the general Western European market, with a prominence of “family” vehicles and smaller volumes of “niche” products (although a sample of new car registrations would probably show a higher proportion of SUVs considering the rise of this body style in recent years). Also note the large number of NAs – about 20,000. E-Bay could definitely do a better job at encouraging their customers to write their ads properly.

1.3. Year of Registration

This will basically tell us about the age of the vehicle. From the data summary, we saw that the minimum year is 1,000, which will come as a surprise to most historians. The maximum year is 9,999, which is obviously wrong as we will all be teleporting by then. So we clearly need to tidy up this variable. To keep things simple, I select only vehicles registered since 1960. As the data was collected in 2016, we set that year as our upper bound. Then we plot a histogram.

summary(cars$yearOfRegistration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1999    2004    2004    2008    9999

# Leave out cars first registered before 1960 and, supposedly, after 2016:
cars <- cars[yearOfRegistration >= 1960 & yearOfRegistration <= 2016]

ggplot(data = subset(cars, !is.na(yearOfRegistration)),
       aes(x = yearOfRegistration)) +
    geom_histogram(binwidth = 1, fill = 'deepskyblue4') +
    scale_x_continuous(breaks = seq(1960, 2016, 5)) +
    geom_vline(xintercept = mean(cars$yearOfRegistration),
               size = 1, colour = 'brown') +
    geom_vline(xintercept = median(cars$yearOfRegistration),
               size = 1, colour = 'blue')

Red bar indicates mean, blue bar indicates median

Note that the mean and the median are identical. The distribution is close to normal, with the following exceptions:

A significant left tail
Three peaks: one in 1999-2000, one in 2005-2006 and a large one in 2016.

I did some research and it turns out that 1999, 2000, 2005, 2006 were all among the strongest years for new car registrations in Germany in the last 20 years. As they are also in the heart of the used car market in terms of age, it makes sense that they would translate into these peaks.
As for 2016, the explanation is less obvious, especially as the data was collected in March and April, which is quite ealy in the year. The peak could be due to some listing errors (on purpose or not) where owners enter a date at random or to attract visitors to their ad. It could also be linked to the website’s features when creating an ad (eg. default value in drop-down menu).
But there could also be a number of genuine 2016 cars suddenly arriving on the market. Employees in the automotive industry have often access to cheap car leasing schemes, whereby they can change their vehicle every 6 months or so. More importantly, most manufacturers register large numbers of demonstrators, press units and self-registered cars (new vehicles registered by the manufacturer or its dealers, in order to artificially boost market share and / or create cheaper opportunities to capture some customers over the competition).

Maybe further investigation will tell us whether this theory holds.

1.4. Month of Registration

We should note from the summary in the beginning that monthOfRegistration contains nearly 38,000 NAs. Again, it is surprising that year seems mandatory (although its value is clearly not controlled) but month is not. This variable is really a categorical variable.

summary(cars$monthOfRegistration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   6.000   6.387   9.000  12.000   28445

ggplot(data = subset(cars, !is.na(monthOfRegistration)),
       aes(x = monthOfRegistration)) +
    geom_histogram(binwidth = 0.5, fill = 'deepskyblue4') +
    scale_x_continuous(breaks = 1:12)

March and June are the strongest months for vehicle registrations. A quick research on the internet confirmed that this is consistent with the car registration seasonality that we observe in Germany (incidently, there is a similar effect in France and the UK).

1.5. Transmission Type

The gearbox variable can only take two values: manual or automatic.

ggplot(data = cars, aes(x = gearbox)) +
    geom_bar(fill = 'deepskyblue4')

The European market is primarily a manual transmission market, so no surprises here. Again, there are about 15,000 NAs.

1.6. Engine Power

The engine power is measured using the metric PS (1PS = 736 Watts). Again, we know from the summary that there are some nonsensical values in the data. I was prepared to remove anything below 40PS, but then I realised that there are Trabants in the dataset! This venerable left-over from the East-German Communist era is now widely used in Berlin as a rental car for people looking for a different experience of the city. Its 2-stroke engine managed 26PS!. Out of respect for such an antiquity, I decided to set the lower bound at 25PS.

As for the excessively high PS values, it looks like most of them are due to people confusing power output and engine capacity (in cm\(^3\)). I decided to set the limit at 600PS, a more than respectable value.

# Drop cars with output below 25 or over 600:
cars <- cars[powerPS >= 25 & powerPS <= 600, ]

ggplot(data = subset(cars, !is.na(powerPS)),
       aes(x = powerPS)) +
    geom_histogram(binwidth = 10, fill = 'deepskyblue4') +
    scale_x_continuous(breaks = seq(0, 600, 20)) +
    geom_vline(xintercept = mean(cars$powerPS), size = 1, colour = 'brown') +
    geom_vline(xintercept = median(cars$powerPS), size = 1, colour = 'blue') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Red bar indicates mean, blue bar indicates median

The distribution is postively skewed with a long tail (that may contain errors, as we just saw).

There are some prefered values – around 60, 100, 120, 140 for instance. These are values that have become some sort of “market standards”: Most manufacturers will offer engines around these values. It makes it easier for the consumer to compare products. Real, measurable power output of a car is never exactly equal to its rated horsepower due to variances in industrial processes. Differences of around 5% are not uncommon for a same model.

1.7. Model

This factor variable contails 251 levels, far too many to plot. But we can select the top 20:

# Create a data.table of unique model names sorted by their respective counts:
model_count <- cars[, .N, by = model]
model_count <- model_count[order(model_count$N, decreasing = TRUE), ]


ggplot(data = model_count[1:10, ], 
       aes(x = reorder(model, -N), y = N)) +
    geom_bar(stat = 'identity', fill = 'deepskyblue4') + coord_flip()

Without surprise, the Volkswagen Golf (the most popular car in Europe) is also number one in the dataset. Note the very large number of vehicles designated as “other” – while some of them are probably models that exist but cannot be selected on the eBay website, it is unlikely that their number would be that high so we have to assume that once again, they are mostly due to human error. Analysing the namevariable (the free-text variable that we dropped at the very begining of the study) might help identify these cars, but this would be a whole project in itself.

The next car on the list is the BMW 3-Series, which is by no means a cheap car. This fact alone shows that we are indeed working on the German market – the most high-end market in Europe.

1.8. Mileage

… or more accurately, “kilometreage”.

summary(cars$kilometer)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000  100000  150000  125500  150000  150000

ggplot(data = cars, aes(x = kilometer)) +
    geom_histogram(binwidth = 5000, fill = 'deepskyblue4') +
    geom_vline(xintercept = mean(cars$kilometer), size = 1, colour = 'brown') +
    geom_vline(xintercept = median(cars$kilometer), size = 1, colour = 'blue')

Red bar indicates mean, blue bar indicates median

This histogram clearly shows that mileage is not a free input field. This is surprising, because mileage is one of the most important pieces of information when it comes to used cars, so maximum accuracy would have been desirable. Moreover, when I went to E-Bay myself to try out the used car ad generator, I was able to enter any value. So maybe there is some aggregating mechanism during data extract or the functionality changed recently?

The second thing that is striking with this chart is the predominance of 150,000km cars. Given the nature of the data, there is no point trying to apply scale transformations to improve the plot. This variable should actually be considered as a categorical variable more than a continuous one. I therefore add a variable in the dataset called km_cat. We will see later which one is more appropriate.

cars <- cars[, km_cat := factor(kilometer)]

1.9. Fuel Type

This factor variable also contains many NAs (over 33,000).

ggplot(data = cars, aes(x = fuelType)) +
    geom_bar(fill = 'deepskyblue4')

“Alternative”" sources of energy are almost negligible in this dataset, which is not surprising considering that over 50% of the vehicles were 12 to 13 years old when this data was extracted.

Petrol is roughly twice as prominent as Diesel.

1.10. Brand

This is another factor with many levels (40) so we will take the same approach as with model.

# Create a data.table of unique model names sorted by their respective counts:
brand_count <- cars[, .N, by = brand]
brand_count <- brand_count[order(brand_count$N, decreasing = TRUE), ]


ggplot(data = brand_count[1:10, ], 
       aes(x = reorder(brand, -N), y = N)) +
    geom_bar(stat = 'identity', fill = 'deepskyblue4') + coord_flip()

The top 5 brands are German. Number 6 is Ford, which in Europe is largely perceived as German as it has its European headquarters in Cologne and many of its European products are actually designed and built in Germany. The next two brands are French, then Fiat is Italian. Seat is Spanish but it is actually part of the VW Group and their cars share almost all their components with VW products.
In other words, German manufacturers are hugely dominant on their home turf.

One issue we will have when looking for associations between variables is that with brand containing 40 levels, plots will be really hard to read (in addition to causing long processing times). We can improve this by grouping brands into categories based on brand perception. We could simply use mean prices to make these distinctions, but we would then create a correlation to price where there isn’t necessarily one. Moreover, brand perception involves a lot more than just price – there is history, perceived quality, marketing etc.

So I chose another approach: Use my domain-knowledge to manually classify the 40 levels into clusters. I didn’t plan on an exact number of clusters beforehand, just something manageable. I then intuitively grouped brands together and came up with 8 clusters which I then named.
Althought this intuitive approach is subjective, I believe it actually adds information to the dataset, unlike the “group by price” method which removes some.

# Define brand groups based on market perception:
other <- c('lada', 'trabant', 'other')
budget <- c('chevrolet', 'daewoo', 'dacia')
budget_plus <- c('hyundai', 'kia', 'skoda', 'daihatsu')
mid_minus <- c('chrysler', 'fiat', 'ford', 'citroen', 'mitsubishi', 'opel', 
               'rover', 'seat', 'suzuki')
mid_range <- c('nissan', 'peugeot', 'renault', 'toyota')
mid_plus <- c('honda', 'mazda', 'smart', 'subaru', 'volkswagen')
premium_minus <- c('alfa_romeo', 'lancia', 'saab', 'jeep', 'volvo', 'mini')
premium <- c('audi', 'bmw', 'jaguar', 'land_rover', 'mercedes_benz', 'porsche')

cars[brand %in% other, brand_cat := 'other']
cars[brand %in% budget, brand_cat := 'budget']
cars[brand %in% budget_plus, brand_cat := 'budget_plus']
cars[brand %in% mid_minus, brand_cat := 'mid_minus']
cars[brand %in% mid_range, brand_cat := 'mid_range']
cars[brand %in% mid_plus, brand_cat := 'mid_plus']
cars[brand %in% premium_minus, brand_cat := 'premium_minus']
cars[brand %in% premium, brand_cat := 'premium']

cars$brand_cat <- ordered(cars$brand_cat, 
                          levels = c('budget', 'budget_plus',
                                    'mid_minus', 'mid_range',
                                    'mid_plus', 'premium_minus',
                                    'premium', 'other'),
                          labels = c('budget', 'budget_plus',
                                    'mid_minus', 'mid_range',
                                    'mid_plus', 'premium_minus',
                                    'premium', 'other'))

ggplot(data = cars, aes(x = brand_cat)) +
    geom_bar(fill = 'deepskyblue4') + coord_flip()

There are large disparities between brand categories in terms of count. The biggest surprise is the predominance of the premium brands, which are supposedly the most expensive. But the analysis of the top 10 brands above explains it: This category contains BMW, Audi and Mercedes which are all among the most common brands in Germany. Clearly, this plot would look very different in most other countries.

1.11. Unrepaired Damage

ggplot(data = cars, aes(x = notRepairedDamage)) +
    geom_bar(fill = 'deepskyblue4')

The variable notRepairedDamage can only take two values: “yes” or “no”. But it does have NAs – about 72,000, which is twice as many as the number of “yes”. It does not seem like this is a mandatory field (and I could not find it on EBay). As I understand it, it refers to potential unrepaired damage on the vehicle being sold.

1.12. Ad Up-time

This is a composite variable that we created by substracting the dateCreated from the dateCrawled date. It is measured in days. Of course the idea is to look for potential correlations to other variables, especially price. From that point of view, there are important limitations associated with ad_up_time:

We don’t know whether there is any incitation for sellers to remove ads once the vehicle is sold. This means some ads might be really old and unattended despite the car having been sold long ago, through eBay or some other channel.
We are not even sure that ads removed from the website cannot be present in the scaped data.

summary(cars$ad_up_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   121.0   358.5   528.8   819.9  1982.0

ggplot(data = cars, aes(x = ad_up_time)) +
    geom_histogram(binwidth = 7, fill = 'deepskyblue4') +
    scale_x_continuous(breaks = seq(0, 2000, 120)) +
    geom_vline(xintercept = mean(cars$ad_up_time), colour = 'brown', size = 1) +
    geom_vline(xintercept = median(cars$ad_up_time), colour = 'blue', size = 1)

Red bar indicates mean, blue car indicates median

Here we have observations up to 2,000 days (nearly 5.5 years). The long tail contains a significant number of observations and there seems to be a lot of variance in the data, so it is difficult to just drop observations after an arbitrary number of days posted.
The other thing I notice is that there seems to be some prevailing values at roughly three months intervals. I am not sure why this is – it might be related to the pricing structure used by eBay…?

2. Multivariate Analyses

In this section, I will examine pairs of variables to look for associations between variables, and then zoom in on some multi-variable combinations that seem particularly interesting.

First, let’s visualize potential variable associations with a plot matrix (to reduce computing time, I use a sample of 10,000 observations):

cars$log_price <- log10(cars$price)

ggpairs(select(cars[sample(nrow(cars), 10000)], -c(price, model, km_cat, monthOfRegistration, brand)), 
        cardinality_threshold = 40,
        axisLabels = 'internal',
        lower = list(combo = wrap('facethist', bins = 100),
                     continuous = wrap('points', alpha = .2)))

#ggsave('plot_matrix.png', width = 50, height = 50, units = 'cm', dpi = 400)

Even so the plot matrix is very clutered. For continuous variables we can try a correlation plot:

ggcorr(cars[, .SD, .SDcols = c("yearOfRegistration", "powerPS", 
                               "kilometer", "ad_up_time", "log_price")],
       geom = "tile", nbreaks = 5, label = TRUE, label_color = "deepskyblue4", 
       label_size = 5, hjust = .5, size = 3, angle = -45)

We notice fairly strong correlations between log(price) and power output, year of registration and mileage (negative in the latter case). Other correlations are weak. We will build individual plots for the most promising variable pairs. Eventually I would like to be able to predict prices based on the other variables, so let’s focus on price as one of our variables.

2.1. Continuous variables

2.1.1. Price vs. Power

Intuitively, it would seem obvious that there should be a pretty strong relationship between power and price. Is it really the case?

ggplot(data = cars, aes(x = powerPS, y = log_price)) + 
    geom_point(fill = "deepskyblue1",
               shape = 21, alpha = .15, size = 1, position = 'jitter') +
    geom_smooth()

## `geom_smooth()` using method = 'gam'

So there is a correlation indeed, but there is also a lot of noise, The relationship does not seem quite linear but looks more like a log function.

2.1.2. Price vs. Power vs. Fuel Type

Let’s try to break this down but fuel type. The reason for doing this is that from experience, I know that Diesel engines are typically more expensive (on the new car market at least) than petrol for a same value of power output. Therefore we have reason to believe that the relationship between power and price will be different, or at the very least, shifted.

I plot price vs PowerPS with a colour-coding for fuelType (and keeping only the two most popular: petrol and Diesel – upper-case D is intentional as Diesel is the name of the inventor, Rudolf Diesel):

ggplot(data = subset(cars, fuelType %in% c("petrol", "diesel")),
       aes(x = powerPS, y = log_price, fill = fuelType)) +
    geom_point(shape = 21, alpha = .15, size = 1, position = 'jitter') +
    scale_fill_brewer(type = 'qual',
                       guide = guide_legend(reverse = TRUE,
                                            override.aes = list(alpha = 1,
                                                                size = 2))) +
    geom_smooth(aes(colour = fuelType))

## `geom_smooth()` using method = 'gam'

This plot confirms that for a given power output, Diesel cars are more expensive (not a great surprise considering that modern Diesel powertrains are usually more technologically advanced). The shape of the smoothers is different too. The petrol smoother looks closer to a regular root or log function, whereas for Diesel cars, it looks more erratic.

ggplot(data = subset(cars, fuelType %in% c("petrol", "diesel")),
       aes(x = log(powerPS), y = log_price, fill = fuelType)) +
    geom_point(shape = 21, alpha = .15, size = 1, position = 'jitter') +
    scale_fill_brewer(type = 'qual',
                       guide = guide_legend(reverse = TRUE,
                                            override.aes = list(alpha = 1,
                                                                size = 2))) +
    geom_smooth(aes(colour = fuelType))

## `geom_smooth()` using method = 'gam'

Taking the log of powerPS definitely helps make the relationship look more linear for petrol cars, but it does not have very convincing results on Diesel cars – the curve only looks linear towards the centre of the distribution. This means that when building the linear model, we may need a different set of parameters for each fuel type. Nevertheless, over the power range where the Diesel smoother looks linear, I notice that both curves are almost parallel. This indicates that at least for the bulk of observations, price varies in a similar way on petrol and Diesel cars, but the Diesel prices are shifted up by a constant (in log x log representation).

2.1.3. Price vs. Power vs. Fuel Type vs. Gearbox

Let’s try to plot these variables together:

ggplot(data = subset(cars, 
                     fuelType %in% c("petrol", "diesel") & !is.na(gearbox)),
       aes(x = log(powerPS), y = log_price)) +
    geom_point(shape = 21, fill = 'deepskyblue1', 
               alpha = .05, position = 'jitter') +
    geom_smooth() +
    facet_grid(gearbox ~ fuelType)

## `geom_smooth()` using method = 'gam'

We see that petrol cars are more spread out in terms of price and power than Diesel cars. Using the log of power, the smoothers are close to linear in the first category but not in the second one. We also notice that automatic cars tend to be both more powerful and more expensive than manual, but this will need to be confirmed later.

cars[, .(log_price_cor = cor(log_price, log(powerPS)))]

##    log_price_cor
## 1:     0.5837276

cars[fuelType %in% c("petrol", "diesel") & !is.na(gearbox),
     .(log_price_cor = cor(log_price, log(powerPS))), by = .(fuelType, gearbox)]

##    fuelType   gearbox log_price_cor
## 1:   diesel    manual     0.4795269
## 2:   diesel automatic     0.4939476
## 3:   petrol    manual     0.4917666
## 4:   petrol automatic     0.5857267

So when grouped by transmission and fuel types, the correlations are all around the .5 mark, except the Petrol / Automatic combination which is around .6. We see that the associations are similar in shape although for manual Diesel cars, the data points form a much tighter group so it is harder to tell.

So far, we have found that there is a positive correlation of price with power, and associations to gearbox and fuel type.

2.1.4. Price vs. Year of Registration

Another continuous variable with a significant correlation to price (according to our correlation matrix) is yearOfRegistration.

ggplot(data = subset(cars, !is.na(yearOfRegistration)),
       aes(x = yearOfRegistration, y = price)) +
    geom_point(fill = 'deepskyblue4', shape = 21, alpha = .05, 
               position = 'jitter') +
    scale_y_continuous(
        breaks = c(300, 1000, 3000, 10000, 30000, 100000, 300000)) + 
    coord_trans(x = 'identity', y = 'log10', limx = c(1960, 2016))

The distribution looks very strange.

The first feature that I notice is a clear positive correlation that looks roughly linear for cars registered from roughly 1992 or 1995 onwards.
Secondly, we see a vertical line of observations in 2016, which we had already noticed when looking at yearOfRegistration alone. On the scatter plot, we can see that this line includes cars at just about any price from a couple hundred Euros to maybe €20,000. Strangely, the bulk of cars just older than this seems to be more expensive, which suggests that they are most probably errors (intentional or not) or people just picking the first year available on the drop-down menu when listing their car. If these cars were really 2016 cars, it would be illogical for them to be advertised at cheaper prices than cars 5 years older. This strongly advocates in favour of filtering 2016 cars out of the data.
Thirdly, it seems that cars registered before the early 1990’s become more expensive the older they get. This would be consistent with vintage vehicles that tend to gain value as they get older – typically after 20-25 years, provided they are in good condition and were not too common to start with.

Let’s first split the data in two with a cut-off in 1995 (21 years ago) and remove the 2016 cars from the dataset (there is only 4 months’ worth of data for that year and it seems highly inaccurate). I then plot the new data with a distinction beween vintage and modern cars.

cars <- cars[yearOfRegistration < 2016, ]
ggplot(data = subset(cars, !is.na(yearOfRegistration)),
       aes(x = yearOfRegistration, y = log_price)) +
    geom_point(aes(fill = collector_status), 
               shape = 21, alpha = .05, 
               position = 'jitter') +
        scale_fill_brewer(type = 'qual',
                       guide = guide_legend(reverse = TRUE,
                                            override.aes = list(alpha = 1,
                                                                size = 5))) +
    geom_smooth(data = subset(cars,
                              !is.na(yearOfRegistration) &
                              yearOfRegistration < 2016 &
                              collector_status == "modern"),
                aes(colour = collector_status)) +
    geom_smooth(data = subset(cars,
                              !is.na(yearOfRegistration) &
                              yearOfRegistration < 2016 &
                              collector_status == "vintage"),
                aes(colour = collector_status)) +
    scale_x_continuous(breaks = seq(1960, 2015, 5))

## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

Now I would like to take a look at the correlation values:

print("Vintage cars:")

## [1] "Vintage cars:"

cars[collector_status == "vintage", cor.test(price, yearOfRegistration)]

## 
##  Pearson's product-moment correlation
## 
## data:  price and yearOfRegistration
## t = -69.163, df = 28266, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3903722 -0.3704315
## sample estimates:
##        cor 
## -0.3804461

print("Modern cars:")

## [1] "Modern cars:"

cars[collector_status == "modern", cor.test(price, yearOfRegistration)]

## 
##  Pearson's product-moment correlation
## 
## data:  price and yearOfRegistration
## t = 415.1, df = 276410, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6173822 0.6219750
## sample estimates:
##       cor 
## 0.6196839

The correlation is much weaker for vintage cars than for modern cars because of the greater variance. This variance comes partly from the lower count, but also probably because vintage car prices can vary enormously according to seemingly irrational factors. They no longer compete with each other so market pressure is different. Moreover their value depends on factors such as: rarity, quality of restoration, use of original parts only, authenticity (faithfulness to the exact specifications of the car when it came out of the factory), historical value, part usage, car’s history etc.

Two cars built exactly the same year, 40 years ago, and selling at the same price back then, can nowadays have orders of magnitude between their current values.

2.1.5. Price vs. mileage

In the previous section, I noticed that the variable kilometer behaves more like a categorical variable than a continuous one. However I want to try plotting both it to see which representation works best.

ggplot(data = subset(cars, 
                     !is.na(kilometer)),
       aes(x = kilometer, y = log_price)) +
    geom_point(colour = 'lightskyblue3', shape = '.', alpha = .25, 
               position = 'jitter') +
    scale_x_continuous(breaks = c(0, as.numeric(levels(cars$km_cat)))) +
    geom_smooth()

## `geom_smooth()` using method = 'gam'

There is an association to log_price that looks roughly linear, except between 5,000 and 20,000km. The clearest feature is that the variance increases with mileage. Let’s compare with a boxplot, this time treating mileage as a categorical variable:

ggplot(data = subset(cars, !is.na(km_cat)),
       aes(x = km_cat, y = log_price)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    geom_point(aes(x = km_cat, y = log_price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown')

Brown diamonds represent means

Despite kilometer being grouped in bins, I believe the scatter plot is a marginally better representation because the bins are not regularly spaced, so the box plot gives a distorted view.

The 5000km data looks highly suspicious - we could be observing the same effect as with yearOfRegistration == 2016. Let’s check for that in the next set of analyses:

2.1.6. Year of Registration vs. Mileage

ggplot(data = subset(cars, !is.na(kilometer) & yearOfRegistration < 2016), 
       aes(x = yearOfRegistration, y = kilometer)) +
    geom_point(colour = 'lightskyblue3', position = 'jitter', shape = '.',
               alpha = 0.25) +
    scale_x_continuous(breaks = seq(1960, 2015, 5), minor_breaks = NULL) +
    scale_y_continuous(breaks = seq(0, 150000, 10000)) +
    geom_smooth()

## `geom_smooth()` using method = 'gam'

The smoother seems to indicate that, similarly to what we observed on price, kilometers tend to be positively correlated to the age of the car up to 15-20 years old, then negatively correlated after that. This could also be linked to the fact that vintage cars don’t tend to run as much, as they are rarely the household’s main car and are often of questionable reliability. However considering the dispersion in the early years of the dataset, I am not sure I should lend much credit to this observation. Modern cars, on the other hand, generally have a higher mileage as they get older, as common sense would have it.

We also notice a group of observations at 5,000 kilometers between 1995 and 2005 approximately, that do not seem to follow the general distribution. Since that period is precisely the one with the highest density of observations in the dataset, I suspect these cars actually have a much higher mileage and that the data is wrong. These are probably the same suspicious observations that we noticed just before, for which kilometers are most likely severely under-valued.

2.1.7. Ad Uptime vs. All Other Continuous Variables

From the plots below, it doesn’t look like ad_up_timeis going to be very informative. The matrix plot also reports that there is virtually no correlation to any other continuous variable.

p1 <- ggplot(data = cars, aes(x = ad_up_time, y = log_price)) +
    geom_point(alpha = .15, color = 'lightskyblue3', shape = ".",
               position = 'jitter') +
    geom_smooth()
p2 <- ggplot(data = cars, aes(x = ad_up_time, y = powerPS)) +
    geom_point(alpha = .15, color = 'lightskyblue3', shape = ".",
               position = 'jitter') +
    geom_smooth()
p3 <- ggplot(data = cars, aes(x = ad_up_time, y = yearOfRegistration)) +
    geom_point(alpha = .15, color = 'lightskyblue3', shape = '.',
               position = 'jitter') +
    geom_smooth()
p4 <- ggplot(data = cars, aes(x = ad_up_time, y = kilometer)) +
    geom_point(alpha = .15, color = 'lightskyblue3', shape = ".",
               position = 'jitter') +
    geom_smooth()
grid.arrange(p1, p2, p3, p4, ncol = 2)

## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

If anything, the very tenuous trends we observe are rather counter-intuitive: cars with a low ad_up_time seem to be generally cheaper and to have higher mileage than the others.

2.2. Discrete variables

In this section, we are going to look at combinations of discrete variables, with a stronger focus on price as this is the variable that I would like to explain.

2.2.1. Price and Power vs. Fuel Type

p1 <- ggplot(data = cars, aes(x = fuelType, y = price)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    scale_y_log10(breaks = c(100, 300, 1000, 3000, 10000, 30000, 
                                  100000, 300000)) +
    geom_point(aes(x = fuelType, y = price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    coord_flip()
p2 <- ggplot(data = cars, aes(x = fuelType, y = powerPS)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    scale_y_continuous() +
    geom_point(aes(x = fuelType, y = powerPS), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown')  +  
    coord_flip()

grid.arrange(p1, p2, nrow = 2)

Brown diamonds indicate mean

These boxplots show that there are numerous outliers in the values of both powerPS and price. We can clearly see that different values of fuel type have different prices, with hybrid and electric cars being significantly more expensive than other types of vehicles.

Similarly, power outputs vary depending on fuel type. Perhaps surprisingly, Diesel cars tend to be more powerful than petrol, although their outliers don’t extend quite as far. This would be explained by the fact that for cost reasons, all low-end models have petrol engines.

To check whether the difference in mean prices is significant between petrol and Diesel cars (the 2 most popular groups), I run a t-test. My null hypothesis is \(price_{petrol} = price_{Diesel}\) and the alternative hypothesis is \(price_{petrol} < price_{Diesel}\). It’s an independent 2-group test.

petrol_diesel <- cars[fuelType %in% c("petrol", "diesel"), 
                      .(price, powerPS, fuelType)]
petrol_diesel$fuelType <- factor(petrol_diesel$fuelType)
with(petrol_diesel, t.test(price ~ fuelType, alternative = "less"))

## 
##  Welch Two Sample t-test
## 
## data:  price by fuelType
## t = -115.19, df = 189150, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -3712.24
## sample estimates:
## mean in group petrol mean in group diesel 
##             5214.650             8980.668

So \(p-value < .01\). We are very confident that there is a real difference in mean prices between the two groups.

2.2.2. Price vs. Gearbox

gearbox is another variable that relates to the cars’ technical specifications. I would like to check whether it has an association to price.

ggplot(data = cars, aes(x = gearbox, y = price)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    scale_y_log10(breaks = c(100, 300, 1000, 3000, 10000, 30000, 
                                  100000, 300000)) +
    geom_point(aes(x = gearbox, y = price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    coord_flip()

Brown diamonds represent mean

I am not sure how to interpret the fact that where gearbox is NA, prices seem lower. Automatic cars seem much more expensive overall, especially bearing in mind that prices are plotted on a logarithmic scale.

Again, let’s run a t-test to make sure the diffrence is significant. My null hypothesis is \(price_{automatic} = price_{manual}\) and the alternative hypothesis is \(price_{automatic} > price_{manual}\). It’s an independent 2-group test.

with(subset(cars, !is.na(gearbox)), t.test(price ~ gearbox, alternative = "greater"))

## 
##  Welch Two Sample t-test
## 
## data:  price by gearbox
## t = 129.99, df = 81053, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  5945.265      Inf
## sample estimates:
## mean in group automatic    mean in group manual 
##               11005.039                4983.581

\(p-value < .01\) and we are 95% confident that there is at least a €5,945 difference in the means. The on-cost of an auto transmission is not enough to explain such a difference, but the population profile would be. Auto transmissions are typically found in larger, more powerful cars, whereas almost all small cars have a manual gearbox.

2.2.3. Price vs. Gearbox vs. Fuel Type

Combining these three variables, we obtain the following boxplot:

ggplot(data = subset(cars, 
                     fuelType %in% c("petrol", "diesel") & !is.na(gearbox)),
       aes(x = fuelType, y = price)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    geom_point(aes(x = fuelType, y = price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    scale_y_continuous(limits = c(0, 50000)) +
    facet_wrap(~ gearbox)

## Warning: Removed 1124 rows containing non-finite values (stat_boxplot).

## Warning: Removed 1124 rows containing non-finite values (stat_summary).

Brown diamonds represent means

This boxplot confirms that prices are generally higher for auto transmissions than for manuals, and also higher for Diesel cars than petrol. However there is more dispersion for auto and Diesel than for manual and petrol, probably due to their lower count in the data. We also notice that there are many outliers. Here I used a linear scale for price but cut off at €50,000, and there are many more outliers above the cut-off point that cannot be seen on the plotting area.

2.2.4. Price vs. Vehicle Type

p1 <- ggplot(data = subset(cars, !is.na(vehicleType)),
             aes(x = price)) +
    geom_histogram(fill = 'deepskyblue4', bins = 100) +
    scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
    facet_wrap(~ vehicleType, ncol = 1)
        
p2 <- ggplot(data = subset(cars, !is.na(vehicleType)),
             aes(y = price, x = vehicleType)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    scale_y_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
    geom_point(aes(x = vehicleType, y = price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    scale_x_discrete(limits = rev(levels(cars$vehicleType))) +
    coord_flip()

p3 <- ggplot(data = cars, 
             aes(x = price)) +
    geom_freqpoly(aes(colour = vehicleType)) +
    scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5))

lay <- rbind(c(1, 2), c(3, 3))
grid.arrange(grobs = list(p1, p2, p3), layout_matrix = lay, heights = c(3, 2))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Brown diamonds represent means

As expected, different vehicle types have different price distributions, most of them approaching normal when viewed on a logarithmic scale. The highest mean and median prices are found with SUVs, followed by convertibles, coupés and people carriers (minivans in the US). The variances of these distributions are quite large, coupés in particular. However this looks like a good contributor in explaining price differences.

2.2.5. Price vs. Brand category

Since I took the time to manually classify brands by perceived “premiumness”, let’s have a look at potential associations with price:

p1 <- ggplot(data = cars,
             aes(x = price)) +
    geom_histogram(fill = 'deepskyblue4', bins = 100) +
    scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
    facet_wrap(~ brand_cat, ncol = 1)
        
p2 <- ggplot(data = cars,
             aes(y = price, x = brand_cat)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    scale_y_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
    geom_point(aes(x = brand_cat, y = price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    scale_x_discrete(limits = rev(levels(cars$brand_cat))) +
    coord_flip()

p3 <- ggplot(data = cars, 
             aes(x = price)) +
    geom_freqpoly(aes(colour = brand_cat)) +
    scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5))

lay <- rbind(c(1, 2), c(3, 3))
grid.arrange(grobs = list(p1, p2, p3), layout_matrix = lay, heights = c(3, 2))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Brown diamonds represent means

In general, we observe a fairly logical pattern with higher prices for more premium brands. The only surpise is that the “budget” and “budget_plus” categories seem almost as expensive overall as the “premium_minus” brands. I am not sure why, maybe this has to do with the fact that some of the brands that make up these two categories are fairly recent and therefore have a younger population?

ggplot(data = cars, aes(x = brand_cat, y = yearOfRegistration)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    geom_point(aes(x = brand_cat, y = yearOfRegistration), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    coord_trans(limy = c(1980, 2016))

Brown diamonds represent means

print("Median year of registration by brand category:")

## [1] "Median year of registration by brand category:"

cars[, str(yearOfRegistration)]

##  int [1:304677] 2011 2004 2001 2008 1995 2004 2014 1998 2004 2005 ...

## NULL

cars[, .(Median_year = as.numeric(median(yearOfRegistration))), by = brand_cat]

##        brand_cat Median_year
## 1:       premium        2004
## 2: premium_minus        2004
## 3:      mid_plus        2003
## 4:   budget_plus        2007
## 5:     mid_range        2003
## 6:     mid_minus        2003
## 7:        budget        2008
## 8:         other        1999

It seems that these two categories are indeed younger than the rest (with a 4-5 year difference in the medians) which would explain at least part of the observation we made previously.

2.2.6. Price vs. Unrepaired Damage

p1 <- ggplot(data = subset(cars, !is.na(notRepairedDamage)),
             aes(x = price)) +
    geom_histogram(fill = 'deepskyblue4', bins = 100) +
    scale_x_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
    facet_wrap(~ notRepairedDamage, ncol = 1)
        
p2 <- ggplot(data = cars,
             aes(y = price, x = notRepairedDamage)) +
    geom_boxplot(colour = 'deepskyblue4', fill = 'deepskyblue1') +
    scale_y_log10(breaks = c(1.e2, 1.e3, 1.e4, 1.e5)) +
    geom_point(aes(x = notRepairedDamage, y = price), stat = 'summary', fun.y = mean,
               shape = 18, 
               size = 4, colour = 'brown') +
    scale_x_discrete(limits = rev(levels(cars$notRepairedDamage))) +
    coord_flip()

grid.arrange(p1, p2, ncol = 2)

## Warning: Removed 40366 rows containing non-finite values (stat_boxplot).

## Warning: Removed 40366 rows containing non-finite values (stat_summary).

So cars with unrepaired damage are much cheaper on average than cars in good condition. The log10 scale is slightly deceiving here, but in reality there is a 1-to-3 to 1-to-4 difference. We can confirm that the difference is significant by runing an independent 2-sample t-test. My \(H_0\) is \(price_{unrepaired \, damage} = price_{no \, unrepaired \, damage}\). My \(H_a\) is \(price_{unrepaired \, damage} < price_{no \, unrepaired \, damage}\).

with(subset(cars, !is.na(notRepairedDamage)), t.test(price ~ notRepairedDamage,
                                                     alternative = 'less') )

## 
##  Welch Two Sample t-test
## 
## data:  price by notRepairedDamage
## t = -166.11, df = 72067, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -4797.876
## sample estimates:
## mean in group yes  mean in group no 
##          2395.173          7241.033

So again, \(p < .01\). We are 95% confident that there is a €4,800 difference on average between cars with unrepaired damage, and cars without.

However we should note that the notion of unrepaired damage is somewhat vague. No one expect 10- or 15-year-old cars to be in immaculate condition: They will always have some scratches and bumps. However these are unlikely to drop the car’s price by a factor 3 so I conclude that to most sellers, unrepaired damage means serious damage, potentially preventing the vehicule from operating normally. Is this definition presented in the eBay guidelines? Or is it just an implicit understanding from sellers? I tried to navigate the eBay website for this information but was unable to find it.

3. Linear Regression

Based on the knowledge that we gained on the previous phases of the project, I would like to attempt a linear regression to predict prices based on the most useful variables. To clean up the data further, I decided to remove all observations that are in the 5,000km group and were registered between 1990 and 2010. I decided to select the following variables:

Outcome:
- log_price
Continuous Predictors:
- log of powerPS in interaction with fuelType, following our observations on the shape of the correlation
- powerPS in interaction with fuelType
- yearOfRegistration in interaction with collector_status
- kilometer in interaction with collector_status
Discrete Predictors:
- vehicleType
- brand_cat
- notRepairedDamage in interaction with everything else. The reason for this is that we saw that this particular variable is associated with a 3- or 4-fold price difference, therefore for all intents and purposes it defines two separate markets (much like collector_status)

require(dplyr)
cars_no_nas <- na.omit(cars[!(yearOfRegistration %in% seq(1990, 2010) 
                              & kilometer == 5000),
                            ])

fit <- lm(log_price ~  (powerPS : fuelType +
              I(log(powerPS)) : fuelType +
              gearbox +      
              yearOfRegistration:collector_status + 
              kilometer:collector_status + 
              vehicleType + 
              brand_cat) : notRepairedDamage,
   data = cars_no_nas)

cbind(Predictor = summary(fit)$dimnames, p_value = summary(fit)$coefficients[, 4])

##                                                                       p_value
## (Intercept)                                                      0.000000e+00
## gearboxautomatic:notRepairedDamageyes                           3.125360e-137
## gearboxmanual:notRepairedDamageyes                              3.958991e-137
## gearboxautomatic:notRepairedDamageno                             1.390427e-08
## vehicleTypepeople carrier:notRepairedDamageyes                   1.317575e-05
## vehicleTypeconvertible:notRepairedDamageyes                      8.630971e-28
## vehicleTypecoupe:notRepairedDamageyes                            4.465875e-05
## vehicleTypesmall car:notRepairedDamageyes                        1.693015e-03
## vehicleTypeestate:notRepairedDamageyes                           1.636337e-08
## vehicleTypesedan:notRepairedDamageyes                            3.471787e-01
## vehicleTypeSUV:notRepairedDamageyes                              8.027336e-23
## vehicleTypepeople carrier:notRepairedDamageno                    2.420756e-17
## vehicleTypeconvertible:notRepairedDamageno                       9.568651e-94
## vehicleTypecoupe:notRepairedDamageno                             2.372919e-13
## vehicleTypesmall car:notRepairedDamageno                         5.460557e-03
## vehicleTypeestate:notRepairedDamageno                            6.389341e-19
## vehicleTypesedan:notRepairedDamageno                             2.492941e-08
## vehicleTypeSUV:notRepairedDamageno                               3.401488e-39
## brand_cat.L:notRepairedDamageyes                                 1.461562e-04
## brand_cat.Q:notRepairedDamageyes                                 4.137184e-02
## brand_cat.C:notRepairedDamageyes                                 8.535466e-07
## brand_cat^4:notRepairedDamageyes                                 7.136172e-07
## brand_cat^5:notRepairedDamageyes                                 4.177829e-01
## brand_cat^6:notRepairedDamageyes                                 1.361594e-39
## brand_cat^7:notRepairedDamageyes                                 4.476149e-89
## brand_cat.L:notRepairedDamageno                                  6.020996e-31
## brand_cat.Q:notRepairedDamageno                                  2.603099e-08
## brand_cat.C:notRepairedDamageno                                  6.510218e-34
## brand_cat^4:notRepairedDamageno                                  2.422874e-59
## brand_cat^5:notRepairedDamageno                                  1.179173e-26
## brand_cat^6:notRepairedDamageno                                 5.859188e-102
## brand_cat^7:notRepairedDamageno                                 3.680811e-294
## powerPS:fuelTypeother:notRepairedDamageyes                       4.573723e-01
## powerPS:fuelTypepetrol:notRepairedDamageyes                      1.836444e-48
## powerPS:fuelTypecng:notRepairedDamageyes                         3.874873e-01
## powerPS:fuelTypediesel:notRepairedDamageyes                      9.220426e-22
## powerPS:fuelTypeelectric:notRepairedDamageyes                    8.526414e-02
## powerPS:fuelTypehybrid:notRepairedDamageyes                      7.424538e-01
## powerPS:fuelTypelpg:notRepairedDamageyes                         2.523583e-10
## powerPS:fuelTypeother:notRepairedDamageno                        5.999347e-01
## powerPS:fuelTypepetrol:notRepairedDamageno                       0.000000e+00
## powerPS:fuelTypecng:notRepairedDamageno                          2.252517e-02
## powerPS:fuelTypediesel:notRepairedDamageno                       3.827960e-90
## powerPS:fuelTypeelectric:notRepairedDamageno                     1.463680e-08
## powerPS:fuelTypehybrid:notRepairedDamageno                       2.820426e-01
## powerPS:fuelTypelpg:notRepairedDamageno                          1.007769e-15
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes               7.983801e-06
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes              3.530306e-23
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes                 9.419925e-07
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes              5.356408e-35
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes            9.731434e-02
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes              9.100139e-08
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes                 5.178994e-22
## fuelTypeother:I(log(powerPS)):notRepairedDamageno                3.998841e-27
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno               0.000000e+00
## fuelTypecng:I(log(powerPS)):notRepairedDamageno                  4.746770e-75
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno               0.000000e+00
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno             1.250921e-64
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno              5.464982e-260
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno                  0.000000e+00
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes   0.000000e+00
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes  0.000000e+00
## yearOfRegistration:collector_statusmodern:notRepairedDamageno    0.000000e+00
## yearOfRegistration:collector_statusvintage:notRepairedDamageno   0.000000e+00
## collector_statusmodern:kilometer:notRepairedDamageyes           1.426397e-318
## collector_statusvintage:kilometer:notRepairedDamageyes           0.000000e+00
## collector_statusmodern:kilometer:notRepairedDamageno             0.000000e+00
## collector_statusvintage:kilometer:notRepairedDamageno            0.000000e+00

print("")

## [1] ""

print(paste("R-squared:", summary(fit)$r.squared))

## [1] "R-squared: 0.764911484617119"

So this regression achieves \(R^2=0.765\), which seems decent. There are many unsignificant parameters though, most of them for low-count fuel types such as electric or natural gas. To view these results more explicitly, let’s predict price on a few observations selected at random:

test_examples <- cars_no_nas[sample(nrow(cars_no_nas), 1000),]
pred_price <- 10^predict(fit, newdata = test_examples)

## Warning in predict.lm(fit, newdata = test_examples): prediction from a
## rank-deficient fit may be misleading

test_examples <- cbind(pred_price, test_examples)
print(head(test_examples, 20))

##     pred_price price    vehicleType yearOfRegistration   gearbox powerPS
##  1:   686.2174   300      small car               1996    manual      60
##  2:  1176.5914   990          sedan               1999    manual      75
##  3:  2558.1184  3800          sedan               2010    manual      77
##  4: 10469.3059 13500         estate               2014    manual      90
##  5:  3986.4063  2250 people carrier               2006    manual     106
##  6:  5350.2319  7999 people carrier               2009    manual     109
##  7:  1994.4817  1750         estate               1999    manual     125
##  8:  2132.5466  1650 people carrier               2002    manual     107
##  9:  1734.0371  2290      small car               2008    manual      54
## 10:  2453.4324  4990          sedan               2004    manual      75
## 11:  3479.7591  1650          sedan               2004    manual     145
## 12:  5625.7161  5100         estate               2003 automatic     230
## 13:  1855.1173  4200      small car               2001    manual      75
## 14:  3551.4687  4300          sedan               2003    manual     116
## 15:  4102.1051  4449          sedan               2006    manual     140
## 16: 21926.5980 21500          sedan               2012 automatic     170
## 17:  6084.3462  4999          sedan               2002    manual     143
## 18: 12872.6935 14200          sedan               2013    manual     140
## 19:  1609.4185  1700      small car               1998    manual      86
## 20:  4495.4456  3100         estate               2007    manual     150
##        model kilometer monthOfRegistration fuelType         brand
##  1:    punto    150000                  10   petrol          fiat
##  2:    astra    125000                   3   petrol          opel
##  3:    punto    150000                   4   petrol          fiat
##  4:    auris     40000                   9   diesel        toyota
##  5:   scenic    150000                   6   diesel       renault
##  6: berlingo    150000                  12   diesel       citroen
##  7:  impreza    150000                   6   petrol        subaru
##  8:   scenic    150000                   4   petrol       renault
##  9:    panda    150000                   9   petrol          fiat
## 10:     golf    125000                   7   petrol    volkswagen
## 11:      156    150000                   5   petrol    alfa_romeo
## 12:      5er    150000                  10      lpg           bmw
## 13:       a2    150000                   3   petrol          audi
## 14:      3er    125000                  11   petrol           bmw
## 15:     leon    150000                   3   diesel          seat
## 16:     golf     50000                   4   diesel    volkswagen
## 17: c_klasse     60000                   9   petrol mercedes_benz
## 18:  octavia     40000                   3   petrol         skoda
## 19: a_klasse    150000                   7   petrol mercedes_benz
## 20:    astra    150000                  11   diesel          opel
##     notRepairedDamage  ad_up_time km_cat     brand_cat log_price
##  1:                no   19.359028 150000     mid_minus  2.477121
##  2:                no   19.566667 125000     mid_minus  2.995635
##  3:                no  501.360417 150000     mid_minus  3.579784
##  4:                no 1228.997917  40000     mid_range  4.130334
##  5:                no  536.299306 150000     mid_range  3.352183
##  6:                no 1334.985417 150000     mid_minus  3.903036
##  7:                no  443.735417 150000      mid_plus  3.243038
##  8:                no  670.938889 150000     mid_range  3.217484
##  9:                no  521.142361 150000     mid_minus  3.359835
## 10:                no  120.951389 125000      mid_plus  3.698101
## 11:                no  220.303472 150000 premium_minus  3.217484
## 12:                no  117.760417 150000       premium  3.707570
## 13:                no  361.363194 150000       premium  3.623249
## 14:                no  878.544444 125000       premium  3.633468
## 15:                no 1771.467361 150000     mid_minus  3.648262
## 16:                no  231.984028  50000      mid_plus  4.332438
## 17:                no  661.657639  60000       premium  3.698883
## 18:                no  444.634722  40000   budget_plus  4.152288
## 19:                no   14.452083 150000       premium  3.230449
## 20:                no    2.707639 150000     mid_minus  3.491362
##     collector_status
##  1:           modern
##  2:           modern
##  3:           modern
##  4:           modern
##  5:           modern
##  6:           modern
##  7:           modern
##  8:           modern
##  9:           modern
## 10:           modern
## 11:           modern
## 12:           modern
## 13:           modern
## 14:           modern
## 15:           modern
## 16:           modern
## 17:           modern
## 18:           modern
## 19:           modern
## 20:           modern

We see that some predictions are really close whereas others are widely off the mark. Let’s make this more visual:

ggplot(data = test_examples,
       aes(x = price, y = pred_price)) +
    geom_point(colour = 'deepskyblue4') +
    geom_abline(slope = 1., intercept = 0., colour = 'red', size = 1)

The red line represents identity

The higher the price, the more the predictions seem to get wrong, which is intuitively logical seeing that we predicted log_price and not price directly. In log10 coordinates, this phenomenon disappears but the opposite appears: Errors seem larger for low prices:

ggplot(data = test_examples,
       aes(x = price, y = pred_price)) +
    geom_point(colour = 'deepskyblue4') +
    geom_abline(slope = 1., intercept = 0., colour = 'red', size = 1) +
    scale_x_log10() + scale_y_log10()

The red line represents identity

We seem to generally underestimate prices a little. Let’s have a look at the residuals:

par(mfrow = c(2, 2))
plot(fit)

## Warning in sqrt(crit * p * (1 - hh)/hh): production de NaN

## Warning in sqrt(crit * p * (1 - hh)/hh): production de NaN

par(mfrow = c(1, 1))

The residuals vs. fitted plot shows that there is still some pattern left in the data, especially toward the high-end of price predictions where we tend to under-estimate the outcome variable. The normal Q-Q curve shows a lot of departure from the ideal which is characteristic of heavy-tailed data such as we have here. The Scale-Location plot shows that the data is not completely homoscedastic. Finally, there does not seem to be any overly influencial observations according to the Residuals vs. Leverage plot, but we did a lot of cleaning up beforehand to get rid of many outliers. Looking at the three observations for which we are provided indices in the plot, we see that they are all alternative fuel vehicles:

cars_no_nas[c(145504, 40697, 166256)]

##    price    vehicleType yearOfRegistration   gearbox powerPS       model
## 1: 49700 people carrier               2015 automatic     179 transporter
## 2: 24850          sedan               2012 automatic     306       other
## 3:  3500          sedan               2002    manual      75        golf
##    kilometer monthOfRegistration fuelType      brand notRepairedDamage
## 1:     20000                   3   diesel volkswagen                no
## 2:     60000                   8   hybrid        bmw               yes
## 3:    100000                   1   petrol volkswagen                no
##    ad_up_time km_cat brand_cat log_price collector_status
## 1: 1126.81389  20000  mid_plus  4.696356           modern
## 2:   94.83611  60000   premium  4.395326           modern
## 3:  677.73333 100000  mid_plus  3.544068           modern

Reflection

Using a dataset made available on Kaggle and using data from the German eBay used car ads, I started by doing significant amounts of data cleaning, translations and conversions. I was then able to analyse each variable in turn and noticed that the price variable in particular had a very negatively skewed distribution and that a log transform was required. This initial phase of the analysis also allowed me to discover more issues with the data. Year of registration, price and mileage all had non-sensical entries, most probably due to human errors and maybe cheekiness. Most variables have large amounts of missing values. The brand and model variables tell us a lot about the German market, one of the most high-end market in Europe. Finally, I established that the ad up time was probably not usable as a predictor.

I was then able to look at associations between variables. I established that among the continuous variables, power had highest correlation to price, especially in log form, but with different profiles depending on fuel and transmission types. Looking at mileage was interesting because it exhibited two distinct correlations to price, one for “modern” cars and one for “vintage”. Mileage, another strong contributor, is unfortunately not really a continuous variable in this dataset, as it can only take a few distinct values.

Most categorical variables are also useful to the model. Fuel, gearbox and vehicle type all have associations with price. Unrepaired damages are associated to such a price drop that they litterally create a parallel market, which we must make sure that our linear model is able to capture as distinct from the bulk of the transactions.

Finally, the brand category variable that I manually created turned out to be quite useful as well as it exhibits a significant association to price.

Using the knowledge I had gained from this exploratory analysis, I built a linear model that explains 76.5% of the variance in the data. While this is not a very high level of accuracy, it is a very crude model that cannot account for all associations between variables. For instance, the price of vintage cars can vary differently depending on the brand of the vehicle. Since brand contains over 40 levels, I deliberately aggregated it into categories that try to reflect current market perception, not that of 25 or 30 years ago, so it would not necessarily be adequate as a predictor for vintage car prices. I also left model out of my set of predictors because it has over 200 levels, which is too complex for what I was trying to achieve here. Other potential predictors such as post code and free text description were also left out; while they would require a lot more work to provide sensible information, they could add significantly to the model.

Among other limitations, there is the high level of human error that this dataset seems to contain. To understand the source of some of those errors or to identify some new ones, it would be useful to be able to access the German eBay website as it was when data collection was performed. Similarly, I would like to better understand the criteria and process that were applied when scraping the website.

In terms of further development of this analysis, it would be interesting to compare this dataset to used car data from other sources to see if the model developped here would give us the same level of accuracy, or if eBay market prices are specific. Additionally, as mentioned, the name field probably contains useful information although it would be very tedious to tidy up.

As a closing observation, I would point out that we often hear that domain-knowledge is critical to a good data analyst. I don’t know if it is a general rule, but in the present project, I found that an in-depth knowledge of the automotive market definitely helped me make more meaningful analyses, develop relevant theories and engineer useful variables.

Used Cars Dataset - Exploratory Analysis

Luc Frachon

19 janvier 2017

Introduction

0. Data Preparation

1. Description of Individual Variables

1.1. Price

1.2. Vehicle Type

1.3. Year of Registration

1.4. Month of Registration

1.5. Transmission Type

1.6. Engine Power

1.7. Model

1.8. Mileage

1.9. Fuel Type

1.10. Brand

1.11. Unrepaired Damage

1.12. Ad Up-time

2. Multivariate Analyses

2.1. Continuous variables

2.1.1. Price vs. Power

2.1.2. Price vs. Power vs. Fuel Type

2.1.3. Price vs. Power vs. Fuel Type vs. Gearbox

2.1.4. Price vs. Year of Registration

2.1.5. Price vs. mileage

2.1.6. Year of Registration vs. Mileage

2.1.7. Ad Uptime vs. All Other Continuous Variables

2.2. Discrete variables

2.2.1. Price and Power vs. Fuel Type

2.2.2. Price vs. Gearbox

2.2.3. Price vs. Gearbox vs. Fuel Type

2.2.4. Price vs. Vehicle Type

2.2.5. Price vs. Brand category

2.2.6. Price vs. Unrepaired Damage

3. Linear Regression

Reflection