Kahlen Cheung

14/06/2022

Business Context: Used Car Industry

Business Intelligence and Data-driven Decision Making

The Review Report on UK Used Car Sales Performance (2020) (the report) aims to review and analysis the sales performance of the UK used car in 2020, which mainly disclose the used car price for the second-hand car market, the most common brand, the most common models and the average selling price of different cars in different conditions.

Domain Knowledge and the Business Context

This analysis would provide essential insights for the second hand car dealership industry. It will be useful for existing business and business who want to explore the second hand car market, as well as the individual traders.

Business Insights

  • When is the ideal car age to sell used cars?

  • Top 10 most common listed models.

  • Price of used car compared by brands.

  • Car specs compared by brands.

  • Parameters of Price.

Data Structure

Internal and external data sources

The report is conducted based on the open source dataset 100,000 UK Used Car Data set from kaggle, produced by ADITYA.

This not an official dataset, which data is generated from a used car trading website, without any first-hand internal data provided by clients. Nevertheless, it is still capable for reviewing some of the market trends and provide useful insights.

Types of data

The dataset is a mix of categorical and numerical data. Every data sources contains 9 variables, with types of categorical, discrete and ordinal numerical:

character type: model, transmission, fuel type ordinal numeric type: year discrete numeric type: mileage, tax, mpg, engine size

Data formats

There are a total of 13 data sources in this dataset, which are in csv format, with approximate 100,000 rows.

The dataset includes the listed price of used cars by the brands of: Audi, BMW, Ford, Hyundi, Mercedez, Skoda, Toyota, Vauxhall and Volkswagen, while both the models cclass and focus are separated as the individual data sources. Besides, the car specs are stated as well.

Data quality and Bias

This is not an official dataset that is only scrapped from a random used car trading website. The size of the dataset is up to 99,187 entries, and it might not fully represents the actual market trend.

The dataset does not provide a consecutive year of record, therefore future trend cannot be predicted. Besides, it is lack of car information such as body type, as well as the variety of brands is also limited (eg, Mini, Honda, Kia). Moreover, no customer buying record is provided, which is impossible to predict targeted customers for selling.

Ethics

Ethical issues in data sourcing and extraction

On behalf of the dataset provided by Kaggle, it is anonymous that none of the private data such as names, personal ID, or address is disclosed, which no evidence showing the ethical concerns.

Ethical implications of business requirements

An official access of car sales history helps to improve the analysis accuracy.

Since this is a second-hand data source, the result may not fully reflect the market environment. For a more accurate data analysis, it is suggested that the future dataset should provide the consecutive year of record, car details such as body type, as well as the variety of brands, in order to predict future trend. A basic customer buying record would also be preferred.

Analysis Preparation

Stages in the data analysis process

  1. Data Cleaning and Wrangling

Clean up the column names, remove all the NA values, and merge 9 different csv files as a mass dataset.

  1. Explore Data Analysis

Have a brief look into the dataset, to know the types of variables, and amend the types for the following analysis.

  1. Statistical Analysis

Use ggplot to sum of the basic insights of the dataset.

  1. Machine Learning

Use a manual linear regression model to predict the used car price.

Tools for data analysis

  • Data cleaning

library(tidyverse) library(janitor) library(ggplot2)

  • Machine learning

library(modelr) library(broom) library(mosaic) library(caret) library(relaimpo) library(cowplot)

Basic Stats

#For data cleaning 
library(tidyverse)
library(janitor)
library(ggplot2)

#For machine learning
library(modelr)
library(broom)
library(mosaic)
library(caret)
library(relaimpo)
library(cowplot)
car_mass <- read_csv(here::here("clean_data/car_mass.csv"))
glimpse(car_mass)
## Rows: 99,187
## Columns: 12
## $ brand               <chr> "audi", "audi", "audi", "audi", "audi", "audi", "a…
## $ manufacturer_origin <chr> "Germany", "Germany", "Germany", "Germany", "Germa…
## $ model               <chr> "A1", "A6", "A1", "A4", "A3", "A1", "A6", "A4", "A…
## $ car_age             <dbl> 3, 4, 4, 3, 1, 4, 4, 4, 5, 4, 3, 4, 4, 3, 5, 3, 4,…
## $ year                <dbl> 2017, 2016, 2016, 2017, 2019, 2016, 2016, 2016, 20…
## $ price               <dbl> 12500, 16500, 11000, 16800, 17300, 13900, 13250, 1…
## $ transmission        <chr> "Manual", "Automatic", "Manual", "Automatic", "Man…
## $ mileage             <dbl> 15735, 36203, 29946, 25952, 1998, 32260, 76788, 75…
## $ fuel_type           <chr> "Petrol", "Diesel", "Petrol", "Diesel", "Petrol", …
## $ tax                 <dbl> 150, 20, 30, 145, 145, 30, 30, 20, 20, 30, 145, 12…
## $ mpg                 <dbl> 55.4, 64.2, 55.4, 67.3, 49.6, 58.9, 61.4, 70.6, 60…
## $ engine_size         <dbl> 1.4, 2.0, 1.4, 2.0, 1.0, 1.4, 2.0, 2.0, 1.4, 1.4, …
  1. Mercedes has the highest total listed price(~30,000 million), followed by Volkswagen, Audi and BMW; Ford has the most listed car items(~17,500), followed by Volkswagen, Vauxhall and Mercedes. Besides, Ford is the most common brand, while Mercedes has the highest sales price range.
sum_listed_price <- car_mass %>% 
  group_by(brand) %>% 
  summarise(total_sales = sum(price)) %>% 
  ggplot(aes(x = brand, y = (total_sales/10000)))+
  geom_col()+
  ylab("Listed Price(M)") + xlab("Brand")+
  ggtitle("Total Listed Price by Brand")
  car_mass %>% 
  group_by(brand) %>% 
  count(brand) %>% 
  ggplot(aes(x = brand, y = n))+
  geom_col()+
  xlab("Brand") + ylab("Count of Cars")+
  ggtitle("Total Number of Car Listed by Brand")

  1. The top three most common listed models are: Fiesta (Ford), Focus (Ford), Golf (Volkswagen), where fits to the finding above, showing that Ford and Volkswagen are the most common brands.
car_mass %>%
  dplyr::group_by(model) %>%
  count(model) %>%
  arrange(desc(n)) %>%
  head(10) %>%
  ggplot(aes(x = model, y = n))+
  geom_col()+
  ylab("Number of Count")+ xlab("Model")+
  ggtitle("The Most Common Listed Model")

# Fiesta, Golf, Focus, C Class, Corsa, Polo, Astra, A Class, 3 Series, Kuga
  1. Most lised cars are equipped with petrol gear engine of diesel gear engine. Toyota is comparatively eco-friendly to other brands, which has more choices for hybrid gear engine model.
 car_mass %>% 
  group_by(brand, fuel_type) %>% 
  count(brand) %>% 
  ggplot(aes(x = brand, y = n))+
  geom_col(aes(x = brand, y = n, fill = fuel_type))+
  xlab("Brand") + ylab("Count of Cars")+
  ggtitle("Fuel Type Difference by Brand")

  1. The most common registered year are 2019, 2017 and 2016, meaning that people usually sell the used car at the car age of one, three and four. The finding may be resulted from the 5 years Lease Deal offered by car dealerships, which customers prefer to sell the used car when it comes to the end of the deal.
car_mass %>% 
  filter(year > 2000) %>% 
  ggplot(aes(x = year))+
  geom_bar()+
  ylab("Number of Listed Items") + xlab("Year Registered")+
  ggtitle("Common Registerd Year")

  1. The graph indicates that most of the used car have a mpg value under 100, price range is commonly set to 50,000 or below.
ggplot(car_mass, aes(price, mpg, colour = brand))+
  geom_point()+
  geom_smooth(method = "lm")+
  geom_vline(xintercept = mean(car_mass$price),  colour = "red")+
  xlab("Price")+ylab("MPG")+
  ggtitle("A Comparasion of Price and MPG")
## `geom_smooth()` using formula 'y ~ x'

Prescriptive Analysis

A manual Linear Regression model is performed to predict the used car price, by examining the residuals of dummy models, relative importance and measures of goodness of fit.

The prescriptive model receives an R-square value (accuracy rate) of 0.779, by comparing price to the following variables:

engine size

brand

car age

fuel type

mileage

engine size and mileage (interactive variable)

Data Types Amendment

The amendment for data types is used for the following machine learning process, where all the character type of variables will be changed to factor type.

car_df <- read_csv(here::here("clean_data/car_mass.csv")) %>% 
  mutate(across(where(is.character), ~ as.factor(.x))) %>% 
  dplyr::select(-model)

Conduct Model

Correlation Coefficent of variables

By using the ggcorr it is easy to have a brief look on the correlations coefficient of different variables.

library(GGally)

ggcorr(car_mass, label = TRUE)

Exploring Model

  • Testing on price ~ engine_size + brand + car_age + fuel_type + mileage or mpg

  • model_5a price ~ engine_size + brand + car_age + fuel_type + mileage has a higher multiple r-squared value of 0.7536, so a residual will be added to this model.

model_5a <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage, data = car_df)

model_5b <- lm(price ~ engine_size + brand + car_age + fuel_type + mpg, data = car_df)
summary(model_5a)
## 
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type + 
##     mileage, data = car_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69863  -2535   -444   1783 107248 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        8.349e+03  9.579e+01   87.157  < 2e-16 ***
## engine_size        1.072e+04  3.672e+01  291.845  < 2e-16 ***
## brandbmw          -2.438e+03  6.742e+01  -36.168  < 2e-16 ***
## brandford         -4.560e+03  6.303e+01  -72.345  < 2e-16 ***
## brandhyundi       -5.964e+03  8.666e+01  -68.819  < 2e-16 ***
## brandmercedes     -9.892e+00  6.432e+01   -0.154    0.878    
## brandskoda        -4.751e+03  7.972e+01  -59.600  < 2e-16 ***
## brandtoyota       -6.729e+03  8.439e+01  -79.741  < 2e-16 ***
## brandvauxhall     -7.477e+03  6.576e+01 -113.688  < 2e-16 ***
## brandvolkswagen   -3.257e+03  6.294e+01  -51.759  < 2e-16 ***
## car_age           -1.535e+03  1.103e+01 -139.153  < 2e-16 ***
## fuel_typeElectric  9.459e+03  2.000e+03    4.729 2.26e-06 ***
## fuel_typeHybrid    4.383e+03  1.028e+02   42.628  < 2e-16 ***
## fuel_typeOther     3.727e+03  3.146e+02   11.849  < 2e-16 ***
## fuel_typePetrol    1.825e+03  3.894e+01   46.857  < 2e-16 ***
## mileage           -1.036e-01  1.133e-03  -91.415  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4898 on 99171 degrees of freedom
## Multiple R-squared:  0.7536, Adjusted R-squared:  0.7536 
## F-statistic: 2.022e+04 on 15 and 99171 DF,  p-value: < 2.2e-16
summary(model_5b)
## 
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type + 
##     mpg, data = car_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103357   -2589    -401    1800  124375 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       13562.334    151.709   89.397   <2e-16 ***
## engine_size        9601.332     43.760  219.407   <2e-16 ***
## brandbmw          -1937.922     69.739  -27.788   <2e-16 ***
## brandford         -4351.031     64.696  -67.253   <2e-16 ***
## brandhyundi       -6075.201     89.208  -68.101   <2e-16 ***
## brandmercedes       553.656     66.163    8.368   <2e-16 ***
## brandskoda        -4645.887     81.876  -56.743   <2e-16 ***
## brandtoyota       -6658.343     86.711  -76.788   <2e-16 ***
## brandvauxhall     -7689.512     67.666 -113.640   <2e-16 ***
## brandvolkswagen   -3247.198     64.648  -50.229   <2e-16 ***
## car_age           -2214.805      7.712 -287.188   <2e-16 ***
## fuel_typeElectric 26623.427   2075.066   12.830   <2e-16 ***
## fuel_typeHybrid    6775.858    113.154   59.882   <2e-16 ***
## fuel_typeOther     5646.865    324.328   17.411   <2e-16 ***
## fuel_typePetrol    1610.705     43.655   36.896   <2e-16 ***
## mpg                 -70.195      1.330  -52.780   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5030 on 99171 degrees of freedom
## Multiple R-squared:  0.7401, Adjusted R-squared:  0.7401 
## F-statistic: 1.883e+04 on 15 and 99171 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model_5a)

Add Interactive Variables to the 5 Predictors

To see how the engine size interacts with other predictors.

model_5a_resid <- car_df %>% 
  add_residuals(model_5a) %>% 
  dplyr::select(-price)

ggpairs(model_5a_resid)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The relationships of engine size comparing to brand, car_age, fuel_type and mileage.

interaction_engine_brand <- coplot(resid ~ engine_size | brand, 
                                   panel = function(x, y, ...){
                                     points(x, y)
                                     abline(lm(y ~ x), col = "blue")},
                                   data = model_5a_resid, rows = 1)

interaction_engine_carage <- coplot(resid ~ engine_size | car_age, 
                                   panel = function(x, y, ...){
                                     points(x, y)
                                     abline(lm(y ~ x), col = "blue")},
                                   data = model_5a_resid, rows = 1)

interaction_engine_fuel <- coplot(resid ~ engine_size | fuel_type, 
                                   panel = function(x, y, ...){
                                     points(x, y)
                                     abline(lm(y ~ x), col = "blue")},
                                   data = model_5a_resid, rows = 1)

interaction_engine_mile <- coplot(resid ~ engine_size | mileage, 
                                   panel = function(x, y, ...){
                                     points(x, y)
                                     abline(lm(y ~ x), col = "blue")},
                                   data = model_5a_resid, rows = 1)

  • It shows that engine size:mileage is relatively correlated compared to other interactive variables. In terms of large mileage, car with small engine size has less depreciation than car with large engine size. While for the small mileage, cars with large engine size has less depreciation than car with small engine size.

The relationships of brands and engine size

model_5a_resid %>% 
  ggplot(aes(x = engine_size, y = resid))+
  geom_point(size = 0.2)+
  geom_smooth(method = "lm", se = FALSE)+facet_wrap(~brand)

Model 6

The interactive variable helps to refine the model with a significant correlations.

  • Testing on price ~ engine_size + brand + car_age + fuel_type + mileage +

engine_size:car_age or engine_size:mileage or car_age : brand

  • The three interactive variables have quite similar r-squared value. However, in this case, model_6b price ~ engine_size + brand + car_age + fuel_type + mileage + engine_size:mileage seems to be more make sense to affect the used car price.
model_6a <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage + engine_size:car_age, data = car_df)

model_6b <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage + engine_size:mileage, data = car_df)

model_6c <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage + car_age : brand, data = car_df)
summary(model_6a)
## 
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type + 
##     mileage + engine_size:car_age, data = car_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54495  -2436   -439   1738 117187 
## 
## Coefficients:
##                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          2.425e+03  1.068e+02   22.699  < 2e-16 ***
## engine_size          1.399e+04  4.666e+01  299.824  < 2e-16 ***
## brandbmw            -2.430e+03  6.394e+01  -38.015  < 2e-16 ***
## brandford           -4.556e+03  5.978e+01  -76.227  < 2e-16 ***
## brandhyundi         -5.948e+03  8.218e+01  -72.380  < 2e-16 ***
## brandmercedes        1.194e+02  6.101e+01    1.956   0.0504 .  
## brandskoda          -4.635e+03  7.561e+01  -61.302  < 2e-16 ***
## brandtoyota         -6.838e+03  8.004e+01  -85.428  < 2e-16 ***
## brandvauxhall       -7.481e+03  6.237e+01 -119.955  < 2e-16 ***
## brandvolkswagen     -3.260e+03  5.969e+01  -54.617  < 2e-16 ***
## car_age              3.776e+02  2.096e+01   18.018  < 2e-16 ***
## fuel_typeElectric    8.900e+03  1.897e+03    4.692 2.71e-06 ***
## fuel_typeHybrid      4.479e+03  9.751e+01   45.928  < 2e-16 ***
## fuel_typeOther       3.924e+03  2.983e+02   13.152  < 2e-16 ***
## fuel_typePetrol      2.013e+03  3.697e+01   54.452  < 2e-16 ***
## mileage             -9.542e-02  1.077e-03  -88.593  < 2e-16 ***
## engine_size:car_age -1.105e+03  1.049e+01 -105.350  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4645 on 99170 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7784 
## F-statistic: 2.177e+04 on 16 and 99170 DF,  p-value: < 2.2e-16
summary(model_6b)
## 
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type + 
##     mileage + engine_size:mileage, data = car_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72289  -2442   -441   1759 107350 
## 
## Coefficients:
##                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          3.229e+03  1.026e+02   31.468  < 2e-16 ***
## engine_size          1.362e+04  4.416e+01  308.537  < 2e-16 ***
## brandbmw            -2.411e+03  6.385e+01  -37.761  < 2e-16 ***
## brandford           -4.682e+03  5.970e+01  -78.416  < 2e-16 ***
## brandhyundi         -6.017e+03  8.206e+01  -73.315  < 2e-16 ***
## brandmercedes       -5.391e+01  6.092e+01   -0.885    0.376    
## brandskoda          -4.758e+03  7.550e+01  -63.017  < 2e-16 ***
## brandtoyota         -6.792e+03  7.992e+01  -84.982  < 2e-16 ***
## brandvauxhall       -7.650e+03  6.230e+01 -122.795  < 2e-16 ***
## brandvolkswagen     -3.349e+03  5.961e+01  -56.188  < 2e-16 ***
## car_age             -1.542e+03  1.045e+01 -147.569  < 2e-16 ***
## fuel_typeElectric    9.485e+03  1.894e+03    5.007 5.53e-07 ***
## fuel_typeHybrid      4.286e+03  9.737e+01   44.012  < 2e-16 ***
## fuel_typeOther       3.859e+03  2.979e+02   12.955  < 2e-16 ***
## fuel_typePetrol      1.855e+03  3.688e+01   50.297  < 2e-16 ***
## mileage              1.525e-01  2.626e-03   58.071  < 2e-16 ***
## engine_size:mileage -1.399e-01  1.309e-03 -106.818  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4639 on 99170 degrees of freedom
## Multiple R-squared:  0.779,  Adjusted R-squared:  0.779 
## F-statistic: 2.185e+04 on 16 and 99170 DF,  p-value: < 2.2e-16
summary(model_6c)
## 
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type + 
##     mileage + car_age:brand, data = car_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51248  -2434   -399   1728 138008 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              1.077e+04  1.087e+02   99.129  < 2e-16 ***
## engine_size              1.070e+04  3.521e+01  303.865  < 2e-16 ***
## brandbmw                -2.951e+03  1.049e+02  -28.130  < 2e-16 ***
## brandford               -8.504e+03  1.014e+02  -83.898  < 2e-16 ***
## brandhyundi             -9.850e+03  1.449e+02  -67.990  < 2e-16 ***
## brandmercedes           -9.843e+01  9.967e+01   -0.988    0.323    
## brandskoda              -7.754e+03  1.294e+02  -59.921  < 2e-16 ***
## brandtoyota             -1.219e+04  1.317e+02  -92.498  < 2e-16 ***
## brandvauxhall           -1.241e+04  1.053e+02 -117.872  < 2e-16 ***
## brandvolkswagen         -5.861e+03  9.947e+01  -58.920  < 2e-16 ***
## car_age                 -2.424e+03  2.285e+01 -106.062  < 2e-16 ***
## fuel_typeElectric        9.472e+03  1.914e+03    4.948 7.50e-07 ***
## fuel_typeHybrid          4.502e+03  9.866e+01   45.633  < 2e-16 ***
## fuel_typeOther           3.931e+03  3.011e+02   13.057  < 2e-16 ***
## fuel_typePetrol          1.821e+03  3.728e+01   48.838  < 2e-16 ***
## mileage                 -9.589e-02  1.091e-03  -87.863  < 2e-16 ***
## brandbmw:car_age         1.803e+02  2.842e+01    6.345 2.24e-10 ***
## brandford:car_age        1.325e+03  2.709e+01   48.927  < 2e-16 ***
## brandhyundi:car_age      1.345e+03  4.093e+01   32.850  < 2e-16 ***
## brandmercedes:car_age   -2.317e+01  2.794e+01   -0.829    0.407    
## brandskoda:car_age       1.070e+03  4.005e+01   26.725  < 2e-16 ***
## brandtoyota:car_age      1.765e+03  3.340e+01   52.833  < 2e-16 ***
## brandvauxhall:car_age    1.665e+03  2.843e+01   58.566  < 2e-16 ***
## brandvolkswagen:car_age  9.032e+02  2.798e+01   32.275  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4687 on 99163 degrees of freedom
## Multiple R-squared:  0.7744, Adjusted R-squared:  0.7743 
## F-statistic: 1.48e+04 on 23 and 99163 DF,  p-value: < 2.2e-16
  • By comparing the models with and without the interactive variables, it shows that model_6b has higher r-squared value which means more reliable.
anova(model_5a, model_6b)
## Analysis of Variance Table
## 
## Model 1: price ~ engine_size + brand + car_age + fuel_type + mileage
## Model 2: price ~ engine_size + brand + car_age + fuel_type + mileage + 
##     engine_size:mileage
##   Res.Df        RSS Df Sum of Sq     F    Pr(>F)    
## 1  99171 2.3793e+12                                 
## 2  99170 2.1338e+12  1 2.455e+11 11410 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Level of Importance of Variables

According to the model, it shows that engine size is the parameter that affects the price the most, followed by car age, and engine size with mileage.

library(caret)

variable_importance <- tibble(varImp(model_6b, scale = TRUE)) %>% 
  mutate(observations = c("engine_size", "brandbmw", "brandford", "brandhyundi",
                         "brandmercedes", "brandskoda", "brandtoyota", "brandvauxhall", "brandvolkswagen", "car_age", "fuel_typeElectric", "fuel_typeHybrid", "fuel_typeOther", "fuel_typePetrol", "mileage", "engine_size:mileage"), .before = 1)

model_6b_clean <- tidy(model_6b) %>% 
  mutate(scaled_estimate = round(
    as.numeric(format(estimate / 3228.6434845, scientific = FALSE)), 
    digits = 3),.after = 1)


variable_importance %>% 
  ggplot(aes(x = observations, y = Overall))+
  geom_col()+
  xlab("Variables")+ylab("Level of Variable Importance")+
  ggtitle("Difference on Variable Importance")+
  theme(axis.text.x = element_text(angle=90, vjust=.5))

As referenced by Audi, Mercedes, BMW and Volkswagen have a relatively high trade-in / trade-out value comparing the depreciation rate with other brands.

model_6b_clean %>%
  dplyr::select(term, scaled_estimate) %>% 
  mutate(term = case_when(term == "(Intercept)" ~ "brandaudi", TRUE ~ term)) %>% 
  slice(1, 3:10) %>% 
  mutate(term = as.factor(term)) %>% 
  #mutate(clean_scaled_estimate = scaled_estimate -1) %>% 
  ggplot(aes(x = term, y = scaled_estimate))+
  geom_point()+
  geom_line(group = 1)+
  xlab("Brand")+ ylab("Scaled Depreciation Rate")+
  ggtitle("Depreciation Rate by Brand")+
  theme(axis.text.x = element_text(angle=45, vjust=.5))

Relative Importance

library(relaimpo)

calc.relimp(model_6b, type = "lmg", rela = TRUE)
## Response variable: price 
## Total response variance: 97353218 
## Analysis based on 99187 observations 
## 
## 16 Regressors: 
## Some regressors combined in groups: 
##         Group  brand : brandbmw brandford brandhyundi brandmercedes brandskoda brandtoyota brandvauxhall brandvolkswagen 
##         Group  fuel_type : fuel_typeElectric fuel_typeHybrid fuel_typeOther fuel_typePetrol 
## 
##  Relative importance of 6 (groups of) regressors assessed: 
##  brand fuel_type engine_size car_age mileage engine_size:mileage 
##  
## Proportion of variance explained by model: 77.9%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                            lmg
## brand               0.17137420
## fuel_type           0.03322868
## engine_size         0.43197235
## car_age             0.15860854
## mileage             0.17283456
## engine_size:mileage 0.03198167
## 
## Average coefficients for different model sizes: 
## 
##                            1group       2groups       3groups       4groups
## engine_size          1.129053e+04  1.135790e+04 11760.0412704  1.242669e+04
## brandbmw            -1.632762e+02 -7.231286e+02 -1387.7080326 -2.056505e+03
## brandford           -1.061693e+04 -9.206989e+03 -7513.0545239 -5.808481e+03
## brandhyundi         -1.014655e+04 -9.317628e+03 -8181.7535782 -6.884082e+03
## brandmercedes        1.801912e+03  1.178041e+03   559.8503465  4.259568e+01
## brandskoda          -8.621236e+03 -7.912001e+03 -6875.8259919 -5.576263e+03
## brandtoyota         -1.037429e+04 -9.632426e+03 -8542.5388456 -7.293430e+03
## brandvauxhall       -1.249023e+04 -1.124728e+04 -9769.7097517 -8.342023e+03
## brandvolkswagen     -6.057733e+03 -5.526176e+03 -4782.1552714 -3.903214e+03
## car_age             -2.280156e+03 -2.203217e+03 -1934.2525898 -1.677093e+03
## fuel_typeElectric   -2.694155e+03  8.528233e+02  3930.9230105  6.864365e+03
## fuel_typeHybrid     -4.990242e+01  1.574244e+03  2772.6550031  3.228181e+03
## fuel_typeOther      -1.896144e+03 -1.873095e+02  1262.1406002  2.554177e+03
## fuel_typePetrol     -4.564443e+03 -2.592254e+03  -922.1075624  5.153644e+02
## mileage             -1.949953e-01 -1.796792e-01    -0.1381463 -5.303379e-02
## engine_size:mileage           NaN           NaN    -0.1338649 -1.359874e-01
##                           5groups       6groups
## engine_size          1.311014e+04 13624.4281957
## brandbmw            -2.431642e+03 -2410.8632860
## brandford           -4.815047e+03 -4681.5721169
## brandhyundi         -6.069064e+03 -6016.5438339
## brandmercedes       -1.641798e+02   -53.9084485
## brandskoda          -4.717040e+03 -4757.7305620
## brandtoyota         -6.626922e+03 -6791.8553674
## brandvauxhall       -7.599124e+03 -7650.1667663
## brandvolkswagen     -3.350773e+03 -3349.2143491
## car_age             -1.549329e+03 -1541.9981773
## fuel_typeElectric    8.731740e+03  9485.1089373
## fuel_typeHybrid      3.526858e+03  4285.6233060
## fuel_typeOther       3.442066e+03  3859.1136162
## fuel_typePetrol      1.388014e+03  1854.9282493
## mileage              5.317545e-02     0.1525227
## engine_size:mileage -1.379505e-01    -0.1398619

Parsimonious Measures of Goodness of Fit

model_overfit <- lm(price ~., data = car_df)

model_wellfit <- model_6b

model_underfit <-  lm(price ~ engine_size , data = car_df)
broom::glance(model_overfit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df   logLik      AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>    <dbl>  <dbl>
## 1     0.766         0.766 4771.    15482.       0    21 -980865. 1961777. 1.96e6
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model_wellfit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df   logLik      AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>    <dbl>  <dbl>
## 1     0.779         0.779 4639.    21850.       0    16 -978085. 1956206. 1.96e6
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model_underfit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df    logLik     AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>     <dbl>   <dbl>  <dbl>
## 1     0.407         0.407 7597.    68128.       0     1 -1027024.  2.05e6 2.05e6
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Price Predictor

Setting the price predictor with a similar used car specs by different brands, it matches with the result of depreciation by brand. Audi and Mercedes share a relatively similar deprecation rate.

price ~ engine size + brand + car age + fuel type + mileage + engine_size:mileage

Below is the example of Audi, with car spec as: fuel_type = “Petrol”, engine_size = 1, car_age = 1, mileage = 5000

predict(model_6b, data.frame(
  brand = "audi", 
  fuel_type = "Petrol", 
  engine_size = 1, 
  car_age = 1, 
  mileage = 5000))
##        1 
## 17229.31
Brand Estimated Price
Audi 17229.31
BMW 14818.44
Ford 12547.73
Hyundi 11212.76
Mercedes 17175.4
Skoda 12471.57
Toyota 10437.45
Vauxhall 9579.139
Volkswagen 13880.09

Conclusion

Data

Based on the data, the most common brand is Ford, however, Mercedes has the highest listed price in total The most common car age for selling is 1 year, followed by 3 years and 4 years old car age. 5 Year Lease Car Deal might affect the car age in used car market. Traders are able to know which car brand, and the best car age have a higher value in used-car market.

Model

Based on the model, engine size is the most import factor for estimating the price, followed by brand and car age An engine doubled in size also approximately doubled in price. In terms of large mileage, car with small engine size has better trading value than car with large engine size. While for the small mileage, cars with large engine size has better trading than car with small engine size.