Kahlen Cheung
14/06/2022
Business Intelligence and Data-driven Decision Making
The Review Report on UK Used Car Sales Performance (2020) (the report) aims to review and analysis the sales performance of the UK used car in 2020, which mainly disclose the used car price for the second-hand car market, the most common brand, the most common models and the average selling price of different cars in different conditions.
Domain Knowledge and the Business Context
This analysis would provide essential insights for the second hand car dealership industry. It will be useful for existing business and business who want to explore the second hand car market, as well as the individual traders.
Business Insights
When is the ideal car age to sell used cars?
Top 10 most common listed models.
Price of used car compared by brands.
Car specs compared by brands.
Parameters of Price.
Internal and external data sources
The report is conducted based on the open source dataset 100,000 UK Used Car Data set from kaggle, produced by ADITYA.
This not an official dataset, which data is generated from a used car trading website, without any first-hand internal data provided by clients. Nevertheless, it is still capable for reviewing some of the market trends and provide useful insights.
Types of data
The dataset is a mix of categorical and numerical data. Every data sources contains 9 variables, with types of categorical, discrete and ordinal numerical:
character type: model, transmission, fuel type ordinal numeric type: year discrete numeric type: mileage, tax, mpg, engine size
Data formats
There are a total of 13 data sources in this dataset, which are in csv format, with approximate 100,000 rows.
The dataset includes the listed price of used cars by the brands of: Audi, BMW, Ford, Hyundi, Mercedez, Skoda, Toyota, Vauxhall and Volkswagen, while both the models cclass and focus are separated as the individual data sources. Besides, the car specs are stated as well.
Data quality and Bias
This is not an official dataset that is only scrapped from a random used car trading website. The size of the dataset is up to 99,187 entries, and it might not fully represents the actual market trend.
The dataset does not provide a consecutive year of record, therefore future trend cannot be predicted. Besides, it is lack of car information such as body type, as well as the variety of brands is also limited (eg, Mini, Honda, Kia). Moreover, no customer buying record is provided, which is impossible to predict targeted customers for selling.
Ethical issues in data sourcing and extraction
On behalf of the dataset provided by Kaggle, it is anonymous that none of the private data such as names, personal ID, or address is disclosed, which no evidence showing the ethical concerns.
Ethical implications of business requirements
An official access of car sales history helps to improve the analysis accuracy.
Since this is a second-hand data source, the result may not fully reflect the market environment. For a more accurate data analysis, it is suggested that the future dataset should provide the consecutive year of record, car details such as body type, as well as the variety of brands, in order to predict future trend. A basic customer buying record would also be preferred.
Stages in the data analysis process
Clean up the column names, remove all the NA values, and merge 9 different csv files as a mass dataset.
Have a brief look into the dataset, to know the types of variables, and amend the types for the following analysis.
Use ggplot to sum of the basic insights of the dataset.
Use a manual linear regression model to predict the used car price.
Tools for data analysis
library(tidyverse) library(janitor) library(ggplot2)
library(modelr) library(broom) library(mosaic) library(caret) library(relaimpo) library(cowplot)
#For data cleaning
library(tidyverse)
library(janitor)
library(ggplot2)
#For machine learning
library(modelr)
library(broom)
library(mosaic)
library(caret)
library(relaimpo)
library(cowplot)
car_mass <- read_csv(here::here("clean_data/car_mass.csv"))
glimpse(car_mass)
## Rows: 99,187
## Columns: 12
## $ brand <chr> "audi", "audi", "audi", "audi", "audi", "audi", "a…
## $ manufacturer_origin <chr> "Germany", "Germany", "Germany", "Germany", "Germa…
## $ model <chr> "A1", "A6", "A1", "A4", "A3", "A1", "A6", "A4", "A…
## $ car_age <dbl> 3, 4, 4, 3, 1, 4, 4, 4, 5, 4, 3, 4, 4, 3, 5, 3, 4,…
## $ year <dbl> 2017, 2016, 2016, 2017, 2019, 2016, 2016, 2016, 20…
## $ price <dbl> 12500, 16500, 11000, 16800, 17300, 13900, 13250, 1…
## $ transmission <chr> "Manual", "Automatic", "Manual", "Automatic", "Man…
## $ mileage <dbl> 15735, 36203, 29946, 25952, 1998, 32260, 76788, 75…
## $ fuel_type <chr> "Petrol", "Diesel", "Petrol", "Diesel", "Petrol", …
## $ tax <dbl> 150, 20, 30, 145, 145, 30, 30, 20, 20, 30, 145, 12…
## $ mpg <dbl> 55.4, 64.2, 55.4, 67.3, 49.6, 58.9, 61.4, 70.6, 60…
## $ engine_size <dbl> 1.4, 2.0, 1.4, 2.0, 1.0, 1.4, 2.0, 2.0, 1.4, 1.4, …
sum_listed_price <- car_mass %>%
group_by(brand) %>%
summarise(total_sales = sum(price)) %>%
ggplot(aes(x = brand, y = (total_sales/10000)))+
geom_col()+
ylab("Listed Price(M)") + xlab("Brand")+
ggtitle("Total Listed Price by Brand")
car_mass %>%
group_by(brand) %>%
count(brand) %>%
ggplot(aes(x = brand, y = n))+
geom_col()+
xlab("Brand") + ylab("Count of Cars")+
ggtitle("Total Number of Car Listed by Brand")
car_mass %>%
dplyr::group_by(model) %>%
count(model) %>%
arrange(desc(n)) %>%
head(10) %>%
ggplot(aes(x = model, y = n))+
geom_col()+
ylab("Number of Count")+ xlab("Model")+
ggtitle("The Most Common Listed Model")
# Fiesta, Golf, Focus, C Class, Corsa, Polo, Astra, A Class, 3 Series, Kuga
car_mass %>%
group_by(brand, fuel_type) %>%
count(brand) %>%
ggplot(aes(x = brand, y = n))+
geom_col(aes(x = brand, y = n, fill = fuel_type))+
xlab("Brand") + ylab("Count of Cars")+
ggtitle("Fuel Type Difference by Brand")
2019, 2017 and 2016, meaning that people usually sell the used car at the car age of one, three and four. The finding may be resulted from the 5 years Lease Deal offered by car dealerships, which customers prefer to sell the used car when it comes to the end of the deal.car_mass %>%
filter(year > 2000) %>%
ggplot(aes(x = year))+
geom_bar()+
ylab("Number of Listed Items") + xlab("Year Registered")+
ggtitle("Common Registerd Year")
ggplot(car_mass, aes(price, mpg, colour = brand))+
geom_point()+
geom_smooth(method = "lm")+
geom_vline(xintercept = mean(car_mass$price), colour = "red")+
xlab("Price")+ylab("MPG")+
ggtitle("A Comparasion of Price and MPG")
## `geom_smooth()` using formula 'y ~ x'
A manual Linear Regression model is performed to predict the used car price, by examining the residuals of dummy models, relative importance and measures of goodness of fit.
The prescriptive model receives an R-square value (accuracy rate) of 0.779, by comparing price to the following variables:
engine size
brand
car age
fuel type
mileage
engine size and mileage (interactive variable)
The amendment for data types is used for the following machine learning process, where all the character type of variables will be changed to factor type.
car_df <- read_csv(here::here("clean_data/car_mass.csv")) %>%
mutate(across(where(is.character), ~ as.factor(.x))) %>%
dplyr::select(-model)
By using the ggcorr it is easy to have a brief look on the correlations coefficient of different variables.
library(GGally)
ggcorr(car_mass, label = TRUE)
Testing on price ~ engine_size + brand + car_age + fuel_type + mileage or mpg
model_5a price ~ engine_size + brand + car_age + fuel_type + mileage has a higher multiple r-squared value of 0.7536, so a residual will be added to this model.
model_5a <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage, data = car_df)
model_5b <- lm(price ~ engine_size + brand + car_age + fuel_type + mpg, data = car_df)
summary(model_5a)
##
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type +
## mileage, data = car_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69863 -2535 -444 1783 107248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.349e+03 9.579e+01 87.157 < 2e-16 ***
## engine_size 1.072e+04 3.672e+01 291.845 < 2e-16 ***
## brandbmw -2.438e+03 6.742e+01 -36.168 < 2e-16 ***
## brandford -4.560e+03 6.303e+01 -72.345 < 2e-16 ***
## brandhyundi -5.964e+03 8.666e+01 -68.819 < 2e-16 ***
## brandmercedes -9.892e+00 6.432e+01 -0.154 0.878
## brandskoda -4.751e+03 7.972e+01 -59.600 < 2e-16 ***
## brandtoyota -6.729e+03 8.439e+01 -79.741 < 2e-16 ***
## brandvauxhall -7.477e+03 6.576e+01 -113.688 < 2e-16 ***
## brandvolkswagen -3.257e+03 6.294e+01 -51.759 < 2e-16 ***
## car_age -1.535e+03 1.103e+01 -139.153 < 2e-16 ***
## fuel_typeElectric 9.459e+03 2.000e+03 4.729 2.26e-06 ***
## fuel_typeHybrid 4.383e+03 1.028e+02 42.628 < 2e-16 ***
## fuel_typeOther 3.727e+03 3.146e+02 11.849 < 2e-16 ***
## fuel_typePetrol 1.825e+03 3.894e+01 46.857 < 2e-16 ***
## mileage -1.036e-01 1.133e-03 -91.415 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4898 on 99171 degrees of freedom
## Multiple R-squared: 0.7536, Adjusted R-squared: 0.7536
## F-statistic: 2.022e+04 on 15 and 99171 DF, p-value: < 2.2e-16
summary(model_5b)
##
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type +
## mpg, data = car_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103357 -2589 -401 1800 124375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13562.334 151.709 89.397 <2e-16 ***
## engine_size 9601.332 43.760 219.407 <2e-16 ***
## brandbmw -1937.922 69.739 -27.788 <2e-16 ***
## brandford -4351.031 64.696 -67.253 <2e-16 ***
## brandhyundi -6075.201 89.208 -68.101 <2e-16 ***
## brandmercedes 553.656 66.163 8.368 <2e-16 ***
## brandskoda -4645.887 81.876 -56.743 <2e-16 ***
## brandtoyota -6658.343 86.711 -76.788 <2e-16 ***
## brandvauxhall -7689.512 67.666 -113.640 <2e-16 ***
## brandvolkswagen -3247.198 64.648 -50.229 <2e-16 ***
## car_age -2214.805 7.712 -287.188 <2e-16 ***
## fuel_typeElectric 26623.427 2075.066 12.830 <2e-16 ***
## fuel_typeHybrid 6775.858 113.154 59.882 <2e-16 ***
## fuel_typeOther 5646.865 324.328 17.411 <2e-16 ***
## fuel_typePetrol 1610.705 43.655 36.896 <2e-16 ***
## mpg -70.195 1.330 -52.780 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5030 on 99171 degrees of freedom
## Multiple R-squared: 0.7401, Adjusted R-squared: 0.7401
## F-statistic: 1.883e+04 on 15 and 99171 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model_5a)
To see how the engine size interacts with other predictors.
model_5a_resid <- car_df %>%
add_residuals(model_5a) %>%
dplyr::select(-price)
ggpairs(model_5a_resid)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The relationships of engine size comparing to brand, car_age, fuel_type and mileage.
interaction_engine_brand <- coplot(resid ~ engine_size | brand,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")},
data = model_5a_resid, rows = 1)
interaction_engine_carage <- coplot(resid ~ engine_size | car_age,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")},
data = model_5a_resid, rows = 1)
interaction_engine_fuel <- coplot(resid ~ engine_size | fuel_type,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")},
data = model_5a_resid, rows = 1)
interaction_engine_mile <- coplot(resid ~ engine_size | mileage,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")},
data = model_5a_resid, rows = 1)
The relationships of brands and engine size
model_5a_resid %>%
ggplot(aes(x = engine_size, y = resid))+
geom_point(size = 0.2)+
geom_smooth(method = "lm", se = FALSE)+facet_wrap(~brand)
The interactive variable helps to refine the model with a significant correlations.
engine_size:car_age or engine_size:mileage or car_age : brand
price ~ engine_size + brand + car_age + fuel_type + mileage + engine_size:mileage seems to be more make sense to affect the used car price.model_6a <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage + engine_size:car_age, data = car_df)
model_6b <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage + engine_size:mileage, data = car_df)
model_6c <- lm(price ~ engine_size + brand + car_age + fuel_type + mileage + car_age : brand, data = car_df)
summary(model_6a)
##
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type +
## mileage + engine_size:car_age, data = car_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54495 -2436 -439 1738 117187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.425e+03 1.068e+02 22.699 < 2e-16 ***
## engine_size 1.399e+04 4.666e+01 299.824 < 2e-16 ***
## brandbmw -2.430e+03 6.394e+01 -38.015 < 2e-16 ***
## brandford -4.556e+03 5.978e+01 -76.227 < 2e-16 ***
## brandhyundi -5.948e+03 8.218e+01 -72.380 < 2e-16 ***
## brandmercedes 1.194e+02 6.101e+01 1.956 0.0504 .
## brandskoda -4.635e+03 7.561e+01 -61.302 < 2e-16 ***
## brandtoyota -6.838e+03 8.004e+01 -85.428 < 2e-16 ***
## brandvauxhall -7.481e+03 6.237e+01 -119.955 < 2e-16 ***
## brandvolkswagen -3.260e+03 5.969e+01 -54.617 < 2e-16 ***
## car_age 3.776e+02 2.096e+01 18.018 < 2e-16 ***
## fuel_typeElectric 8.900e+03 1.897e+03 4.692 2.71e-06 ***
## fuel_typeHybrid 4.479e+03 9.751e+01 45.928 < 2e-16 ***
## fuel_typeOther 3.924e+03 2.983e+02 13.152 < 2e-16 ***
## fuel_typePetrol 2.013e+03 3.697e+01 54.452 < 2e-16 ***
## mileage -9.542e-02 1.077e-03 -88.593 < 2e-16 ***
## engine_size:car_age -1.105e+03 1.049e+01 -105.350 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4645 on 99170 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7784
## F-statistic: 2.177e+04 on 16 and 99170 DF, p-value: < 2.2e-16
summary(model_6b)
##
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type +
## mileage + engine_size:mileage, data = car_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72289 -2442 -441 1759 107350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.229e+03 1.026e+02 31.468 < 2e-16 ***
## engine_size 1.362e+04 4.416e+01 308.537 < 2e-16 ***
## brandbmw -2.411e+03 6.385e+01 -37.761 < 2e-16 ***
## brandford -4.682e+03 5.970e+01 -78.416 < 2e-16 ***
## brandhyundi -6.017e+03 8.206e+01 -73.315 < 2e-16 ***
## brandmercedes -5.391e+01 6.092e+01 -0.885 0.376
## brandskoda -4.758e+03 7.550e+01 -63.017 < 2e-16 ***
## brandtoyota -6.792e+03 7.992e+01 -84.982 < 2e-16 ***
## brandvauxhall -7.650e+03 6.230e+01 -122.795 < 2e-16 ***
## brandvolkswagen -3.349e+03 5.961e+01 -56.188 < 2e-16 ***
## car_age -1.542e+03 1.045e+01 -147.569 < 2e-16 ***
## fuel_typeElectric 9.485e+03 1.894e+03 5.007 5.53e-07 ***
## fuel_typeHybrid 4.286e+03 9.737e+01 44.012 < 2e-16 ***
## fuel_typeOther 3.859e+03 2.979e+02 12.955 < 2e-16 ***
## fuel_typePetrol 1.855e+03 3.688e+01 50.297 < 2e-16 ***
## mileage 1.525e-01 2.626e-03 58.071 < 2e-16 ***
## engine_size:mileage -1.399e-01 1.309e-03 -106.818 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4639 on 99170 degrees of freedom
## Multiple R-squared: 0.779, Adjusted R-squared: 0.779
## F-statistic: 2.185e+04 on 16 and 99170 DF, p-value: < 2.2e-16
summary(model_6c)
##
## Call:
## lm(formula = price ~ engine_size + brand + car_age + fuel_type +
## mileage + car_age:brand, data = car_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51248 -2434 -399 1728 138008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.077e+04 1.087e+02 99.129 < 2e-16 ***
## engine_size 1.070e+04 3.521e+01 303.865 < 2e-16 ***
## brandbmw -2.951e+03 1.049e+02 -28.130 < 2e-16 ***
## brandford -8.504e+03 1.014e+02 -83.898 < 2e-16 ***
## brandhyundi -9.850e+03 1.449e+02 -67.990 < 2e-16 ***
## brandmercedes -9.843e+01 9.967e+01 -0.988 0.323
## brandskoda -7.754e+03 1.294e+02 -59.921 < 2e-16 ***
## brandtoyota -1.219e+04 1.317e+02 -92.498 < 2e-16 ***
## brandvauxhall -1.241e+04 1.053e+02 -117.872 < 2e-16 ***
## brandvolkswagen -5.861e+03 9.947e+01 -58.920 < 2e-16 ***
## car_age -2.424e+03 2.285e+01 -106.062 < 2e-16 ***
## fuel_typeElectric 9.472e+03 1.914e+03 4.948 7.50e-07 ***
## fuel_typeHybrid 4.502e+03 9.866e+01 45.633 < 2e-16 ***
## fuel_typeOther 3.931e+03 3.011e+02 13.057 < 2e-16 ***
## fuel_typePetrol 1.821e+03 3.728e+01 48.838 < 2e-16 ***
## mileage -9.589e-02 1.091e-03 -87.863 < 2e-16 ***
## brandbmw:car_age 1.803e+02 2.842e+01 6.345 2.24e-10 ***
## brandford:car_age 1.325e+03 2.709e+01 48.927 < 2e-16 ***
## brandhyundi:car_age 1.345e+03 4.093e+01 32.850 < 2e-16 ***
## brandmercedes:car_age -2.317e+01 2.794e+01 -0.829 0.407
## brandskoda:car_age 1.070e+03 4.005e+01 26.725 < 2e-16 ***
## brandtoyota:car_age 1.765e+03 3.340e+01 52.833 < 2e-16 ***
## brandvauxhall:car_age 1.665e+03 2.843e+01 58.566 < 2e-16 ***
## brandvolkswagen:car_age 9.032e+02 2.798e+01 32.275 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4687 on 99163 degrees of freedom
## Multiple R-squared: 0.7744, Adjusted R-squared: 0.7743
## F-statistic: 1.48e+04 on 23 and 99163 DF, p-value: < 2.2e-16
anova(model_5a, model_6b)
## Analysis of Variance Table
##
## Model 1: price ~ engine_size + brand + car_age + fuel_type + mileage
## Model 2: price ~ engine_size + brand + car_age + fuel_type + mileage +
## engine_size:mileage
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 99171 2.3793e+12
## 2 99170 2.1338e+12 1 2.455e+11 11410 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the model, it shows that engine size is the parameter that affects the price the most, followed by car age, and engine size with mileage.
library(caret)
variable_importance <- tibble(varImp(model_6b, scale = TRUE)) %>%
mutate(observations = c("engine_size", "brandbmw", "brandford", "brandhyundi",
"brandmercedes", "brandskoda", "brandtoyota", "brandvauxhall", "brandvolkswagen", "car_age", "fuel_typeElectric", "fuel_typeHybrid", "fuel_typeOther", "fuel_typePetrol", "mileage", "engine_size:mileage"), .before = 1)
model_6b_clean <- tidy(model_6b) %>%
mutate(scaled_estimate = round(
as.numeric(format(estimate / 3228.6434845, scientific = FALSE)),
digits = 3),.after = 1)
variable_importance %>%
ggplot(aes(x = observations, y = Overall))+
geom_col()+
xlab("Variables")+ylab("Level of Variable Importance")+
ggtitle("Difference on Variable Importance")+
theme(axis.text.x = element_text(angle=90, vjust=.5))
As referenced by Audi, Mercedes, BMW and Volkswagen have a relatively high trade-in / trade-out value comparing the depreciation rate with other brands.
model_6b_clean %>%
dplyr::select(term, scaled_estimate) %>%
mutate(term = case_when(term == "(Intercept)" ~ "brandaudi", TRUE ~ term)) %>%
slice(1, 3:10) %>%
mutate(term = as.factor(term)) %>%
#mutate(clean_scaled_estimate = scaled_estimate -1) %>%
ggplot(aes(x = term, y = scaled_estimate))+
geom_point()+
geom_line(group = 1)+
xlab("Brand")+ ylab("Scaled Depreciation Rate")+
ggtitle("Depreciation Rate by Brand")+
theme(axis.text.x = element_text(angle=45, vjust=.5))
library(relaimpo)
calc.relimp(model_6b, type = "lmg", rela = TRUE)
## Response variable: price
## Total response variance: 97353218
## Analysis based on 99187 observations
##
## 16 Regressors:
## Some regressors combined in groups:
## Group brand : brandbmw brandford brandhyundi brandmercedes brandskoda brandtoyota brandvauxhall brandvolkswagen
## Group fuel_type : fuel_typeElectric fuel_typeHybrid fuel_typeOther fuel_typePetrol
##
## Relative importance of 6 (groups of) regressors assessed:
## brand fuel_type engine_size car_age mileage engine_size:mileage
##
## Proportion of variance explained by model: 77.9%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg
## brand 0.17137420
## fuel_type 0.03322868
## engine_size 0.43197235
## car_age 0.15860854
## mileage 0.17283456
## engine_size:mileage 0.03198167
##
## Average coefficients for different model sizes:
##
## 1group 2groups 3groups 4groups
## engine_size 1.129053e+04 1.135790e+04 11760.0412704 1.242669e+04
## brandbmw -1.632762e+02 -7.231286e+02 -1387.7080326 -2.056505e+03
## brandford -1.061693e+04 -9.206989e+03 -7513.0545239 -5.808481e+03
## brandhyundi -1.014655e+04 -9.317628e+03 -8181.7535782 -6.884082e+03
## brandmercedes 1.801912e+03 1.178041e+03 559.8503465 4.259568e+01
## brandskoda -8.621236e+03 -7.912001e+03 -6875.8259919 -5.576263e+03
## brandtoyota -1.037429e+04 -9.632426e+03 -8542.5388456 -7.293430e+03
## brandvauxhall -1.249023e+04 -1.124728e+04 -9769.7097517 -8.342023e+03
## brandvolkswagen -6.057733e+03 -5.526176e+03 -4782.1552714 -3.903214e+03
## car_age -2.280156e+03 -2.203217e+03 -1934.2525898 -1.677093e+03
## fuel_typeElectric -2.694155e+03 8.528233e+02 3930.9230105 6.864365e+03
## fuel_typeHybrid -4.990242e+01 1.574244e+03 2772.6550031 3.228181e+03
## fuel_typeOther -1.896144e+03 -1.873095e+02 1262.1406002 2.554177e+03
## fuel_typePetrol -4.564443e+03 -2.592254e+03 -922.1075624 5.153644e+02
## mileage -1.949953e-01 -1.796792e-01 -0.1381463 -5.303379e-02
## engine_size:mileage NaN NaN -0.1338649 -1.359874e-01
## 5groups 6groups
## engine_size 1.311014e+04 13624.4281957
## brandbmw -2.431642e+03 -2410.8632860
## brandford -4.815047e+03 -4681.5721169
## brandhyundi -6.069064e+03 -6016.5438339
## brandmercedes -1.641798e+02 -53.9084485
## brandskoda -4.717040e+03 -4757.7305620
## brandtoyota -6.626922e+03 -6791.8553674
## brandvauxhall -7.599124e+03 -7650.1667663
## brandvolkswagen -3.350773e+03 -3349.2143491
## car_age -1.549329e+03 -1541.9981773
## fuel_typeElectric 8.731740e+03 9485.1089373
## fuel_typeHybrid 3.526858e+03 4285.6233060
## fuel_typeOther 3.442066e+03 3859.1136162
## fuel_typePetrol 1.388014e+03 1854.9282493
## mileage 5.317545e-02 0.1525227
## engine_size:mileage -1.379505e-01 -0.1398619
model_overfit <- lm(price ~., data = car_df)
model_wellfit <- model_6b
model_underfit <- lm(price ~ engine_size , data = car_df)
broom::glance(model_overfit)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.766 0.766 4771. 15482. 0 21 -980865. 1961777. 1.96e6
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model_wellfit)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.779 0.779 4639. 21850. 0 16 -978085. 1956206. 1.96e6
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model_underfit)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.407 0.407 7597. 68128. 0 1 -1027024. 2.05e6 2.05e6
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Setting the price predictor with a similar used car specs by different brands, it matches with the result of depreciation by brand. Audi and Mercedes share a relatively similar deprecation rate.
price ~ engine size + brand + car age + fuel type + mileage + engine_size:mileage
Below is the example of Audi, with car spec as: fuel_type = “Petrol”, engine_size = 1, car_age = 1, mileage = 5000
predict(model_6b, data.frame(
brand = "audi",
fuel_type = "Petrol",
engine_size = 1,
car_age = 1,
mileage = 5000))
## 1
## 17229.31
| Brand | Estimated Price |
|---|---|
| Audi | 17229.31 |
| BMW | 14818.44 |
| Ford | 12547.73 |
| Hyundi | 11212.76 |
| Mercedes | 17175.4 |
| Skoda | 12471.57 |
| Toyota | 10437.45 |
| Vauxhall | 9579.139 |
| Volkswagen | 13880.09 |
Based on the data, the most common brand is Ford, however, Mercedes has the highest listed price in total The most common car age for selling is 1 year, followed by 3 years and 4 years old car age. 5 Year Lease Car Deal might affect the car age in used car market. Traders are able to know which car brand, and the best car age have a higher value in used-car market.
Based on the model, engine size is the most import factor for estimating the price, followed by brand and car age An engine doubled in size also approximately doubled in price. In terms of large mileage, car with small engine size has better trading value than car with large engine size. While for the small mileage, cars with large engine size has better trading than car with small engine size.