Introduction

Car valuation as collateral for loans has become increasingly significant for banks in Vietnam. This practice not only provides a safety net for financial institutions but also facilitates greater access to credit for individuals and businesses. In the Vietnamese context, where the automotive market is rapidly expanding, the accurate valuation of cars used as collateral is crucial for ensuring the stability and growth of the banking sector. One of the primary reasons banks in Vietnam utilize car valuation for collateral is to enhance loan security. By assessing the value of a car accurately, banks can mitigate the risk associated with lending. This security is crucial, especially in a market where default rates can impact the financial health of lending institutions. The accurate valuation of cars ensures that banks have a tangible asset to fall back on in case of loan default, thus protecting their financial interests .

Predicting car prices is a fascinating and widely studied problem. Accurate car price prediction requires expert knowledge due to the numerous unique features and factors that influence a car’s value. The most critical factors typically include the brand and model, year of manufacture, kilometers driven, and mileage. Additionally, the fuel type and fuel consumption per mile significantly impact the price, owing to frequent fluctuations in fuel costs. Other distinct features, such as exterior color, number of doors, type of transmission, dimensions, safety features, air conditioning, interior quality, and the presence of a navigation system, also play a crucial role in determining the car’s price. Traditionally, statistical models like linear regression have been utilized for this task. However, these models often fall short in capturing the complex, nonlinear relationships between the variables, leading to suboptimal predictive performance. With the increasing importance of accurate price prediction, machine learning algorithms have shown significant potential in this domain due to their ability to learn intricate patterns from data and make precise predictions. Hence, evaluating the effectiveness of these algorithms for used car price prediction is crucial.

This article aims to assess the performance of 12 machine learning (ML) algorithms for predicting the prices of used cars. The ML models were developed using a training dataset comprising 20,000 used vehicles from 30 popular car brands. Its predictions were validated against a test dataset containing 2,727 used cars. The performance of each algorithm is measured using the R-squared (R²) metric. The analysis is conducted on a dataset comprising various details of used cars, including kilometers driven, model, age, fuel type, and brand. Models are created using different algorithms, and their R² scores are compared.

The results indicate that XGBoost outperforms the other algorithms with an R² score of 0.9843, followed by Random Forest with an R² score of 0.9649. These findings underscore the importance of selecting suitable machine learning algorithms for used car price prediction and demonstrate the superiority of ensemble methods over traditional linear models with R² score of 0.5733. By employing advanced machine learning techniques, the study highlights the potential for significantly improving the accuracy of price predictions for pre-owned automobiles, thereby providing valuable insights for both sellers and buyers in the automotive market.

Web Scraping

First, it is necessary to collect data on the listed car prices as well as the accompanying information (marked in red). Below is an example of a car listed for sale on https://bonbanh.com/oto (this is one of the largest car trading platforms in Vietnam):

Below are the R codes to retrieve data from 34,299 cars listed for sale (as of June 11, 2024):

#====================================================================
#  Stage 1: Collect 34,299 selling cars from https://bonbanh.com/oto
#====================================================================

rm(list = ls()) # Clear our R environment. 

library(rvest)
library(stringr)
library(dplyr)

n <- 1711 # Number of pages. 

mainURLs <- str_c("https://bonbanh.com/oto/page,", 1:n)

# Functtions collects all urls: 

extractULRs <- function(mainURL) {
  
  Sys.sleep(0.5)
  
  mainURL %>% 
    read_html() %>% 
    html_nodes("a") %>% 
    html_attr("href") -> urls
  
  urls[str_detect(urls, "^xe-")] -> urls 
  
  str_c("https://bonbanh.com/", urls %>% na.omit()) -> fullURL
  
  return(fullURL)
  
}

# Test the function: 

extractULRs(mainURLs[1])

# Exact all 34,201 messages for selling: 

# sapply(mainURLs[1:n], extractULRs) -> cars_for_selling

list_links <- vector("list", n)

system.time(for(i in 1:n) {
  
  list_links[[i]] <- tryCatch(extractULRs(mainURLs[i]), error = function(e) {})
  print(i)
  print(mainURLs[i])
})

cars_for_selling <- list_links %>% unlist()

# Save data: 

saveRDS(cars_for_selling, "cars_for_selling.rds")

# Function extracts all info for cars: 

extractINFO <- function(link) {
  
  # link <- cars_for_selling[1]
  
  Sys.sleep(0.5)
  
  link %>% 
    read_html() -> pageContent
  
  pageContent %>% 
    html_node(".notes") %>% 
    html_text2() -> ngaydang_luotxem
  
  pageContent %>% 
    html_node("span:nth-child(7)") %>% 
    html_text2() -> ma_tin
  
  pageContent %>% 
    html_node(xpath = '//*[@id="car_detail"]/div[3]/h1') %>% 
    html_text2() -> gia
  
  pageContent %>% 
    html_node(xpath = '//*[@id="sgg"]/div[1]') %>% 
    html_text2() -> car_info  
  
  pageContent %>% 
    html_node(xpath = '//*[@id="car_detail"]/div[7]/div[2]/div') %>% 
    html_text2() -> seller
  
  pageContent %>% 
    html_node(xpath = '//*[@id="sgg"]/div[2]/div/div') %>% 
    html_text2() -> mieuta
  
  data.frame(date_views = ngaydang_luotxem, 
             code = ma_tin, 
             price = gia, 
             car_info = car_info, 
             car_des = mieuta, 
             seller_contact = seller, 
             url = link) -> df_car
  
  return(df_car)
}

# Collect info for 34299 used cars: 

# system.time(lapply(cars_for_selling, extractINFO) -> list_car_info)

k <- length(cars_for_selling)

list_info <- vector("list", k)

system.time(for(j in 1:k) {
  list_info[[j]] <- tryCatch(extractINFO(cars_for_selling[j]), error = function(e) {})
  print(j)
  print(cars_for_selling[j])
})

# Save data: 

do.call("bind_rows", list_info) -> used_carData

saveRDS(used_carData, "used_carData.rds")

The data extraction script may take a significant amount of time or could be perceived as a network attack from a single IP address, which might result in a 403 error. Therefore, to facilitate reuse, this data has been saved (and can be downloaded) [here]((https://www.mediafire.com/file/74oj0euma9s3jxp/used_carData.rds/file).

Data Preprocessing

Data obtained from websites is often “messy” and thus needs to be processed appropriately for the intended use. Below is the R code for Data Preprocessing:

#============================
#  Stage 2: Data Processing
#============================

# Load data: 

readRDS("used_carData.rds") -> used_carData # Data can be download from https://www.mediafire.com/file/74oj0euma9s3jxp/used_carData.rds/file

library(stringi)

# Function extracts price: 

extractPRICE <- function(text) {
  
  text %>% 
    str_split("-", simplify = TRUE) %>% 
    as.data.frame() %>% 
    pull(V2) %>% 
    str_squish() %>% 
    stri_trans_general("Latin-ASCII") %>% 
    str_to_lower() %>% 
    str_split("ty", simplify = TRUE) %>% 
    as.data.frame() -> df_gia
  
  df_gia %>% 
    mutate(hangtrieu = case_when(str_detect(V1, "trieu") ~ str_replace_all(V1, "trieu", "") %>% str_squish() %>% as.numeric(),
                                 TRUE ~ str_squish(V1) %>% as.numeric() * 1000)) %>% 
    mutate(phantrieu = case_when(str_detect(V2, "trieu") ~ str_replace_all(V2, "trieu", "") %>% str_squish() %>% as.numeric(), 
                                 TRUE ~ 0)) %>% 
    mutate(gia = hangtrieu + phantrieu) %>% 
    pull(gia) %>% 
    return()
  
}


used_carData %>% mutate(carPrice = extractPRICE(price)) -> used_carData

# Only select used cars: 

used_carData %>% 
  mutate(car_infoLatin = stri_trans_general(car_info, "Latin-ASCII") %>% str_to_lower()) %>% 
  filter(!str_detect(car_infoLatin, "xe moi")) %>% 
  select(-car_infoLatin) -> car_xe_cu

car_xe_cu -> x

# Year of production: 

x %>% 
  pull(car_info) %>%  
  str_to_lower() %>% 
  stri_trans_general("Latin-ASCII") %>% 
  str_split(":\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  mutate(car_info = x %>% pull(car_info)) -> x

x %>% 
  pull(V2) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  str_squish() %>% 
  as.numeric() -> nam_sanxuat

car_xe_cu %>% mutate(nam_sanxuat = nam_sanxuat) -> car_xe_cu

# Km: 

x %>% 
  pull(V4) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  str_split(" km", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  str_replace_all(",", "") %>% 
  as.numeric() -> km_dadi

car_xe_cu %>% mutate(km_dadi = km_dadi) -> car_xe_cu

# Origin: 

x %>% 
  pull(V5) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> xuat_xu

car_xe_cu %>% mutate(xuat_xu = xuat_xu) -> car_xe_cu

# Car form: 

x %>% 
  pull(V6) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> kieu_dang 

car_xe_cu %>% mutate(kieu_dang = kieu_dang) -> car_xe_cu

x %>% 
  pull(V7) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> loai_so

car_xe_cu %>% mutate(loai_so = loai_so) -> car_xe_cu

x %>% 
  pull(V8) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  str_split(" ", simplify = TRUE) %>% 
  as.data.frame() %>% 
  select(-V3) %>% 
  mutate(loaiNL = V1, dungtich = as.numeric(V2)) %>% 
  select(loaiNL, dungtich) -> df_loaiNL_dungtich

bind_cols(car_xe_cu, df_loaiNL_dungtich) -> car_xe_cu

x %>% 
  pull(V9) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> car_color

car_xe_cu %>% mutate(car_color = car_color) -> car_xe_cu

x %>% 
  pull(V10) %>% 
  str_split("\n", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> noithat_color

car_xe_cu %>% mutate(noithat_color = noithat_color) -> car_xe_cu

x %>% 
  pull(V11) %>% 
  str_split(" ", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  as.numeric() -> so_cho_ngoi

x %>% 
  pull(V12) %>% 
  str_split(" ", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  as.numeric() -> so_cua_so 

x %>% 
  pull(V13) %>% 
  str_split(" ", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> kieu_dan_dong


car_xe_cu %>% 
  mutate(so_cho_ngoi = so_cho_ngoi, 
         so_cua_so = so_cua_so, 
         kieu_dan_dong = kieu_dan_dong) -> car_xe_cu 

# Extract more info: 

car_xe_cu %>% 
  pull(price) %>% 
  stri_trans_general("Latin-ASCII") %>% 
  str_replace_all("Xe ", "") -> extraINFO


extraINFO %>%
  str_split(" ", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) -> producer


car_xe_cu %>% 
  pull(price) %>% 
  str_split("-", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V1) %>% 
  str_replace_all("Xe ", "") %>% 
  str_squish() %>% 
  stri_trans_general("Latin-ASCII") -> car_model

car_xe_cu %>% 
  mutate(producer = producer, 
         car_model = car_model) -> car_xe_cu

library(lubridate)

car_xe_cu %>% 
  pull(date_views) %>% 
  str_split(" ", simplify = TRUE) %>% 
  as.data.frame() %>% 
  pull(V3) %>% 
  dmy() -> dateYMD

car_xe_cu %>% mutate(dateYMD = dateYMD) -> car_xe_cu

car_xe_cu %>% mutate(noithat_color = noithat_color) -> car_xe_cu

car_xe_cu %>% 
  mutate(noithat_color = case_when(noithat_color == "-" ~ "unknown", 
                                   TRUE ~ noithat_color)) %>% 
  filter(loaiNL != "-", 
         !is.na(carPrice), 
         carPrice <= 10000, 
         km_dadi != 0, 
         car_color != "-") %>% 
  select(7:22) %>% 
  mutate(kieu_dang = str_replace_all(kieu_dang, " / |/", "_")) %>% 
  mutate(car_age = 2024 - nam_sanxuat) %>% 
  mutate_if(is.character, function(x) {str_replace_all(x, " ", "_")}) %>% 
  mutate_if(is.character, function(x) {as.factor(x)}) -> data_modelling

# Remove missing data points: 

set.seed(12)

data_modelling %>% 
  sample_n(nrow(data_modelling)) %>% 
  na.omit() -> data_modelling

# Show some observations: 

data_modelling %>% 
  select(c(1:2, 4, 15)) %>% 
  head() %>% 
  knitr::kable()

url	carPrice	km_dadi	producer
https://bonbanh.com/xe-peugeot-3008-al-2022-5662251	858	30000	Peugeot
https://bonbanh.com/xe-hyundai-elantra-2.0-at-2021-5573293	550	44000	Hyundai
https://bonbanh.com/xe-toyota-yaris-1.5-at-2013-5632737	315	130000	Toyota
https://bonbanh.com/xe-vinfast-lux_sa_2.0-premium-2.0-at-2020-5643193	790	60000	VinFast
https://bonbanh.com/xe-toyota-vios-1.5e-cvt-2020-5612988	395	66000	Toyota
https://bonbanh.com/xe-suzuki-xl7-1.5-at-2022-5638433	525	40000	Suzuki

Machine Learning Models for Price Prediction

The traditional approach is to use Linear Regression (or Generalized Linear Model). Using the h2o package - a library that can be utilized and implemented by R or Python with no syntax differences, below are the R codes:

# Activate h2o package for using: 

library(h2o)
h2o.init(nthreads = 2, max_mem_size = "16g")

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         43 minutes 23 seconds 
##     H2O cluster timezone:       Asia/Bangkok 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.42.0.2 
##     H2O cluster version age:    10 months and 20 days 
##     H2O cluster name:           H2O_started_from_R_Admin_dci541 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   14.14 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.2 (2023-10-31 ucrt)

h2o.no_progress()

# Convert training data to h2o object: 

train_data <- data_modelling %>% slice(1:20000)

test_data <- data_modelling %>% slice(20001:nrow(data_modelling))

ames_train <- as.h2o(train_data)

ames_test <- as.h2o(test_data %>% select(-carPrice))

# Set the response column to carPrice: 

response <- "carPrice"

# Set the predictor names: 

predictors <- setdiff(colnames(ames_train), c(response, "url", "car_model"))

# Train default GLM: 

h2o.glm(x = predictors, 
        y = response, 
        training_frame = ames_train, 
        seed = 123) -> h2oGLM

# Predict and calculate R2: 

cor(predict(h2oGLM, ames_test) %>% as.vector(), test_data %>% pull(carPrice))^2

## [1] 0.573334

An R² value of 57.34% indicates that this traditional approach has limited predictive capability. However, we will use this value as a baseline to evaluate the forecasting performance of other approaches, including Machine Learning models.

Random Forest is a commonly used machine learning algorithm due to its proven predictive capabilities across various applications. Below are the R codes to implement this algorithm:

# Train default Random Forest (RF): 

h2o_rf1 <- h2o.randomForest(x = predictors,
                            y = response, 
                            training_frame = ames_train, 
                            seed = 123)

# Used RF for predicting price: 

predict(h2o_rf1, ames_test) %>% as.vector() -> pricePred

# R-squaread: 

cor(pricePred, test_data %>% pull(carPrice))^2

## [1] 0.9649989

An R² value of 96.49% is significantly higher compared to 57.34%. This result can be interpreted as follows: by using input variables such as “kilometers driven,” “manufacturer,” “engine type,” and “year of manufacture,” the Random Forest model can predict with an accuracy of 96.49% for the test dataset consisting of 2,727 listed cars.

We can visualize the predictive capability of the Random Forest on the test dataset using a scatter plot as follows:

# Visualize actuals and predicted prices: 

test_data %>% 
  mutate(pricePred = pricePred %>% round(0)) %>% 
  select(car_model, carPrice, pricePred, everything()) -> test_data

library(ggplot2)

test_data %>% 
  ggplot(aes(x = pricePred, y = carPrice)) + 
  geom_point() + 
  labs(title = "Figure 1: Actual and Predicted Prices by Random Forest", 
       subtitle = "Prices in millions VND", 
       caption = "Source: https://bonbanh.com")

Of course, we can improve the predictive capability of Random Forest by fine-tuning the hyperparameters for this algorithm.

Conclusion

Experimental results on the collected data show that machine learning algorithms such as Random Forest have a very good ability to predict the prices of used cars. Accurately forecasting or estimating the market value of cars plays an important role in the lending decisions of financial institutions like banks when these assets are used as collateral for loans.

Machine Learning Models for Used Car Price Prediction

R Data Science Series

Author: Nguyen Chi Dung

Introduction

Web Scraping

Data Preprocessing

Machine Learning Models for Price Prediction

Conclusion