Prediksi Harga Mobil Bekas

Loading Packages

Studi Kasus

CarDekho adalah portal mobil yang membantu penggunanya dalam melakukan pembelian atau penjualan mobil dengan memberikan informasi mengenai harga, spesifikasi, asuransi dan aspek-aspek lain. Dataset yang digunakan berisi informasi tentang mobil bekas beserta spesifikasinya. Informasi yang dimiliki tersebut digunakan untuk membuat model untuk memprediksi harga penjualan mobil.

CarDekho dataset terdiri dari 8128 observasi dengan 13 peubah. Berikut adalah deskripsi dari peubah-peubah dalam dataset.

Peubah Deskripsi
name Nama mobil
year Tahun pembelian mobil
selling_price Harga jual mobil (Rp)
km_driven Jumlah kilometer yang ditempuh mobil (km)
fuel Jenis bahan bakar mobil (CNG/diesel/petrol/LPG)
seller_type Tipe penjual (individual/dealer/trustmark dealer)
transmission Transmisi mobil (automatic/manual)
owner Pemilik sebelumnya (first/second/third/fourth and above owner/test drive car)
mileage Jarak tempuh mobil (kmpl, km/kg)
engine Kapasitas mesin (CC)
max_power Kekuatan mesin (bhp)
torque Torsi mobil (kgm, nm)
seats Kapasitas tempat duduk
car <- read.csv("Car details v3.csv", stringsAsFactors = TRUE)
head(car)
##                            name year selling_price km_driven   fuel seller_type
## 1        Maruti Swift Dzire VDI 2014        450000    145500 Diesel  Individual
## 2  Skoda Rapid 1.5 TDI Ambition 2014        370000    120000 Diesel  Individual
## 3      Honda City 2017-2020 EXi 2006        158000    140000 Petrol  Individual
## 4     Hyundai i20 Sportz Diesel 2010        225000    127000 Diesel  Individual
## 5        Maruti Swift VXI BSIII 2007        130000    120000 Petrol  Individual
## 6 Hyundai Xcent 1.2 VTVT E Plus 2017        440000     45000 Petrol  Individual
##   transmission        owner    mileage  engine  max_power
## 1       Manual  First Owner  23.4 kmpl 1248 CC     74 bhp
## 2       Manual Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3       Manual  Third Owner  17.7 kmpl 1497 CC     78 bhp
## 4       Manual  First Owner  23.0 kmpl 1396 CC     90 bhp
## 5       Manual  First Owner  16.1 kmpl 1298 CC   88.2 bhp
## 6       Manual  First Owner 20.14 kmpl 1197 CC  81.86 bhp
##                     torque seats
## 1           190Nm@ 2000rpm     5
## 2      250Nm@ 1500-2500rpm     5
## 3    12.7@ 2,700(kgm@ rpm)     5
## 4 22.4 kgm at 1750-2750rpm     5
## 5    11.5@ 4,500(kgm@ rpm)     5
## 6        113.75nm@ 4000rpm     5
skim(car)
Data summary
Name car
Number of rows 8128
Number of columns 13
_______________________
Column type frequency:
factor 9
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
name 0 1 FALSE 2058 Mar: 129, Mar: 82, Mar: 71, BMW: 62
fuel 0 1 FALSE 4 Die: 4402, Pet: 3631, CNG: 57, LPG: 38
seller_type 0 1 FALSE 3 Ind: 6766, Dea: 1126, Tru: 236
transmission 0 1 FALSE 2 Man: 7078, Aut: 1050
owner 0 1 FALSE 5 Fir: 5289, Sec: 2105, Thi: 555, Fou: 174
mileage 0 1 FALSE 394 18.: 225, emp: 221, 19.: 173, 18.: 164
engine 0 1 FALSE 122 124: 1017, 119: 832, 998: 453, 796: 444
max_power 0 1 FALSE 323 74 : 377, 81.: 220, emp: 215, 88.: 204
torque 0 1 FALSE 442 190: 530, 200: 445, 90N: 405, 113: 223

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2013.80 4.04 1983 2011 2015 2017 2020 ▁▁▁▃▇
selling_price 0 1.00 638271.81 806253.40 29999 254999 450000 675000 10000000 ▇▁▁▁▁
km_driven 0 1.00 69819.51 56550.55 1 35000 60000 98000 2360457 ▇▁▁▁▁
seats 221 0.97 5.42 0.96 2 5 5 5 14 ▁▇▂▁▁

Berdasarkan output di atas, terdapat kondisi data yang harus ditangani terlebih dahulu sebelum melakukan pemodelan, antara lain:

  • Data cleansing dan konversi peubah numerik sehingga tersimpan sebagai numerik. Misalnya pada peubah engine harus dihilangkan unit CC dan konversi menjadi numerik. Demikian juga dengan peubah max_power, torque dan mileage.
  • Penanganan data hilang (missing value)

Data Preprocessing

  • Menghilangkan satuan pada peubah engine, max_power, torque dan mileage sekaligus mengubah menjadi numerik
  • Membuat peubah age atau umur mobil sejak diproduksi (year) hingga dijual
  • Mengubah peubah owner menjadi ordinal/ordered factor
  • Mengeliminasi peubah name dan year
  • Menghilangkan mobil-mobil berbahan bakar CNG dan LPG.
car2 <- car %>% mutate(engine    = as.numeric(str_remove(engine, " CC")),
                       max_power = as.numeric(str_remove(max_power, " bhp")),
                       torque    = as.numeric(str_extract(torque, "[0-9.]+")),
                       mileage   = as.numeric(str_extract(mileage, "[0-9.]+")),
                       owner     = factor(owner, ordered = TRUE,
                                          levels = c("Test Drive Car", 
                                                     "First Owner", 
                                                     "Second Owner", 
                                                     "Third Owner", 
                                                     "Fourth & Above Owner")),
                       age       = 2024 - year) %>% 
  select(-name, -year) %>%
  filter(!fuel %in% c('CNG','LPG'))
head(car2)
##   selling_price km_driven   fuel seller_type transmission        owner mileage
## 1        450000    145500 Diesel  Individual       Manual  First Owner   23.40
## 2        370000    120000 Diesel  Individual       Manual Second Owner   21.14
## 3        158000    140000 Petrol  Individual       Manual  Third Owner   17.70
## 4        225000    127000 Diesel  Individual       Manual  First Owner   23.00
## 5        130000    120000 Petrol  Individual       Manual  First Owner   16.10
## 6        440000     45000 Petrol  Individual       Manual  First Owner   20.14
##   engine max_power torque seats age
## 1   1248     74.00 190.00     5  10
## 2   1498    103.52 250.00     5  10
## 3   1497     78.00  12.70     5  18
## 4   1396     90.00  22.40     5  14
## 5   1298     88.20  11.50     5  17
## 6   1197     81.86 113.75     5   7
skim(car2)
Data summary
Name car2
Number of rows 8033
Number of columns 12
_______________________
Column type frequency:
factor 4
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
fuel 0 1 FALSE 2 Die: 4402, Pet: 3631, CNG: 0, LPG: 0
seller_type 0 1 FALSE 3 Ind: 6673, Dea: 1124, Tru: 236
transmission 0 1 FALSE 2 Man: 6983, Aut: 1050
owner 0 1 TRUE 5 Fir: 5238, Sec: 2073, Thi: 547, Fou: 170

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
selling_price 0 1.00 642736.12 809863.53 29999.0 260000.00 450000.0 680000.00 10000000 ▇▁▁▁▁
km_driven 0 1.00 69738.82 56643.61 1000.0 35000.00 60000.0 98000.00 2360457 ▇▁▁▁▁
mileage 214 0.97 19.39 4.00 0.0 16.78 19.3 22.32 42 ▁▃▇▁▁
engine 214 0.97 1463.09 504.66 624.0 1197.00 1248.0 1582.00 3604 ▇▇▂▂▁
max_power 208 0.97 91.86 35.85 0.0 69.00 82.4 102.00 400 ▇▇▁▁▁
torque 214 0.97 169.32 97.32 4.8 104.00 160.0 204.00 789 ▇▆▂▁▁
seats 214 0.97 5.42 0.96 2.0 5.00 5.0 5.00 14 ▁▇▂▁▁
age 0 1.00 10.18 4.03 4.0 7.00 9.0 13.00 41 ▇▃▁▁▁

Spliting Data

Data dibagi menjadi dua kelompok, yaitu data train (latih) dan data test (uji).

set.seed(123)
train_idx <- createDataPartition(car2$selling_price, p = 0.7, list=FALSE)
trainData <- car2[train_idx,]
testData <- car2[-train_idx,]

Eksplorasi Data

Eksplorasi data dilakukan terhadap data train.

Sebaran Peubah Respon: selling_price

ggplot(trainData, aes_string(x = "selling_price")) + 
  geom_histogram(color = "black", bins = 30) +
  ggtitle("Sebaran selling_price")

Hubungan peubah respons dan peubah penjelas numerik

plot_numeric_features <- function(x){
  ggplot(trainData, aes_string(x, "selling_price")) +
    geom_point() +
    geom_smooth(method = "loess", se = F) +
    scale_x_continuous(labels = scales::comma) +
    ylim(0, NA)
}

plot_grid(
  plot_numeric_features("km_driven"),
  plot_numeric_features("mileage"),
  plot_numeric_features("engine"),
  plot_numeric_features("max_power"),
  plot_numeric_features("torque"),
  plot_numeric_features("seats"),
  plot_numeric_features("age"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Dari plot di atas, terlihat adanya outlier pada peubah km_driven, torque, mileage, max_power

Sebaran data berdasarkan peubah penjelas kategorik

count_categoric_features <- function(x){
  ggplot(trainData, aes_string(x = x)) +
    geom_bar() + 
    coord_flip()
}

plot_grid(
  count_categoric_features("fuel"),
  count_categoric_features("seller_type"),
  count_categoric_features("transmission"),
  count_categoric_features("owner"))

Hubungan peubah respons dan peubah penjelas kategorik

plot_categoric_features <- function(x){
  ggplot(trainData, aes_string(x, "selling_price")) +
    geom_boxplot() +
    coord_flip()
}
plot_grid(
  plot_categoric_features("fuel"),
  plot_categoric_features("seller_type"),
  plot_categoric_features("transmission"),
  plot_categoric_features("owner"))

Outlier dan Missing Value

Penanganan Outlier/Pencilan pada data train

Dari hasil eksplorasi di atas, terlihat adanya pencilan/outlier pada peubah km_driven, torque, mileage, max_power. Salah satu penangan pencilan adalah dengan metode capping.

trainData$km_driven[trainData$km_driven > 500000] <- 500000
trainData$torque[trainData$torque > 640] <- 640
trainData$mileage[trainData$mileage > 30] <- 30
trainData$mileage[trainData$mileage < 7] <- 7
trainData$max_power[trainData$max_power > 300] <- 300

Setelah caping

plot_grid(
  plot_numeric_features("km_driven"),
  plot_numeric_features("mileage"),
  plot_numeric_features("engine"),
  plot_numeric_features("max_power"),
  plot_numeric_features("torque"),
  plot_numeric_features("seats"),
  plot_numeric_features("age"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Penanganan Missing Value pada Data Train

colSums(is.na(trainData))
## selling_price     km_driven          fuel   seller_type  transmission 
##             0             0             0             0             0 
##         owner       mileage        engine     max_power        torque 
##             0           153           153           151           153 
##         seats           age 
##           153             0

Penanganan missing value dilakukan dengan imputasi menggunakan nilai median.

tD      <- trainData[,-c(1:6)]
tD2     <- trainData[,-c(7:12)]
imputer <- compute(tD)
tD      <- impute(tD, object=imputer)
trainData <- cbind(tD2,tD)
colSums(is.na(trainData))
## selling_price     km_driven          fuel   seller_type  transmission 
##             0             0             0             0             0 
##         owner       mileage        engine     max_power        torque 
##             0             0             0             0             0 
##         seats           age 
##             0             0

Penanganan Outlier dan Missing Value pada Data Test

Penanganan outlier dan missing value pada data testing menggunakan cara yang sama dengan yang dilakukan terhadap data training.

testData$km_driven[testData$km_driven > 500000] <- 500000
testData$torque[testData$torque > 640] <- 640
testData$mileage[testData$mileage > 30] <- 30
testData$mileage[testData$mileage < 7] <- 7
testData$max_power[testData$max_power > 300] <- 300


tsD      <- testData[,-c(1:6)]
tsD2     <- testData[,-c(7:12)]
imputer  <- compute(tsD)
tsD      <- impute(tsD, object=imputer)
testData <- cbind(tsD2,tsD)

Pemodelan menggunakan Random Forest

Model prediksi menggunakan metode Random Forest. Pemilihan tuning parameter dilakukan menggunakan cross validation atau validasi silang terhadap data testing.

K-fold cross validation adalah salah satu teknik validasi untuk mencari tuning parameter terbaik sekaligus mengevaluasi kinerja model. Pada studi kasus ini digunakan 5-fold cross validation. Data dipartisi secara acak ke dalam lima subset data. Secara bergantian masing-masing subset akan dijadikan sebagai data testing, sementara empat subset data lainnya sebagai data training.

fitControl <- trainControl(
  method = "cv",
  number = 5,
  returnResamp = "all")

Tuning Parameter

Opsi tuneLength pada fungsi caret::train akan memilih sejumlah tuning parameter atau kombinasi tuning parameter yang dianggap paling tepat sesuai dengan metode yang dipilih dan data training yang diberikan.

Cross Validation

rf <- train(selling_price ~ ., 
            data = trainData,
            method = 'ranger',
            tuneLength = 10, 
            importance = "impurity",
            trControl = fitControl,
            verbose = FALSE)
rf
## Random Forest 
## 
## 5625 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4499, 4500, 4501, 4499, 4501 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE      
##    2    variance    241005.5  0.9264671  125799.63
##    2    extratrees  336772.4  0.8726877  178363.96
##    3    variance    188943.1  0.9443052   89735.83
##    3    extratrees  245618.5  0.9169024  128666.42
##    5    variance    169001.2  0.9534225   74627.06
##    5    extratrees  187505.3  0.9435589   86816.26
##    7    variance    162856.1  0.9567830   72548.79
##    7    extratrees  173905.5  0.9505257   78484.15
##    8    variance    161153.3  0.9577470   72167.34
##    8    extratrees  172279.9  0.9514676   77399.04
##   10    variance    160351.9  0.9581032   72087.95
##   10    extratrees  169560.4  0.9528464   76357.57
##   12    variance    157710.5  0.9597369   71949.59
##   12    extratrees  166477.0  0.9544609   75242.12
##   13    variance    157425.0  0.9599501   72109.36
##   13    extratrees  165625.6  0.9552443   75325.26
##   15    variance    156857.3  0.9603160   72298.27
##   15    extratrees  163671.4  0.9561640   74706.63
##   17    variance    157579.8  0.9601560   72842.09
##   17    extratrees  162415.2  0.9569515   74292.89
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 15, splitrule = variance
##  and min.node.size = 5.

Hasil validasi silang ditampilkan pada plot berikut:

plot(rf, main = "5-Fold Cross Validation Random Forest")

rf_best <- rf$bestTune
rf_best
##    mtry splitrule min.node.size
## 17   15  variance             5

Berdasarkan output di atas, tuning parameter terbaik adalah mtry = 17, splitrule = variance dan min.node.size = 5, yang memberikan RMSE = 156654.2, R-squared = 0.9606900 dan MAE =72770.01.

Re-Fit Model Menggunakan Tuning Parameter Terbaik

Re-fit model terhadap seluruh data testing dengan menggunakan tuning parameter terbaik yang diperoleh pada tahap sebelumnya:

rf <- train(selling_price ~ ., 
            data = trainData,
            method = 'ranger',
            tuneGrid  = rf_best, 
            importance = "impurity",
            verbose = FALSE)
rf
## Random Forest 
## 
## 5625 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5625, 5625, 5625, 5625, 5625, 5625, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   171937.4  0.9539336  76769.11
## 
## Tuning parameter 'mtry' was held constant at a value of 15
## Tuning
##  parameter 'splitrule' was held constant at a value of variance
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
rf_result <- rf$results
rf_result
##   mtry splitrule min.node.size     RMSE  Rsquared      MAE   RMSESD RsquaredSD
## 1   15  variance             5 171937.4 0.9539336 76769.11 26171.02 0.01268921
##      MAESD
## 1 2977.652

Diperoleh model dengan RMSE = 173709.3, R-Squared = 0.9529528, MAE = 77499.92

Evaluasi Terhadap Data Test

Untuk menguji kinerja model dalam memprediksi data baru, dilakukan evaluasi terhadap data testing:

eval_test_data <- function(model){
  pred <- predict(model, newdata = testData)
  mae <- MAE(testData$selling_price, pred)
  rmse <- RMSE(testData$selling_price, pred)
  R2 <- R2_Score(testData$selling_price, pred)
  return(c(RMSE = rmse, R_Squared = R2, MAE = mae))
}
rf_eval <- eval_test_data(rf)
rf_eval
##         RMSE    R_Squared          MAE 
## 1.298815e+05 9.751060e-01 6.713350e+04

Diperoleh model RMSE = 1.298583e+05, R-Squared = 9.751528e-01 , MAE = 6.724087e+04.

Variable Importance

plot(varImp(rf), 
     main = "Random Forest Variable Importance" )