Prediksi Harga Mobil Bekas
Loading Packages
Studi Kasus
CarDekho adalah portal mobil yang membantu penggunanya dalam melakukan pembelian atau penjualan mobil dengan memberikan informasi mengenai harga, spesifikasi, asuransi dan aspek-aspek lain. Dataset yang digunakan berisi informasi tentang mobil bekas beserta spesifikasinya. Informasi yang dimiliki tersebut digunakan untuk membuat model untuk memprediksi harga penjualan mobil.
CarDekho dataset terdiri dari 8128 observasi dengan 13 peubah. Berikut adalah deskripsi dari peubah-peubah dalam dataset.
| Peubah | Deskripsi |
|---|---|
| name | Nama mobil |
| year | Tahun pembelian mobil |
| selling_price | Harga jual mobil (Rp) |
| km_driven | Jumlah kilometer yang ditempuh mobil (km) |
| fuel | Jenis bahan bakar mobil (CNG/diesel/petrol/LPG) |
| seller_type | Tipe penjual (individual/dealer/trustmark dealer) |
| transmission | Transmisi mobil (automatic/manual) |
| owner | Pemilik sebelumnya (first/second/third/fourth and above owner/test drive car) |
| mileage | Jarak tempuh mobil (kmpl, km/kg) |
| engine | Kapasitas mesin (CC) |
| max_power | Kekuatan mesin (bhp) |
| torque | Torsi mobil (kgm, nm) |
| seats | Kapasitas tempat duduk |
## name year selling_price km_driven fuel seller_type
## 1 Maruti Swift Dzire VDI 2014 450000 145500 Diesel Individual
## 2 Skoda Rapid 1.5 TDI Ambition 2014 370000 120000 Diesel Individual
## 3 Honda City 2017-2020 EXi 2006 158000 140000 Petrol Individual
## 4 Hyundai i20 Sportz Diesel 2010 225000 127000 Diesel Individual
## 5 Maruti Swift VXI BSIII 2007 130000 120000 Petrol Individual
## 6 Hyundai Xcent 1.2 VTVT E Plus 2017 440000 45000 Petrol Individual
## transmission owner mileage engine max_power
## 1 Manual First Owner 23.4 kmpl 1248 CC 74 bhp
## 2 Manual Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual Third Owner 17.7 kmpl 1497 CC 78 bhp
## 4 Manual First Owner 23.0 kmpl 1396 CC 90 bhp
## 5 Manual First Owner 16.1 kmpl 1298 CC 88.2 bhp
## 6 Manual First Owner 20.14 kmpl 1197 CC 81.86 bhp
## torque seats
## 1 190Nm@ 2000rpm 5
## 2 250Nm@ 1500-2500rpm 5
## 3 12.7@ 2,700(kgm@ rpm) 5
## 4 22.4 kgm at 1750-2750rpm 5
## 5 11.5@ 4,500(kgm@ rpm) 5
## 6 113.75nm@ 4000rpm 5
| Name | car |
| Number of rows | 8128 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| name | 0 | 1 | FALSE | 2058 | Mar: 129, Mar: 82, Mar: 71, BMW: 62 |
| fuel | 0 | 1 | FALSE | 4 | Die: 4402, Pet: 3631, CNG: 57, LPG: 38 |
| seller_type | 0 | 1 | FALSE | 3 | Ind: 6766, Dea: 1126, Tru: 236 |
| transmission | 0 | 1 | FALSE | 2 | Man: 7078, Aut: 1050 |
| owner | 0 | 1 | FALSE | 5 | Fir: 5289, Sec: 2105, Thi: 555, Fou: 174 |
| mileage | 0 | 1 | FALSE | 394 | 18.: 225, emp: 221, 19.: 173, 18.: 164 |
| engine | 0 | 1 | FALSE | 122 | 124: 1017, 119: 832, 998: 453, 796: 444 |
| max_power | 0 | 1 | FALSE | 323 | 74 : 377, 81.: 220, emp: 215, 88.: 204 |
| torque | 0 | 1 | FALSE | 442 | 190: 530, 200: 445, 90N: 405, 113: 223 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2013.80 | 4.04 | 1983 | 2011 | 2015 | 2017 | 2020 | ▁▁▁▃▇ |
| selling_price | 0 | 1.00 | 638271.81 | 806253.40 | 29999 | 254999 | 450000 | 675000 | 10000000 | ▇▁▁▁▁ |
| km_driven | 0 | 1.00 | 69819.51 | 56550.55 | 1 | 35000 | 60000 | 98000 | 2360457 | ▇▁▁▁▁ |
| seats | 221 | 0.97 | 5.42 | 0.96 | 2 | 5 | 5 | 5 | 14 | ▁▇▂▁▁ |
Berdasarkan output di atas, terdapat kondisi data yang harus ditangani terlebih dahulu sebelum melakukan pemodelan, antara lain:
- Data cleansing dan konversi peubah numerik sehingga tersimpan
sebagai numerik. Misalnya pada peubah engine harus dihilangkan unit
CCdan konversi menjadi numerik. Demikian juga dengan peubahmax_power,torquedanmileage. - Penanganan data hilang (missing value)
Data Preprocessing
- Menghilangkan satuan pada peubah
engine,max_power,torquedanmileagesekaligus mengubah menjadi numerik - Membuat peubah
ageatau umur mobil sejak diproduksi (year) hingga dijual - Mengubah peubah
ownermenjadi ordinal/ordered factor - Mengeliminasi peubah
namedanyear - Menghilangkan mobil-mobil berbahan bakar CNG dan LPG.
car2 <- car %>% mutate(engine = as.numeric(str_remove(engine, " CC")),
max_power = as.numeric(str_remove(max_power, " bhp")),
torque = as.numeric(str_extract(torque, "[0-9.]+")),
mileage = as.numeric(str_extract(mileage, "[0-9.]+")),
owner = factor(owner, ordered = TRUE,
levels = c("Test Drive Car",
"First Owner",
"Second Owner",
"Third Owner",
"Fourth & Above Owner")),
age = 2024 - year) %>%
select(-name, -year) %>%
filter(!fuel %in% c('CNG','LPG'))
head(car2)## selling_price km_driven fuel seller_type transmission owner mileage
## 1 450000 145500 Diesel Individual Manual First Owner 23.40
## 2 370000 120000 Diesel Individual Manual Second Owner 21.14
## 3 158000 140000 Petrol Individual Manual Third Owner 17.70
## 4 225000 127000 Diesel Individual Manual First Owner 23.00
## 5 130000 120000 Petrol Individual Manual First Owner 16.10
## 6 440000 45000 Petrol Individual Manual First Owner 20.14
## engine max_power torque seats age
## 1 1248 74.00 190.00 5 10
## 2 1498 103.52 250.00 5 10
## 3 1497 78.00 12.70 5 18
## 4 1396 90.00 22.40 5 14
## 5 1298 88.20 11.50 5 17
## 6 1197 81.86 113.75 5 7
| Name | car2 |
| Number of rows | 8033 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| fuel | 0 | 1 | FALSE | 2 | Die: 4402, Pet: 3631, CNG: 0, LPG: 0 |
| seller_type | 0 | 1 | FALSE | 3 | Ind: 6673, Dea: 1124, Tru: 236 |
| transmission | 0 | 1 | FALSE | 2 | Man: 6983, Aut: 1050 |
| owner | 0 | 1 | TRUE | 5 | Fir: 5238, Sec: 2073, Thi: 547, Fou: 170 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| selling_price | 0 | 1.00 | 642736.12 | 809863.53 | 29999.0 | 260000.00 | 450000.0 | 680000.00 | 10000000 | ▇▁▁▁▁ |
| km_driven | 0 | 1.00 | 69738.82 | 56643.61 | 1000.0 | 35000.00 | 60000.0 | 98000.00 | 2360457 | ▇▁▁▁▁ |
| mileage | 214 | 0.97 | 19.39 | 4.00 | 0.0 | 16.78 | 19.3 | 22.32 | 42 | ▁▃▇▁▁ |
| engine | 214 | 0.97 | 1463.09 | 504.66 | 624.0 | 1197.00 | 1248.0 | 1582.00 | 3604 | ▇▇▂▂▁ |
| max_power | 208 | 0.97 | 91.86 | 35.85 | 0.0 | 69.00 | 82.4 | 102.00 | 400 | ▇▇▁▁▁ |
| torque | 214 | 0.97 | 169.32 | 97.32 | 4.8 | 104.00 | 160.0 | 204.00 | 789 | ▇▆▂▁▁ |
| seats | 214 | 0.97 | 5.42 | 0.96 | 2.0 | 5.00 | 5.0 | 5.00 | 14 | ▁▇▂▁▁ |
| age | 0 | 1.00 | 10.18 | 4.03 | 4.0 | 7.00 | 9.0 | 13.00 | 41 | ▇▃▁▁▁ |
Spliting Data
Data dibagi menjadi dua kelompok, yaitu data train (latih) dan data test (uji).
Eksplorasi Data
Eksplorasi data dilakukan terhadap data train.
Sebaran Peubah Respon: selling_price
ggplot(trainData, aes_string(x = "selling_price")) +
geom_histogram(color = "black", bins = 30) +
ggtitle("Sebaran selling_price")Hubungan peubah respons dan peubah penjelas numerik
plot_numeric_features <- function(x){
ggplot(trainData, aes_string(x, "selling_price")) +
geom_point() +
geom_smooth(method = "loess", se = F) +
scale_x_continuous(labels = scales::comma) +
ylim(0, NA)
}
plot_grid(
plot_numeric_features("km_driven"),
plot_numeric_features("mileage"),
plot_numeric_features("engine"),
plot_numeric_features("max_power"),
plot_numeric_features("torque"),
plot_numeric_features("seats"),
plot_numeric_features("age"))## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Dari plot di atas, terlihat adanya outlier pada peubah
km_driven, torque, mileage,
max_power
Sebaran data berdasarkan peubah penjelas kategorik
count_categoric_features <- function(x){
ggplot(trainData, aes_string(x = x)) +
geom_bar() +
coord_flip()
}
plot_grid(
count_categoric_features("fuel"),
count_categoric_features("seller_type"),
count_categoric_features("transmission"),
count_categoric_features("owner"))Hubungan peubah respons dan peubah penjelas kategorik
plot_categoric_features <- function(x){
ggplot(trainData, aes_string(x, "selling_price")) +
geom_boxplot() +
coord_flip()
}
plot_grid(
plot_categoric_features("fuel"),
plot_categoric_features("seller_type"),
plot_categoric_features("transmission"),
plot_categoric_features("owner"))Outlier dan Missing Value
Penanganan Outlier/Pencilan pada data train
Dari hasil eksplorasi di atas, terlihat adanya pencilan/outlier pada
peubah km_driven, torque,
mileage, max_power. Salah satu penangan
pencilan adalah dengan metode capping.
trainData$km_driven[trainData$km_driven > 500000] <- 500000
trainData$torque[trainData$torque > 640] <- 640
trainData$mileage[trainData$mileage > 30] <- 30
trainData$mileage[trainData$mileage < 7] <- 7
trainData$max_power[trainData$max_power > 300] <- 300Setelah caping
plot_grid(
plot_numeric_features("km_driven"),
plot_numeric_features("mileage"),
plot_numeric_features("engine"),
plot_numeric_features("max_power"),
plot_numeric_features("torque"),
plot_numeric_features("seats"),
plot_numeric_features("age"))## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Penanganan Missing Value pada Data Train
## selling_price km_driven fuel seller_type transmission
## 0 0 0 0 0
## owner mileage engine max_power torque
## 0 153 153 151 153
## seats age
## 153 0
Penanganan missing value dilakukan dengan imputasi menggunakan nilai median.
tD <- trainData[,-c(1:6)]
tD2 <- trainData[,-c(7:12)]
imputer <- compute(tD)
tD <- impute(tD, object=imputer)
trainData <- cbind(tD2,tD)
colSums(is.na(trainData))## selling_price km_driven fuel seller_type transmission
## 0 0 0 0 0
## owner mileage engine max_power torque
## 0 0 0 0 0
## seats age
## 0 0
Penanganan Outlier dan Missing Value pada Data Test
Penanganan outlier dan missing value pada data testing menggunakan cara yang sama dengan yang dilakukan terhadap data training.
testData$km_driven[testData$km_driven > 500000] <- 500000
testData$torque[testData$torque > 640] <- 640
testData$mileage[testData$mileage > 30] <- 30
testData$mileage[testData$mileage < 7] <- 7
testData$max_power[testData$max_power > 300] <- 300
tsD <- testData[,-c(1:6)]
tsD2 <- testData[,-c(7:12)]
imputer <- compute(tsD)
tsD <- impute(tsD, object=imputer)
testData <- cbind(tsD2,tsD)Pemodelan menggunakan Random Forest
Model prediksi menggunakan metode Random Forest. Pemilihan tuning parameter dilakukan menggunakan cross validation atau validasi silang terhadap data testing.
K-fold cross validation adalah salah satu teknik validasi untuk mencari tuning parameter terbaik sekaligus mengevaluasi kinerja model. Pada studi kasus ini digunakan 5-fold cross validation. Data dipartisi secara acak ke dalam lima subset data. Secara bergantian masing-masing subset akan dijadikan sebagai data testing, sementara empat subset data lainnya sebagai data training.
Tuning Parameter
Opsi tuneLength pada fungsi caret::train
akan memilih sejumlah tuning parameter atau kombinasi tuning parameter
yang dianggap paling tepat sesuai dengan metode yang dipilih dan data
training yang diberikan.
Cross Validation
rf <- train(selling_price ~ .,
data = trainData,
method = 'ranger',
tuneLength = 10,
importance = "impurity",
trControl = fitControl,
verbose = FALSE)
rf## Random Forest
##
## 5625 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4499, 4500, 4501, 4499, 4501
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 241005.5 0.9264671 125799.63
## 2 extratrees 336772.4 0.8726877 178363.96
## 3 variance 188943.1 0.9443052 89735.83
## 3 extratrees 245618.5 0.9169024 128666.42
## 5 variance 169001.2 0.9534225 74627.06
## 5 extratrees 187505.3 0.9435589 86816.26
## 7 variance 162856.1 0.9567830 72548.79
## 7 extratrees 173905.5 0.9505257 78484.15
## 8 variance 161153.3 0.9577470 72167.34
## 8 extratrees 172279.9 0.9514676 77399.04
## 10 variance 160351.9 0.9581032 72087.95
## 10 extratrees 169560.4 0.9528464 76357.57
## 12 variance 157710.5 0.9597369 71949.59
## 12 extratrees 166477.0 0.9544609 75242.12
## 13 variance 157425.0 0.9599501 72109.36
## 13 extratrees 165625.6 0.9552443 75325.26
## 15 variance 156857.3 0.9603160 72298.27
## 15 extratrees 163671.4 0.9561640 74706.63
## 17 variance 157579.8 0.9601560 72842.09
## 17 extratrees 162415.2 0.9569515 74292.89
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 15, splitrule = variance
## and min.node.size = 5.
Hasil validasi silang ditampilkan pada plot berikut:
## mtry splitrule min.node.size
## 17 15 variance 5
Berdasarkan output di atas, tuning parameter terbaik adalah
mtry = 17, splitrule = variance dan
min.node.size = 5, yang memberikan
RMSE = 156654.2, R-squared = 0.9606900 dan
MAE =72770.01.
Re-Fit Model Menggunakan Tuning Parameter Terbaik
Re-fit model terhadap seluruh data testing dengan menggunakan tuning parameter terbaik yang diperoleh pada tahap sebelumnya:
rf <- train(selling_price ~ .,
data = trainData,
method = 'ranger',
tuneGrid = rf_best,
importance = "impurity",
verbose = FALSE)
rf## Random Forest
##
## 5625 samples
## 11 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 5625, 5625, 5625, 5625, 5625, 5625, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 171937.4 0.9539336 76769.11
##
## Tuning parameter 'mtry' was held constant at a value of 15
## Tuning
## parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## mtry splitrule min.node.size RMSE Rsquared MAE RMSESD RsquaredSD
## 1 15 variance 5 171937.4 0.9539336 76769.11 26171.02 0.01268921
## MAESD
## 1 2977.652
Diperoleh model dengan RMSE = 173709.3, R-Squared = 0.9529528, MAE = 77499.92
Evaluasi Terhadap Data Test
Untuk menguji kinerja model dalam memprediksi data baru, dilakukan evaluasi terhadap data testing:
eval_test_data <- function(model){
pred <- predict(model, newdata = testData)
mae <- MAE(testData$selling_price, pred)
rmse <- RMSE(testData$selling_price, pred)
R2 <- R2_Score(testData$selling_price, pred)
return(c(RMSE = rmse, R_Squared = R2, MAE = mae))
}## RMSE R_Squared MAE
## 1.298815e+05 9.751060e-01 6.713350e+04
Diperoleh model RMSE = 1.298583e+05, R-Squared = 9.751528e-01 , MAE = 6.724087e+04.