## Loading required package: janitor
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: reshape2
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.2.0 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: randomForest
##
## randomForest 4.7-1.2
##
## Type rfNews() to see new features/changes/bug fixes.
##
##
## Attaching package: 'randomForest'
##
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## Loading required package: xgboost
##
## Loading required package: caret
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
## Loading required package: Matrix
##
##
## Attaching package: 'Matrix'
##
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Loading required package: ranger
##
##
## Attaching package: 'ranger'
##
##
## The following object is masked from 'package:randomForest':
##
## importance
##
##
## Loading required package: ROCR
##
## Loading required package: MLmetrics
##
##
## Attaching package: 'MLmetrics'
##
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
##
## The following object is masked from 'package:base':
##
## Recall
| Aspect | Description |
|---|---|
| Source | Historical genset deployment records from Malaysia’s national electricity utility’s fleet and dispatch system. |
| Time Period | 1st January to 31st December 2024 |
| Geographical Coverage | Peninsular Malaysia: All 13 states and over 140 districts (e.g., Johor Bahru, Subang Jaya) |
| Total Records | 20,022 unique deployment jobs |
| Total Fields | 27 columns in a single Excel sheet (.xlsx) |
| Reliability and Quality | High reliability: Real-time capture by field engineers, validated for audits. Core fields (e.g., district, job type, dates, capacity, mileage, hours) have <1% missing values; minor misses in non-critical fields; no duplicates |
The data cleaning methodology follows a structured computational workflow executed in the R environment to ensure complete reproducibility and auditability of every transformation. ### Structural Normalization and Cleaning Metadata is standardized using the janitor package, converting all headers to lowercase snake‑case. This resolves irregular spacing and mixed casing that cause errors in attribute calls. For example, “Total Mileage (KM)” becomes “total_mileage_km”, creating a predictable schema for automated processing across platforms. ### Strategic Imputation of Missing Values To handle semantic noise, diverse null placeholders (e.g., dashes, “N/A”) are mapped to formal nulls during ingestion. For key identifiers like sales order numbers, missing entries are imputed as “MISSING” rather than deleted, preserving operational context and allowing models to distinguish true missing data from non‑applicable cases. ### Temporal Harmonization and Feature Engineering A major cleaning step involves reconstructing chronology, since raw data splits dates and times into character strings unsuitable for calculation. Using lubridate, these fragments are parsed into ISO 8601 date objects and durations, then combined into four new datetime features (e.g., datetime_tiba, datetime_pasang) to enable precise lead‑time and service‑duration analysis. ### Textual Scrubbing and Sanitization To preserve categorical analysis integrity, all character columns are sanitized by enforcing UTF 8 encoding and trimming whitespace to prevent phantom mismatches. These refinements ensure accurate text mining and grouping in job description fields ### Results After Cleaning Applying these cleaning protocols produces a more robust dataset with 31 variables, including new temporal features. Aside from removing one duplicate, record counts remain consistent, while character‑based timestamps are converted into logical objects, yielding a clean, analysis‑ready output.
| Feature | Pre-Cleaning State | Post-Cleaning State |
|---|---|---|
| Headers | Mixed case/Spaces (“Total Mileage”) | Snake_case (“total_mileage_km”) |
| Missing Values | Mixed placeholders (“-”, “null”) | Standardized NA or “MISSING” |
| Date Format | Non-standard (e.g., “20/10/2024”) | ISO 8601 (2024-10-20) |
| Temporal Data | Character strings | Functional Datetime objects |
####################################################
# 1. Import Raw Dataset
# - Terus standardize common missing strings
####################################################
genset <- read.csv(
"Y:/data/MNC_UM/WQD7001/group project/GA2/rawdata/Raw Data Genset (2024).csv",
na.strings = c("", " ", "NA", "N/A", "-", "null")
)
#genset <- read_excel("Y:/data/MNC_UM/WQD7001/group project/GA2/rawdata/Raw Data Genset (2024).xlsx")
# First quick check
str(genset)
## 'data.frame': 20021 obs. of 27 variables:
## $ No : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Jenis.Kerja : chr "Breakdown" "Breakdown" "Shutdown" "Standby" ...
## $ Nombor.Dokumen : chr "ERQ0056940" "ERQ0052522" "PRQ0090432" "PRQ0090428" ...
## $ GenSet : chr "WA1465J(C6)" "VJU4414(B25)" "WB1077S(B20)" "SGT-D4" ...
## $ Negeri : chr "JOHOR" "KUALA LUMPUR" "SELANGOR" "TERENGGANU" ...
## $ Daerah : chr "PONTIAN" "KL SELATAN" "SUBANG JAYA" "KUALA TERENGGANU" ...
## $ Lokasi.Kerja : chr "PE TAMAN SUI" "CHAN SOW LIN NO. 2 " "PPU TROPICANA METROPARK SUBANG HI-TECH" "SJL PULAU REDANG" ...
## $ Request.Kapasiti : int 300 300 300 500 500 300 125 300 500 1000 ...
## $ Tarikh.Diperlukan : chr "20/10/2024" "1/1/2024" "1/1/2024" "1/1/2024" ...
## $ Masa.Diperlukan : chr "10:00" "8:00" "8:00" "0:00" ...
## $ Tarikh.Tiba : chr "20/10/2024" "1/1/2024" "1/1/2024" "1/1/2024" ...
## $ Masa.Tiba : chr "10:02" "8:00" "8:00" "0:00" ...
## $ Tarikh.Pasang : chr "20/10/2024" "1/1/2024" "1/1/2024" "1/1/2024" ...
## $ Masa.Pasang : chr "17:51" "8:00" "8:00" "0:00" ...
## $ Tarikh.Set.Tanggal : chr "21/10/2024" "1/8/2024" "1/6/2024" "2/1/2024" ...
## $ Masa..Set.Tanggal : chr "5:54" "22:00" "12:17" "0:00" ...
## $ Total.Mileage..KM. : int 150 18 30 0 0 0 820 150 150 0 ...
## $ Jumlah.Jam.Perkhidmatan : chr "20" "182" "125" "744" ...
## $ Nombor..Sales.Order. : chr NA "59469608" "59469609" "59469604" ...
## $ Nombor..Services.Order. : chr "10946145" "10858317" "10858318" "10858438" ...
## $ Job.Order...PM.Order...WBS.Number...Project.Number: chr "28129289" "J612166143" "D-PZB-S16-4814-3H1" "27111480" ...
## $ Type.of.Ref..No : chr "TCS NUMBER (LV)" "TCS NUMBER (LV)" "LV" "LV" ...
## $ Referance.No. : chr "0333/20241017/027" "0124/20231223/021" NA NA ...
## $ Job.Description : chr "CABLE FAULT" "KES KECURIAN" "PENUKARAN GENSET 500KVA KE 300KVA. LV SUPPLY PPU TROPICANA METROPARK SUBANG HI-TECH" "BD DI PULAU REDANG" ...
## $ Unit : chr "OPERATION" "OPERATION" "ASSETDEV" "OPERATION" ...
## $ Unit.Selepas.Semakan : chr "OPERATION" "OPERATION" "ASSET DEVELOPMENT" "OPERATION" ...
## $ Emergency.Shutdown : chr "DN/Asset Zone South 1" NA NA NA ...
head(genset, 5)
## No Jenis.Kerja Nombor.Dokumen GenSet Negeri Daerah
## 1 1 Breakdown ERQ0056940 WA1465J(C6) JOHOR PONTIAN
## 2 1 Breakdown ERQ0052522 VJU4414(B25) KUALA LUMPUR KL SELATAN
## 3 1 Shutdown PRQ0090432 WB1077S(B20) SELANGOR SUBANG JAYA
## 4 1 Standby PRQ0090428 SGT-D4 TERENGGANU KUALA TERENGGANU
## 5 1 Shutdown PRQ0090431 VAT5454(C46) KELANTAN GUA MUSANG
## Lokasi.Kerja Request.Kapasiti Tarikh.Diperlukan
## 1 PE TAMAN SUI 300 20/10/2024
## 2 CHAN SOW LIN NO. 2 300 1/1/2024
## 3 PPU TROPICANA METROPARK SUBANG HI-TECH 300 1/1/2024
## 4 SJL PULAU REDANG 500 1/1/2024
## 5 PE MIAK 500 1/1/2024
## Masa.Diperlukan Tarikh.Tiba Masa.Tiba Tarikh.Pasang Masa.Pasang
## 1 10:00 20/10/2024 10:02 20/10/2024 17:51
## 2 8:00 1/1/2024 8:00 1/1/2024 8:00
## 3 8:00 1/1/2024 8:00 1/1/2024 8:00
## 4 0:00 1/1/2024 0:00 1/1/2024 0:00
## 5 0:00 1/1/2024 0:11 1/1/2024 0:12
## Tarikh.Set.Tanggal Masa..Set.Tanggal Total.Mileage..KM.
## 1 21/10/2024 5:54 150
## 2 1/8/2024 22:00 18
## 3 1/6/2024 12:17 30
## 4 2/1/2024 0:00 0
## 5 22/1/2024 13:28 0
## Jumlah.Jam.Perkhidmatan Nombor..Sales.Order. Nombor..Services.Order.
## 1 20 <NA> 10946145
## 2 182 59469608 10858317
## 3 125 59469609 10858318
## 4 744 59469604 10858438
## 5 744 59469605 10858439
## Job.Order...PM.Order...WBS.Number...Project.Number Type.of.Ref..No
## 1 28129289 TCS NUMBER (LV)
## 2 J612166143 TCS NUMBER (LV)
## 3 D-PZB-S16-4814-3H1 LV
## 4 27111480 LV
## 5 27513517 LV
## Referance.No.
## 1 0333/20241017/027
## 2 0124/20231223/021
## 3 <NA>
## 4 <NA>
## 5 SWING VENDOR GENSET
## Job.Description
## 1 CABLE FAULT
## 2 KES KECURIAN
## 3 PENUKARAN GENSET 500KVA KE 300KVA. LV SUPPLY PPU TROPICANA METROPARK SUBANG HI-TECH
## 4 BD DI PULAU REDANG
## 5 SWING VENDOR GENSET
## Unit Unit.Selepas.Semakan Emergency.Shutdown
## 1 OPERATION OPERATION DN/Asset Zone South 1
## 2 OPERATION OPERATION <NA>
## 3 ASSETDEV ASSET DEVELOPMENT <NA>
## 4 OPERATION OPERATION <NA>
## 5 OPERATION OPERATION <NA>
####################################################
# 2. Clean Column Names & Remove Duplicates
####################################################
genset <- genset %>%
clean_names()
nrow_before <- nrow(genset)
genset <- genset %>% distinct()
nrow_after <- nrow(genset)
cat("Rows before distinct:", nrow_before, "\n")
## Rows before distinct: 20021
cat("Rows after distinct :", nrow_after, "\n")
## Rows after distinct : 20021
####################################################
# 3. Clean Character Columns (encoding + whitespace)
####################################################
genset <- genset %>%
# 4.1 Fix encoding untuk elak symbol pelik
mutate(across(
.cols = where(is.character),
.fns = ~ iconv(., from = "", to = "UTF-8")
)) %>%
# 4.2 Trim whitespace kiri/kanan
mutate(across(
.cols = where(is.character),
.fns = ~ trimws(.)
))
# Check missing selepas cleaning
colSums(is.na(genset))
## no
## 0
## jenis_kerja
## 0
## nombor_dokumen
## 0
## gen_set
## 0
## negeri
## 0
## daerah
## 0
## lokasi_kerja
## 0
## request_kapasiti
## 0
## tarikh_diperlukan
## 0
## masa_diperlukan
## 0
## tarikh_tiba
## 25
## masa_tiba
## 25
## tarikh_pasang
## 0
## masa_pasang
## 25
## tarikh_set_tanggal
## 25
## masa_set_tanggal
## 25
## total_mileage_km
## 0
## jumlah_jam_perkhidmatan
## 0
## nombor_sales_order
## 3005
## nombor_services_order
## 1
## job_order_pm_order_wbs_number_project_number
## 0
## type_of_ref_no
## 0
## referance_no
## 1719
## job_description
## 0
## unit
## 124
## unit_selepas_semakan
## 1296
## emergency_shutdown
## 16455
####################################################
# 4. Convert Date Columns
####################################################
date_cols <- c(
"tarikh_diperlukan",
"tarikh_tiba",
"tarikh_pasang",
"tarikh_set_tanggal"
)
# Confirm mana column yang wujud
intersect(date_cols, names(genset))
## [1] "tarikh_diperlukan" "tarikh_tiba" "tarikh_pasang"
## [4] "tarikh_set_tanggal"
genset <- genset %>%
mutate(across(
.cols = all_of(date_cols),
.fns = ~ dmy(as.character(.), quiet = TRUE)
))
str(genset[date_cols])
## 'data.frame': 20021 obs. of 4 variables:
## $ tarikh_diperlukan : Date, format: "2024-10-20" "2024-01-01" ...
## $ tarikh_tiba : Date, format: "2024-10-20" "2024-01-01" ...
## $ tarikh_pasang : Date, format: "2024-10-20" "2024-01-01" ...
## $ tarikh_set_tanggal: Date, format: "2024-10-21" "2024-08-01" ...
summary(genset[date_cols])
## tarikh_diperlukan tarikh_tiba tarikh_pasang
## Min. :2024-01-01 Min. :2000-01-01 Min. :2000-01-01
## 1st Qu.:2024-04-01 1st Qu.:2024-03-31 1st Qu.:2024-03-30
## Median :2024-06-27 Median :2024-06-27 Median :2024-06-28
## Mean :2024-06-29 Mean :2024-06-27 Mean :2024-06-07
## 3rd Qu.:2024-09-24 3rd Qu.:2024-09-25 3rd Qu.:2024-09-25
## Max. :2024-12-31 Max. :2024-12-31 Max. :2025-02-01
## NA's :25 NA's :1
## tarikh_set_tanggal
## Min. :2000-01-01
## 1st Qu.:2024-03-31
## Median :2024-06-25
## Mean :2024-04-22
## 3rd Qu.:2024-09-26
## Max. :2025-06-01
## NA's :27
####################################################
# 5. Convert Time Columns + Create Datetime
####################################################
time_cols <- c(
"masa_diperlukan",
"masa_tiba",
"masa_pasang",
"masa_set_tanggal"
)
# Tengok sample raw time
head(genset[time_cols], 5)
## masa_diperlukan masa_tiba masa_pasang masa_set_tanggal
## 1 10:00 10:02 17:51 5:54
## 2 8:00 8:00 8:00 22:00
## 3 8:00 8:00 8:00 12:17
## 4 0:00 0:00 0:00 0:00
## 5 0:00 0:11 0:12 13:28
# Convert ke Period (jam:minit)
genset <- genset %>%
mutate(across(
.cols = all_of(time_cols),
.fns = ~ hm(as.character(.))
))
# Combine date + time → datetime (POSIXct)
genset <- genset %>%
mutate(
datetime_diperlukan = as_datetime(tarikh_diperlukan) + masa_diperlukan,
datetime_tiba = as_datetime(tarikh_tiba) + masa_tiba,
datetime_pasang = as_datetime(tarikh_pasang) + masa_pasang,
datetime_set_tanggal = as_datetime(tarikh_set_tanggal)+ masa_set_tanggal
)
str(genset[, c(
"datetime_diperlukan",
"datetime_tiba",
"datetime_pasang",
"datetime_set_tanggal"
)])
## 'data.frame': 20021 obs. of 4 variables:
## $ datetime_diperlukan : POSIXct, format: "2024-10-20 10:00:00" "2024-01-01 08:00:00" ...
## $ datetime_tiba : POSIXct, format: "2024-10-20 10:02:00" "2024-01-01 08:00:00" ...
## $ datetime_pasang : POSIXct, format: "2024-10-20 17:51:00" "2024-01-01 08:00:00" ...
## $ datetime_set_tanggal: POSIXct, format: "2024-10-21 05:54:00" "2024-08-01 22:00:00" ...
####################################################
# 6. Detect & Convert Numeric Columns Stored as Character
####################################################
classes <- sapply(genset, class)
character_columns <- names(classes[classes == "character"])
# Cari column character yang hanya ada digit / space / dot / minus
possible_numeric <- character_columns[
sapply(genset[character_columns], function(x)
all(grepl("^[0-9 .-]*$", x) | is.na(x)))
]
cat("Possible numeric columns:\n")
## Possible numeric columns:
print(possible_numeric)
## [1] "jumlah_jam_perkhidmatan"
genset[possible_numeric] <- lapply(genset[possible_numeric], function(x) {
x <- trimws(x)
x[x == ""] <- NA
as.numeric(x)
})
# Check balik jenis data
sapply(genset, class)
## $no
## [1] "integer"
##
## $jenis_kerja
## [1] "character"
##
## $nombor_dokumen
## [1] "character"
##
## $gen_set
## [1] "character"
##
## $negeri
## [1] "character"
##
## $daerah
## [1] "character"
##
## $lokasi_kerja
## [1] "character"
##
## $request_kapasiti
## [1] "integer"
##
## $tarikh_diperlukan
## [1] "Date"
##
## $masa_diperlukan
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
##
## $tarikh_tiba
## [1] "Date"
##
## $masa_tiba
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
##
## $tarikh_pasang
## [1] "Date"
##
## $masa_pasang
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
##
## $tarikh_set_tanggal
## [1] "Date"
##
## $masa_set_tanggal
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
##
## $total_mileage_km
## [1] "integer"
##
## $jumlah_jam_perkhidmatan
## [1] "numeric"
##
## $nombor_sales_order
## [1] "character"
##
## $nombor_services_order
## [1] "character"
##
## $job_order_pm_order_wbs_number_project_number
## [1] "character"
##
## $type_of_ref_no
## [1] "character"
##
## $referance_no
## [1] "character"
##
## $job_description
## [1] "character"
##
## $unit
## [1] "character"
##
## $unit_selepas_semakan
## [1] "character"
##
## $emergency_shutdown
## [1] "character"
##
## $datetime_diperlukan
## [1] "POSIXct" "POSIXt"
##
## $datetime_tiba
## [1] "POSIXct" "POSIXt"
##
## $datetime_pasang
## [1] "POSIXct" "POSIXt"
##
## $datetime_set_tanggal
## [1] "POSIXct" "POSIXt"
####################################################
# 7. Missing Values Treatment
####################################################
# 7.1 Percentage missing setiap column
missing_pct <- colSums(is.na(genset)) / nrow(genset) * 100
missing_pct
## no
## 0.000000000
## jenis_kerja
## 0.000000000
## nombor_dokumen
## 0.000000000
## gen_set
## 0.000000000
## negeri
## 0.000000000
## daerah
## 0.000000000
## lokasi_kerja
## 0.000000000
## request_kapasiti
## 0.000000000
## tarikh_diperlukan
## 0.000000000
## masa_diperlukan
## 0.000000000
## tarikh_tiba
## 0.124868888
## masa_tiba
## 0.124868888
## tarikh_pasang
## 0.004994756
## masa_pasang
## 0.124868888
## tarikh_set_tanggal
## 0.134858399
## masa_set_tanggal
## 0.124868888
## total_mileage_km
## 0.000000000
## jumlah_jam_perkhidmatan
## 0.859097947
## nombor_sales_order
## 15.009240298
## nombor_services_order
## 0.004994756
## job_order_pm_order_wbs_number_project_number
## 0.000000000
## type_of_ref_no
## 0.000000000
## referance_no
## 8.585984716
## job_description
## 0.000000000
## unit
## 0.619349683
## unit_selepas_semakan
## 6.473203137
## emergency_shutdown
## 82.188701863
## datetime_diperlukan
## 0.000000000
## datetime_tiba
## 0.124868888
## datetime_pasang
## 0.129863643
## datetime_set_tanggal
## 0.134858399
# 7.2 Contoh: Impute numeric dengan median
# (boleh tambah column lain kalau perlu)
if ("jumlah_jam_perkhidmatan" %in% names(genset)) {
genset$jumlah_jam_perkhidmatan[is.na(genset$jumlah_jam_perkhidmatan)] <-
median(genset$jumlah_jam_perkhidmatan, na.rm = TRUE)
}
# 7.3 Impute categorical: NA → "MISSING"
cat_cols <- names(genset)[sapply(genset, is.character)]
genset[cat_cols] <- lapply(genset[cat_cols], function(x) {
x[is.na(x)] <- "MISSING"
x
})
# Check ringkas
sum(is.na(genset))
## [1] 206
summary(genset$jumlah_jam_perkhidmatan)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 13.00 29.00 53.78 59.00 908.00
####################################################
# 8. (Optional) Outlier Summary untuk Numeric
####################################################
num_cols <- c(
"request_kapasiti",
"total_mileage_km",
"jumlah_jam_perkhidmatan"
)
num_cols <- intersect(num_cols, names(genset))
iqr_outlier_info <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
n_outliers <- sum(x < lower_bound | x > upper_bound, na.rm = TRUE)
c(
Q1 = Q1,
Q3 = Q3,
IQR = IQR,
Lower_Bound = lower_bound,
Upper_Bound = upper_bound,
Number_of_Outliers = n_outliers
)
}
outlier_summary <- do.call(
rbind,
lapply(num_cols, function(col) {
cbind(Column = col, t(iqr_outlier_info(genset[[col]])))
})
)
outlier_summary <- as.data.frame(outlier_summary)
outlier_summary
## Column Q1.25% Q3.75% IQR.75% Lower_Bound.25% Upper_Bound.75%
## 1 request_kapasiti 500 1000 500 -250 1750
## 2 total_mileage_km 34 150 116 -140 324
## 3 jumlah_jam_perkhidmatan 13 59 46 -56 128
## Number_of_Outliers
## 1 0
## 2 1778
## 3 1806
# Simpan outlier summary kalau perlu
write.csv(outlier_summary,
"outlier_summary_numeric.csv",
row.names = FALSE)
####################################################
# 9. SAVE CLEAN DATA
####################################################
# Versi akhir data bersih kita namakan genset_clean
genset_clean <- genset
# Save sebagai CSV untuk EDA / modelling
write.csv(genset_clean,
"Y:/data/MNC_UM/WQD7001/group project/GA2/rawdata/genset_clean.csv",
row.names = FALSE)
# (Optional) Save sebagai RDS (lagi jimat type)
#saveRDS(genset_clean,
# file = "genset_clean.rds")
#cat("Clean dataset saved as 'genset_clean.csv' and 'genset_clean.rds'\n")
With the data cleaned, the next step is to explore distributions, central tendencies, associations, and patterns using statistical and graphical methods. Of 27 variables, 10 were retained for EDA and modelling. Two numeric, two interval dates and the rest categorical. The key target variables are request_kapasiti and jumlah_jam_perkhidmatan. The list of the 10 variables are gen_set, negeri, daerah, lokasi_kerja, jenis_kerja, request_kapasiti, tarikh_pasang, tarikh_set_tanggal, total_mileage_km and jumlah_jam_perkhidmatan.
Basic statistics were generated across the dataset for an initial understanding.
gensetDF <- df
#filter the 10 needed columns
gensetDF<-gensetDF %>% select(gen_set, negeri, daerah, lokasi_kerja, jenis_kerja, request_kapasiti, tarikh_pasang, tarikh_set_tanggal, total_mileage_km, jumlah_jam_perkhidmatan)
#convert categorial to factor
gensetDF<-gensetDF %>%
mutate(
gen_set= as.factor(gen_set),
negeri= as.factor(negeri),
daerah= as.factor(daerah),
lokasi_kerja= as.factor(lokasi_kerja),
jenis_kerja= as.factor(jenis_kerja),
request_kapasiti= as.factor(request_kapasiti)
)
gensetDF$tarikh_pasang <- as.Date(gensetDF$tarikh_pasang, format = "%Y-%m-%d")
gensetDF$tarikh_set_tanggal <- as.Date(gensetDF$tarikh_set_tanggal, format = "%Y-%m-%d")
#filter the 10 needed columns
gensetDF<-gensetDF %>% select(gen_set, negeri, daerah, lokasi_kerja, jenis_kerja, request_kapasiti, tarikh_pasang, tarikh_set_tanggal, total_mileage_km, jumlah_jam_perkhidmatan)
#convert categorial to factor
gensetDF<-gensetDF %>%
mutate(
gen_set= as.factor(gen_set),
negeri= as.factor(negeri),
daerah= as.factor(daerah),
lokasi_kerja= as.factor(lokasi_kerja),
jenis_kerja= as.factor(jenis_kerja),
request_kapasiti= as.factor(request_kapasiti)
)
gensetDF$tarikh_pasang <- as.Date(gensetDF$tarikh_pasang, format = "%Y-%m-%d")
gensetDF$tarikh_set_tanggal <- as.Date(gensetDF$tarikh_set_tanggal, format = "%Y-%m-%d")
summary(gensetDF)
## gen_set negeri daerah
## V-HNTK-VER8616-6D: 92 SELANGOR :4599 RAWANG : 860
## B-HNTK-VEF6362-2D: 88 JOHOR :3391 KLANG : 813
## VAY4428(C44) : 82 PAHANG :2113 JOHOR BAHRU: 736
## D-SLSA-PPH9226-2C: 81 KELANTAN :2014 KL BARAT : 487
## V-HNTK-VFS9985-3D: 81 KUALA LUMPUR:1762 KL TIMUR : 466
## D-GMGE-VDX733-2D : 80 KEDAH :1452 BANTING : 459
## (Other) :19517 (Other) :4690 (Other) :16200
## lokasi_kerja jenis_kerja request_kapasiti
## PPU DENGKIL : 110 Breakdown :9020 125 : 667
## PPU SRI KIJANG 33/11KV : 56 Emergency Shutdown: 536 300 : 4337
## PMU VALDOR 132/11KV : 47 Shutdown :8928 500 : 4900
## PPU PEKAN JENJAROM BANTING: 47 Standby :1537 1000:10117
## PPU BANDAR TASEK MUTIARA : 33
## PPU TAMAN SELAYANG MUTIARA: 33
## (Other) :19695
## tarikh_pasang tarikh_set_tanggal total_mileage_km
## Min. :2000-01-01 Min. :2000-01-01 Min. : 0
## 1st Qu.:2024-04-05 1st Qu.:2024-04-05 1st Qu.: 34
## Median :2024-07-01 Median :2024-07-01 Median : 68
## Mean :2024-06-10 Mean :2024-04-26 Mean : 127
## 3rd Qu.:2024-09-27 3rd Qu.:2024-09-28 3rd Qu.: 150
## Max. :2025-01-02 Max. :2025-01-06 Max. :6326
## NA's :27
## jumlah_jam_perkhidmatan
## Min. : 1.00
## 1st Qu.: 13.00
## Median : 29.00
## Mean : 53.78
## 3rd Qu.: 59.00
## Max. :908.00
##
# --UNIVARIATE ANALYSIS-----------
#generate boxplots for numeric
numeric_cols <- which(sapply(gensetDF, is.numeric))
boxplot(gensetDF[, numeric_cols], main = "Boxplots of Numeric Variables", las = 2)
Obviously, both numerical variables are right-skewed distribution. For
the total_mileage_km, there is an obvious outlier but was confirmed a
special case legitimate entry during data cleaning stage. It is
recommended to remove this single to avoid issue in model training
later.
#generate histograms
# Identify categorical variables
cat_cols <- sapply(gensetDF, function(x) is.factor(x) || is.character(x))
cat_names <- names(gensetDF)[cat_cols]
#4 plots per row
n_row <- ceiling(length(cat_names) / 4)
par(mfrow = c(n_row, 4), mar = c(6, 4, 2, 1)) # extra bottom margin for angled labels
for (colname in cat_names) {
counts <- table(gensetDF[[colname]])
# Draw barplot without x labels
bp <- barplot(counts,
main = paste("Distribution of", colname),
col = "lightblue",
ylab = "Count",
xaxt = "n") # suppress x-axis labels
# Add rotated labels manually (~25 degrees)
text(x = bp, y = -0.05 * max(counts), labels = names(counts),
srt = 45, adj = 1, xpd = TRUE, cex = 0.8)
}
The cardinalities of gen_set, daerah, and lokasi_kerja are very high, with no clear distribution pattern. Negeri wise, more industrialized and densely populated regions record the highest requests. For jenis_kerja, most entries involve breakdowns and shutdowns. In terms of request_kapasiti, gensets rated at 1000 kWh show the highest demand.
par(mfrow=c(1,1)) # restore to single plot layout
#scatter plot
plot(gensetDF$total_mileage_km, gensetDF$jumlah_jam_perkhidmatan,
main = "Scatter Plot: total_mileage_km vs jumlah_jam_perkhidmatan",
xlab = "total_mileage_km",
ylab = "jumlah_jam_perkhidmatan",
pch = 19, col = "blue")
For the bivariate analysis, numerical and categorical independent
variables are examined against the target variables request_kapasiti and
jumlah_jam_perkhidmatan.
The scatter plot indicates that operations are generally well optimized, with genset travel distance and utilization hours closely aligned, apart from the single outlier noted earlier.
# Boxplot with rotated x-axis labels
boxplot(gensetDF$jumlah_jam_perkhidmatan ~ gensetDF$negeri,
data = gensetDF,
col = "lightblue",
main = "Boxplot of negeri vs. jumlah_jam_perkhidmatan",
xlab = "Category", ylab = "Numeric Value",
xaxt = "n") # suppress default x-axis labels
# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$negeri))
text(x = axis_positions,
y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]), # place below axis
labels = levels(gensetDF$negeri),
srt = 45, adj = 1, xpd = TRUE, cex = 0.8)
# Boxplot with rotated x-axis labels
boxplot(gensetDF$jumlah_jam_perkhidmatan ~ gensetDF$jenis_kerja,
data = gensetDF,
col = "lightblue",
main = "Boxplot of jenis_kerja vs. jumlah_jam_perkhidmatan",
xlab = "Category", ylab = "Numeric Value",
xaxt = "n") # suppress default x-axis labels
# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$jenis_kerja))
text(x = axis_positions,
y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]), # place below axis
labels = levels(gensetDF$jenis_kerja),
srt = 45, adj = 1, xpd = TRUE, cex = 0.8)
# Boxplot with rotated x-axis labels
boxplot(gensetDF$jumlah_jam_perkhidmatan ~ gensetDF$request_kapasiti,
data = gensetDF,
col = "lightblue",
main = "Boxplot of request_kapasiti vs. jumlah_jam_perkhidmatan",
xlab = "Category", ylab = "Numeric Value",
xaxt = "n") # suppress default x-axis labels
# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$request_kapasiti))
text(x = axis_positions,
y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]), # place below axis
labels = levels(gensetDF$request_kapasiti),
srt = 45, adj = 1, xpd = TRUE, cex = 0.8)
# Boxplot with rotated x-axis labels
boxplot(gensetDF$total_mileage_km ~ gensetDF$request_kapasiti,
data = gensetDF,
col = "lightblue",
main = "Boxplot of request_kapasiti vs. total_mileage_km",
xlab = "Category", ylab = "Numeric Value",
xaxt = "n") # suppress default x-axis labels
# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$request_kapasiti))
text(x = axis_positions,
y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]), # place below axis
labels = levels(gensetDF$request_kapasiti),
srt = 45, adj = 1, xpd = TRUE, cex = 0.8)
Mean genset service hours are higher in Pahang, Pulau Pinang, and
Selangor, indicating longer usage, while greater variation is observed
in Kedah, Kelantan, Pahang, and Pulau Pinang, reflecting higher
variability. For jenis_kerja, standby cases record the highest mean,
showing that many gensets are dispatched solely for standby. Utilization
hours across all genset capacities are evenly distributed.
# count table
counts <- table(gensetDF$negeri, gensetDF$request_kapasiti)
# Draw stacked bar chart
bp <- barplot(counts,
main = "Stacked Bar Chart: request_kapasiti vs negeri",
xlab = "request kapasiti", ylab = "Count",
col = rainbow(nrow(counts)),
las = 2)
# Add legend on the left side
legend("left",
legend = rownames(counts),
fill = rainbow(nrow(counts)),
inset = c(-0.15, 0), # nudge leftward
xpd = TRUE, # allow drawing outside plot
cex = 0.8)
# count table
counts <- table(gensetDF$jenis_kerja, gensetDF$request_kapasiti)
# Draw stacked bar chart
bp <- barplot(counts,
main = "Stacked Bar Chart: request_kapasiti vs jenis_kerja",
xlab = "request kapasiti", ylab = "Count",
col = rainbow(nrow(counts)),
las = 2)
# Add legend on the left side
legend("left",
legend = rownames(counts),
fill = rainbow(nrow(counts)),
inset = c(-0.15, 0), # nudge leftward
xpd = TRUE, # allow drawing outside plot
cex = 0.8)
The stacked bar charts reveal that requests for 1000 kWh gensets are
highest in industrialized, densely populated states. Across all genset
capacities, shutdown and breakdown cases dominate the requests.
# Create monthly sequence from min start to max end
timeline <- tibble(
month = seq(floor_date(min(gensetDF$tarikh_pasang), "month"),
ceiling_date(max((gensetDF[!is.na(gensetDF$tarikh_set_tanggal), ])$tarikh_set_tanggal), "month"),
by = "month")
)
# Count active requests per month
timeline <- timeline %>%
rowwise() %>%
mutate(active_requests = sum((gensetDF[!is.na(gensetDF$tarikh_set_tanggal), ])$tarikh_pasang <= month & (gensetDF[!is.na(gensetDF$tarikh_set_tanggal), ])$tarikh_set_tanggal >= month))
# Plot monthly time series for 2024
ggplot(timeline[grepl("^2024", timeline$month), ], aes(x = month, y = active_requests)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "darkred") +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
labs(title = "In Operation Gensets by Month",
x = "Month", y = "Number of Gensets in Operation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
A temporal pattern appears in the data that can be explored during the
modelling stage. Genset requests decline during festive seasons and rise
again in the second quarter of the year.
The association between the two numerical variables, jumlah_jam_perkhidmatan and total_mileage_km, is weak. The same is observed for associations among categorical variables. The strongest link to the target variable is lokasi_kerja (location). Given these weak relationships, challenges in model training were anticipated.
# Numerical Compute Pearson correlation
vars <- gensetDF[, names(which(sapply(gensetDF, is.numeric)))]
cor_matrix <- cor(vars, method = "pearson", use = "complete.obs")
print(cor_matrix)
## total_mileage_km jumlah_jam_perkhidmatan
## total_mileage_km 1.00000000 0.09276926
## jumlah_jam_perkhidmatan 0.09276926 1.00000000
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
corrplot(cor_matrix,
method = "color", # colored squares
addCoef.col = "black", # show correlation values
tl.col = "black", # text label color
tl.srt = 45, # rotate labels
col = colorRampPalette(c("red", "white", "blue"))(200),
title = "Heatmap of Pearson Correlation",
mar = c(0,0,2,0))
# Categorial association Chi-square
# Convert matrix to long format
# --- Step 1: Define categorical variables ---
cat_names <- names(gensetDF)[sapply(gensetDF, is.factor)] # or manually list them
# --- Step 2: Build Chi-square statistic matrix ---
chi_matrix <- matrix(NA, nrow=length(cat_names), ncol=length(cat_names),
dimnames=list(cat_names, cat_names))
for (i in seq_along(cat_names)) {
for (j in seq_along(cat_names)) {
if (i != j) {
tbl <- table(gensetDF[[cat_names[i]]], gensetDF[[cat_names[j]]])
chi_matrix[i,j] <- suppressWarnings(chisq.test(tbl)$statistic)
} else {
chi_matrix[i,j] <- NA # self-association not meaningful
}
}
}
All 3 algorithms were trained and evaluated with the following results:
# Step1: Data preperation
data <-df
data$tarikh_diperlukan <- as.Date(data$tarikh_diperlukan, format = "%Y-%m-%d")
data <- data %>%
mutate(month = month(tarikh_diperlukan),
rainy_season = ifelse(month %in% c(10, 11, 12, 1, 2, 3), 1, 0), # Monsoon months
jenis_kerja = as.factor(jenis_kerja))
# Group rare negeri into "Other"
negeri_counts <- table(data$negeri)
rare_negeri <- names(negeri_counts[negeri_counts < 5])
data$negeri[data$negeri %in% rare_negeri] <- "Other"
data$negeri <- as.factor(data$negeri)
# Original levels for mapping back
original_levels <- c("125", "300", "500", "1000")
# For classification models: Relabel to valid R names
new_levels <- c("Cap125", "Cap300", "Cap500", "Cap1000")
data$request_kapasiti_class <- factor(data$request_kapasiti, levels = original_levels, labels = new_levels)
data <- data %>% select(request_kapasiti_class, month, rainy_season, negeri, jenis_kerja) %>% na.omit()
# Step 2: Train/Test Split (stratified on class)
set.seed(123)
train_index <- createDataPartition(data$request_kapasiti_class, p = 0.8, list = FALSE)
train <- data[train_index, ]
test <- data[-train_index, ]
# Filter test for seen negeri levels
test <- test %>% filter(negeri %in% levels(train$negeri))
# Compute class weights for classification models
class_freq <- table(train$request_kapasiti_class)
class_weights <- 1 / class_freq
class_weights <- class_weights / sum(class_weights) * length(class_weights)
names(class_weights) <- new_levels
All 3 algorithms were trained and evaluated with the following results:
# Step 3: Model 1 - Random Forest (Main Model, Classification)
control_rf <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = multiClassSummary)
grid_rf <- expand.grid(mtry = c(2, 4),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 10))
model_rf <- train(request_kapasiti_class ~ month + rainy_season + negeri + jenis_kerja,
data = train,
method = "ranger",
num.trees = 500,
importance = "impurity",
class.weights = class_weights,
tuneGrid = grid_rf,
trControl = control_rf,
verbose = FALSE)
# Predict and map back to original labels
pred_rf <- predict(model_rf, newdata = test)
pred_rf_orig <- factor(pred_rf, levels = new_levels, labels = original_levels)
test_class_orig <- factor(test$request_kapasiti_class, levels = new_levels, labels = original_levels)
cm_rf <- confusionMatrix(pred_rf_orig, test_class_orig)
accuracy_rf <- cm_rf$overall['Accuracy']
weighted_f1_rf <- weighted.mean(cm_rf$byClass[, "F1"], w = table(test_class_orig) / nrow(test), na.rm = TRUE)
macro_recall_rf <- mean(cm_rf$byClass[, "Recall"], na.rm = TRUE)
# Confusion Matrix Heatmap for Random Forest
cm_rf_table <- as.table(cm_rf$table)
cm_rf_df <- as.data.frame(cm_rf_table)
ggplot(cm_rf_df, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = 1) +
scale_fill_gradient(low = "white", high = "blue") +
theme_minimal() +
labs(title = "Confusion Matrix - Random Forest")
# Step 4: Model 2 - XGBoost (Classification for Comparison)
# One-hot encode for XGBoost
dummy <- dummyVars(~ month + rainy_season + negeri + jenis_kerja, data = train)
X_train <- predict(dummy, newdata = train)
y_train <- as.numeric(train$request_kapasiti_class) - 1
X_test <- predict(dummy, newdata = test)
y_test <- as.numeric(test$request_kapasiti_class) - 1
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest <- xgb.DMatrix(data = X_test, label = y_test)
params_xgb <- list(
objective = "multi:softprob",
num_class = 4,
eta = 0.05,
max_depth = 4,
subsample = 0.8,
colsample_bytree = 0.8,
eval_metric = "mlogloss"
)
watchlist <- list(train = dtrain, eval = dtest)
model_xgb <- xgb.train(params_xgb, dtrain, nrounds = 1000, watchlist = watchlist, early_stopping_rounds = 20, verbose = 0)
# Predict
preds_prob_xgb <- predict(model_xgb, dtest)
preds_matrix_xgb <- matrix(preds_prob_xgb, nrow = nrow(test), ncol = 4, byrow = TRUE)
pred_xgb <- max.col(preds_matrix_xgb) - 1
pred_xgb_orig <- factor(pred_xgb, levels = 0:3, labels = original_levels)
cm_xgb <- confusionMatrix(pred_xgb_orig, test_class_orig)
accuracy_xgb <- cm_xgb$overall['Accuracy']
weighted_f1_xgb <- weighted.mean(cm_xgb$byClass[, "F1"], w = table(test_class_orig) / nrow(test), na.rm = TRUE)
macro_recall_xgb <- mean(cm_xgb$byClass[, "Recall"], na.rm = TRUE)
# Confusion Matrix Heatmap for XGBoost
cm_xgb_table <- as.table(cm_xgb$table)
cm_xgb_df <- as.data.frame(cm_xgb_table)
ggplot(cm_xgb_df, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = 1) +
scale_fill_gradient(low = "white", high = "blue") +
theme_minimal() +
labs(title = "Confusion Matrix - XGBoost")
# Step 4.5: Model - Multinomial Logistic Regression
# We use 'multinom' from the nnet package via caret
control_lr <- trainControl(method = "cv", number = 3, classProbs = TRUE)
model_lr <- train(request_kapasiti_class ~ month + rainy_season + negeri + jenis_kerja,
data = train,
method = "multinom",
trControl = control_lr,
trace = FALSE) # trace=FALSE hides the iteration logs
# Predict and map back to original labels
pred_lr <- predict(model_lr, newdata = test)
pred_lr_orig <- factor(pred_lr, levels = new_levels, labels = original_levels)
cm_lr <- confusionMatrix(pred_lr_orig, test_class_orig)
accuracy_lr <- cm_lr$overall['Accuracy']
weighted_f1_lr <- weighted.mean(cm_lr$byClass[, "F1"], w = table(test_class_orig) / nrow(test), na.rm = TRUE)
macro_recall_lr <- mean(cm_lr$byClass[, "Recall"], na.rm = TRUE)
# Confusion Matrix Heatmap for Logistic Regression
cm_lr_table <- as.table(cm_lr$table)
cm_lr_df <- as.data.frame(cm_lr_table)
ggplot(cm_lr_df, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = 1) +
scale_fill_gradient(low = "white", high = "blue") +
theme_minimal() +
labs(title = "Confusion Matrix - Logistic Regression")
# Step 5: Compare Performance (Updated)
metrics <- data.frame(
Model = c("Random Forest", "XGBoost", "Logistic Regression"),
Accuracy = c(accuracy_rf, accuracy_xgb, accuracy_lr),
Weighted_F1 = c(weighted_f1_rf, weighted_f1_xgb, weighted_f1_lr),
Macro_Recall = c(macro_recall_rf, macro_recall_xgb, macro_recall_lr)
)
print(metrics)
## Model Accuracy Weighted_F1 Macro_Recall
## 1 Random Forest 0.5331002 0.4730562 0.3085022
## 2 XGBoost 0.2530602 0.2855560 0.2505855
## 3 Logistic Regression 0.5301024 0.4865376 0.3165443
# Step 6: Visualize Comparisons
# Bar plot for Accuracy, Weighted F1, Macro Recall
metrics_long <- metrics %>%
pivot_longer(cols = c(Accuracy, Weighted_F1, Macro_Recall), names_to = "Metric", values_to = "Value")
ggplot(metrics_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(title = "Model Performance Comparison", x = "Metric", y = "Score") +
scale_fill_brewer(palette = "Set1")
# Save plot (optional)
#ggsave("model_comparison.png")
# Feature Importance for Random Forest (best model)
importance_rf <- varImp(model_rf)
print(importance_rf)
## ranger variable importance
##
## Overall
## negeriSELANGOR 100.000
## negeriKELANTAN 82.547
## month 76.816
## negeriPULAU PINANG 75.909
## jenis_kerjaShutdown 53.824
## negeriKUALA LUMPUR 46.389
## negeriTERENGGANU 33.912
## negeriPERAK 28.307
## jenis_kerjaStandby 26.993
## negeriMELAKA 23.383
## negeriKEDAH 19.897
## negeriPAHANG 12.869
## rainy_season 11.976
## jenis_kerjaEmergency Shutdown 5.229
## negeriNEGERI SEMBILAN 4.636
## negeriPERLIS 1.100
## negeriPUTRA JAYA 0.000
# Feature Importance for Logistic Regression
importance_lr <- varImp(model_lr)
print(importance_lr)
## multinom variable importance
##
## Overall
## `negeriPUTRA JAYA` 100.00
## negeriSELANGOR 62.10
## negeriKELANTAN 50.91
## jenis_kerjaStandby 43.92
## negeriMELAKA 39.40
## `negeriPULAU PINANG` 38.76
## `jenis_kerjaEmergency Shutdown` 36.11
## negeriPERLIS 34.60
## negeriKEDAH 34.33
## `negeriKUALA LUMPUR` 33.01
## negeriTERENGGANU 32.01
## `negeriNEGERI SEMBILAN` 31.22
## negeriPERAK 30.92
## negeriPAHANG 17.22
## jenis_kerjaShutdown 15.32
## rainy_season 3.20
## month 0.00
The choice of model was guided by three main criteria:
# Data Cleaning & Feature Engineering
# Convert to timestamps and calculate duration
df_clean <- df %>%
mutate(
# dt_pasang = dmy_hm(datetime_pasang),
# dt_tanggal = dmy_hm(datetime_set_tanggal)
dt_pasang = ymd(tarikh_pasang),
dt_tanggal = ymd(tarikh_set_tanggal)
) %>%
mutate(duration = as.numeric(difftime(dt_tanggal, dt_pasang, units="hours"))) %>%
# PREPROCESSING: Remove extreme outliers (> 500 hours) and negative values
filter(!is.na(duration) & duration > 0 & duration < 500) %>%
# FEATURE ENGINEERING: Extract time of day/month
mutate(
hour_pasang = hour(dt_pasang),
month_pasang = month(dt_pasang),
# Log Transformation to handle skewness
log_duration = log1p(duration)
)
# Select and format features
ml_data <- df_clean %>%
select(duration, log_duration, jenis_kerja, negeri,
request_kapasiti, total_mileage_km, hour_pasang, month_pasang) %>%
mutate(across(c(jenis_kerja, negeri), as.factor))
# Data Splitting (80% Train, 20% Test)
set.seed(42)
trainIndex <- createDataPartition(ml_data$duration, p = 0.8, list = FALSE)
train_set <- ml_data[trainIndex, ]
test_set <- ml_data[-trainIndex, ]
# Model Evaluation Function
get_metrics <- function(actual, predicted, model_name) {
mae <- mean(abs(actual - predicted))
rmse <- sqrt(mean((actual - predicted)^2))
r2 <- cor(actual, predicted)^2
return(data.frame(Model = model_name, MAE = mae, RMSE = rmse, R2 = r2))
}
All 3 algorithms were evaluated to identify the most reliable model for operational use.
# --- MODEL 1: LINEAR REGRESSION ---
lm_model <- lm(duration ~ jenis_kerja + negeri + request_kapasiti +
total_mileage_km + hour_pasang + month_pasang, data = train_set)
lm_pred <- predict(lm_model, test_set)
# --- MODEL 2: RANDOM FOREST ---
rf_model <- randomForest(duration ~ jenis_kerja + negeri + request_kapasiti +
total_mileage_km + hour_pasang + month_pasang,
data = train_set, ntree = 100)
rf_pred <- predict(rf_model, test_set)
# --- MODEL 3: XGBOOST (Better Model) ---
# XGBoost requires a matrix format and we use log-transformed target
train_matrix <- sparse.model.matrix(log_duration ~ . - duration - 1, data = train_set)
test_matrix <- sparse.model.matrix(log_duration ~ . - duration - 1, data = test_set)
xgb_model <- xgboost(data = train_matrix, label = train_set$log_duration,
nrounds = 100, objective = "reg:squarederror",
verbose = 0, eta = 0.1, max_depth = 6)
# Predict and convert back from log to original scale
xgb_pred_log <- predict(xgb_model, test_matrix)
xgb_pred <- expm1(xgb_pred_log)
# 6. Evaluation and Comparison
results <- rbind(
get_metrics(test_set$duration, lm_pred, "Linear Regression"),
get_metrics(test_set$duration, rf_pred, "Random Forest"),
get_metrics(test_set$duration, xgb_pred, "XGBoost")
)
print("--- Final Model Performance Comparison ---")
## [1] "--- Final Model Performance Comparison ---"
print(results)
## Model MAE RMSE R2
## 1 Linear Regression 35.99285 54.39508 0.09267082
## 2 Random Forest 32.26968 49.39906 0.25463319
## 3 XGBoost 30.08559 52.03442 0.21346377
# Visualization of Best Model
# --- Step 1: Reshape to long format ---
df_long <- melt(results, id.vars = "Model",
variable.name = "Metric",
value.name = "Value")
# --- Step 2: Plot grouped bar chart with numbers on top ---
ggplot(df_long, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
geom_text(aes(label = round(Value, 2)),
position = position_dodge(width = 0.9),
vjust = -0.3, size = 3) +
labs(title = "Model Performance Comparison",
y = "Metric Value", x = "Model") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
# Predicted vs Actual for Random Forest
library(ggplot2)
rf_pred <- predict(rf_model, test_set)
df_plot <- data.frame(actual = test_set$duration, pred = rf_pred)
ggplot(df_plot, aes(x = actual, y = pred)) +
geom_point(color = "darkgreen", alpha = 0.4, size = 1.5) +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
labs(title = "Random Forest: Predicted vs Actual Duration (Test Set)",
x = "Actual Hours", y = "Predicted Hours") +
theme_minimal()
# Using the randomForest package's importance & varImpPlot
randomForest::importance(rf_model) # numeric importance per feature
## IncNodePurity
## jenis_kerja 2257190
## negeri 3763816
## request_kapasiti 1332312
## total_mileage_km 5991445
## hour_pasang 0
## month_pasang 3120262
varImpPlot(rf_model, main = "Random Forest Feature Importance")
The data interpretation from both models reveals consistent patterns in mobile generator deployment for power-related jobs in Malaysia. For capacity demand prediction (multi-class classification), the Multinomial Logistic Regression model identifies “negeri” (state) as the primary driver, with high-demand urban areas like Selangor (4599 requests) and Johor (3391) favouring larger capacities (e.g., 1000 kVA, 10117 instances). “Jenis_kerja” (job type) follows, dominated by Breakdown (9020) and Shutdown (8928), emphasizing operational urgency over seasonality. Class imbalance, with 125 kVA rare (667), leads to under-prediction of smaller units, risking over-provisioning. For service duration prediction (regression), the Random Forest model highlights “total_mileage_km” (mean 127 km) as the top predictor, indicating logistical challenges extend durations. “Jenis_kerja” and “negeri” again rank high, with regional variations (e.g., rural vs. urban) and emergency types contributing to variance. The model excels for short durations (<100 hours) but struggles with outliers. Overall, location and job type are pivotal across models, supported by raw data trends. These insights enable targeted resource optimization, reducing costs and improving response efficiency, though enhancements like additional features (e.g., “daerah”) are recommended.