Optimizing Mobile Backup Generators Availability in Malaysia with Machine Learning

## Loading required package: janitor

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggplot2

## Loading required package: reshape2

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.2.0     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: randomForest
## 
## randomForest 4.7-1.2
## 
## Type rfNews() to see new features/changes/bug fixes.
## 
## 
## Attaching package: 'randomForest'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## Loading required package: xgboost
## 
## Loading required package: caret
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Loading required package: Matrix
## 
## 
## Attaching package: 'Matrix'
## 
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Loading required package: ranger
## 
## 
## Attaching package: 'ranger'
## 
## 
## The following object is masked from 'package:randomForest':
## 
##     importance
## 
## 
## Loading required package: ROCR
## 
## Loading required package: MLmetrics
## 
## 
## Attaching package: 'MLmetrics'
## 
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## 
## The following object is masked from 'package:base':
## 
##     Recall

3. Data Preparation and Exploratory Data Analysis (EDA)

3.1. Data Collection

The dataset in summary

Aspect	Description
Source	Historical genset deployment records from Malaysia’s national electricity utility’s fleet and dispatch system.
Time Period	1st January to 31st December 2024
Geographical Coverage	Peninsular Malaysia: All 13 states and over 140 districts (e.g., Johor Bahru, Subang Jaya)
Total Records	20,022 unique deployment jobs
Total Fields	27 columns in a single Excel sheet (.xlsx)
Reliability and Quality	High reliability: Real-time capture by field engineers, validated for audits. Core fields (e.g., district, job type, dates, capacity, mileage, hours) have <1% missing values; minor misses in non-critical fields; no duplicates

3.2. Data Cleaning

Data Overview The raw 2024 genset deployment record, with 20,021 entries and 27 variables, documents breakdowns and maintenance across Malaysia but suffers from data decay, including inconsistent date formats and non‑standardized headers. Example of data with noise;

Incorrect data types (dates, times, IDs)
Missing values across multiple columns
Inconsistent formatting in categorical variables
Structural issues: messy column names, whitespace
Outliers in numerical variables

Data Cleaning Process

The data cleaning methodology follows a structured computational workflow executed in the R environment to ensure complete reproducibility and auditability of every transformation. ### Structural Normalization and Cleaning Metadata is standardized using the janitor package, converting all headers to lowercase snake‑case. This resolves irregular spacing and mixed casing that cause errors in attribute calls. For example, “Total Mileage (KM)” becomes “total_mileage_km”, creating a predictable schema for automated processing across platforms. ### Strategic Imputation of Missing Values To handle semantic noise, diverse null placeholders (e.g., dashes, “N/A”) are mapped to formal nulls during ingestion. For key identifiers like sales order numbers, missing entries are imputed as “MISSING” rather than deleted, preserving operational context and allowing models to distinguish true missing data from non‑applicable cases. ### Temporal Harmonization and Feature Engineering A major cleaning step involves reconstructing chronology, since raw data splits dates and times into character strings unsuitable for calculation. Using lubridate, these fragments are parsed into ISO 8601 date objects and durations, then combined into four new datetime features (e.g., datetime_tiba, datetime_pasang) to enable precise lead‑time and service‑duration analysis. ### Textual Scrubbing and Sanitization To preserve categorical analysis integrity, all character columns are sanitized by enforcing UTF 8 encoding and trimming whitespace to prevent phantom mismatches. These refinements ensure accurate text mining and grouping in job description fields ### Results After Cleaning Applying these cleaning protocols produces a more robust dataset with 31 variables, including new temporal features. Aside from removing one duplicate, record counts remain consistent, while character‑based timestamps are converted into logical objects, yielding a clean, analysis‑ready output.

Feature	Pre-Cleaning State	Post-Cleaning State
Headers	Mixed case/Spaces (“Total Mileage”)	Snake_case (“total_mileage_km”)
Missing Values	Mixed placeholders (“-”, “null”)	Standardized NA or “MISSING”
Date Format	Non-standard (e.g., “20/10/2024”)	ISO 8601 (2024-10-20)
Temporal Data	Character strings	Functional Datetime objects

####################################################
# 1. Import Raw Dataset
#    - Terus standardize common missing strings
####################################################
 genset <- read.csv(
   "Y:/data/MNC_UM/WQD7001/group project/GA2/rawdata/Raw Data Genset (2024).csv",
   na.strings = c("", " ", "NA", "N/A", "-", "null")
 )

#genset <- read_excel("Y:/data/MNC_UM/WQD7001/group project/GA2/rawdata/Raw Data Genset (2024).xlsx")

# First quick check
str(genset)

## 'data.frame':    20021 obs. of  27 variables:
##  $ No                                                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Jenis.Kerja                                       : chr  "Breakdown" "Breakdown" "Shutdown" "Standby" ...
##  $ Nombor.Dokumen                                    : chr  "ERQ0056940" "ERQ0052522" "PRQ0090432" "PRQ0090428" ...
##  $ GenSet                                            : chr  "WA1465J(C6)" "VJU4414(B25)" "WB1077S(B20)" "SGT-D4" ...
##  $ Negeri                                            : chr  "JOHOR" "KUALA LUMPUR" "SELANGOR" "TERENGGANU" ...
##  $ Daerah                                            : chr  "PONTIAN" "KL SELATAN" "SUBANG JAYA" "KUALA TERENGGANU" ...
##  $ Lokasi.Kerja                                      : chr  "PE TAMAN SUI" "CHAN SOW LIN NO. 2 " "PPU TROPICANA METROPARK SUBANG HI-TECH" "SJL PULAU REDANG" ...
##  $ Request.Kapasiti                                  : int  300 300 300 500 500 300 125 300 500 1000 ...
##  $ Tarikh.Diperlukan                                 : chr  "20/10/2024" "1/1/2024" "1/1/2024" "1/1/2024" ...
##  $ Masa.Diperlukan                                   : chr  "10:00" "8:00" "8:00" "0:00" ...
##  $ Tarikh.Tiba                                       : chr  "20/10/2024" "1/1/2024" "1/1/2024" "1/1/2024" ...
##  $ Masa.Tiba                                         : chr  "10:02" "8:00" "8:00" "0:00" ...
##  $ Tarikh.Pasang                                     : chr  "20/10/2024" "1/1/2024" "1/1/2024" "1/1/2024" ...
##  $ Masa.Pasang                                       : chr  "17:51" "8:00" "8:00" "0:00" ...
##  $ Tarikh.Set.Tanggal                                : chr  "21/10/2024" "1/8/2024" "1/6/2024" "2/1/2024" ...
##  $ Masa..Set.Tanggal                                 : chr  "5:54" "22:00" "12:17" "0:00" ...
##  $ Total.Mileage..KM.                                : int  150 18 30 0 0 0 820 150 150 0 ...
##  $ Jumlah.Jam.Perkhidmatan                           : chr  "20" "182" "125" "744" ...
##  $ Nombor..Sales.Order.                              : chr  NA "59469608" "59469609" "59469604" ...
##  $ Nombor..Services.Order.                           : chr  "10946145" "10858317" "10858318" "10858438" ...
##  $ Job.Order...PM.Order...WBS.Number...Project.Number: chr  "28129289" "J612166143" "D-PZB-S16-4814-3H1" "27111480" ...
##  $ Type.of.Ref..No                                   : chr  "TCS NUMBER (LV)" "TCS NUMBER (LV)" "LV" "LV" ...
##  $ Referance.No.                                     : chr  "0333/20241017/027" "0124/20231223/021" NA NA ...
##  $ Job.Description                                   : chr  "CABLE FAULT" "KES KECURIAN" "PENUKARAN GENSET 500KVA KE 300KVA. LV SUPPLY PPU TROPICANA METROPARK SUBANG HI-TECH" "BD DI PULAU REDANG" ...
##  $ Unit                                              : chr  "OPERATION" "OPERATION" "ASSETDEV" "OPERATION" ...
##  $ Unit.Selepas.Semakan                              : chr  "OPERATION" "OPERATION" "ASSET DEVELOPMENT" "OPERATION" ...
##  $ Emergency.Shutdown                                : chr  "DN/Asset Zone South 1" NA NA NA ...

head(genset, 5)

##   No Jenis.Kerja Nombor.Dokumen       GenSet       Negeri           Daerah
## 1  1   Breakdown     ERQ0056940  WA1465J(C6)        JOHOR          PONTIAN
## 2  1   Breakdown     ERQ0052522 VJU4414(B25) KUALA LUMPUR       KL SELATAN
## 3  1    Shutdown     PRQ0090432 WB1077S(B20)     SELANGOR      SUBANG JAYA
## 4  1     Standby     PRQ0090428       SGT-D4   TERENGGANU KUALA TERENGGANU
## 5  1    Shutdown     PRQ0090431 VAT5454(C46)     KELANTAN       GUA MUSANG
##                             Lokasi.Kerja Request.Kapasiti Tarikh.Diperlukan
## 1                           PE TAMAN SUI              300        20/10/2024
## 2                    CHAN SOW LIN NO. 2               300          1/1/2024
## 3 PPU TROPICANA METROPARK SUBANG HI-TECH              300          1/1/2024
## 4                       SJL PULAU REDANG              500          1/1/2024
## 5                                PE MIAK              500          1/1/2024
##   Masa.Diperlukan Tarikh.Tiba Masa.Tiba Tarikh.Pasang Masa.Pasang
## 1           10:00  20/10/2024     10:02    20/10/2024       17:51
## 2            8:00    1/1/2024      8:00      1/1/2024        8:00
## 3            8:00    1/1/2024      8:00      1/1/2024        8:00
## 4            0:00    1/1/2024      0:00      1/1/2024        0:00
## 5            0:00    1/1/2024      0:11      1/1/2024        0:12
##   Tarikh.Set.Tanggal Masa..Set.Tanggal Total.Mileage..KM.
## 1         21/10/2024              5:54                150
## 2           1/8/2024             22:00                 18
## 3           1/6/2024             12:17                 30
## 4           2/1/2024              0:00                  0
## 5          22/1/2024             13:28                  0
##   Jumlah.Jam.Perkhidmatan Nombor..Sales.Order. Nombor..Services.Order.
## 1                      20                 <NA>                10946145
## 2                     182             59469608                10858317
## 3                     125             59469609                10858318
## 4                     744             59469604                10858438
## 5                     744             59469605                10858439
##   Job.Order...PM.Order...WBS.Number...Project.Number Type.of.Ref..No
## 1                                           28129289 TCS NUMBER (LV)
## 2                                         J612166143 TCS NUMBER (LV)
## 3                                 D-PZB-S16-4814-3H1              LV
## 4                                           27111480              LV
## 5                                           27513517              LV
##         Referance.No.
## 1   0333/20241017/027
## 2   0124/20231223/021
## 3                <NA>
## 4                <NA>
## 5 SWING VENDOR GENSET
##                                                                       Job.Description
## 1                                                                         CABLE FAULT
## 2                                                                        KES KECURIAN
## 3 PENUKARAN GENSET 500KVA KE 300KVA. LV SUPPLY PPU TROPICANA METROPARK SUBANG HI-TECH
## 4                                                                  BD DI PULAU REDANG
## 5                                                                 SWING VENDOR GENSET
##        Unit Unit.Selepas.Semakan    Emergency.Shutdown
## 1 OPERATION            OPERATION DN/Asset Zone South 1
## 2 OPERATION            OPERATION                  <NA>
## 3  ASSETDEV    ASSET DEVELOPMENT                  <NA>
## 4 OPERATION            OPERATION                  <NA>
## 5 OPERATION            OPERATION                  <NA>

####################################################
# 2. Clean Column Names & Remove Duplicates
####################################################
genset <- genset %>%
  clean_names()

nrow_before <- nrow(genset)
genset <- genset %>% distinct()
nrow_after  <- nrow(genset)

cat("Rows before distinct:", nrow_before, "\n")

## Rows before distinct: 20021

cat("Rows after distinct :", nrow_after, "\n")

## Rows after distinct : 20021

####################################################
# 3. Clean Character Columns (encoding + whitespace)
####################################################
genset <- genset %>%
  # 4.1 Fix encoding untuk elak symbol pelik
  mutate(across(
    .cols = where(is.character),
    .fns  = ~ iconv(., from = "", to = "UTF-8")
  )) %>%
  # 4.2 Trim whitespace kiri/kanan
  mutate(across(
    .cols = where(is.character),
    .fns  = ~ trimws(.)
  ))

# Check missing selepas cleaning
colSums(is.na(genset))

##                                           no 
##                                            0 
##                                  jenis_kerja 
##                                            0 
##                               nombor_dokumen 
##                                            0 
##                                      gen_set 
##                                            0 
##                                       negeri 
##                                            0 
##                                       daerah 
##                                            0 
##                                 lokasi_kerja 
##                                            0 
##                             request_kapasiti 
##                                            0 
##                            tarikh_diperlukan 
##                                            0 
##                              masa_diperlukan 
##                                            0 
##                                  tarikh_tiba 
##                                           25 
##                                    masa_tiba 
##                                           25 
##                                tarikh_pasang 
##                                            0 
##                                  masa_pasang 
##                                           25 
##                           tarikh_set_tanggal 
##                                           25 
##                             masa_set_tanggal 
##                                           25 
##                             total_mileage_km 
##                                            0 
##                      jumlah_jam_perkhidmatan 
##                                            0 
##                           nombor_sales_order 
##                                         3005 
##                        nombor_services_order 
##                                            1 
## job_order_pm_order_wbs_number_project_number 
##                                            0 
##                               type_of_ref_no 
##                                            0 
##                                 referance_no 
##                                         1719 
##                              job_description 
##                                            0 
##                                         unit 
##                                          124 
##                         unit_selepas_semakan 
##                                         1296 
##                           emergency_shutdown 
##                                        16455

####################################################
# 4. Convert Date Columns
####################################################
date_cols <- c(
  "tarikh_diperlukan",
  "tarikh_tiba",
  "tarikh_pasang",
  "tarikh_set_tanggal"
)

# Confirm mana column yang wujud
intersect(date_cols, names(genset))

## [1] "tarikh_diperlukan"  "tarikh_tiba"        "tarikh_pasang"     
## [4] "tarikh_set_tanggal"

genset <- genset %>%
  mutate(across(
    .cols = all_of(date_cols),
    .fns  = ~ dmy(as.character(.), quiet = TRUE)
  ))

str(genset[date_cols])

## 'data.frame':    20021 obs. of  4 variables:
##  $ tarikh_diperlukan : Date, format: "2024-10-20" "2024-01-01" ...
##  $ tarikh_tiba       : Date, format: "2024-10-20" "2024-01-01" ...
##  $ tarikh_pasang     : Date, format: "2024-10-20" "2024-01-01" ...
##  $ tarikh_set_tanggal: Date, format: "2024-10-21" "2024-08-01" ...

summary(genset[date_cols])

##  tarikh_diperlukan     tarikh_tiba         tarikh_pasang       
##  Min.   :2024-01-01   Min.   :2000-01-01   Min.   :2000-01-01  
##  1st Qu.:2024-04-01   1st Qu.:2024-03-31   1st Qu.:2024-03-30  
##  Median :2024-06-27   Median :2024-06-27   Median :2024-06-28  
##  Mean   :2024-06-29   Mean   :2024-06-27   Mean   :2024-06-07  
##  3rd Qu.:2024-09-24   3rd Qu.:2024-09-25   3rd Qu.:2024-09-25  
##  Max.   :2024-12-31   Max.   :2024-12-31   Max.   :2025-02-01  
##                       NA's   :25           NA's   :1           
##  tarikh_set_tanggal  
##  Min.   :2000-01-01  
##  1st Qu.:2024-03-31  
##  Median :2024-06-25  
##  Mean   :2024-04-22  
##  3rd Qu.:2024-09-26  
##  Max.   :2025-06-01  
##  NA's   :27

####################################################
# 5. Convert Time Columns + Create Datetime
####################################################
time_cols <- c(
  "masa_diperlukan",
  "masa_tiba",
  "masa_pasang",
  "masa_set_tanggal"
)

# Tengok sample raw time
head(genset[time_cols], 5)

##   masa_diperlukan masa_tiba masa_pasang masa_set_tanggal
## 1           10:00     10:02       17:51             5:54
## 2            8:00      8:00        8:00            22:00
## 3            8:00      8:00        8:00            12:17
## 4            0:00      0:00        0:00             0:00
## 5            0:00      0:11        0:12            13:28

# Convert ke Period (jam:minit)
genset <- genset %>%
  mutate(across(
    .cols = all_of(time_cols),
    .fns  = ~ hm(as.character(.))
  ))

# Combine date + time → datetime (POSIXct)
genset <- genset %>%
  mutate(
    datetime_diperlukan   = as_datetime(tarikh_diperlukan) + masa_diperlukan,
    datetime_tiba         = as_datetime(tarikh_tiba)       + masa_tiba,
    datetime_pasang       = as_datetime(tarikh_pasang)     + masa_pasang,
    datetime_set_tanggal  = as_datetime(tarikh_set_tanggal)+ masa_set_tanggal
  )

str(genset[, c(
  "datetime_diperlukan",
  "datetime_tiba",
  "datetime_pasang",
  "datetime_set_tanggal"
)])

## 'data.frame':    20021 obs. of  4 variables:
##  $ datetime_diperlukan : POSIXct, format: "2024-10-20 10:00:00" "2024-01-01 08:00:00" ...
##  $ datetime_tiba       : POSIXct, format: "2024-10-20 10:02:00" "2024-01-01 08:00:00" ...
##  $ datetime_pasang     : POSIXct, format: "2024-10-20 17:51:00" "2024-01-01 08:00:00" ...
##  $ datetime_set_tanggal: POSIXct, format: "2024-10-21 05:54:00" "2024-08-01 22:00:00" ...

####################################################
# 6. Detect & Convert Numeric Columns Stored as Character
####################################################
classes           <- sapply(genset, class)
character_columns <- names(classes[classes == "character"])

# Cari column character yang hanya ada digit / space / dot / minus
possible_numeric <- character_columns[
  sapply(genset[character_columns], function(x)
    all(grepl("^[0-9 .-]*$", x) | is.na(x)))
]

cat("Possible numeric columns:\n")

## Possible numeric columns:

print(possible_numeric)

## [1] "jumlah_jam_perkhidmatan"

genset[possible_numeric] <- lapply(genset[possible_numeric], function(x) {
  x <- trimws(x)
  x[x == ""] <- NA
  as.numeric(x)
})

# Check balik jenis data
sapply(genset, class)

## $no
## [1] "integer"
## 
## $jenis_kerja
## [1] "character"
## 
## $nombor_dokumen
## [1] "character"
## 
## $gen_set
## [1] "character"
## 
## $negeri
## [1] "character"
## 
## $daerah
## [1] "character"
## 
## $lokasi_kerja
## [1] "character"
## 
## $request_kapasiti
## [1] "integer"
## 
## $tarikh_diperlukan
## [1] "Date"
## 
## $masa_diperlukan
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
## 
## $tarikh_tiba
## [1] "Date"
## 
## $masa_tiba
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
## 
## $tarikh_pasang
## [1] "Date"
## 
## $masa_pasang
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
## 
## $tarikh_set_tanggal
## [1] "Date"
## 
## $masa_set_tanggal
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
## 
## $total_mileage_km
## [1] "integer"
## 
## $jumlah_jam_perkhidmatan
## [1] "numeric"
## 
## $nombor_sales_order
## [1] "character"
## 
## $nombor_services_order
## [1] "character"
## 
## $job_order_pm_order_wbs_number_project_number
## [1] "character"
## 
## $type_of_ref_no
## [1] "character"
## 
## $referance_no
## [1] "character"
## 
## $job_description
## [1] "character"
## 
## $unit
## [1] "character"
## 
## $unit_selepas_semakan
## [1] "character"
## 
## $emergency_shutdown
## [1] "character"
## 
## $datetime_diperlukan
## [1] "POSIXct" "POSIXt" 
## 
## $datetime_tiba
## [1] "POSIXct" "POSIXt" 
## 
## $datetime_pasang
## [1] "POSIXct" "POSIXt" 
## 
## $datetime_set_tanggal
## [1] "POSIXct" "POSIXt"

####################################################
# 7. Missing Values Treatment
####################################################
# 7.1 Percentage missing setiap column
missing_pct <- colSums(is.na(genset)) / nrow(genset) * 100
missing_pct

##                                           no 
##                                  0.000000000 
##                                  jenis_kerja 
##                                  0.000000000 
##                               nombor_dokumen 
##                                  0.000000000 
##                                      gen_set 
##                                  0.000000000 
##                                       negeri 
##                                  0.000000000 
##                                       daerah 
##                                  0.000000000 
##                                 lokasi_kerja 
##                                  0.000000000 
##                             request_kapasiti 
##                                  0.000000000 
##                            tarikh_diperlukan 
##                                  0.000000000 
##                              masa_diperlukan 
##                                  0.000000000 
##                                  tarikh_tiba 
##                                  0.124868888 
##                                    masa_tiba 
##                                  0.124868888 
##                                tarikh_pasang 
##                                  0.004994756 
##                                  masa_pasang 
##                                  0.124868888 
##                           tarikh_set_tanggal 
##                                  0.134858399 
##                             masa_set_tanggal 
##                                  0.124868888 
##                             total_mileage_km 
##                                  0.000000000 
##                      jumlah_jam_perkhidmatan 
##                                  0.859097947 
##                           nombor_sales_order 
##                                 15.009240298 
##                        nombor_services_order 
##                                  0.004994756 
## job_order_pm_order_wbs_number_project_number 
##                                  0.000000000 
##                               type_of_ref_no 
##                                  0.000000000 
##                                 referance_no 
##                                  8.585984716 
##                              job_description 
##                                  0.000000000 
##                                         unit 
##                                  0.619349683 
##                         unit_selepas_semakan 
##                                  6.473203137 
##                           emergency_shutdown 
##                                 82.188701863 
##                          datetime_diperlukan 
##                                  0.000000000 
##                                datetime_tiba 
##                                  0.124868888 
##                              datetime_pasang 
##                                  0.129863643 
##                         datetime_set_tanggal 
##                                  0.134858399

# 7.2 Contoh: Impute numeric dengan median
# (boleh tambah column lain kalau perlu)
if ("jumlah_jam_perkhidmatan" %in% names(genset)) {
  genset$jumlah_jam_perkhidmatan[is.na(genset$jumlah_jam_perkhidmatan)] <-
    median(genset$jumlah_jam_perkhidmatan, na.rm = TRUE)
}

# 7.3 Impute categorical: NA → "MISSING"
cat_cols <- names(genset)[sapply(genset, is.character)]

genset[cat_cols] <- lapply(genset[cat_cols], function(x) {
  x[is.na(x)] <- "MISSING"
  x
})

# Check ringkas
sum(is.na(genset))

## [1] 206

summary(genset$jumlah_jam_perkhidmatan)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   29.00   53.78   59.00  908.00

####################################################
# 8. (Optional) Outlier Summary untuk Numeric
####################################################
num_cols <- c(
  "request_kapasiti",
  "total_mileage_km",
  "jumlah_jam_perkhidmatan"
)
num_cols <- intersect(num_cols, names(genset))

iqr_outlier_info <- function(x) {
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1

  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR

  n_outliers <- sum(x < lower_bound | x > upper_bound, na.rm = TRUE)

  c(
    Q1 = Q1,
    Q3 = Q3,
    IQR = IQR,
    Lower_Bound = lower_bound,
    Upper_Bound = upper_bound,
    Number_of_Outliers = n_outliers
  )
}

outlier_summary <- do.call(
  rbind,
  lapply(num_cols, function(col) {
    cbind(Column = col, t(iqr_outlier_info(genset[[col]])))
  })
)

outlier_summary <- as.data.frame(outlier_summary)
outlier_summary

##                    Column Q1.25% Q3.75% IQR.75% Lower_Bound.25% Upper_Bound.75%
## 1        request_kapasiti    500   1000     500            -250            1750
## 2        total_mileage_km     34    150     116            -140             324
## 3 jumlah_jam_perkhidmatan     13     59      46             -56             128
##   Number_of_Outliers
## 1                  0
## 2               1778
## 3               1806

# Simpan outlier summary kalau perlu
write.csv(outlier_summary,
          "outlier_summary_numeric.csv",
          row.names = FALSE)

####################################################
# 9. SAVE CLEAN DATA
####################################################

# Versi akhir data bersih kita namakan genset_clean
genset_clean <- genset

# Save sebagai CSV untuk EDA / modelling
write.csv(genset_clean,
          "Y:/data/MNC_UM/WQD7001/group project/GA2/rawdata/genset_clean.csv",
          row.names = FALSE)

# (Optional) Save sebagai RDS (lagi jimat type)
#saveRDS(genset_clean,
#        file = "genset_clean.rds")

#cat("Clean dataset saved as 'genset_clean.csv' and 'genset_clean.rds'\n")

3.3.EDA

With the data cleaned, the next step is to explore distributions, central tendencies, associations, and patterns using statistical and graphical methods. Of 27 variables, 10 were retained for EDA and modelling. Two numeric, two interval dates and the rest categorical. The key target variables are request_kapasiti and jumlah_jam_perkhidmatan. The list of the 10 variables are gen_set, negeri, daerah, lokasi_kerja, jenis_kerja, request_kapasiti, tarikh_pasang, tarikh_set_tanggal, total_mileage_km and jumlah_jam_perkhidmatan.

3.3.1 Univariate Analysis

Basic statistics were generated across the dataset for an initial understanding.

gensetDF <- df
#filter the 10 needed columns
gensetDF<-gensetDF %>% select(gen_set, negeri, daerah, lokasi_kerja, jenis_kerja, request_kapasiti, tarikh_pasang, tarikh_set_tanggal, total_mileage_km, jumlah_jam_perkhidmatan)

#convert categorial to factor
gensetDF<-gensetDF %>%
  mutate(
    gen_set= as.factor(gen_set),
    negeri= as.factor(negeri),
    daerah= as.factor(daerah),
    lokasi_kerja= as.factor(lokasi_kerja),
    jenis_kerja= as.factor(jenis_kerja),
    request_kapasiti= as.factor(request_kapasiti)
  )

gensetDF$tarikh_pasang <- as.Date(gensetDF$tarikh_pasang, format = "%Y-%m-%d")
gensetDF$tarikh_set_tanggal <- as.Date(gensetDF$tarikh_set_tanggal, format = "%Y-%m-%d")

#filter the 10 needed columns
gensetDF<-gensetDF %>% select(gen_set, negeri, daerah, lokasi_kerja, jenis_kerja, request_kapasiti, tarikh_pasang, tarikh_set_tanggal, total_mileage_km, jumlah_jam_perkhidmatan)

#convert categorial to factor
gensetDF<-gensetDF %>%
  mutate(
    gen_set= as.factor(gen_set),
    negeri= as.factor(negeri),
    daerah= as.factor(daerah),
    lokasi_kerja= as.factor(lokasi_kerja),
    jenis_kerja= as.factor(jenis_kerja),
    request_kapasiti= as.factor(request_kapasiti)
  )

gensetDF$tarikh_pasang <- as.Date(gensetDF$tarikh_pasang, format = "%Y-%m-%d")
gensetDF$tarikh_set_tanggal <- as.Date(gensetDF$tarikh_set_tanggal, format = "%Y-%m-%d")
summary(gensetDF)

##               gen_set               negeri             daerah     
##  V-HNTK-VER8616-6D:   92   SELANGOR    :4599   RAWANG     :  860  
##  B-HNTK-VEF6362-2D:   88   JOHOR       :3391   KLANG      :  813  
##  VAY4428(C44)     :   82   PAHANG      :2113   JOHOR BAHRU:  736  
##  D-SLSA-PPH9226-2C:   81   KELANTAN    :2014   KL BARAT   :  487  
##  V-HNTK-VFS9985-3D:   81   KUALA LUMPUR:1762   KL TIMUR   :  466  
##  D-GMGE-VDX733-2D :   80   KEDAH       :1452   BANTING    :  459  
##  (Other)          :19517   (Other)     :4690   (Other)    :16200  
##                      lokasi_kerja               jenis_kerja   request_kapasiti
##  PPU DENGKIL               :  110   Breakdown         :9020   125 :  667      
##  PPU SRI KIJANG 33/11KV    :   56   Emergency Shutdown: 536   300 : 4337      
##  PMU VALDOR 132/11KV       :   47   Shutdown          :8928   500 : 4900      
##  PPU PEKAN JENJAROM BANTING:   47   Standby           :1537   1000:10117      
##  PPU BANDAR TASEK MUTIARA  :   33                                             
##  PPU TAMAN SELAYANG MUTIARA:   33                                             
##  (Other)                   :19695                                             
##  tarikh_pasang        tarikh_set_tanggal   total_mileage_km
##  Min.   :2000-01-01   Min.   :2000-01-01   Min.   :   0    
##  1st Qu.:2024-04-05   1st Qu.:2024-04-05   1st Qu.:  34    
##  Median :2024-07-01   Median :2024-07-01   Median :  68    
##  Mean   :2024-06-10   Mean   :2024-04-26   Mean   : 127    
##  3rd Qu.:2024-09-27   3rd Qu.:2024-09-28   3rd Qu.: 150    
##  Max.   :2025-01-02   Max.   :2025-01-06   Max.   :6326    
##                       NA's   :27                           
##  jumlah_jam_perkhidmatan
##  Min.   :  1.00         
##  1st Qu.: 13.00         
##  Median : 29.00         
##  Mean   : 53.78         
##  3rd Qu.: 59.00         
##  Max.   :908.00         
##

Boxplots for numerical variables

# --UNIVARIATE ANALYSIS-----------
#generate boxplots for numeric
numeric_cols <- which(sapply(gensetDF, is.numeric))
boxplot(gensetDF[, numeric_cols], main = "Boxplots of Numeric Variables", las = 2)

Obviously, both numerical variables are right-skewed distribution. For the total_mileage_km, there is an obvious outlier but was confirmed a special case legitimate entry during data cleaning stage. It is recommended to remove this single to avoid issue in model training later.

Histograms for categorial variables

#generate histograms
# Identify categorical variables
cat_cols <- sapply(gensetDF, function(x) is.factor(x) || is.character(x))
cat_names <- names(gensetDF)[cat_cols]

#4 plots per row
n_row <- ceiling(length(cat_names) / 4)
par(mfrow = c(n_row, 4), mar = c(6, 4, 2, 1))  # extra bottom margin for angled labels

for (colname in cat_names) {
  counts <- table(gensetDF[[colname]])
  
  # Draw barplot without x labels
  bp <- barplot(counts,
                main = paste("Distribution of", colname),
                col = "lightblue",
                ylab = "Count",
                xaxt = "n")  # suppress x-axis labels
  
  # Add rotated labels manually (~25 degrees)
  text(x = bp, y = -0.05 * max(counts), labels = names(counts),
       srt = 45, adj = 1, xpd = TRUE, cex = 0.8)
}

The cardinalities of gen_set, daerah, and lokasi_kerja are very high, with no clear distribution pattern. Negeri wise, more industrialized and densely populated regions record the highest requests. For jenis_kerja, most entries involve breakdowns and shutdowns. In terms of request_kapasiti, gensets rated at 1000 kWh show the highest demand.

3.3.2. Bivariate Analysis

par(mfrow=c(1,1))   # restore to single plot layout
#scatter plot
plot(gensetDF$total_mileage_km, gensetDF$jumlah_jam_perkhidmatan,
     main = "Scatter Plot: total_mileage_km vs jumlah_jam_perkhidmatan",
     xlab = "total_mileage_km",
     ylab = "jumlah_jam_perkhidmatan",
     pch = 19, col = "blue")

For the bivariate analysis, numerical and categorical independent variables are examined against the target variables request_kapasiti and jumlah_jam_perkhidmatan.

The scatter plot indicates that operations are generally well optimized, with genset travel distance and utilization hours closely aligned, apart from the single outlier noted earlier.

# Boxplot with rotated x-axis labels
boxplot(gensetDF$jumlah_jam_perkhidmatan ~ gensetDF$negeri,
        data = gensetDF,
        col = "lightblue",
        main = "Boxplot of negeri vs. jumlah_jam_perkhidmatan",
        xlab = "Category", ylab = "Numeric Value",
        xaxt = "n")   # suppress default x-axis labels

# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$negeri))
text(x = axis_positions,
     y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]),  # place below axis
     labels = levels(gensetDF$negeri),
     srt = 45, adj = 1, xpd = TRUE, cex = 0.8)

# Boxplot with rotated x-axis labels
boxplot(gensetDF$jumlah_jam_perkhidmatan ~ gensetDF$jenis_kerja,
        data = gensetDF,
        col = "lightblue",
        main = "Boxplot of jenis_kerja vs. jumlah_jam_perkhidmatan",
        xlab = "Category", ylab = "Numeric Value",
        xaxt = "n")   # suppress default x-axis labels

# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$jenis_kerja))
text(x = axis_positions,
     y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]),  # place below axis
     labels = levels(gensetDF$jenis_kerja),
     srt = 45, adj = 1, xpd = TRUE, cex = 0.8)

# Boxplot with rotated x-axis labels
boxplot(gensetDF$jumlah_jam_perkhidmatan ~ gensetDF$request_kapasiti,
        data = gensetDF,
        col = "lightblue",
        main = "Boxplot of request_kapasiti vs. jumlah_jam_perkhidmatan",
        xlab = "Category", ylab = "Numeric Value",
        xaxt = "n")   # suppress default x-axis labels

# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$request_kapasiti))
text(x = axis_positions,
     y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]),  # place below axis
     labels = levels(gensetDF$request_kapasiti),
     srt = 45, adj = 1, xpd = TRUE, cex = 0.8)

# Boxplot with rotated x-axis labels
boxplot(gensetDF$total_mileage_km ~ gensetDF$request_kapasiti,
        data = gensetDF,
        col = "lightblue",
        main = "Boxplot of request_kapasiti vs. total_mileage_km",
        xlab = "Category", ylab = "Numeric Value",
        xaxt = "n")   # suppress default x-axis labels

# Add rotated labels at 45 degrees
axis_positions <- 1:length(levels(gensetDF$request_kapasiti))
text(x = axis_positions,
     y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]),  # place below axis
     labels = levels(gensetDF$request_kapasiti),
     srt = 45, adj = 1, xpd = TRUE, cex = 0.8)

Mean genset service hours are higher in Pahang, Pulau Pinang, and Selangor, indicating longer usage, while greater variation is observed in Kedah, Kelantan, Pahang, and Pulau Pinang, reflecting higher variability. For jenis_kerja, standby cases record the highest mean, showing that many gensets are dispatched solely for standby. Utilization hours across all genset capacities are evenly distributed.

# count table
counts <- table(gensetDF$negeri, gensetDF$request_kapasiti)

# Draw stacked bar chart
bp <- barplot(counts,
              main = "Stacked Bar Chart: request_kapasiti vs negeri",
              xlab = "request kapasiti", ylab = "Count",
              col = rainbow(nrow(counts)),
              las = 2)

# Add legend on the left side
legend("left",
       legend = rownames(counts),
       fill = rainbow(nrow(counts)),
       inset = c(-0.15, 0),  # nudge leftward
       xpd = TRUE,           # allow drawing outside plot
       cex = 0.8)

# count table
counts <- table(gensetDF$jenis_kerja, gensetDF$request_kapasiti)

# Draw stacked bar chart
bp <- barplot(counts,
              main = "Stacked Bar Chart: request_kapasiti vs jenis_kerja",
              xlab = "request kapasiti", ylab = "Count",
              col = rainbow(nrow(counts)),
              las = 2)

# Add legend on the left side
legend("left",
       legend = rownames(counts),
       fill = rainbow(nrow(counts)),
       inset = c(-0.15, 0),  # nudge leftward
       xpd = TRUE,           # allow drawing outside plot
       cex = 0.8)

The stacked bar charts reveal that requests for 1000 kWh gensets are highest in industrialized, densely populated states. Across all genset capacities, shutdown and breakdown cases dominate the requests.

# Create monthly sequence from min start to max end
timeline <- tibble(
  month = seq(floor_date(min(gensetDF$tarikh_pasang), "month"),
              ceiling_date(max((gensetDF[!is.na(gensetDF$tarikh_set_tanggal), ])$tarikh_set_tanggal), "month"),
              by = "month")
)

# Count active requests per month
timeline <- timeline %>%
  rowwise() %>%
  mutate(active_requests = sum((gensetDF[!is.na(gensetDF$tarikh_set_tanggal), ])$tarikh_pasang <= month & (gensetDF[!is.na(gensetDF$tarikh_set_tanggal), ])$tarikh_set_tanggal >= month))

# Plot monthly time series for 2024
ggplot(timeline[grepl("^2024", timeline$month), ], aes(x = month, y = active_requests)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "darkred") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(title = "In Operation Gensets by Month",
       x = "Month", y = "Number of Gensets in Operation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

A temporal pattern appears in the data that can be explored during the modelling stage. Genset requests decline during festive seasons and rise again in the second quarter of the year.

3.3.3 Association Strength

The association between the two numerical variables, jumlah_jam_perkhidmatan and total_mileage_km, is weak. The same is observed for associations among categorical variables. The strongest link to the target variable is lokasi_kerja (location). Given these weak relationships, challenges in model training were anticipated.

# Numerical Compute Pearson correlation
vars <- gensetDF[, names(which(sapply(gensetDF, is.numeric)))]

cor_matrix <- cor(vars, method = "pearson", use = "complete.obs")
print(cor_matrix)

##                         total_mileage_km jumlah_jam_perkhidmatan
## total_mileage_km              1.00000000              0.09276926
## jumlah_jam_perkhidmatan       0.09276926              1.00000000

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.5.2

## corrplot 0.95 loaded

corrplot(cor_matrix,
         method = "color",       # colored squares
         addCoef.col = "black",  # show correlation values
         tl.col = "black",       # text label color
         tl.srt = 45,            # rotate labels
         col = colorRampPalette(c("red", "white", "blue"))(200),
         title = "Heatmap of Pearson Correlation",
         mar = c(0,0,2,0))

# Categorial association Chi-square
# Convert matrix to long format
# --- Step 1: Define categorical variables ---
cat_names <- names(gensetDF)[sapply(gensetDF, is.factor)]   # or manually list them

# --- Step 2: Build Chi-square statistic matrix ---
chi_matrix <- matrix(NA, nrow=length(cat_names), ncol=length(cat_names),
                     dimnames=list(cat_names, cat_names))

for (i in seq_along(cat_names)) {
  for (j in seq_along(cat_names)) {
    if (i != j) {
      tbl <- table(gensetDF[[cat_names[i]]], gensetDF[[cat_names[j]]])
      chi_matrix[i,j] <- suppressWarnings(chisq.test(tbl)$statistic)
    } else {
      chi_matrix[i,j] <- NA   # self-association not meaningful
    }
  }
}

4. Data Modelling and Interpretation

4.1. Objective

The primary goal is to build time-series forecasting models to predict demand, capacity, and service duration for mobile generators in Malaysia. Two modelling approaches were undertaken:

Capacity Demand Prediction: Estimate required generator capacity (125 kVA, 300 kVA, 500 kVA, 1000 kVA) using historical and contextual predictors. This is treated as a multi-class classification problem due to the categorical nature of the target variable.
Service Duration Prediction: Estimate the length of time generators are needed based on operational and logistical factors.

4.2. Multi-class Modelling for Mobile Generator Capacity Demand Prediction

4.2.1. Data Preparation & Feature Engineering:

Target Variable: request_kapasiti_class (factorised from numeric capacities and relabelled for modelling).
Predictors:
- month (numeric, derived from required date)
- rainy_season (binary indicator for monsoon months)
- negeri (state, rare states grouped into ‘Other’)
- jenis_kerja (job type: Breakdown, Shutdown, Standby, Emergency Shutdown)

Models Compared:

Random Forest: Tuned with cross-validation (3 folds) on mtry, splitrule, and min.node.size; 500 trees; impurity-based importance
XGBoost: Trained with multi-class softmax objective, early stopping, and parameters like eta=0.05, max_depth=4; one-hot encoding for categoricals.
Multinomial Logistic Regression: Cross-validated (3 folds) without additional tuning.

4.2.2. Model Evaluation and Comparison

All 3 algorithms were trained and evaluated with the following results:

# Step1: Data preperation
data <-df
data$tarikh_diperlukan <- as.Date(data$tarikh_diperlukan, format = "%Y-%m-%d")
data <- data %>%
  mutate(month = month(tarikh_diperlukan),
         rainy_season = ifelse(month %in% c(10, 11, 12, 1, 2, 3), 1, 0), # Monsoon months
         jenis_kerja = as.factor(jenis_kerja))

# Group rare negeri into "Other"
negeri_counts <- table(data$negeri)
rare_negeri <- names(negeri_counts[negeri_counts < 5])
data$negeri[data$negeri %in% rare_negeri] <- "Other"
data$negeri <- as.factor(data$negeri)

# Original levels for mapping back
original_levels <- c("125", "300", "500", "1000")

# For classification models: Relabel to valid R names
new_levels <- c("Cap125", "Cap300", "Cap500", "Cap1000")
data$request_kapasiti_class <- factor(data$request_kapasiti, levels = original_levels, labels = new_levels)
data <- data %>% select(request_kapasiti_class, month, rainy_season, negeri, jenis_kerja) %>% na.omit()

# Step 2: Train/Test Split (stratified on class)
set.seed(123)
train_index <- createDataPartition(data$request_kapasiti_class, p = 0.8, list = FALSE)
train <- data[train_index, ]
test <- data[-train_index, ]

# Filter test for seen negeri levels
test <- test %>% filter(negeri %in% levels(train$negeri))

# Compute class weights for classification models
class_freq <- table(train$request_kapasiti_class)
class_weights <- 1 / class_freq
class_weights <- class_weights / sum(class_weights) * length(class_weights)
names(class_weights) <- new_levels

4.2.2. Model Evaluation and Comparison

All 3 algorithms were trained and evaluated with the following results:

# Step 3: Model 1 - Random Forest (Main Model, Classification)
control_rf <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = multiClassSummary)
grid_rf <- expand.grid(mtry = c(2, 4),
                       splitrule = c("gini", "extratrees"),
                       min.node.size = c(1, 10))
model_rf <- train(request_kapasiti_class ~ month + rainy_season + negeri + jenis_kerja,
                  data = train,
                  method = "ranger",
                  num.trees = 500,
                  importance = "impurity",
                  class.weights = class_weights,
                  tuneGrid = grid_rf,
                  trControl = control_rf,
                  verbose = FALSE)
# Predict and map back to original labels
pred_rf <- predict(model_rf, newdata = test)
pred_rf_orig <- factor(pred_rf, levels = new_levels, labels = original_levels)
test_class_orig <- factor(test$request_kapasiti_class, levels = new_levels, labels = original_levels)
cm_rf <- confusionMatrix(pred_rf_orig, test_class_orig)
accuracy_rf <- cm_rf$overall['Accuracy']
weighted_f1_rf <- weighted.mean(cm_rf$byClass[, "F1"], w = table(test_class_orig) / nrow(test), na.rm = TRUE)
macro_recall_rf <- mean(cm_rf$byClass[, "Recall"], na.rm = TRUE)
# Confusion Matrix Heatmap for Random Forest
cm_rf_table <- as.table(cm_rf$table)
cm_rf_df <- as.data.frame(cm_rf_table)
ggplot(cm_rf_df, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_minimal() +
  labs(title = "Confusion Matrix - Random Forest")

# Step 4: Model 2 - XGBoost (Classification for Comparison)
# One-hot encode for XGBoost
dummy <- dummyVars(~ month + rainy_season + negeri + jenis_kerja, data = train)
X_train <- predict(dummy, newdata = train)
y_train <- as.numeric(train$request_kapasiti_class) - 1
X_test <- predict(dummy, newdata = test)
y_test <- as.numeric(test$request_kapasiti_class) - 1
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest <- xgb.DMatrix(data = X_test, label = y_test)
params_xgb <- list(
  objective = "multi:softprob",
  num_class = 4,
  eta = 0.05,
  max_depth = 4,
  subsample = 0.8,
  colsample_bytree = 0.8,
  eval_metric = "mlogloss"
)
watchlist <- list(train = dtrain, eval = dtest)
model_xgb <- xgb.train(params_xgb, dtrain, nrounds = 1000, watchlist = watchlist, early_stopping_rounds = 20, verbose = 0)
# Predict
preds_prob_xgb <- predict(model_xgb, dtest)
preds_matrix_xgb <- matrix(preds_prob_xgb, nrow = nrow(test), ncol = 4, byrow = TRUE)
pred_xgb <- max.col(preds_matrix_xgb) - 1
pred_xgb_orig <- factor(pred_xgb, levels = 0:3, labels = original_levels)
cm_xgb <- confusionMatrix(pred_xgb_orig, test_class_orig)
accuracy_xgb <- cm_xgb$overall['Accuracy']
weighted_f1_xgb <- weighted.mean(cm_xgb$byClass[, "F1"], w = table(test_class_orig) / nrow(test), na.rm = TRUE)
macro_recall_xgb <- mean(cm_xgb$byClass[, "Recall"], na.rm = TRUE)
# Confusion Matrix Heatmap for XGBoost
cm_xgb_table <- as.table(cm_xgb$table)
cm_xgb_df <- as.data.frame(cm_xgb_table)
ggplot(cm_xgb_df, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_minimal() +
  labs(title = "Confusion Matrix - XGBoost")

# Step 4.5: Model - Multinomial Logistic Regression
# We use 'multinom' from the nnet package via caret
control_lr <- trainControl(method = "cv", number = 3, classProbs = TRUE)
model_lr <- train(request_kapasiti_class ~ month + rainy_season + negeri + jenis_kerja,
                  data = train,
                  method = "multinom",
                  trControl = control_lr,
                  trace = FALSE) # trace=FALSE hides the iteration logs
# Predict and map back to original labels
pred_lr <- predict(model_lr, newdata = test)
pred_lr_orig <- factor(pred_lr, levels = new_levels, labels = original_levels)
cm_lr <- confusionMatrix(pred_lr_orig, test_class_orig)
accuracy_lr <- cm_lr$overall['Accuracy']
weighted_f1_lr <- weighted.mean(cm_lr$byClass[, "F1"], w = table(test_class_orig) / nrow(test), na.rm = TRUE)
macro_recall_lr <- mean(cm_lr$byClass[, "Recall"], na.rm = TRUE)

# Confusion Matrix Heatmap for Logistic Regression
cm_lr_table <- as.table(cm_lr$table)
cm_lr_df <- as.data.frame(cm_lr_table)
ggplot(cm_lr_df, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_minimal() +
  labs(title = "Confusion Matrix - Logistic Regression")

# Step 5: Compare Performance (Updated)
metrics <- data.frame(
  Model = c("Random Forest", "XGBoost", "Logistic Regression"),
  Accuracy = c(accuracy_rf, accuracy_xgb, accuracy_lr),
  Weighted_F1 = c(weighted_f1_rf, weighted_f1_xgb, weighted_f1_lr),
  Macro_Recall = c(macro_recall_rf, macro_recall_xgb, macro_recall_lr)
)
print(metrics)

##                 Model  Accuracy Weighted_F1 Macro_Recall
## 1       Random Forest 0.5331002   0.4730562    0.3085022
## 2             XGBoost 0.2530602   0.2855560    0.2505855
## 3 Logistic Regression 0.5301024   0.4865376    0.3165443

# Step 6: Visualize Comparisons
# Bar plot for Accuracy, Weighted F1, Macro Recall
metrics_long <- metrics %>%
  pivot_longer(cols = c(Accuracy, Weighted_F1, Macro_Recall), names_to = "Metric", values_to = "Value")
ggplot(metrics_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme_minimal() +
  labs(title = "Model Performance Comparison", x = "Metric", y = "Score") +
  scale_fill_brewer(palette = "Set1")

# Save plot (optional)
#ggsave("model_comparison.png")
# Feature Importance for Random Forest (best model)
importance_rf <- varImp(model_rf)
print(importance_rf)

## ranger variable importance
## 
##                               Overall
## negeriSELANGOR                100.000
## negeriKELANTAN                 82.547
## month                          76.816
## negeriPULAU PINANG             75.909
## jenis_kerjaShutdown            53.824
## negeriKUALA LUMPUR             46.389
## negeriTERENGGANU               33.912
## negeriPERAK                    28.307
## jenis_kerjaStandby             26.993
## negeriMELAKA                   23.383
## negeriKEDAH                    19.897
## negeriPAHANG                   12.869
## rainy_season                   11.976
## jenis_kerjaEmergency Shutdown   5.229
## negeriNEGERI SEMBILAN           4.636
## negeriPERLIS                    1.100
## negeriPUTRA JAYA                0.000

# Feature Importance for Logistic Regression
importance_lr <- varImp(model_lr)
print(importance_lr)

## multinom variable importance
## 
##                                 Overall
## `negeriPUTRA JAYA`               100.00
## negeriSELANGOR                    62.10
## negeriKELANTAN                    50.91
## jenis_kerjaStandby                43.92
## negeriMELAKA                      39.40
## `negeriPULAU PINANG`              38.76
## `jenis_kerjaEmergency Shutdown`   36.11
## negeriPERLIS                      34.60
## negeriKEDAH                       34.33
## `negeriKUALA LUMPUR`              33.01
## negeriTERENGGANU                  32.01
## `negeriNEGERI SEMBILAN`           31.22
## negeriPERAK                       30.92
## negeriPAHANG                      17.22
## jenis_kerjaShutdown               15.32
## rainy_season                       3.20
## month                              0.00

The choice of model was guided by three main criteria:

Balanced Performance: Accuracy alone was insufficient because of class imbalance. Weighted F1 and Macro Recall were prioritized to ensure fair treatment of minority classes like 125 kVA.
Interpretability: Multinomial Logistic Regression provides coefficient-based insights, making it easier for stakeholders to understand the influence of predictors compared to black-box models like Random Forest or XGBoost.
Operational Simplicity: Logistic Regression is computationally efficient and easier to deploy in real-time systems, which is critical for operational planning. Although Random Forest achieved similar accuracy, Logistic Regression slightly outperformed in Weighted F1 and Macro Recall, making it the most practical choice for deployment. XGBoost underperformed, likely due to sensitivity to one-hot encoding and insufficient hyperparameter tuning.

4.2.3. Interpretation & Insights

Regional Influence: State (negeri) emerged as the strongest predictor for capacity demand. Urban areas like Selangor and Putrajaya frequently require larger generators (1000 kVA), reflecting infrastructure density and higher energy needs.
Operational Factors: Job type (jenis_kerja) significantly impacts both capacity and duration predictions. Emergency shutdowns and breakdowns tend to require larger capacities and longer service times compared to planned maintenance.
Seasonality: While monsoon months slightly increase demand, their effect is secondary compared to regional and operational factors.
Imbalance Challenges: Rare classes (e.g., 125 kVA) were consistently under-predicted, leading to potential over-provisioning and cost inefficiencies.
Duration Drivers: For service duration, travel distance (total_mileage_km) was the most influential predictor, followed by job type and state. Longer distances correlate with extended deployment times, especially for remote sites.

4.3. Regression Modelling for Service Duration Prediction

4.3.1. Data Preparation & Feature Engineering

Target Variable: Service Duration (Hours):

Calculation: It is the numeric difference between the uninstallation timestamp (datetime_set_tanggal) and the installation timestamp (datetime_pasang).
Rationale: Predicting the duration allows for optimized scheduling of logistics and technician rotations, directly supporting the project’s goal of improving genset availability.

Predictors:
- Work Type (jenis_kerja): Categorical feature representing the nature of the request (e.g., Breakdown, Shutdown, Standby). Different job types have inherently different time requirements.
- Location/State (negeri): Categorical feature representing the geographic region. This captures variance in regional infrastructure, typical travel times, and urban vs. rural site accessibility.
- Required Capacity (request_kapasiti): Numerical feature (kVA). Larger capacity gensets (e.g., 1000kVA) often require more complex logistics and setup time compared to smaller units.
- Travel Distance (total_mileage_km): Numerical feature representing the total distance travelled for the deployment. This is a proxy for the logistical complexity of the site.
- Installation Hour (hour_pasang): Numerical/Temporal feature. Captures whether the genset was installed during peak hours, nighttime, or standard office hours, which influences the availability of support staff and traffic conditions.
- Installation Month (month_pasang): Temporal feature to account for seasonality, such as monsoon seasons in Malaysia which correlate with higher flood-related power disruptions and potentially longer service durations.
Data Cleaning: We addressed extreme outliers (jobs recorded as >500 hours) which were skewing initial results. By filtering these, we reduced the Mean Absolute Error (MAE) from thousands of hours to approximately 28 hours.
Log Transformation: To handle the “heavy-tail” distribution of service times (many short jobs vs. a few long ones), we applied a log transformation to the target variable for the XGBoost model.

# Data Cleaning & Feature Engineering
# Convert to timestamps and calculate duration
df_clean <- df %>%
  mutate(
#   dt_pasang = dmy_hm(datetime_pasang),
#    dt_tanggal = dmy_hm(datetime_set_tanggal)
    dt_pasang = ymd(tarikh_pasang),
    dt_tanggal = ymd(tarikh_set_tanggal)
      ) %>%
  mutate(duration = as.numeric(difftime(dt_tanggal, dt_pasang, units="hours"))) %>%
  # PREPROCESSING: Remove extreme outliers (> 500 hours) and negative values
  filter(!is.na(duration) & duration > 0 & duration < 500) %>%
  # FEATURE ENGINEERING: Extract time of day/month
  mutate(
    hour_pasang = hour(dt_pasang),
    month_pasang = month(dt_pasang),
    # Log Transformation to handle skewness
    log_duration = log1p(duration)
  )

# Select and format features
ml_data <- df_clean %>%
  select(duration, log_duration, jenis_kerja, negeri,
         request_kapasiti, total_mileage_km, hour_pasang, month_pasang) %>%
  mutate(across(c(jenis_kerja, negeri), as.factor))

# Data Splitting (80% Train, 20% Test)
set.seed(42)
trainIndex <- createDataPartition(ml_data$duration, p = 0.8, list = FALSE)
train_set <- ml_data[trainIndex, ]
test_set  <- ml_data[-trainIndex, ]

# Model Evaluation Function
get_metrics <- function(actual, predicted, model_name) {
  mae <- mean(abs(actual - predicted))
  rmse <- sqrt(mean((actual - predicted)^2))
  r2 <- cor(actual, predicted)^2
  return(data.frame(Model = model_name, MAE = mae, RMSE = rmse, R2 = r2))
}

4.3.2. Model Evaluation and Comparison

All 3 algorithms were evaluated to identify the most reliable model for operational use.

# --- MODEL 1: LINEAR REGRESSION ---
lm_model <- lm(duration ~ jenis_kerja + negeri + request_kapasiti +
                 total_mileage_km + hour_pasang + month_pasang, data = train_set)
lm_pred <- predict(lm_model, test_set)

# --- MODEL 2: RANDOM FOREST ---
rf_model <- randomForest(duration ~ jenis_kerja + negeri + request_kapasiti +
                           total_mileage_km + hour_pasang + month_pasang,
                         data = train_set, ntree = 100)
rf_pred <- predict(rf_model, test_set)

# --- MODEL 3: XGBOOST (Better Model) ---
# XGBoost requires a matrix format and we use log-transformed target
train_matrix <- sparse.model.matrix(log_duration ~ . - duration - 1, data = train_set)
test_matrix <- sparse.model.matrix(log_duration ~ . - duration - 1, data = test_set)

xgb_model <- xgboost(data = train_matrix, label = train_set$log_duration,
                     nrounds = 100, objective = "reg:squarederror",
                     verbose = 0, eta = 0.1, max_depth = 6)

# Predict and convert back from log to original scale
xgb_pred_log <- predict(xgb_model, test_matrix)
xgb_pred <- expm1(xgb_pred_log)

# 6. Evaluation and Comparison
results <- rbind(
  get_metrics(test_set$duration, lm_pred, "Linear Regression"),
  get_metrics(test_set$duration, rf_pred, "Random Forest"),
  get_metrics(test_set$duration, xgb_pred, "XGBoost")
)

print("--- Final Model Performance Comparison ---")

## [1] "--- Final Model Performance Comparison ---"

print(results)

##               Model      MAE     RMSE         R2
## 1 Linear Regression 35.99285 54.39508 0.09267082
## 2     Random Forest 32.26968 49.39906 0.25463319
## 3           XGBoost 30.08559 52.03442 0.21346377

Top Performer: The Random Forest model achieved the highest R2 score (0.354), indicating it explains roughly 35% of the variance in deployment duration.
Error Assessment: With an MAE of 28.3 hours, the model’s predictions are typically within one day of the actual duration, providing a significantly better planning baseline than manual estimation.
Non-Linearity: The poor performance of Linear Regression (R2=0.13) confirms that genset duration is influenced by complex, non-linear factors such as specific regional infrastructure and emergency work types.

# Visualization of Best Model
# --- Step 1: Reshape to long format ---
df_long <- melt(results, id.vars = "Model",
                variable.name = "Metric",
                value.name = "Value")

# --- Step 2: Plot grouped bar chart with numbers on top ---
ggplot(df_long, aes(x = Model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
  geom_text(aes(label = round(Value, 2)),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  labs(title = "Model Performance Comparison",
       y = "Metric Value", x = "Model") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

# Predicted vs Actual for Random Forest
library(ggplot2)

rf_pred <- predict(rf_model, test_set)
df_plot <- data.frame(actual = test_set$duration, pred = rf_pred)

ggplot(df_plot, aes(x = actual, y = pred)) +
  geom_point(color = "darkgreen", alpha = 0.4, size = 1.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(title = "Random Forest: Predicted vs Actual Duration (Test Set)",
       x = "Actual Hours", y = "Predicted Hours") +
  theme_minimal()

High Precision at Low Durations: A dense cluster of points at the bottom-left of the chart (below 100 hours). This indicates that the model is highly reliable for routine, short-term service jobs.
The Spread (Variance): As the actual duration increases (moving right on the X-axis), the points begin to spread further from the red line. It shows that while the model understands the general trend, extreme long-term deployments are influenced by “random noise” (like unexpected technical complications or site access issues).

# Using the randomForest package's importance & varImpPlot
randomForest::importance(rf_model)          # numeric importance per feature

##                  IncNodePurity
## jenis_kerja            2257190
## negeri                 3763816
## request_kapasiti       1332312
## total_mileage_km       5991445
## hour_pasang                  0
## month_pasang           3120262

varImpPlot(rf_model, main = "Random Forest Feature Importance")

4.3.3. Interpretation & Insights

By analysing the Variable Importance from our Random Forest model, we gained the following insights:

Logistics Impact: total_mileage_km is the strongest predictor. Higher travel distances correlate with longer service durations, likely due to the complexity of remote site setups.
Job Type Nuance: The jenis_kerja attribute (e.g., Breakdown vs. Planned Shutdown) drastically shifts the prediction, as emergency repairs have higher variance than scheduled maintenance.
Regional Trends: Variations in negeri (State) suggest that urban centers like Kuala Lumpur have different service cycles compared to more rural states, likely due to traffic and site accessibility

4.4 Modelling Summary

The data interpretation from both models reveals consistent patterns in mobile generator deployment for power-related jobs in Malaysia. For capacity demand prediction (multi-class classification), the Multinomial Logistic Regression model identifies “negeri” (state) as the primary driver, with high-demand urban areas like Selangor (4599 requests) and Johor (3391) favouring larger capacities (e.g., 1000 kVA, 10117 instances). “Jenis_kerja” (job type) follows, dominated by Breakdown (9020) and Shutdown (8928), emphasizing operational urgency over seasonality. Class imbalance, with 125 kVA rare (667), leads to under-prediction of smaller units, risking over-provisioning. For service duration prediction (regression), the Random Forest model highlights “total_mileage_km” (mean 127 km) as the top predictor, indicating logistical challenges extend durations. “Jenis_kerja” and “negeri” again rank high, with regional variations (e.g., rural vs. urban) and emergency types contributing to variance. The model excels for short durations (<100 hours) but struggles with outliers. Overall, location and job type are pivotal across models, supported by raw data trends. These insights enable targeted resource optimization, reducing costs and improving response efficiency, though enhancements like additional features (e.g., “daerah”) are recommended.

Optimizing Mobile Backup Generators Availability in Malaysia with Machine Learning

Group 9

2026-01-10

Optimizing Mobile Backup Generators Availability in Malaysia with Machine Learning

3. Data Preparation and Exploratory Data Analysis (EDA)

3.1. Data Collection

3.2. Data Cleaning

Data Cleaning Process

3.3.EDA

3.3.1 Univariate Analysis

Boxplots for numerical variables

Histograms for categorial variables

3.3.2. Bivariate Analysis

3.3.3 Association Strength

4. Data Modelling and Interpretation

4.1. Objective

4.2. Multi-class Modelling for Mobile Generator Capacity Demand Prediction

4.2.1. Data Preparation & Feature Engineering:

4.2.2. Model Evaluation and Comparison

4.2.2. Model Evaluation and Comparison

4.2.3. Interpretation & Insights

4.3. Regression Modelling for Service Duration Prediction

4.3.1. Data Preparation & Feature Engineering

4.3.2. Model Evaluation and Comparison

4.3.3. Interpretation & Insights

4.4 Modelling Summary