Deskripsi Insight: Insight yang diambil dari dataset Coffee Chain ini adalah “Faktor Penentu Profitabilitas Produk”. Dalam manajemen operasional, perusahaan sering kali sulit menentukan apakah sebuah produk akan memberikan profit tinggi hanya dengan melihat pada satu aspek saja. Sehingga analisis ini bertujuan untuk membangun suatu model yang mampu mengklasifikasikan transaksi ke dalam kategori High Profit (Profit di atas median) atau Low Profit (Profit di bawah median) menggunakan metode Random Forest.

Data yang digunakan meliputi variabel:

Target: Profit_Status (High/Low).

Prediktor: Biaya Pemasaran (Marketing), Stok (Inventory), dan Total Pengeluaran (Total Expenses), Ukuran Pasar (Market Size), Wilayah (Market), dan Jenis Produk (Product Type).

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tibble' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Load Data
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
df <- read_excel("C:/Users/Lenovo/Downloads/Coffee Chain.xlsx")
str(df)
## tibble [4,248 × 20] (S3: tbl_df/tbl/data.frame)
##  $ Area Code     : num [1:4248] 719 970 970 303 303 720 970 719 970 719 ...
##  $ Date          : POSIXct[1:4248], format: "2012-01-01" "2012-01-01" ...
##  $ Market        : chr [1:4248] "Central" "Central" "Central" "Central" ...
##  $ Market Size   : chr [1:4248] "Major Market" "Major Market" "Major Market" "Major Market" ...
##  $ Product       : chr [1:4248] "Amaretto" "Colombian" "Decaf Irish Cream" "Green Tea" ...
##  $ Product Line  : chr [1:4248] "Beans" "Beans" "Beans" "Leaves" ...
##  $ Product Type  : chr [1:4248] "Coffee" "Coffee" "Coffee" "Tea" ...
##  $ State         : chr [1:4248] "Colorado" "Colorado" "Colorado" "Colorado" ...
##  $ Type          : chr [1:4248] "Regular" "Regular" "Decaf" "Regular" ...
##  $ Budget COGS   : num [1:4248] 90 80 100 30 60 80 140 50 50 40 ...
##  $ Budget Margin : num [1:4248] 130 110 140 50 90 130 160 80 70 70 ...
##  $ Budget Profit : num [1:4248] 100 80 110 30 70 80 110 20 40 20 ...
##  $ Budget Sales  : num [1:4248] 220 190 240 80 150 210 300 130 120 110 ...
##  $ COGS          : num [1:4248] 89 83 95 44 54 72 170 63 60 58 ...
##  $ Inventory     : num [1:4248] 777 623 821 623 456 ...
##  $ Margin        : num [1:4248] 130 107 139 56 80 108 171 87 80 72 ...
##  $ Marketing     : num [1:4248] 24 27 26 14 15 23 47 57 19 22 ...
##  $ Profit        : num [1:4248] 94 68 101 30 54 53 99 0 33 17 ...
##  $ Sales         : num [1:4248] 219 190 234 100 134 180 341 150 140 130 ...
##  $ Total Expenses: num [1:4248] 36 39 38 26 26 55 72 87 47 55 ...
dim(df)
## [1] 4248   20
summary(df)
##    Area Code          Date                        Market         
##  Min.   :203.0   Min.   :2012-01-01 00:00:00   Length:4248       
##  1st Qu.:417.0   1st Qu.:2012-06-23 12:00:00   Class :character  
##  Median :573.0   Median :2012-12-16 12:00:00   Mode  :character  
##  Mean   :582.3   Mean   :2012-12-15 22:00:00                     
##  3rd Qu.:772.0   3rd Qu.:2013-06-08 12:00:00                     
##  Max.   :985.0   Max.   :2013-12-01 00:00:00                     
##  Market Size          Product          Product Line       Product Type      
##  Length:4248        Length:4248        Length:4248        Length:4248       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     State               Type            Budget COGS     Budget Margin   
##  Length:4248        Length:4248        Min.   :  0.00   Min.   :-210.0  
##  Class :character   Class :character   1st Qu.: 30.00   1st Qu.:  50.0  
##  Mode  :character   Mode  :character   Median : 50.00   Median :  70.0  
##                                        Mean   : 74.83   Mean   : 100.8  
##                                        3rd Qu.: 90.00   3rd Qu.: 130.0  
##                                        Max.   :450.00   Max.   : 690.0  
##  Budget Profit      Budget Sales         COGS          Inventory      
##  Min.   :-320.00   Min.   :   0.0   Min.   :  0.00   Min.   :-3534.0  
##  1st Qu.:  20.00   1st Qu.:  80.0   1st Qu.: 43.00   1st Qu.:  432.0  
##  Median :  40.00   Median : 130.0   Median : 60.00   Median :  619.0  
##  Mean   :  60.91   Mean   : 175.6   Mean   : 84.43   Mean   :  749.4  
##  3rd Qu.:  80.00   3rd Qu.: 210.0   3rd Qu.:100.00   3rd Qu.:  910.5  
##  Max.   : 560.00   Max.   :1140.0   Max.   :364.00   Max.   : 8252.0  
##      Margin          Marketing          Profit           Sales    
##  Min.   :-302.00   Min.   :  0.00   Min.   :-638.0   Min.   : 17  
##  1st Qu.:  52.75   1st Qu.: 13.00   1st Qu.:  17.0   1st Qu.:100  
##  Median :  76.00   Median : 22.00   Median :  40.0   Median :138  
##  Mean   : 104.29   Mean   : 31.19   Mean   :  61.1   Mean   :193  
##  3rd Qu.: 132.00   3rd Qu.: 39.00   3rd Qu.:  92.0   3rd Qu.:230  
##  Max.   : 613.00   Max.   :156.00   Max.   : 778.0   Max.   :912  
##  Total Expenses  
##  Min.   : 10.00  
##  1st Qu.: 33.00  
##  Median : 46.00  
##  Mean   : 54.06  
##  3rd Qu.: 65.00  
##  Max.   :190.00
# Penanganan Outlier
cap_outliers <- function(x) {
  if(!is.numeric(x)) return(x)
  qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
  H <- 1.5 * IQR(x, na.rm = T)
  x[x < (qnt[1] - H)] <- (qnt[1] - H)
  x[x > (qnt[2] + H)] <- (qnt[2] + H)
  return(x)
}
# Data Preparation & Feature Engineering
median_p <- median(df$Profit, na.rm = TRUE)

df_model <- df %>%
  # Membuat Target
  mutate(Profit_Status = factor(ifelse(Profit > median_p, "High", "Low"), 
                                levels = c("Low", "High"))) %>%
  mutate(Marketing = cap_outliers(Marketing),
         Inventory = cap_outliers(Inventory),
         `Total Expenses` = cap_outliers(`Total Expenses`)) %>%
  
  # Memilih kolom yang akan digunakan
  select(Profit_Status, Marketing, Inventory, `Total Expenses`, 
         `Market Size`, `Product Type`, Market) %>%
  
  # Mengubah semua kolom teks (character) menjadi kategori (factor)
  mutate(across(where(is.character), as.factor))
# Split Data (70/30)
set.seed(123)
train_idx <- createDataPartition(df_model$Profit_Status, p = 0.7, list = FALSE)
train_data <- df_model[train_idx, ]
test_data  <- df_model[-train_idx, ]
print("Kolom yang tersedia di data training:")
## [1] "Kolom yang tersedia di data training:"
print(colnames(train_data))
## [1] "Profit_Status"  "Marketing"      "Inventory"      "Total Expenses"
## [5] "Market Size"    "Product Type"   "Market"
# Analisis Klasifikasi Random Forest 

Y_train <- train_data$Profit_Status
X_train <- train_data %>% select(-Profit_Status)

model_rf <- randomForest(x = X_train, 
                         y = Y_train, 
                         ntree = 500,
                         nodesize = 10,
                         importance = TRUE)
# Evaluasi Model Data Train
preds_train <- predict(model_rf, X_train)
conf_matrix_train <- confusionMatrix(preds_train, train_data$Profit_Status, positive = "High")

# Confusion Matrix Train
print(conf_matrix_train)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Low High
##       Low  1438  127
##       High   52 1357
##                                           
##                Accuracy : 0.9398          
##                  95% CI : (0.9307, 0.9481)
##     No Information Rate : 0.501           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8796          
##                                           
##  Mcnemar's Test P-Value : 3.184e-08       
##                                           
##             Sensitivity : 0.9144          
##             Specificity : 0.9651          
##          Pos Pred Value : 0.9631          
##          Neg Pred Value : 0.9188          
##              Prevalence : 0.4990          
##          Detection Rate : 0.4563          
##    Detection Prevalence : 0.4738          
##       Balanced Accuracy : 0.9398          
##                                           
##        'Positive' Class : High            
## 
# Evaluasi Model data test
X_test <- test_data %>% select(-Profit_Status)
preds_rf <- predict(model_rf, X_test)

# Confusion Matrix
conf_matrix <- confusionMatrix(preds_rf, test_data$Profit_Status, positive = "High")
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low  579   99
##       High  59  537
##                                           
##                Accuracy : 0.876           
##                  95% CI : (0.8566, 0.8936)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7519          
##                                           
##  Mcnemar's Test P-Value : 0.001918        
##                                           
##             Sensitivity : 0.8443          
##             Specificity : 0.9075          
##          Pos Pred Value : 0.9010          
##          Neg Pred Value : 0.8540          
##              Prevalence : 0.4992          
##          Detection Rate : 0.4215          
##    Detection Prevalence : 0.4678          
##       Balanced Accuracy : 0.8759          
##                                           
##        'Positive' Class : High            
## 
# 8. Visualisasi Kepentingan Variabel
varImpPlot(model_rf, main="Variabel Penentu Utama Profitabilitas")

Grafik Variable Importance (Mean Decrease Gini) menunjukkan kontribusi dari masing-masing variabel dalam meningkatkan kemurnian klasifikasi model. Variabel yang berada di posisi paling atas (Inventory) adalah faktor yang paling krusial bagi profitabilitas.

cat("--- PERBANDINGAN AKURASI ---", 
    "\nAkurasi Data Train : ", round(conf_matrix_train$overall['Accuracy'], 4),
    "\nAkurasi Data Test  : ", round(conf_matrix$overall['Accuracy'], 4))
## --- PERBANDINGAN AKURASI --- 
## Akurasi Data Train :  0.9398 
## Akurasi Data Test  :  0.876

Interpretasi hasil:

  1. Analisis Performa Model Berdasarkan hasil Confusion Matrix pada data testing:
  1. Akurasi (87,6%): Model memiliki ketepatan yang sangat tinggi dalam mengklasifikasikan produk. Artinya, sistem ini reliabel digunakan untuk memprediksi hasil bisnis di masa depan.
  2. Stabilitas (Train vs Test): Akurasi data Train (93,98%) dan data Test (87,6%) menunjukkan selisih yang cukup kecil. Hal ini membuktikan model tidak mengalami overfitting yang parah dan mampu melakukan generalisasi pada data baru.
  3. Sensitivity (84,43%): Model sangat baik dalam mengenali produk-produk yang berpotensi menghasilkan keuntungan tinggi (High Profit).
  1. Interpretasi Visualisasi Melalui grafik varImpPlot, terlihat bahwa variabel Inventory dan Marketing menduduki posisi teratas. Inventory: Tingkat stok yang optimal sangat berpengaruh, mengindikasikan bahwa manajemen rantai pasok yang efisien berbanding lurus dengan keuntungan. Marketing: Biaya pemasaran menjadi faktor kedua yang paling berpengaruh. Ini mengindikasikan bahwa strategi promosi perusahaan memiliki dampak yang signifikan dalam menggerakkan produk dari kategori Low ke High Profit.

  2. Rekomendasi Hasil analisis ini merekomendasikan manajemen Coffee Chain untuk memperkuat integrasi data antara bagian logistik (Inventory) dan bagian promosi (Marketing). Karena kedua faktor ini paling menentukan keuntungan, perusahaan harus memastikan bahwa setiap uang yang dikeluarkan untuk pemasaran didukung oleh ketersediaan stok yang optimal di wilayah yang tepat agar potensi keuntungan maksimal dapat tercapai.