Show the code

if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman", repos = "https://cloud.r-project.org")
}

library(pacman)

pacman::p_load(
  here, # A Simpler Way to Find Your Files
  tidyverse, # Data manipulation, exploration and visualization
  janitor, # Simple Tools for Examining and Cleaning Dirty Data
  arrow, # Integration to Apache Arrow
  duckdb, # DBI Package for the DuckDB Database Management System
  tictoc, # Timing Functions for R
  ggeasy, # Easy Access to 'ggplot2' Commands
  tidyquant, # Tidy Quantitative Financial Analysis
  timetk, # A Tool Kit for Working with Time Series
  factoextra, # Extract and Visualize the Results of Multivariate Data Analyses
  tidymodels, # Easily Install and Load the 'Tidymodels' Packages
  vip # Variable Importance Plots 
  )

1 Overview

This is the seventh block of the workshop series “R for Excel users - Application to the energy losses study in Kenya for KPLC and the World Bank”. In this block we will explore advanced topics in R, focusing on big data analytics and machine learning models. We will learn how to handle large datasets that exceed available memory using efficient file formats like Parquet, Apache Arrow, and DuckDB. We will also introduce Apache Spark for distributed data processing and delve into machine learning models tailored to KPLC’s needs.

2 Learning objectives

By the end of this block, participants will be able to:

Import, store, and process large datasets efficiently using Parquet, Apache Arrow, and DuckDB.
Understand the fundamentals of distributed data processing with Apache Spark and its integration with R.
Apply appropriate data wrangling techniques for big data in the context of KPLC’s energy losses analysis.
Build, train, and evaluate basic machine learning models relevant to KPLC’s operational needs.
Interpret model outputs to support data-driven decision-making.

3 Instructor

Jorge Colomer is an industrial engineer with a specialization in electric power systems from the University of Zaragoza (Spain). He holds an MBA from the Technical University of Madrid (UPM), a master’s degree in Data Science and Business Intelligence from the University of Castilla-La Mancha, and has completed several postgraduate courses in finance and quantitative analysis at UPM. He currently leads the Data Analytics Unit at the Spanish branch of SEURECA, serves as an adjunct professor in the master’s program in Artificial Intelligence at Alfonso X el Sabio University in Madrid, and is a PhD candidate in the Statistical Modeling and Decision Sciences program at Rey Juan Carlos University in Madrid.

Jorge can be reached at jorge.colomer@veolia.com

4 Prerequisites

Before the course begins, please make sure you meet the following prerequisites:

Create a free account on Posit Cloud: Go to https://posit.cloud/ and sign up for a free account. We will use this platform during the workshop to run R code without needing to install anything on your computer.
Basic familiarity with Excel: This course is designed for Excel users who want to start working with R. A basic understanding of Excel functions will be helpful.
A stable internet connection: Since we will be using an online environment (Posit Cloud), please ensure you have a reliable internet connection throughout the course.
Download workshop data:

Dropbox folder: https://www.dropbox.com/scl/fo/gkjj6i7qn49kt2p87nsv4/AOdlIM6RIc7gkDW3_zptqTE?rlkey=mi5emw34h2wfk74gtiwmzhx3l&st=0iykwigh&dl=0

Save it temporarily somewhere you will remember.

Note: In your case, download the files from the Dropbox folder and save it in a folder called data inside your project directory. If you are using Posit Cloud, you can create this folder by clicking on the “New Folder” button in the Files pane and then uploading the file to that folder.

5 License

This work is licensed under a Creative Commons Attribution 4.0 International License.

6 Agenda

Block 1: Introduction to R, RStudio and R Markdown
- Objectives and resources
- Why Excel, why R?
- Overview of R and RStudio
- Intro to R Markdown
- R functions
- Help pages
- Commenting
- Assigning objects with <-
- R packages
Block 2: Importing and exporting data
- Importing data from Excel
- Exporting data to Excel
- Importing and exporting CSV files
- Other formats
Block 3: Data visualization with ggplot2
- First steps
- Creating a ggplot
- Adding aesthetics and layers
- Good data visualisation practices
Block 4: Data tidying
- Tidy data
- Lengthening data
- Widening data
- Pivot tables and dplyr
Block 5: Filters and joins
- Filtering data
- Joining data
- Combining datasets
Block 6: Report automation with R and Quarto
- Understanding Parameterisation in Quarto
- Preparing the Data
- Designing a Quarto Report Template
- Incorporating Parameters into the Report
- Rendering Reports Programmatically
- Best Practices for Automated Reporting
- Hands-on Exercise: KPLC Monthly Reports

Block 7: Advanced topics—Big data analytics and machine learning models
- Parquet files, Apache Arrow and DuckDB
- Apache Spark and the sparklyr package
- Machine learning models

Block 8: Automated exploratory data analysis (EDA) and data quality
- Introduction to EDA and its role in KPLC’s workflows
- Why automate EDA? Benefits and limitations
- Generating summary statistics with skimr
- Extended descriptive analysis with other EDA packages
- Visual overviews for detecting trends, patterns, and anomalies
- Introduction to data quality concepts
- Metadata and data dictionaries: definitions and best practices
- Detecting and handling missing data, inconsistencies, and outliers
- Integrating data quality checks into repeatable workflows
- Hands-on exercise: Automated EDA and data quality assessment on KPLC datasets

7 Introduction

In this block, we tackle the challenges of analysing datasets that exceed available memory by leveraging efficient file formats and modern tools such as Parquet, Apache Arrow, and DuckDB. We will use Parquet files in combination with Apache Arrow to facilitate fast, out-of-memory data manipulation using familiar dplyr syntax, and where integration with DuckDB enables SQL-style querying over large-scale data.

We will also introduce Apache Spark briefly, emphasising its role in scalable distributed processing for truly big data scenarios.

Finally, we’ll explore machine learning models tailored to KPLC’s needs. These models serve as tools to better understand and mitigate inefficiencies and help drive data-informed operational decisions.

8 Parquet files, Apache Arrow and DuckDB

8.1 Some definitions

Parquet is a columnar storage file format designed for efficient storage and retrieval of large datasets. Its column-oriented structure enables highly efficient compression and encoding, reducing file sizes and speeding up analytical queries—particularly when working with subsets of columns.

Apache Arrow is an open-source framework for in-memory data representation that enables fast, language-agnostic data interchange. In R, the arrow package allows you to read and write Parquet files directly, and to query them using dplyr without loading the entire dataset into memory. This makes it possible to work interactively with datasets that would otherwise exceed your computer’s RAM capacity.

DuckDB is an in-process SQL database management system optimised for analytical workloads. It can query Parquet files directly, without importing them into a separate database, and integrates seamlessly with the arrow package in R. This allows analysts to combine the expressiveness of SQL with the flexibility of R’s tidyverse for large-scale data analysis.

8.2 Practical Example: Analysing KPLC’s postpaid customer databases

In this example, we will work with KPLC’s postpaid customer databases, which contain millions of records and therefore cannot be loaded entirely into memory. Instead, we will use a workflow that combines Parquet files, Apache Arrow, and DuckDB to perform efficient, out-of-memory analysis directly from R.

The process involves:

Storing the data in Parquet format to take advantage of columnar storage and compression.
Accessing the Parquet files with the arrow package, enabling lazy evaluation and selective reading of only the necessary columns and rows.
Querying the data with DuckDB, either via SQL queries or dplyr syntax, for filtering, aggregation, and joining with other datasets.
Integrating results into the analytical workflow, such as generating summary tables, calculating performance indicators, or preparing visualisations for inclusion in Quarto reports.

8.3 Reading Parquet files with `arrow`

Let us take the file postpaid_bills_clean.parquet, stored locally on the hard drive and shared by KPLC at the start of the project. It is an 0.8 GB file containing millions of records. First, we define a pointer to the file using the arrow::open_dataset() function, which allows us to access the data without loading it into memory. This is particularly useful when working with large volumes of data, as it avoids memory issues and enables filtering and aggregation operations to be carried out efficiently.

Show the code

pt_parquet <- arrow::open_dataset(here("..",
                                       "LOSS REDUCTION WORKSHOP",
                                       "data",
                                       "postpaid_bills_clean.parquet"))

8.4 Exploring the Parquet file with `dplyr`

Let us explore the Parquet file using the glimpse() function from dplyr, which provides a quick view of the data structure, including the number of rows and columns, as well as the data types of each column. This allows us to quickly understand the composition of the dataset without the need to load it into memory.

Show the code

pt_parquet %>% 
   dplyr::glimpse()

FileSystemDataset with 1 Parquet file
25,678,033 rows x 16 columns
$ billing_month <date32[day]> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 20…
$ region             <string> "CENTRAL RIFT", "CENTRAL RIFT", "CENTRAL RIFT", "C…
$ county             <string> "BARINGO", "BARINGO", "BARINGO", "BARINGO", "BARIN…
$ account            <string> "100310010", "100310010", "100469659", "100469659"…
$ tariff             <string> "Small Commercial", "Small Commercial", "Small Com…
$ billing_date  <date32[day]> 2024-01-22, 2024-01-31, 2024-01-05, 2024-01-31, 20…
$ bill_amount        <double> 80, 1950, 107365, 29647, 0, 321, 4418, 1503, 1870,…
$ bill_type          <string> "TFGEN10003", "TFGEN00001", "TFGEN00001", "TFGEN00…
$ kwh                <double> NA, 54, 4684, 821, 0, 12, 120, 47, 81, 35, 357, 60…
$ substation_name    <string> "21191", "21191", "176673", "176673", "08264", "23…
$ feeder_name        <string> "RAVINETOWN 33KV EX ELDAMARAVIN", "RAVINETOWN 33KV…
$ x                  <double> NA, NA, 131293.2, 131293.2, NA, 138450.9, 141910.2…
$ y                  <double> NA, NA, 10092947, 10092947, NA, 10054735, 10002805…
$ customer_type      <string> "Private", "Private", "Private", "Private", "Priva…
$ lon                <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ lat                <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Call `print()` for full schema details

We can see that the dataset contains 25.6 million rows and 16 columns, with data types including numbers, characters, and dates. This gives us a clear idea of the data structure and allows us to plan the analytical operations to be carried out next.

9 Querying the Parquet file with `dplyr`

Let us use the familiar dplyr functions to perform some analytical operations on the dataset. For example, we can calculate the total billing amount and the total kWh consumed for each tariff. This is done by using the group_by() function to group the data by the tariff column, followed by summarise() to compute the corresponding sums. Finally, we use ungroup() to remove the grouping and collect() to bring the results into R.

We use the tictoc package to measure the time taken to execute this operation.

Show the code

tic()
pt_parquet %>% 
  group_by(tariff) %>% 
  summarize(
    total_bill_amount = sum(bill_amount, na.rm = TRUE),
    total_kwh = sum(kwh, na.rm = TRUE)
  ) %>%
  ungroup() %>% 
  collect()

# A tibble: 8 × 3
  tariff            total_bill_amount  total_kwh
  <chr>                         <dbl>      <dbl>
1 Small Commercial       43321204530. 1456172126
2 Ordinary Lifeline       4855041738.  215357873
3 Ordinary               34241243666. 1165082937
4 Street Lighting         1865263817.  101140344
5 Large Power           160462386936. 6431911708
6 E-Mobility                15424880      552264
7 <NA>                          1440           0
8 Bulk supply                 330026       11123

Show the code

toc()

1.33 sec elapsed

10 Using `duckdb` with `arrow`

There’s one important advantage of parquet and arrow — its very easy to turn an arrow dataset into a DuckDB database by calling arrow::to_duckdb(). Thanks to DuckDB, we can take advantage of the full power of an SQL database management system and bring the results back into R without having to load the entire dataset into memory.

Show the code

tic()
pt_parquet %>% 
  arrow::to_duckdb() %>% 
  filter(lubridate::year(billing_month) == 2024) %>% 
  group_by(tariff, 
           month = lubridate::floor_date(billing_month, 
                                         "month")) %>% 
  summarise(total_kwh = sum(kwh, 
                            na.rm = TRUE)) %>%
  ungroup() %>% 
  arrange(month, 
          tariff) %>% 
  collect() %>% 
  drop_na(tariff) %>% 
  mutate(tariff = fct_reorder(tariff, 
                              total_kwh, 
                              .desc = TRUE)) %>%
  ggplot(aes(x = month, 
             y = total_kwh)) +
  geom_area(fill = "grey") +
  facet_wrap(~tariff, 
             scales = "fixed", 
             axes = "all", 
             axis.labels = "all") +
  labs(title = "Total kWh consumed by tariff in 2024",
       x = "Month",
       y = "Total kWh") +
  scale_y_continuous(labels = scales::label_number(big.mark = ",", 
                                                    decimal.mark = ".",
                                                    scale = 1e-6), 
                     limits = c(0, NA)) +
  theme_tq() +
  easy_title_bold() +
  easy_remove_gridlines() +
  easy_remove_axes("both", "title") +
  scale_x_date(date_labels = "%b", 
               date_breaks = "2 months")

Show the code

toc()

2.63 sec elapsed

The neat thing about to_duckdb() is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.

11 Apache Spark and the `sparklyr` package

Apache Spark is an open-source, distributed computing system designed for large-scale data processing. It enables fast computation by distributing tasks across multiple cores or machines, making it suitable for handling massive datasets that cannot be processed efficiently on a single computer. Spark supports a variety of workloads, including batch processing, interactive queries, streaming analytics, and machine learning.

In R, the sparklyr package provides a convenient interface to Apache Spark. It allows analysts to connect R directly to a Spark cluster, manipulate Spark DataFrames using familiar dplyr syntax, and access Spark’s built-in machine learning library (MLlib). This makes it possible to integrate distributed data processing into an R-based workflow without having to switch to another programming environment.

For those interested in exploring Spark with R in greater depth, the book Mastering Spark with R is an excellent resource, offering practical examples and detailed explanations of how to harness Spark’s capabilities from within R.

12 Machine learning models

12.1 Introduction

Machine Learning (ML) is a field of artificial intelligence that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed for each specific task. Instead of following rigid rules, ML algorithms improve their performance as they are exposed to more data.

ML techniques are generally divided into three main categories:

Supervised learning – Models are trained using labelled datasets, where the outcome or target variable is known. This category includes classification (predicting a discrete category, such as customer tariff type) and regression (predicting a continuous value, such as monthly energy consumption).
Unsupervised learning – Models are applied to datasets without pre-labelled outcomes. They aim to uncover hidden structures or patterns, such as grouping similar customers using clustering or reducing the dimensionality of the data with techniques like Principal Component Analysis (PCA).
Reinforcement learning – Algorithms learn optimal strategies through trial-and-error interactions with an environment, guided by rewards and penalties. While less common in traditional data analytics projects, it is increasingly used in real-time decision-making systems.

12.2 Application to KPLC’s databases: supervised classification models

In this session, we focus on supervised classification models to predict a customer’s tariff category based on their historical electricity consumption patterns and other predictors.

12.2.1 Feature engineering and Random Forest models

Among the different algorithms available, Random Forest has been selected due to its robustness, ability to handle large datasets, and good performance in modelling complex, non-linear relationships.

Schematic representation of a random forest model

We start from the file postpaid_bills_clean.parquet, provided by KPLC at the beginning of the project. A pointer to this Parquet file is defined using the arrow package, and the data is processed in-memory through a DuckDB table. We begin by exploring the table, filtering by the region of interest, and examining the number of accounts (contracts) by both tariff type and customer type.

For demonstration purposes, we use data from the NAIROBI NORTH region. Let’s start by examining the number of accounts by tariff type for the Nairobi North region in 2024.

Show the code

pt_parquet <- arrow::open_dataset(here("..",
                                       "LOSS REDUCTION WORKSHOP",
                                       "data",
                                       "postpaid_bills_clean.parquet"))

# take the pointer, filter by year and region and count the number of accounts by tariff
pt_parquet %>% 
  arrow::to_duckdb() %>% 
  filter(year(billing_month) == 2024,
         region == "NAIROBI NORTH"
         ) %>%
  group_by(tariff) %>% 
  summarise(accounts = n_distinct(account)) %>% 
  ungroup() %>% 
  arrange(tariff, desc(accounts)) %>% 
  collect() %>% 
  drop_na(tariff) %>% 
  janitor::adorn_totals("row")

            tariff accounts
       Bulk supply        1
        E-Mobility        4
       Large Power     1314
          Ordinary    78203
 Ordinary Lifeline    87973
  Small Commercial    27895
   Street Lighting     2835
             Total   198225

To simplify the analysis, we will focus on three tariff categories: Ordinary, Ordinary Lifeline, and Street Lighting. These categories have been selected to illustrate the modelling process and do not represent a complete real-world case. A more comprehensive analysis would include a wider range of tariffs and customer types, as well as a larger and more diverse dataset.

Instead of working directly with the monthly consumption time series, we will calculate a set of features that summarise each account’s consumption pattern. Examples of such features include the maximum recorded value, the standard deviation, and other statistical measures that capture variability and distribution. These aggregated features allow us to represent each account with a compact set of descriptors, making the classification process more efficient and less sensitive to the length of the time series.

Show the code

# 1) Valid accounts: just one tariff class in 2024 in NAIROBI NORTH
valid_accounts <- pt_parquet %>%
  arrow::to_duckdb() %>%
  filter(region == "NAIROBI NORTH",
         year(billing_month) == 2024) %>%
  group_by(account) %>%
  summarise(n_tariffs = n_distinct(tariff, 
                                   na.rm = TRUE), 
            .groups = "drop") %>%
  filter(n_tariffs == 1) %>%
  select(account) %>% 
  collect()

# 2) Base table: filter by region, year and valid accounts
pt_base <- pt_parquet %>% 
  arrow::to_duckdb() %>%
  filter(region == "NAIROBI NORTH", 
         year(billing_month) == 2024) %>%
  collect() %>% 
  filter(account %in% pull(valid_accounts)) %>% 
  filter(tariff %in% c("Ordinary",
                       "Ordinary Lifeline",
                       "Street Lighting")) %>%
  mutate(month = floor_date(billing_month, "month"))

# 3) Add the monthly consumption (kwh) per account, tariff, county and month
monthly <- pt_base %>%
  group_by(account, 
           tariff, 
           county, 
           customer_type, 
           month) %>%
  summarise(kwh_m = sum(kwh, 
                        na.rm = TRUE), 
            .groups = "drop")

# Calculate new features by account from the monthly consumption
features_df_12m <- monthly %>%
  group_by(account) %>%
  summarise(
    tariff        = first(tariff),
    county        = first(county),
    customer_type = first(customer_type),

    months_n    = n(), # number of months with data
    total_kwh   = sum(kwh_m, na.rm = TRUE), # total kWh consumed
    mean_kwh    = mean(kwh_m, na.rm = TRUE), # mean kWh consumed
    sd_kwh      = sd(kwh_m,   na.rm = TRUE), # standard deviation of kWh consumed
    min_kwh     = min(kwh_m,  na.rm = TRUE), # minimum kWh consumed
    max_kwh     = max(kwh_m,  na.rm = TRUE), # maximum kWh consumed
    months_zero = sum(kwh_m == 0, na.rm = TRUE), # number of months with zero consumption
    months_neg  = sum(kwh_m <  0, na.rm = TRUE), # number of months with negative consumption
    .groups = "drop"
  ) %>%
  filter(months_n == 12) %>% # only keep accounts with 12 months of data
  mutate(
    cv_kwh = if_else(is.finite(mean_kwh) & mean_kwh != 0, 
                     sd_kwh / mean_kwh, 
                     NA_real_) # Coefficient of variation (CV) of kWh consumed
  )

Once the time series features for each account have been calculated, we use the tidymodels package to prepare the data, train a classification model, and measure its performance. The process includes:

Data preparation: Split the dataset into training and test sets, and preprocess the features.
Model specification: Define a Random Forest model with suitable parameters.
Training and prediction: Fit the model to the training data and make predictions on the test set.
Performance metrics: Calculate the model’s accuracy and generate a confusion matrix to assess its performance.

Show the code

set.seed(123)

# 0) Preparation
data_ml <- features_df_12m %>%
  mutate(
    cv_kwh = ifelse(is.na(cv_kwh), 
                    0, 
                    cv_kwh), # replace NA with 0
    tariff = as.factor(tariff),
    county = as.factor(county),
    customer_type = as.factor(customer_type)
  )

# 1) Split the data into training and testing sets
spl <- initial_split(data_ml, 
                     prop = 0.8, 
                     strata = tariff) # stratified sampling by tariff

train <- training(spl)

test  <- testing(spl)

# 2) Recipe: select predictors and pre-process
num_vars <- c("months_n", 
              "total_kwh", 
              "mean_kwh", 
              "sd_kwh", 
              "min_kwh", 
              "max_kwh",
              "months_zero", 
              "months_neg", 
              "cv_kwh")

rec <- recipe(tariff ~ ., 
              data = train) %>% # tariff is the outcome, all other columns are predictors
  update_role(account, 
              new_role = "id") %>% # Mark 'account' as an ID variable (it will be retained but not used as a predictor)
  step_rm(-all_of(c("tariff", 
                    "account", 
                    num_vars, 
                    "county", 
                    "customer_type"))) %>%
  # Keep only the listed variables: tariff (target), account (ID), numeric variables (num_vars), county, and customer_type.
  # All other variables not in this list will be removed from the dataset.
  step_impute_median(all_of(num_vars)) %>% # For all numeric variables (defined in num_vars), replace missing values with the median of that variable.
  step_zv(all_predictors()) %>% # Remove any predictor variables that have zero variance (i.e., the same value for all rows).
  step_dummy(all_nominal_predictors()) # Convert all categorical predictors into dummy/indicator variables for modelling.

# 3) Model: Random Forest
mtry_val <- max(1L, 
                floor(sqrt(length(num_vars)))) # Sets the mtry parameter for the random forest algorithm

# Creates a random forest model specification in tidymodels
rf_spec <- rand_forest(mtry = mtry_val, 
                       trees = 800, 
                       min_n = 5) %>%
  set_mode("classification") %>%
  set_engine("ranger", 
             importance = "impurity")

# Creates a workflow object that combines:
# The preprocessing steps (rec) from the recipe
# The model specification (rf_spec)
wf <- workflow() %>% 
  add_recipe(rec) %>% 
  add_model(rf_spec)

# 4) Train and predict
tic()
fit <- wf %>% 
  fit(train)
toc()

17.38 sec elapsed

Show the code

tic()
pred <- augment(fit, 
                test) %>%
  select(.pred_class, 
         starts_with(".pred_"), 
         tariff)
toc()

2.51 sec elapsed

12.2.2 Performance metrics

The performance of the Random Forest classification model is evaluated using two key metrics: accuracy and a confusion matrix.

Accuracy measures the overall proportion of correct classifications made by the model, providing a straightforward indication of its performance.

The confusion matrix, on the other hand, offers a detailed breakdown of the model’s predictions, showing how many instances were correctly classified for each tariff category and where misclassifications occurred.

Show the code

# Accuracy: overall proportion of correct classifications
accuracy(pred, 
         truth = tariff, 
         estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.993

Show the code

# Confusion matrix: shows the detailed distribution of correct/incorrect predictions per class
conf_mat(pred, 
         truth = tariff, 
         estimate = .pred_class)

                   Truth
Prediction          Ordinary Ordinary Lifeline Street Lighting
  Ordinary              8047                26               2
  Ordinary Lifeline       42              9630               7
  Street Lighting          3                43             490

12.2.3 Variable importance analysis

The next code is used to visualise and interpret the results of a Random Forest classification model by identifying which predictors had the greatest influence on the model’s decisions. While a Random Forest is often considered a “black box” algorithm due to its complexity, variable importance analysis helps to shed light on how the model is using the available data to make predictions. By focusing on the top fifteen predictors, the plot highlights the most influential features in distinguishing between the three tariff categories in the Nairobi North dataset.

The output from vip() is a ranked bar chart showing the relative contribution of each variable to the model, based on the “impurity” measure. In the context of Random Forests, impurity importance measures how much each predictor reduces classification uncertainty across all decision trees in the forest. Variables that consistently provide valuable splits to separate the classes will appear with higher importance scores. This allows analysts to see, at a glance, which features are driving the model’s classification performance.

Show the code

vip(fit$fit$fit, 
    num_features = 15) +
  labs(
    title = "Variable Importance - Nairobi North",
    subtitle = "Random Forest Model with 3 tariff categories"
  ) +
  theme_tq() +
  easy_remove_gridlines() +
  easy_title_bold()

This variable importance plot is showing which features had the greatest influence on the Random Forest’s classification of the three tariff categories in the Nairobi North dataset.

Here’s how to interpret it:

1. Ranking of predictors

The bars are ordered from most important (top) to least important (bottom).
total_kwh, mean_kwh, and max_kwh are by far the most influential features — these consumption-level indicators are driving most of the model’s classification power.
Mid-level importance is held by sd_kwh (variation in consumption), min_kwh, and months_zero (months with zero consumption), which suggests the model also uses consumption variability and inactivity to distinguish tariffs.

2. Role of customer type

Some customer_type dummy variables (e.g., County.Governments, Private) appear in the middle of the ranking, meaning they have moderate influence.
Other customer categories (e.g., National.Government, Embassies, Staff) have very low importance, indicating they are rarely used for splitting in the trees and thus contribute little to classification accuracy.

3. Low-importance predictors

months_neg (months with negative readings) is almost negligible in importance, suggesting that either negative readings are rare or not particularly informative for tariff classification.
The very small importance values for some customer type categories mean they could potentially be removed in a simplified model without much impact on performance.

4. Practical takeaway

The model is mostly relying on consumption magnitude and variation to classify tariffs.
Customer type matters, but only certain categories are informative.
This insight could inform future feature engineering:
- You might create more nuanced consumption-related features.
- You might simplify categorical predictors by combining or dropping low-importance levels.

--- title: "R for Excel users - Block 7: Advanced topics—Big data analytics and machine learning models" subtitle: "Application to the energy losses study in Kenya for KPLC and the World Bank" title-block-banner: true date: today date-format: long author: - name: Jorge Colomer, Head of the Data Analytics Unit email: jorge.colomer@veolia.com affiliation: - name: SEURECA-VEOLIA city: Madrid state: Spain url: https://www.seureca.veolia.com/en execute: echo: true warning: false cache: true format: html: self-contained: true code-fold: true code-summary: "Show the code" code-overflow: scroll code-tools: true code-block-border-left: false code-copy: true fontsize: larger toc: true toc-float: true toc-expand: true toc-location: left toc-depth: 3 toc-title: "Table of Contents" number-sections: true theme: flatly # available themes: https://quarto.org/docs/output-formats/html-themes.html --- ::: {style="font-size: 90%"} ```{r libraries} if (!requireNamespace("pacman", quietly = TRUE)) { install.packages("pacman", repos = "https://cloud.r-project.org") } library(pacman) pacman::p_load( here, # A Simpler Way to Find Your Files tidyverse, # Data manipulation, exploration and visualization janitor, # Simple Tools for Examining and Cleaning Dirty Data arrow, # Integration to Apache Arrow duckdb, # DBI Package for the DuckDB Database Management System tictoc, # Timing Functions for R ggeasy, # Easy Access to 'ggplot2' Commands tidyquant, # Tidy Quantitative Financial Analysis timetk, # A Tool Kit for Working with Time Series factoextra, # Extract and Visualize the Results of Multivariate Data Analyses tidymodels, # Easily Install and Load the 'Tidymodels' Packages vip # Variable Importance Plots ) ``` ::: ```{r} #| include: false if ("package:tidylog" %in% search()) { detach("package:tidylog", unload = TRUE) } ``` # Overview This is the seventh block of the workshop series "R for Excel users - Application to the energy losses study in Kenya for KPLC and the World Bank". In this block we will explore advanced topics in R, focusing on big data analytics and machine learning models. We will learn how to handle large datasets that exceed available memory using efficient file formats like Parquet, Apache Arrow, and DuckDB. We will also introduce Apache Spark for distributed data processing and delve into machine learning models tailored to KPLC's needs. # Learning objectives By the end of this block, participants will be able to: 1. Import, store, and process large datasets efficiently using Parquet, Apache Arrow, and DuckDB. 2. Understand the fundamentals of distributed data processing with Apache Spark and its integration with R. 3. Apply appropriate data wrangling techniques for big data in the context of KPLC’s energy losses analysis. 4. Build, train, and evaluate basic machine learning models relevant to KPLC's operational needs. 5. Interpret model outputs to support data-driven decision-making. # Instructor Jorge Colomer is an industrial engineer with a specialization in electric power systems from the University of Zaragoza (Spain). He holds an MBA from the Technical University of Madrid (UPM), a master's degree in Data Science and Business Intelligence from the University of Castilla-La Mancha, and has completed several postgraduate courses in finance and quantitative analysis at UPM. He currently leads the Data Analytics Unit at the Spanish branch of SEURECA, serves as an adjunct professor in the master's program in Artificial Intelligence at Alfonso X el Sabio University in Madrid, and is a PhD candidate in the Statistical Modeling and Decision Sciences program at Rey Juan Carlos University in Madrid. Jorge can be reached at [jorge.colomer\@veolia.com](mailto:jorge.colomer@veolia.com){.email} # Prerequisites Before the course begins, please make sure you meet the following prerequisites: - **Create a free account on Posit Cloud:** Go to <https://posit.cloud/> and sign up for a free account. We will use this platform during the workshop to run R code without needing to install anything on your computer. - **Basic familiarity with Excel:** This course is designed for Excel users who want to start working with R. A basic understanding of Excel functions will be helpful. - **A stable internet connection:** Since we will be using an online environment (Posit Cloud), please ensure you have a reliable internet connection throughout the course. - **Download workshop data:** Dropbox folder: <https://www.dropbox.com/scl/fo/gkjj6i7qn49kt2p87nsv4/AOdlIM6RIc7gkDW3_zptqTE?rlkey=mi5emw34h2wfk74gtiwmzhx3l&st=0iykwigh&dl=0> Save it temporarily somewhere you will remember. [**Note:** In your case, download the files from the Dropbox folder and save it in a folder called `data` inside your project directory. If you are using Posit Cloud, you can create this folder by clicking on the "New Folder" button in the Files pane and then uploading the file to that folder.]{style="color: #8B0000;"} # License This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). # Agenda ::: {style="color: #999999;"} - **Block 1: Introduction to R, RStudio and R Markdown** - Objectives and resources - Why Excel, why R? - Overview of R and RStudio - Intro to R Markdown - R functions - Help pages - Commenting - Assigning objects with `<-` - R packages - **Block 2: Importing and exporting data** - Importing data from Excel - Exporting data to Excel - Importing and exporting CSV files - Other formats - **Block 3: Data visualization with `ggplot2`** - First steps - Creating a ggplot - Adding aesthetics and layers - Good data visualisation practices - **Block 4: Data tidying** - Tidy data - Lengthening data - Widening data - Pivot tables and `dplyr` - **Block 5: Filters and joins** - Filtering data - Joining data - Combining datasets - **Block 6: Report automation with R and Quarto** - Understanding Parameterisation in Quarto - Preparing the Data - Designing a Quarto Report Template - Incorporating Parameters into the Report - Rendering Reports Programmatically - Best Practices for Automated Reporting - Hands-on Exercise: KPLC Monthly Reports ::: - **Block 7: Advanced topics—Big data analytics and machine learning models** - Parquet files, Apache Arrow and DuckDB - Apache Spark and the `sparklyr` package - Machine learning models ::: {style="color: #999999;"} - **Block 8: Automated exploratory data analysis (EDA) and data quality** - Introduction to EDA and its role in KPLC's workflows - Why automate EDA? Benefits and limitations - Generating summary statistics with `skimr` - Extended descriptive analysis with other EDA packages - Visual overviews for detecting trends, patterns, and anomalies - Introduction to data quality concepts - Metadata and data dictionaries: definitions and best practices - Detecting and handling missing data, inconsistencies, and outliers - Integrating data quality checks into repeatable workflows - Hands-on exercise: Automated EDA and data quality assessment on KPLC datasets ::: # Introduction In this block, we tackle the challenges of analysing datasets that exceed available memory by leveraging efficient file formats and modern tools such as Parquet, Apache Arrow, and DuckDB. We will use Parquet files in combination with Apache Arrow to facilitate fast, out-of-memory data manipulation using familiar `dplyr` syntax, and where integration with DuckDB enables SQL-style querying over large-scale data. We will also introduce **Apache Spark** briefly, emphasising its role in scalable distributed processing for truly big data scenarios. Finally, we'll explore **machine learning models tailored to KPLC's needs**. These models serve as tools to better understand and mitigate inefficiencies and help drive data-informed operational decisions. # Parquet files, Apache Arrow and DuckDB ## Some definitions **Parquet** is a columnar storage file format designed for efficient storage and retrieval of large datasets. Its column-oriented structure enables highly efficient compression and encoding, reducing file sizes and speeding up analytical queries—particularly when working with subsets of columns. **Apache Arrow** is an open-source framework for in-memory data representation that enables fast, language-agnostic data interchange. In R, the `arrow` package allows you to read and write Parquet files directly, and to query them using dplyr without loading the entire dataset into memory. This makes it possible to work interactively with datasets that would otherwise exceed your computer's RAM capacity. **DuckDB** is an in-process SQL database management system optimised for analytical workloads. It can query Parquet files directly, without importing them into a separate database, and integrates seamlessly with the arrow package in R. This allows analysts to combine the expressiveness of SQL with the flexibility of R's `tidyverse` for large-scale data analysis. ## Practical Example: Analysing KPLC's postpaid customer databases In this example, we will work with KPLC's postpaid customer databases, which contain millions of records and therefore cannot be loaded entirely into memory. Instead, we will use a workflow that combines Parquet files, Apache Arrow, and DuckDB to perform efficient, out-of-memory analysis directly from R. The process involves: 1. **Storing the data in Parquet format** to take advantage of columnar storage and compression. 2. **Accessing the Parquet files with the `arrow` package**, enabling lazy evaluation and selective reading of only the necessary columns and rows. 3. **Querying the data with DuckDB**, either via SQL queries or `dplyr` syntax, for filtering, aggregation, and joining with other datasets. 4. **Integrating results into the analytical workflow**, such as generating summary tables, calculating performance indicators, or preparing visualisations for inclusion in Quarto reports. ## Reading Parquet files with `arrow` Let us take the file `postpaid_bills_clean.parquet`, stored locally on the hard drive and shared by KPLC at the start of the project. It is an 0.8 GB file containing millions of records. First, we define a pointer to the file using the `arrow::open_dataset()` function, which allows us to access the data without loading it into memory. This is particularly useful when working with large volumes of data, as it avoids memory issues and enables filtering and aggregation operations to be carried out efficiently. ::: {style="font-size: 90%"} ```{r} pt_parquet <- arrow::open_dataset(here("..", "LOSS REDUCTION WORKSHOP", "data", "postpaid_bills_clean.parquet")) ``` ::: ## Exploring the Parquet file with `dplyr` Let us explore the Parquet file using the `glimpse()` function from `dplyr`, which provides a quick view of the data structure, including the number of rows and columns, as well as the data types of each column. This allows us to quickly understand the composition of the dataset without the need to load it into memory. ::: {style="font-size: 90%"} ```{r} pt_parquet %>% dplyr::glimpse() ``` ::: We can see that the dataset contains 25.6 million rows and 16 columns, with data types including numbers, characters, and dates. This gives us a clear idea of the data structure and allows us to plan the analytical operations to be carried out next. # Querying the Parquet file with `dplyr` Let us use the familiar `dplyr` functions to perform some analytical operations on the dataset. For example, we can calculate the total billing amount and the total kWh consumed for each tariff. This is done by using the `group_by()` function to group the data by the `tariff` column, followed by `summarise()` to compute the corresponding sums. Finally, we use `ungroup()` to remove the grouping and `collect()` to bring the results into R. We use the `tictoc` package to measure the time taken to execute this operation. ::: {style="font-size: 90%"} ```{r} tic() pt_parquet %>% group_by(tariff) %>% summarize( total_bill_amount = sum(bill_amount, na.rm = TRUE), total_kwh = sum(kwh, na.rm = TRUE) ) %>% ungroup() %>% collect() toc() ``` ::: # Using `duckdb` with `arrow` There's one important advantage of parquet and arrow — its very easy to turn an arrow dataset into a DuckDB database by calling `arrow::to_duckdb()`. Thanks to DuckDB, we can take advantage of the full power of an SQL database management system and bring the results back into R without having to load the entire dataset into memory. ::: {style="font-size: 90%"} ```{r} tic() pt_parquet %>% arrow::to_duckdb() %>% filter(lubridate::year(billing_month) == 2024) %>% group_by(tariff, month = lubridate::floor_date(billing_month, "month")) %>% summarise(total_kwh = sum(kwh, na.rm = TRUE)) %>% ungroup() %>% arrange(month, tariff) %>% collect() %>% drop_na(tariff) %>% mutate(tariff = fct_reorder(tariff, total_kwh, .desc = TRUE)) %>% ggplot(aes(x = month, y = total_kwh)) + geom_area(fill = "grey") + facet_wrap(~tariff, scales = "fixed", axes = "all", axis.labels = "all") + labs(title = "Total kWh consumed by tariff in 2024", x = "Month", y = "Total kWh") + scale_y_continuous(labels = scales::label_number(big.mark = ",", decimal.mark = ".", scale = 1e-6), limits = c(0, NA)) + theme_tq() + easy_title_bold() + easy_remove_gridlines() + easy_remove_axes("both", "title") + scale_x_date(date_labels = "%b", date_breaks = "2 months") toc() ``` ::: The neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another. ::: {style="font-size: 90%"} ```{r} #| include: false # Remove the pointer rm(pt_parquet) ``` ::: # Apache Spark and the `sparklyr` package [**Apache Spark**](https://spark.apache.org/) is an open-source, distributed computing system designed for large-scale data processing. It enables fast computation by distributing tasks across multiple cores or machines, making it suitable for handling massive datasets that cannot be processed efficiently on a single computer. Spark supports a variety of workloads, including batch processing, interactive queries, streaming analytics, and machine learning. In R, the [`sparklyr`](https://spark.posit.co/) package provides a convenient interface to Apache Spark. It allows analysts to connect R directly to a Spark cluster, manipulate Spark DataFrames using familiar `dplyr` syntax, and access Spark's built-in machine learning library ([`MLlib`](https://spark.apache.org/mllib/)). This makes it possible to integrate distributed data processing into an R-based workflow without having to switch to another programming environment. For those interested in exploring Spark with R in greater depth, the book [*Mastering Spark with R*](https://therinspark.com/) is an excellent resource, offering practical examples and detailed explanations of how to harness Spark's capabilities from within R. # Machine learning models ## Introduction Machine Learning (ML) is a field of artificial intelligence that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed for each specific task. Instead of following rigid rules, ML algorithms improve their performance as they are exposed to more data. ML techniques are generally divided into three main categories: - **Supervised learning** – Models are trained using labelled datasets, where the outcome or target variable is known. This category includes **classification** (predicting a discrete category, such as customer tariff type) and **regression** (predicting a continuous value, such as monthly energy consumption). - **Unsupervised learning** – Models are applied to datasets without pre-labelled outcomes. They aim to uncover hidden structures or patterns, such as grouping similar customers using **clustering** or reducing the dimensionality of the data with techniques like **Principal Component Analysis (PCA)**. - **Reinforcement learning** – Algorithms learn optimal strategies through trial-and-error interactions with an environment, guided by rewards and penalties. While less common in traditional data analytics projects, it is increasingly used in real-time decision-making systems. ## Application to KPLC's databases: supervised classification models In this session, we focus on **supervised classification models** to predict a customer's tariff category based on their historical electricity consumption patterns and other predictors. ### Feature engineering and Random Forest models Among the different algorithms available, **Random Forest** has been selected due to its robustness, ability to handle large datasets, and good performance in modelling complex, non-linear relationships. ![Schematic representation of a random forest model](images/random-forest.png) We start from the file `postpaid_bills_clean.parquet`, provided by KPLC at the beginning of the project. A pointer to this Parquet file is defined using the `arrow` package, and the data is processed in-memory through a DuckDB table. We begin by exploring the table, filtering by the region of interest, and examining the number of accounts (contracts) by both tariff type and customer type. For demonstration purposes, we use data from the NAIROBI NORTH region. Let's start by examining the number of accounts by tariff type for the Nairobi North region in 2024. ::: {style="font-size: 90%"} ```{r} pt_parquet <- arrow::open_dataset(here("..", "LOSS REDUCTION WORKSHOP", "data", "postpaid_bills_clean.parquet")) # take the pointer, filter by year and region and count the number of accounts by tariff pt_parquet %>% arrow::to_duckdb() %>% filter(year(billing_month) == 2024, region == "NAIROBI NORTH" ) %>% group_by(tariff) %>% summarise(accounts = n_distinct(account)) %>% ungroup() %>% arrange(tariff, desc(accounts)) %>% collect() %>% drop_na(tariff) %>% janitor::adorn_totals("row") ``` ::: To simplify the analysis, we will focus on three tariff categories: *Ordinary*, *Ordinary Lifeline*, and *Street Lighting*. These categories have been selected to illustrate the modelling process and do not represent a complete real-world case. A more comprehensive analysis would include a wider range of tariffs and customer types, as well as a larger and more diverse dataset. Instead of working directly with the monthly consumption time series, we will calculate a set of features that summarise each account's consumption pattern. Examples of such features include the maximum recorded value, the standard deviation, and other statistical measures that capture variability and distribution. These aggregated features allow us to represent each account with a compact set of descriptors, making the classification process more efficient and less sensitive to the length of the time series. ::: {style="font-size: 90%"} ```{r} # 1) Valid accounts: just one tariff class in 2024 in NAIROBI NORTH valid_accounts <- pt_parquet %>% arrow::to_duckdb() %>% filter(region == "NAIROBI NORTH", year(billing_month) == 2024) %>% group_by(account) %>% summarise(n_tariffs = n_distinct(tariff, na.rm = TRUE), .groups = "drop") %>% filter(n_tariffs == 1) %>% select(account) %>% collect() # 2) Base table: filter by region, year and valid accounts pt_base <- pt_parquet %>% arrow::to_duckdb() %>% filter(region == "NAIROBI NORTH", year(billing_month) == 2024) %>% collect() %>% filter(account %in% pull(valid_accounts)) %>% filter(tariff %in% c("Ordinary", "Ordinary Lifeline", "Street Lighting")) %>% mutate(month = floor_date(billing_month, "month")) # 3) Add the monthly consumption (kwh) per account, tariff, county and month monthly <- pt_base %>% group_by(account, tariff, county, customer_type, month) %>% summarise(kwh_m = sum(kwh, na.rm = TRUE), .groups = "drop") # Calculate new features by account from the monthly consumption features_df_12m <- monthly %>% group_by(account) %>% summarise( tariff = first(tariff), county = first(county), customer_type = first(customer_type), months_n = n(), # number of months with data total_kwh = sum(kwh_m, na.rm = TRUE), # total kWh consumed mean_kwh = mean(kwh_m, na.rm = TRUE), # mean kWh consumed sd_kwh = sd(kwh_m, na.rm = TRUE), # standard deviation of kWh consumed min_kwh = min(kwh_m, na.rm = TRUE), # minimum kWh consumed max_kwh = max(kwh_m, na.rm = TRUE), # maximum kWh consumed months_zero = sum(kwh_m == 0, na.rm = TRUE), # number of months with zero consumption months_neg = sum(kwh_m < 0, na.rm = TRUE), # number of months with negative consumption .groups = "drop" ) %>% filter(months_n == 12) %>% # only keep accounts with 12 months of data mutate( cv_kwh = if_else(is.finite(mean_kwh) & mean_kwh != 0, sd_kwh / mean_kwh, NA_real_) # Coefficient of variation (CV) of kWh consumed ) ``` ::: Once the time series features for each account have been calculated, we use the `tidymodels` package to prepare the data, train a classification model, and measure its performance. The process includes: 1. **Data preparation**: Split the dataset into training and test sets, and preprocess the features. 2. **Model specification**: Define a Random Forest model with suitable parameters. 3. **Training and prediction**: Fit the model to the training data and make predictions on the test set. 4. **Performance metrics**: Calculate the model’s accuracy and generate a confusion matrix to assess its performance. ::: {style="font-size: 90%"} ```{r} set.seed(123) # 0) Preparation data_ml <- features_df_12m %>% mutate( cv_kwh = ifelse(is.na(cv_kwh), 0, cv_kwh), # replace NA with 0 tariff = as.factor(tariff), county = as.factor(county), customer_type = as.factor(customer_type) ) # 1) Split the data into training and testing sets spl <- initial_split(data_ml, prop = 0.8, strata = tariff) # stratified sampling by tariff train <- training(spl) test <- testing(spl) # 2) Recipe: select predictors and pre-process num_vars <- c("months_n", "total_kwh", "mean_kwh", "sd_kwh", "min_kwh", "max_kwh", "months_zero", "months_neg", "cv_kwh") rec <- recipe(tariff ~ ., data = train) %>% # tariff is the outcome, all other columns are predictors update_role(account, new_role = "id") %>% # Mark 'account' as an ID variable (it will be retained but not used as a predictor) step_rm(-all_of(c("tariff", "account", num_vars, "county", "customer_type"))) %>% # Keep only the listed variables: tariff (target), account (ID), numeric variables (num_vars), county, and customer_type. # All other variables not in this list will be removed from the dataset. step_impute_median(all_of(num_vars)) %>% # For all numeric variables (defined in num_vars), replace missing values with the median of that variable. step_zv(all_predictors()) %>% # Remove any predictor variables that have zero variance (i.e., the same value for all rows). step_dummy(all_nominal_predictors()) # Convert all categorical predictors into dummy/indicator variables for modelling. # 3) Model: Random Forest mtry_val <- max(1L, floor(sqrt(length(num_vars)))) # Sets the mtry parameter for the random forest algorithm # Creates a random forest model specification in tidymodels rf_spec <- rand_forest(mtry = mtry_val, trees = 800, min_n = 5) %>% set_mode("classification") %>% set_engine("ranger", importance = "impurity") # Creates a workflow object that combines: # The preprocessing steps (rec) from the recipe # The model specification (rf_spec) wf <- workflow() %>% add_recipe(rec) %>% add_model(rf_spec) # 4) Train and predict tic() fit <- wf %>% fit(train) toc() tic() pred <- augment(fit, test) %>% select(.pred_class, starts_with(".pred_"), tariff) toc() ``` ::: ### Performance metrics The performance of the Random Forest classification model is evaluated using two key metrics: **accuracy** and a **confusion matrix**. Accuracy measures the overall proportion of correct classifications made by the model, providing a straightforward indication of its performance. The confusion matrix, on the other hand, offers a detailed breakdown of the model's predictions, showing how many instances were correctly classified for each tariff category and where misclassifications occurred. ::: {style="font-size: 90%"} ```{r} # Accuracy: overall proportion of correct classifications accuracy(pred, truth = tariff, estimate = .pred_class) # Confusion matrix: shows the detailed distribution of correct/incorrect predictions per class conf_mat(pred, truth = tariff, estimate = .pred_class) ``` ::: ### Variable importance analysis The next code is used to visualise and interpret the results of a Random Forest classification model by identifying which predictors had the greatest influence on the model's decisions. While a Random Forest is often considered a "black box" algorithm due to its complexity, variable importance analysis helps to shed light on how the model is using the available data to make predictions. By focusing on the top fifteen predictors, the plot highlights the most influential features in distinguishing between the three tariff categories in the Nairobi North dataset. The output from `vip()` is a ranked bar chart showing the relative contribution of each variable to the model, based on the "impurity" measure. In the context of Random Forests, impurity importance measures how much each predictor reduces classification uncertainty across all decision trees in the forest. Variables that consistently provide valuable splits to separate the classes will appear with higher importance scores. This allows analysts to see, at a glance, which features are driving the model's classification performance. ::: {style="font-size: 90%"} ```{r} vip(fit$fit$fit, num_features = 15) + labs( title = "Variable Importance - Nairobi North", subtitle = "Random Forest Model with 3 tariff categories" ) + theme_tq() + easy_remove_gridlines() + easy_title_bold() ``` ::: This variable importance plot is showing which features had the greatest influence on the Random Forest's classification of the three tariff categories in the Nairobi North dataset. Here's how to interpret it: **1. Ranking of predictors** - The bars are ordered from most important (top) to least important (bottom). - `total_kwh`, `mean_kwh`, and `max_kwh` are by far the most influential features — these consumption-level indicators are driving most of the model’s classification power. - Mid-level importance is held by `sd_kwh` (variation in consumption), `min_kwh`, and `months_zero` (months with zero consumption), which suggests the model also uses consumption variability and inactivity to distinguish tariffs. **2. Role of customer type** - Some `customer_type` dummy variables (e.g., `County.Governments`, `Private`) appear in the middle of the ranking, meaning they have moderate influence. - Other customer categories (e.g., `National.Government`, `Embassies`, `Staff`) have very low importance, indicating they are rarely used for splitting in the trees and thus contribute little to classification accuracy. **3. Low-importance predictors** - `months_neg` (months with negative readings) is almost negligible in importance, suggesting that either negative readings are rare or not particularly informative for tariff classification. - The very small importance values for some customer type categories mean they could potentially be removed in a simplified model without much impact on performance. **4. Practical takeaway** - The model is mostly relying on **consumption magnitude and variation** to classify tariffs. - Customer type matters, but only certain categories are informative. - This insight could inform future feature engineering: - You might create more nuanced consumption-related features. - You might simplify categorical predictors by combining or dropping low-importance levels.

1 Overview

2 Learning objectives

3 Instructor

4 Prerequisites

5 License

6 Agenda

7 Introduction

8 Parquet files, Apache Arrow and DuckDB

8.1 Some definitions

8.2 Practical Example: Analysing KPLC’s postpaid customer databases

8.3 Reading Parquet files with arrow

8.4 Exploring the Parquet file with dplyr

9 Querying the Parquet file with dplyr

10 Using duckdb with arrow

11 Apache Spark and the sparklyr package

12 Machine learning models

12.1 Introduction

12.2 Application to KPLC’s databases: supervised classification models

12.2.1 Feature engineering and Random Forest models

12.2.2 Performance metrics

12.2.3 Variable importance analysis

8.3 Reading Parquet files with `arrow`

8.4 Exploring the Parquet file with `dplyr`

9 Querying the Parquet file with `dplyr`

10 Using `duckdb` with `arrow`

11 Apache Spark and the `sparklyr` package