if (!requireNamespace("pacman", quietly =TRUE)) {install.packages("pacman", repos ="https://cloud.r-project.org")}library(pacman)pacman::p_load( here, # A Simpler Way to Find Your Files tidyverse, # Data manipulation, exploration and visualization janitor, # Simple Tools for Examining and Cleaning Dirty Data arrow, # Integration to Apache Arrow duckdb, # DBI Package for the DuckDB Database Management System tictoc, # Timing Functions for R ggeasy, # Easy Access to 'ggplot2' Commands tidyquant, # Tidy Quantitative Financial Analysis timetk, # A Tool Kit for Working with Time Series factoextra, # Extract and Visualize the Results of Multivariate Data Analyses tidymodels, # Easily Install and Load the 'Tidymodels' Packages vip # Variable Importance Plots )
1 Overview
This is the seventh block of the workshop series “R for Excel users - Application to the energy losses study in Kenya for KPLC and the World Bank”. In this block we will explore advanced topics in R, focusing on big data analytics and machine learning models. We will learn how to handle large datasets that exceed available memory using efficient file formats like Parquet, Apache Arrow, and DuckDB. We will also introduce Apache Spark for distributed data processing and delve into machine learning models tailored to KPLC’s needs.
2 Learning objectives
By the end of this block, participants will be able to:
Import, store, and process large datasets efficiently using Parquet, Apache Arrow, and DuckDB.
Understand the fundamentals of distributed data processing with Apache Spark and its integration with R.
Apply appropriate data wrangling techniques for big data in the context of KPLC’s energy losses analysis.
Build, train, and evaluate basic machine learning models relevant to KPLC’s operational needs.
Interpret model outputs to support data-driven decision-making.
3 Instructor
Jorge Colomer is an industrial engineer with a specialization in electric power systems from the University of Zaragoza (Spain). He holds an MBA from the Technical University of Madrid (UPM), a master’s degree in Data Science and Business Intelligence from the University of Castilla-La Mancha, and has completed several postgraduate courses in finance and quantitative analysis at UPM. He currently leads the Data Analytics Unit at the Spanish branch of SEURECA, serves as an adjunct professor in the master’s program in Artificial Intelligence at Alfonso X el Sabio University in Madrid, and is a PhD candidate in the Statistical Modeling and Decision Sciences program at Rey Juan Carlos University in Madrid.
Before the course begins, please make sure you meet the following prerequisites:
Create a free account on Posit Cloud: Go to https://posit.cloud/ and sign up for a free account. We will use this platform during the workshop to run R code without needing to install anything on your computer.
Basic familiarity with Excel: This course is designed for Excel users who want to start working with R. A basic understanding of Excel functions will be helpful.
A stable internet connection: Since we will be using an online environment (Posit Cloud), please ensure you have a reliable internet connection throughout the course.
Note: In your case, download the files from the Dropbox folder and save it in a folder called data inside your project directory. If you are using Posit Cloud, you can create this folder by clicking on the “New Folder” button in the Files pane and then uploading the file to that folder.
Block 1: Introduction to R, RStudio and R Markdown
Objectives and resources
Why Excel, why R?
Overview of R and RStudio
Intro to R Markdown
R functions
Help pages
Commenting
Assigning objects with <-
R packages
Block 2: Importing and exporting data
Importing data from Excel
Exporting data to Excel
Importing and exporting CSV files
Other formats
Block 3: Data visualization with ggplot2
First steps
Creating a ggplot
Adding aesthetics and layers
Good data visualisation practices
Block 4: Data tidying
Tidy data
Lengthening data
Widening data
Pivot tables and dplyr
Block 5: Filters and joins
Filtering data
Joining data
Combining datasets
Block 6: Report automation with R and Quarto
Understanding Parameterisation in Quarto
Preparing the Data
Designing a Quarto Report Template
Incorporating Parameters into the Report
Rendering Reports Programmatically
Best Practices for Automated Reporting
Hands-on Exercise: KPLC Monthly Reports
Block 7: Advanced topics—Big data analytics and machine learning models
Parquet files, Apache Arrow and DuckDB
Apache Spark and the sparklyr package
Machine learning models
Block 8: Automated exploratory data analysis (EDA) and data quality
Introduction to EDA and its role in KPLC’s workflows
Why automate EDA? Benefits and limitations
Generating summary statistics with skimr
Extended descriptive analysis with other EDA packages
Visual overviews for detecting trends, patterns, and anomalies
Introduction to data quality concepts
Metadata and data dictionaries: definitions and best practices
Detecting and handling missing data, inconsistencies, and outliers
Integrating data quality checks into repeatable workflows
Hands-on exercise: Automated EDA and data quality assessment on KPLC datasets
7 Introduction
In this block, we tackle the challenges of analysing datasets that exceed available memory by leveraging efficient file formats and modern tools such as Parquet, Apache Arrow, and DuckDB. We will use Parquet files in combination with Apache Arrow to facilitate fast, out-of-memory data manipulation using familiar dplyr syntax, and where integration with DuckDB enables SQL-style querying over large-scale data.
We will also introduce Apache Spark briefly, emphasising its role in scalable distributed processing for truly big data scenarios.
Finally, we’ll explore machine learning models tailored to KPLC’s needs. These models serve as tools to better understand and mitigate inefficiencies and help drive data-informed operational decisions.
8 Parquet files, Apache Arrow and DuckDB
8.1 Some definitions
Parquet is a columnar storage file format designed for efficient storage and retrieval of large datasets. Its column-oriented structure enables highly efficient compression and encoding, reducing file sizes and speeding up analytical queries—particularly when working with subsets of columns.
Apache Arrow is an open-source framework for in-memory data representation that enables fast, language-agnostic data interchange. In R, the arrow package allows you to read and write Parquet files directly, and to query them using dplyr without loading the entire dataset into memory. This makes it possible to work interactively with datasets that would otherwise exceed your computer’s RAM capacity.
DuckDB is an in-process SQL database management system optimised for analytical workloads. It can query Parquet files directly, without importing them into a separate database, and integrates seamlessly with the arrow package in R. This allows analysts to combine the expressiveness of SQL with the flexibility of R’s tidyverse for large-scale data analysis.
In this example, we will work with KPLC’s postpaid customer databases, which contain millions of records and therefore cannot be loaded entirely into memory. Instead, we will use a workflow that combines Parquet files, Apache Arrow, and DuckDB to perform efficient, out-of-memory analysis directly from R.
The process involves:
Storing the data in Parquet format to take advantage of columnar storage and compression.
Accessing the Parquet files with the arrow package, enabling lazy evaluation and selective reading of only the necessary columns and rows.
Querying the data with DuckDB, either via SQL queries or dplyr syntax, for filtering, aggregation, and joining with other datasets.
Integrating results into the analytical workflow, such as generating summary tables, calculating performance indicators, or preparing visualisations for inclusion in Quarto reports.
8.3 Reading Parquet files with arrow
Let us take the file postpaid_bills_clean.parquet, stored locally on the hard drive and shared by KPLC at the start of the project. It is an 0.8 GB file containing millions of records. First, we define a pointer to the file using the arrow::open_dataset() function, which allows us to access the data without loading it into memory. This is particularly useful when working with large volumes of data, as it avoids memory issues and enables filtering and aggregation operations to be carried out efficiently.
Let us explore the Parquet file using the glimpse() function from dplyr, which provides a quick view of the data structure, including the number of rows and columns, as well as the data types of each column. This allows us to quickly understand the composition of the dataset without the need to load it into memory.
We can see that the dataset contains 25.6 million rows and 16 columns, with data types including numbers, characters, and dates. This gives us a clear idea of the data structure and allows us to plan the analytical operations to be carried out next.
9 Querying the Parquet file with dplyr
Let us use the familiar dplyr functions to perform some analytical operations on the dataset. For example, we can calculate the total billing amount and the total kWh consumed for each tariff. This is done by using the group_by() function to group the data by the tariff column, followed by summarise() to compute the corresponding sums. Finally, we use ungroup() to remove the grouping and collect() to bring the results into R.
We use the tictoc package to measure the time taken to execute this operation.
# A tibble: 8 × 3
tariff total_bill_amount total_kwh
<chr> <dbl> <dbl>
1 Small Commercial 43321204530. 1456172126
2 Ordinary Lifeline 4855041738. 215357873
3 Ordinary 34241243666. 1165082937
4 Street Lighting 1865263817. 101140344
5 Large Power 160462386936. 6431911708
6 E-Mobility 15424880 552264
7 <NA> 1440 0
8 Bulk supply 330026 11123
Show the code
toc()
1.33 sec elapsed
10 Using duckdb with arrow
There’s one important advantage of parquet and arrow — its very easy to turn an arrow dataset into a DuckDB database by calling arrow::to_duckdb(). Thanks to DuckDB, we can take advantage of the full power of an SQL database management system and bring the results back into R without having to load the entire dataset into memory.
The neat thing about to_duckdb() is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.
11 Apache Spark and the sparklyr package
Apache Spark is an open-source, distributed computing system designed for large-scale data processing. It enables fast computation by distributing tasks across multiple cores or machines, making it suitable for handling massive datasets that cannot be processed efficiently on a single computer. Spark supports a variety of workloads, including batch processing, interactive queries, streaming analytics, and machine learning.
In R, the sparklyr package provides a convenient interface to Apache Spark. It allows analysts to connect R directly to a Spark cluster, manipulate Spark DataFrames using familiar dplyr syntax, and access Spark’s built-in machine learning library (MLlib). This makes it possible to integrate distributed data processing into an R-based workflow without having to switch to another programming environment.
For those interested in exploring Spark with R in greater depth, the book Mastering Spark with R is an excellent resource, offering practical examples and detailed explanations of how to harness Spark’s capabilities from within R.
12 Machine learning models
12.1 Introduction
Machine Learning (ML) is a field of artificial intelligence that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed for each specific task. Instead of following rigid rules, ML algorithms improve their performance as they are exposed to more data.
ML techniques are generally divided into three main categories:
Supervised learning – Models are trained using labelled datasets, where the outcome or target variable is known. This category includes classification (predicting a discrete category, such as customer tariff type) and regression (predicting a continuous value, such as monthly energy consumption).
Unsupervised learning – Models are applied to datasets without pre-labelled outcomes. They aim to uncover hidden structures or patterns, such as grouping similar customers using clustering or reducing the dimensionality of the data with techniques like Principal Component Analysis (PCA).
Reinforcement learning – Algorithms learn optimal strategies through trial-and-error interactions with an environment, guided by rewards and penalties. While less common in traditional data analytics projects, it is increasingly used in real-time decision-making systems.
12.2 Application to KPLC’s databases: supervised classification models
In this session, we focus on supervised classification models to predict a customer’s tariff category based on their historical electricity consumption patterns and other predictors.
12.2.1 Feature engineering and Random Forest models
Among the different algorithms available, Random Forest has been selected due to its robustness, ability to handle large datasets, and good performance in modelling complex, non-linear relationships.
Schematic representation of a random forest model
We start from the file postpaid_bills_clean.parquet, provided by KPLC at the beginning of the project. A pointer to this Parquet file is defined using the arrow package, and the data is processed in-memory through a DuckDB table. We begin by exploring the table, filtering by the region of interest, and examining the number of accounts (contracts) by both tariff type and customer type.
For demonstration purposes, we use data from the NAIROBI NORTH region. Let’s start by examining the number of accounts by tariff type for the Nairobi North region in 2024.
Show the code
pt_parquet <- arrow::open_dataset(here("..","LOSS REDUCTION WORKSHOP","data","postpaid_bills_clean.parquet"))# take the pointer, filter by year and region and count the number of accounts by tariffpt_parquet %>% arrow::to_duckdb() %>%filter(year(billing_month) ==2024, region =="NAIROBI NORTH" ) %>%group_by(tariff) %>%summarise(accounts =n_distinct(account)) %>%ungroup() %>%arrange(tariff, desc(accounts)) %>%collect() %>%drop_na(tariff) %>% janitor::adorn_totals("row")
tariff accounts
Bulk supply 1
E-Mobility 4
Large Power 1314
Ordinary 78203
Ordinary Lifeline 87973
Small Commercial 27895
Street Lighting 2835
Total 198225
To simplify the analysis, we will focus on three tariff categories: Ordinary, Ordinary Lifeline, and Street Lighting. These categories have been selected to illustrate the modelling process and do not represent a complete real-world case. A more comprehensive analysis would include a wider range of tariffs and customer types, as well as a larger and more diverse dataset.
Instead of working directly with the monthly consumption time series, we will calculate a set of features that summarise each account’s consumption pattern. Examples of such features include the maximum recorded value, the standard deviation, and other statistical measures that capture variability and distribution. These aggregated features allow us to represent each account with a compact set of descriptors, making the classification process more efficient and less sensitive to the length of the time series.
Show the code
# 1) Valid accounts: just one tariff class in 2024 in NAIROBI NORTHvalid_accounts <- pt_parquet %>% arrow::to_duckdb() %>%filter(region =="NAIROBI NORTH",year(billing_month) ==2024) %>%group_by(account) %>%summarise(n_tariffs =n_distinct(tariff, na.rm =TRUE), .groups ="drop") %>%filter(n_tariffs ==1) %>%select(account) %>%collect()# 2) Base table: filter by region, year and valid accountspt_base <- pt_parquet %>% arrow::to_duckdb() %>%filter(region =="NAIROBI NORTH", year(billing_month) ==2024) %>%collect() %>%filter(account %in%pull(valid_accounts)) %>%filter(tariff %in%c("Ordinary","Ordinary Lifeline","Street Lighting")) %>%mutate(month =floor_date(billing_month, "month"))# 3) Add the monthly consumption (kwh) per account, tariff, county and monthmonthly <- pt_base %>%group_by(account, tariff, county, customer_type, month) %>%summarise(kwh_m =sum(kwh, na.rm =TRUE), .groups ="drop")# Calculate new features by account from the monthly consumptionfeatures_df_12m <- monthly %>%group_by(account) %>%summarise(tariff =first(tariff),county =first(county),customer_type =first(customer_type),months_n =n(), # number of months with datatotal_kwh =sum(kwh_m, na.rm =TRUE), # total kWh consumedmean_kwh =mean(kwh_m, na.rm =TRUE), # mean kWh consumedsd_kwh =sd(kwh_m, na.rm =TRUE), # standard deviation of kWh consumedmin_kwh =min(kwh_m, na.rm =TRUE), # minimum kWh consumedmax_kwh =max(kwh_m, na.rm =TRUE), # maximum kWh consumedmonths_zero =sum(kwh_m ==0, na.rm =TRUE), # number of months with zero consumptionmonths_neg =sum(kwh_m <0, na.rm =TRUE), # number of months with negative consumption.groups ="drop" ) %>%filter(months_n ==12) %>%# only keep accounts with 12 months of datamutate(cv_kwh =if_else(is.finite(mean_kwh) & mean_kwh !=0, sd_kwh / mean_kwh, NA_real_) # Coefficient of variation (CV) of kWh consumed )
Once the time series features for each account have been calculated, we use the tidymodels package to prepare the data, train a classification model, and measure its performance. The process includes:
Data preparation: Split the dataset into training and test sets, and preprocess the features.
Model specification: Define a Random Forest model with suitable parameters.
Training and prediction: Fit the model to the training data and make predictions on the test set.
Performance metrics: Calculate the model’s accuracy and generate a confusion matrix to assess its performance.
Show the code
set.seed(123)# 0) Preparationdata_ml <- features_df_12m %>%mutate(cv_kwh =ifelse(is.na(cv_kwh), 0, cv_kwh), # replace NA with 0tariff =as.factor(tariff),county =as.factor(county),customer_type =as.factor(customer_type) )# 1) Split the data into training and testing setsspl <-initial_split(data_ml, prop =0.8, strata = tariff) # stratified sampling by tarifftrain <-training(spl)test <-testing(spl)# 2) Recipe: select predictors and pre-processnum_vars <-c("months_n", "total_kwh", "mean_kwh", "sd_kwh", "min_kwh", "max_kwh","months_zero", "months_neg", "cv_kwh")rec <-recipe(tariff ~ ., data = train) %>%# tariff is the outcome, all other columns are predictorsupdate_role(account, new_role ="id") %>%# Mark 'account' as an ID variable (it will be retained but not used as a predictor)step_rm(-all_of(c("tariff", "account", num_vars, "county", "customer_type"))) %>%# Keep only the listed variables: tariff (target), account (ID), numeric variables (num_vars), county, and customer_type.# All other variables not in this list will be removed from the dataset.step_impute_median(all_of(num_vars)) %>%# For all numeric variables (defined in num_vars), replace missing values with the median of that variable.step_zv(all_predictors()) %>%# Remove any predictor variables that have zero variance (i.e., the same value for all rows).step_dummy(all_nominal_predictors()) # Convert all categorical predictors into dummy/indicator variables for modelling.# 3) Model: Random Forestmtry_val <-max(1L, floor(sqrt(length(num_vars)))) # Sets the mtry parameter for the random forest algorithm# Creates a random forest model specification in tidymodelsrf_spec <-rand_forest(mtry = mtry_val, trees =800, min_n =5) %>%set_mode("classification") %>%set_engine("ranger", importance ="impurity")# Creates a workflow object that combines:# The preprocessing steps (rec) from the recipe# The model specification (rf_spec)wf <-workflow() %>%add_recipe(rec) %>%add_model(rf_spec)# 4) Train and predicttic()fit <- wf %>%fit(train)toc()
The performance of the Random Forest classification model is evaluated using two key metrics: accuracy and a confusion matrix.
Accuracy measures the overall proportion of correct classifications made by the model, providing a straightforward indication of its performance.
The confusion matrix, on the other hand, offers a detailed breakdown of the model’s predictions, showing how many instances were correctly classified for each tariff category and where misclassifications occurred.
Show the code
# Accuracy: overall proportion of correct classificationsaccuracy(pred, truth = tariff, estimate = .pred_class)
# Confusion matrix: shows the detailed distribution of correct/incorrect predictions per classconf_mat(pred, truth = tariff, estimate = .pred_class)
Truth
Prediction Ordinary Ordinary Lifeline Street Lighting
Ordinary 8047 26 2
Ordinary Lifeline 42 9630 7
Street Lighting 3 43 490
12.2.3 Variable importance analysis
The next code is used to visualise and interpret the results of a Random Forest classification model by identifying which predictors had the greatest influence on the model’s decisions. While a Random Forest is often considered a “black box” algorithm due to its complexity, variable importance analysis helps to shed light on how the model is using the available data to make predictions. By focusing on the top fifteen predictors, the plot highlights the most influential features in distinguishing between the three tariff categories in the Nairobi North dataset.
The output from vip() is a ranked bar chart showing the relative contribution of each variable to the model, based on the “impurity” measure. In the context of Random Forests, impurity importance measures how much each predictor reduces classification uncertainty across all decision trees in the forest. Variables that consistently provide valuable splits to separate the classes will appear with higher importance scores. This allows analysts to see, at a glance, which features are driving the model’s classification performance.
Show the code
vip(fit$fit$fit, num_features =15) +labs(title ="Variable Importance - Nairobi North",subtitle ="Random Forest Model with 3 tariff categories" ) +theme_tq() +easy_remove_gridlines() +easy_title_bold()
This variable importance plot is showing which features had the greatest influence on the Random Forest’s classification of the three tariff categories in the Nairobi North dataset.
Here’s how to interpret it:
1. Ranking of predictors
The bars are ordered from most important (top) to least important (bottom).
total_kwh, mean_kwh, and max_kwh are by far the most influential features — these consumption-level indicators are driving most of the model’s classification power.
Mid-level importance is held by sd_kwh (variation in consumption), min_kwh, and months_zero (months with zero consumption), which suggests the model also uses consumption variability and inactivity to distinguish tariffs.
2. Role of customer type
Some customer_type dummy variables (e.g., County.Governments, Private) appear in the middle of the ranking, meaning they have moderate influence.
Other customer categories (e.g., National.Government, Embassies, Staff) have very low importance, indicating they are rarely used for splitting in the trees and thus contribute little to classification accuracy.
3. Low-importance predictors
months_neg (months with negative readings) is almost negligible in importance, suggesting that either negative readings are rare or not particularly informative for tariff classification.
The very small importance values for some customer type categories mean they could potentially be removed in a simplified model without much impact on performance.
4. Practical takeaway
The model is mostly relying on consumption magnitude and variation to classify tariffs.
Customer type matters, but only certain categories are informative.
This insight could inform future feature engineering:
You might create more nuanced consumption-related features.
You might simplify categorical predictors by combining or dropping low-importance levels.
Source Code
---title: "R for Excel users - Block 7: Advanced topics—Big data analytics and machine learning models"subtitle: "Application to the energy losses study in Kenya for KPLC and the World Bank"title-block-banner: truedate: todaydate-format: longauthor: - name: Jorge Colomer, Head of the Data Analytics Unit email: jorge.colomer@veolia.com affiliation: - name: SEURECA-VEOLIA city: Madrid state: Spain url: https://www.seureca.veolia.com/enexecute: echo: true warning: false cache: trueformat: html: self-contained: true code-fold: true code-summary: "Show the code" code-overflow: scroll code-tools: true code-block-border-left: false code-copy: true fontsize: larger toc: true toc-float: true toc-expand: true toc-location: left toc-depth: 3 toc-title: "Table of Contents" number-sections: true theme: flatly # available themes: https://quarto.org/docs/output-formats/html-themes.html---::: {style="font-size: 90%"}```{r libraries}if (!requireNamespace("pacman", quietly =TRUE)) {install.packages("pacman", repos ="https://cloud.r-project.org")}library(pacman)pacman::p_load( here, # A Simpler Way to Find Your Files tidyverse, # Data manipulation, exploration and visualization janitor, # Simple Tools for Examining and Cleaning Dirty Data arrow, # Integration to Apache Arrow duckdb, # DBI Package for the DuckDB Database Management System tictoc, # Timing Functions for R ggeasy, # Easy Access to 'ggplot2' Commands tidyquant, # Tidy Quantitative Financial Analysis timetk, # A Tool Kit for Working with Time Series factoextra, # Extract and Visualize the Results of Multivariate Data Analyses tidymodels, # Easily Install and Load the 'Tidymodels' Packages vip # Variable Importance Plots )```:::```{r}#| include: falseif ("package:tidylog"%in%search()) {detach("package:tidylog", unload =TRUE)}```# OverviewThis is the seventh block of the workshop series "R for Excel users - Application to the energy losses study in Kenya for KPLC and the World Bank". In this block we will explore advanced topics in R, focusing on big data analytics and machine learning models. We will learn how to handle large datasets that exceed available memory using efficient file formats like Parquet, Apache Arrow, and DuckDB. We will also introduce Apache Spark for distributed data processing and delve into machine learning models tailored to KPLC's needs.# Learning objectivesBy the end of this block, participants will be able to:1. Import, store, and process large datasets efficiently using Parquet, Apache Arrow, and DuckDB.2. Understand the fundamentals of distributed data processing with Apache Spark and its integration with R.3. Apply appropriate data wrangling techniques for big data in the context of KPLC’s energy losses analysis.4. Build, train, and evaluate basic machine learning models relevant to KPLC's operational needs.5. Interpret model outputs to support data-driven decision-making.# InstructorJorge Colomer is an industrial engineer with a specialization in electric power systems from the University of Zaragoza (Spain). He holds an MBA from the Technical University of Madrid (UPM), a master's degree in Data Science and Business Intelligence from the University of Castilla-La Mancha, and has completed several postgraduate courses in finance and quantitative analysis at UPM. He currently leads the Data Analytics Unit at the Spanish branch of SEURECA, serves as an adjunct professor in the master's program in Artificial Intelligence at Alfonso X el Sabio University in Madrid, and is a PhD candidate in the Statistical Modeling and Decision Sciences program at Rey Juan Carlos University in Madrid.Jorge can be reached at [jorge.colomer\@veolia.com](mailto:jorge.colomer@veolia.com){.email}# PrerequisitesBefore the course begins, please make sure you meet the following prerequisites:- **Create a free account on Posit Cloud:** Go to <https://posit.cloud/> and sign up for a free account. We will use this platform during the workshop to run R code without needing to install anything on your computer.- **Basic familiarity with Excel:** This course is designed for Excel users who want to start working with R. A basic understanding of Excel functions will be helpful.- **A stable internet connection:** Since we will be using an online environment (Posit Cloud), please ensure you have a reliable internet connection throughout the course.- **Download workshop data:**Dropbox folder: <https://www.dropbox.com/scl/fo/gkjj6i7qn49kt2p87nsv4/AOdlIM6RIc7gkDW3_zptqTE?rlkey=mi5emw34h2wfk74gtiwmzhx3l&st=0iykwigh&dl=0>Save it temporarily somewhere you will remember.[**Note:** In your case, download the files from the Dropbox folder and save it in a folder called `data` inside your project directory. If you are using Posit Cloud, you can create this folder by clicking on the "New Folder" button in the Files pane and then uploading the file to that folder.]{style="color: #8B0000;"}# LicenseThis work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).# Agenda::: {style="color: #999999;"}- **Block 1: Introduction to R, RStudio and R Markdown** - Objectives and resources - Why Excel, why R? - Overview of R and RStudio - Intro to R Markdown - R functions - Help pages - Commenting - Assigning objects with `<-` - R packages- **Block 2: Importing and exporting data** - Importing data from Excel - Exporting data to Excel - Importing and exporting CSV files - Other formats- **Block 3: Data visualization with `ggplot2`** - First steps - Creating a ggplot - Adding aesthetics and layers - Good data visualisation practices- **Block 4: Data tidying** - Tidy data - Lengthening data - Widening data - Pivot tables and `dplyr`- **Block 5: Filters and joins** - Filtering data - Joining data - Combining datasets- **Block 6: Report automation with R and Quarto** - Understanding Parameterisation in Quarto - Preparing the Data - Designing a Quarto Report Template - Incorporating Parameters into the Report - Rendering Reports Programmatically - Best Practices for Automated Reporting - Hands-on Exercise: KPLC Monthly Reports:::- **Block 7: Advanced topics—Big data analytics and machine learning models** - Parquet files, Apache Arrow and DuckDB - Apache Spark and the `sparklyr` package - Machine learning models::: {style="color: #999999;"}- **Block 8: Automated exploratory data analysis (EDA) and data quality** - Introduction to EDA and its role in KPLC's workflows - Why automate EDA? Benefits and limitations - Generating summary statistics with `skimr` - Extended descriptive analysis with other EDA packages - Visual overviews for detecting trends, patterns, and anomalies - Introduction to data quality concepts - Metadata and data dictionaries: definitions and best practices - Detecting and handling missing data, inconsistencies, and outliers - Integrating data quality checks into repeatable workflows - Hands-on exercise: Automated EDA and data quality assessment on KPLC datasets:::# IntroductionIn this block, we tackle the challenges of analysing datasets that exceed available memory by leveraging efficient file formats and modern tools such as Parquet, Apache Arrow, and DuckDB. We will use Parquet files in combination with Apache Arrow to facilitate fast, out-of-memory data manipulation using familiar `dplyr` syntax, and where integration with DuckDB enables SQL-style querying over large-scale data.We will also introduce **Apache Spark** briefly, emphasising its role in scalable distributed processing for truly big data scenarios.Finally, we'll explore **machine learning models tailored to KPLC's needs**. These models serve as tools to better understand and mitigate inefficiencies and help drive data-informed operational decisions.# Parquet files, Apache Arrow and DuckDB## Some definitions**Parquet** is a columnar storage file format designed for efficient storage and retrieval of large datasets. Its column-oriented structure enables highly efficient compression and encoding, reducing file sizes and speeding up analytical queries—particularly when working with subsets of columns.**Apache Arrow** is an open-source framework for in-memory data representation that enables fast, language-agnostic data interchange. In R, the `arrow` package allows you to read and write Parquet files directly, and to query them using dplyr without loading the entire dataset into memory. This makes it possible to work interactively with datasets that would otherwise exceed your computer's RAM capacity.**DuckDB** is an in-process SQL database management system optimised for analytical workloads. It can query Parquet files directly, without importing them into a separate database, and integrates seamlessly with the arrow package in R. This allows analysts to combine the expressiveness of SQL with the flexibility of R's `tidyverse` for large-scale data analysis.## Practical Example: Analysing KPLC's postpaid customer databasesIn this example, we will work with KPLC's postpaid customer databases, which contain millions of records and therefore cannot be loaded entirely into memory. Instead, we will use a workflow that combines Parquet files, Apache Arrow, and DuckDB to perform efficient, out-of-memory analysis directly from R.The process involves:1. **Storing the data in Parquet format** to take advantage of columnar storage and compression.2. **Accessing the Parquet files with the `arrow` package**, enabling lazy evaluation and selective reading of only the necessary columns and rows.3. **Querying the data with DuckDB**, either via SQL queries or `dplyr` syntax, for filtering, aggregation, and joining with other datasets.4. **Integrating results into the analytical workflow**, such as generating summary tables, calculating performance indicators, or preparing visualisations for inclusion in Quarto reports.## Reading Parquet files with `arrow`Let us take the file `postpaid_bills_clean.parquet`, stored locally on the hard drive and shared by KPLC at the start of the project. It is an 0.8 GB file containing millions of records. First, we define a pointer to the file using the `arrow::open_dataset()` function, which allows us to access the data without loading it into memory. This is particularly useful when working with large volumes of data, as it avoids memory issues and enables filtering and aggregation operations to be carried out efficiently.::: {style="font-size: 90%"}```{r}pt_parquet <- arrow::open_dataset(here("..","LOSS REDUCTION WORKSHOP","data","postpaid_bills_clean.parquet"))```:::## Exploring the Parquet file with `dplyr`Let us explore the Parquet file using the `glimpse()` function from `dplyr`, which provides a quick view of the data structure, including the number of rows and columns, as well as the data types of each column. This allows us to quickly understand the composition of the dataset without the need to load it into memory.::: {style="font-size: 90%"}```{r}pt_parquet %>% dplyr::glimpse()```:::We can see that the dataset contains 25.6 million rows and 16 columns, with data types including numbers, characters, and dates. This gives us a clear idea of the data structure and allows us to plan the analytical operations to be carried out next.# Querying the Parquet file with `dplyr`Let us use the familiar `dplyr` functions to perform some analytical operations on the dataset. For example, we can calculate the total billing amount and the total kWh consumed for each tariff. This is done by using the `group_by()` function to group the data by the `tariff` column, followed by `summarise()` to compute the corresponding sums. Finally, we use `ungroup()` to remove the grouping and `collect()` to bring the results into R.We use the `tictoc` package to measure the time taken to execute this operation.::: {style="font-size: 90%"}```{r}tic()pt_parquet %>%group_by(tariff) %>%summarize(total_bill_amount =sum(bill_amount, na.rm =TRUE),total_kwh =sum(kwh, na.rm =TRUE) ) %>%ungroup() %>%collect()toc()```:::# Using `duckdb` with `arrow`There's one important advantage of parquet and arrow — its very easy to turn an arrow dataset into a DuckDB database by calling `arrow::to_duckdb()`. Thanks to DuckDB, we can take advantage of the full power of an SQL database management system and bring the results back into R without having to load the entire dataset into memory.::: {style="font-size: 90%"}```{r}tic()pt_parquet %>% arrow::to_duckdb() %>%filter(lubridate::year(billing_month) ==2024) %>%group_by(tariff, month = lubridate::floor_date(billing_month, "month")) %>%summarise(total_kwh =sum(kwh, na.rm =TRUE)) %>%ungroup() %>%arrange(month, tariff) %>%collect() %>%drop_na(tariff) %>%mutate(tariff =fct_reorder(tariff, total_kwh, .desc =TRUE)) %>%ggplot(aes(x = month, y = total_kwh)) +geom_area(fill ="grey") +facet_wrap(~tariff, scales ="fixed", axes ="all", axis.labels ="all") +labs(title ="Total kWh consumed by tariff in 2024",x ="Month",y ="Total kWh") +scale_y_continuous(labels = scales::label_number(big.mark =",", decimal.mark =".",scale =1e-6), limits =c(0, NA)) +theme_tq() +easy_title_bold() +easy_remove_gridlines() +easy_remove_axes("both", "title") +scale_x_date(date_labels ="%b", date_breaks ="2 months")toc()```:::The neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.::: {style="font-size: 90%"}```{r}#| include: false# Remove the pointerrm(pt_parquet)```:::# Apache Spark and the `sparklyr` package[**Apache Spark**](https://spark.apache.org/) is an open-source, distributed computing system designed for large-scale data processing. It enables fast computation by distributing tasks across multiple cores or machines, making it suitable for handling massive datasets that cannot be processed efficiently on a single computer. Spark supports a variety of workloads, including batch processing, interactive queries, streaming analytics, and machine learning.In R, the [`sparklyr`](https://spark.posit.co/) package provides a convenient interface to Apache Spark. It allows analysts to connect R directly to a Spark cluster, manipulate Spark DataFrames using familiar `dplyr` syntax, and access Spark's built-in machine learning library ([`MLlib`](https://spark.apache.org/mllib/)). This makes it possible to integrate distributed data processing into an R-based workflow without having to switch to another programming environment.For those interested in exploring Spark with R in greater depth, the book [*Mastering Spark with R*](https://therinspark.com/) is an excellent resource, offering practical examples and detailed explanations of how to harness Spark's capabilities from within R.# Machine learning models## IntroductionMachine Learning (ML) is a field of artificial intelligence that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed for each specific task. Instead of following rigid rules, ML algorithms improve their performance as they are exposed to more data.ML techniques are generally divided into three main categories:- **Supervised learning** – Models are trained using labelled datasets, where the outcome or target variable is known. This category includes **classification** (predicting a discrete category, such as customer tariff type) and **regression** (predicting a continuous value, such as monthly energy consumption).- **Unsupervised learning** – Models are applied to datasets without pre-labelled outcomes. They aim to uncover hidden structures or patterns, such as grouping similar customers using **clustering** or reducing the dimensionality of the data with techniques like **Principal Component Analysis (PCA)**.- **Reinforcement learning** – Algorithms learn optimal strategies through trial-and-error interactions with an environment, guided by rewards and penalties. While less common in traditional data analytics projects, it is increasingly used in real-time decision-making systems.## Application to KPLC's databases: supervised classification modelsIn this session, we focus on **supervised classification models** to predict a customer's tariff category based on their historical electricity consumption patterns and other predictors.### Feature engineering and Random Forest modelsAmong the different algorithms available, **Random Forest** has been selected due to its robustness, ability to handle large datasets, and good performance in modelling complex, non-linear relationships.We start from the file `postpaid_bills_clean.parquet`, provided by KPLC at the beginning of the project. A pointer to this Parquet file is defined using the `arrow` package, and the data is processed in-memory through a DuckDB table. We begin by exploring the table, filtering by the region of interest, and examining the number of accounts (contracts) by both tariff type and customer type.For demonstration purposes, we use data from the NAIROBI NORTH region. Let's start by examining the number of accounts by tariff type for the Nairobi North region in 2024.::: {style="font-size: 90%"}```{r}pt_parquet <- arrow::open_dataset(here("..","LOSS REDUCTION WORKSHOP","data","postpaid_bills_clean.parquet"))# take the pointer, filter by year and region and count the number of accounts by tariffpt_parquet %>% arrow::to_duckdb() %>%filter(year(billing_month) ==2024, region =="NAIROBI NORTH" ) %>%group_by(tariff) %>%summarise(accounts =n_distinct(account)) %>%ungroup() %>%arrange(tariff, desc(accounts)) %>%collect() %>%drop_na(tariff) %>% janitor::adorn_totals("row")```:::To simplify the analysis, we will focus on three tariff categories: *Ordinary*, *Ordinary Lifeline*, and *Street Lighting*. These categories have been selected to illustrate the modelling process and do not represent a complete real-world case. A more comprehensive analysis would include a wider range of tariffs and customer types, as well as a larger and more diverse dataset.Instead of working directly with the monthly consumption time series, we will calculate a set of features that summarise each account's consumption pattern. Examples of such features include the maximum recorded value, the standard deviation, and other statistical measures that capture variability and distribution. These aggregated features allow us to represent each account with a compact set of descriptors, making the classification process more efficient and less sensitive to the length of the time series.::: {style="font-size: 90%"}```{r}# 1) Valid accounts: just one tariff class in 2024 in NAIROBI NORTHvalid_accounts <- pt_parquet %>% arrow::to_duckdb() %>%filter(region =="NAIROBI NORTH",year(billing_month) ==2024) %>%group_by(account) %>%summarise(n_tariffs =n_distinct(tariff, na.rm =TRUE), .groups ="drop") %>%filter(n_tariffs ==1) %>%select(account) %>%collect()# 2) Base table: filter by region, year and valid accountspt_base <- pt_parquet %>% arrow::to_duckdb() %>%filter(region =="NAIROBI NORTH", year(billing_month) ==2024) %>%collect() %>%filter(account %in%pull(valid_accounts)) %>%filter(tariff %in%c("Ordinary","Ordinary Lifeline","Street Lighting")) %>%mutate(month =floor_date(billing_month, "month"))# 3) Add the monthly consumption (kwh) per account, tariff, county and monthmonthly <- pt_base %>%group_by(account, tariff, county, customer_type, month) %>%summarise(kwh_m =sum(kwh, na.rm =TRUE), .groups ="drop")# Calculate new features by account from the monthly consumptionfeatures_df_12m <- monthly %>%group_by(account) %>%summarise(tariff =first(tariff),county =first(county),customer_type =first(customer_type),months_n =n(), # number of months with datatotal_kwh =sum(kwh_m, na.rm =TRUE), # total kWh consumedmean_kwh =mean(kwh_m, na.rm =TRUE), # mean kWh consumedsd_kwh =sd(kwh_m, na.rm =TRUE), # standard deviation of kWh consumedmin_kwh =min(kwh_m, na.rm =TRUE), # minimum kWh consumedmax_kwh =max(kwh_m, na.rm =TRUE), # maximum kWh consumedmonths_zero =sum(kwh_m ==0, na.rm =TRUE), # number of months with zero consumptionmonths_neg =sum(kwh_m <0, na.rm =TRUE), # number of months with negative consumption.groups ="drop" ) %>%filter(months_n ==12) %>%# only keep accounts with 12 months of datamutate(cv_kwh =if_else(is.finite(mean_kwh) & mean_kwh !=0, sd_kwh / mean_kwh, NA_real_) # Coefficient of variation (CV) of kWh consumed )```:::Once the time series features for each account have been calculated, we use the `tidymodels` package to prepare the data, train a classification model, and measure its performance. The process includes:1. **Data preparation**: Split the dataset into training and test sets, and preprocess the features.2. **Model specification**: Define a Random Forest model with suitable parameters.3. **Training and prediction**: Fit the model to the training data and make predictions on the test set.4. **Performance metrics**: Calculate the model’s accuracy and generate a confusion matrix to assess its performance.::: {style="font-size: 90%"}```{r}set.seed(123)# 0) Preparationdata_ml <- features_df_12m %>%mutate(cv_kwh =ifelse(is.na(cv_kwh), 0, cv_kwh), # replace NA with 0tariff =as.factor(tariff),county =as.factor(county),customer_type =as.factor(customer_type) )# 1) Split the data into training and testing setsspl <-initial_split(data_ml, prop =0.8, strata = tariff) # stratified sampling by tarifftrain <-training(spl)test <-testing(spl)# 2) Recipe: select predictors and pre-processnum_vars <-c("months_n", "total_kwh", "mean_kwh", "sd_kwh", "min_kwh", "max_kwh","months_zero", "months_neg", "cv_kwh")rec <-recipe(tariff ~ ., data = train) %>%# tariff is the outcome, all other columns are predictorsupdate_role(account, new_role ="id") %>%# Mark 'account' as an ID variable (it will be retained but not used as a predictor)step_rm(-all_of(c("tariff", "account", num_vars, "county", "customer_type"))) %>%# Keep only the listed variables: tariff (target), account (ID), numeric variables (num_vars), county, and customer_type.# All other variables not in this list will be removed from the dataset.step_impute_median(all_of(num_vars)) %>%# For all numeric variables (defined in num_vars), replace missing values with the median of that variable.step_zv(all_predictors()) %>%# Remove any predictor variables that have zero variance (i.e., the same value for all rows).step_dummy(all_nominal_predictors()) # Convert all categorical predictors into dummy/indicator variables for modelling.# 3) Model: Random Forestmtry_val <-max(1L, floor(sqrt(length(num_vars)))) # Sets the mtry parameter for the random forest algorithm# Creates a random forest model specification in tidymodelsrf_spec <-rand_forest(mtry = mtry_val, trees =800, min_n =5) %>%set_mode("classification") %>%set_engine("ranger", importance ="impurity")# Creates a workflow object that combines:# The preprocessing steps (rec) from the recipe# The model specification (rf_spec)wf <-workflow() %>%add_recipe(rec) %>%add_model(rf_spec)# 4) Train and predicttic()fit <- wf %>%fit(train)toc()tic()pred <-augment(fit, test) %>%select(.pred_class, starts_with(".pred_"), tariff)toc()```:::### Performance metricsThe performance of the Random Forest classification model is evaluated using two key metrics: **accuracy** and a **confusion matrix**.Accuracy measures the overall proportion of correct classifications made by the model, providing a straightforward indication of its performance.The confusion matrix, on the other hand, offers a detailed breakdown of the model's predictions, showing how many instances were correctly classified for each tariff category and where misclassifications occurred.::: {style="font-size: 90%"}```{r}# Accuracy: overall proportion of correct classificationsaccuracy(pred, truth = tariff, estimate = .pred_class) # Confusion matrix: shows the detailed distribution of correct/incorrect predictions per classconf_mat(pred, truth = tariff, estimate = .pred_class)```:::### Variable importance analysisThe next code is used to visualise and interpret the results of a Random Forest classification model by identifying which predictors had the greatest influence on the model's decisions. While a Random Forest is often considered a "black box" algorithm due to its complexity, variable importance analysis helps to shed light on how the model is using the available data to make predictions. By focusing on the top fifteen predictors, the plot highlights the most influential features in distinguishing between the three tariff categories in the Nairobi North dataset.The output from `vip()` is a ranked bar chart showing the relative contribution of each variable to the model, based on the "impurity" measure. In the context of Random Forests, impurity importance measures how much each predictor reduces classification uncertainty across all decision trees in the forest. Variables that consistently provide valuable splits to separate the classes will appear with higher importance scores. This allows analysts to see, at a glance, which features are driving the model's classification performance.::: {style="font-size: 90%"}```{r}vip(fit$fit$fit, num_features =15) +labs(title ="Variable Importance - Nairobi North",subtitle ="Random Forest Model with 3 tariff categories" ) +theme_tq() +easy_remove_gridlines() +easy_title_bold()```:::This variable importance plot is showing which features had the greatest influence on the Random Forest's classification of the three tariff categories in the Nairobi North dataset.Here's how to interpret it:**1. Ranking of predictors**- The bars are ordered from most important (top) to least important (bottom).- `total_kwh`, `mean_kwh`, and `max_kwh` are by far the most influential features — these consumption-level indicators are driving most of the model’s classification power.- Mid-level importance is held by `sd_kwh` (variation in consumption), `min_kwh`, and `months_zero` (months with zero consumption), which suggests the model also uses consumption variability and inactivity to distinguish tariffs.**2. Role of customer type**- Some `customer_type` dummy variables (e.g., `County.Governments`, `Private`) appear in the middle of the ranking, meaning they have moderate influence.- Other customer categories (e.g., `National.Government`, `Embassies`, `Staff`) have very low importance, indicating they are rarely used for splitting in the trees and thus contribute little to classification accuracy.**3. Low-importance predictors**- `months_neg` (months with negative readings) is almost negligible in importance, suggesting that either negative readings are rare or not particularly informative for tariff classification.- The very small importance values for some customer type categories mean they could potentially be removed in a simplified model without much impact on performance.**4. Practical takeaway**- The model is mostly relying on **consumption magnitude and variation** to classify tariffs.- Customer type matters, but only certain categories are informative.- This insight could inform future feature engineering: - You might create more nuanced consumption-related features. - You might simplify categorical predictors by combining or dropping low-importance levels.