This course introduces students to the practice of consumer analytics using R. Working with a real retail transaction dataset, students progress from raw data to actionable customer insights — and reflect on the legal and ethical boundaries that govern this work in Europe.
The course is designed for students with basic familiarity with data and statistics. No prior R experience is required, though it is helpful.
Duration: 3 hours (with one short break)
Format: Live coding — students follow along in their browsers using webR
Be able to load, inspect, and clean a real-world transactional dataset in R
Understand and apply the RFM (Recency, Frequency, Monetary) framework for customer segmentation
Have an idea of designing personalised marketing actions based on customer segments
Critically evaluate the legal and ethical implications of consumer profiling under GDPR
Learning Outcomes
#
Outcome
Assessed through
1
Load data from a URL and inspect its structure
Live coding
2
Apply data cleaning steps to remove noise and invalid records
Live coding
3
Produce and interpret summary statistics and visualisations
Discussion
4
Calculate RFM scores and assign customer segments
Live coding
5
Design segment-based marketing actions and interpret findings
Discussion
6
Explain what GDPR requires of a consumer analytics project
Discussion
Course Outline
Time
Session
Topics
9:45 – 10:30
Session 1 — Data
Load, inspect, clean, explore
10:30 – 11:15
Session 2 — RFM
Scoring, segmentation, interpretation
11:15 – 11:30
Break
11:30 – 12:15
Session 3 — Marketing
Personalised actions, product gaps
12:15 – 13:00
Session 4 — Regulation
GDPR, profiling, ethics
The Dataset
UK Online Retail Dataset Creator: Daqing Chen, London South Bank University Source: UCI Machine Learning Repository License: CC BY 4.0
A UK-based online retailer selling unique occasion gifts, primarily to wholesalers. Transactions from December 2010 to September 2011.
We work with a random sample of 20,000 rows hosted on GitHub.
Variables
Column
Type
Description
InvoiceNo
Text
6-digit transaction ID. Starts with C if cancelled
StockCode
Text
5-digit product code
Description
Text
Product name
Quantity
Integer
Units purchased per row
InvoiceDate
DateTime
Date and time of transaction
UnitPrice
Numeric
Price per unit in British pounds (£)
CustomerID
Integer
5-digit customer identifier
Country
Text
Country of the customer
After cleaning, our working dataset contains:
Metric
Value
Transactions
14,579
Unique customers
3,017
Unique products
2,544
Countries
34
Period
Dec 2010 – Dec 2011
Total revenue
£316,047
Why R?
R is a free, open-source programming language designed for statistical computing and data analysis. It is now one of the two dominant languages in data science alongside Python.
Open and free — no licences, no fees, no institutional access restrictions
Reproducible — R code is a complete, readable record of every step. Anyone can re-run your code and get the same result
Community-driven — over 20,000 packages on (The Comprehensive R Archive Network) CRAN, contributed by researchers worldwide
Industry and academia — widely used in economics, marketing, public health, and finance
The R ecosystem:
Tool
What it is
R
The core language and engine
RStudio
The most popular editor for writing R code
Quarto
Combines R code, output, and text into polished HTML, PDF, or Word documents
tidyverse
A collection of packages that make data work readable and consistent
In this course we use webR — R running entirely in your browser, with no installation required.
Session 1 — Loading, Cleaning, and Exploring the Data
Time: 9:45 – 10:30
1. Load Packages and Data
We start by loading tidyverse — a collection of R packages with a shared design philosophy. library(tidyverse) loads them all at once.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Package
What it does
dplyr
Filter, group, summarise — the core data manipulation tool
ggplot2
Create charts and visualisations
lubridate
Work with dates and times
Then we load the dataset. read.csv() is base R — no extra package needed.
Before touching the data, always look at it first. glimpse() shows column names, types, and a preview of values. summary() gives the range and missing value counts.
InvoiceNo StockCode Description Quantity
Length:20000 Length:20000 Length:20000 Min. :-1200.00
Class :character Class :character Class :character 1st Qu.: 1.00
Mode :character Mode :character Mode :character Median : 3.00
Mean : 9.65
3rd Qu.: 10.00
Max. : 4800.00
InvoiceDate UnitPrice CustomerID Country
Length:20000 Min. : 0.000 Min. :12347 Length:20000
Class :character 1st Qu.: 1.250 1st Qu.:13905 Class :character
Mode :character Median : 2.080 Median :15159 Mode :character
Mean : 4.073 Mean :15288
3rd Qu.: 4.130 3rd Qu.:16791
Max. :2653.950 Max. :18287
NA's :5078
What we see:
20,000 rows, 8 columns
InvoiceDate is read as text — needs to be converted to a date
Some CustomerID values are missing
Quantity and UnitPrice can be negative — these are returns and cancellations
The Pipe Operator |>
All tidyverse code is chained with the pipe operator |>, which reads as “and then”.
# Without pipe — nested, hard to readarrange(summarise(group_by(retail, Country), Revenue =sum(Revenue)), desc(Revenue))# With pipe — reads left to right, step by stepretail |>group_by(Country) |>summarise(Revenue =sum(Revenue)) |>arrange(desc(Revenue))
Both produce identical output. The pipe makes the logic transparent — you read each step in the order it happens.
3. Clean
Real data is never clean. We remove cancelled orders, missing customers, and invalid values. We also create a Revenue column (Quantity × UnitPrice).
retail_clean <- retail |>filter(!grepl("^C", InvoiceNo)) |># Remove cancellations (InvoiceNo starts with C)filter(!is.na(CustomerID)) |># Remove missing customersfilter(Quantity >0, UnitPrice >0) |># Remove invalid rowsmutate(InvoiceDate =as.Date(InvoiceDate, format ="%Y-%m-%d"), # Parse dateRevenue = Quantity * UnitPrice # Row-level revenue )# How many rows were removed?nrow(retail) -nrow(retail_clean)
[1] 5421
Result: 5,421 rows removed — 27% of the sample. This is typical for retail transaction data.
4. Explore
range(retail_clean$InvoiceDate, na.rm =TRUE) # Time period
The UK accounts for 82% of total revenue. Despite selling to 34 countries, this is essentially a domestic business with modest international reach.
retail_clean |>group_by(Country) |>summarise(Revenue =sum(Revenue)) |>arrange(desc(Revenue)) |>slice(1:10) |>ggplot(aes(x =reorder(Country, Revenue), y = Revenue)) +geom_col(fill ="steelblue") +coord_flip() +scale_y_continuous(labels = scales::comma) +labs(title ="Top 10 Countries by Revenue", x =NULL, y ="Revenue (£)") +theme_minimal()
Volume vs. Revenue
There is an important distinction between products that sell in high volume and products that generate high revenue. They are not always the same.
knitr::kable() formats a data frame as a clean, readable table in the output document. We use it throughout this course whenever we want to display a summary table neatly.
# By revenueretail_clean |>group_by(Description) |>summarise(Total_Revenue =sum(Revenue)) |>arrange(desc(Total_Revenue)) |>slice(1:10) |> knitr::kable()
Description
Total_Revenue
REGENCY CAKESTAND 3 TIER
5788.56
METAL SIGN TAKE IT OR LEAVE IT
3996.70
VINTAGE UNION JACK MEMOBOARD
3543.87
POSTAGE
3424.00
LANDMARK FRAME OXFORD STREET
3317.85
JUMBO BAG RED RETROSPOT
3210.64
PARTY BUNTING
2922.42
RABBIT NIGHT LIGHT
2628.32
Manual
2551.14
BLACK RECORD COVER FRAME
2400.69
Key observation: The top-selling product by volume (World War 2 Gliders, 6,058 units) does not appear in the top revenue list. The top revenue product (Regency Cakestand 3 Tier, £5,789) sells in far lower quantities. This is a classic volume vs. margin trade-off.
What the chart shows: Revenue is flat through early 2011, rises sharply from September, and peaks in November (£40,000+). The December drop is a data artefact — the dataset ends on 9 December. The seasonal pattern is consistent with a gift retailer supplying wholesalers ahead of Christmas.
Discussion: Why does revenue stay low in January–February despite Valentine’s Day? What does this tell us about this retailer’s customer base?
Exercise
Exercise 1 — Country deep-dive
Pick a country other than the UK — for example Germany or France. Filter the data to that country only and find the top 5 products by revenue.
Which single transaction row generated the highest revenue?
retail_clean |>arrange(desc(Revenue)) |>slice(1)
InvoiceNo StockCode Description Quantity InvoiceDate
1 581115 22413 METAL SIGN TAKE IT OR LEAVE IT 1404 2011-12-07
UnitPrice CustomerID Country Revenue
1 2.75 15195 United Kingdom 3861
Session 2 — RFM Segmentation
Time: 10:30 – 11:15
What is RFM?
RFM scores each customer on three dimensions derived entirely from their transaction history:
Dimension
Question
Direction
Recency (R)
How recently did this customer buy?
Fewer days = better
Frequency (F)
How many distinct purchases?
Higher = better
Monetary (M)
How much have they spent in total?
Higher = better
RFM requires no demographic data — only a customer ID, a date, and a value. It was developed in direct mail marketing in the 1980s and remains one of the most widely used segmentation frameworks in retail and e-commerce.
CustomerID Recency Frequency Monetary
Min. :12347 Min. : 1 Min. : 1.000 Min. : 0.39
1st Qu.:13792 1st Qu.: 24 1st Qu.: 1.000 1st Qu.: 17.00
Median :15289 Median : 64 Median : 2.000 Median : 39.00
Mean :15290 Mean :104 Mean : 2.699 Mean : 104.76
3rd Qu.:16779 3rd Qu.:163 3rd Qu.: 3.000 3rd Qu.: 90.55
Max. :18287 Max. :374 Max. :97.000 Max. :14678.87
What the distribution tells us:
Metric
Median
Mean
Max
Recency (days)
64
104
374
Frequency (orders)
2
2.7
97
Monetary (£)
39
105
14,679
All three distributions are right-skewed — the mean is pulled well above the median by a small number of high-value, high-frequency customers, almost certainly wholesalers. Half of all customers made just 2 or fewer purchases and spent less than £39 in the entire year.
Step 2 — Score 1–5 and Assign Segments
We use quintile scoring — dividing customers into five equal groups for each dimension, scored 1 (worst) to 5 (best).
rfm |>count(Segment) |>ggplot(aes(x =reorder(Segment, n), y = n)) +geom_col(fill ="steelblue") +coord_flip() +labs(title ="Customer Segments", x =NULL, y ="Number of Customers") +theme_minimal()
Interpreting the Segments
Segment
Customers
Total Revenue
Share
Champions
712
£177,153
56%
Loyal Customers
581
£48,935
15%
At Risk
516
£47,008
15%
Lost
692
£24,704
8%
Recent Customers
268
£10,403
3%
Potential Loyalists
248
£7,844
2%
Champions (712 customers, £177,153) Bought on average 18 days ago, placed nearly 6 orders, spent £249 on average. These are the wholesalers — the backbone of the business. Losing one Champion costs as much as acquiring 5–7 new customers.
Loyal Customers (581 customers, £48,935) Bought 48 days ago on average, 2.6 orders, £84 average spend. Solid, recurring customers The opportunity here is to move them up — a loyalty programme or personalised upsell could convert some into Champions.
At Risk (516 customers, £47,008) This is the most strategically important segment. Their average monetary value (£91) is actually higher than Loyal Customers (£84), yet they have not purchased in an average of 182 days. They were valuable customers who have gone quiet. Without intervention they will slide into Lost. A targeted win-back campaign — a personalised offer, a “we miss you” email — is the highest-priority action.
Lost (692 customers, £24,704) The largest segment by customer count. They last bought 227 days ago on average, placed only 1 order, and spent just £36. These customers were probably never deeply engaged with the retailer. At this recency distance, the cost of re-acquisition typically exceeds the expected return. Low priority.
Recent Customers (268 customers, £10,403) Bought recently (23 days ago) but only once. These are new customers in their first purchase window — the most critical period for habit formation. A follow-up offer or welcome sequence within the next 30 days can convert them into Loyal Customers. If ignored, most will become Lost.
Potential Loyalists (248 customers, £7,844) One purchase, 64 days ago, £32 average spend. Light engagement is appropriate; heavy investment is not yet justified.
Strategic summary: Protect Champions at all costs. Rescue At Risk customers urgently. Nurture Recent Customers within 30 days. Do not spend on Lost.
Exercise
Exercise 3 — How many customers are in each segment?
Exercise 4 — What does the average Champion look like?
Filter the rfm table to Champions and report their average Recency, Frequency, and Monetary.
Session 3 — Personalised Marketing
Time: 11:30 – 12:15
From Segments to Actions
RFM tells us who the customers are. The next question is what to offer them. A natural starting point: recommend to Loyal Customers the products that Champions already buy — to nudge them up a tier.
# Attach segment labels to transaction dataretail_segmented <- retail_clean |>left_join(rfm |>select(CustomerID, Segment), by ="CustomerID")# Top 15 products bought by Championschampion_products <- retail_segmented |>filter(Segment =="Champions") |>group_by(Description) |>summarise(Times_Bought =n()) |>arrange(desc(Times_Bought)) |>slice(1:15)champion_products |> knitr::kable()
Description
Times_Bought
PARTY BUNTING
37
JUMBO BAG RED RETROSPOT
36
HEART OF WICKER SMALL
32
LUNCH BAG RED RETROSPOT
28
WHITE HANGING HEART T-LIGHT HOLDER
28
ASSORTED COLOUR BIRD ORNAMENT
27
SPACEBOY LUNCH BOX
27
LUNCH BAG BLACK SKULL.
26
LUNCH BAG CARS BLUE
26
LUNCH BAG SPACEBOY DESIGN
25
PAPER CHAIN KIT 50’S CHRISTMAS
25
REGENCY CAKESTAND 3 TIER
25
PACK OF 72 RETROSPOT CAKE CASES
24
JUMBO STORAGE BAG SUKI
23
LUNCH BAG PINK POLKADOT
23
# Products bought by Loyal Customersloyal_products <- retail_segmented |>filter(Segment =="Loyal Customers") |>distinct(Description)# Gap: what Champions love that Loyal Customers have not boughtchampion_products |>filter(!Description %in% loyal_products$Description) |> knitr::kable()
Description
Times_Bought
JUMBO STORAGE BAG SUKI
23
The Finding
Champions and Loyal Customers buy almost identical products. Only one product — Jumbo Storage Bag Suki — appears in Champion purchases but not in Loyal Customer purchases.
What this means: The difference between these two segments is not what they buy but how often and how much. Product recommendations are the wrong intervention.
The right strategy: Increase purchase frequency. Bring Loyal Customers back sooner with:
A time-limited offer tied to their purchase gap (“It has been 60 days since your last order”)
A loyalty reward triggered after the next purchase
A seasonal prompt ahead of the Christmas peak
This is more targeted, more actionable — and, as we will see, more legally sensitive.
Exercise
Exercise 5 — How much revenue is at stake if At Risk customers are lost?
Exercise 6 — Which segment generates the most revenue?
Sort all segments by total Monetary value.
Session 4 — Data Regulation and Ethics
Time: 12:15 – 13:00
What is GDPR?
The General Data Protection Regulation came into force in May 2018. It is EU law governing how organisations collect, store, and use personal data. It applies to any organisation — anywhere in the world — that handles data from EU residents.
The core idea: in Europe, privacy is a fundamental right, not a consumer preference. This is the key difference from the US, where there is no equivalent federal law.
Fines reach up to 4% of global annual turnover — for Meta, that was €1.2 billion in 2023.
How GDPR is enforced
National Supervisory Authorities (SAs) Every EU member state has an independent data protection authority — the ICO in the UK, the CNIL in France, the BfDI in Germany, the DPC in Ireland. These authorities can investigate companies on their own initiative, without any complaint, and impose fines directly. They do not need to go to court.
European Data Protection Board (EDPB) When a case crosses borders, the EDPB coordinates national authorities and can issue binding decisions. This is how the €1.2 billion Meta fine was imposed: the EDPB overruled Ireland’s DPC and required a larger penalty.
What did we just build?
Look at what we coded today through a GDPR lens:
What we did
Marketing term
GDPR term
Computed RFM scores per CustomerID
Customer scoring
Profiling (Article 4.4)
Assigned segment labels automatically
Segmentation
Systematic profiling at scale
Identified days since each customer’s last purchase
Churn detection
Individual behavioural tracking
Designed a trigger email based on purchase timing
Lifecycle marketing
Automated direct marketing based on profiling
Recommended products to specific segments
Personalised marketing
Targeted direct marketing using profiled data
Every row in this table is profiling under GDPR. Every row requires a lawful basis — and the company must be able to prove it.
Two articles that directly apply
Article 4.4 — Profiling Any automated use of personal data to evaluate or predict a person’s behaviour is profiling. An RFM score attached to a CustomerID is profiling — even if no name or email address is visible. The score is derived from personal data and linked to an identifiable individual.
Article 21 — Right to object to direct marketing Every individual has an absolute right to object to their data being used for direct marketing. No exceptions. No override. If a customer clicks unsubscribe, all profiling for marketing purposes must stop — immediately and permanently.
Our trigger email “It has been 60 days since your last order” sends because we tracked that individual’s purchase timing and used it to target them. That is direct marketing based on profiling. Article 21 applies.
The Target case
In 2012, Target predicted a teenage customer was pregnant from her purchase patterns — before she told her family. She received baby coupons at home. Target faced no legal consequence in the US.
Under GDPR, pregnancy is health data — special category data requiring explicit consent. Inferring it from purchase behaviour without consent is a serious violation regardless of whether the inference is correct.
Same analytics. Entirely different legal outcome depending on which side of the Atlantic you operate.
Real fines
Company
Fine
Reason
Year
Meta
€1.2 billion
Transferring EU user data to US servers
2023
Amazon
€746 million
Targeted advertising without consent
2021
LinkedIn
€310 million
Unlawful behavioural advertising
2024
Google
€50 million
Consent buried in multi-step menus
2019
Discussion (15 min)
“You are hired as a data analyst at a European e-commerce company. Your manager asks you to build exactly what we built today and deploy it next week. What do you do?”
“Analytics tells you what is possible. GDPR tells you what is permitted. Your job as a data scientist is to know both.”
References
Chen, D. (2012). Data mining for the online retail industry. Journal of Database Marketing & Customer Strategy Management, 19, 197–208. DOI: 10.24432/C5BW33
Wachter, S. et al. (2021). Is that your final decision? Multi-stage profiling and Article 22 GDPR. International Data Privacy Law, 11(4). Oxford Academic
Future of Privacy Forum (2022). Automated Decision-Making: Practical Cases from Courts. FPF Report
European Data Protection Board (2023). €1.2 billion fine for Facebook. EDPB