Management requires a dataset that is computationally efficient while remaining representative of the global customer base. The objective is to reduce memory usage and processing time when collecting data from Spark into R, while preserving the true distribution of customers across continents.
Method definition: Stratified sampling divides the population into homogeneous subgroups (strata) and then samples from each subgroup. In this case, the strata are defined by continent.
Instead of treating the global customer base as one uniform pool, we explicitly ensure that each continent is proportionally represented in the final dataset.
Stratified sampling by continent is appropriate for the following reasons:
## 1. Clean start
rm(list = ls())
invisible(gc(verbose = FALSE))
options(warn.conflicts = FALSE)
## 2. Load packages
required_packages <- c(
"tidyr",
"ggplot2",
"knitr",
"tidyverse",
"lubridate",
"dbplyr",
"sparklyr",
"rprojroot",
"DBI",
"moments"
)
is_req_pk_name <- !required_packages %in% rownames(installed.packages())
missing <- required_packages[is_req_pk_name]
if (length(missing)) {
stop("Missing packages: ", paste(missing, collapse = ", "))
}
options(tidyverse.quiet = TRUE)
invisible(lapply(required_packages, function(pkg) {
library(pkg, character.only = TRUE, warn.conflicts = FALSE)
}))
#invisible(lapply(required_packages, library, character.only = TRUE))
## 3. Project paths
root_path <- rprojroot::find_rstudio_root_file()
project_path <- root_path
## 4. Spark connection
spark_connection <- source(file.path(project_path, "config/spark.R"))$value
## Spark Connection Status (Open): TRUE
stopifnot(sparklyr::spark_connection_is_open(spark_connection))
## 5. Initialize or get sales table
source(file.path(project_path, "R/init_sales_table.R"))
sales_table_in_spark <- init_sales_table(spark_connection)
## >>> Table 'sales_spark_table' missing. Generating data...
## >>> Copying data to Spark...
## 6. Sample data for stratified analysis
stratified_sample <- sales_table_in_spark |>
# Spark's native sampling function
# fraction = 0.1 gives roughly 10%
sparklyr::sdf_sample(fraction = 0.1, replacement = FALSE, seed = 42) |>
collect()
## 7. Verify the results
tbl <- stratified_sample |>
count(continent) |>
mutate(percentage = n / sum(n) * 100) |>
kable(caption = "Summary of Stratified Sample by Continent")
print(tbl)
##
##
## Table: Summary of Stratified Sample by Continent
##
## |continent | n| percentage|
## |:-------------|-----:|----------:|
## |Africa | 10023| 20.084965|
## |Asia | 12307| 24.661844|
## |Europe | 8981| 17.996914|
## |North America | 11055| 22.152977|
## |Oceania | 2527| 5.063824|
## |South America | 5010| 10.039477|
How do system characteristics differ across global regions, and are there notable patterns in how customers access the platform?
Analyzing device_type by region identifies critical UX disparities that shape global product strategy. While mobile-heavy regions like Oceania demand ‘Mobile-First’ architecture and fast payment flows, desktop-heavy markets require high-fidelity media and detailed specifications. This pairing ensures that our technical roadmap aligns with the actual hardware habits of our diverse user base.
We first examine the overall distribution of device types across the sampled dataset.
device_freq <- stratified_sample |>
group_by(device_type) |>
tally(name = "count") |>
mutate(percentage = count / sum(count) * 100) |>
arrange(desc(count)) |>
collect()
knitr::kable(device_freq, caption = "Frequency Table: Device Type (Counts and Percentages)")
| device_type | count | percentage |
|---|---|---|
| Mobile | 29970 | 60.056510 |
| Desktop | 14981 | 30.020239 |
| Tablet | 4952 | 9.923251 |
We construct a cross-tabulation of continent and device type.
Conditioning:
Percentages are calculated within continent (row
percentages).
That is, for each continent, percentages show the distribution of device
types used within that specific region.
Reporting percentages without conditioning leads to misleading conclusions.
If we used overall percentages, larger continents (with more customers) would dominate the distribution. For example:
By conditioning within continent, we answer the correct business question:
Given a customer is in Asia, what is the probability they use Mobile vs Desktop?
This removes size effects and isolates behavioral differences.
Compare behavior, not scale
We evaluate device preference patterns within each continent,
independent of total customer volume.
Make fair regional comparisons
We can compare:
Without distortion from population differences.
Support operational decisions
Product optimization decisions (mobile UX, desktop feature rollouts,
infrastructure planning) depend on relative usage
patterns, not absolute counts.
Avoid Simpson’s Paradox-type distortions
Aggregated percentages can hide regional variation. Conditioning
preserves subgroup structure.
continent_device_share <- sales_table_in_spark |>
group_by(continent, device_type) |>
tally(name = "count") |>
group_by(continent) |>
mutate(device_share = count / sum(count) * 100) |>
ungroup() |>
select(continent, device_type, device_share) |>
collect()
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
# Pivot to wide format
continent_device_table <- continent_device_share |>
mutate(device_share = round(device_share, 1)) |>
pivot_wider(
names_from = device_type,
values_from = device_share,
names_glue = "{device_type} Share (%)"
) |>
arrange(continent)
kable(
continent_device_table,
caption = "Device Type Share (%) Within Each Continent"
)
| continent | Tablet Share (%) | Desktop Share (%) | Mobile Share (%) |
|---|---|---|---|
| Africa | 10.0 | 30.0 | 60.0 |
| Asia | 10.0 | 30.0 | 59.9 |
| Europe | 10.0 | 30.0 | 60.0 |
| North America | 10.0 | 30.2 | 59.8 |
| Oceania | 9.6 | 30.4 | 60.0 |
| South America | 10.1 | 30.1 | 59.8 |
This visualization displays the joint distribution of Continent and Device Type using raw counts.
Chart Structure
Each panel (or grouping) allows us to compare how continents differ for a given device type, and how device usage varies within each continent.
Although this chart uses raw counts, interpretation must be conditioned on continent-level totals.
That means:
In other words:
Counts reflect scale, not composition.
Avoid Misinterpreting Volume as Preference
Larger continents will naturally have higher device counts.
Without conditioning, we might incorrectly conclude that one continent
“prefers” a device simply because it has more customers
overall.
Separate Market Size from Behavior
These are different business questions.
Support Correct Business Decisions
Prevent Scale Bias
If Asia has twice as many customers as Europe, its bars will appear
larger across all device types — even if the internal device mix is
identical.
Conditioning prevents drawing false conclusions from scale
effects.
library(ggplot2)
# Recreate aggregation directly from Spark
device_continent_counts <- stratified_sample |>
group_by(continent, device_type) |>
tally(name = "count") |>
collect()
# Grouped bar chart
ggplot(device_continent_counts,
aes(x = device_type,
y = count,
fill = device_type)) +
geom_bar(stat = "identity",
position = position_dodge()) +
facet_wrap(~ continent) +
labs(
title = "Device Type Counts by Continent",
x = "Device Type",
y = "Count",
fill = "Device Type"
) +
theme_minimal()
This visualization answers a slightly different structural question:
For each device type, which continent contributes the highest usage?
Chart Structure
Each panel isolates a single device type and compares continents against one another.
Although the chart displays raw counts, interpretation must be conditioned on continent size.
That means:
This chart is conditioned implicitly on:
Given a specific device type, how is its usage distributed across continents in absolute terms?
Counts Reflect Market Size, Not Preference
If Asia has the largest total customer base, it may dominate in Mobile, Desktop, and Tablet counts purely due to scale.
That does not imply higher relative preference.
Correct Interpretation of “Leads”
The chart title asks: Which continent leads by device
type?
“Leads” here refers to absolute contribution, not
proportional dominance.
Avoid Confusing Volume with Behavior
Operational vs Behavioral Insight
Prevent Scale Distortion
Without explicitly acknowledging conditioning, one might incorrectly interpret taller bars as stronger behavioral preference rather than larger customer base.
library(ggplot2)
library(dplyr)
ggplot(device_continent_counts,
aes(x = continent,
y = count,
fill = device_type)) +
geom_col(position = position_dodge(width = 0.8), width = 0.7) +
geom_text(aes(label = count),
position = position_dodge(width = 0.8),
vjust = -0.3,
size = 3) +
facet_wrap(~ device_type, scales = "free_y") +
labs(
title = "Which Continent Leads by Device Type?",
x = "Continent",
y = "Count",
fill = "Device Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
The stratified sample successfully preserves the continental distribution of the global customer base, ensuring fair representation across regions. This confirms that the dataset remains structurally reliable for regional comparison and managerial decision-making.
Across all continents, device composition is highly consistent:
This indicates that device preference is not strongly region-dependent. Customer access behavior appears globally standardized, with a clear mobile-first pattern across markets.
From the count-based visualizations:
However, these differences reflect market size (scale) rather than stronger device preference within those continents.
Behavioral Insight - Customers globally are mobile-first. - Device mix is structurally similar across continents.
Operational Insight - Infrastructure load will be highest in Asia due to larger customer volume. - Scaling decisions should prioritize high-volume regions.
Strategic Insight - UX optimization should prioritize Mobile globally. - Infrastructure and capacity planning should be region-volume driven.
Device behavior is globally uniform, but traffic scale differs by continent.
Product strategy should be mobile-first globally, while operational strategy should be Asia-weighted for scaling and infrastructure planning.
How is customer engagement distributed, and what does this reveal
about usage intensity and the presence of high-activity users? ##
Selected Numerical Variable: session_duration
The numerical variable selected for this analysis is session_duration (measured in minutes).
This variable captures how long customers actively engage with the
platform during a session.
It provides insight into engagement depth rather than spending value,
which was the primary focus in the lecture slides.
Understanding session duration helps answer:
The following metrics are calculated using Spark aggregations:
percentile_approx()These measures provide both central tendency and upper-tail behavior.
# Summary stats for one numeric variable from stratified_sample
num_var <- "session_duration"
stopifnot(num_var %in% names(stratified_sample))
stopifnot(is.numeric(stratified_sample[[num_var]]))
stats_tbl <- stratified_sample |>
dplyr::summarise(
count = sum(!is.na(.data[[num_var]])),
min = min(.data[[num_var]], na.rm = TRUE),
mean = mean(.data[[num_var]], na.rm = TRUE),
median = stats::quantile(.data[[num_var]], 0.50, na.rm = TRUE, names = FALSE),
sd = stats::sd(.data[[num_var]], na.rm = TRUE),
p90 = stats::quantile(.data[[num_var]], 0.90, na.rm = TRUE, names = FALSE),
p95 = stats::quantile(.data[[num_var]], 0.95, na.rm = TRUE, names = FALSE),
max = max(.data[[num_var]], na.rm = TRUE)
)
knitr::kable(
stats_tbl,
caption = paste0("Summary Statistics for ", num_var, " (computed from Stratified Sample)")
)
| count | min | mean | median | sd | p90 | p95 | max |
|---|---|---|---|---|---|---|---|
| 49903 | 3.2 | 24.95533 | 22.5 | 11.70783 | 40.1 | 47.1 | 151.5 |
To understand the distribution beyond averages, we assess skewness and kurtosis.
The code extracts the numeric variable, removes missing values, and computes shape metrics.
# Change to your numeric variable
num_var <- "session_duration"
x <- stratified_sample[[num_var]]
x <- x[!is.na(x)]
# 1. Skewness & Kurtosis
shape_stats <- tibble(
"Skewness" = skewness(x),
"Kurtosis" = kurtosis(x)
)
kable(shape_stats, digits = 3,
caption = paste0("Shape Metrics for ", num_var))
| Skewness | Kurtosis |
|---|---|
| 1.51 | 7.025 |
To formally identify extreme values, the Interquartile Range (IQR) method is applied.
The IQR measures the spread of the middle 50% of the data:
\[ IQR = Q3 - Q1 \]
Observations falling outside the interval:
\[ [Q1 - 1.5 \times IQR,\; Q3 + 1.5 \times IQR] \]
are classified as extreme values.
This method is robust because it does not assume normality and is not distorted by large outliers.
If the percentage of extreme values is meaningful, this indicates the presence of heavy users who may disproportionately influence averages and system load.
# 2. IQR-based outlier detection
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
extreme_values <- x[x < lower_bound | x > upper_bound]
outlier_summary <- tibble(
"Lower Bound" = lower_bound,
"Upper Bound" = upper_bound,
"Number of Extreme Values" = length(extreme_values),
"Percent of Extreme Values" = length(extreme_values) / length(x) * 100
)
kable(outlier_summary, digits = 2,
caption = paste0("Extreme Value Assessment for ", num_var))
| Lower Bound | Upper Bound | Number of Extreme Values | Percent of Extreme Values |
|---|---|---|---|
| -3.85 | 50.95 | 1714 | 3.43 |
A histogram complements the statistical diagnostics by visually displaying the distribution shape.
Key visual signals:
Visual inspection ensures that numerical diagnostics align with actual distribution structure.
library(ggplot2)
ggplot(stratified_sample, aes(x = .data[[num_var]])) +
geom_histogram(bins = 40) +
labs(title = paste("Distribution of", num_var)) +
theme_minimal()
The above examines session duration, defined as the time between a customer entering the platform and completing an online purchase. Since these are completed online orders, session duration represents the time-to-purchase.
The summary statistics indicate that:
This suggests that most customers complete their purchases within a relatively consistent time window, but a smaller group takes significantly longer before converting.
The right-skewed pattern has a clear behavioral meaning:
Because the median is lower than the mean, the typical customer completes their purchase faster than what the average alone might suggest. The mean is pulled upward by longer sessions.
1. Checkout Efficiency A concentrated lower-to-mid session duration range indicates that the purchase funnel is generally effective.
2. Friction Detection Long-duration outliers signal potential friction points in: - Payment processing - Product information clarity - Navigation complexity
3. Optimization Strategy Reducing the tail of long sessions can: - Increase conversion speed - Improve user satisfaction - Reduce abandonment risk in near-purchase scenarios
Session duration data shows a healthy core of efficient purchases with a small but meaningful tail of delayed conversions.
The operational focus should not be on shortening already efficient sessions, but on identifying and eliminating the causes of extended purchase times to tighten the distribution and improve overall transaction flow.
Do key numerical outcomes differ meaningfully across customer groups? ## Selected Variables
continentsession_durationThe objective is to determine whether customer engagement intensity differs across regions.
For each continent, we report:
These metrics allow comparison of:
num_var <- "session_duration"
group_summary <- stratified_sample |>
group_by(continent) |>
summarise(
mean = mean(.data[[num_var]], na.rm = TRUE),
sd = sd(.data[[num_var]], na.rm = TRUE),
p95 = quantile(.data[[num_var]], 0.95, na.rm = TRUE),
.groups = "drop"
)
kable(group_summary, digits = 2,
caption = "Grouped Summary: Session Duration by Continent")
| continent | mean | sd | p95 |
|---|---|---|---|
| Africa | 25.02 | 11.69 | 46.9 |
| Asia | 24.90 | 11.61 | 47.0 |
| Europe | 24.67 | 11.54 | 46.5 |
| North America | 25.12 | 11.93 | 47.7 |
| Oceania | 24.88 | 11.77 | 47.2 |
| South America | 25.13 | 11.75 | 47.5 |
A violin plot is used because:
It shows distribution shape.
It highlights density differences.
It captures variability and tail thickness.
It allows visual comparison of skewness between groups.
ggplot(stratified_sample,
aes(x = continent,
y = .data[[num_var]],
fill = continent)) +
geom_violin(trim = FALSE, alpha = 0.6) +
geom_boxplot(width = 0.1, outlier.shape = NA) +
labs(
title = "Distribution of Session Duration by Continent",
x = "Continent",
y = "Session Duration"
) +
theme_minimal() +
theme(legend.position = "none")
Based on the grouped summary table and violin plot above, session duration is remarkably consistent across continents.
The violin plots confirm that: - Distribution shapes are nearly identical. - All regions exhibit right-skewness. - Upper-tail behavior (long sessions) exists consistently across continents.
There is no meaningful regional difference in engagement intensity. Customers across Africa, Asia, Europe, North America, Oceania, and South America demonstrate:
This indicates that engagement depth is behaviorally uniform at a global level.
Regional strategy does not need to differentiate based on engagement intensity.
(Based on Part D – Violin Plot of Session Duration by Continent) ### Selected Visualization: Violin Plot (with Embedded Boxplot) #### 1. Why is this visualization appropriate for the data type?
Session duration is a continuous numerical variable, and the comparison is across categorical groups (continents).
A violin plot is appropriate because it:
Since Part D focuses on engagement intensity and upper-tail behavior, a distribution-based chart is the correct analytical choice.
If a simple bar chart of means were used:
Because session duration contains heavy tails and extreme values, reducing the data to means alone would oversimplify engagement behavior and potentially distort interpretation.
The violin plot clearly shows that:
This supports a critical managerial conclusion:
In short, the visualization prevents over-segmentation and supports unified, globally scalable optimization decisions.
Analysis Prepared By: Rubin Anyeema Apore
For: BDAT 603 – Descriptive Analytics
Dataset: Simulated Global Online Sales (Spark-Based
Stratified Sample)
Scope: Parts A–E (Sampling, Categorical Analysis,
Numerical Analysis, Grouped Comparison, Visualization Critique)
Date: “2026-02-12” —