Part A: Sampling Strategy

Business Motivation

Management requires a dataset that is computationally efficient while remaining representative of the global customer base. The objective is to reduce memory usage and processing time when collecting data from Spark into R, while preserving the true distribution of customers across continents.

Selected Sampling Method: Stratified Sampling by Continent

Method definition: Stratified sampling divides the population into homogeneous subgroups (strata) and then samples from each subgroup. In this case, the strata are defined by continent.

Instead of treating the global customer base as one uniform pool, we explicitly ensure that each continent is proportionally represented in the final dataset.

Why Stratified Sampling by Continent Is Appropriate

Stratified sampling by continent is appropriate for the following reasons:

  • Global representativeness: Since the business operates across multiple continents, customer behavior, purchasing power, and product mix may vary geographically. Stratification ensures no region is under- or over-represented.
  • Reduced sampling variance: Compared to simple random sampling, stratified sampling often produces more stable and precise estimates when variability exists between groups.
  • Fair managerial insight: Management decisions (pricing, logistics, marketing spend) are often region-specific. Preserving continental proportions prevents distorted conclusions.
  • Computational efficiency: We can sample a fixed fraction (e.g., 10%) within each continent in Spark before collecting the data into R, reducing memory pressure while retaining structure.

Strengths and Limitations

Strengths

  • Ensures proportional representation across continents.
  • Produces more reliable subgroup analysis.
  • Reduces the risk of geographic bias.
  • Efficient for large distributed datasets when sampling is done in Spark.

Limitations

  • Requires correct identification of strata (continent must be accurate).
  • Slightly more complex than simple random sampling.
  • If some strata are very small, sample sizes within those groups may also become small.

Setup, Connection and Sampling

## 1. Clean start
rm(list = ls())
invisible(gc(verbose = FALSE))
options(warn.conflicts = FALSE)

## 2. Load packages
required_packages <- c(
  "tidyr",
  "ggplot2",
  "knitr",
  "tidyverse",
  "lubridate",
  "dbplyr",
  "sparklyr",
  "rprojroot",
  "DBI",
  "moments"
)

is_req_pk_name <- !required_packages %in% rownames(installed.packages())
missing <- required_packages[is_req_pk_name]

if (length(missing)) {
  stop("Missing packages: ", paste(missing, collapse = ", "))
}
options(tidyverse.quiet = TRUE)

invisible(lapply(required_packages, function(pkg) {
  library(pkg, character.only = TRUE, warn.conflicts = FALSE)
}))

#invisible(lapply(required_packages, library, character.only = TRUE))

## 3. Project paths
root_path <- rprojroot::find_rstudio_root_file()
project_path <- root_path

## 4. Spark connection
spark_connection <- source(file.path(project_path, "config/spark.R"))$value
## Spark Connection Status (Open): TRUE
stopifnot(sparklyr::spark_connection_is_open(spark_connection))

## 5. Initialize or get sales table
source(file.path(project_path, "R/init_sales_table.R"))
sales_table_in_spark <- init_sales_table(spark_connection)
## >>> Table 'sales_spark_table' missing. Generating data...
## >>> Copying data to Spark...
## 6. Sample data for stratified analysis
stratified_sample <- sales_table_in_spark |>
# Spark's native sampling function
# fraction = 0.1 gives roughly 10%
sparklyr::sdf_sample(fraction = 0.1, replacement = FALSE, seed = 42) |>
collect()

## 7. Verify the results
tbl <- stratified_sample |>
  count(continent) |>
  mutate(percentage = n / sum(n) * 100) |>
  kable(caption = "Summary of Stratified Sample by Continent")

print(tbl)
## 
## 
## Table: Summary of Stratified Sample by Continent
## 
## |continent     |     n| percentage|
## |:-------------|-----:|----------:|
## |Africa        | 10023|  20.084965|
## |Asia          | 12307|  24.661844|
## |Europe        |  8981|  17.996914|
## |North America | 11055|  22.152977|
## |Oceania       |  2527|   5.063824|
## |South America |  5010|  10.039477|

Part B: Categorical Variable Analysis

Business Question

How do system characteristics differ across global regions, and are there notable patterns in how customers access the platform?

Selected Variables

  • Continent
  • Device Type

Why ?

Analyzing device_type by region identifies critical UX disparities that shape global product strategy. While mobile-heavy regions like Oceania demand ‘Mobile-First’ architecture and fast payment flows, desktop-heavy markets require high-fidelity media and detailed specifications. This pairing ensures that our technical roadmap aligns with the actual hardware habits of our diverse user base.

1. Frequency Table (Device Type)

We first examine the overall distribution of device types across the sampled dataset.

device_freq <- stratified_sample |>
  group_by(device_type) |>
  tally(name = "count") |>
  mutate(percentage = count / sum(count) * 100) |>
  arrange(desc(count)) |>
  collect()

knitr::kable(device_freq, caption = "Frequency Table: Device Type (Counts and Percentages)")
Frequency Table: Device Type (Counts and Percentages)
device_type count percentage
Mobile 29970 60.056510
Desktop 14981 30.020239
Tablet 4952 9.923251

2. Cross-Tabulation: Continent × Device Type

We construct a cross-tabulation of continent and device type.

Conditioning:
Percentages are calculated within continent (row percentages).
That is, for each continent, percentages show the distribution of device types used within that specific region.

Why This Conditioning Is Necessary

Reporting percentages without conditioning leads to misleading conclusions.

If we used overall percentages, larger continents (with more customers) would dominate the distribution. For example:

  • If Asia has twice as many total customers as Europe, it will naturally have higher device counts.
  • That does not mean Asia prefers a specific device type more — it may simply reflect scale.

By conditioning within continent, we answer the correct business question:

Given a customer is in Asia, what is the probability they use Mobile vs Desktop?

This removes size effects and isolates behavioral differences.

What Row Percentages Allow Us to Do

  1. Compare behavior, not scale
    We evaluate device preference patterns within each continent, independent of total customer volume.

  2. Make fair regional comparisons
    We can compare:

    • Mobile share in Asia
    • Mobile share in Europe
    • Mobile share in North America

    Without distortion from population differences.

  3. Support operational decisions
    Product optimization decisions (mobile UX, desktop feature rollouts, infrastructure planning) depend on relative usage patterns, not absolute counts.

  4. Avoid Simpson’s Paradox-type distortions
    Aggregated percentages can hide regional variation. Conditioning preserves subgroup structure.

continent_device_share <- sales_table_in_spark |>
  group_by(continent, device_type) |>
  tally(name = "count") |>
  group_by(continent) |>
  mutate(device_share = count / sum(count) * 100) |>
  ungroup() |>
  select(continent, device_type, device_share) |>
  collect()
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
# Pivot to wide format
continent_device_table <- continent_device_share |>
  mutate(device_share = round(device_share, 1)) |>
  pivot_wider(
    names_from = device_type,
    values_from = device_share,
    names_glue = "{device_type} Share (%)"
  ) |>
  arrange(continent)

kable(
  continent_device_table,
  caption = "Device Type Share (%) Within Each Continent"
)
Device Type Share (%) Within Each Continent
continent Tablet Share (%) Desktop Share (%) Mobile Share (%)
Africa 10.0 30.0 60.0
Asia 10.0 30.0 59.9
Europe 10.0 30.0 60.0
North America 10.0 30.2 59.8
Oceania 9.6 30.4 60.0
South America 10.1 30.1 59.8

3. Visualization: Bar Chart (Counts) Grouped by Continent

This visualization displays the joint distribution of Continent and Device Type using raw counts.

Chart Structure

  • X-axis: Continent
  • Y-axis: Count of observations
  • Grouped by: Device Type
  • Color: Device Type (consistent colors across all continents)

Each panel (or grouping) allows us to compare how continents differ for a given device type, and how device usage varies within each continent.

Conditioning

Although this chart uses raw counts, interpretation must be conditioned on continent-level totals.

That means:

  • Comparisons across continents must account for differences in total customer volume.
  • A higher bar does not automatically imply stronger preference — it may reflect larger regional population size.

In other words:

Counts reflect scale, not composition.

Why This Conditioning Is Necessary

  1. Avoid Misinterpreting Volume as Preference

    Larger continents will naturally have higher device counts.
    Without conditioning, we might incorrectly conclude that one continent “prefers” a device simply because it has more customers overall.

  2. Separate Market Size from Behavior

    • Raw counts answer: Where is device usage largest in absolute terms?
    • Conditioned percentages answer: Where is device usage most dominant relative to that region’s size?

    These are different business questions.

  3. Support Correct Business Decisions

    • Infrastructure planning may depend on absolute counts (server load, traffic volume).
    • UX optimization and marketing strategy depend on relative device share within each continent.
  4. Prevent Scale Bias

    If Asia has twice as many customers as Europe, its bars will appear larger across all device types — even if the internal device mix is identical.
    Conditioning prevents drawing false conclusions from scale effects.

library(ggplot2)

# Recreate aggregation directly from Spark
device_continent_counts <- stratified_sample |>
  group_by(continent, device_type) |>
  tally(name = "count") |>
  collect()

# Grouped bar chart
ggplot(device_continent_counts,
       aes(x = device_type,
           y = count,
           fill = device_type)) +
  geom_bar(stat = "identity",
           position = position_dodge()) +
  facet_wrap(~ continent) +
  labs(
    title = "Device Type Counts by Continent",
    x = "Device Type",
    y = "Count",
    fill = "Device Type"
  ) +
  theme_minimal()


3. Visualization: Bar Chart (Counts) Grouped Across Continents per Device Type

This visualization answers a slightly different structural question:

For each device type, which continent contributes the highest usage?

Chart Structure

  • X-axis: Continent
  • Y-axis: Count of observations
  • Facets: Device Type (one panel per device type)
  • Color: Device Type (consistent within each panel)

Each panel isolates a single device type and compares continents against one another.

Conditioning

Although the chart displays raw counts, interpretation must be conditioned on continent size.

That means:

  • Comparisons across continents reflect absolute volume.
  • Higher bars indicate larger total usage, not necessarily stronger device preference within that continent.

This chart is conditioned implicitly on:

Given a specific device type, how is its usage distributed across continents in absolute terms?

Why This Conditioning Is Necessary

  1. Counts Reflect Market Size, Not Preference

    If Asia has the largest total customer base, it may dominate in Mobile, Desktop, and Tablet counts purely due to scale.

    That does not imply higher relative preference.

  2. Correct Interpretation of “Leads”

    The chart title asks: Which continent leads by device type?
    “Leads” here refers to absolute contribution, not proportional dominance.

  3. Avoid Confusing Volume with Behavior

    • This chart shows where device traffic is highest.
    • It does not show which continent prefers a device more relative to its own population.
  4. Operational vs Behavioral Insight

    • Absolute counts inform infrastructure, traffic load, and scaling decisions.
    • Conditioned percentages (row-based shares) inform UX prioritization and regional strategy.
  5. Prevent Scale Distortion

Without explicitly acknowledging conditioning, one might incorrectly interpret taller bars as stronger behavioral preference rather than larger customer base.

library(ggplot2)
library(dplyr)

ggplot(device_continent_counts,
       aes(x = continent,
           y = count,
           fill = device_type)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  geom_text(aes(label = count),
            position = position_dodge(width = 0.8),
            vjust = -0.3,
            size = 3) +
  facet_wrap(~ device_type, scales = "free_y") +
  labs(
    title = "Which Continent Leads by Device Type?",
    x = "Continent",
    y = "Count",
    fill = "Device Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Interpretation

The stratified sample successfully preserves the continental distribution of the global customer base, ensuring fair representation across regions. This confirms that the dataset remains structurally reliable for regional comparison and managerial decision-making.

Device Usage Patterns

Across all continents, device composition is highly consistent:

  • ~60% Mobile
  • ~30% Desktop
  • ~10% Tablet

This indicates that device preference is not strongly region-dependent. Customer access behavior appears globally standardized, with a clear mobile-first pattern across markets.

Absolute Volume (Counts)

From the count-based visualizations:

  • Asia leads in absolute usage across all device types.
  • North America and Africa follow.
  • Oceania and South America show the lowest total volumes.

However, these differences reflect market size (scale) rather than stronger device preference within those continents.

Business Implications

Behavioral Insight - Customers globally are mobile-first. - Device mix is structurally similar across continents.

Operational Insight - Infrastructure load will be highest in Asia due to larger customer volume. - Scaling decisions should prioritize high-volume regions.

Strategic Insight - UX optimization should prioritize Mobile globally. - Infrastructure and capacity planning should be region-volume driven.

Conclusion

Device behavior is globally uniform, but traffic scale differs by continent.

Product strategy should be mobile-first globally, while operational strategy should be Asia-weighted for scaling and infrastructure planning.


Part C: Numerical Variable Summarization

Business Question

How is customer engagement distributed, and what does this reveal about usage intensity and the presence of high-activity users? ## Selected Numerical Variable: session_duration

The numerical variable selected for this analysis is session_duration (measured in minutes).

This variable captures how long customers actively engage with the platform during a session.
It provides insight into engagement depth rather than spending value, which was the primary focus in the lecture slides.

Understanding session duration helps answer:

  • Are most users lightly engaged?
  • Is engagement driven by a small group of heavy users?
  • Are there extreme outliers influencing system load? ## Summary Statistics Computed

The following metrics are calculated using Spark aggregations:

  • Count/Total
  • Minimum
  • Mean
  • Median (Approximate) using percentile_approx()
  • Standard Deviation
  • 90th Percentile
  • 95th Percentile
  • Maximum

These measures provide both central tendency and upper-tail behavior.

# Summary stats for one numeric variable from stratified_sample
num_var <- "session_duration"

stopifnot(num_var %in% names(stratified_sample))
stopifnot(is.numeric(stratified_sample[[num_var]]))

stats_tbl <- stratified_sample |>
  dplyr::summarise(
    count  = sum(!is.na(.data[[num_var]])),
    min    = min(.data[[num_var]], na.rm = TRUE),
    mean   = mean(.data[[num_var]], na.rm = TRUE),
    median = stats::quantile(.data[[num_var]], 0.50, na.rm = TRUE, names = FALSE),
    sd     = stats::sd(.data[[num_var]], na.rm = TRUE),
    p90    = stats::quantile(.data[[num_var]], 0.90, na.rm = TRUE, names = FALSE),
    p95    = stats::quantile(.data[[num_var]], 0.95, na.rm = TRUE, names = FALSE),
    max    = max(.data[[num_var]], na.rm = TRUE)
  )

knitr::kable(
  stats_tbl,
  caption = paste0("Summary Statistics for ", num_var, " (computed from Stratified Sample)")
)
Summary Statistics for session_duration (computed from Stratified Sample)
count min mean median sd p90 p95 max
49903 3.2 24.95533 22.5 11.70783 40.1 47.1 151.5

Shape Analysis: Skewness and Kurtosis

To understand the distribution beyond averages, we assess skewness and kurtosis.

  • Skewness indicates asymmetry.
    Positive values suggest a long right tail (heavy users).
  • Kurtosis measures tail heaviness.
    High values indicate extreme observations or outlier-prone distributions.

The code extracts the numeric variable, removes missing values, and computes shape metrics.

# Change to your numeric variable
num_var <- "session_duration"

x <- stratified_sample[[num_var]]
x <- x[!is.na(x)]

# 1. Skewness & Kurtosis
shape_stats <- tibble(
  "Skewness"  = skewness(x),
  "Kurtosis"  = kurtosis(x)
)

kable(shape_stats, digits = 3,
      caption = paste0("Shape Metrics for ", num_var))
Shape Metrics for session_duration
Skewness Kurtosis
1.51 7.025

Extreme Value

To formally identify extreme values, the Interquartile Range (IQR) method is applied.

The IQR measures the spread of the middle 50% of the data:

\[ IQR = Q3 - Q1 \]

Observations falling outside the interval:

\[ [Q1 - 1.5 \times IQR,\; Q3 + 1.5 \times IQR] \]

are classified as extreme values.

This method is robust because it does not assume normality and is not distorted by large outliers.

Why This Matters

  • It quantifies how many unusually high or low engagement values exist.
  • It helps determine whether extreme sessions are rare anomalies or structurally present in the data.
  • It provides an objective threshold for identifying high-intensity users.

If the percentage of extreme values is meaningful, this indicates the presence of heavy users who may disproportionately influence averages and system load.

# 2. IQR-based outlier detection
Q1  <- quantile(x, 0.25)
Q3  <- quantile(x, 0.75)
IQR_val <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

extreme_values <- x[x < lower_bound | x > upper_bound]

outlier_summary <- tibble(
  "Lower Bound" = lower_bound,
  "Upper Bound" = upper_bound,
  "Number of Extreme Values" = length(extreme_values),
  "Percent of Extreme Values" = length(extreme_values) / length(x) * 100
)

kable(outlier_summary, digits = 2,
      caption = paste0("Extreme Value Assessment for ", num_var))
Extreme Value Assessment for session_duration
Lower Bound Upper Bound Number of Extreme Values Percent of Extreme Values
-3.85 50.95 1714 3.43

Visual Confirmation

A histogram complements the statistical diagnostics by visually displaying the distribution shape.

Key visual signals:

  • A long right tail indicates positive skewness.
  • Sparse but distant bars suggest extreme high-value observations.
  • A tightly clustered center with wide tails indicates heavy dispersion.

Visual inspection ensures that numerical diagnostics align with actual distribution structure.

library(ggplot2)

ggplot(stratified_sample, aes(x = .data[[num_var]])) +
  geom_histogram(bins = 40) +
  labs(title = paste("Distribution of", num_var)) +
  theme_minimal()

Interpretation

The above examines session duration, defined as the time between a customer entering the platform and completing an online purchase. Since these are completed online orders, session duration represents the time-to-purchase.

Distribution and Central Tendency

The summary statistics indicate that:

  • The mean session duration is slightly higher than the median.
  • The distribution is right-skewed.
  • There are observable high-duration outliers.

This suggests that most customers complete their purchases within a relatively consistent time window, but a smaller group takes significantly longer before converting.

Interpretation of the Distribution

The right-skewed pattern has a clear behavioral meaning:

  • The majority of customers move efficiently through the purchase process.
  • A minority of sessions are prolonged, likely due to:
    • Comparison shopping
    • Cart hesitation
    • Payment friction
    • Technical or UX delays

Because the median is lower than the mean, the typical customer completes their purchase faster than what the average alone might suggest. The mean is pulled upward by longer sessions.

Business Implications

1. Checkout Efficiency A concentrated lower-to-mid session duration range indicates that the purchase funnel is generally effective.

2. Friction Detection Long-duration outliers signal potential friction points in: - Payment processing - Product information clarity - Navigation complexity

3. Optimization Strategy Reducing the tail of long sessions can: - Increase conversion speed - Improve user satisfaction - Reduce abandonment risk in near-purchase scenarios

Conclusion

Session duration data shows a healthy core of efficient purchases with a small but meaningful tail of delayed conversions.

The operational focus should not be on shortening already efficient sessions, but on identifying and eliminating the causes of extended purchase times to tighten the distribution and improve overall transaction flow.


Part D: Grouped Numerical Comparison

Business Question

Do key numerical outcomes differ meaningfully across customer groups? ## Selected Variables

  • Categorical Grouping Variable: continent
  • Numerical Outcome Variable: session_duration

The objective is to determine whether customer engagement intensity differs across regions.

Grouped Summary Table

For each continent, we report:

  • Mean (average engagement)
  • Standard Deviation (variability)
  • 95th Percentile (P95) (upper-tail intensity)

These metrics allow comparison of:

  • Central tendency
  • Spread
  • Extreme engagement behavior
num_var <- "session_duration"

group_summary <- stratified_sample |>
  group_by(continent) |>
  summarise(
    mean = mean(.data[[num_var]], na.rm = TRUE),
    sd   = sd(.data[[num_var]], na.rm = TRUE),
    p95  = quantile(.data[[num_var]], 0.95, na.rm = TRUE),
    .groups = "drop"
  )

kable(group_summary, digits = 2,
      caption = "Grouped Summary: Session Duration by Continent")
Grouped Summary: Session Duration by Continent
continent mean sd p95
Africa 25.02 11.69 46.9
Asia 24.90 11.61 47.0
Europe 24.67 11.54 46.5
North America 25.12 11.93 47.7
Oceania 24.88 11.77 47.2
South America 25.13 11.75 47.5

Visualization: Violin Plot

A violin plot is used because:

  • It shows distribution shape.

  • It highlights density differences.

  • It captures variability and tail thickness.

  • It allows visual comparison of skewness between groups.

ggplot(stratified_sample,
       aes(x = continent,
           y = .data[[num_var]],
           fill = continent)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, outlier.shape = NA) +
  labs(
    title = "Distribution of Session Duration by Continent",
    x = "Continent",
    y = "Session Duration"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation

Based on the grouped summary table and violin plot above, session duration is remarkably consistent across continents.

Key Findings

  • Mean session duration ranges narrowly between approximately 24.7 to 25.1 minutes across all regions.
  • Standard deviation is also similar (≈11.5–11.9), indicating comparable variability.
  • The 95th percentile (P95) falls between 46.5 and 47.7 minutes for every continent.

The violin plots confirm that: - Distribution shapes are nearly identical. - All regions exhibit right-skewness. - Upper-tail behavior (long sessions) exists consistently across continents.

What does that Mean ?

There is no meaningful regional difference in engagement intensity. Customers across Africa, Asia, Europe, North America, Oceania, and South America demonstrate:

  • Similar average time-to-purchase.
  • Similar variability in browsing behavior.
  • Similar presence of extended (high-duration) sessions.

This indicates that engagement depth is behaviorally uniform at a global level.

Business Implication

Regional strategy does not need to differentiate based on engagement intensity.

  • Optimization efforts should focus on global checkout efficiency, not continent-specific redesigns.
  • Since upper-tail behavior (long sessions) is structurally present in all regions, improvements targeting purchase friction will likely yield benefits globally rather than regionally.

Part E: Visualization Critique

(Based on Part D – Violin Plot of Session Duration by Continent) ### Selected Visualization: Violin Plot (with Embedded Boxplot) #### 1. Why is this visualization appropriate for the data type?

Session duration is a continuous numerical variable, and the comparison is across categorical groups (continents).

A violin plot is appropriate because it:

  • Displays the full distribution shape of session duration.
  • Shows density patterns, not just averages.
  • Highlights variability and tail thickness.
  • Allows direct visual comparison of skewness and spread across regions.

Since Part D focuses on engagement intensity and upper-tail behavior, a distribution-based chart is the correct analytical choice.

2. What would be misleading if a different chart were used?

If a simple bar chart of means were used:

  • It would hide the right-skewed nature of the data.
  • It would conceal long-session outliers.
  • It would suggest differences (or similarity) based only on averages.
  • It would ignore variability and upper-tail engagement.

Because session duration contains heavy tails and extreme values, reducing the data to means alone would oversimplify engagement behavior and potentially distort interpretation.

3. How does this visualization support managerial decision making?

The violin plot clearly shows that:

  • Distribution shapes are nearly identical across continents.
  • Variability and tail behavior are structurally similar.
  • There is no meaningful regional divergence in engagement intensity.

This supports a critical managerial conclusion:

  • Engagement optimization should be global, not region-specific.
  • Friction reduction strategies will likely benefit all markets equally.
  • Regional differences in performance are unlikely driven by session behavior.

In short, the visualization prevents over-segmentation and supports unified, globally scalable optimization decisions.


Footnotes

Analysis Prepared By: Rubin Anyeema Apore
For: BDAT 603 – Descriptive Analytics
Dataset: Simulated Global Online Sales (Spark-Based Stratified Sample)
Scope: Parts A–E (Sampling, Categorical Analysis, Numerical Analysis, Grouped Comparison, Visualization Critique)
Date: “2026-02-12” —