2025-04-27

Story 6: Instacart Customer Segmentation

A dataset was given consisting of several files describing customer purchases which took place at Instacart, an online grocery delivery service, during a 365 day period prior to 2020. The goal on this assignment is to perform a customer segmentation analysis to understand the different types of customer behavior exhibited by Instacart customers. The dimensionality reduction has to be used,also cluster analysis, and any other tool that fit to find and visualize customer segments at Instacart.

The data consists of a partially processed dataset that Instacart posted to kaggle for a prediction competition. This dataset is being used for a different purpose.

Note: for this assignment due to pc memory the subsample used is 5000 users.

Datasets description

  1. user_features.csv (Pre-processed): Contains user-level features derived from the original Instacart data. user_id: Unique identifier for each customer. Food Category Counts (Columns 2-135): Number of items ordered by each user across various food categories (Instacart “aisles”) throughout the year. Note: This is a total count and doesn’t reflect quantities per order. Day of Week Order Counts (Columns 136-142): Number of orders placed by each user on each specific day of the week.
  2. Official Instacart Data (Original): aisles.csv: Maps aisle_id to the name of the food category (aisle). departments.csv: Maps department_id to a broader product category (department). Departments contain aisles. products.csv: Contains details about each product, including its product_name, aisle_id, and department_id. orders.csv: Provides high-level information about each order, such as user_id, order_id, order number for the user, day and hour of the order, and days since the previous order. all_order_products.csv: Contains item-level information for each order, listing all product_ids included in each order_id and the order in which they were added.

Preparing data

## Warning: package 'data.table' was built under R version 4.4.2
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()     masks data.table::between()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ lubridate::hour()    masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ lubridate::mday()    masks data.table::mday()
## ✖ lubridate::minute()  masks data.table::minute()
## ✖ lubridate::month()   masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second()  masks data.table::second()
## ✖ purrr::transpose()   masks data.table::transpose()
## ✖ lubridate::wday()    masks data.table::wday()
## ✖ lubridate::week()    masks data.table::week()
## ✖ lubridate::yday()    masks data.table::yday()
## ✖ lubridate::year()    masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'cluster' was built under R version 4.4.3
## Warning: package 'patchwork' was built under R version 4.4.3
## Warning: package 'ggthemes' was built under R version 4.4.3
## Warning: package 'pheatmap' was built under R version 4.4.3
## Warning: package 'tidytext' was built under R version 4.4.2

Determining Number of Clusters - Elbow Method

KMeans Clustering

PCA Visualization with Clusters

Top products Analysis

Clustered Heatmap of Top Products

Department Preferences by Cluster

Cluster Behavior Profiles Z-scores

Enhanced Top Products Heatmap per Cluster

Product Sales Patterns by Day of Week

Approximate Revenue Analysis by Cluster

Conclusions

  1. 4 Distinct Segments: Identified through PCA + KMeans: frequent shoppers, organic lovers, weekend buyers, and occasional users.
  2. Top Products: Organic staples (bananas, strawberries, spinach) dominate across clusters.
  3. Department Patterns: Produce and dairy preferred by most clusters; beverages vary by group.
  4. Time Insights: Clusters differ in shopping hours/days e.g. Cluster 3 shops late mornings.
  5. PCA Effectiveness: First 2 PCs explain ~17.7% variance, capturing key behavioral dimensions.
  6. Actionable Insight: Target promotions by cluster.
  7. Limitation: Sample size (5k users) may miss niche behaviors.
  8. The revenue plot indicates Clusters 2 and 4 are big revenue, while Clusters 1 and 3 contribute less.