Using the pre-processed dataset user_features.csv, which contains aggregated user-level data.
Goal:
To gain insight into the types of customers shopping on Instacart, based on the percentage of item types (aisles) purchased over time.
Dimensions of the Data:
Each row represents a unique customer (user_id)
206209
Columns represent the proportion of items purchased from different product categories (aisles/variables)
135 different aisles
Principle Component Analysis
Because of the high number of variables in the dataset, I applied Principal Component Analysis (PCA) to reduce dimensionality. This allowed me to identify the key principal components and better understand the underlying structure of the data.
4 seems to be a good number for our PCs based on the scree plot.
Plotting data in 2 dimensions using PCA
Getting a look at our data, not much insight yet,
Principle Components Break Downs
Top Ten Features (Loadings) for each principle component
The following plots display the eigenvectors (loadings) of the original variables projected into 2D principal component space. Only features with an absolute loading of at least 0.15 on either axis are shown, highlighting the most influential variables in each component pairing.
I condensed the variables into four principal components (PCs), using a scree plot to determine the number of components that captured the most variance in the data.
Within the PCs, I extracted meaningful loadings to understand the contribution of the original features.
I plotted the 3D PCA-transformed data to better differentiate between clusters.
To segment the users, I applied K-Means clustering and used a WSS (within-cluster sum of squares) elbow plot to identify the optimal number of clusters (K).
I calculated the average aisle proportions within each cluster to gain insight into customer behavior.
Improvements / Further work
This analysis could be improved by incorporating purchase frequency or purchasing power to add an economic layer to the segmentation.
Another potential improvement is to group aisles by department, which would simplify and clarify the resulting customer segments.
I also attempted to implement DBSCAN, though I’m not confident I executed it correctly — this is an area I’d like to explore further.
Finally, assigning more meaningful labels to the clusters would enhance interpretability. For example, Cluster 8 might represent “Adult Party Shoppers,” while Cluster 3 could reflect “Home Cleaners and Organizers.”