Introduction

INSTACART ONLINE GROCERY ANALYSIS

Instacart is an American company that operates as a sometimes-same-day grocery delivery service. Customers select groceries through a web application from various retailers and the order is delivered by a personal shopper.

Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently the data scientists use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session.

This information comes from an open data set study by Instacart,“The Instacart Online Grocery Shopping”. This dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

Project Focus

The focus for this study is to build a probability model to predict what the probability is that an Instacart customer will reorder. To do that we have to understand customers’ behavior patterns. We will use data from previous purchases to know when most purchases are made on Instacart and order product data to learn what is a best-selling product. This analysis will mostly be done by data visualization and market basket analysis.

By Franklin Ajisogun

Data Processing

Row

Number of Customers

206,209

Number of Products

1,384,617

Reorder Products

Number of Departments

21

Number of Aisles

134

Columns

Order Data

Product data

product_id product_name aisle_id department_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13
6 Dry Nose Oil 11 11
7 Pure Coconut Water With Orange 98 7
8 Cut Russet Potatoes Steam N’ Mash 116 1
9 Light Strawberry Blueberry Yogurt 120 16
10 Sparkling Orange Juice & Prickly Pear Beverage 115 7
11 Peach Mango Juice 31 7
12 Chocolate Fudge Layer Cake 119 1
13 Saline Nasal Mist 11 11
14 Fresh Scent Dishwasher Cleaner 74 17
15 Overnight Diapers Size 6 56 18
16 Mint Chocolate Flavored Syrup 103 19
17 Rendered Duck Fat 35 12
18 Pizza for One Suprema Frozen Pizza 79 1
19 Gluten Free Quinoa Three Cheese & Mushroom Blend 63 9
20 Pomegranate Cranberry & Aloe Vera Enrich Drink 98 7

Order Product Train/test data

order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
36 19660 2 1
36 49235 3 0
36 43086 4 1
36 46620 5 1
36 34497 6 1
36 48679 7 1
36 46979 8 1
38 11913 1 0
38 18159 2 0
38 4461 3 0
38 21616 4 1

Department data

department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
6 international
7 beverages
8 pets
9 dry goods pasta
10 bulk
11 personal care
12 meat seafood
13 pantry
14 breakfast
15 canned goods
16 dairy eggs
17 household
18 babies
19 snacks
20 deli

Aisle data

aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
11 cold flu allergy
12 fresh pasta
13 prepared meals
14 tofu meat alternatives
15 packaged seafood
16 fresh herbs
17 baking ingredients
18 bulk dried fruits vegetables
19 oils vinegars
20 oral hygiene

Exploratory Data Analysis

Case Study Outline

Induction

  • Main Goal: The objective of this case study is to predict what the probability is that an Instacart customer will reorder

  • Sub-Goal: Before we can suggest which users would reorder the item again, we have to explore how the users interact with the items

Goal

  • We will look at what days customers usually order, i.e hours and days
  • We will look how many days it would take the users to reorder the product
  • We will ask: What are the best-selling products or the most reordered products?
  • We will ask: When do people order?

Some key observations

  • Customers always order from 8:00 - 18:00 and the most common day that customers like to order are Sunday and Monday
  • Majority of customers reorder the last day of the month
  • The best-selling products are either fruit or organic products

Problem 1: Case Study

The first case study

  • We will look at what days customers usually order, i.e. hours and days

  • We will look at how many days it would take the users to reorder the product

  • We will look at how many items customers bought

What hours of the day do customers order?


Observation

  • Based on the Daily Order plot, we can see that the most orders are made from 8:00 - 18:00

Next case study

  • We will look at how customers perform on the weekdays

What days do customers order the most?


Observation

  • Based on the Weekly Order Count, we can see that Sunday has the most orders of the week. Monday is the second-highest order day.

Next case study

  • We will look at the days between customers’ orders

When do customers order again?


Observation

Based on the Day Since Last Order count, we can see that:

  • The majority of customers reorder once a month, probably using “order renew” (automatic)
  • Once a week is the second most common order renew
  • A lot of customers who order once a month or once a week may be businesses who want to restock thier products
  • The majority of customers usually reorder within seven days

Next case study

  • We will look at how many items customers buy.

How many items do customers buy?


Observation

Based on the Total Number of Items chart, we can see that:

  • Most customers are buying seven items
  • Very few are ordering more than 40 items

Problem 2: Case Study

The second case study

  • We will look at what the best-selling products are

  • We will look at how many products are being reordered

  • We will look at what products are being reordered

  • We will look at the percentage of the reorderded vs. non-reordered for the bestselling products

What are the best-selling products?


Observation

Based on the best-selling products chart, we can see that:

  • Bananas are the most-sold items
  • The top 20 products are either organic or fruit
  • Another interesting thing to note is the large number of organic items in the top 20 popular products. We can see that a substantial segment of the customers are health conscious through their choice of organic products.

New Discovery

  • We observed that the majority of the top 10 products are organic. This will help while building the probability model

  • Let’s look at the reordered products and see if those products are the same as the best-selling products

Next case study

  • We will look at the precentage of the reordered vs. non-reordered of the bestselling products

Table Chart

Count Product
18726 Banana
15480 Bag of Organic Bananas
9784 Organic Baby Spinach
8135 Large Lemon
7409 Organic Avocado
7293 Organic Hass Avocado
6494 Strawberries
6033 Limes
5546 Organic Raspberries

What is the percentage of the reordered vs. non-reordered of the bestselling products


Observation

  • Based on the Items ordered precentage chart, we can see that the top 20 selling products have high probability of being reordered.

  • Bananas have a very high probability of being reordered which are then followed by the organic bananas with an 86% probability of being reordered.

  • Some of the ordered products are fruits and some are items for cooking ingredients such as garlic and onion.

Next case study

  • We will look at how many products are being reordered

How many products are reordered?


Observation

  • Based on the Reordered Items chart, we can see that the majority of the products are being reordered

Hypothesis

  • It will be interesting if the majority of the reordered products are organic and fruit

Next case study

  • We will at look at the most reodered products

Which products do customers reorder the most


Discovery

  • To our surprise, a lot of the top products are organic

  • Bananas and organic bananas have a high propability of being reordered. However, they are not in the top 3 reordered products

  • It is interesting that a lot of the top reordered products are milk and one fruit (Banana)

  • It makes sense that most reordered product is milk. Milk is a product for daily consumption.

  • I can now say that organic vs. non organic will be a good feature or one of the more important features of predicting the probability that a customer will reorder.

Hypothesis

  • When we observe further, we noticed that many of the items were liquids. Perhaps many customers don’t want to carry heavy items. To know if this hypothesis is true, we have to look at the how many milk items are delivered to each customer.

  • Most of the reordered products are either organic products or fresh products.

Table Chart

Proportion Reordered Product Name
0.9347826 2% Lactose Free Milk
0.9130435 Organic Low Fat Milk
0.8983051 100% Florida Orange Juice
0.8888889 Organic Spelt Tortillas
0.8888889 Original Sparkling Seltzer Water Cans
0.8841717 Banana
0.8833333 Petit Suisse Fruit
0.8819876 Organic Lowfat 1% Milk
0.8810409 Organic Lactose Free 1% Lowfat Milk
0.8785249 1% Lowfat Milk

Problem 3: Case Study

The third case study

Based on our discovery:

  • We will look at how many items are organic or non-organic

  • We will look at how many reordered products are organic or non-organic

How items are organic and non-organic


Observation

Based on the organic and non-organic chart, we can see that:

  • The majority of the reordered products are non-organic products
  • Less than half of the reordered products are organic products

Next case study

  • We will look at how many reordered items are organic vs. non-organic

How many reordered products are organic or non-organic?


Observation

Based on the Reordered items: organic vs. non-organic chart, we see can that:

  • Customers are more likely to reorder organic than non-organic products

Hypothesis

  • Based on our previous hypothesis about organic vs. non-organic: this means that definitely the organic vs. non-organic feature will be an important for the probability model.

Problem 4: Case Study

The fourth case study

Based on our discovery:

  • We will look at the products in the departments and aisles

  • We will look at the most ordered products in each department and aisle

What are the products in the departments and aisles?


Note:The size of the boxes is proportional to the amount of the items in the aisles

Observation

Based on the department and aisle chart, we can see that:

  • Personal care and snacks have the most products in their departments.

Next case study

  • We will look at the most sold products in the departments and aisles

What are the most sold products in the departments and asiles


Observation

Based on the most sold products in the departments and aisles chart:

  • We noticed that the produce department has the most sold items and the second highest is the dairy/eggs department

  • The best-selling department and aisle will help us to know if the customer reordered

Problem 5: Case Study

The fifth case study

Based on our discovery:

  • We will look at the customer who has the highest probability of reordering.

  • We will look at what product(s) he/she usually orders.

  • We will look at the days since reordering of the product(s) and how many products he/she ordered

Reordering Customers (Top 10)

Customers who have the highest probability of reordering (Top 20)

User ID Reorder Count Reorder Percentage
99753 97 1.0000000
26489 95 0.9793814
17997 94 0.9690722
100935 94 0.9690722
69919 91 0.9381443
145481 91 0.9381443
140753 90 0.9278351
175680 90 0.9278351
34340 89 0.9175258
39993 89 0.9175258
84095 89 0.9175258
104576 89 0.9175258
125378 89 0.9175258
127577 89 0.9270833
82414 88 0.9777778
92366 88 0.9072165
111982 88 0.9072165
80422 86 0.9450549
185406 86 0.8865979
34396 85 0.9340659

Observation

Based on the Reordered Customers:

  • We noticed that the customer with id #99753 has highest probability that he/she will reorder

  • He/she has reordered 97 times

Next case study

  • We will look at what product(s) custumer #99753 usually ordered and how fast he/she reorders the product(s)

What product(s) does User #99753 usually ordered?

Customer #99753 products and day since the reordered

Product Name Order Number Order day Day(s) Since Order
Organic Whole Milk 45 Tues 2
Organic Reduced Fat Milk 45 Tues 2
Organic Whole Milk 65 Wed 2
Organic Reduced Fat Milk 65 Wed 2
Organic Whole Milk 36 Sun 5
Organic Reduced Fat Milk 36 Sun 5
Organic Whole Milk 84 Tues 2
Organic Reduced Fat Milk 84 Tues 2
Organic Whole Milk 5 Sun 4
Organic Reduced Fat Milk 5 Sun 4
Organic Whole Milk 56 Mon 4
Organic Reduced Fat Milk 56 Mon 4
Organic Reduced Fat Milk 61 Mon 4
Organic Whole Milk 61 Mon 4
Organic Reduced Fat Milk 33 Wed 3
Organic Whole Milk 33 Wed 3
Organic Reduced Fat Milk 48 Thurs 3
Organic Whole Milk 48 Thurs 3
Organic Reduced Fat Milk 28 Sun 3
Organic Reduced Fat Milk 77 Sun 6

Observation

Based on Customer #99753, we can see that:

  • The customer with id #99753 tends to reorder a lot of organic milk

  • Based on the Order Number, this custumers might be a coffee store businesss. 84 gallons of organic whole milk is a lot milk to finish in two days

Market Basket Analysis

Introduction to Market Basket Analysis

Market Basket Analysis (MBA) & the APRIORI Algorithm help companies to understand how customers interact with their products. MBA helps to boost company profits and helps with the layout of the company’s store by predicting what items customers buy together.

Market Basket Analysis is used when you want to find an association between different objects in a set or item or find frequent patterns in a transaction database.

We will use MBA to understand Instacart’s customers to help us better understand Instacart. Maybe this will help us to gain insight with the probability of an item being reordered. For further reading on the Maeket Basket Analysis Click here

Learn: if any customers either purchased friut or organic products, they tend to purchase bananas or organic bananas. This shows high confidence and lift. This helps us to understand customers’ tranastion when ordering their products and also helps with the probability model

Prediction Model:

Prediction model: logistic Regression

The logistic Regression model was used to predict the probability that a customer will reorder. The model was built off the R shiny app. Click here

The features we will use are:

  • Add to cart order: This tells us the position of the order in the cart

  • Order Number: This tells us the number of items of the same product

  • Organic vs. Non-Organic: This tells us if the product is organic or non-organic

The prediction dashboard is very user-friendly and straigth-forward. The user can simply predict by:

  • Entering the position of the order in the cart
  • Entering the how many items you want to order
  • Selecting if the product is organic or non-organic

Instacart Recommendation System:

Instacart Recommendation App

In this web app, I built a recommendation system to help customers see similar products they may like. Click here

  • This will help customers find similar product they like

Instacart Recommendation app works by:

  • Entering department of the products

  • Entering products from its department

Analysis Part 1

Row

Customers’ Product in the Cart

    items                                                transactionID
[1] {Bag of Organic Bananas,                                          
     Bulgarian Yogurt,                                                
     Cucumber Kirby,                                                  
     Lightly Smoked Sardines in Olive Oil,                            
     Organic 4% Milk Fat Whole Milk Cottage Cheese,                   
     Organic Celery Hearts,                                           
     Organic Hass Avocado,                                            
     Organic Whole String Cheese}                              1      
[2] {Corn Tortillas,                                                  
     Extra Virgin Olive Oil,                                          
     Gala Apples,                                                     
     Garnet Sweet Potato (Yam),                                       
     Ground Cumin,                                                    
     I Heart Baby Kale,                                               
     No Salt Added Black Beans,                                       
     Organic Baby Carrots,                                            
     Organic Baby Spinach,                                            
     Organic Yellow Onion,                                            
     Original Hummus,                                                 
     Snack Sticks Chicken & Rice Recipe Dog Treats,                   
     Total 2% All Natural Plain Greek Yogurt,                         
     Unscented Long Lasting Stick Deodorant,                          
     Wheat Sandwich Thins}                                     100000 
[3] {Daily Moisture Shampoo,                                          
     DeTox Caffeine Free Organic Herbal Tea Bags,                     
     Ensure Plus Milk Chocolate Nutrition Shake,                      
     G Series Perform Frost Glacier Cherry Sports Drink,              
     Original No Pulp 100% Florida Orange Juice,                      
     Triple Chocolate Ripple,                                         
     ZzzQuil Liquid Warming Berry Flavor Sleep-Aid}            1000008
[4] {Almond Chia Granola Clusters,                                    
     Boneless Skinless Chicken Breasts,                               
     Broccoli Crown,                                                  
     Fresh Cauliflower,                                               
     Orange Bell Pepper,                                              
     Organic Gala Apples,                                             
     Organic Red Onion,                                               
     Veggie Chips}                                             1000029

Analysis Part 2

Row

Customer buying behavior

     lhs                                       rhs                          support confidence     lift count
[1]  {Organic Strawberries}                 => {Bag of Organic Bananas} 0.023428092  0.2821737 2.391732  3074
[2]  {Organic Hass Avocado}                 => {Bag of Organic Bananas} 0.018443716  0.3318250 2.812582  2420
[3]  {Organic Avocado}                      => {Banana}                 0.016888957  0.2990957 2.095714  2216
[4]  {Large Lemon}                          => {Banana}                 0.016446917  0.2652735 1.858728  2158
[5]  {Strawberries}                         => {Banana}                 0.014846429  0.2999692 2.101835  1948
[6]  {Organic Raspberries}                  => {Bag of Organic Bananas} 0.013566039  0.3209520 2.720421  1780
[7]  {Organic Raspberries}                  => {Organic Strawberries}   0.012727688  0.3011179 3.626738  1670
[8]  {Limes}                                => {Large Lemon}            0.012156086  0.2643792 4.264192  1595
[9]  {Organic Blueberries}                  => {Organic Strawberries}   0.009671519  0.2555377 3.077758  1269
[10] {Organic Cucumber}                     => {Bag of Organic Bananas} 0.009663898  0.2748754 2.329870  1268
[11] {Honeycrisp Apple}                     => {Banana}                 0.009381907  0.3466629 2.429010  1231
[12] {Organic Fuji Apple}                   => {Banana}                 0.009221858  0.3715075 2.603092  1210
[13] {Seedless Red Grapes}                  => {Banana}                 0.008856032  0.2862774 2.005899  1162
[14] {Yellow Onions}                        => {Banana}                 0.008162488  0.2846890 1.994769  1071
[15] {Organic Lemon}                        => {Bag of Organic Bananas} 0.008132002  0.3044223 2.580313  1067
[16] {Organic Cilantro}                     => {Limes}                  0.007674720  0.2855927 6.211275  1007
[17] {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007415593  0.3365617 2.852730   973
[18] {Broccoli Crown}                       => {Banana}                 0.007049768  0.3154843 2.210547   925
[19] {Small Hass Avocado}                   => {Banana}                 0.006592485  0.2787625 1.953243   865
[20] {Red Peppers}                          => {Banana}                 0.006386708  0.2884682 2.021249   838

Analysis Part 3

Row

Customer buying behavior Network Graph

Conclusion

Conclusion

What have we learned?

Further Exploration

Based on the analysis done on Instacart, I found out that:

I wondered why customers tend to reorder organic products than non-organic. Even if organic products are more costly than non-organic. One of my hypotheses is that customers who ordered organic products are consuming unhealthy options outside their homes. They want to eat healthy at home. I would like to add an additional datasets i.e income, education and job, or conduct a survey asking i.e What products do you perfer organic or non-organic when eating at home and why do you perfer it?

Recommendation

Based on the analysis done on Instacart, I found out that: