test_python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth

Exploratory Data Analysis

The GroceryDataSet is a list of transcations make at a grocery store. The dataset comes from the University of California, Irvine, their machine learning dataset.

pd.set_option('display.max_columns', 40)

df = pd.read_csv('GroceryDataSet.csv', header=None).fillna(0)
df.shape

## (9835, 32)

df.head()

##                  0                    1               2   \
## 0      citrus fruit  semi-finished bread       margarine   
## 1    tropical fruit               yogurt          coffee   
## 2        whole milk                    0               0   
## 3         pip fruit               yogurt   cream cheese    
## 4  other vegetables           whole milk  condensed milk   
## 
##                          3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19  \
## 0               ready soups  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   
## 1                         0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   
## 2                         0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   
## 3              meat spreads  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   
## 4  long life bakery product  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   
## 
##   20 21 22 23 24 25 26 27 28 29 30 31  
## 0  0  0  0  0  0  0  0  0  0  0  0  0  
## 1  0  0  0  0  0  0  0  0  0  0  0  0  
## 2  0  0  0  0  0  0  0  0  0  0  0  0  
## 3  0  0  0  0  0  0  0  0  0  0  0  0  
## 4  0  0  0  0  0  0  0  0  0  0  0  0

df.ne(0).sum().sum() / (df.shape[0] * df.shape[1])

## 0.13779550076258262

When loading the data there was a lot of nan values which were filled with 0 to indicate there are no transcations past the last. The dataset contain 9835 market transcations which are mostly empty with only 13% of it’s cells filled.

Data Preparation

To get a better understanding of the dataset, the data will be transformed so each row is a transcation and each column is an item where the values are True or False if the item was purchased or not

all_items = df.apply(lambda x: [i for i in x if i != 0], axis=1).values.tolist()

mlxtend TransactionEncoder class is used to transform the data

trans_encoder = TransactionEncoder()
trans_encoder_matrix = trans_encoder.fit(all_items).transform(all_items)
tdf = pd.DataFrame(trans_encoder_matrix, columns=trans_encoder.columns_).astype(int)
tdf.head()

##    Instant food products  UHT-milk  abrasive cleaner  artif. sweetener  \
## 0                      0         0                 0                 0   
## 1                      0         0                 0                 0   
## 2                      0         0                 0                 0   
## 3                      0         0                 0                 0   
## 4                      0         0                 0                 0   
## 
##    baby cosmetics  baby food  bags  baking powder  bathroom cleaner  beef  \
## 0               0          0     0              0                 0     0   
## 1               0          0     0              0                 0     0   
## 2               0          0     0              0                 0     0   
## 3               0          0     0              0                 0     0   
## 4               0          0     0              0                 0     0   
## 
##    berries  beverages  bottled beer  bottled water  brandy  brown bread  \
## 0        0          0             0              0       0            0   
## 1        0          0             0              0       0            0   
## 2        0          0             0              0       0            0   
## 3        0          0             0              0       0            0   
## 4        0          0             0              0       0            0   
## 
##    butter  butter milk  cake bar  candles  ...  specialty vegetables  spices  \
## 0       0            0         0        0  ...                     0       0   
## 1       0            0         0        0  ...                     0       0   
## 2       0            0         0        0  ...                     0       0   
## 3       0            0         0        0  ...                     0       0   
## 4       0            0         0        0  ...                     0       0   
## 
##    spread cheese  sugar  sweet spreads  syrup  tea  tidbits  toilet cleaner  \
## 0              0      0              0      0    0        0               0   
## 1              0      0              0      0    0        0               0   
## 2              0      0              0      0    0        0               0   
## 3              0      0              0      0    0        0               0   
## 4              0      0              0      0    0        0               0   
## 
##    tropical fruit  turkey  vinegar  waffles  whipped/sour cream  whisky  \
## 0               0       0        0        0                   0       0   
## 1               1       0        0        0                   0       0   
## 2               0       0        0        0                   0       0   
## 3               0       0        0        0                   0       0   
## 4               0       0        0        0                   0       0   
## 
##    white bread  white wine  whole milk  yogurt  zwieback  
## 0            0           0           0       0         0  
## 1            0           0           0       1         0  
## 2            0           0           1       0         0  
## 3            0           0           0       1         0  
## 4            0           0           1       0         0  
## 
## [5 rows x 169 columns]

The transformation looks correct but we will verify the first row with the original dataset

tdf[['citrus fruit', 'semi-finished bread' ,'margarine', 'ready soups']].head()

##    citrus fruit  semi-finished bread  margarine  ready soups
## 0             1                    1          1            1
## 1             0                    0          0            0
## 2             0                    0          0            0
## 3             0                    0          0            0
## 4             0                    0          0            0

Those values are correct

With the data fully transformed we can explore some top 10 most purchased items.

plt.figure(figsize=(15,5));
sns.barplot(x=tdf.sum().nlargest(10).index , y=tdf.sum().nlargest(10).values, palette = 'hot');
plt.xlabel('Items', size = 12);
plt.xticks(rotation=30);
plt.ylabel('Counts', size = 12);
plt.title('Top 10 most purchased items', size = 20);
plt.subplots_adjust(bottom=0.25, top=0.9);
plt.show();

This looks almost like my grocery transactions, milk and vegetables are definitely on top

plt.figure(figsize=(15,5));
sns.barplot(x=tdf.sum().nsmallest(10).index , y=tdf.sum().nsmallest(10).values, palette = 'cool');
plt.xlabel('Items', size = 12);
plt.xticks(rotation=30);
plt.ylabel('Counts', size = 12);
plt.title('Top 10 least purchased items', size = 20);
plt.subplots_adjust(bottom=0.25, top=0.9);
plt.show();

This grocery store makes most of it’s revenue from whole milk and vegetables and the least from baby food and sound storage medium.

tdf.sum(1).describe().iloc[1:]

## mean     4.409456
## std      3.589385
## min      1.000000
## 25%      2.000000
## 50%      3.000000
## 75%      6.000000
## max     32.000000
## dtype: float64

We can see on average a customer puchases 4.4 items where the longest chain is 32 items long, that was a very active shopper. As expected the minimum puchase is 1 since there are no rows with all 0s

Model building

Apriori is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as the one we are using.

frequent_itemsets = apriori(tdf, min_support=0.001, use_colnames=True, max_len=2).assign(
        length=lambda x: x['itemsets'].map(lambda x: len(x))
)

frequent_itemsets.head()

##     support                 itemsets  length
## 0  0.008033  (Instant food products)       1
## 1  0.033452               (UHT-milk)       1
## 2  0.003559       (abrasive cleaner)       1
## 3  0.003254       (artif. sweetener)       1
## 4  0.017692          (baking powder)       1

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1).sort_values('lift', ascending=False)

rules.head()

##                   antecedents              consequents  antecedent support  \
## 4145                (mustard)             (mayonnaise)            0.011998   
## 4144             (mayonnaise)                (mustard)            0.009151   
## 13           (hamburger meat)  (Instant food products)            0.033249   
## 12    (Instant food products)         (hamburger meat)            0.008033   
## 2657              (detergent)               (softener)            0.019217   
## 
##       consequent support   support  confidence       lift  leverage  \
## 4145            0.009151  0.001423    0.118644  12.965160  0.001314   
## 4144            0.011998  0.001423    0.155556  12.965160  0.001314   
## 13              0.008033  0.003050    0.091743  11.421438  0.002783   
## 12              0.033249  0.003050    0.379747  11.421438  0.002783   
## 2657            0.005491  0.001118    0.058201  10.600137  0.001013   
## 
##       conviction  
## 4145    1.124233  
## 4144    1.170002  
## 13      1.092166  
## 12      1.558640  
## 2657    1.055968

The association_rules returns 9 columns, but we are only interested in the first two and support, confidence and lift. Antecendents and Consequents are pairs of items that have associations and their strength is given by the lift/confidence.

Support is the joint probability of the antecendents and consequents [0,1]
Confidence is the conditional probability P(consequents|antecedents) [0,1]
Lift is a combination of Support and Confidence, range [0, inf]

Model exploration

Now that we have the rules we can take a look at different associations

rules[rules['lift'].gt(1) & rules['confidence'].gt(0.6)]

##      antecedents   consequents  antecedent support  consequent support  \
## 3780     (honey)  (whole milk)            0.001525            0.255516   
## 1597   (cereals)  (whole milk)            0.005694            0.255516   
## 4963      (rice)  (whole milk)            0.007626            0.255516   
## 
##        support  confidence      lift  leverage  conviction  
## 3780  0.001118    0.733333  2.870009  0.000729    2.791815  
## 1597  0.003660    0.642857  2.515917  0.002206    2.084555  
## 4963  0.004677    0.613333  2.400371  0.002729    1.925390

Selecting rows with high lift and confidence score we can see a strong association between (honey, cereals and rice) with whole milk.

rules['support'].describe()

## count    5428.000000
## mean        0.004008
## std         0.005128
## min         0.001017
## 25%         0.001423
## 50%         0.002339
## 75%         0.004372
## max         0.074835
## Name: support, dtype: float64

We see that the support for all rules in our dataset is very low (i.e., the proportion of transactions that involve items from both baskets), which may be problematic, due to the fact that any results obtained from analysis may not be statistically significant.

rules.loc[rules['lift'].idxmax()]

## antecedents              (mustard)
## consequents           (mayonnaise)
## antecedent support        0.011998
## consequent support      0.00915099
## support                 0.00142349
## confidence                0.118644
## lift                       12.9652
## leverage                0.00131369
## conviction                 1.12423
## Name: 4145, dtype: object

We see that itemsets mustard and mayonnaise have the highest lift. So if we know that a customer has bought mustard they are ~13 times more likely that they will purchase mayonaise as well. But as we’ve pointed out, due to the fact that the support is very low, we cannot really determine whether this is just a fluke or a real association.

rules.loc[rules['lift'].idxmin()]

## antecedents           (specialty chocolate)
## consequents                    (rolls/buns)
## antecedent support                0.0304016
## consequent support                 0.183935
## support                          0.00559227
## confidence                         0.183946
## lift                                1.00006
## leverage                        3.51504e-07
## conviction                          1.00001
## Name: 5010, dtype: object

We see that itemsets rolls/buns and specialty chocolate have the lowest lift. The low lift makes sense because I can’t think of a reason to have those two items together

rules[rules['consequents'].map(lambda x: list(x)[0]).eq('whole milk')].query('lift > 2')

##               antecedents   consequents  antecedent support  \
## 3780              (honey)  (whole milk)            0.001525   
## 1597            (cereals)  (whole milk)            0.005694   
## 4963               (rice)  (whole milk)            0.007626   
## 2111       (cocoa drinks)  (whole milk)            0.002237   
## 4931     (pudding powder)  (whole milk)            0.002339   
## 3940                (jam)  (whole milk)            0.005389   
## 223       (baking powder)  (whole milk)            0.017692   
## 2264  (cooking chocolate)  (whole milk)            0.002542   
## 
##       consequent support   support  confidence      lift  leverage  conviction  
## 3780            0.255516  0.001118    0.733333  2.870009  0.000729    2.791815  
## 1597            0.255516  0.003660    0.642857  2.515917  0.002206    2.084555  
## 4963            0.255516  0.004677    0.613333  2.400371  0.002729    1.925390  
## 2111            0.255516  0.001322    0.590909  2.312611  0.000750    1.819850  
## 4931            0.255516  0.001322    0.565217  2.212062  0.000724    1.712313  
## 3940            0.255516  0.002949    0.547170  2.141431  0.001572    1.644069  
## 223             0.255516  0.009253    0.522989  2.046793  0.004732    1.560725  
## 2264            0.255516  0.001322    0.520000  2.035097  0.000672    1.551008

Earler we saw that whole milk was the most frequently bought item, now we look at it’s association. Items like cereals and honey are at the top probabily because we eat cereal with milk and put honey in our tea. A lot of the items involve baking as well, like baking powder, pudding powder, cooking chocolate and cocoa drinks.

sup = rules['support'].values;
conf = rules['confidence'].values;
lift = rules['lift'].values;

plt.figure(figsize=(10,6));
sc = plt.scatter(sup,conf,c=lift);
plt.colorbar(sc,label='Lift');
plt.xlabel('support');
plt.ylabel('confidence');
plt.show();

rules[rules['lift'].gt(10)]

##                   antecedents              consequents  antecedent support  \
## 4145                (mustard)             (mayonnaise)            0.011998   
## 4144             (mayonnaise)                (mustard)            0.009151   
## 13           (hamburger meat)  (Instant food products)            0.033249   
## 12    (Instant food products)         (hamburger meat)            0.008033   
## 2657              (detergent)               (softener)            0.019217   
## 2656               (softener)              (detergent)            0.005491   
## 3976         (red/blush wine)                 (liquor)            0.019217   
## 3977                 (liquor)         (red/blush wine)            0.011083   
## 
##       consequent support   support  confidence       lift  leverage  \
## 4145            0.009151  0.001423    0.118644  12.965160  0.001314   
## 4144            0.011998  0.001423    0.155556  12.965160  0.001314   
## 13              0.008033  0.003050    0.091743  11.421438  0.002783   
## 12              0.033249  0.003050    0.379747  11.421438  0.002783   
## 2657            0.005491  0.001118    0.058201  10.600137  0.001013   
## 2656            0.019217  0.001118    0.203704  10.600137  0.001013   
## 3976            0.011083  0.002135    0.111111  10.025484  0.001922   
## 3977            0.019217  0.002135    0.192661  10.025484  0.001922   
## 
##       conviction  
## 4145    1.124233  
## 4144    1.170002  
## 13      1.092166  
## 12      1.558640  
## 2657    1.055968  
## 2656    1.231681  
## 3976    1.112532  
## 3977    1.214833

Finally I wanted to look at the relationship with support and confidence and how they influence lift, we can see a slight regression where support and confidence both increase in a linear fashion. Some of the larger lift values share similar support/confidence with small values so they don’t apper in the plot. The table above shows the largest lift values.

Conclusion

Exploring the grocery data set has exposed a lot of secrets in the grocery business. By implementing an apriori logrithm we found that certain items have strong association with other items making the pair a good target for coupons/adds. When it comes to item placement it would be a good idea to place items with high lift scores apart so customers will be exposed to other products along the way. I am certain many grocery stores already implement these approaches to maximize profit.

However the analysis done here shouldn’t be stretch too far since we saw the support values are too low. More data is needed to make strong connections between items. Futhermore the dataset was small making it more likely that what we found isn’t statistically significant.