Introduction to Market Basket Analysis

This module introduces market basket analysis.

The name of this techniques, market basket analysis, probably derives from its frequent application to the analyzing consumer buying behavior of consumers in supermarkets. Upon check-out, the items in the cart or basket are recorded. Especially when linked to customer or membership cards, the data provides valuable insight in buying patterns.

The same holds true for e-commerce organizations, like Amazon. Compared to traditional bookstores, Amazon has a much better insight in buying patterns. From purchasing patterns, Amazon is able to detect which titles are frequently ordered by the same buyers. This knowledge enables Amazon to effectively cross-sell items, by recommending titles to buyers under headings like "customers who bought this item also bought ..."

However, as we will show, the technique can be applied in all cases where we are looking for patterns, or relationships between characteristics of objects or persons. Some examples:

  1. Purchases in supermarkets
  2. Behavioral traits of criminals
  3. DNA patterns in diseases
  4. Or the Titanic data ...

The popularity of the technique has everything to do with the vast amount of data that large organizations (supermarkets; online sellers, such as Amazon) have about people (customers; members, etc.) and their transactions.

Traditionally, marketing specialists rely on intuitive knowledge of buying behavior, sometimes supported by information from consumer surveys. Driving questions are, for example, what are the expected sales numbers if the price for a product is reduced by 5%? Or, for supermarkets, what is the optimal store layout (order of products and shelves; place on the shelf)?

Of course, it is possible to use experiments, but frequent changes in store layout, for example, will easily lead to customer confusion and annoyance. With the emergence of barcodes and scanning systems that are linked to customer cards, companies can better map who buys what, when and in what quantities.

Online stores such as Amazon have an enormous competitive advantage over traditional stores. To traditional stores, most customers are just anonymous buyers. But online stores know the profile of the customers and their buying habits. Amazon knows the genres of books their customers enjoy, and can recommend books by other authors.

The algorithm

Like cluster analysis via K-means clustering, market basket analysis is an example of unsupervised learning. That is, it is not necessary to first train the algorithm, and then see if the algorithm provides good predictions in a test set of data not used in the training.

What we are looking for is a set of rules that relate characteristics to one another. We can then use these rules for all types of decisions. In the setting of the supermarkets, the characteristics are the groceries (items) in the shopping basket of the customer.

Items, item sets and association rules

The relationship between items is the joint occurrence of items in the shopping basket.

In the jargon of market basket analysis, we speak of item sets, and of association rules.

An item set is simply a collection of one or more items. We display these collections with curly brackets {}.

An example of an item set in a customer's shopping basket in the supermarket is the combination of bread and peanut butter, shown as:

{bread, peanut butter}

But also

{bread}

and

{bread, peanut butter, toilet cleaner}

are examples of item sets.

The number of possible item sets quickly takes on enormous proportions.

Suppose a very small store sells 10 types of products (or items). A customer may or may not buy any item, which results in:

\(2^{10} = 1024\)

possible item sets! For a slightly larger supermarket with 100 different items, the number of options is

\(2^{10} = 1.27*10^{30}\)

(1270000000000000000 ... stops at 28 zeros)

In those situations it is not feasible to evaluate all possible combinations. We have to look for a smart and efficient algorithm to generate rules that connect item sets.

An example of such a rule is:

{bread} → {peanut butter, sprinkles}

In this rule, a link is made between an item or item set to the left of the arrow (commonly referred to as LHS, or left hand side) and an item or item set to the right of the arrow (RHS, or right hand side).

The left side sets the condition, and the right side is the result.

In words, this rule says that the purchase of bread results in the purchase of peanut butter and sprinkles.

Market basket analysis algorithm is a smart method to detect “interesting” connection rules between items.

What is interesting depends on the type of application.

In a large supermarket with a thousand products and many thousands of customers and millions of transactions per year, combinations that make up a small part of the whole can carry important information. Different criteria will apply in medical applications or smaller data sets.

Data

The name of the technique, market basket analysis, seems to limit its application. But the principles of analyzing connection rules, with the core concepts of support, confidence and lift, are widely applicable, as evident from the Titanic data video above and in the example below.

As an example of market basket analysis we will use a small data set with characteristics of 10 people who have shown criminal behavior.

Psychologists have interviewed these individuals and coded the information from the conversations using keywords (such as "drugs" if there was a drug addiction; "divorced", and so on).

The question now is whether these characteristics (items) are related to each other. For example, are drug users relatively often divorced; or vice versa, do criminals who are divorced more often use drugs?

Note that this technique is geared toward finding associations. It is unsupervised, as we do not have a target variable to explain or predict. Nor do we have a theory that uses logical reasoning to link the LHS to the RHS.

That said, there often is a logic to the associations found. Consumers who buy bread, may also peanut butter and chocolate sprinkles to use on the bread. But we do not start analyzing the data to test a priori theories. We will keep associations if we do not understand their logic, like in the beer and diapers example in the video!

We can record the information from the interviews in a database. Such a data file (in Excel) can look like this:

In the first interview, for example, it turns out that the interviewee has divorced parents and has been guilty of shoplifting.

Shoplifting does not occur in any of the other interviews.

One conclusion is that shoplifting is always associated with divorced parents (although the evidence for this conclusion, with only one case of shoplifting, is pretty thin).

Conversely, there are several interviewees with divorced parents who have not been guilty of shoplifting.

Note that due to the lack of structure in the data set, it is not so easy to identify these types of relationships even in this small data set!

A number of things stand out in the database.

This way of storing data has great advantages.

Think of the thousands of products offered by supermarkets, out of which only a very small proportion is part of each transaction. If every column were to represent a product (item), then every record in the data file would mainly consist of zeros or empty cells! An online store would have to create a line with as many columns as items in the assortment, even though each transaction includes only one or a few items.

In its quest to be all things to all people, Amazon has built an unbelievable catalog of more than 12 million products, books, media, wine, and services. If you expand this to Amazon Marketplace sellers, as well, the number is closer to more than 350 million products. Source: www.bigcommerce.com

In our small example, too, this method of data storage has advantages: the interviewer can note down the comments during the interview, without having to worry about the sequence and the number of possible comments.

In the same vein, the cashier in the supermarket also scans the products in any "random" order in which they are placed on the belt!

Self test: try to see for yourself if you can detect some pattern in the data. Although there are only 10 people in the file and the number of comments has a maximum of 5, it is not an easy task! Algorithms help!

The Algorithm

When we look for “interesting” connection rules in a huge set of possible rules, the challenge is to develop an efficient algorithm for reducing the number of possibilities.

Such an algorithm has two steps.

  1. In the first step, we only consider those items and item sets that occur a minimal number of times. In our example, we could limit our quest to those comments (items) and combinations of comments that are mentioned in at least 2 out of 10 cases. The comment "ADHD" (which stands for Attention Deficit Hyperactivity Disorder) would then be dropped as it occurs only once. But here's the trick: if "adhd" only occurs once, then all potential combinations of other comments plus "adhd" cannot occur twice or more! In other words: with the exclusion of one item a lot of item sets will go with it. Examples:

{adhd; drugs}; {adhd; divorced}; {adhd; drugs; divorced}.

We refer to the number of occurrences of items and item sets as support.

Support = Frequency Of Item Set / Total

The support for the comment "adhd" is 1 in 10 (0.10, or 10%).

Keep in mind that an item set can consist of 1 or more items. A single comment itself is therefore also an item set, even if the set only contains one item!

  1. Once we have determined item sets with sufficient support, we can proceed to step 2. Here, we assess association rules on the basis of confidence. This is a tough concept that often leads to confusion. We will explain it using our example.

The formula for confidence is:

Confidence (X→Y) = (Support (X,Y)) / (Support (X))

In words, this formula expresses the probability that item set X (as a condition) leads to item set Y (the consequence).

In the example of the supermarket, the rule that buying bread leads to buying peanut butter has a probability equal to the number of times peanut butter and bread are bought together, divided by the number of times bread is bought.

In a numerical example: imagine a supermarket analyzing 80 shopping baskets. In 60 cases, the shopping basket contains bread (the gray shaded part, in the left column of figure 3). In 50 cases, the shopping basket contains peanut butter (the yellow shaded area in the right column).

The support for bread is 60/80 = 0.75 (or 75%).

The support for peanut butter is 50/80 = 0.625 (or 62.5%).

We can also calculate the support for the item set {bread, peanut butter}: 40/80 = 0.50.

We can now calculate the confidence for the association rules.

{bread} → {peanut butter}

and

{peanut butter} → {bread}

For the first association {bread} → {peanut butter} the confidence can be calculated as:

Confidence (bread → peanut butter) = 0.50 / 0.75 = 0.67

In 2 out of 3 cases (67%), buying bread leads to buying peanut butter.

This seems to be informative. Or is it? What do you think? What exactly is the "value" of this information?

The 67% confidence for this rule is only informative if it deviates from support for peanut butter! Without prior information on buying bread, we would estimate the probability of buying peanut butter at 62.5%.

But if we know that there bread is part of the shopping basket, then the probability of peanut butter in the shopping basket is only slightly higher (67%).

For assessing the value if information, we need - in addition to support and confidence, a third measure, which we call lift. We will come to that later on.

Note that confidence for {bread} → {peanut butter} is not the same as confidence for {peanut butter} → {bread}!

Confidence (peanut butter → bread) = 0.50 / 0.625 = 0.80

We can interpret the difference.

If peanut butter is bought, chances are very high (80%) that bread will also be bought.

It is more common to buy "bread" but "no peanut butter, than "peanut butter but "no bread" (not everyone likes peanut butter on their bread; but in reverse, peanut butter cannot be used for many other purposes than on your bread).

Let's apply this to our example.

As an example we can look at the relationship between {parents divorced} and {divorced}.

It may of course happen in incidental cases that your parents divorce after you have been divorced yourself, but much more important in the context of explanations for criminal behavior are the consequences of divorced parents on the lives of their children. A divorce can be related to the loss of a job; job loss can lead to drug use and lack of money; and so on. So the interesting relationship for us is: {parents divorced} → {divorced} In the formula for confidence we have to calculate what the support is for the item {parents separated} and for the item set {parents separated; divorced}. The support for {divorced} is not important in this relationship. Because the data file is small, we can easily calculate it, for example in Excel.

END