Introduction

This writeup will cover the market basket analysis I did on sales data. The code and data relevant to this project can be found on my Github. I’d be excited to hear what you have to say, so please feel free to reach out to me via email!

The dataset consists of 100,000 records on transactions. Each of these transactions has a binary attribute for each of 50 items; essentially, each transaction either contains or does not contain a particular item. The essential task is to analyze the grouping of particular products; this is also known as a market basket analysis. I will show through 2 different methods (heatmap and dendogram via hierarchical clustering; PCA and k-means clustering) that 3 distinct product groupings are present.

Exploring The Data

The data was easy to read in using the standard read.csv function:

sales.file.name <- 'sales-data.csv'
total.sales <- read.csv(sales.file.name)
head(total.sales[, 1:10], n = 20)

##    id item_0 item_1 item_2 item_3 item_4 item_5 item_6 item_7 item_8
## 1   0      0      1      0      0      0      0      0      0      0
## 2   1      0      0      0      1      0      1      0      0      0
## 3   2      0      0      0      0      0      1      0      0      0
## 4   3      0      1      0      0      0      0      0      0      0
## 5   4      0      0      0      1      0      1      0      0      0
## 6   5      0      0      1      0      0      0      0      1      0
## 7   6      0      0      0      0      0      0      1      0      0
## 8   7      0      0      1      0      0      0      0      1      0
## 9   8      0      0      0      0      0      0      0      0      0
## 10  9      0      0      1      0      0      0      0      1      0
## 11 10      0      1      0      0      0      0      0      0      0
## 12 11      0      0      0      1      0      1      0      0      0
## 13 12      0      0      1      0      0      0      0      1      0
## 14 13      0      0      0      1      0      1      0      0      0
## 15 14      0      0      1      0      0      0      0      1      0
## 16 15      0      0      0      0      0      0      0      0      0
## 17 16      0      0      1      0      0      0      0      1      0
## 18 17      0      0      0      1      0      1      0      0      0
## 19 18      0      0      0      0      1      0      0      0      0
## 20 19      0      0      1      0      0      0      0      1      0

total.sales is now a data frame that contains one transaction per row and 51 columns: 1 for each of the items that are either part or not part of the transaction plus 1 additional column for the transaction ID. One gets the feeling that this data is very sparse from looking at this subset of the data.

I started exploring this data by simply plotting a histogram for the probability of a certain item appearing in a transaction. (The helper function get.sales.items is needed since not all columns correspond to items; in fact, all but the first column correspond to an item.)

get.sales.items <- function(sales) {
    sales.items <- grep('item', colnames(sales), value = TRUE)
    return(sales.items)
}

get.probabilities <- function(sales) {
    probabilities <- sapply(get.sales.items(sales), function(x) {
        sum(sales[[x]]) / nrow(sales)
    })
}

probabilities <- get.probabilities(total.sales)
hist(probabilities, main = "Probabilities for Items", xlab = "Percentage of Transactions", ylab = "Count")

There appear to be 2 very distinct group of items: those which are purchased in about a quarter or so of transactions and those which are not purchased nearly as often. This will come into play when analyzing various groupings of items: consider the case when of a frequently purchased item (say a bag of chips) and a much less frequently purchased item (say toothpaste). One cannot claim that toothpaste is a good pairing with chips simply because everytime someone buys toothpaste, they also purchase chips: they will purchase chips regardless!

Conditional Probabilities

When considering the last point (that certain items have a high rate of purchase) and the idea that the original dataset was very sparse, I decided to condense the dataset to a matrix of conditional probabilities. One benefit of having defined the get.probabilities function as we did is that we can pass any subset of sales (namely, the subset of sales that actually contain a particular item); to achieve this, I also made the filter.sales function to only retain those records which contain a particular item. Here is the code to generate a full matrix of conditional probabilities (which I decided to call a coincidence matrix):

filter.sales <- function(sales, item) {
    sales.subset <- sales[sales[[item]] == 1, ]
    return(sales.subset)
}

make.coincidence.matrix <- function(sales, preserve.diagonal = TRUE) {
    sales.items <- get.sales.items(sales)
    coincidence.matrix <- data.frame()
    for(i in 1:length(sales.items)) {
        sale.item <- sales.items[i]
        filtered.sales <- filter.sales(sales, sale.item)
        conditional.probabilities.for.item <- get.probabilities(filtered.sales)
        for(j in 1:length(conditional.probabilities.for.item)) {
            coincidence.matrix[i, j] = conditional.probabilities.for.item[j]
            if(i == j && !preserve.diagonal) {
                coincidence.matrix[i, j] = 0
            }
        }
    }
    colnames(coincidence.matrix) <- sales.items
    rownames(coincidence.matrix) <- sales.items
    return(data.matrix(coincidence.matrix))
}

coincidence.matrix <- make.coincidence.matrix(total.sales)
head(coincidence.matrix[, 1:9], n = 20)

##             item_0     item_1     item_2     item_3     item_4     item_5
## item_0  1.00000000 0.19091467 0.18907305 0.18723143 0.02455494 0.18968692
## item_1  0.01306613 1.00000000 0.02201496 0.02058651 0.01037728 0.02390555
## item_2  0.01087340 0.01849891 1.00000000 0.02086422 0.01140295 0.02051119
## item_3  0.01084946 0.01743028 0.02102305 1.00000000 0.01205891 0.85988190
## item_4  0.02457002 0.15171990 0.19840295 0.20823096 1.00000000 0.21068796
## item_5  0.01094038 0.02014587 0.02057074 0.85586319 0.01214417 1.00000000
## item_6  0.02958199 0.16334405 0.19549839 0.19163987 0.03472669 0.19935691
## item_7  0.01059921 0.01815998 0.85952516 0.02112776 0.01236574 0.02063313
## item_8  0.03696742 0.16165414 0.19235589 0.18859649 0.02443609 0.20112782
## item_9  0.01223152 0.85006935 0.02421084 0.02055399 0.01034004 0.02320205
## item_10 0.02411168 0.18337563 0.19860406 0.18781726 0.02538071 0.18972081
## item_11 0.02559301 0.17478152 0.19225968 0.19475655 0.03121099 0.20037453
## item_12 0.03069054 0.16751918 0.18989770 0.19053708 0.02941176 0.19629156
## item_13 0.02886350 0.17197835 0.18400481 0.19542995 0.03547805 0.19963921
## item_14 0.02471483 0.16032953 0.19138150 0.20659062 0.02344740 0.20278834
## item_15 0.02358783 0.16263191 0.17752948 0.19428926 0.03289882 0.19801366
## item_16 0.02373517 0.15427858 0.18863210 0.22798251 0.03810119 0.22173641
## item_17 0.02818627 0.17156863 0.19056373 0.20649510 0.03370098 0.20526961
## item_18 0.03182375 0.17258262 0.18298654 0.19277846 0.02815177 0.20318237
## item_19 0.02899445 0.16964837 0.20111043 0.20974707 0.02282542 0.20666255
##             item_6     item_7     item_8
## item_0  0.02823818 0.18416206 0.03621854
## item_1  0.01067137 0.02159482 0.01083943
## item_2  0.01073219 0.85885759 0.01083810
## item_3  0.01060046 0.02127205 0.01070717
## item_4  0.03316953 0.21498771 0.02395577
## item_5  0.01097578 0.02067696 0.01136525
## item_6  1.00000000 0.20128617 0.02508039
## item_7  0.01105851 1.00000000 0.01035189
## item_8  0.02443609 0.18358396 1.00000000
## item_9  0.01084444 0.02299189 0.01059224
## item_10 0.02601523 0.19670051 0.03172589
## item_11 0.03682896 0.18664170 0.02871411
## item_12 0.02621483 0.19245524 0.03196931
## item_13 0.03307276 0.18580878 0.02705953
## item_14 0.03105196 0.19645120 0.03105196
## item_15 0.02793296 0.17815022 0.03103662
## item_16 0.02311056 0.18176140 0.02685821
## item_17 0.02573529 0.19362745 0.02941176
## item_18 0.03365973 0.19645043 0.03304774
## item_19 0.03516348 0.19740901 0.03084516

Before explaining how to utilize this matrix except for simply looking at a bunch of numbers that seem useful, let me explain the preserve.diagonal argument: I have included it here because not including the diagonal elements (all 1’s since the conditional probability that an item X appears given that item X appears is 1) reduces the dimensionality of the matrix. This is useful in PCA, and I will touch on it there again.

Heatmaps and Dendograms

Now, one of my favorite visualization techniques is the heatmap. This allows one to see the hotspots of a matrix. It is a technique typically used in gene expression data because it allows researchers to easily spot which genes are being over-expressed or under-expressed. Instead of spotting genes of interest, here we will be able to find items of interest! Here is the code that generates the heatmap according to my preferences:

colors <- colorRampPalette(c("white","royalblue"))(256);
map <- heatmap(t(coincidence.matrix), col = colors, symm = TRUE, main = "Heatmap of Conditional Probabilities for the Items", xlab = "Item Number", ylab = "Item Number")

The interpretation of this heatmap is as follows: for any given item X, the pixel in the column for item Y is colored with an intensity corresponding to the conditional probability that item Y will be in a transaction involving X. (Those following along with the code would notice the extra call to t before passing in the matrix; this is required because by default, heatmap applies a transpose to the incoming matrix. If I didn’t include this call to t, the interpretation I just described wouldn’t apply.)

There are a couple interesting features to notice in this heatmap. The first is the diagonal line from the top left to the bottom right of the heatmap; this diagonal represents the case discussed earlier: the conditional probability that item X will contain item X is 1 (we can be sure that a column and row with the same index number refer to the same item because we specified that the heatmap should be symmetric). Thus, these dark boxes are a reference point for the most intense correspondences.

Another prominent feature is the set of 3 boxes in the top left corner. While the diagonal still stands out, these individual 3 boxes are also quite intensely colored, meaning that the items within these boxes are quite often purchased together. What’s important is that the whole box is filled in, meaning that each item is highly purchased with respect to each of the other items; this is precisely the relation we are looking for. Tentatively, we can mark these boxes as items that have a particular group.

One of the other major features of the heatmap is the slightly intense region in the lower left portion (beneath the 3 boxes) of the heatmap. Here we see a relation that is not at the level of the boxes but not quite at the noise level that we see in other parts of the heatmap. Referring to the interpretation given earlier, these boxes correspond to conditional probabilities for an item that is tentatively not part of a group given that an item that is tentatively part of a group has been purchased. My personal theory (touched on earlier) is that the items in the groups are purchased quite frequently whereas the items not in the groups are not purchased as frequently. We can check this by analyzing those items with the highest probability:

head(sort(probabilities, decreasing = TRUE), n = 15)

##  item_2  item_7 item_29  item_5 item_22  item_3 item_39  item_1 item_42 
## 0.28326 0.28304 0.28302 0.28244 0.28167 0.28112 0.23830 0.23802 0.23796 
##  item_9 item_35 item_13 item_23 item_31 item_24 
## 0.23791 0.23772 0.01663 0.01653 0.01650 0.01647

Clearly, there is a steep drop off just as we had noticed in the exploratory analysis phase. Specifically, the probability for item 35 is 23.8%, which is good enough to be the 11th most popular. The 12th most frequent item, item 13, is purchased at a comparably dismal 1.7% rate. More importantly, all of the items that we have tentatively put into groups (the first 5 + 3 + 3 = 11 rows and columns of the heatmap) appear in the highly frequent purchased category. My theory is supported by the data because those items tentatively in groups are more frequently purchased than any other items. In particular, we know that it is a case of chips and toothpaste because the transpose cells (the transpose cell of the cell representing the conditional probability that a transaction contains item X given that the transaction contains item Y is the cell representing the conditional probability that a transaction contains item Y given that the transaction contains item X) are nearly white. In particular, consider items 9 (chips) and 31 (toothpaste): the probability that one purchases item 9 given that item 31 has been purchased (bottom left pixel) is somewhat high whereas the probability that item 31 is purchased given that item 9 has been purchased (top right pixel) is very low. When people are out buying item 31, they always happen to pick up item 9, but since it is not a reciprocal relationship, we know that people rarely buy item 31 when purchasing item 9, which discounts the idea that items 9 and 31 should belong in a grouping together. This brings back the importance of the box idea: each item must be frequently purchased with respect to each of the other items in order to consider the items as an appropriate grouping.

At this point, it is appropriate to address the elephant in the room: the tree looking structures at the edges of the plot. These are dendograms based on hierarchical clustering. Specifically, hierarchical clustering is pairing together different things based on their relative similarity to each other. These are visualized via a dendogram. In particular, the part of a dendogram with a long vertical section (for example, the left part of the horizontal dendogram) means that particular set is similar to itself but not very similar to the other items in the tree; the part of a dendogram with a long horizontal section (for example, all the items that we have tentatively pegged as not being part of a grouping) indicates that all of these items are quite similar to each other as compared to the rest of the tree. By analyzing the dendogram, we see that the same 3 groups come out (no surprise as the dendogram was used to arrange the heatmap). However, we gain the insight that the particular groupings of items are very similar to themselves but not any of the other groupings or the set of the rest of the items. Based on the results of the heatmap and dendogram, I will say the following groupings exist:

First Grouping
- Item 9
- Item 39
- Item 35
- Item 42
- Item 1
Second Grouping
- Item 5
- Item 3
- Item 22
Third Grouping
- Item 29
- Item 7
- Item 2

Confirming Results Through PCA and k-means clustering

In the last section, we analyzed the various items using a dendogram. This utilizes local similarities (between different pairs) to distinguish the various clusters. To take a slightly different approach, I will use PCA and k-means clustering. In particular, PCA looks to isolate the most important dimensions of a dataset; in this sense, it is taking a global perspective (taking into account the general shape of the data) and summarizing it in a lower dimension space. I will follow this up with k-means clustering to isolate various groups, which will indicate the different product groupings.

PCA is not generally applicable to any problem. Because it is a dimensionality reducing technique, one needs to ensure that enough dimension of the original data set is retained so that insights on the reduced dimension dataset carry over to the original dataset. In particular, I will go back to the preserve.diagonal parameter in the make.coincidence.matrix function. Since excluding the diagonal decreases the dimensionality of the matrix (there would be many more zero entries, and hence, very loosely speaking, less going on in the dataset), I will recalculate the coincidence.matrix to not include the diagonal. Additionally, when one usually performs PCA, one normalizes (subtract the mean, divide by standard deviation) the data ahead of time. However, this would introduce many non-zero entries into the matrix since the currently light regions would be converted to heavy negative regions. Thus, I will not scale or center the data ahead of time. Here is the code to generate and determine whether PCA is appropriate to analyze the conditional probabilities of the items:

coincidence.matrix <- make.coincidence.matrix(sales = total.sales, preserve.diagonal = FALSE)
coincidence.matrix.pca <- prcomp(coincidence.matrix, center = FALSE, scale = FALSE)
cumulative.variance <- cumsum(coincidence.matrix.pca$sdev^2 / sum(coincidence.matrix.pca$sdev^2))
plot(cumulative.variance, ylim = c(0, 1), type = "l", main = "Determining Acceptability of PCA", ylab = "Variance Retained", xlab = "Number of Principal Components")

head(cumulative.variance, n = 5)

## [1] 0.6004775 0.7750019 0.8454542 0.8644566 0.8833922

One can see that even with the first component, we can retain over 60% of the variance in the original dataset! I will choose to use 3 principal components because we wish to visualize the data. This is acceptable because we retain around 84.5% of the original variance. We can also see this visually with a heatmap:

number.components <- 3
coincidence.matrix.reduced <- coincidence.matrix.pca$x[,1:number.components] %*%
                              t(coincidence.matrix.pca$rotation[,1:number.components])

colors <- colorRampPalette(c("white","royalblue"))(256);
heatmap(t(coincidence.matrix.reduced), col = colors, symm = TRUE, Rowv = map$rowInd, keep.dendro = FALSE, main = "Heatmap of Conditional Probabilities for the Items", xlab = "Item Number", ylab = "Item Number")

This heatmap looks very similar to the heatmap from before (even though the dimensionality has been reduced!). Most of the major features remain such as the 3 boxes in the top left corner and the region of correspondence below those 3 boxes. The major missing feature is the solid colored diagonal from before; we must remember that we excluded the diagonal this time, so if we retained the original variance of the data perfectly, we’d have completely white boxes on the diagonal. Instead, we have the diagonal filled in the box region and not filled for the rest of the heatmap. That indicates that the self-similarity for those items is part of a general pattern and could override the fact that we excluded the diagonal.

We can also use the principal components analysis to compute a new representation for each item. In essence, each item’s row full of conditional probabilities will now be replaced with 3 numbers that best represent its value relative to the rest of the dataset in the particular dimensions that the original dataset varies most in.

items.reducedform <- coincidence.matrix.pca$rotation[, 1:number.components]
head(items.reducedform, n = 10)

##                PC1         PC2           PC3
## item_0 -0.03293981  0.02406555 -7.675922e-05
## item_1 -0.35128273 -0.27617522  1.313904e-03
## item_2 -0.23909067  0.31336418 -4.105698e-01
## item_3 -0.23800169  0.31560087  4.049991e-01
## item_4 -0.03355145  0.02635848  1.327477e-04
## item_5 -0.24262142  0.31923635  4.076938e-01
## item_6 -0.03194735  0.02459429  1.083754e-04
## item_7 -0.23843532  0.31354657 -4.103316e-01
## item_8 -0.03276775  0.02517503  4.308741e-04
## item_9 -0.35139249 -0.27384931  3.819674e-04

Based on these new representations, I will cluster the data according to the k-means algorithm. Since I hope to minimize my squared error as much as possible while using as few clusters as possible, I will check to see how many clusters accomplishes this goal using the following code:

fits <- sapply(1:20, function(x) {
        clustering <- kmeans(items.reducedform, x, nstart = 50)
        return(clustering$tot.withinss)
    })
plot(1:length(fits), fits, xlab = "Number of Clusters", ylab = "Sum of Intracluster Squared Error", 
main = "Determining Optimal Number of Clusters")

From this plot, it appears that 4 different clusters would be optimal since not much error is left to reduce after 4 clusters are introduced; introducing more clusters would simply overfit the data. Now, using the particular cluster assignments, we can visualize these data:

desired.fit <- kmeans(items.reducedform, 4, nstart = 50)

library(scatterplot3d) #Plotting Library
scatterplot3d(x = items.reducedform[, 1], y = items.reducedform[, 2], z = items.reducedform[, 3], color = desired.fit$cluster, main = "Visualization of k-means clustering", xlab = "First Principal Component", ylab = "Second Principal Component", zlab = "Third Principal Component")

This plot confirms that 4 different clusters are clearly present. In addition, they are very compactly arranged in relation to each other, which is ideal for our analysis since it suggests that there are 4 clear product groupings.

As a final check on the initial results, we can analyze which items should be grouped together according to the 4 cluster fit:

cluster.assignments <- desired.fit$cluster
cluster.groups <- lapply(unique(cluster.assignments), function(x) {
        indices <- cluster.assignments == x
        return(dimnames(items.reducedform)[[1]][indices])
    })
print(cluster.groups)

## [[1]]
##  [1] "item_0"  "item_4"  "item_6"  "item_8"  "item_10" "item_11" "item_12"
##  [8] "item_13" "item_14" "item_15" "item_16" "item_17" "item_18" "item_19"
## [15] "item_20" "item_21" "item_23" "item_24" "item_25" "item_26" "item_27"
## [22] "item_28" "item_30" "item_31" "item_32" "item_33" "item_34" "item_36"
## [29] "item_37" "item_38" "item_40" "item_41" "item_43" "item_44" "item_45"
## [36] "item_46" "item_47" "item_48" "item_49"
## 
## [[2]]
## [1] "item_1"  "item_9"  "item_35" "item_39" "item_42"
## 
## [[3]]
## [1] "item_2"  "item_7"  "item_29"
## 
## [[4]]
## [1] "item_3"  "item_5"  "item_22"

The first group is the “no grouping” group since it lumps so many different items with little similarity (as shown by the heatmap). The rest of the groups do match the groups I had presented earlier, so in fact, we have managed to show 2 different ways to arrive at this grouping results.

Conclusions

As a summary, we analyzed sales data that recorded 50 particular items either being part of or not being part of the transaction. We used different techniques such as heatmaps, dendograms, PCA, and k-means clustering to analyze the data and extract the relevant groupings, which are listed below. Thanks for reading!

First Grouping
- Item 1
- Item 9
- Item 35
- Item 39
- Item 42
Second Grouping
- Item 3
- Item 5
- Item 22
Third Grouping
- Item 2
- Item 7
- Item 29

Grouping Similar Products From Sales Data