Preliminary Affinity Analysis

Introduction

The following is a preliminary exploration of frequent itemsets within the Marine Fasteners customer base. In order to aid the sales team in upselling and item recommendations, Marine Fasteners CSO Jose Espejo tasked me with finding items that are frequently purchased together. After thorough research, data cleaning and manipulation, and some programming, I have produced an introductory high-level view of frequent itemsets that show clear correlations between certain item categories. I believe this analysis will be useful moving forward for more specific insights into various industries and future customers.

Framework

Performing research for this project, I came across what is known as an Apriori algorithm. Frequently used in machine learning, the Apriori algorithm (called such because of its use of “prior knowledge” to find frequent itemsets) is meant for association rule mining. The algorithm takes transaction data fed into it in a particular format and analyzes the ways in which items relate to each other in terms of frequency and pairing. In this project, I used R programming to apply the Apriori algorithm to historical transaction data in the Marine Fasteners database.

Tools

This project uses a series of technological tools, including SQL (SMSS), R, and MS Excel. I began mining data with SQL, querying the database with broad parameters. I then moved this data to Excel, where I cleaned and transposed the data into the format that the algorithm requires (more on that later). I then moved to RStudio to execute the analysis and put together this report.

Now, on to the data!

Mining in SQL

First, I wanted to get a high-level view of what each of our customers have in their invoice history, organized by customer, with no time parameters. To do this, I produced this simple query:

SELECT rARInvHistory.chst_customer AS cust, rARInvHistory.cihs_item AS item, COUNT(*) AS count
FROM MARINE.dbo.rARInvHistory
GROUP BY rARInvHistory.chst_customer, rARInvHistory.cihs_item
ORDER BY rARInvHistory.chst_customer

This query produced roughly 800,000 rows, with every customer represented with every item that they had ever purchased, along with a count of how many times that item had been invoiced:

Seeing as this output is very dense and likely full of data from decades ago and customers/items that are no longer active, but still wanting to keep a bird’s-eye view, I simply exported it as a CSV to clean further in Excel.

Cleaning and Formatting in Excel

The Apriori algorithm is often associated with a Market Basket Analysis, as it reads each transaction as part of a larger dataset, or “basket”. In other words, it analyzes data similarly to how a grocery store would analyze all of the shopping carts in the store and what items are in them, finding associations and frequent pairs in order to market their items better. In order to do this, the algorithm requires that the data be fed to it in a particular format. Each “transaction”, that is, each group of items purchased together, needs to be on its own row, spread out horizontally, each cell representing one item.

Further, I realized that since many customers purchase different variations of the same essential part, comparing all the parts with each other would end up with associations that bare no useful data, as we don’t need to know that customers who buy, for instance, 5/16”X2” Hex Head Stainless Steel Cap Screws also buy 5/16”X1.5” Hex Head Stainless Steel Cap Screws. So I went about disregarding the specific dimensions of each part and simply focused in on the family that each part belongs to. Using a simple =LEFT function, I extracted just the first five characters from each part number to focus in on the part family, removed duplicates, and disregarded part numbers such as B-… to make the analysis readable and useful. (Note: future applications of this analysis will certainly take into consideration these disregarded parts, but for the sake of testing the algorithm, I wanted to make this particular project clean and useful)

After cleaning the data, I extracted from the CSV a list of each customer number and applied a =FILTER function combine with =TRANSPOSE to display the items horizontally, producing a spreadsheet in this format:

Seeing as we want to find itemsets, I cleared out rows that had only one item, and was left with 2,466 rows, each representing a customer and the unique item families that they have historically purchased.

Applying the Algorithm in RStudio

Initial Analysis and Findings

Moving into RStudio, I set up the environment, loaded the dataset, and installed and loaded the necessary packages.

install.packages("arules")
library(arules)
install.packages("arulesViz")
library(arulesViz)
install.packages("dplyr")
library(dplyr)
install.packages("tidyverse")
library(tidyverse)

dataset = read.csv("W:/apriori/cust_prod_usage_nofirst.csv", header = FALSE)

Then I converted the CSV to a “transaction” format, necessary for the algorithm to function properly.

dataset = read.transactions("W:/apriori/cust_prod_usage_nofirst.csv", sep = ',', rm.duplicates = TRUE)

Taking a quick glance at the data, we can already find some useful information:

summary(dataset)

## transactions as itemMatrix in sparse format with
##  2466 rows (elements/itemsets/transactions) and
##  2551 columns (items) and a density of 0.007060189 
## 
## most frequent items:
##   HHCSS   WFLSS   ESN79   NFISS   PPMSS (Other) 
##    1522    1382    1162    1124    1031   38193 
## 
## element (itemset/transaction) length distribution:
## sizes
##   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21 
## 262 237 163 122 120 108  98  90  74  63  72  57  54  33  40  45  39  43  34  39 
##  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41 
##  36  34  40  25  37  27  28  15  14  13  25  25  25  17  18  10  10  19  12   7 
##  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61 
##  11  14   7   7   7   8   9  10   5  10   9   4   3   6   4   3   3   5   2   4 
##  62  63  64  65  66  67  68  70  71  72  73  74  75  76  78  80  82  83  84  85 
##   2   5   4   1   4   5   1   1   2   2   2   2   3   1   2   4   3   4   3   1 
##  87  89  90  91  93  94  95  97  98 102 106 107 109 111 113 116 117 120 130 134 
##   1   2   1   2   2   3   1   2   2   3   1   1   2   2   3   1   2   1   1   1 
## 135 138 140 142 144 146 151 159 164 169 171 186 192 206 227 305 455 
##   2   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00   10.00   18.01   23.00  455.00 
## 
## includes extended item information - examples:
##   labels
## 1  12479
## 2  14305
## 3  14671

As we can see, the top five item families are:
— HHCSS: Hex Head Cap Screws
— WFLSS: Stainless Steel Flat Washers
— ESN79: Nylon Insert Lock Nuts
— NFISS: Stainless Steel Hex Finish Nuts
— PPMSS: Stainless Steel Phillips Pan Machine Screws

Visualizing this information, we can see the popularity of the top 25 item families compared to others:

itemFrequencyPlot(dataset, topN = 25)

Frequent Itemsets

To apply the Apriori algorithm, I used the eclat() function. The code includes mathematical “support”, which is the idea of how frequently an item is represented in the data. It is used to tell the algorithm to ignore certain items that don’t show frequently enough. The support can be adjusted depending on each analysis, but for our purposes we can set it to the default which is 0.1. This will mean that no itemset that appears less than 10% of the time will show. The aspects “minlen” and “maxlen” refer to how short and how long we want to find associations. In this case, I wanted to find at least pairs (2) and at most groups of 4 items.

rules = eclat(data = dataset, parameter = list(support = 0.1, minlen = 2, maxlen = 4))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE     0.1      2      4 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 246 
## 
## create itemset ... 
## set transactions ...[2551 item(s), 2466 transaction(s)] done [0.01s].
## sorting and recoding items ... [40 item(s)] done [0.00s].
## creating bit matrix ... [40 row(s), 2466 column(s)] done [0.00s].
## writing  ... [1797 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

Taking a glance at the data, let’s find out some interesting sets, up to 25:

inspect(sort(rules, by = 'support')[1:25])

##      items                        support   count
## [1]  {HHCSS, WFLSS}               0.4789132 1181 
## [2]  {NFISS, WFLSS}               0.4075426 1005 
## [3]  {HHCSS, NFISS}               0.4063260 1002 
## [4]  {ESN79, HHCSS}               0.3937551  971 
## [5]  {ESN79, WFLSS}               0.3929440  969 
## [6]  {HHCSS, NFISS, WFLSS}        0.3751014  925 
## [7]  {ESN79, HHCSS, WFLSS}        0.3527981  870 
## [8]  {WFLSS, WMLSS}               0.3353609  827 
## [9]  {HHCSS, WMLSS}               0.3345499  825 
## [10] {HHCSS, PPMSS}               0.3292782  812 
## [11] {PPMSS, WFLSS}               0.3183293  785 
## [12] {NFISS, WMLSS}               0.3134631  773 
## [13] {HHCSS, WFLSS, WMLSS}        0.3130576  772 
## [14] {ESN79, PPMSS}               0.3126521  771 
## [15] {ESN79, NFISS}               0.3110300  767 
## [16] {NFISS, WFLSS, WMLSS}        0.2980535  735 
## [17] {HHCSS, NFISS, WMLSS}        0.2952149  728 
## [18] {ESN79, HHCSS, NFISS}        0.2907543  717 
## [19] {ESN79, NFISS, WFLSS}        0.2903487  716 
## [20] {ESN79, PPASS}               0.2891322  713 
## [21] {PPASS, PPMSS}               0.2871046  708 
## [22] {PPASS, WFLSS}               0.2862936  706 
## [23] {HHCSS, PPASS}               0.2854826  704 
## [24] {HHCSS, PPMSS, WFLSS}        0.2850770  703 
## [25] {HHCSS, NFISS, WFLSS, WMLSS} 0.2834550  699

These are just a few of the frequent itemsets in the data, ranked from most to least frequent. What this data is showing is that Stainless Steel Hex Head Cap Screws and Stainless Steel Flat Washers are purchased together almost half the time – a very strong correlation. Out of 2,466 customers, 1,181 have purchased both of these items. Going down the list, it’s no surprise that the major players in these correlations are in the top five most frequently purchased items, as shown in the bar graph above.

Findings and Moving Forward

As this is a preliminary, high-level exploration of item affinity, there is perhaps no major surprises here. However, it does confirm that the algorithm works and can be applied to more specific analyses moving forward. I’d like to continue to explore this avenue of inquiry into our database, analyzing frequent itemsets among particular industries and timeframes. I believe this will continue to help our sales team think of our products not as individual items but as part of larger groups, in order to upsell and give informed recommendations. This data can also assist our purchasing department, as we can preemptively purchase a item family’s “item partner” if the other begins to trend up in sales, thereby decreasing lead times and more efficiently serving our customers.

As I continue to sink my teeth into this kind of analysis, I would improve upon the technique outlined above by incorporating better grouping strategies and item prefixes, along with tighter parameters when extracting and cleaning the data.

Compiled exclusively for Marine Fasteners, Inc., November 2022.

Preliminary Affinity Analysis - Marine Fasteners

Dylan March, Data Analyst

November 11, 2022