Homework 10

suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(arules)))
suppressWarnings(suppressMessages(library(arulesViz)))

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

groc <- read.transactions("https://raw.githubusercontent.com/gpsingh12/Data624/master/Recommender-System/GroceryDataSet.csv", sep=",")
summary(groc)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

head(sort(itemFrequency(groc), decreasing =TRUE), n=10)

##       whole milk other vegetables       rolls/buns             soda 
##       0.25551601       0.19349263       0.18393493       0.17437722 
##           yogurt    bottled water  root vegetables   tropical fruit 
##       0.13950178       0.11052364       0.10899847       0.10493137 
##    shopping bags          sausage 
##       0.09852567       0.09395018

itemFrequencyPlot(groc, topN=10)

We will utilize the transactions data to find association between the items listed in the dataset. Apriori algorithm will be used to build the associated rules between the items. For using this function, we will try to get famiiar with the parametrs Support and Confidence.

Support : This gives the frequency (no. of times the item occurred) of the item in the dataset. If you consider a basket containing 10 items(5-apples, 3-eggs, 2-pens) then support of any precise item say apple can be 5 as mentioned. Likewise precise value can be calculated by the proportion of number of occurrences to the total number of items in the basket ( i.e., support(apples) = 5/8). In our case, we will flag the items that are sold once a day. The total no of transactions (nrow(groc)) is 9835.

Confidence : This explains how likely Y is purchased when X is purchased. This defines association between two items. For example when a person buys milk is more likely to buy bread as well or vice versa. This is measured by the proportion of transactions with item X, in which item Y also appears. Expressed as {X -> Y}. Calculated by the proportion of number of transactions in which both (X & Y) occurs to support of the item X.

MaxLen is set to two, since we do not want the transactions for which minimum items is one.

https://www.quora.com/What-is-support-and-confidence-in-data-mining

We will set the rules in the apriori algorithm.

rule_groc <- apriori(groc, parameter = list(support = 0.001, confidence = 0.6, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [2918 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

Inspect rules - Top 10 rules by lift

inspect(sort(rule_groc, by = "lift")[1:10])

##      lhs                        rhs                      support confidence      lift count
## [1]  {Instant food products,                                                               
##       soda}                  => {hamburger meat}     0.001220132  0.6315789 18.995654    12
## [2]  {popcorn,                                                                             
##       soda}                  => {salty snack}        0.001220132  0.6315789 16.697793    12
## [3]  {ham,                                                                                 
##       processed cheese}      => {white bread}        0.001931876  0.6333333 15.045491    19
## [4]  {other vegetables,                                                                    
##       tropical fruit,                                                                      
##       white bread,                                                                         
##       yogurt}                => {butter}             0.001016777  0.6666667 12.030581    10
## [5]  {hamburger meat,                                                                      
##       whipped/sour cream,                                                                  
##       yogurt}                => {butter}             0.001016777  0.6250000 11.278670    10
## [6]  {domestic eggs,                                                                       
##       other vegetables,                                                                    
##       tropical fruit,                                                                      
##       whole milk,                                                                          
##       yogurt}                => {butter}             0.001016777  0.6250000 11.278670    10
## [7]  {liquor,                                                                              
##       red/blush wine}        => {bottled beer}       0.001931876  0.9047619 11.235269    19
## [8]  {butter,                                                                              
##       other vegetables,                                                                    
##       sugar}                 => {whipped/sour cream} 0.001016777  0.7142857  9.964539    10
## [9]  {butter,                                                                              
##       hard cheese,                                                                         
##       whole milk}            => {whipped/sour cream} 0.001423488  0.6666667  9.300236    14
## [10] {butter,                                                                              
##       fruit/vegetable juice,                                                               
##       other vegetables,                                                                    
##       tropical fruit}        => {whipped/sour cream} 0.001016777  0.6666667  9.300236    10

We have selected our top10 rules based on the lift. We can see from the list instant food products and soda are sold much along with the hamburger meat. Similarly association of salty snack with pop corn and soda is very high. Apart from that we can see from no. 7 that confidence of liquor,red/blush wine along with the bottled beer is almost 90%. Chances are very high if some one buy liquor/red wine will also buy bottled beer. The rules can be sorted by the confidence level also to find the items sold under this category.

suppressWarnings(plot(rule_groc[1:10], method="graph", control=list(type="items")))

## Available control parameters (with default values):
## main  =  Graph for 10 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

The graph represents the association of items with another items being sold. Setting the confidence level, we can perform further analysis.

Further Analysis : Based on specific requirements, parameters of the apriori algorithm can be set to get required results. For example: if the sale of milk is high, we can possibly find another set of items that are associated with milk and has higher chances of being sold along with milk from our algorithm.

Homework 10

Gurpreet Singh

May 4, 2019