1 Goal


The goal of this tutorial is to remove empty tickets from a transaction file. Sometimes we could load empty lines at the end of a file and we may need to get rid of them.


2 Data loading


# We need to load two libraries to perform this task
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Loading required package: grid
# In this tutorial we are going to use transactions file format.
# The file looks like:
# apple, orange, pear
# apple, pear
# orange
# etc

# First we load the data
products <- read.transactions("Transactions_all.csv", sep =",", format("basket"),  rm.duplicates = TRUE)
## Warning in readLines(file, encoding = encoding): incomplete final line
## found on 'Transactions_all.csv'
## distribution of transactions with duplicates:
## items
##   1   2 
## 191  10
summary(products)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  125 columns (items) and a density of 0.03506172 
## 
## most frequent items:
##                     iMac                HP Laptop CYBERPOWER Gamer Desktop 
##                     2519                     1909                     1809 
##            Apple Earpods        Apple MacBook Air                  (Other) 
##                     1715                     1530                    33622 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    2 2163 1647 1294 1021  856  646  540  439  353  247  171  119   77   72 
##   15   16   17   18   19   20   21   22   23   25   26   27   29   30 
##   56   41   26   20   10   10   10    5    3    1    1    3    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.383   6.000  30.000 
## 
## includes extended item information - examples:
##                             labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3                   3-Button Mouse

3 Number of items per transaction


# We can see the number of items that are bought in every ticket with the size function
head(size(products))
## [1] 4 3 1 4 4 5
# We can plot the distribution of items per transaction
hist(size(products))


4 Filter empty tickets


# Using the size function and the which function we can filter empty tickets
# We keep only tickets with more than 0 items
products_full <- products[which(size(products) != 0)]

# We can check that we removed the empty tickets with a summary
summary(products_full)
## transactions as itemMatrix in sparse format with
##  9833 rows (elements/itemsets/transactions) and
##  125 columns (items) and a density of 0.03506885 
## 
## most frequent items:
##                     iMac                HP Laptop CYBERPOWER Gamer Desktop 
##                     2519                     1909                     1809 
##            Apple Earpods        Apple MacBook Air                  (Other) 
##                     1715                     1530                    33622 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2163 1647 1294 1021  856  646  540  439  353  247  171  119   77   72   56 
##   16   17   18   19   20   21   22   23   25   26   27   29   30 
##   41   26   20   10   10   10    5    3    1    1    3    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.384   6.000  30.000 
## 
## includes extended item information - examples:
##                             labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3                   3-Button Mouse

5 Conclusion


In this tutorial we have learnt how to remove empty tickets. The same logic can be applied to filter using the which function to any number of items per transaction.