The goal of this tutorial is to remove empty tickets from a transaction file. Sometimes we could load empty lines at the end of a file and we may need to get rid of them.
# We need to load two libraries to perform this task
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Loading required package: grid
# In this tutorial we are going to use transactions file format.
# The file looks like:
# apple, orange, pear
# apple, pear
# orange
# etc
# First we load the data
products <- read.transactions("Transactions_all.csv", sep =",", format("basket"), rm.duplicates = TRUE)
## Warning in readLines(file, encoding = encoding): incomplete final line
## found on 'Transactions_all.csv'
## distribution of transactions with duplicates:
## items
## 1 2
## 191 10
summary(products)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 125 columns (items) and a density of 0.03506172
##
## most frequent items:
## iMac HP Laptop CYBERPOWER Gamer Desktop
## 2519 1909 1809
## Apple Earpods Apple MacBook Air (Other)
## 1715 1530 33622
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2 2163 1647 1294 1021 856 646 540 439 353 247 171 119 77 72
## 15 16 17 18 19 20 21 22 23 25 26 27 29 30
## 56 41 26 20 10 10 10 5 3 1 1 3 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.383 6.000 30.000
##
## includes extended item information - examples:
## labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3 3-Button Mouse
# We can see the number of items that are bought in every ticket with the size function
head(size(products))
## [1] 4 3 1 4 4 5
# We can plot the distribution of items per transaction
hist(size(products))
# Using the size function and the which function we can filter empty tickets
# We keep only tickets with more than 0 items
products_full <- products[which(size(products) != 0)]
# We can check that we removed the empty tickets with a summary
summary(products_full)
## transactions as itemMatrix in sparse format with
## 9833 rows (elements/itemsets/transactions) and
## 125 columns (items) and a density of 0.03506885
##
## most frequent items:
## iMac HP Laptop CYBERPOWER Gamer Desktop
## 2519 1909 1809
## Apple Earpods Apple MacBook Air (Other)
## 1715 1530 33622
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2163 1647 1294 1021 856 646 540 439 353 247 171 119 77 72 56
## 16 17 18 19 20 21 22 23 25 26 27 29 30
## 41 26 20 10 10 10 5 3 1 1 3 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.384 6.000 30.000
##
## includes extended item information - examples:
## labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3 3-Button Mouse
In this tutorial we have learnt how to remove empty tickets. The same logic can be applied to filter using the which function to any number of items per transaction.