Algoritma FP-Growth
Definisi
FP-Growth adalah salah satu alternatif algoritma yang dapat digunakan untuk menentukan himpunan data yang paling sering muncul (frequent itemset) dalam sebuah kumpulan data (Indah Mulia Sari : 2013).
FP-Growth adalah algoritma pencarian frequent item sets yang didapat dari FP-tree dengan menjelajahi tree dari bawah menuju ke atas (Tan, Steinbach, & Kumar, 2004).
FP-growth adalah salah satu alternatif algoritma yang dapat digunakan untuk menentukan himpunan data yang paling sering muncul (frequent itemset)Â dalam sebuah kumpulan data.
FP-growth menggunakan pendekatan yang berbeda dari paradigma yang digunakan pada algoritma Apriori (Sensuse, 2012).
FP-Growth adalah algoritma pencarian frequent itemsets yang didapat dari FP-tree (Rama Novta Miraldi, 2014).
Jadi dapat disimpulkan bahwa FP-Growth adalah salah satu algoritma yang digunakan untuk mencari himpunan data yang sering muncul dari sekumpulan data, dengan menggunakan cara pembangktan stuktur data Tree.
Tujuan
Tujuan dari algoritma FP-Growth adalah sama dengan Apriori yaitu mencari Association Rules (hubungan antar item di dalam sebuah dataset yang telah ditentukan)
Kelebihan Algoritma FP-Growth
FP-Growth merupakan algoritma yang lebih baik dibandingkan dengan algoritma apriori dikarenakan algoritma apriori yang memakan waktu saat menentukan kandidat itemset dan membaca database yang berulang-ulang, sedangkan FP-Growth hanya membaca database satu kali saat pembuatan FP-Tree dan menggunakan kandidat itemset dalam proses pembentukan frequent pattern.
Tahapan Algoritma FP-Growth
- Siapkan Dataset
- Pengurutan berdasarkan nilai frekuensi kemunculan item yang terbesar
- Membentuk FP-tree
- Membentuk Conditional Pattern Base
- Membentuk Conditional FP-tree
- Membentuk Frequent Pattern Generated
- Mencari frequency 2 Itemset
- Mencari Support 2 Itemset 9 Mencari Confidance 2 Itemset
Eksperimen Algoritma
library
if(sessionInfo()['basePkgs']=="dplyr" | sessionInfo()['otherPkgs']=="dplyr"){
detach(package:dplyr, unload=TRUE)
}
if(sessionInfo()['basePkgs']=="tm" | sessionInfo()['otherPkgs']=="tm"){
detach(package:sentiment, unload=TRUE)
detach(package:tm, unload=TRUE)
}
library(plyr)
library(arules)## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(Matrix)
library(base)Masukkan Dataset
groceries <- readxl:: read_excel("~/Semester 3/Data Mining/Groceries_dataset.xlsx")
class(groceries)## [1] "tbl_df" "tbl" "data.frame"
Pembersihan Data dan Eksplor
str(groceries)## tibble [38,765 × 3] (S3: tbl_df/tbl/data.frame)
## $ Member_number : num [1:38765] 1808 2552 2300 1187 3037 ...
## $ Date : chr [1:38765] "21-07-2015" "42125" "19-09-2015" "42350" ...
## $ itemDescription: chr [1:38765] "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...
head(groceries)## # A tibble: 6 × 3
## Member_number Date itemDescription
## <dbl> <chr> <chr>
## 1 1808 21-07-2015 tropical fruit
## 2 2552 42125 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 42350 other vegetables
## 5 3037 42006 whole milk
## 6 4941 14-02-2015 rolls/buns
Mengecek Nilai NA
sum(is.na(groceries))## [1] 0
Konversi Nomor Anggota menjadi numrtik
sorted <- groceries[order(groceries$Member_number),]Mengkonversi deskripsi item ke format kategoris
sorted$Member_number <- as.numeric(sorted$Member_number)
str(sorted)## tibble [38,765 × 3] (S3: tbl_df/tbl/data.frame)
## $ Member_number : num [1:38765] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ Date : chr [1:38765] "27-05-2015" "24-07-2015" "15-03-2015" "25-11-2015" ...
## $ itemDescription: chr [1:38765] "soda" "canned beer" "sausage" "sausage" ...
Kelompokkan semua item yang dibeli bersama oleh pelanggan yang sama pada tanggal yang sama
itemList <- ddply(sorted, c("Member_number","Date"), function(df1)paste(df1$itemDescription,collapse = ","))
head(itemList,15)## Member_number Date V1
## 1 1000 15-03-2015 sausage,whole milk,semi-finished bread,yogurt
## 2 1000 24-06-2014 whole milk,pastry,salty snack
## 3 1000 24-07-2015 canned beer,misc. beverages
## 4 1000 25-11-2015 sausage,hygiene articles
## 5 1000 27-05-2015 soda,pickled vegetables
## 6 1001 14-04-2015 beef,white bread
## 7 1001 20-01-2015 frankfurter,soda,whipped/sour cream
## 8 1001 41822 sausage,whole milk,rolls/buns
## 9 1001 41985 whole milk,soda
## 10 1001 42040 frankfurter,curd
## 11 1002 26-04-2014 butter,whole milk
## 12 1002 26-04-2015 tropical fruit,sugar
## 13 1002 30-08-2015 butter milk,specialty chocolate
## 14 1002 41884 frozen vegetables,other vegetables
## 15 1003 15-10-2014 root vegetables,detergent
Menghapus nomor dan tanggal anggota
itemList$Member_number <- NULL
itemList$Date <- NULL
colnames(itemList) <- c("itemList")write.csv(itemList,"ItemList.csv", quote = FALSE, row.names = TRUE)
head(itemList)## itemList
## 1 sausage,whole milk,semi-finished bread,yogurt
## 2 whole milk,pastry,salty snack
## 3 canned beer,misc. beverages
## 4 sausage,hygiene articles
## 5 soda,pickled vegetables
## 6 beef,white bread
Ubah file CSV ke Format Keranjang
txn = read.transactions(file="ItemList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);## distribution of transactions with duplicates:
## items
## 1 2 3 4
## 662 39 5 1
print(txn)## transactions in sparse format with
## 14964 transactions (rows) and
## 168 items (columns)
Hapus kutipan dari Transaksi
txn@itemInfo$labels <- gsub("\"","",txn@itemInfo$labels)Apriori Algorithm
Algoritma Apriori adalah algoritma yang digunakan untuk menghitung aturan asosiasi antar objek. Aturan asosiasi menjelaskan bagaimana dua atau lebih objek terkait satu sama lain. Apriori menghasilkan seperangkat aturan yang paling relevan dari data transaksi tertentu. Menunjukkan dukungan, kepercayaan diri, dan pencabutan aturan yang dapat digunakan untuk memutuskan kekuatan relatif aturan.
Tahapan untuk menganalisis aturan asosiasi yaitu:
Analisa pola frekuensi tinggi (Support)
\(\begin{array}{ll}\text { Support }(\mathrm{A}) & =\frac{\Sigma \text { Transaksi mengandung } A}{\Sigma \text { Transaksi }} \times 100 \% \\ \text { Support }(\mathrm{A} \mid \mathrm{B}) & =\frac{\Sigma \text { Transaksi mengandung } A \text { dan } B}{\Sigma \text { Transaksi } A} \times 100 \%\end{array}\)
Pembentukan aturan asosiasi (Confidence)
\(\begin{aligned} \text { Confidence }(\mathrm{A} \mid \mathrm{B}) & \rightarrow(\mathrm{A \cap B}) \\ & =\frac{\Sigma \text { Transaksi mengandung } A \text { dan } B}{\Sigma \text { Transaksi } A} \times 100 \%\end{aligned}\)
Mengetahui kekuatan aturan asosiasi (association rule) yang telah terbentuk dari nilai support dan confidence
Lift \(=\frac{\text { Support }}{\operatorname{Supp}(X) \times \operatorname{Supp}(Y)}\)
basket_rules <- apriori(txn, parameter = list(minlen=2, sup = 0.001, conf = 0.05, target="rules"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [450 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Total aturan yang dihasilkan
print(length(basket_rules))## [1] 450
summary(basket_rules)## set of 450 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 423 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 2.00 2.06 2.00 3.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001002 Min. :0.05000 Min. :0.005346 Min. :0.5195
## 1st Qu.:0.001270 1st Qu.:0.06397 1st Qu.:0.015972 1st Qu.:0.7673
## Median :0.001938 Median :0.08108 Median :0.023590 Median :0.8350
## Mean :0.002760 Mean :0.08759 Mean :0.033723 Mean :0.8859
## 3rd Qu.:0.003341 3rd Qu.:0.10482 3rd Qu.:0.043705 3rd Qu.:0.9601
## Max. :0.014836 Max. :0.25581 Max. :0.157912 Max. :2.1831
## count
## Min. : 15.0
## 1st Qu.: 19.0
## Median : 29.0
## Mean : 41.3
## 3rd Qu.: 50.0
## Max. :222.0
##
## mining info:
## data ntransactions support confidence
## txn 14964 0.001 0.05
## call
## apriori(data = txn, parameter = list(minlen = 2, sup = 0.001, conf = 0.05, target = "rules"))
Memeriksa aturan keranjang
inspect(basket_rules[1:20])## lhs rhs support confidence
## [1] {frozen fish} => {whole milk} 0.001069233 0.1568627
## [2] {seasonal products} => {rolls/buns} 0.001002406 0.1415094
## [3] {pot plants} => {other vegetables} 0.001002406 0.1282051
## [4] {pot plants} => {whole milk} 0.001002406 0.1282051
## [5] {pasta} => {whole milk} 0.001069233 0.1322314
## [6] {pickled vegetables} => {whole milk} 0.001002406 0.1119403
## [7] {packaged fruit/vegetables} => {rolls/buns} 0.001202887 0.1417323
## [8] {detergent} => {yogurt} 0.001069233 0.1240310
## [9] {detergent} => {rolls/buns} 0.001002406 0.1162791
## [10] {detergent} => {whole milk} 0.001403368 0.1627907
## [11] {semi-finished bread} => {other vegetables} 0.001002406 0.1056338
## [12] {semi-finished bread} => {whole milk} 0.001670676 0.1760563
## [13] {red/blush wine} => {rolls/buns} 0.001336541 0.1273885
## [14] {red/blush wine} => {other vegetables} 0.001136060 0.1082803
## [15] {flour} => {tropical fruit} 0.001069233 0.1095890
## [16] {flour} => {whole milk} 0.001336541 0.1369863
## [17] {herbs} => {yogurt} 0.001136060 0.1075949
## [18] {herbs} => {whole milk} 0.001136060 0.1075949
## [19] {processed cheese} => {root vegetables} 0.001069233 0.1052632
## [20] {processed cheese} => {rolls/buns} 0.001470195 0.1447368
## coverage lift count
## [1] 0.006816359 0.9933534 16
## [2] 0.007083667 1.2864807 15
## [3] 0.007818765 1.0500611 15
## [4] 0.007818765 0.8118754 15
## [5] 0.008086073 0.8373723 16
## [6] 0.008954825 0.7088763 15
## [7] 0.008487036 1.2885066 18
## [8] 0.008620690 1.4443580 16
## [9] 0.008620690 1.0571081 15
## [10] 0.008620690 1.0308929 21
## [11] 0.009489441 0.8651911 15
## [12] 0.009489441 1.1148993 25
## [13] 0.010491847 1.1581057 20
## [14] 0.010491847 0.8868668 17
## [15] 0.009756750 1.6172489 16
## [16] 0.009756750 0.8674833 20
## [17] 0.010558674 1.2529577 17
## [18] 0.010558674 0.6813587 17
## [19] 0.010157712 1.5131200 16
## [20] 0.010157712 1.3158214 22
inspect(basket_rules[1:5])## lhs rhs support confidence
## [1] {frozen fish} => {whole milk} 0.001069233 0.1568627
## [2] {seasonal products} => {rolls/buns} 0.001002406 0.1415094
## [3] {pot plants} => {other vegetables} 0.001002406 0.1282051
## [4] {pot plants} => {whole milk} 0.001002406 0.1282051
## [5] {pasta} => {whole milk} 0.001069233 0.1322314
## coverage lift count
## [1] 0.006816359 0.9933534 16
## [2] 0.007083667 1.2864807 15
## [3] 0.007818765 1.0500611 15
## [4] 0.007818765 0.8118754 15
## [5] 0.008086073 0.8373723 16
Memvisualisasikan Aturan Asosiasi
plot(sort(basket_rules,by="lift"),method="graph",control=list(type="items"))## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plot(basket_rules, jitter = 0)plot(basket_rules, method = "grouped", control = list(k = 5))plot(basket_rules, method = "grouped", control = list(k = 10))Grafik 10 aturan pertama
plot(basket_rules[1:10], method="graph")Grafik 20 aturan pertama
plot(basket_rules[1:20], method="graph")Grafik 50 aturan pertama
plot(basket_rules[1:50], method="graph")Plot koordinat paralel
plot(basket_rules[1:10], method="paracoord")plot(basket_rules[1:20], method="paracoord")Produk Yang Paling Sering Digunakan
itemFrequencyPlot(txn, topN = 10)Mengubah hiperparameter
basket_rules2 <- apriori(txn, parameter = list(minlen=3, sup = 0.001, conf = 0.1, target="rules"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.001 3
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
print(length(basket_rules2))## [1] 17
summary(basket_rules2)## set of 17 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 17
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001002 Min. :0.1018 Min. :0.005346 Min. :0.7214
## 1st Qu.:0.001136 1st Qu.:0.1172 1st Qu.:0.008086 1st Qu.:0.8897
## Median :0.001136 Median :0.1269 Median :0.008955 Median :1.1081
## Mean :0.001207 Mean :0.1437 Mean :0.008821 Mean :1.1794
## 3rd Qu.:0.001337 3rd Qu.:0.1642 3rd Qu.:0.010559 3rd Qu.:1.2297
## Max. :0.001470 Max. :0.2558 Max. :0.011160 Max. :2.1831
## count
## Min. :15.00
## 1st Qu.:17.00
## Median :17.00
## Mean :18.06
## 3rd Qu.:20.00
## Max. :22.00
##
## mining info:
## data ntransactions support confidence
## txn 14964 0.001 0.1
## call
## apriori(data = txn, parameter = list(minlen = 3, sup = 0.001, conf = 0.1, target = "rules"))
inspect(basket_rules2)## lhs rhs support
## [1] {sausage, yogurt} => {whole milk} 0.001470195
## [2] {sausage, whole milk} => {yogurt} 0.001470195
## [3] {whole milk, yogurt} => {sausage} 0.001470195
## [4] {sausage, soda} => {whole milk} 0.001069233
## [5] {sausage, whole milk} => {soda} 0.001069233
## [6] {rolls/buns, sausage} => {whole milk} 0.001136060
## [7] {sausage, whole milk} => {rolls/buns} 0.001136060
## [8] {rolls/buns, yogurt} => {whole milk} 0.001336541
## [9] {whole milk, yogurt} => {rolls/buns} 0.001336541
## [10] {other vegetables, yogurt} => {whole milk} 0.001136060
## [11] {whole milk, yogurt} => {other vegetables} 0.001136060
## [12] {rolls/buns, soda} => {other vegetables} 0.001136060
## [13] {other vegetables, soda} => {rolls/buns} 0.001136060
## [14] {other vegetables, rolls/buns} => {soda} 0.001136060
## [15] {rolls/buns, soda} => {whole milk} 0.001002406
## [16] {other vegetables, soda} => {whole milk} 0.001136060
## [17] {other vegetables, rolls/buns} => {whole milk} 0.001202887
## confidence coverage lift count
## [1] 0.2558140 0.005747126 1.6199746 22
## [2] 0.1641791 0.008954825 1.9118880 22
## [3] 0.1317365 0.011160118 2.1830624 22
## [4] 0.1797753 0.005947608 1.1384500 16
## [5] 0.1194030 0.008954825 1.2296946 16
## [6] 0.2125000 0.005346164 1.3456835 17
## [7] 0.1268657 0.008954825 1.1533523 17
## [8] 0.1709402 0.007818765 1.0825005 20
## [9] 0.1197605 0.011160118 1.0887581 20
## [10] 0.1404959 0.008086073 0.8897081 17
## [11] 0.1017964 0.011160118 0.8337610 17
## [12] 0.1404959 0.008086073 1.1507281 17
## [13] 0.1172414 0.009689922 1.0658566 17
## [14] 0.1075949 0.010558674 1.1080872 17
## [15] 0.1239669 0.008086073 0.7850365 15
## [16] 0.1172414 0.009689922 0.7424460 17
## [17] 0.1139241 0.010558674 0.7214386 18
plot(basket_rules2, method="graph")plot(basket_rules2, method="paracoord")Dengan demikian telah menyelesaikan Market Basket Analysis menggunakan algoritma Apriori (FP-Growth) pada R
Referensi
- https://www.kaggle.com/code/heeraldedhia/market-basket-analysis-using-apriori-algorithm/notebook
- https://rpubs.com/Tantri_12345
- https://github.com/Tantri12345/FP-Growth
- https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset?resource=download
- https://gifadn.medium.com/market-basket-analisys-mba-dengan-menggunakan-datasets-groceries-di-r-62f63f0278c4