資料來源:Online Dataset
InvoiceNo:
Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode:
Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description:
Product (item) name. Nominal.
Quantity:
The quantities of each product (item) per transaction. Numeric.
InvoiceDate:
Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice:
Unit price. Numeric, Product price per unit in sterling.
CustomerID:
Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country:
Country name. Nominal, the name of the country where each customer resides.
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 2010/12/1 08:26 2.55 17850 United Kingdom
## 2 2010/12/1 08:26 3.39 17850 United Kingdom
## 3 2010/12/1 08:26 2.75 17850 United Kingdom
## 4 2010/12/1 08:26 3.39 17850 United Kingdom
## 5 2010/12/1 08:26 3.39 17850 United Kingdom
## 6 2010/12/1 08:26 7.65 17850 United Kingdom
資料日期從2010/12/01到2011/12/09
## InvoiceNo StockCode Description
## Length:541909 Length:541909 Length:541909
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Quantity InvoiceDate UnitPrice
## Min. :-80995.00 Min. :2010-12-01 08:26:00 Min. :-11062.06
## 1st Qu.: 1.00 1st Qu.:2011-03-28 11:34:00 1st Qu.: 1.25
## Median : 3.00 Median :2011-07-19 17:17:00 Median : 2.08
## Mean : 9.55 Mean :2011-07-04 13:34:57 Mean : 4.61
## 3rd Qu.: 10.00 3rd Qu.:2011-10-19 11:27:00 3rd Qu.: 4.13
## Max. : 80995.00 Max. :2011-12-09 12:50:00 Max. : 38970.00
## CustomerID Country
## Length:541909 Length:541909
## Class :character Class :character
## Mode :character Mode :character
##
##
##
先把退貨訂單移除後接著刪除價格、數量小於0的資料。並將99.9%以外的資料去除,接著檢查資料遺漏值狀況。
##
## Missings per variable:
## Variable Count
## InvoiceNo 0
## StockCode 0
## Description 0
## Quantity 0
## InvoiceDate 0
## UnitPrice 0
## CustomerID 131773
## Country 0
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0:0:0:0:0 397275 75.09243
## 0:0:0:0:0:0:1:0 131773 24.90757
## InvoiceNo StockCode Description
## Length:529048 Length:529048 Length:529048
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Quantity InvoiceDate UnitPrice
## Min. : 1.000 Min. :2010-12-01 08:26:00 Min. : 0.001
## 1st Qu.: 1.000 1st Qu.:2011-03-28 12:23:00 1st Qu.: 1.250
## Median : 3.000 Median :2011-07-20 13:06:00 Median : 2.080
## Mean : 9.439 Mean :2011-07-04 20:58:49 Mean : 3.366
## 3rd Qu.: 10.000 3rd Qu.:2011-10-19 12:54:00 3rd Qu.: 4.130
## Max. :448.000 Max. :2011-12-09 12:50:00 Max. :165.000
## CustomerID Country
## Length:529048 Length:529048
## Class :character Class :character
## Mode :character Mode :character
##
##
##
只有customerID有遺漏,總共131773筆,可能是顧客不願意留下紀錄,但仍有其他完整購買行為,因此不進行刪除,因此總資料筆數為529048。
訂單平均次數為9次,中位數為3,資料呈現右偏分布,約莫50%的人過去一年只來網站購物過三次以下。
訂單月份趨勢
訂單周間趨勢
消費時段趨勢
顧客購買集中於早上10點到下午4點。
月份收入趨勢
最熱銷的前十個產品以及月份銷量趨勢
## Description Quantity
## 1815 JUMBO BAG RED RETROSPOT 45274
## 227 ASSORTED COLOUR BIRD ORNAMENT 31413
## 3930 WORLD WAR 2 GLIDERS ASSTD DESIGNS 31143
## 3840 WHITE HANGING HEART T-LIGHT HOLDER 29929
## 2336 PACK OF 72 RETROSPOT CAKE CASES 28619
## 3672 VICTORIAN GLASS HANGING T-LIGHT 22384
## 2678 POPCORN HOLDER 21219
## 2738 RABBIT NIGHT LIGHT 20968
## 1813 JUMBO BAG PINK POLKADOT 19965
## 2002 LUNCH BAG RED RETROSPOT 19353
## $x
## [1] "Month"
##
## $y
## [1] "Sales Volume"
##
## $title
## [1] "Sales by month"
##
## attr(,"class")
## [1] "labels"
Jumbo bag red retpospot 銷量在八月達到高峰,且持續維持高檔,可能是九月之後國外節慶較多,像是十月的萬聖節、十一月的感恩節、十二月的聖誕節。都需要有送禮的需求,因此紙袋銷量暴增。
Assprted cp;our bird ornament 銷量於7月後直線上升,且在11月達到高峰,可能是因為感恩節與聖誕節要到了因此買裝飾品裝飾家裡。
World war 2 gliders asstd designs從年初銷量持續穩定的增加,這種小玩具在Amazon屬於前100的熱銷產品,上網去查了一下他已經賣出約1,084,303 件,價格較便宜且拿來送禮或買給小朋友玩都很適合。
White hanging heart T-light holder在十一月銷量急遽上升,可能是需要送給太太或女朋友的聖誕禮物,且這種蠟燭座是使用電力來營造如蠟燭般的效果,更具有情調。
Pack of 72 retrospot cake cases在三月與十二月銷量顯著大於其他月份,四月有復活節,十二月有聖誕節,兩個重要節慶外國人常常開派對慶祝,蛋糕紙因為烤蛋糕的需求上升而顯著增加。
Victorian glass hanging T-light的每月銷量變化十分極端。
Popcorn holder 在十一月銷量暴增,可能也是因應節慶需求。
Rabbit night light 在11月銷量大增,可能也是受到節慶影響。
Jumbo bag pink polkadot在六月到九月這段時間熱銷,可能是夏天大家戶外活動,進入冬天之後銷量顯著下滑。
Lunch bag red retrospot在五月到八月這段時間熱銷,可能是夏天大家喜歡出去從事野餐等戶外活動,進入冬天之後銷量顯著下滑,表示野餐袋這產品可能會受到季節性因素影響。
從上述十件年度熱銷產品的銷售月份變化發現,購物行為很容易受到節慶的影響而有顯著差異。
計算每個用戶平均貢獻金額
平均每人貢獻金額落在約400 – 550之間,且下半年的普遍貢獻金額都較高。
希望能更細部的描述消費者的購買行為,因此使用RFM model對消費者做分群以辨識出:
1. 需要維持關係的貴客
2. 維持之前很常來,但最近不來的“前客”
3. 一次性消費顧客
4. 新顧客
因此新增以下三個變數
1. recency: 上次下訂單日期與分析日的間隔 (分析日在此假設為 2011-12-10)
2. frequency: 過去一年的下訂單頻率
3. monetary: 平均每次訂單金額
且將沒有CustomerID的資料進行刪除以進行客戶分群
## CustomerID recency frequency monitery_ave
## Length:4320 Min. : 1.00 Min. : 1.000 Min. : 2.9
## Class :character 1st Qu.: 18.00 1st Qu.: 1.000 1st Qu.: 177.8
## Mode :character Median : 51.00 Median : 2.000 Median : 290.2
## Mean : 93.15 Mean : 4.253 Mean : 366.4
## 3rd Qu.:143.00 3rd Qu.: 5.000 3rd Qu.: 425.3
## Max. :374.00 Max. :207.000 Max. :8951.3
根據80 20法則,20%的人貢獻80%的營收,因此將總貢獻金額根據比例切成5等分:1=低,2=中低,3=中,4=中高,5=高
最近一次造訪間隔切成1=一周、2=一到兩周、3=兩到三周、4=三到四周、5=四周以上
造訪頻率切成1次,2次 ,3次 ,4次 ,5次以上
購買頻率比例
## Frequency Percentage
## 1 1 34.68%
## 2 2 19.28%
## 3 3 11.37%
## 4 4 9.03%
## 5 5 25.65%
距離上次下單日期間隔
## Recency Percentage
## 1 1 10.67%
## 2 2 10.25%
## 3 3 9.81%
## 4 4 6.81%
## 5 5 62.45%
Recency 和 Frequency交叉分析
##
## 0-7天 8-15天 16-23天 24-30天 30天以上
## 1次 0.007 0.012 0.009 0.011 0.068
## 2次 0.010 0.017 0.009 0.013 0.053
## 3次 0.016 0.018 0.012 0.013 0.038
## 4次 0.015 0.011 0.013 0.006 0.022
## 5次以上 0.299 0.134 0.071 0.047 0.075
45.9%的人屬於來兩次以下且上次來已經超過兩周。
單次平均購買金額前20%的人貢獻總營收比例
## [1] 0.5531581
前20%的人貢獻約55%的總營收
高貢獻金額的族群的 Recency 和 Frequency 交叉分析
##
## 0-7天 8-15天 16-23天 24-30天 30天以上
## 1次 0.007 0.010 0.008 0.009 0.081
## 2次 0.006 0.015 0.007 0.013 0.065
## 3次 0.010 0.016 0.010 0.013 0.049
## 4次 0.012 0.007 0.010 0.009 0.024
## 5次以上 0.289 0.128 0.066 0.049 0.086
高貢獻族群有將近43.6%屬於低造訪頻率與超過三不回訪的客人,更顯示了顧客流失問題嚴重,且大約29%的人只來了一次。表示雖然訂單逐月創高,但大部分都是新客,如何加強消費者回購率為日後重要課題。
接下來使用購物籃分析瞭解顧客購買產品之間的關聯性便於做產品組合與推薦。
購物籃分析關鍵的三個指標:
1. support: P(A交集B)
在所有事件發生之下,同時發生A和B的機率 可以想成是 length(transactions) / 同時買A和B的訂單數量
2. confidence: P(A|B)
購買B的前提之下,買A的機率 1就代表買B的人100%購買A
3. lift: P(A|B)/P(B)
若>1 具有正向關係 =1無關係 <1負向關係
## transactions as itemMatrix in sparse format with
## 18402 rows (elements/itemsets/transactions) and
## 26197 columns (items) and a density of 0.0007197084
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
## 1754 1532
## JUMBO BAG RED RETROSPOT PARTY BUNTING
## 1416 1268
## ASSORTED COLOUR BIRD ORNAMENT (Other)
## 1237 339748
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1 1469 858 752 752 755 698 650 634 638 576 608 523 510 522
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 549 515 465 434 481 420 386 312 310 271 235 254 227 214 223
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 210 168 161 136 143 133 110 112 91 109 93 91 85 88 66
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 63 66 63 60 59 51 62 41 40 49 43 37 30 38 30
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 28 27 17 25 25 20 27 25 23 15 21 18 14 15 15
## 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 11 15 12 7 8 14 15 13 9 7 12 10 14 8 6
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
## 6 6 10 6 4 4 4 5 5 2 4 2 4 4 3
## 106 107 108 109 110 111 112 113 114 115 117 118 119 121 122
## 2 2 6 3 4 3 2 1 3 1 3 3 3 1 2
## 123 124 126 127 128 132 133 134 135 141 142 143 144 146 147
## 2 1 4 1 2 1 1 2 1 1 2 2 1 1 2
## 148 151 155 157 158 169 170 172 178 179 181 183 202 205 228
## 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1
## 250 251 285 321 400 419
## 1 1 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 18.85 24.00 419.00
##
## includes extended item information - examples:
## labels
## 1 1
## 2 1 HANGER
## 3 10
總共有18402筆交易資料,26197個產品
各產品的購買頻率
創造規則
使用apriori演算法建立關聯性規則找出哪組商品顧客可能會在一次購物時同時購買
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 18
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[26197 item(s), 18402 transaction(s)] done [0.21s].
## sorting and recoding items ... [2438 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.51s].
## writing ... [99020 rule(s)] done [0.06s].
## creating S4 object ... done [0.06s].
篩選規則
subRules<-association.rules[quality(association.rules)$confidence>0.4&quality(association.rules)$support>=0.005]
視覺化關聯性規則
plotly_arules(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
根據confidence(A產品與B產品的交集)找出前10個重要的規則若想看有興趣的產品關聯性,可以透過以下的code進行查詢
lhs => rhs 買了lhs之後也會買rhs
While customers bought METAL also bought…
metal.association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8),appearance = list(lhs="METAL",default="rhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 18
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[26197 item(s), 18402 transaction(s)] done [0.21s].
## sorting and recoding items ... [2438 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.01s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(metal.association.rules))
## lhs rhs support confidence lift count
## [1] {METAL} => {DECORATION} 0.002717096 1 368.04 50
customers buy before buying ‘METAL’
metal.association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8),appearance = list(rhs=c("METAL"),default="lhs"))
inspect(head(metal.association.rules))
customers buy before buying “METAL” and “DECORATION”
metal.association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8),appearance = list(rhs=c("METAL","DECORATION"),default="lhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 18
##
## set item appearances ...[2 item(s)] done [0.00s].
## set transactions ...[26197 item(s), 18402 transaction(s)] done [0.22s].
## sorting and recoding items ... [2438 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(tr, parameter = list(supp = 0.001, conf = 0.8),
## appearance = list(rhs = c("METAL", : Mining stopped (maxlen reached). Only
## patterns up to a length of 10 returned!
## done [0.48s].
## writing ... [4 rule(s)] done [0.06s].
## creating S4 object ... done [0.02s].
inspect(head(metal.association.rules))
## lhs rhs support confidence lift count
## [1] {WOBBLY CHICKEN} => {METAL} 0.001521574 1 368.04 28
## [2] {WOBBLY CHICKEN} => {DECORATION} 0.001521574 1 368.04 28
## [3] {WOBBLY RABBIT} => {METAL} 0.001847625 1 368.04 34
## [4] {WOBBLY RABBIT} => {DECORATION} 0.001847625 1 368.04 34