資料彙整流程
1. 交易項目計錄:Z
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
library(dplyr)
library(ggplot2)
library(caTools)
1.1 The do.call-rbind-lapply Combo
Z = do.call(rbind, lapply(
dir('data/TaFengDataSet','.*csv$',full.names=T),
read.csv, header=F)
) %>%
setNames(c("date","cust","age","area","cat","prod","qty","cost","price"))
nrow(Z)
Data Convresion
Z$date = as.Date(as.character(Z$date))
summary(Z)
- 將date變成文字型態
- 利用summary查看原始資料之敘述統計量
Quantile of Variables
sapply(Z[,7:9], quantile, prob=c(.99, .999, .9995))
Get rid of Outliers
Z = subset(Z, qty<=24 & cost<=3800 & price<=4000)
nrow(Z)
- 就算有一大筆資料,只要有一筆離群值,就可能造成估計上的偏差
- 找出並過濾掉離群值
Assign Transaction ID
Z$tid = group_indices(Z, date, cust)
No. Customers, Categories, Product Items & Transactions
sapply(Z[,c("cust","cat","prod","tid")], n_distinct)
- 總共有32256位不同的顧客、2007種不同產品…等
Summary of Item Records
summary(Z)
2. 交易計錄:X
交易資料彙整
X = group_by(Z, tid) %>% summarise(
date = first(date), # 交易日期
cust = first(cust), # 顧客 ID
age = first(age), # 顧客 年齡級別
area = first(area), # 顧客 居住區別
items = n(), # 交易項目(總)數
pieces = sum(qty), # 產品(總)件數
total = sum(price), # 交易(總)金額
gross = sum(price - cost) # 毛利
) %>% data.frame # 119422
Check Quantile & Remove Outliers
sapply(X[,6:9], quantile, prob=c(.999, .9995, .9999))
X = subset(X, items<=62 & pieces<95 & total<16000) # 119328
Weekly Transactions
par(cex=0.8)
hist(X$date, "weeks", freq=T, border='lightgray', col='darkcyan',
las=2, main="No. Transaction per Week")
- 由直方圖看每周交易筆數差異
- 可看見聖誕節當周交易量特別低,同學可以想想其背後商業意涵唷
3. 顧客資料:A
顧客資料彙整
d0 = max(X$date)
A = group_by(X, cust) %>% summarise(
r = 1 + as.integer(difftime(d0, max(date), units="days")), # recency
s = 1 + as.integer(difftime(d0, min(date), units="days")), # seniority
f = n(), # frquency
m = mean(total), # monetary
rev = sum(total), # total revenue contribution
raw = sum(gross), # total gross profit contribution
age = first(age), # age group
area = first(area), # area code
) %>% data.frame # 33241
- 由顧客資料依照rfm分析製作新變數,rfm分析介紹請看:
- rfm分析: 從交易記錄到顧客產品矩陣
- r: 距今最近一次購買
- s: 顧客第一次購買
- f: 顧客購買頻率
- m: 平均交易金額
顧客摘要
summary(A)
par(mfrow=c(3,2), mar=c(3,3,4,2))
for(x in c('r','s','f','m'))
hist(A[,x],freq=T,main=x,xlab="",ylab="",cex.main=2)
hist(pmin(A$f,10),0:10,freq=T,xlab="",ylab="",cex.main=2)
hist(log(A$m,10),freq=T,xlab="",ylab="",cex.main=2)
Dupliate & Save
A0 = A; X0 = X; Z0 = Z
save(Z0, X0, A0, file="data/tf0.rdata")
4. Objective of the Contest
range(X$date)
使用一月底(含2001-01-31)以前的資料,建立模型來預測每一位顧客:
- 她在2月份(2001-02-01 ~ 2001-02-28)會不會來買?
- 如果她來買的話,會買多少錢?
The Basic Questions of Analysis
【Q】 What are the Unit of Analysis?
【Q】 What are the Target of Analysis? Should we model for every customers in the dataset? Why not?
【Q】 How to make the Training/Testing Data Split?
【Q】 What are the Predicting and Targeted Variables?
The Target of Analysis
Screen out the new customers (who arrive after 2001-02-01)
A = filter(A0, s > 28) # 28584
The Baseline Probability
mean(A$r <= 28)
Spliting Factor and Spliting Ratio
library(caTools)
set.seed(1234); spl = sample.split(A$r <= 28, SplitRatio=0.75)
cid1 = subset(A, spl)$cust # 21438
cid2 = subset(A, !spl)$cust # 7146
cid1/cid2 are the customers ids in the training/testing data. But, …
【Q】 What are the Predicting (X) and Targeted Variables (Y)?
