This is a transnational data set which contains all the transactions occurring between 12/01/2010 and 12/09/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
The raw data includes 541909 observations of 8 variables. After omitting 136534 records with missing values in the observations, data set has 406829 records.
Create an aggregated variable named Amount, by multiplying Quantity with Price, which gives the total amount of money spent per product in each transaction. Create SKU variable that encodes the first 3 digits of Stockcode to indicate different stock keeping unit (subgroup of the product), which is related to later calculation of "Breadth" customer behavior indicator. *Separate the variable InvoiceDate into two variables Date and Time . This allows different transactions created by the same consumer on the same day but at different times to be treated separately.
#create Amount variable
#Amount is total amount of money spent per product in each transaction.
eRetail$Amount <- eRetail$Quantity * eRetail$UnitPrice
#create SKU variable
#SKU decribes indicate different stock keeping unit (subgroup of the product)
eRetail$SKU <- substr(eRetail$StockCode,1,3)
#separate date & time#
eRetail$InvoiceDate <- strptime(eRetail$InvoiceDate,"%m/%d/%Y %H:%M")
#InvoiceTime indicate the time of purchase
eRetail$InvoiceTime <- format(eRetail$InvoiceDate,"%H")
#InvoiceDate now indicate the date of purchase
eRetail$InvoiceDate <- as.Date(eRetail$InvoiceDate,"%m/%d/%Y")
#Change CustomerID into character value
eRetail$CustomerID <- as.character(eRetail$CustomerID)
#look at internal structure#
str(eRetail)## 'data.frame': 406829 obs. of 11 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
## $ Description: Factor w/ 4223 levels " 4 PURPLE FLOCK DINNER CANDLES",..: 4026 4034 931 1958 2979 3234 1572 1697 1694 258 ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Date, format: "2010-12-01" "2010-12-01" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : chr "17850" "17850" "17850" "17850" ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
## $ Amount : num 15.3 20.3 22 20.3 20.3 ...
## $ SKU : chr "851" "710" "844" "840" ...
## $ InvoiceTime: chr "08" "08" "08" "08" ...
## - attr(*, "na.action")=Class 'omit' Named int [1:135080] 623 1444 1445 1446 1447 1448 1449 1450 1451 1452 ...
## .. ..- attr(*, "names")= chr [1:135080] "623" "1444" "1445" "1446" ...
#View a summary#
summary(eRetail)## InvoiceNo StockCode
## 576339 : 542 85123A : 2077
## 579196 : 533 22423 : 1905
## 580727 : 529 85099B : 1662
## 578270 : 442 84879 : 1418
## 573576 : 435 47566 : 1416
## 567656 : 421 20725 : 1359
## (Other):403927 (Other):396992
## Description Quantity
## WHITE HANGING HEART T-LIGHT HOLDER: 2070 Min. :-80995.00
## REGENCY CAKESTAND 3 TIER : 1905 1st Qu.: 2.00
## JUMBO BAG RED RETROSPOT : 1662 Median : 5.00
## ASSORTED COLOUR BIRD ORNAMENT : 1418 Mean : 12.06
## PARTY BUNTING : 1416 3rd Qu.: 12.00
## LUNCH BAG RED RETROSPOT : 1358 Max. : 80995.00
## (Other) :397000
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 Min. : 0.00 Length:406829
## 1st Qu.:2011-04-06 1st Qu.: 1.25 Class :character
## Median :2011-07-31 Median : 1.95 Mode :character
## Mean :2011-07-10 Mean : 3.46
## 3rd Qu.:2011-10-20 3rd Qu.: 3.75
## Max. :2011-12-09 Max. :38970.00
##
## Country Amount SKU
## United Kingdom:361878 Min. :-168469.6 Length:406829
## Germany : 9495 1st Qu.: 4.2 Class :character
## France : 8491 Median : 11.1 Mode :character
## EIRE : 7485 Mean : 20.4
## Spain : 2533 3rd Qu.: 19.5
## Netherlands : 2371 Max. : 168469.6
## (Other) : 14576
## InvoiceTime
## Length:406829
## Class :character
## Mode :character
##
##
##
##
#View the top#
head(eRetail)## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country Amount SKU InvoiceTime
## 1 2010-12-01 2.55 17850 United Kingdom 15.30 851 08
## 2 2010-12-01 3.39 17850 United Kingdom 20.34 710 08
## 3 2010-12-01 2.75 17850 United Kingdom 22.00 844 08
## 4 2010-12-01 3.39 17850 United Kingdom 20.34 840 08
## 5 2010-12-01 3.39 17850 United Kingdom 20.34 840 08
## 6 2010-12-01 7.65 17850 United Kingdom 15.30 227 08
In total, there are 406829 unique customer ID in the one-year transaction data from 2010-12-01 to 2011-12-09. StockCode 85123A associated with product description "white hanging heart T-light holder" appeared 2077 times. At maximum, 542 unique invoices are associated with one customer with ID: 576339. The unit price has a range of $0.00 to $38970. The Quantity has a range of -80995 to 80995, in which extreme value should be excluded for RFM analysis.
#eRetail1 is eRetail re-sorted by product
eRetail1 <- ddply(eRetail, .(StockCode, Description), summarize,
sumAmount= sum(Amount),
sumQuantity= sum(Quantity),
nCustomer= length(unique(CustomerID)),
nPurchase= length(unique(InvoiceNo)) )head(eRetail1[order(-eRetail1$sumQuantity),] )## StockCode Description sumAmount sumQuantity
## 3029 84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS 13332.33 53215
## 3461 85099B JUMBO BAG RED RETROSPOT 83236.76 45066
## 3289 84879 ASSORTED COLOUR BIRD ORNAMENT 56499.22 35314
## 3476 85123A WHITE HANGING HEART T-LIGHT HOLDER 93823.85 34147
## 435 21212 PACK OF 72 RETROSPOT CAKE CASES 16247.95 33409
## 1113 22197 POPCORN HOLDER 23098.30 30504
## nCustomer nPurchase
## 3029 307 477
## 3461 636 1643
## 3289 679 1385
## 3476 858 2013
## 435 636 1041
## 1113 296 668
head(eRetail1[order(-eRetail1$sumAmount),] )## StockCode Description sumAmount sumQuantity
## 1323 22423 REGENCY CAKESTAND 3 TIER 132870.40 11555
## 3476 85123A WHITE HANGING HEART T-LIGHT HOLDER 93823.85 34147
## 3461 85099B JUMBO BAG RED RETROSPOT 83236.76 45066
## 2806 47566 PARTY BUNTING 67687.53 15027
## 3916 POST POSTAGE 66710.24 3002
## 3289 84879 ASSORTED COLOUR BIRD ORNAMENT 56499.22 35314
## nCustomer nPurchase
## 1323 887 1884
## 3476 858 2013
## 3461 636 1643
## 2806 708 1399
## 3916 379 1194
## 3289 679 1385
head(eRetail1[order(-eRetail1$nCustomer),] )## StockCode Description sumAmount sumQuantity
## 1323 22423 REGENCY CAKESTAND 3 TIER 132870.40 11555
## 3476 85123A WHITE HANGING HEART T-LIGHT HOLDER 93823.85 34147
## 2806 47566 PARTY BUNTING 67687.53 15027
## 3289 84879 ASSORTED COLOUR BIRD ORNAMENT 56499.22 35314
## 1612 22720 SET OF 3 CAKE TINS PANTRY DESIGN 32607.80 6864
## 435 21212 PACK OF 72 RETROSPOT CAKE CASES 16247.95 33409
## nCustomer nPurchase
## 1323 887 1884
## 3476 858 2013
## 2806 708 1399
## 3289 679 1385
## 1612 640 1218
## 435 636 1041
head(eRetail1[order(-eRetail1$nPurchase),] )## StockCode Description sumAmount sumQuantity
## 3476 85123A WHITE HANGING HEART T-LIGHT HOLDER 93823.85 34147
## 1323 22423 REGENCY CAKESTAND 3 TIER 132870.40 11555
## 3461 85099B JUMBO BAG RED RETROSPOT 83236.76 45066
## 2806 47566 PARTY BUNTING 67687.53 15027
## 3289 84879 ASSORTED COLOUR BIRD ORNAMENT 56499.22 35314
## 175 20725 LUNCH BAG RED RETROSPOT 27230.05 17145
## nCustomer nPurchase
## 3476 858 2013
## 1323 887 1884
## 3461 636 1643
## 2806 708 1399
## 3289 679 1385
## 175 532 1329
#eRetail2 is a subset of eRetail1 only containing records related to the top selling products
eRetail2 <- subset(eRetail, Description%in%c(
"MEDIUM CERAMIC TOP STORAGE JAR",
"JUMBO BAG RED RETROSPOT",
"REGENCY CAKESTAND 3 TIER",
"WHITE HANGING HEART T-LIGHT HOLDER",
"PARTY BUNTING",
"WORLD WAR 2 GLIDERS ASSTD DESIGNS"),
select = c(Description,InvoiceDate,InvoiceTime,Quantity,CustomerID,Amount,InvoiceNo))
#re-name & covert character of variables
eRetail2$Invoice_month<-month(eRetail2$InvoiceDate)
eRetail2$Decription<-as.character(eRetail2$Description)ggplot(eRetail2, aes(x=Invoice_month, y= Quantity))+ facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales volume by month", x = "Month", y = "Sales Volume")ggplot(eRetail2, aes(x=Invoice_month, y= Amount )) + facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Sales Revenue") ggplot(eRetail2, aes(x=Invoice_month, y= length(unique(CustomerID)) )) + facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Number of Customer") ggplot(eRetail2, aes(x=Invoice_month, y= length(unique(InvoiceNo)) )) + facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Number of Purchases") *The sales of products changes have different seasonality.
#eRetail3 is eRetail re-sorted by Invoicetime
eRetail3<-ddply(eRetail, .(InvoiceTime), summarize, sumAmount=sum(Amount), sumQuantity=sum(Quantity), nCustomer=length(unique(CustomerID)))ggplot(eRetail3, aes(x=InvoiceTime, y= sumQuantity)) +
geom_bar(stat="identity") +
labs(title = "Sales volume by the hour", x = "Hours", y = "Sales Volume")ggplot(eRetail3, aes(x=InvoiceTime, y= nCustomer)) +
geom_bar(stat="identity") +
labs(title = "Number of Customers by the hour", x = "Hours", y = "Number of customer") *The busiest hour of the day is around 12 pm for sales volume, and 12 pm for number of customers.
#getRFMdf is a self-defined function to put raw transactional data into a RFM data frame suitable for RFM analysis, the data is re-ordered by customer
getRFMdf<-function (RFM_raw){
RFM_raw <- RFM_raw[!duplicated(RFM_raw$CustomerID),]
RFM_raw <- cbind(RFM_raw, First_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, min)), "1970/01/01")))
RFM_raw <- cbind(RFM_raw, Last_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, max)), "1970/01/01")))
#Recency is the number of days since last date of purchase
AsOfDate <- max(RFM_raw$Last_date) #Asofdate is the last date in the whole data set
RFM_raw <- cbind(RFM_raw, Recency = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$Last_date,units="days")))/30)
#First_purchase is the number of days since first date of purchase
RFM_raw <- cbind(RFM_raw, First_purchase = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$First_date,units="days")))/30)
#Frequency is the number of purchases made by an individual customer
RFM_raw <- cbind(RFM_raw, Frequency = with(df,
as.numeric(by(InvoiceNo, CustomerID, function(x) length(unique(x))))))
#Monetary & related
#Monetary is the sum of sales revenue (sum of Amount) made by an individual customer
RFM_raw <- cbind(RFM_raw, Monetary = with(df,
as.numeric(by(Amount, CustomerID, sum))))
#avgM is the mean of sales revenue (mean of Amount) made by an individual customer
RFM_raw <- cbind(RFM_raw, avgM = with(df,
as.numeric(by(Amount, CustomerID, mean))))
#maxM is the Max value of Amountof purchases made by an individual customer
RFM_raw <- cbind(RFM_raw, maxM = with(df,
as.numeric(by(Amount, CustomerID, max))))
#Breadth is the total number of SKU (sub-categories of products) made by an individual customer
RFM_raw <- cbind(RFM_raw, Breadth = with(df,
as.numeric(by(SKU, CustomerID, function(x) length(unique(x))))))
#Tenure is the number of days between the first date and the last date of purchase.
RFM_raw <- cbind(RFM_raw, Tenure = with(df, as.numeric(difftime(RFM_raw$Last_date,RFM_raw$First_date,units="days")))/30)
#sumQuant is the total number of products purchased by an individual customer
RFM_raw <- cbind(RFM_raw, sumQuant = with(df,
as.numeric(by(Quantity, CustomerID, mean))))
}#getRFMnor is a self-defined function to normalize raw observation of Recency, First_date, Monetary & related, Frequency, Breadth, Tenure, and Quantity in RFM data frame by z-score normalization.
getRFMnor<-function (RFMn){
RFMn<- as.data.frame(scale(df2[14:22], center= TRUE))
RFMn<- cbind(df2[,c(1:13)],RFMn)
RFMn<- rename(RFMn, c("Recency" = "R", "Frequency" = "Fq", "Monetary" = "M", "Breadth" = "B" , "Tenure" = "Ten", "sumQuant" = "Q" ) )
}#score15 is a self-defined function to score the z-scored nomralized Recency, First_date, Monetary & related, Frequency, Breadth, Tenure, and Quantity into a rating of 1 to 5
score15<-function(x){
ceiling((rank(x))/(length(x))*5)
}
#getRFMscore is a self-defined function to translate the RFM data frame with normalized data into a new data frame in which Recency, First_date, Monetary & related, Frequency, Breadth, Tenure, and Quantity are scored 1 to 5
getRFMscore<-function (RFMs){
RFMs <- as.data.frame(lapply(df3[,c(15:22)], score15))
RFMs <- cbind(df3[,c(1:13)], R= ceiling((rank(-df3$R))/(length(df3$R))*5), RFMs)
RFMs <- cbind(RFMs,RFMScore = 100*RFMs$R + 10*RFMs$Fq+RFMs$M)
}#input Retail dataset into function getRFMdf to get rawRFM
df <- eRetail
rawRFM<-as.data.frame(getRFMdf(df))#take a look at disturbution of rawRFM
par(mfrow = c(1,3))
boxplot(rawRFM$Recency)$stats[c(1, 5), ]## [1] 0.0 11.1
boxplot(rawRFM$Frequency)$stats[c(1, 5), ]## [1] 1 11
boxplot(rawRFM$Monetary)$stats[c(1, 5), ]## [1] -1592.49 3580.13
Many upside outliers for Recency, Frequency, and, Monetary.
#RFM is the rawRFM with exteme values of Recency, Frequency, and Monetary excluded
RFM<-subset(rawRFM,rawRFM$Recency<= 12 &
rawRFM$Frequency<= 25 &
rawRFM$Monetary>= 0 & rawRFM$Monetary<= 4000)
#closer look of Monetary after exclusion of outliers
summary(RFM$Monetary)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 288.4 603.0 929.2 1295.0 3992.0
boxplot(RFM$Monetary)$stats[c(1, 5), ]## [1] 0.00 2803.69
#histograms of Recency, Frequency and Monetary of RFM
par(mfrow = c(1,3))
hist(RFM$Recency)
hist(RFM$Frequency)
hist(RFM$Monetary)Strong left-skewness is observed for Recency, Frequency, and, Monetary.
df2<- RFM
nRFM<-as.data.frame(getRFMnor(df2))Strong left-skewness is still observed for Recency, Frequency, and, Monetary.
df3 <- nRFM
RFMs<-as.data.frame(getRFMscore(df3))
#histograms of Recency, Frequency and Monetary of RFMs
par(mfrow = c(1,3))
hist(RFMs$R)
hist(RFMs$Fq)
hist(RFMs$M)Now, equivalence for Recency and Monetary is observed as expected, but Frequency is not uni-variant nor normally disturbed.
#obtain new data frame RFM-cluster from nRFM with normalized data
RFM_cluster <- data.frame(nRFM$R,nRFM$Fq,nRFM$M)
#the number of clusters is defined to be 5
km <- kmeans(RFM_cluster,centers=5)
#combine cluster # to RFMs with scored RFM
RFM_cluster$cluster <- as.factor(km$cluster)
RFM_cluster <- cbind(RFM_cluster,RFMs)#visualize the cluster with scatter plot
ggplot(RFM_cluster,aes(x=nRFM.R, y=nRFM.M, color= cluster,size= nRFM.Fq))+
geom_point()+
scale_size_area(max_size=10)+
labs(title= "Scatter Plot of the clusters", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")## Warning: Removed 2493 rows containing missing values (geom_point).
RFM_cluster1<-RFM_cluster[which(RFM_cluster$cluster==1),]
ggplot(RFM_cluster1,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
geom_point()+
scale_size_area(max_size=10)+
labs(title= "Scatter Plot of the cluster 1", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")apply(RFM_cluster1[,c(18,20:27)],2,mean)## R Fq M avgM maxM B
## 1.169565 1.938043 2.034783 3.057609 2.666304 2.163043
## Ten Q RFMScore
## 1.743478 2.825000 138.371739
ggplot(RFM_cluster1,aes(x=RFMScore))+
geom_histogram(bins=50)+
labs(title= "Disturbution of RFM scores in cluster 1", x="RFMScore", y="Count")Cluster 1 mostly contains customers with lowest recency (recency score of 1), below average Frequency (1.67) and below average Monetary (1.88) scores. The average overall RFMScore is 119, with a disturbution of strong left skewness to low RFM score. In addition, the average Tenure of the customers is below average (1.44). Therefore, cluster 1 contains mostly new customers of low value to the business.
RFM_cluster2<-RFM_cluster[which(RFM_cluster$cluster==2),]
ggplot(RFM_cluster2,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
geom_point()+
scale_size_area(max_size=10)+
labs(title= "Scatter Plot of the cluster 2", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")apply(RFM_cluster2[,c(18,20:27)],2,mean)## R Fq M avgM maxM B
## 3.727067 4.099660 4.023783 2.990940 3.459796 3.722537
## Ten Q RFMScore
## 4.015855 3.077010 417.727067
ggplot(RFM_cluster2,aes(x=RFMScore))+
geom_histogram(bins=50)+
labs(title= "Disturbution of RFM scores in cluster 2", x="RFMScore", y="Count")Cluster 2 mostly contains customers with lower recency (recency score of 1 or 2), average Frequency (2.35) and average Monetary (2.34) scores. The average overall RFMScore is 190, with a bi-modality disturbution of low RFM scores and slightly below average RFM scores. In addition, the average Tenure of the customers is average (2.37). Therefore, cluster 2 contains customers that have lower value to the business mainly because they did not purchase often in recent month.
RFM_cluster3<-RFM_cluster[which(RFM_cluster$cluster==3),]
ggplot(RFM_cluster3,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
geom_point()+
scale_size_area(max_size=10)+
labs(title= "Scatter Plot of the cluster 3", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")apply(RFM_cluster3[,c(18,20:27)],2,mean)## R Fq M avgM maxM B
## 3.294470 2.163225 2.212525 2.866089 2.546302 2.551632
## Ten Q RFMScore
## 2.289807 2.966023 353.291805
ggplot(RFM_cluster3,aes(x=RFMScore))+
geom_histogram(bins=50)+
labs(title= "Disturbution of RFM scores in cluster 3", x="RFMScore", y="Count")Cluster 3 mostly contains customers with above average recency score (3.75), high Frequency (4.26) and high Monetary (4.43) scores. The average overall RFMScore is 422, with a disturbution of strong right skewness to high RFM score. In addition, the average Tenure of the customers is above average (4.09). Therefore, cluster 3 contains loyal customers that have high value to the business because they purchase often and have been purchasing from the store for a long time.
RFM_cluster4<-RFM_cluster[which(RFM_cluster$cluster==4),]
ggplot(RFM_cluster4,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
geom_point()+
scale_size_area(max_size=10)+
labs(title= "Scatter Plot of the cluster 4", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")apply(RFM_cluster4[,c(18,20:27)],2,mean)## R Fq M avgM maxM B
## 3.757925 4.498559 5.000000 3.585014 4.121037 4.285303
## Ten Q RFMScore
## 4.190202 3.521614 425.778098
ggplot(RFM_cluster4,aes(x=RFMScore))+
geom_histogram(bins=50)+
labs(title= "Disturbution of RFM scores in cluster 4", x="RFMScore", y="Count")Cluster 4 mostly contains customers with above average recency score (3.56), average Frequency (2.39) and Monetary (2.40) scores. The average overall RFMScore is 382, with a relativly normal disturbtion of scores ranging from relatively low to the high scores. In addition, the average Tenure of the customers is high (4.09). Therefore, cluster 4 contains average value customers.
RFM_cluster5<-RFM_cluster[which(RFM_cluster$cluster==5),]
ggplot(RFM_cluster5,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
geom_point()+
scale_size_area(max_size=10)+
labs(title= "Scatter Plot of the cluster 5", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")apply(RFM_cluster5[,c(18,20:27)],2,mean)## R Fq M avgM maxM B
## 4.321138 5.000000 4.930894 2.817073 3.780488 4.459350
## Ten Q RFMScore
## 4.699187 2.853659 487.044715
ggplot(RFM_cluster5,aes(x=RFMScore))+
geom_histogram(bins=50)+
labs(title= "Disturbution of RFM scores in cluster 5", x="RFMScore", y="Count")Cluster 5 mostly contains customers with highest recency score (score of 4 or 5), highest Frequency (4.96) and highest Monetary (4.96) scores. The average overall RFMScore is 471, with strong right skewness to the high scores. In addition, the average Tenure of the customers is way above average (4.59). Therefore, cluster 4 contains high value customers, who buys often and a lot.
cluster1 <- ddply(RFM_cluster1, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster2 <- ddply(RFM_cluster2, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster3 <- ddply(RFM_cluster3, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster4 <- ddply(RFM_cluster4, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster5 <- ddply(RFM_cluster5, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )head(cluster1[order(-cluster1$sumQuantity),] )## StockCode Description sumAmount sumQuantity
## 566 84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS 604.80 2880
## 17 17096 ASSORTED LAQUERED INCENSE HOLDERS 293.76 1728
## 456 23167 SMALL CERAMIC TOP STORAGE JAR 931.50 1350
## 15 17003 BROCADE RING PURSE 180.00 720
## 555 79321 CHILLI LIGHTS 2068.80 512
## 546 71459 HANGING JAM JAR T-LIGHT HOLDER 318.13 433
## nCustomer nPurchase
## 566 2 2
## 17 1 1
## 456 1 1
## 15 1 1
## 555 3 3
## 546 5 5
head(cluster2[order(-cluster2$sumQuantity),] )## StockCode Description sumAmount sumQuantity
## 581 85123A WHITE HANGING HEART T-LIGHT HOLDER 5739.50 2242
## 7 16014 SMALL CHINESE STYLE SCISSOR 320.00 1000
## 152 21975 PACK OF 60 DINOSAUR CAKE CASES 252.00 600
## 307 22710 WRAP I LOVE LONDON 204.00 600
## 579 85099B JUMBO BAG RED RETROSPOT 709.58 411
## 82 21422 PORCELAIN ROSE SMALL 207.36 288
## nCustomer nPurchase
## 581 9 9
## 7 1 1
## 152 1 1
## 307 2 2
## 579 5 5
## 82 1 1
head(cluster3[order(-cluster3$sumQuantity),] )## StockCode Description sumAmount sumQuantity
## 623 23166 MEDIUM CERAMIC TOP STORAGE JAR 77183.6 74215
## 808 84826 ASSTD DESIGN 3D PAPER STICKERS 0.0 12540
## 27 18007 ESSENTIAL BALM 3.5g TIN IN ENVELOPE 144.0 2400
## 823 84950 ASSORTED COLOUR T-LIGHT HOLDER 1003.2 1824
## 794 84568 GIRLS ALPHABET IRON ON PATCHES 244.8 1440
## 359 22413 METAL SIGN TAKE IT OR LEAVE IT 3861.0 1404
## nCustomer nPurchase
## 623 1 1
## 808 1 1
## 27 1 1
## 823 1 1
## 794 1 1
## 359 1 1
head(cluster4[order(-cluster4$sumQuantity),] )## StockCode Description sumAmount sumQuantity
## 84 22086 PAPER CHAIN KIT 50'S CHRISTMAS 821.90 322
## 288 85123A WHITE HANGING HEART T-LIGHT HOLDER 568.65 223
## 257 51008 AFGHAN SLIPPER SOCK PAIR 590.00 200
## 287 85099B JUMBO BAG RED RETROSPOT 330.00 200
## 221 23188 VINTAGE 2 METER FOLDING RULER 228.60 156
## 279 84978 HANGING HEART JAR T-LIGHT HOLDER 159.00 150
## nCustomer nPurchase
## 84 2 2
## 288 4 4
## 257 1 1
## 287 1 1
## 221 2 2
## 279 1 1
head(cluster5[order(-cluster5$sumQuantity),] )## StockCode Description sumAmount sumQuantity
## 193 84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS 518.4 2880
## 84 22385 JUMBO BAG SPACEBOY DESIGN 179.0 100
## 208 85099C JUMBO BAG BAROQUE BLACK WHITE 179.0 100
## 65 22142 CHRISTMAS CRAFT WHITE FAIRY 100.0 80
## 126 22777 GLASS CLOCHE LARGE 550.8 72
## 210 85123A WHITE HANGING HEART T-LIGHT HOLDER 163.2 64
## nCustomer nPurchase
## 193 1 1
## 84 1 1
## 208 1 1
## 65 1 1
## 126 1 1
## 210 1 1
In terms of toal number of items purchased, the top selling products among different clusters are mostly different. However, cluster 1 (the lowest values customers and new customers) and cluster 5 ( the highest value customers) both buy "wold war 2 gliders asstd designs" with the highest quantity. Besides, "White hanging heart T-light holder" is among the top 6 of highest total number sold among cluster 3, 4 and 5.