Online Retail

DATA SET DESCRIPTION

This is a transnational data set which contains all the transactions occurring between 12/01/2010 and 12/09/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Load the data

The raw data includes 541909 observations of 8 variables. After omitting 136534 records with missing values in the observations, data set has 406829 records.

Data pre-processing

Create an aggregated variable named Amount, by multiplying Quantity with Price, which gives the total amount of money spent per product in each transaction. Create SKU variable that encodes the first 3 digits of Stockcode to indicate different stock keeping unit (subgroup of the product), which is related to later calculation of "Breadth" customer behavior indicator. *Separate the variable InvoiceDate into two variables Date and Time . This allows different transactions created by the same consumer on the same day but at different times to be treated separately.

#create Amount variable
#Amount is total amount of money spent per product in each transaction.
eRetail$Amount <- eRetail$Quantity * eRetail$UnitPrice

#create SKU variable
#SKU decribes indicate different stock keeping unit (subgroup of the product)
eRetail$SKU <- substr(eRetail$StockCode,1,3)

#separate date & time#
eRetail$InvoiceDate <- strptime(eRetail$InvoiceDate,"%m/%d/%Y %H:%M")
#InvoiceTime indicate the time of purchase
eRetail$InvoiceTime <- format(eRetail$InvoiceDate,"%H")
#InvoiceDate now indicate the date of purchase
eRetail$InvoiceDate <- as.Date(eRetail$InvoiceDate,"%m/%d/%Y")

#Change CustomerID into character value
eRetail$CustomerID <- as.character(eRetail$CustomerID)

#look at internal structure#
str(eRetail)
## 'data.frame':    406829 obs. of  11 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4223 levels " 4 PURPLE FLOCK DINNER CANDLES",..: 4026 4034 931 1958 2979 3234 1572 1697 1694 258 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Date, format: "2010-12-01" "2010-12-01" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : chr  "17850" "17850" "17850" "17850" ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
##  $ Amount     : num  15.3 20.3 22 20.3 20.3 ...
##  $ SKU        : chr  "851" "710" "844" "840" ...
##  $ InvoiceTime: chr  "08" "08" "08" "08" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:135080] 623 1444 1445 1446 1447 1448 1449 1450 1451 1452 ...
##   .. ..- attr(*, "names")= chr [1:135080] "623" "1444" "1445" "1446" ...
#View a summary#
summary(eRetail)
##    InvoiceNo        StockCode     
##  576339 :   542   85123A :  2077  
##  579196 :   533   22423  :  1905  
##  580727 :   529   85099B :  1662  
##  578270 :   442   84879  :  1418  
##  573576 :   435   47566  :  1416  
##  567656 :   421   20725  :  1359  
##  (Other):403927   (Other):396992  
##                              Description        Quantity        
##  WHITE HANGING HEART T-LIGHT HOLDER:  2070   Min.   :-80995.00  
##  REGENCY CAKESTAND 3 TIER          :  1905   1st Qu.:     2.00  
##  JUMBO BAG RED RETROSPOT           :  1662   Median :     5.00  
##  ASSORTED COLOUR BIRD ORNAMENT     :  1418   Mean   :    12.06  
##  PARTY BUNTING                     :  1416   3rd Qu.:    12.00  
##  LUNCH BAG RED RETROSPOT           :  1358   Max.   : 80995.00  
##  (Other)                           :397000                      
##   InvoiceDate           UnitPrice         CustomerID       
##  Min.   :2010-12-01   Min.   :    0.00   Length:406829     
##  1st Qu.:2011-04-06   1st Qu.:    1.25   Class :character  
##  Median :2011-07-31   Median :    1.95   Mode  :character  
##  Mean   :2011-07-10   Mean   :    3.46                     
##  3rd Qu.:2011-10-20   3rd Qu.:    3.75                     
##  Max.   :2011-12-09   Max.   :38970.00                     
##                                                            
##            Country           Amount              SKU           
##  United Kingdom:361878   Min.   :-168469.6   Length:406829     
##  Germany       :  9495   1st Qu.:      4.2   Class :character  
##  France        :  8491   Median :     11.1   Mode  :character  
##  EIRE          :  7485   Mean   :     20.4                     
##  Spain         :  2533   3rd Qu.:     19.5                     
##  Netherlands   :  2371   Max.   : 168469.6                     
##  (Other)       : 14576                                         
##  InvoiceTime       
##  Length:406829     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
#View the top#
head(eRetail)
##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##   InvoiceDate UnitPrice CustomerID        Country Amount SKU InvoiceTime
## 1  2010-12-01      2.55      17850 United Kingdom  15.30 851          08
## 2  2010-12-01      3.39      17850 United Kingdom  20.34 710          08
## 3  2010-12-01      2.75      17850 United Kingdom  22.00 844          08
## 4  2010-12-01      3.39      17850 United Kingdom  20.34 840          08
## 5  2010-12-01      3.39      17850 United Kingdom  20.34 840          08
## 6  2010-12-01      7.65      17850 United Kingdom  15.30 227          08

In total, there are 406829 unique customer ID in the one-year transaction data from 2010-12-01 to 2011-12-09. StockCode 85123A associated with product description "white hanging heart T-light holder" appeared 2077 times. At maximum, 542 unique invoices are associated with one customer with ID: 576339. The unit price has a range of $0.00 to $38970. The Quantity has a range of -80995 to 80995, in which extreme value should be excluded for RFM analysis.

SIMPLE EXPLORATION OF DATA

What are top 6 selling products accross all times?

#eRetail1 is eRetail re-sorted by product
eRetail1 <- ddply(eRetail, .(StockCode, Description), summarize, 
                  sumAmount= sum(Amount), 
                  sumQuantity= sum(Quantity), 
                  nCustomer= length(unique(CustomerID)), 
                  nPurchase= length(unique(InvoiceNo)) )

6 top selling products by sales volume:

head(eRetail1[order(-eRetail1$sumQuantity),] )
##      StockCode                        Description sumAmount sumQuantity
## 3029     84077  WORLD WAR 2 GLIDERS ASSTD DESIGNS  13332.33       53215
## 3461    85099B            JUMBO BAG RED RETROSPOT  83236.76       45066
## 3289     84879      ASSORTED COLOUR BIRD ORNAMENT  56499.22       35314
## 3476    85123A WHITE HANGING HEART T-LIGHT HOLDER  93823.85       34147
## 435      21212    PACK OF 72 RETROSPOT CAKE CASES  16247.95       33409
## 1113     22197                     POPCORN HOLDER  23098.30       30504
##      nCustomer nPurchase
## 3029       307       477
## 3461       636      1643
## 3289       679      1385
## 3476       858      2013
## 435        636      1041
## 1113       296       668

6 top selling products by sales revenue (sum of Amount):

head(eRetail1[order(-eRetail1$sumAmount),] )
##      StockCode                        Description sumAmount sumQuantity
## 1323     22423           REGENCY CAKESTAND 3 TIER 132870.40       11555
## 3476    85123A WHITE HANGING HEART T-LIGHT HOLDER  93823.85       34147
## 3461    85099B            JUMBO BAG RED RETROSPOT  83236.76       45066
## 2806     47566                      PARTY BUNTING  67687.53       15027
## 3916      POST                            POSTAGE  66710.24        3002
## 3289     84879      ASSORTED COLOUR BIRD ORNAMENT  56499.22       35314
##      nCustomer nPurchase
## 1323       887      1884
## 3476       858      2013
## 3461       636      1643
## 2806       708      1399
## 3916       379      1194
## 3289       679      1385

6 top selling products by number of customers:

head(eRetail1[order(-eRetail1$nCustomer),] )
##      StockCode                        Description sumAmount sumQuantity
## 1323     22423           REGENCY CAKESTAND 3 TIER 132870.40       11555
## 3476    85123A WHITE HANGING HEART T-LIGHT HOLDER  93823.85       34147
## 2806     47566                      PARTY BUNTING  67687.53       15027
## 3289     84879      ASSORTED COLOUR BIRD ORNAMENT  56499.22       35314
## 1612     22720  SET OF 3 CAKE TINS PANTRY DESIGN   32607.80        6864
## 435      21212    PACK OF 72 RETROSPOT CAKE CASES  16247.95       33409
##      nCustomer nPurchase
## 1323       887      1884
## 3476       858      2013
## 2806       708      1399
## 3289       679      1385
## 1612       640      1218
## 435        636      1041

6 top selling products by number of purchases:

head(eRetail1[order(-eRetail1$nPurchase),] )
##      StockCode                        Description sumAmount sumQuantity
## 3476    85123A WHITE HANGING HEART T-LIGHT HOLDER  93823.85       34147
## 1323     22423           REGENCY CAKESTAND 3 TIER 132870.40       11555
## 3461    85099B            JUMBO BAG RED RETROSPOT  83236.76       45066
## 2806     47566                      PARTY BUNTING  67687.53       15027
## 3289     84879      ASSORTED COLOUR BIRD ORNAMENT  56499.22       35314
## 175      20725            LUNCH BAG RED RETROSPOT  27230.05       17145
##      nCustomer nPurchase
## 3476       858      2013
## 1323       887      1884
## 3461       636      1643
## 2806       708      1399
## 3289       679      1385
## 175        532      1329

Do these sales of top selling products change with time (months)?Any seasonality?

#eRetail2 is a subset of eRetail1 only containing records related to the top selling products
eRetail2 <- subset(eRetail, Description%in%c(
  "MEDIUM CERAMIC TOP STORAGE JAR",
  "JUMBO BAG RED RETROSPOT",
  "REGENCY CAKESTAND 3 TIER",
  "WHITE HANGING HEART T-LIGHT HOLDER",
  "PARTY BUNTING",
  "WORLD WAR 2 GLIDERS ASSTD DESIGNS"), 
  select = c(Description,InvoiceDate,InvoiceTime,Quantity,CustomerID,Amount,InvoiceNo))
#re-name & covert character of variables
eRetail2$Invoice_month<-month(eRetail2$InvoiceDate)
eRetail2$Decription<-as.character(eRetail2$Description)

Sales volume of 6 top selling products by month:

ggplot(eRetail2, aes(x=Invoice_month, y= Quantity))+ facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales volume by month", x = "Month", y = "Sales Volume")

Sales Revenue (sum of Amount) of 6 top selling products by month:

ggplot(eRetail2, aes(x=Invoice_month, y= Amount )) + facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Sales Revenue") 

Number of cusotmers of 6 top selling products by month:

ggplot(eRetail2, aes(x=Invoice_month, y= length(unique(CustomerID)) )) + facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Number of Customer") 

Number of purchases of 6 top selling products by month:

ggplot(eRetail2, aes(x=Invoice_month, y= length(unique(InvoiceNo)) )) + facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Number of Purchases") 

*The sales of products changes have different seasonality.

What are the busiest hours of a day?

#eRetail3 is eRetail re-sorted by Invoicetime
eRetail3<-ddply(eRetail, .(InvoiceTime), summarize, sumAmount=sum(Amount), sumQuantity=sum(Quantity), nCustomer=length(unique(CustomerID)))

Sales volume of 6 top selling products by the hour:

ggplot(eRetail3, aes(x=InvoiceTime, y= sumQuantity)) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales volume by the hour", x = "Hours", y = "Sales Volume")

Number of cusotmers of 6 top selling products by the hour:

ggplot(eRetail3, aes(x=InvoiceTime, y= nCustomer)) + 
  geom_bar(stat="identity") + 
  labs(title = "Number of Customers by the hour", x = "Hours", y = "Number of customer")

*The busiest hour of the day is around 12 pm for sales volume, and 12 pm for number of customers.

START RFM ANALYSIS

Building dataset for RFM analysis

#getRFMdf is a self-defined function to put raw transactional data into a RFM data frame suitable for RFM analysis, the data is re-ordered by customer
getRFMdf<-function (RFM_raw){
  RFM_raw <- RFM_raw[!duplicated(RFM_raw$CustomerID),]
  RFM_raw <- cbind(RFM_raw, First_date = with(df,
  as.Date(as.integer(by(InvoiceDate, CustomerID, min)), "1970/01/01")))
  RFM_raw <- cbind(RFM_raw, Last_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, max)), "1970/01/01")))
  #Recency is the number of days since last date of purchase
  AsOfDate <- max(RFM_raw$Last_date) #Asofdate is the last date in the whole data set
  RFM_raw <- cbind(RFM_raw, Recency = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$Last_date,units="days")))/30)
  #First_purchase is the number of days since first date of purchase
  RFM_raw <- cbind(RFM_raw, First_purchase = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$First_date,units="days")))/30)
  #Frequency is the number of purchases made by an individual customer
  RFM_raw <- cbind(RFM_raw, Frequency = with(df,
as.numeric(by(InvoiceNo, CustomerID, function(x) length(unique(x))))))
  #Monetary & related
  #Monetary is the sum of sales revenue (sum of Amount) made by an individual customer
  RFM_raw <- cbind(RFM_raw, Monetary = with(df,
as.numeric(by(Amount, CustomerID, sum))))
  #avgM is the mean of sales revenue (mean of Amount) made by an individual customer
  RFM_raw <- cbind(RFM_raw, avgM = with(df,
as.numeric(by(Amount, CustomerID, mean))))
  #maxM is the Max value of Amountof purchases made by an individual customer
  RFM_raw <- cbind(RFM_raw, maxM = with(df,
as.numeric(by(Amount, CustomerID, max))))
  #Breadth is the total number of SKU (sub-categories of products) made by an individual customer
  RFM_raw <- cbind(RFM_raw, Breadth = with(df,
as.numeric(by(SKU, CustomerID, function(x) length(unique(x))))))
  #Tenure is the number of days between the first date and the last date of purchase.
  RFM_raw <- cbind(RFM_raw, Tenure = with(df, as.numeric(difftime(RFM_raw$Last_date,RFM_raw$First_date,units="days")))/30)
  #sumQuant is the total number of products purchased by an individual customer
RFM_raw <- cbind(RFM_raw, sumQuant = with(df,
as.numeric(by(Quantity, CustomerID, mean))))
}

define getRFMnor

#getRFMnor is a self-defined function to normalize raw observation of Recency, First_date, Monetary & related, Frequency, Breadth, Tenure, and Quantity in RFM data frame by z-score normalization.
getRFMnor<-function (RFMn){
RFMn<- as.data.frame(scale(df2[14:22], center= TRUE))
RFMn<- cbind(df2[,c(1:13)],RFMn)
RFMn<- rename(RFMn, c("Recency" = "R", "Frequency" = "Fq", "Monetary" = "M", "Breadth" = "B" , "Tenure" = "Ten", "sumQuant" = "Q" ) )
}

define getRFMscore function

#score15 is a self-defined function to score the z-scored nomralized Recency, First_date, Monetary & related, Frequency, Breadth, Tenure, and Quantity into a rating of 1 to 5
score15<-function(x){
  ceiling((rank(x))/(length(x))*5)
}
#getRFMscore is a self-defined function to translate the RFM data frame with normalized data into a new data frame in which Recency, First_date, Monetary & related, Frequency, Breadth, Tenure, and Quantity are scored 1 to 5
getRFMscore<-function (RFMs){
  RFMs <- as.data.frame(lapply(df3[,c(15:22)], score15))
  RFMs <- cbind(df3[,c(1:13)], R= ceiling((rank(-df3$R))/(length(df3$R))*5), RFMs)
  RFMs <- cbind(RFMs,RFMScore = 100*RFMs$R + 10*RFMs$Fq+RFMs$M)
}

Get raw RFM dataframe "rawRFM" from Retail dataset(12/2010 to 12/2011)

#input Retail dataset into function getRFMdf to get rawRFM
df <- eRetail
rawRFM<-as.data.frame(getRFMdf(df))

some rawRFM EDA

#take a look at disturbution of rawRFM
 par(mfrow = c(1,3))
boxplot(rawRFM$Recency)$stats[c(1, 5), ]
## [1]  0.0 11.1
boxplot(rawRFM$Frequency)$stats[c(1, 5), ]
## [1]  1 11
boxplot(rawRFM$Monetary)$stats[c(1, 5), ]

## [1] -1592.49  3580.13

Many upside outliers for Recency, Frequency, and, Monetary.

Exclude outliners of rawRFM to get new dataframe RFM

#RFM is the rawRFM with exteme values of Recency, Frequency, and Monetary excluded
RFM<-subset(rawRFM,rawRFM$Recency<= 12 & 
              rawRFM$Frequency<= 25 & 
              rawRFM$Monetary>= 0 & rawRFM$Monetary<= 4000)
#closer look of Monetary after exclusion of outliers
summary(RFM$Monetary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   288.4   603.0   929.2  1295.0  3992.0
boxplot(RFM$Monetary)$stats[c(1, 5), ]

## [1]    0.00 2803.69
#histograms of Recency, Frequency and Monetary of RFM
par(mfrow = c(1,3))
hist(RFM$Recency)
hist(RFM$Frequency)
hist(RFM$Monetary)

Strong left-skewness is observed for Recency, Frequency, and, Monetary.

Nomralize RFM to get new dataframe nRFM

df2<- RFM
nRFM<-as.data.frame(getRFMnor(df2))

Strong left-skewness is still observed for Recency, Frequency, and, Monetary.

Score nRFM 1 to 5, and get new data frame RFMs with the scores

df3 <- nRFM
RFMs<-as.data.frame(getRFMscore(df3))
#histograms of Recency, Frequency and Monetary of RFMs
par(mfrow = c(1,3))
hist(RFMs$R)
hist(RFMs$Fq)
hist(RFMs$M)

Now, equivalence for Recency and Monetary is observed as expected, but Frequency is not uni-variant nor normally disturbed.

K-means cluster analysis

#obtain new data frame RFM-cluster from nRFM with normalized data
RFM_cluster <- data.frame(nRFM$R,nRFM$Fq,nRFM$M)
#the number of clusters is defined to be 5
km <- kmeans(RFM_cluster,centers=5)
#combine cluster # to RFMs with scored RFM
RFM_cluster$cluster <- as.factor(km$cluster)
RFM_cluster <- cbind(RFM_cluster,RFMs)

Visualize the clusters

#visualize the cluster with scatter plot
ggplot(RFM_cluster,aes(x=nRFM.R, y=nRFM.M, color= cluster,size= nRFM.Fq))+
  geom_point()+ 
  scale_size_area(max_size=10)+
  labs(title= "Scatter Plot of the clusters", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")
## Warning: Removed 2493 rows containing missing values (geom_point).

Separate idividual cluster

Cluster No.1

RFM_cluster1<-RFM_cluster[which(RFM_cluster$cluster==1),]
ggplot(RFM_cluster1,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
  geom_point()+ 
  scale_size_area(max_size=10)+
  labs(title= "Scatter Plot of the cluster 1", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")

apply(RFM_cluster1[,c(18,20:27)],2,mean)
##          R         Fq          M       avgM       maxM          B 
##   1.169565   1.938043   2.034783   3.057609   2.666304   2.163043 
##        Ten          Q   RFMScore 
##   1.743478   2.825000 138.371739
ggplot(RFM_cluster1,aes(x=RFMScore))+
  geom_histogram(bins=50)+ 
  labs(title= "Disturbution of RFM scores in cluster 1", x="RFMScore", y="Count")

Cluster 1 mostly contains customers with lowest recency (recency score of 1), below average Frequency (1.67) and below average Monetary (1.88) scores. The average overall RFMScore is 119, with a disturbution of strong left skewness to low RFM score. In addition, the average Tenure of the customers is below average (1.44). Therefore, cluster 1 contains mostly new customers of low value to the business.

Cluster No.2

RFM_cluster2<-RFM_cluster[which(RFM_cluster$cluster==2),]
ggplot(RFM_cluster2,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
  geom_point()+ 
  scale_size_area(max_size=10)+
  labs(title= "Scatter Plot of the cluster 2", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")

apply(RFM_cluster2[,c(18,20:27)],2,mean)
##          R         Fq          M       avgM       maxM          B 
##   3.727067   4.099660   4.023783   2.990940   3.459796   3.722537 
##        Ten          Q   RFMScore 
##   4.015855   3.077010 417.727067
ggplot(RFM_cluster2,aes(x=RFMScore))+
  geom_histogram(bins=50)+ 
  labs(title= "Disturbution of RFM scores in cluster 2", x="RFMScore", y="Count")

Cluster 2 mostly contains customers with lower recency (recency score of 1 or 2), average Frequency (2.35) and average Monetary (2.34) scores. The average overall RFMScore is 190, with a bi-modality disturbution of low RFM scores and slightly below average RFM scores. In addition, the average Tenure of the customers is average (2.37). Therefore, cluster 2 contains customers that have lower value to the business mainly because they did not purchase often in recent month.

Cluster No.3

RFM_cluster3<-RFM_cluster[which(RFM_cluster$cluster==3),]
ggplot(RFM_cluster3,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
  geom_point()+ 
  scale_size_area(max_size=10)+
  labs(title= "Scatter Plot of the cluster 3", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")

apply(RFM_cluster3[,c(18,20:27)],2,mean)
##          R         Fq          M       avgM       maxM          B 
##   3.294470   2.163225   2.212525   2.866089   2.546302   2.551632 
##        Ten          Q   RFMScore 
##   2.289807   2.966023 353.291805
ggplot(RFM_cluster3,aes(x=RFMScore))+
  geom_histogram(bins=50)+ 
  labs(title= "Disturbution of RFM scores in cluster 3", x="RFMScore", y="Count")

Cluster 3 mostly contains customers with above average recency score (3.75), high Frequency (4.26) and high Monetary (4.43) scores. The average overall RFMScore is 422, with a disturbution of strong right skewness to high RFM score. In addition, the average Tenure of the customers is above average (4.09). Therefore, cluster 3 contains loyal customers that have high value to the business because they purchase often and have been purchasing from the store for a long time.

Cluster No.4

RFM_cluster4<-RFM_cluster[which(RFM_cluster$cluster==4),]
ggplot(RFM_cluster4,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
  geom_point()+ 
  scale_size_area(max_size=10)+
  labs(title= "Scatter Plot of the cluster 4", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")

apply(RFM_cluster4[,c(18,20:27)],2,mean)
##          R         Fq          M       avgM       maxM          B 
##   3.757925   4.498559   5.000000   3.585014   4.121037   4.285303 
##        Ten          Q   RFMScore 
##   4.190202   3.521614 425.778098
ggplot(RFM_cluster4,aes(x=RFMScore))+
  geom_histogram(bins=50)+ 
  labs(title= "Disturbution of RFM scores in cluster 4", x="RFMScore", y="Count")

Cluster 4 mostly contains customers with above average recency score (3.56), average Frequency (2.39) and Monetary (2.40) scores. The average overall RFMScore is 382, with a relativly normal disturbtion of scores ranging from relatively low to the high scores. In addition, the average Tenure of the customers is high (4.09). Therefore, cluster 4 contains average value customers.

Cluster No.5

RFM_cluster5<-RFM_cluster[which(RFM_cluster$cluster==5),]
ggplot(RFM_cluster5,aes(x=nRFM.R, y=nRFM.M, color= nRFM.Fq))+
  geom_point()+ 
  scale_size_area(max_size=10)+
  labs(title= "Scatter Plot of the cluster 5", x="Z-score Normalized Recency", y="Z-score Normalized Monetary")

apply(RFM_cluster5[,c(18,20:27)],2,mean)
##          R         Fq          M       avgM       maxM          B 
##   4.321138   5.000000   4.930894   2.817073   3.780488   4.459350 
##        Ten          Q   RFMScore 
##   4.699187   2.853659 487.044715
ggplot(RFM_cluster5,aes(x=RFMScore))+
  geom_histogram(bins=50)+ 
  labs(title= "Disturbution of RFM scores in cluster 5", x="RFMScore", y="Count")

Cluster 5 mostly contains customers with highest recency score (score of 4 or 5), highest Frequency (4.96) and highest Monetary (4.96) scores. The average overall RFMScore is 471, with strong right skewness to the high scores. In addition, the average Tenure of the customers is way above average (4.59). Therefore, cluster 4 contains high value customers, who buys often and a lot.

For each cluster, What are top 5 selling products across all times?

cluster1 <- ddply(RFM_cluster1, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster2 <- ddply(RFM_cluster2, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster3 <- ddply(RFM_cluster3, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster4 <- ddply(RFM_cluster4, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
cluster5 <- ddply(RFM_cluster5, .(StockCode,Description), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
head(cluster1[order(-cluster1$sumQuantity),] )
##     StockCode                       Description sumAmount sumQuantity
## 566     84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS    604.80        2880
## 17      17096 ASSORTED LAQUERED INCENSE HOLDERS    293.76        1728
## 456     23167    SMALL CERAMIC TOP STORAGE JAR     931.50        1350
## 15      17003               BROCADE RING PURSE     180.00         720
## 555     79321                     CHILLI LIGHTS   2068.80         512
## 546     71459    HANGING JAM JAR T-LIGHT HOLDER    318.13         433
##     nCustomer nPurchase
## 566         2         2
## 17          1         1
## 456         1         1
## 15          1         1
## 555         3         3
## 546         5         5
head(cluster2[order(-cluster2$sumQuantity),] )
##     StockCode                        Description sumAmount sumQuantity
## 581    85123A WHITE HANGING HEART T-LIGHT HOLDER   5739.50        2242
## 7       16014        SMALL CHINESE STYLE SCISSOR    320.00        1000
## 152     21975     PACK OF 60 DINOSAUR CAKE CASES    252.00         600
## 307     22710                WRAP I LOVE LONDON     204.00         600
## 579    85099B            JUMBO BAG RED RETROSPOT    709.58         411
## 82      21422               PORCELAIN ROSE SMALL    207.36         288
##     nCustomer nPurchase
## 581         9         9
## 7           1         1
## 152         1         1
## 307         2         2
## 579         5         5
## 82          1         1
head(cluster3[order(-cluster3$sumQuantity),] )
##     StockCode                         Description sumAmount sumQuantity
## 623     23166      MEDIUM CERAMIC TOP STORAGE JAR   77183.6       74215
## 808     84826      ASSTD DESIGN 3D PAPER STICKERS       0.0       12540
## 27      18007 ESSENTIAL BALM 3.5g TIN IN ENVELOPE     144.0        2400
## 823     84950      ASSORTED COLOUR T-LIGHT HOLDER    1003.2        1824
## 794     84568     GIRLS ALPHABET IRON ON PATCHES      244.8        1440
## 359     22413     METAL SIGN TAKE IT OR LEAVE IT     3861.0        1404
##     nCustomer nPurchase
## 623         1         1
## 808         1         1
## 27          1         1
## 823         1         1
## 794         1         1
## 359         1         1
head(cluster4[order(-cluster4$sumQuantity),] )
##     StockCode                        Description sumAmount sumQuantity
## 84      22086    PAPER CHAIN KIT 50'S CHRISTMAS     821.90         322
## 288    85123A WHITE HANGING HEART T-LIGHT HOLDER    568.65         223
## 257     51008           AFGHAN SLIPPER SOCK PAIR    590.00         200
## 287    85099B            JUMBO BAG RED RETROSPOT    330.00         200
## 221     23188     VINTAGE  2 METER FOLDING RULER    228.60         156
## 279     84978   HANGING HEART JAR T-LIGHT HOLDER    159.00         150
##     nCustomer nPurchase
## 84          2         2
## 288         4         4
## 257         1         1
## 287         1         1
## 221         2         2
## 279         1         1
head(cluster5[order(-cluster5$sumQuantity),] )
##     StockCode                        Description sumAmount sumQuantity
## 193     84077  WORLD WAR 2 GLIDERS ASSTD DESIGNS     518.4        2880
## 84      22385          JUMBO BAG SPACEBOY DESIGN     179.0         100
## 208    85099C     JUMBO  BAG BAROQUE BLACK WHITE     179.0         100
## 65      22142       CHRISTMAS CRAFT WHITE FAIRY      100.0          80
## 126     22777                 GLASS CLOCHE LARGE     550.8          72
## 210    85123A WHITE HANGING HEART T-LIGHT HOLDER     163.2          64
##     nCustomer nPurchase
## 193         1         1
## 84          1         1
## 208         1         1
## 65          1         1
## 126         1         1
## 210         1         1

In terms of toal number of items purchased, the top selling products among different clusters are mostly different. However, cluster 1 (the lowest values customers and new customers) and cluster 5 ( the highest value customers) both buy "wold war 2 gliders asstd designs" with the highest quantity. Besides, "White hanging heart T-light holder" is among the top 6 of highest total number sold among cluster 3, 4 and 5.