This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 12/09/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
The raw data includes 541909 observations of 8 variables. I omitted 136534 missing observations and the new data set has 406829 observations. Some of the Quantity data are negative numbers, which are inventory adjustments and other fees.
Because the negative values of quantity not associated with cusotmers, for the purpose of RFM analysis, they can be excluded.Therefore, subset quantity > 0 & Unitrice>0. Create an aggregated variable named Amount, by multiplying Quantity with Price, which gives the total amount of money spent per product / item in each transaction. *Separate the variable InvoiceDate into two variables Date and Time . This allows different transactions created by the same consumer on the same day but at different times to be treated separately.
#exclude negative quantites#
eRetail <- subset(eRetail, Quantity > 0 & UnitPrice > 0, select = c(1:8))
#create Amount variable#
eRetail$Amount <- eRetail$Quantity * eRetail$UnitPrice
#separate date & time#
eRetail$Invoice_date <- format(as.POSIXct(strptime(eRetail$InvoiceDate,"%m/%d/%Y %H:%M",tz="")) ,format = "%m/%d/%Y")
eRetail$Invoice_time <- format(as.POSIXct(strptime(eRetail$InvoiceDate,"%m/%d/%Y %H:%M",tz="")) ,format = "%H:%M")
length(unique(eRetail$CustomerID))## [1] 4338
Now 397884 observations, 4338 unique customer ID.
*define first-year, second-year time frame as RFM1 & RFM2
#subset only necessary columns for later RFM Analysis
#by one year time frame
eRetail$Invoice_date <- as.Date(eRetail$Invoice_date, "%m/%d/%Y")
RFM1 <- subset(eRetail, Invoice_date < "2011-01-01")
length(unique(RFM1$CustomerID))## [1] 885
RFM2 <- subset(eRetail, Invoice_date >= "2011-01-01")
length(unique(RFM2$CustomerID))## [1] 4219
The first-year data frame "RFM1" contains 26157 observations, and 885 unique CustomerID. The second-year data frame "RFM2" contains 371727 observations, and 4219 unique CustomerID. The year 2011, compared to year 2010 has 3334 more customer IDs.
For each time frame, including the two-year "RFM"," Create an aggregated variable named First_date, which is the earliest Invoice_date for each CustomerID. Create an aggregated variable named Last_date, which is the lastest Invoice_date for each CustomerID. sum the amount per CustomerID, and find mean Amount name it Monetary count of purchase per CustomerID, name it Frequency, as number of purchases in the time frame. *Create an aggregated variable named Recency, as the Recency(days) from last date of purchase to the endDate, the smaller days value means more recent.
RFM<-ddply(eRetail, .(CustomerID), summarize, First_date=min(as.Date(Invoice_date)), Last_date=max(as.Date(Invoice_date)), sumAmount=sum(Amount), maxAmount=max(Amount), meanAmount=mean(Amount), Monetary=mean(Amount), Frequency=length(unique(InvoiceDate)),Breadth=length(unique(StockCode)) )
RFM$Recency<-as.numeric(difftime("2011-12-31",RFM$Last_date,units="weeks"))
length(unique(RFM$CustomerID))## [1] 4338
RFM$Tenure<-as.numeric(difftime(RFM$Last_date,RFM$First_date,units="weeks"))RFM1<-ddply(RFM1, .(CustomerID), summarize, First_date=min(as.Date(Invoice_date)), Last_date=max(as.Date(Invoice_date)), sumAmount=sum(Amount), maxAmount=max(Amount), meanAmount=mean(Amount), Monetary=mean(Amount), Frequency=length(unique(InvoiceDate)),Breadth=length(unique(StockCode)) )
RFM1$Recency<-as.numeric(difftime("2010-12-31",RFM1$Last_date,units="days"))
RFM1$Tenure<-as.numeric(difftime(RFM1$Last_date,RFM1$First_date,units="days"))RFM2<-ddply(RFM2, .(CustomerID), summarize, First_date=min(as.Date(Invoice_date)), Last_date=max(as.Date(Invoice_date)), sumAmount=sum(Amount), maxAmount=max(Amount), meanAmount=mean(Amount), Monetary=mean(Amount), Frequency=length(unique(InvoiceDate)),Breadth=length(unique(StockCode)) )
RFM2$Recency<-as.numeric(difftime("2011-12-31",RFM2$Last_date,units="days"))
RFM2$Tenure<-as.numeric(difftime(RFM2$Last_date,RFM2$First_date,units="days"))From the summary table, we know that the Customer are predominately associated with United Kingdom. From plotting the data and summary table, we know that Monetary, Frequency and Recency are not normally distributed, all of them are left skewed, and Recency is also clustered.
normalize Monetary, Frequency and Recency on a 1 to 5 scale by quartile normalization. To recap the definition, Recency = time since the last purchase Frequency = number of purchases in the year Monetary = value of the highest order from a given customer
M = RFM$Monetary
M<-rank(M)/length(M)
RFM$M = ceiling(M*5)
Fq = RFM$Frequency
Fq<-rank(Fq)/length(Fq)
RFM$Fq = ceiling(Fq*5)
R = RFM$Recency
R<-rank(-R)/length(R)
RFM$R = ceiling(R*5)
sumM = RFM$sumAmount
sumM<-rank(sumM)/length(sumM)
RFM$sumM = ceiling(sumM*5)Note that Rcency is smaller the better, so R is ranked in reverse. Now R the bigger the better, same as M and F.
*Now,is Monetary, Frequency and Recency normally distributed?
Monetary is normally distributed. Note frequency is still not normally distributed with this method. 8Note recency still shows slight bi-modality. *For Monetary, there is more customer with Monetary rating of 5, which means
Similary, for RFM1&RFM2
Monetary is strongly correleated with Recency and Frequency. In all three time frames.
# Recency by Frequency - Annual Value (total annual sales to segment)
ggplot(RFM, aes(x =R, y = Fq))+
geom_point(aes(size = M, colour = sumM))+
scale_size_area(max_size=15)+
#Add labels to axes
labs(x="Recency", y="Frequency")# Recency by Frequency - Annual Value (total annual sales to segment)
ggplot(RFM1, aes(x =R, y = Fq))+
geom_point(aes(size= M, colour = sumM))+
scale_size_area(max_size=15)+
#Add labels to axes
labs(x="Recency", y="Frequency")# Recency by Frequency - Annual Value (total annual sales to segment)
ggplot(RFM2, aes(x =R, y = Fq))+
geom_point(aes(size= M, colour = sumM))+
scale_size_area(max_size=15)+
#Add labels to axes
labs(x="Recency", y="Frequency")As appreaed in all three years, it seems than higher frequency strongly correlates higher value of the customer segment. And some groups have customer with different Monetary value, some have more uniformed Monetary value. Only in the sceondyear, there is also a treand that better Recency corrleates to higher value of the customer segment.