Online Retail

DATA SET DESCRIPTION

This is a transnational data set which contains all the transactions occurring between 12/01/2010 and 12/09/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Load the data

The raw data includes 541909 observations of 8 variables. I omitted 136534 missing observations and the new data set has 406829 observations. Some of the Quantity data are negative numbers, which are inventory adjustments and other fees.

Data pre-processing

Create an aggregated variable named Amount, by multiplying Quantity with Price, which gives the total amount of money spent per product / item in each transaction. The amount variable is assume to represent sales revenue. Separate the variable InvoiceDate into two variables Date and Time . This allows different transactions created by the same consumer on the same day but at different times to be treated separately. Create SKU variable that encodes the first 3 digits of Stockcode to indicate different stock keeping unit, which is related to later calculation of "Breadth" customer behavior indicator.

#omit NA
eRetail<-na.omit(eRetail)
#create Amount variable#
eRetail$Amount <- eRetail$Quantity * eRetail$UnitPrice
#create SKU variable
eRetail$SKU <- substr(eRetail$StockCode,1,3)
#separate date & time#
eRetail$InvoiceDate<-strptime(eRetail$InvoiceDate,"%m/%d/%Y %H:%M")
eRetail$InvoiceTime = format(eRetail$InvoiceDate,"%H")
eRetail$InvoiceDate<-as.Date(eRetail$InvoiceDate,"%m/%d/%Y")
#look at internal structure#
str(eRetail)

## 'data.frame':    406829 obs. of  11 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4223 levels " 4 PURPLE FLOCK DINNER CANDLES",..: 4026 4034 931 1958 2979 3234 1572 1697 1694 258 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Date, format: "2010-12-01" "2010-12-01" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
##  $ Amount     : num  15.3 20.3 22 20.3 20.3 ...
##  $ SKU        : chr  "851" "710" "844" "840" ...
##  $ InvoiceTime: chr  "08" "08" "08" "08" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:135080] 623 1444 1445 1446 1447 1448 1449 1450 1451 1452 ...
##   .. ..- attr(*, "names")= chr [1:135080] "623" "1444" "1445" "1446" ...

#View a summary#
summary(eRetail)

##    InvoiceNo        StockCode     
##  576339 :   542   85123A :  2077  
##  579196 :   533   22423  :  1905  
##  580727 :   529   85099B :  1662  
##  578270 :   442   84879  :  1418  
##  573576 :   435   47566  :  1416  
##  567656 :   421   20725  :  1359  
##  (Other):403927   (Other):396992  
##                              Description        Quantity        
##  WHITE HANGING HEART T-LIGHT HOLDER:  2070   Min.   :-80995.00  
##  REGENCY CAKESTAND 3 TIER          :  1905   1st Qu.:     2.00  
##  JUMBO BAG RED RETROSPOT           :  1662   Median :     5.00  
##  ASSORTED COLOUR BIRD ORNAMENT     :  1418   Mean   :    12.06  
##  PARTY BUNTING                     :  1416   3rd Qu.:    12.00  
##  LUNCH BAG RED RETROSPOT           :  1358   Max.   : 80995.00  
##  (Other)                           :397000                      
##   InvoiceDate           UnitPrice          CustomerID   
##  Min.   :2010-12-01   Min.   :    0.00   Min.   :12346  
##  1st Qu.:2011-04-06   1st Qu.:    1.25   1st Qu.:13953  
##  Median :2011-07-31   Median :    1.95   Median :15152  
##  Mean   :2011-07-10   Mean   :    3.46   Mean   :15288  
##  3rd Qu.:2011-10-20   3rd Qu.:    3.75   3rd Qu.:16791  
##  Max.   :2011-12-09   Max.   :38970.00   Max.   :18287  
##                                                         
##            Country           Amount              SKU           
##  United Kingdom:361878   Min.   :-168469.6   Length:406829     
##  Germany       :  9495   1st Qu.:      4.2   Class :character  
##  France        :  8491   Median :     11.1   Mode  :character  
##  EIRE          :  7485   Mean   :     20.4                     
##  Spain         :  2533   3rd Qu.:     19.5                     
##  Netherlands   :  2371   Max.   : 168469.6                     
##  (Other)       : 14576                                         
##  InvoiceTime       
##  Length:406829     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

#View the top#
head(eRetail)

##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##   InvoiceDate UnitPrice CustomerID        Country Amount SKU InvoiceTime
## 1  2010-12-01      2.55      17850 United Kingdom  15.30 851          08
## 2  2010-12-01      3.39      17850 United Kingdom  20.34 710          08
## 3  2010-12-01      2.75      17850 United Kingdom  22.00 844          08
## 4  2010-12-01      3.39      17850 United Kingdom  20.34 840          08
## 5  2010-12-01      3.39      17850 United Kingdom  20.34 840          08
## 6  2010-12-01      7.65      17850 United Kingdom  15.30 227          08

Now Retail dataset with adjust purchase value of discounts has 406829 observations of 11 variables. 4339 unique customer ID. Total sales revenue $8,474,959 *Total number of items sold 4,983,937

This is end of data re-origanization I did based on the orignal dataset (eRetail) to Retail. In addtion, discount dataset is a subset.

SIMPLE EXPLORATION OF DATA

Outliners

## boxplot of Amount
boxplot(eRetail$Amount)$stats[c(1, 5), ]

## [1] -18.75  42.45

#cutoff outliner
Retail<-subset(eRetail,eRetail$Amount>= -18.75 & eRetail$Amount<= 42.45 )

What are top 5 selling products accross all times?

Retail1 <- ddply(Retail, .(StockCode), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )

head(Retail1[order(-Retail1$sumQuantity),] )

##      StockCode sumAmount sumQuantity nCustomer nPurchase
## 2793     84077   6728.97       23023       291       430
## 426      21212   9062.59       16441       610       967
## 3049     84879  22962.03       13587       628      1134
## 3014     84826    236.38       13522        26        28
## 1088     22197  10393.55       12165       390       920
## 3213    85099B  24659.24       12035       582      1339

head(Retail1[order(-Retail1$nCustomer),] )

##      StockCode sumAmount sumQuantity nCustomer nPurchase
## 3227    85123A  26939.51        9137       756      1583
## 1287     22423  21520.21        1697       719      1132
## 3049     84879  22962.03       13587       628      1134
## 2583     47566  22632.99        4585       615      1089
## 426      21212   9062.59       16441       610       967
## 1567     22720  14978.48        3018       610      1114

head(Retail1[order(-Retail1$sumAmount),] )

##      StockCode sumAmount sumQuantity nCustomer nPurchase
## 3227    85123A  26939.51        9137       756      1583
## 3213    85099B  24659.24       12035       582      1339
## 3049     84879  22962.03       13587       628      1134
## 2583     47566  22632.99        4585       615      1089
## 1287     22423  21520.21        1697       719      1132
## 2023     23203  18458.88        8906       483       989

head(Retail1[order(-Retail1$nPurchase),] )

##      StockCode sumAmount sumQuantity nCustomer nPurchase
## 3227    85123A  26939.51        9137       756      1583
## 3213    85099B  24659.24       12035       582      1339
## 171      20725  16063.55        9695       518      1216
## 3049     84879  22962.03       13587       628      1134
## 1287     22423  21520.21        1697       719      1132
## 1567     22720  14978.48        3018       610      1114

5 top selling products by sales volume: Product StockCode 3825 WORLD WAR 2 GLIDERS ASSTD DESIGNS 84077 1779 JUMBO BAG RED RETROSPOT 85099B 218 ASSORTED COLOUR BIRD ORNAMENT 84879 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 2289 PACK OF 72 RETROSPOT CAKE CASES 21212 2625 POPCORN HOLDER 22197 sumAmount sumQuantity nCustomer nPurchase 3825 13332.33 53215 307 477 1779 83236.76 45066 636 1643 218 56499.22 35314 679 1385 3736 93823.85 34147 858 2013 2289 16247.95 33409 636 1041 2625 23098.30 30504 296 668

5 top selling products by number of customers: Product StockCode 2793 REGENCY CAKESTAND 3 TIER 22423 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 2365 PARTY BUNTING 47566 218 ASSORTED COLOUR BIRD ORNAMENT 84879 3031 SET OF 3 CAKE TINS PANTRY DESIGN 22720 1779 JUMBO BAG RED RETROSPOT 85099B sumAmount sumQuantity nCustomer nPurchase 2793 132870.40 11555 887 1884 3736 93823.85 34147 858 2013 2365 67687.53 15027 708 1399 218 56499.22 35314 679 1385 3031 32607.80 6864 640 1218 1779 83236.76 45066 636 1643

5 top selling products by value of sales revenue: Product StockCode 2793 REGENCY CAKESTAND 3 TIER 22423 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 1779 JUMBO BAG RED RETROSPOT 85099B 2365 PARTY BUNTING 47566 2637 POSTAGE POST 218 ASSORTED COLOUR BIRD ORNAMENT 84879 sumAmount sumQuantity nCustomer nPurchase 2793 132870.40 11555 887 1884 3736 93823.85 34147 858 2013 1779 83236.76 45066 636 1643 2365 67687.53 15027 708 1399 2637 66710.24 3002 378 1192 218 56499.22 35314 679 1385

5 top selling products by number of purchases: Product StockCode 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 2793 REGENCY CAKESTAND 3 TIER 22423 1779 JUMBO BAG RED RETROSPOT 85099B 2365 PARTY BUNTING 47566 218 ASSORTED COLOUR BIRD ORNAMENT 84879 1961 LUNCH BAG RED RETROSPOT 20725 sumAmount sumQuantity nCustomer nPurchase 3736 93823.85 34147 858 2013 2793 132870.40 11555 887 1884 1779 83236.76 45066 636 1643 2365 67687.53 15027 708 1399 218 56499.22 35314 679 1385 1961 27230.05 17145 532 1329

Do these sales of top selling products change with time (months)?Any seasonality?

Retail2 <- subset(Retail, Description%in%c("MEDIUM CERAMIC TOP STORAGE JAR","JUMBO BAG RED RETROSPOT","REGENCY CAKESTAND 3 TIER","WHITE HANGING HEART T-LIGHT HOLDER","PARTY BUNTING","WORLD WAR 2 GLIDERS ASSTD DESIGNS"), select = c(Description,InvoiceDate,InvoiceTime,Quantity,CustomerID,Amount,InvoiceNo))
Retail2$Invoice_month<-month(Retail2$InvoiceDate)
Retail2$Decription<-as.character(Retail2$Description)

ggplot(Retail2, aes(x=Invoice_month, y= Quantity))+ facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Sales Volume")

ggplot(Retail2, aes(x=Invoice_month, y= length(unique(CustomerID)) )) + facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Number of Customer")

ggplot(Retail2, aes(x=Invoice_month, y= Amount )) + facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Sales Revenue")

ggplot(Retail2, aes(x=Invoice_month, y= length(unique(InvoiceNo)) )) + facet_wrap(~Description, ncol=2) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by month", x = "Month", y = "Number of Purchases")

*The sales of products changes with time.

What are the busiest hours of a day?

Retail3<-ddply(Retail, .(InvoiceTime), summarize, sumAmount=sum(Amount), sumQuantity=sum(Quantity), nCustomer=length(unique(CustomerID)))
names(Retail3) [1] <-"InvoiceHour"

ggplot(Retail3, aes(x=InvoiceHour, y= sumQuantity)) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by hours", x = "Hours", y = "Sales Volume")

ggplot(Retail3, aes(x=InvoiceHour, y= nCustomer)) + 
  geom_bar(stat="identity") + 
  labs(title = "Sales by hours", x = "Hours", y = "Number of customer")

*The busiest hour of the day is around 12 pm for sales volume, and 12 pm for number of customers.

START RFM ANALYSIS

Building dataset for RFM analysis

getRFMdf<-function (RFM_raw){
RFM_raw <- RFM_raw[!duplicated(RFM_raw$CustomerID),]
RFM_raw <- cbind(RFM_raw, First_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, min)), "1970/01/01")))
RFM_raw <- cbind(RFM_raw, Last_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, max)), "1970/01/01")))
#Recency
AsOfDate <- max(RFM_raw$Last_date)
RFM_raw <- cbind(RFM_raw, Recency = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$Last_date,units="days")))/30)
#First_purchase
RFM_raw <- cbind(RFM_raw, First_purchase = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$First_date,units="days")))/30)
#Frequency
RFM_raw <- cbind(RFM_raw, Frequency = with(df,
as.numeric(by(InvoiceNo, CustomerID, function(x) length(unique(x))))))
#Monetary & related
RFM_raw <- cbind(RFM_raw, Monetary = with(df,
as.numeric(by(Amount, CustomerID, mean))))
RFM_raw <- cbind(RFM_raw, sumM = with(df,
as.numeric(by(Amount, CustomerID, sum))))
RFM_raw <- cbind(RFM_raw, maxM = with(df,
as.numeric(by(Amount, CustomerID, max))))
#Breadth
RFM_raw <- cbind(RFM_raw, Breadth = with(df,
as.numeric(by(SKU, CustomerID, function(x) length(unique(x))))))
#Tenure
RFM_raw <- cbind(RFM_raw, Tenure = with(df, as.numeric(difftime(RFM_raw$Last_date,RFM_raw$First_date,units="days")))/30)
#sum Quantity
RFM_raw <- cbind(RFM_raw, sumQuant = with(df,
as.numeric(by(Quantity, CustomerID, mean))))
}

define getRFMnor

#min-max normarlization
minmax<-function(x){
  (x-min(x))/(max(x)-min(x))
}
getRFMnor<-function (RFMn){
RFMn<- as.data.frame(minmax(df2[14:22]))
RFMn<- cbind(df2[,c(1:13)],RFMn)
RFMn<- rename(RFMn, c("Recency" = "R", "Frequency" = "Fq", "Monetary" = "M", "Breadth" = "B" , "Tenure" = "Ten", "sumQuant" = "Q" ) )
}

define getRFMscore function

#score 1 to 9
score19<-function(x){
  ceiling((rank(x))/(length(x))*9)
}
getRFMscore<-function (RFMs){
RFMs <- as.data.frame(lapply(df3[,c(15:22)], score19))
RFMs <- cbind(df3[,c(1:13)], R= ceiling((rank(-df3$R))/(length(df3$R))*9), RFMs)
RFMs <- cbind(RFMs,RFMScore = 100*RFMs$R + 50*RFMs$Fq+RFMs$M)
}

RFM score on the whole data (12/2010 to 12/2011)

df<- eRetail
rawRFM<-as.data.frame(getRFMdf(df))

some rawRFM score EDA

#take a look at disturbution
 par(mfrow = c(1,3))
boxplot(rawRFM$Recency)$stats[c(1, 5), ]

## [1]  0.0 11.1

boxplot(rawRFM$Frequency)$stats[c(1, 5), ]

## [1]  1 11

boxplot(rawRFM$Monetary)$stats[c(1, 5), ]

## [1] -7.45 42.30

Strong left-skewness for Recency, Frequency, TotalAmount,Monetary,Breadth, and Tenure.

Exclude outliners

RFM<-subset(rawRFM,rawRFM$Recency<= 13 & rawRFM$Frequency<= 100 & rawRFM$Monetary>= 0 & rawRFM$Monetary<= 400)
summary(rawRFM$Monetary)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4288.00    10.99    16.92    28.84    23.54  3861.00

 par(mfrow = c(1,3))
 hist(RFM$Recency)
 hist(RFM$Frequency)
 hist(RFM$Monetary)

Now, the Left-skewness is better.

#data normalization
df2<- RFM
nRFM<-as.data.frame(getRFMnor(df2))
#score
df3 <- nRFM
RFMs<-as.data.frame(getRFMscore(df3))
 par(mfrow = c(1,3))
 hist(RFMs$R)
 hist(RFMs$Fq)
 hist(RFMs$M)

RFM visualization

In hope to find a trend / association among the scores identify customer segment with high value high value customer is defined as high Monetary score, in adjunct to high sumM score M is Monetary score, which is score upon the mean purchase value of the different purchases of an individual customer *sumM is score of sum of Amount, in other words, a score based upon total purchase value of all the purchases of an individual customer

#draw histogram
drawHistograms <- function(df,r=5,f=5,m=5){

#set the layout plot window
par(mfrow = c(m,f))

names <-rep("",times=r)
for(i in 1:r) names[i]<-paste("R",i)


for (i in 1:m){
    for (j in 1:f){
        c <- rep(0,times=r)
        for(k in 1:f){
            tmpdf <-df[df$Fq==j & df$M==i & df$R==k,]
            c[k]<- dim(tmpdf)[1]

        }
        if (i==1 & j==1) 
            barplot(c,col="lightblue",names.arg=names)
        else
            barplot(c,col="lightblue")
        if (j==1) title(ylab=paste("M",i))  
        if (i==1) title(main=paste("Fq",j)) 

    }

}

par(mfrow = c(5,5))

} # end of drawHistograms function

#drawHistograms(RFMscore)

BINARY INDEPENDENCE TESTING

2 sample t-test to see of scoring differs among one-time purchasers and repeat customers

RFMFq1<-RFM[which(RFM$Frequency==1),]
RFMFq2<-RFM[which(RFM$Frequency>1),]

There are 1227 one-time customers, 2597 repeat customers

t.test(RFMFq1$R,RFMFq2$R)#p-value < 2.2e-16

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$R and RFMFq2$R
## t = 25.599, df = 1750.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.784685 3.246808
## sample estimates:
## mean of x mean of y 
##  5.118688  2.102941

t.test(RFMFq1$M,RFMFq2$M)#p-value = 0.002532

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$M and RFMFq2$M
## t = 1.3192, df = 2139.7, p-value = 0.1872
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8155223  4.1679774
## sample estimates:
## mean of x mean of y 
##  26.39549  24.71926

t.test(RFMFq1$sumM,RFMFq2$sumM)#p-value < 2.2e-16

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$sumM and RFMFq2$sumM
## t = -14.625, df = 3068.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2213.823 -1690.404
## sample estimates:
## mean of x mean of y 
##  336.8013 2288.9153

t.test(RFMFq1$maxM,RFMFq2$maxM)#p-value < 2.2e-16

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$maxM and RFMFq2$maxM
## t = -2.6068, df = 3031.6, p-value = 0.009183
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -286.73836  -40.55909
## sample estimates:
## mean of x mean of y 
##  60.16894 223.81766

t.test(RFMFq1$B,RFMFq2$B)#p-value < 2.2e-16

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$B and RFMFq2$B
## t = -40.38, df = 3869.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.93662 -11.73855
## sample estimates:
## mean of x mean of y 
##  10.97008  23.30767

t.test(RFMFq1$Ten,RFMFq2$Ten)#p-value < 2.2e-16

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$Ten and RFMFq2$Ten
## t = -87.982, df = 3025, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.506469 -6.222789
## sample estimates:
## mean of x mean of y 
##  0.000000  6.364629

t.test(RFMFq1$Q,RFMFq2$Q)#p-value = 0.3064

## 
##  Welch Two Sample t-test
## 
## data:  RFMFq1$Q and RFMFq2$Q
## t = -0.77575, df = 4149.2, p-value = 0.4379
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -75.73449  32.79241
## sample estimates:
## mean of x mean of y 
##  24.11654  45.58757

From 2-sample t-tests, the very small p-values suggested One-time purchasers behave very differently from the other purchasers. Except for adjusted mean purchase value, the respective RFM score for one-time purchasers are much lower than other customers.

One-way ANOVA analysis

Differiate countires

RFMC<-RFM[which(RFM$Country == "United Kingdom" | RFM$Country == "Germany" | RFM$Country == "France" ),]

ggplot(RFMC, aes(x = Country, y= Recency)) + geom_boxplot() +
  labs(title = "Recency by Country", x = "Country", y = "Recency")+coord_flip()

ggplot(RFMC, aes(x = Country, y = Frequency)) + geom_boxplot() +
  labs(title = "Frequency by Country", x = "Country", y = "Frequency")+coord_flip()

ggplot(RFMC, aes(x = Country, y = Monetary)) + geom_boxplot() +
  labs(title = "sumM by Country", x = "Country", y = "Monetary")+coord_flip()

#exclude outlier
RFMC<-RFM[which(RFM$Recency >= 10 & RFM$Frequency >= 10 | RFM$Monetary >= 30 ),]

Anova testing

# One Way Anova (Completely Randomized Design)
f1 <- lm(Recency ~ Country, data = RFMC)
summary(f1)

## 
## Call:
## lm(formula = Recency ~ Country, data = RFMC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.192 -2.682 -1.115  1.552  9.505 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             1.23333    3.40543   0.362    0.717  
## CountryBelgium          3.52500    3.80739   0.926    0.355  
## CountryChannel Islands  5.00000    4.17079   1.199    0.231  
## CountryDenmark         -0.60000    4.81601  -0.125    0.901  
## CountryEIRE            -0.93333    4.81601  -0.194    0.846  
## CountryFinland         -1.01667    4.17079  -0.244    0.807  
## CountryFrance           0.75238    3.64056   0.207    0.836  
## CountryGermany          1.69474    3.49390   0.485    0.628  
## CountryIsrael           0.80000    4.17079   0.192    0.848  
## CountryItaly           10.38333    4.17079   2.490    0.013 *
## CountryJapan            2.85000    4.17079   0.683    0.495  
## CountryNetherlands     -1.13333    4.81601  -0.235    0.814  
## CountryPoland          -0.96667    4.81601  -0.201    0.841  
## CountryPortugal         3.50000    4.17079   0.839    0.402  
## CountrySaudi Arabia     6.83333    4.81601   1.419    0.156  
## CountrySingapore        2.30000    4.81601   0.478    0.633  
## CountrySpain            1.37778    3.93225   0.350    0.726  
## CountrySwitzerland     -0.03333    4.17079  -0.008    0.994  
## CountryUnited Kingdom   1.98148    3.40803   0.581    0.561  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.405 on 690 degrees of freedom
## Multiple R-squared:  0.03361,    Adjusted R-squared:  0.008398 
## F-statistic: 1.333 on 18 and 690 DF,  p-value: 0.1596

anova(f1)

## Analysis of Variance Table
## 
## Response: Recency
##            Df Sum Sq Mean Sq F value Pr(>F)
## Country    18  278.3  15.460  1.3331 0.1596
## Residuals 690 8001.9  11.597

confint(f1)

##                             2.5 %    97.5 %
## (Intercept)             -5.452920  7.919587
## CountryBelgium          -3.950459 11.000459
## CountryChannel Islands  -3.188955 13.188955
## CountryDenmark         -10.055790  8.855790
## CountryEIRE            -10.389124  8.522457
## CountryFinland          -9.205621  7.172288
## CountryFrance           -6.395525  7.900287
## CountryGermany          -5.165214  8.554688
## CountryIsrael           -7.388955  8.988955
## CountryItaly             2.194379 18.572288
## CountryJapan            -5.338955 11.038955
## CountryNetherlands     -10.589124  8.322457
## CountryPoland          -10.422457  8.489124
## CountryPortugal         -4.688955 11.688955
## CountrySaudi Arabia     -2.622457 16.289124
## CountrySingapore        -7.155790 11.755790
## CountrySpain            -6.342843  9.098398
## CountrySwitzerland      -8.222288  8.155621
## CountryUnited Kingdom   -4.709880  8.672831

f2<-lm(Frequency ~ Country, data=RFMC) 
summary(f2)

## 
## Call:
## lm(formula = Frequency ~ Country, data = RFMC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.421 -5.744 -3.744  0.256 82.256 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             5.000e+00  1.124e+01   0.445    0.657
## CountryBelgium         -7.500e-01  1.257e+01  -0.060    0.952
## CountryChannel Islands -4.000e+00  1.377e+01  -0.291    0.771
## CountryDenmark         -9.492e-13  1.590e+01   0.000    1.000
## CountryEIRE            -1.000e+00  1.590e+01  -0.063    0.950
## CountryFinland         -9.227e-13  1.377e+01   0.000    1.000
## CountryFrance          -2.286e+00  1.202e+01  -0.190    0.849
## CountryGermany          5.421e+00  1.153e+01   0.470    0.638
## CountryIsrael          -1.500e+00  1.377e+01  -0.109    0.913
## CountryItaly           -4.000e+00  1.377e+01  -0.291    0.771
## CountryJapan           -2.500e+00  1.377e+01  -0.182    0.856
## CountryNetherlands     -1.000e+00  1.590e+01  -0.063    0.950
## CountryPoland           1.300e+01  1.590e+01   0.818    0.414
## CountryPortugal        -1.500e+00  1.377e+01  -0.109    0.913
## CountrySaudi Arabia    -2.000e+00  1.590e+01  -0.126    0.900
## CountrySingapore       -4.000e+00  1.590e+01  -0.252    0.801
## CountrySpain            4.333e+00  1.298e+01   0.334    0.739
## CountrySwitzerland      2.000e+00  1.377e+01   0.145    0.885
## CountryUnited Kingdom   1.744e+00  1.125e+01   0.155    0.877
## 
## Residual standard error: 11.24 on 690 degrees of freedom
## Multiple R-squared:  0.009389,   Adjusted R-squared:  -0.01645 
## F-statistic: 0.3633 on 18 and 690 DF,  p-value: 0.9932

anova(f2)

## Analysis of Variance Table
## 
## Response: Frequency
##            Df Sum Sq Mean Sq F value Pr(>F)
## Country    18    826   45.91  0.3633 0.9932
## Residuals 690  87192  126.36

confint(f2)

##                            2.5 %   97.5 %
## (Intercept)            -17.07110 27.07110
## CountryBelgium         -25.42624 23.92624
## CountryChannel Islands -31.03147 23.03147
## CountryDenmark         -31.21325 31.21325
## CountryEIRE            -32.21325 30.21325
## CountryFinland         -27.03147 27.03147
## CountryFrance          -25.88072 21.30929
## CountryGermany         -17.22342 28.06553
## CountryIsrael          -28.53147 25.53147
## CountryItaly           -31.03147 23.03147
## CountryJapan           -29.53147 24.53147
## CountryNetherlands     -32.21325 30.21325
## CountryPoland          -18.21325 44.21325
## CountryPortugal        -28.53147 25.53147
## CountrySaudi Arabia    -33.21325 29.21325
## CountrySingapore       -35.21325 27.21325
## CountrySpain           -21.15218 29.81885
## CountrySwitzerland     -25.03147 29.03147
## CountryUnited Kingdom  -20.34443 23.83146

f3<-lm(Monetary ~ Country, data=RFMC) 
summary(f3)

## 
## Call:
## lm(formula = Monetary ~ Country, data = RFMC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61.19 -40.69 -24.62   6.59 321.46 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)              66.085     67.133   0.984    0.325
## CountryBelgium          -10.838     75.057  -0.144    0.885
## CountryChannel Islands  -29.682     82.220  -0.361    0.718
## CountryDenmark           52.807     94.940   0.556    0.578
## CountryEIRE             -27.421     94.940  -0.289    0.773
## CountryFinland          -32.023     82.220  -0.389    0.697
## CountryFrance            19.811     71.768   0.276    0.783
## CountryGermany           14.817     68.877   0.215    0.830
## CountryIsrael            -2.819     82.220  -0.034    0.973
## CountryItaly             -9.094     82.220  -0.111    0.912
## CountryJapan            -33.790     82.220  -0.411    0.681
## CountryNetherlands      -34.911     94.940  -0.368    0.713
## CountryPoland           -32.642     94.940  -0.344    0.731
## CountryPortugal         -23.186     82.220  -0.282    0.778
## CountrySaudi Arabia      69.275     94.940   0.730    0.466
## CountrySingapore          6.479     94.940   0.068    0.946
## CountrySpain             -2.144     77.518  -0.028    0.978
## CountrySwitzerland      -14.346     82.220  -0.174    0.862
## CountryUnited Kingdom    12.054     67.184   0.179    0.858
## 
## Residual standard error: 67.13 on 690 degrees of freedom
## Multiple R-squared:  0.009719,   Adjusted R-squared:  -0.01611 
## F-statistic: 0.3762 on 18 and 690 DF,  p-value: 0.9916

anova(f3)

## Analysis of Variance Table
## 
## Response: Monetary
##            Df  Sum Sq Mean Sq F value Pr(>F)
## Country    18   30521  1695.6  0.3762 0.9916
## Residuals 690 3109696  4506.8

confint(f3)

##                             2.5 %   97.5 %
## (Intercept)             -65.72386 197.8941
## CountryBelgium         -158.20461 136.5292
## CountryChannel Islands -191.11475 131.7500
## CountryDenmark         -133.59934 239.2127
## CountryEIRE            -213.82724 158.9848
## CountryFinland         -193.45569 129.4090
## CountryFrance          -121.09894 160.7205
## CountryGermany         -120.41651 150.0498
## CountryIsrael          -164.25128 158.6135
## CountryItaly           -170.52671 152.3380
## CountryJapan           -195.22248 127.6423
## CountryNetherlands     -221.31658 151.4955
## CountryPoland          -219.04784 153.7642
## CountryPortugal        -184.61794 138.2468
## CountrySaudi Arabia    -117.13116 255.6809
## CountrySingapore       -179.92716 192.8849
## CountrySpain           -154.34387 150.0559
## CountrySwitzerland     -175.77856 147.0862
## CountryUnited Kingdom  -119.85599 143.9631

For Recency, There were no statistically significant differences between group means as determined by one-way ANOVA (F(4,45) = 1.616, p = 0.1868) For Frequency, There were no statistically significant differences between group means as determined by one-way ANOVA (F(4,45) = 0.2695, p = 0.8961) *For Monetary, if alpha = 0.05 There were statistically significant differences between group means as determined by one-way ANOVA (F(4,45) = 3.73, p = 0.0105)

K-means cluster analysis

Visulize the clusters

#here I use the normalized data
RFM_cluster <- data.frame(RFMs$R,RFMs$Fq,RFMs$M,RFMs$RFMScore)
d  <- dist(RFM_cluster) 
km <- kmeans(d,centers=3)
RFM_cluster$cluster <- km$cluster 
RFM_cluster$CustomerID <- RFMs$CustomerID

separate clusters

RFM_cluster1<-RFM_cluster[which(RFM_cluster$cluster==1),]
#mostly Recency = 3-9, RFM score ~650 -1000 
RFM_cluster2<-RFM_cluster[which(RFM_cluster$cluster==2),]
#mostly Recency = 1-5 , Frequency= 2-7, RFM score ~200 - 600
RFM_cluster3<-RFM_cluster[which(RFM_cluster$cluster==3),]
#mostly Recency = 6-9, Frequency >= 4, RFM score ~1000-1350

**the sequence of the cluster may change Cluster 1 Lower value customers Cluster 2 Newer Customers Cluster 3 High value Customers