This is a transnational data set which contains all the transactions occurring between 12/01/2010 and 12/09/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
The raw data includes 541909 observations of 8 variables. I omitted 136534 missing observations and the new data set has 406829 observations. Some of the Quantity data are negative numbers, which are inventory adjustments and other fees.
Create an aggregated variable named Amount, by multiplying Quantity with Price, which gives the total amount of money spent per product / item in each transaction. The amount variable is assume to represent sales revenue. Separate the variable InvoiceDate into two variables Date and Time . This allows different transactions created by the same consumer on the same day but at different times to be treated separately. Create SKU variable that encodes the first 3 digits of Stockcode to indicate different stock keeping unit, which is related to later calculation of "Breadth" customer behavior indicator.
#omit NA
eRetail<-na.omit(eRetail)
#create Amount variable#
eRetail$Amount <- eRetail$Quantity * eRetail$UnitPrice
#create SKU variable
eRetail$SKU <- substr(eRetail$StockCode,1,3)
#separate date & time#
eRetail$InvoiceDate<-strptime(eRetail$InvoiceDate,"%m/%d/%Y %H:%M")
eRetail$InvoiceTime = format(eRetail$InvoiceDate,"%H")
eRetail$InvoiceDate<-as.Date(eRetail$InvoiceDate,"%m/%d/%Y")
#look at internal structure#
str(eRetail)## 'data.frame': 406829 obs. of 11 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
## $ Description: Factor w/ 4223 levels " 4 PURPLE FLOCK DINNER CANDLES",..: 4026 4034 931 1958 2979 3234 1572 1697 1694 258 ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Date, format: "2010-12-01" "2010-12-01" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
## $ Amount : num 15.3 20.3 22 20.3 20.3 ...
## $ SKU : chr "851" "710" "844" "840" ...
## $ InvoiceTime: chr "08" "08" "08" "08" ...
## - attr(*, "na.action")=Class 'omit' Named int [1:135080] 623 1444 1445 1446 1447 1448 1449 1450 1451 1452 ...
## .. ..- attr(*, "names")= chr [1:135080] "623" "1444" "1445" "1446" ...
#View a summary#
summary(eRetail)## InvoiceNo StockCode
## 576339 : 542 85123A : 2077
## 579196 : 533 22423 : 1905
## 580727 : 529 85099B : 1662
## 578270 : 442 84879 : 1418
## 573576 : 435 47566 : 1416
## 567656 : 421 20725 : 1359
## (Other):403927 (Other):396992
## Description Quantity
## WHITE HANGING HEART T-LIGHT HOLDER: 2070 Min. :-80995.00
## REGENCY CAKESTAND 3 TIER : 1905 1st Qu.: 2.00
## JUMBO BAG RED RETROSPOT : 1662 Median : 5.00
## ASSORTED COLOUR BIRD ORNAMENT : 1418 Mean : 12.06
## PARTY BUNTING : 1416 3rd Qu.: 12.00
## LUNCH BAG RED RETROSPOT : 1358 Max. : 80995.00
## (Other) :397000
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 Min. : 0.00 Min. :12346
## 1st Qu.:2011-04-06 1st Qu.: 1.25 1st Qu.:13953
## Median :2011-07-31 Median : 1.95 Median :15152
## Mean :2011-07-10 Mean : 3.46 Mean :15288
## 3rd Qu.:2011-10-20 3rd Qu.: 3.75 3rd Qu.:16791
## Max. :2011-12-09 Max. :38970.00 Max. :18287
##
## Country Amount SKU
## United Kingdom:361878 Min. :-168469.6 Length:406829
## Germany : 9495 1st Qu.: 4.2 Class :character
## France : 8491 Median : 11.1 Mode :character
## EIRE : 7485 Mean : 20.4
## Spain : 2533 3rd Qu.: 19.5
## Netherlands : 2371 Max. : 168469.6
## (Other) : 14576
## InvoiceTime
## Length:406829
## Class :character
## Mode :character
##
##
##
##
#View the top#
head(eRetail)## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country Amount SKU InvoiceTime
## 1 2010-12-01 2.55 17850 United Kingdom 15.30 851 08
## 2 2010-12-01 3.39 17850 United Kingdom 20.34 710 08
## 3 2010-12-01 2.75 17850 United Kingdom 22.00 844 08
## 4 2010-12-01 3.39 17850 United Kingdom 20.34 840 08
## 5 2010-12-01 3.39 17850 United Kingdom 20.34 840 08
## 6 2010-12-01 7.65 17850 United Kingdom 15.30 227 08
Now Retail dataset with adjust purchase value of discounts has 406829 observations of 11 variables. 4339 unique customer ID. Total sales revenue $8,474,959 *Total number of items sold 4,983,937
## boxplot of Amount
boxplot(eRetail$Amount)$stats[c(1, 5), ]## [1] -18.75 42.45
#cutoff outliner
Retail<-subset(eRetail,eRetail$Amount>= -18.75 & eRetail$Amount<= 42.45 )Retail1 <- ddply(Retail, .(StockCode), summarize, sumAmount= sum(Amount), sumQuantity= sum(Quantity), nCustomer= length(unique(CustomerID)), nPurchase= length(unique(InvoiceNo)) )
head(Retail1[order(-Retail1$sumQuantity),] )## StockCode sumAmount sumQuantity nCustomer nPurchase
## 2793 84077 6728.97 23023 291 430
## 426 21212 9062.59 16441 610 967
## 3049 84879 22962.03 13587 628 1134
## 3014 84826 236.38 13522 26 28
## 1088 22197 10393.55 12165 390 920
## 3213 85099B 24659.24 12035 582 1339
head(Retail1[order(-Retail1$nCustomer),] )## StockCode sumAmount sumQuantity nCustomer nPurchase
## 3227 85123A 26939.51 9137 756 1583
## 1287 22423 21520.21 1697 719 1132
## 3049 84879 22962.03 13587 628 1134
## 2583 47566 22632.99 4585 615 1089
## 426 21212 9062.59 16441 610 967
## 1567 22720 14978.48 3018 610 1114
head(Retail1[order(-Retail1$sumAmount),] )## StockCode sumAmount sumQuantity nCustomer nPurchase
## 3227 85123A 26939.51 9137 756 1583
## 3213 85099B 24659.24 12035 582 1339
## 3049 84879 22962.03 13587 628 1134
## 2583 47566 22632.99 4585 615 1089
## 1287 22423 21520.21 1697 719 1132
## 2023 23203 18458.88 8906 483 989
head(Retail1[order(-Retail1$nPurchase),] )## StockCode sumAmount sumQuantity nCustomer nPurchase
## 3227 85123A 26939.51 9137 756 1583
## 3213 85099B 24659.24 12035 582 1339
## 171 20725 16063.55 9695 518 1216
## 3049 84879 22962.03 13587 628 1134
## 1287 22423 21520.21 1697 719 1132
## 1567 22720 14978.48 3018 610 1114
5 top selling products by sales volume: Product StockCode 3825 WORLD WAR 2 GLIDERS ASSTD DESIGNS 84077 1779 JUMBO BAG RED RETROSPOT 85099B 218 ASSORTED COLOUR BIRD ORNAMENT 84879 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 2289 PACK OF 72 RETROSPOT CAKE CASES 21212 2625 POPCORN HOLDER 22197 sumAmount sumQuantity nCustomer nPurchase 3825 13332.33 53215 307 477 1779 83236.76 45066 636 1643 218 56499.22 35314 679 1385 3736 93823.85 34147 858 2013 2289 16247.95 33409 636 1041 2625 23098.30 30504 296 668
5 top selling products by number of customers: Product StockCode 2793 REGENCY CAKESTAND 3 TIER 22423 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 2365 PARTY BUNTING 47566 218 ASSORTED COLOUR BIRD ORNAMENT 84879 3031 SET OF 3 CAKE TINS PANTRY DESIGN 22720 1779 JUMBO BAG RED RETROSPOT 85099B sumAmount sumQuantity nCustomer nPurchase 2793 132870.40 11555 887 1884 3736 93823.85 34147 858 2013 2365 67687.53 15027 708 1399 218 56499.22 35314 679 1385 3031 32607.80 6864 640 1218 1779 83236.76 45066 636 1643
5 top selling products by value of sales revenue: Product StockCode 2793 REGENCY CAKESTAND 3 TIER 22423 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 1779 JUMBO BAG RED RETROSPOT 85099B 2365 PARTY BUNTING 47566 2637 POSTAGE POST 218 ASSORTED COLOUR BIRD ORNAMENT 84879 sumAmount sumQuantity nCustomer nPurchase 2793 132870.40 11555 887 1884 3736 93823.85 34147 858 2013 1779 83236.76 45066 636 1643 2365 67687.53 15027 708 1399 2637 66710.24 3002 378 1192 218 56499.22 35314 679 1385
5 top selling products by number of purchases: Product StockCode 3736 WHITE HANGING HEART T-LIGHT HOLDER 85123A 2793 REGENCY CAKESTAND 3 TIER 22423 1779 JUMBO BAG RED RETROSPOT 85099B 2365 PARTY BUNTING 47566 218 ASSORTED COLOUR BIRD ORNAMENT 84879 1961 LUNCH BAG RED RETROSPOT 20725 sumAmount sumQuantity nCustomer nPurchase 3736 93823.85 34147 858 2013 2793 132870.40 11555 887 1884 1779 83236.76 45066 636 1643 2365 67687.53 15027 708 1399 218 56499.22 35314 679 1385 1961 27230.05 17145 532 1329
Retail2 <- subset(Retail, Description%in%c("MEDIUM CERAMIC TOP STORAGE JAR","JUMBO BAG RED RETROSPOT","REGENCY CAKESTAND 3 TIER","WHITE HANGING HEART T-LIGHT HOLDER","PARTY BUNTING","WORLD WAR 2 GLIDERS ASSTD DESIGNS"), select = c(Description,InvoiceDate,InvoiceTime,Quantity,CustomerID,Amount,InvoiceNo))
Retail2$Invoice_month<-month(Retail2$InvoiceDate)
Retail2$Decription<-as.character(Retail2$Description)ggplot(Retail2, aes(x=Invoice_month, y= Quantity))+ facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Sales Volume")ggplot(Retail2, aes(x=Invoice_month, y= length(unique(CustomerID)) )) + facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Number of Customer") ggplot(Retail2, aes(x=Invoice_month, y= Amount )) + facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Sales Revenue") ggplot(Retail2, aes(x=Invoice_month, y= length(unique(InvoiceNo)) )) + facet_wrap(~Description, ncol=2) +
geom_bar(stat="identity") +
labs(title = "Sales by month", x = "Month", y = "Number of Purchases") *The sales of products changes with time.
Retail3<-ddply(Retail, .(InvoiceTime), summarize, sumAmount=sum(Amount), sumQuantity=sum(Quantity), nCustomer=length(unique(CustomerID)))
names(Retail3) [1] <-"InvoiceHour"ggplot(Retail3, aes(x=InvoiceHour, y= sumQuantity)) +
geom_bar(stat="identity") +
labs(title = "Sales by hours", x = "Hours", y = "Sales Volume")ggplot(Retail3, aes(x=InvoiceHour, y= nCustomer)) +
geom_bar(stat="identity") +
labs(title = "Sales by hours", x = "Hours", y = "Number of customer") *The busiest hour of the day is around 12 pm for sales volume, and 12 pm for number of customers.
getRFMdf<-function (RFM_raw){
RFM_raw <- RFM_raw[!duplicated(RFM_raw$CustomerID),]
RFM_raw <- cbind(RFM_raw, First_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, min)), "1970/01/01")))
RFM_raw <- cbind(RFM_raw, Last_date = with(df,
as.Date(as.integer(by(InvoiceDate, CustomerID, max)), "1970/01/01")))
#Recency
AsOfDate <- max(RFM_raw$Last_date)
RFM_raw <- cbind(RFM_raw, Recency = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$Last_date,units="days")))/30)
#First_purchase
RFM_raw <- cbind(RFM_raw, First_purchase = with(df,
as.numeric(difftime(AsOfDate,RFM_raw$First_date,units="days")))/30)
#Frequency
RFM_raw <- cbind(RFM_raw, Frequency = with(df,
as.numeric(by(InvoiceNo, CustomerID, function(x) length(unique(x))))))
#Monetary & related
RFM_raw <- cbind(RFM_raw, Monetary = with(df,
as.numeric(by(Amount, CustomerID, mean))))
RFM_raw <- cbind(RFM_raw, sumM = with(df,
as.numeric(by(Amount, CustomerID, sum))))
RFM_raw <- cbind(RFM_raw, maxM = with(df,
as.numeric(by(Amount, CustomerID, max))))
#Breadth
RFM_raw <- cbind(RFM_raw, Breadth = with(df,
as.numeric(by(SKU, CustomerID, function(x) length(unique(x))))))
#Tenure
RFM_raw <- cbind(RFM_raw, Tenure = with(df, as.numeric(difftime(RFM_raw$Last_date,RFM_raw$First_date,units="days")))/30)
#sum Quantity
RFM_raw <- cbind(RFM_raw, sumQuant = with(df,
as.numeric(by(Quantity, CustomerID, mean))))
}#min-max normarlization
minmax<-function(x){
(x-min(x))/(max(x)-min(x))
}
getRFMnor<-function (RFMn){
RFMn<- as.data.frame(minmax(df2[14:22]))
RFMn<- cbind(df2[,c(1:13)],RFMn)
RFMn<- rename(RFMn, c("Recency" = "R", "Frequency" = "Fq", "Monetary" = "M", "Breadth" = "B" , "Tenure" = "Ten", "sumQuant" = "Q" ) )
}#score 1 to 9
score19<-function(x){
ceiling((rank(x))/(length(x))*9)
}
getRFMscore<-function (RFMs){
RFMs <- as.data.frame(lapply(df3[,c(15:22)], score19))
RFMs <- cbind(df3[,c(1:13)], R= ceiling((rank(-df3$R))/(length(df3$R))*9), RFMs)
RFMs <- cbind(RFMs,RFMScore = 100*RFMs$R + 50*RFMs$Fq+RFMs$M)
}df<- eRetail
rawRFM<-as.data.frame(getRFMdf(df))#take a look at disturbution
par(mfrow = c(1,3))
boxplot(rawRFM$Recency)$stats[c(1, 5), ]## [1] 0.0 11.1
boxplot(rawRFM$Frequency)$stats[c(1, 5), ]## [1] 1 11
boxplot(rawRFM$Monetary)$stats[c(1, 5), ]## [1] -7.45 42.30
Strong left-skewness for Recency, Frequency, TotalAmount,Monetary,Breadth, and Tenure.
RFM<-subset(rawRFM,rawRFM$Recency<= 13 & rawRFM$Frequency<= 100 & rawRFM$Monetary>= 0 & rawRFM$Monetary<= 400)
summary(rawRFM$Monetary)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4288.00 10.99 16.92 28.84 23.54 3861.00
par(mfrow = c(1,3))
hist(RFM$Recency)
hist(RFM$Frequency)
hist(RFM$Monetary)Now, the Left-skewness is better.
#data normalization
df2<- RFM
nRFM<-as.data.frame(getRFMnor(df2))
#score
df3 <- nRFM
RFMs<-as.data.frame(getRFMscore(df3))
par(mfrow = c(1,3))
hist(RFMs$R)
hist(RFMs$Fq)
hist(RFMs$M)In hope to find a trend / association among the scores identify customer segment with high value high value customer is defined as high Monetary score, in adjunct to high sumM score M is Monetary score, which is score upon the mean purchase value of the different purchases of an individual customer *sumM is score of sum of Amount, in other words, a score based upon total purchase value of all the purchases of an individual customer
#draw histogram
drawHistograms <- function(df,r=5,f=5,m=5){
#set the layout plot window
par(mfrow = c(m,f))
names <-rep("",times=r)
for(i in 1:r) names[i]<-paste("R",i)
for (i in 1:m){
for (j in 1:f){
c <- rep(0,times=r)
for(k in 1:f){
tmpdf <-df[df$Fq==j & df$M==i & df$R==k,]
c[k]<- dim(tmpdf)[1]
}
if (i==1 & j==1)
barplot(c,col="lightblue",names.arg=names)
else
barplot(c,col="lightblue")
if (j==1) title(ylab=paste("M",i))
if (i==1) title(main=paste("Fq",j))
}
}
par(mfrow = c(5,5))
} # end of drawHistograms function#drawHistograms(RFMscore)RFMFq1<-RFM[which(RFM$Frequency==1),]
RFMFq2<-RFM[which(RFM$Frequency>1),]There are 1227 one-time customers, 2597 repeat customers
t.test(RFMFq1$R,RFMFq2$R)#p-value < 2.2e-16##
## Welch Two Sample t-test
##
## data: RFMFq1$R and RFMFq2$R
## t = 25.599, df = 1750.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.784685 3.246808
## sample estimates:
## mean of x mean of y
## 5.118688 2.102941
t.test(RFMFq1$M,RFMFq2$M)#p-value = 0.002532##
## Welch Two Sample t-test
##
## data: RFMFq1$M and RFMFq2$M
## t = 1.3192, df = 2139.7, p-value = 0.1872
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8155223 4.1679774
## sample estimates:
## mean of x mean of y
## 26.39549 24.71926
t.test(RFMFq1$sumM,RFMFq2$sumM)#p-value < 2.2e-16##
## Welch Two Sample t-test
##
## data: RFMFq1$sumM and RFMFq2$sumM
## t = -14.625, df = 3068.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2213.823 -1690.404
## sample estimates:
## mean of x mean of y
## 336.8013 2288.9153
t.test(RFMFq1$maxM,RFMFq2$maxM)#p-value < 2.2e-16##
## Welch Two Sample t-test
##
## data: RFMFq1$maxM and RFMFq2$maxM
## t = -2.6068, df = 3031.6, p-value = 0.009183
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -286.73836 -40.55909
## sample estimates:
## mean of x mean of y
## 60.16894 223.81766
t.test(RFMFq1$B,RFMFq2$B)#p-value < 2.2e-16##
## Welch Two Sample t-test
##
## data: RFMFq1$B and RFMFq2$B
## t = -40.38, df = 3869.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.93662 -11.73855
## sample estimates:
## mean of x mean of y
## 10.97008 23.30767
t.test(RFMFq1$Ten,RFMFq2$Ten)#p-value < 2.2e-16##
## Welch Two Sample t-test
##
## data: RFMFq1$Ten and RFMFq2$Ten
## t = -87.982, df = 3025, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.506469 -6.222789
## sample estimates:
## mean of x mean of y
## 0.000000 6.364629
t.test(RFMFq1$Q,RFMFq2$Q)#p-value = 0.3064##
## Welch Two Sample t-test
##
## data: RFMFq1$Q and RFMFq2$Q
## t = -0.77575, df = 4149.2, p-value = 0.4379
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -75.73449 32.79241
## sample estimates:
## mean of x mean of y
## 24.11654 45.58757
From 2-sample t-tests, the very small p-values suggested One-time purchasers behave very differently from the other purchasers. Except for adjusted mean purchase value, the respective RFM score for one-time purchasers are much lower than other customers.
RFMC<-RFM[which(RFM$Country == "United Kingdom" | RFM$Country == "Germany" | RFM$Country == "France" ),]
ggplot(RFMC, aes(x = Country, y= Recency)) + geom_boxplot() +
labs(title = "Recency by Country", x = "Country", y = "Recency")+coord_flip()ggplot(RFMC, aes(x = Country, y = Frequency)) + geom_boxplot() +
labs(title = "Frequency by Country", x = "Country", y = "Frequency")+coord_flip()ggplot(RFMC, aes(x = Country, y = Monetary)) + geom_boxplot() +
labs(title = "sumM by Country", x = "Country", y = "Monetary")+coord_flip()#exclude outlier
RFMC<-RFM[which(RFM$Recency >= 10 & RFM$Frequency >= 10 | RFM$Monetary >= 30 ),]# One Way Anova (Completely Randomized Design)
f1 <- lm(Recency ~ Country, data = RFMC)
summary(f1)##
## Call:
## lm(formula = Recency ~ Country, data = RFMC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.192 -2.682 -1.115 1.552 9.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.23333 3.40543 0.362 0.717
## CountryBelgium 3.52500 3.80739 0.926 0.355
## CountryChannel Islands 5.00000 4.17079 1.199 0.231
## CountryDenmark -0.60000 4.81601 -0.125 0.901
## CountryEIRE -0.93333 4.81601 -0.194 0.846
## CountryFinland -1.01667 4.17079 -0.244 0.807
## CountryFrance 0.75238 3.64056 0.207 0.836
## CountryGermany 1.69474 3.49390 0.485 0.628
## CountryIsrael 0.80000 4.17079 0.192 0.848
## CountryItaly 10.38333 4.17079 2.490 0.013 *
## CountryJapan 2.85000 4.17079 0.683 0.495
## CountryNetherlands -1.13333 4.81601 -0.235 0.814
## CountryPoland -0.96667 4.81601 -0.201 0.841
## CountryPortugal 3.50000 4.17079 0.839 0.402
## CountrySaudi Arabia 6.83333 4.81601 1.419 0.156
## CountrySingapore 2.30000 4.81601 0.478 0.633
## CountrySpain 1.37778 3.93225 0.350 0.726
## CountrySwitzerland -0.03333 4.17079 -0.008 0.994
## CountryUnited Kingdom 1.98148 3.40803 0.581 0.561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.405 on 690 degrees of freedom
## Multiple R-squared: 0.03361, Adjusted R-squared: 0.008398
## F-statistic: 1.333 on 18 and 690 DF, p-value: 0.1596
anova(f1)## Analysis of Variance Table
##
## Response: Recency
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 18 278.3 15.460 1.3331 0.1596
## Residuals 690 8001.9 11.597
confint(f1)## 2.5 % 97.5 %
## (Intercept) -5.452920 7.919587
## CountryBelgium -3.950459 11.000459
## CountryChannel Islands -3.188955 13.188955
## CountryDenmark -10.055790 8.855790
## CountryEIRE -10.389124 8.522457
## CountryFinland -9.205621 7.172288
## CountryFrance -6.395525 7.900287
## CountryGermany -5.165214 8.554688
## CountryIsrael -7.388955 8.988955
## CountryItaly 2.194379 18.572288
## CountryJapan -5.338955 11.038955
## CountryNetherlands -10.589124 8.322457
## CountryPoland -10.422457 8.489124
## CountryPortugal -4.688955 11.688955
## CountrySaudi Arabia -2.622457 16.289124
## CountrySingapore -7.155790 11.755790
## CountrySpain -6.342843 9.098398
## CountrySwitzerland -8.222288 8.155621
## CountryUnited Kingdom -4.709880 8.672831
f2<-lm(Frequency ~ Country, data=RFMC)
summary(f2)##
## Call:
## lm(formula = Frequency ~ Country, data = RFMC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.421 -5.744 -3.744 0.256 82.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.000e+00 1.124e+01 0.445 0.657
## CountryBelgium -7.500e-01 1.257e+01 -0.060 0.952
## CountryChannel Islands -4.000e+00 1.377e+01 -0.291 0.771
## CountryDenmark -9.492e-13 1.590e+01 0.000 1.000
## CountryEIRE -1.000e+00 1.590e+01 -0.063 0.950
## CountryFinland -9.227e-13 1.377e+01 0.000 1.000
## CountryFrance -2.286e+00 1.202e+01 -0.190 0.849
## CountryGermany 5.421e+00 1.153e+01 0.470 0.638
## CountryIsrael -1.500e+00 1.377e+01 -0.109 0.913
## CountryItaly -4.000e+00 1.377e+01 -0.291 0.771
## CountryJapan -2.500e+00 1.377e+01 -0.182 0.856
## CountryNetherlands -1.000e+00 1.590e+01 -0.063 0.950
## CountryPoland 1.300e+01 1.590e+01 0.818 0.414
## CountryPortugal -1.500e+00 1.377e+01 -0.109 0.913
## CountrySaudi Arabia -2.000e+00 1.590e+01 -0.126 0.900
## CountrySingapore -4.000e+00 1.590e+01 -0.252 0.801
## CountrySpain 4.333e+00 1.298e+01 0.334 0.739
## CountrySwitzerland 2.000e+00 1.377e+01 0.145 0.885
## CountryUnited Kingdom 1.744e+00 1.125e+01 0.155 0.877
##
## Residual standard error: 11.24 on 690 degrees of freedom
## Multiple R-squared: 0.009389, Adjusted R-squared: -0.01645
## F-statistic: 0.3633 on 18 and 690 DF, p-value: 0.9932
anova(f2)## Analysis of Variance Table
##
## Response: Frequency
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 18 826 45.91 0.3633 0.9932
## Residuals 690 87192 126.36
confint(f2)## 2.5 % 97.5 %
## (Intercept) -17.07110 27.07110
## CountryBelgium -25.42624 23.92624
## CountryChannel Islands -31.03147 23.03147
## CountryDenmark -31.21325 31.21325
## CountryEIRE -32.21325 30.21325
## CountryFinland -27.03147 27.03147
## CountryFrance -25.88072 21.30929
## CountryGermany -17.22342 28.06553
## CountryIsrael -28.53147 25.53147
## CountryItaly -31.03147 23.03147
## CountryJapan -29.53147 24.53147
## CountryNetherlands -32.21325 30.21325
## CountryPoland -18.21325 44.21325
## CountryPortugal -28.53147 25.53147
## CountrySaudi Arabia -33.21325 29.21325
## CountrySingapore -35.21325 27.21325
## CountrySpain -21.15218 29.81885
## CountrySwitzerland -25.03147 29.03147
## CountryUnited Kingdom -20.34443 23.83146
f3<-lm(Monetary ~ Country, data=RFMC)
summary(f3)##
## Call:
## lm(formula = Monetary ~ Country, data = RFMC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.19 -40.69 -24.62 6.59 321.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.085 67.133 0.984 0.325
## CountryBelgium -10.838 75.057 -0.144 0.885
## CountryChannel Islands -29.682 82.220 -0.361 0.718
## CountryDenmark 52.807 94.940 0.556 0.578
## CountryEIRE -27.421 94.940 -0.289 0.773
## CountryFinland -32.023 82.220 -0.389 0.697
## CountryFrance 19.811 71.768 0.276 0.783
## CountryGermany 14.817 68.877 0.215 0.830
## CountryIsrael -2.819 82.220 -0.034 0.973
## CountryItaly -9.094 82.220 -0.111 0.912
## CountryJapan -33.790 82.220 -0.411 0.681
## CountryNetherlands -34.911 94.940 -0.368 0.713
## CountryPoland -32.642 94.940 -0.344 0.731
## CountryPortugal -23.186 82.220 -0.282 0.778
## CountrySaudi Arabia 69.275 94.940 0.730 0.466
## CountrySingapore 6.479 94.940 0.068 0.946
## CountrySpain -2.144 77.518 -0.028 0.978
## CountrySwitzerland -14.346 82.220 -0.174 0.862
## CountryUnited Kingdom 12.054 67.184 0.179 0.858
##
## Residual standard error: 67.13 on 690 degrees of freedom
## Multiple R-squared: 0.009719, Adjusted R-squared: -0.01611
## F-statistic: 0.3762 on 18 and 690 DF, p-value: 0.9916
anova(f3)## Analysis of Variance Table
##
## Response: Monetary
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 18 30521 1695.6 0.3762 0.9916
## Residuals 690 3109696 4506.8
confint(f3)## 2.5 % 97.5 %
## (Intercept) -65.72386 197.8941
## CountryBelgium -158.20461 136.5292
## CountryChannel Islands -191.11475 131.7500
## CountryDenmark -133.59934 239.2127
## CountryEIRE -213.82724 158.9848
## CountryFinland -193.45569 129.4090
## CountryFrance -121.09894 160.7205
## CountryGermany -120.41651 150.0498
## CountryIsrael -164.25128 158.6135
## CountryItaly -170.52671 152.3380
## CountryJapan -195.22248 127.6423
## CountryNetherlands -221.31658 151.4955
## CountryPoland -219.04784 153.7642
## CountryPortugal -184.61794 138.2468
## CountrySaudi Arabia -117.13116 255.6809
## CountrySingapore -179.92716 192.8849
## CountrySpain -154.34387 150.0559
## CountrySwitzerland -175.77856 147.0862
## CountryUnited Kingdom -119.85599 143.9631
For Recency, There were no statistically significant differences between group means as determined by one-way ANOVA (F(4,45) = 1.616, p = 0.1868) For Frequency, There were no statistically significant differences between group means as determined by one-way ANOVA (F(4,45) = 0.2695, p = 0.8961) *For Monetary, if alpha = 0.05 There were statistically significant differences between group means as determined by one-way ANOVA (F(4,45) = 3.73, p = 0.0105)
#here I use the normalized data
RFM_cluster <- data.frame(RFMs$R,RFMs$Fq,RFMs$M,RFMs$RFMScore)
d <- dist(RFM_cluster)
km <- kmeans(d,centers=3)
RFM_cluster$cluster <- km$cluster
RFM_cluster$CustomerID <- RFMs$CustomerIDRFM_cluster1<-RFM_cluster[which(RFM_cluster$cluster==1),]
#mostly Recency = 3-9, RFM score ~650 -1000
RFM_cluster2<-RFM_cluster[which(RFM_cluster$cluster==2),]
#mostly Recency = 1-5 , Frequency= 2-7, RFM score ~200 - 600
RFM_cluster3<-RFM_cluster[which(RFM_cluster$cluster==3),]
#mostly Recency = 6-9, Frequency >= 4, RFM score ~1000-1350**the sequence of the cluster may change Cluster 1 Lower value customers Cluster 2 Newer Customers Cluster 3 High value Customers