This data relates to a UK based non-store online retail agency. This data which is from UCI Machine Learning Repository is described below: This data represents online transactions.
Objective: Our objective in this project is to analyze the data to help the organization to improve their sale (total revenue).
We will accomplish below tasks to achieve that:
we will perform the analysis and discussion of this data set as per below sections:
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.
Source:
Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Data Set Information:
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
Attribute Information:
I will load this data set into statistical software “R” for our further analysis.
## [1] 541909 8
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 4027 4035 932 1959 2980 3235 1573 1698 1695 259 ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Factor w/ 23260 levels "1/10/2011 10:04",..: 6839 6839 6839 6839 6839 6839 6839 6840 6840 6841 ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 12/1/2010 8:26 2.55 17850 United Kingdom
## 2 12/1/2010 8:26 3.39 17850 United Kingdom
## 3 12/1/2010 8:26 2.75 17850 United Kingdom
## 4 12/1/2010 8:26 3.39 17850 United Kingdom
## 5 12/1/2010 8:26 3.39 17850 United Kingdom
## 6 12/1/2010 8:26 7.65 17850 United Kingdom
## [1] "InvoiceNo" "StockCode" "Description" "Quantity" "InvoiceDate"
## [6] "UnitPrice" "CustomerID" "Country"
We will explore the data set and and we encounter the need we will pre-process the data set to make it suitable for our further analysis of clustering of customers and prediction of sales.
Any duplicate CustomerIDs ?
First, I would like to see if there any duplicate Customer IDs in the data set.
## Source: local data frame [8 x 2]
##
## CustomerID count
## (int) (int)
## 1 12370 2
## 2 12394 2
## 3 12417 2
## 4 12422 2
## 5 12429 2
## 6 12431 2
## 7 12455 2
## 8 12457 2
As shwon above, there are 8 duplicate customer IDs who appear in the database for two different countries.
I have corrected this in excel by changing cusotmer ID for one of them in the pair.
Let’s check again for duplicate IDs after correction.
## [1] 541909 8
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 4027 4035 932 1959 2980 3235 1573 1698 1695 259 ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Factor w/ 23260 levels "1/10/2011 10:04",..: 6839 6839 6839 6839 6839 6839 6839 6840 6840 6841 ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : Factor w/ 4380 levels "12346","12347",..: 4057 4057 4057 4057 4057 4057 4057 4057 4057 549 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 12/1/2010 8:26 2.55 17850 United Kingdom
## 2 12/1/2010 8:26 3.39 17850 United Kingdom
## 3 12/1/2010 8:26 2.75 17850 United Kingdom
## 4 12/1/2010 8:26 3.39 17850 United Kingdom
## 5 12/1/2010 8:26 3.39 17850 United Kingdom
## 6 12/1/2010 8:26 7.65 17850 United Kingdom
## [1] "InvoiceNo" "StockCode" "Description" "Quantity" "InvoiceDate"
## [6] "UnitPrice" "CustomerID" "Country"
## Source: local data frame [0 x 2]
##
## Variables not shown: CustomerID (fctr), count (int)
No duplicate customer IDs found this time. so we are good to go for further analysis.
Pre-processing_1
here we will,
check for any NAs in data set.
any cancelled transactions.
add a new variable for “Revenue” (unit price * quantity).
# any NAs in data set?
colSums(is.na(retail))
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country
## 135080 0
# customerID variable has 135080 NAs.
# any cancelled invoices?
table(grepl("^C", retail$InvoiceNo))
##
## FALSE TRUE
## 532621 9288
# 9288 cancelled transactions.
#adding a new variable for revenue (unit price * quantity) per line:
retail$Revenue = with(retail, Quantity*UnitPrice)
Pre-processing_2
here we will, - check which 10 items are best seeling items based on Net Revenue generated by the item for the entire period of this data set.
## [1] 4070
## Source: local data frame [10 x 2]
##
## StockCode NetRevenue
## (fctr) (dbl)
## 1 DOT 206245.48
## 2 22423 164762.19
## 3 47566 98302.98
## 4 85123A 97894.50
## 5 85099B 92356.03
## 6 23084 66756.59
## 7 POST 66230.64
## 8 22086 63791.94
## 9 84879 58959.73
## 10 79321 53768.06
There are 4070 unique items that our store sells.
Above shown table is the top-10 selling items in order of their Revenue contribution.
Pre-processing_3
in next section, we will filter out the original data set for only these top-10 best selling items.
also we will remove NAs/blanks from the data set.
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country Revenue
## 2724 0 0
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country Revenue
## 2724 0 0
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country Revenue
## 0 0 0
Pre-processing_4
## [1] "2010-12-01"
## [1] "2011-12-09"
Our dataset is has data starting from 2010-12-01 to 2011-12-09.
Final Tidy Data set: Summary
# dimension of the data set:
dim(retail3.nona)
## [1] 12106 12
# 12106 observations.
# 12 variables.
# strcture of the data set:
str(retail3.nona)
## 'data.frame': 12106 obs. of 12 variables:
## $ InvoiceNo : Factor w/ 8724 levels "536365","536367",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ StockCode : Factor w/ 10 levels "22086","22423",..: 8 6 10 1 8 8 1 7 5 8 ...
## $ Description: Factor w/ 11 levels "ASSORTED COLOUR BIRD ORNAMENT",..: 11 1 8 6 11 11 6 5 2 11 ...
## $ Quantity : int 6 32 3 80 6 6 4 100 192 64 ...
## $ InvoiceDate: Date, format: "2010-12-01" "2010-12-01" ...
## $ UnitPrice : num 2.55 1.69 18 2.55 2.55 2.55 2.95 1.65 3.82 2.55 ...
## $ CustomerID : Factor w/ 4380 levels "12346","12347",..: 4057 549 200 1056 4057 4057 2201 2731 2731 3810 ...
## $ Country : Factor w/ 35 levels "Australia","Austria",..: 33 33 14 33 33 33 33 33 33 33 ...
## $ Revenue : num 15.3 54.1 54 204 15.3 ...
## $ Month : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ Year : Factor w/ 2 levels "2010","2011": 1 1 1 1 1 1 1 1 1 1 ...
## $ YearMonth : Factor w/ 13 levels "2010-12","2011-1",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:2724] 47 48 49 50 51 52 63 64 65 66 ...
## .. ..- attr(*, "names")= chr [1:2724] "1503" "1577" "1746" "1777" ...
# "retail3" is dataset with top-10 items with NAs.
# "retail3.nona" is dataset with top-10 items & no NAs.
summary(retail3.nona)
## InvoiceNo StockCode Description
## 579196 : 8 85123A :2077 WHITE HANGING HEART T-LIGHT HOLDER:2070
## 574481 : 7 22423 :1905 REGENCY CAKESTAND 3 TIER :1905
## 575607 : 7 85099B :1662 JUMBO BAG RED RETROSPOT :1662
## 578270 : 7 84879 :1418 ASSORTED COLOUR BIRD ORNAMENT :1418
## 563555 : 6 47566 :1416 PARTY BUNTING :1416
## 569246 : 6 POST :1196 POSTAGE :1196
## (Other):12065 (Other):2432 (Other) :2439
## Quantity InvoiceDate UnitPrice
## Min. :-1930.00 Min. :2010-12-01 Min. : 0.000
## 1st Qu.: 2.00 1st Qu.:2011-04-05 1st Qu.: 2.080
## Median : 6.00 Median :2011-07-18 Median : 2.950
## Mean : 16.19 Mean :2011-07-05 Mean : 8.851
## 3rd Qu.: 12.00 3rd Qu.:2011-10-19 3rd Qu.: 10.950
## Max. : 2880.00 Max. :2011-12-09 Max. :8142.750
##
## CustomerID Country Revenue Month
## 14911 : 150 United Kingdom:10046 Min. :-8142.75 11 :2042
## 17841 : 146 Germany : 544 1st Qu.: 12.75 12 :1365
## 12748 : 79 France : 526 Median : 20.80 10 :1174
## 14096 : 74 EIRE : 186 Mean : 53.81 5 :1087
## 13089 : 71 Belgium : 136 3rd Qu.: 45.47 9 :1082
## 12682 : 59 Spain : 131 Max. : 8142.75 6 : 876
## (Other):11527 (Other) : 537 (Other):4480
## Year YearMonth
## 2010: 848 2011-11:2042
## 2011:11258 2011-10:1174
## 2011-5 :1087
## 2011-9 :1082
## 2011-6 : 876
## 2011-3 : 858
## (Other):4987
summary observations - One invoice can have multiple items bought under it. - Max and Min InvoiceDate suggests that our transaction database is from 2010-12-01 till 2011-12-09. - There are many repeat customers. - Most of the transactions are from “United Kingdom”. I have calculated detailed statistics in later sections.
Boxplots of revenue of top-10 items show that most of the items have their revenue positively skewed, meaning there are more high values as outliers to make revenue distribution skewed towards right.
Top-10 best seler chart shows that “DOT” is the number one selling item.
## [1] 2766
Because frequency in the range of $0 to $1000 is very high,meaning, there are lot of customers in rage of 0-1000 revenue. so it is hard to see other bins and their distribution.
In 2nd chart i have zoomed in the y-axis scale to see other bins’ distribution more clearly. 2nd chart is the same as the first one, just zoomed in.
there are 2766 unique customers in our data set contributing for 12106 transactions for 10 best selling items in the store.
most of the customers account for revenue in the range of 0-2500 sterlings per customer during the period of this data collection.
the distribution of revenue per customer is clearly positively skewed with some outlier in high ranges contributing revenue greater than 7500 sterlings per customer. this group will be an interesting group to dig into and see their buying behaviour.
Customers with abnormaly high Revenue
## Source: local data frame [9 x 3]
## Groups: CustomerID [9]
##
## CustomerID Country RevenueFromCustomer
## (fctr) (fctr) (dbl)
## 1 12931 United Kingdom 8715.80
## 2 14096 United Kingdom 14739.50
## 3 14646 Netherlands 19251.38
## 4 14911 EIRE 7927.25
## 5 15061 United Kingdom 14545.70
## 6 15769 United Kingdom 8077.00
## 7 16013 United Kingdom 14416.70
## 8 17450 United Kingdom 12315.44
## 9 17511 United Kingdom 8961.53
## [1] 9 3
Customers with Negative Revenue
## Source: local data frame [23 x 3]
## Groups: CustomerID [23]
##
## CustomerID Country RevenueFromCustomer
## (fctr) (fctr) (dbl)
## 1 13026 United Kingdom -7.62
## 2 13426 United Kingdom -7.62
## 3 13826 United Kingdom -4.90
## 4 14295 United Kingdom -3.00
## 5 14339 United Kingdom -3.82
## 6 14441 United Kingdom -48.00
## 7 14627 United Kingdom -2.95
## 8 15128 United Kingdom -3.71
## 9 15157 United Kingdom -10.34
## 10 15215 United Kingdom -32.85
## .. ... ... ...
## [1] 23 3
These are the 23 customers who have negative revenue contributed.
The net revenue per customer ending in negative due to cancelled transactions.
Interesting that all these customers are from UK.
It will be interesting to see and analyze further that why these customers have more cancellations than normal customer of this store and ending in no positive revenue to the store. may be it will be a good idea to collect more data to find out reasons behind all these cancellations.
Customers Distribution in different Revenue bins
## cut.revenue
## [-2500, 0) [ 0, 2500) [ 2500, 5000) [ 5000, 7500) [ 7500,10000)
## 23 2703 21 10 4
## [10000,30000]
## 5
## Source: local data frame [6 x 3]
##
## InvoiceNo ItemsSoldPerInvoice No.of.Customers
## (fctr) (int) (int)
## 1 536365 1 1
## 2 536367 1 1
## 3 536370 1 1
## 4 536371 1 1
## 5 536373 1 1
## 6 536375 1 1
## Source: local data frame [6 x 3]
##
## InvoiceNo ItemsSoldPerInvoice No.of.Customers
## (fctr) (int) (int)
## 1 C581117 1 1
## 2 C581128 1 1
## 3 C581228 1 1
## 4 C581229 1 1
## 5 C581235 1 1
## 6 C581470 1 1
This table shows that “Per Invoice”" there can be multiple Items sold. But per Invoice its always 1 customerID.
But per one Customer ID there can be multiple Invoice numbers if customer has shopped more than once.
## [1] 35 2
## Source: local data frame [10 x 2]
##
## Country RevenuePerCountry
## (fctr) (dbl)
## 1 United Kingdom 499551.33
## 2 Germany 31687.91
## 3 France 28468.99
## 4 Netherlands 19841.33
## 5 EIRE 13891.02
## 6 Spain 8777.82
## 7 Australia 7098.86
## 8 Japan 6763.29
## 9 Switzerland 5700.48
## 10 Belgium 5469.09
There are 35 countries customers belong to. original dataset had 38 countries. then top-10 items’ dataset had 36 countries. after removing NAs now 35 countries.
Chart of top-10 countries by revenue shows that United Kindom is by far the highest Revenue contributing country for this store.
most out of top-10 countries are from Europe which tells its local popularity vs. international market capture.
Eventhough this is an online agency with no physical stores in UK or else where, its sales/revenue are coming from UK and nearby eupean countries. this tells that either marketing strategies need to be changed to reach out to international customers if that is the goal.
## Source: local data frame [35 x 4]
##
## Country TotalCustomers NetRevenue RevenuePerCustomer
## (fctr) (int) (dbl) (dbl)
## 1 EIRE 3 13891.02 4630.3400
## 2 Japan 2 6763.29 3381.6450
## 3 Netherlands 9 19841.33 2204.5922
## 4 Australia 5 7098.86 1419.7720
## 5 Singapore 1 463.70 463.7000
## 6 Norway 9 3611.86 401.3178
## 7 France 76 28468.99 374.5920
## 8 Germany 90 31687.91 352.0879
## 9 Iceland 1 345.09 345.0900
## 10 Finland 12 4067.47 338.9558
## .. ... ... ... ...
these are the top-10 countries by Per Customer revenue contributions.
UK might be the no.1 total Revenue contributor. but when it comes to per Customer revenue, these coutries are in top list. UK is no where near.
## [1] 18
UK is no. 18 from top when it comes to “Per customer Revenue” to the store.
## [1] 86.58713
## [1] 76.68125
## [1] 17.01636
this is a very small percentage of the total transactions from nonUK countries. so it does not give us any clustering related insight for each country.
also, our database does not have any location specific information such as address or latitude/logitude. all we have is name the countries. so locating customers in group by their location is not possible anyway.
so it is better to label all nonUK countries as “nonUK” rather than having 35 different labels for Country variable.
## [1] 10046
Out of 12106 observations, 10046 observations are from UK customers and the rest are from nonUK customers.
## Source: local data frame [6 x 4]
## Groups: InvoiceDate [6]
##
## InvoiceDate YearMonth TotalCustomersPerDay RevenuePerDay
## (date) (fctr) (int) (dbl)
## 1 2010-12-01 2010-12 58 4502.98
## 2 2010-12-02 2010-12 59 4354.57
## 3 2010-12-03 2010-12 41 3022.23
## 4 2010-12-05 2010-12 74 3010.21
## 5 2010-12-06 2010-12 66 2025.20
## 6 2010-12-07 2010-12 47 4102.19
## Source: local data frame [6 x 4]
## Groups: InvoiceDate [6]
##
## InvoiceDate YearMonth TotalCustomersPerDay RevenuePerDay
## (date) (fctr) (int) (dbl)
## 1 2011-12-04 2011-12 46 981.35
## 2 2011-12-05 2011-12 81 4831.22
## 3 2011-12-06 2011-12 75 4757.26
## 4 2011-12-07 2011-12 76 7216.56
## 5 2011-12-08 2011-12 65 2545.61
## 6 2011-12-09 2011-12 17 572.07
As it is seen in the trend of Revenue by Date earlier, revenues by month also has been flat most of the time during 2011 but lately for last 2-3 months revenue is picking up. last month’s data is partial and not for full 30 days so dip in Revenue is due to that.
Trend for both Revenue per Month and No. of customers served per Month are very similar. this means no. of customers served is a good indicator of trend of revenue made.
retail.data = retail3.nona %>% group_by(CustomerID, Country) %>% summarise(NumTransactions = n(), NetRevenue = sum(Revenue))
# preparing a variable for customer-Regularity
# customer-Regularity is a score out of 13 (as 13 months of data). if customer appeared for 7 months out of 13, his score is 7. max one customer can get is 13.
cus.Regularity = numeric()
for(i in 1:length(retail.data$CustomerID)){
cus.Regularity[i] = length(unique(retail3.nona$YearMonth[retail3.nona$CustomerID == retail.data$CustomerID[i]]))
}
cus.Regularity = cus.Regularity/13
cus.item = retail3.nona %>% group_by(CustomerID,StockCode) %>% summarise(count = sum(Quantity))
cus.item.count = dcast(cus.item, CustomerID~StockCode)
## Using 'count' as value column. Use 'value.var' to override
cus.item.count[is.na(cus.item.count)] <- 0
retail.data2 = retail.data
retail.data2$Customer.Regularity = cus.Regularity
retail.data2$item.22086 = cus.item.count$`22086`
retail.data2$item.22423 = cus.item.count$`22423`
retail.data2$item.23084 = cus.item.count$`23084`
retail.data2$item.47566 = cus.item.count$`47566`
retail.data2$item.79321 = cus.item.count$`79321`
retail.data2$item.84879 = cus.item.count$`84879`
retail.data2$item.85099B = cus.item.count$`85099B`
retail.data2$item.85123A = cus.item.count$`85123A`
retail.data2$item.DOT = cus.item.count$DOT
retail.data2$item.POST = cus.item.count$POST
customer.id = retail.data2$CustomerID
rownames(retail.data2) = retail.data2$CustomerID
retail.data2 = retail.data2[,-1]
# PCA using prcomp():
pr.out = prcomp(retail.data2, scale = TRUE)
pr.out$sdev
## [1] 1.88764372 1.27045571 1.07766277 1.04104830 1.01544402 0.97503404
## [7] 0.94359622 0.92650296 0.91236506 0.85610288 0.81969123 0.63762911
## [13] 0.44873645 0.04306016
pr.var = pr.out$sdev^2
pr.var
## [1] 3.563198826 1.614057712 1.161357051 1.083781566 1.031126559
## [6] 0.950691376 0.890373836 0.858407734 0.832410006 0.732912148
## [11] 0.671893717 0.406570886 0.201364405 0.001854177
pr.out$sdev is std. deviation of PCAs. if you convert it into variance by squaring it, this is the variance explained by each PC for this data set.
to compute the proportion of variance explained by each PC, divide the variance explained by total variance explained by all PCA.
PVE = pr.var / sum(pr.var) * 100 # in percentage
PVE
## [1] 25.45142019 11.52898366 8.29540751 7.74129690 7.36518971
## [6] 6.79065268 6.35981311 6.13148382 5.94578576 5.23508677
## [11] 4.79924084 2.90407776 1.43831718 0.01324412
we see that first PC explains 25.5% variance in the data set. the next PC 11.5%.
we can now plot the PVE explained by each component as well as the cumulative PVE.
As our variables are numerical and our goal is unsupervised to find out some sort of structure/grouping in the customers, i would to use k-means clustering.
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The “Choosing K” step in the example below describes how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.
Silhouette coefficients for three clusters are 0.95, 0.32 and 0.27 with average Silhouette coefficients being 0.93.
this indicates very good clustering.
set.seed(1)
kmeans.model = kmeans(retail.data2, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss
table(kmeans.model$cluster)
##
## 1 2 3
## 2710 49 7
# PCA using prcomp():
pr.out = prcomp(retail.data2, scale = TRUE)
library(rgl)
library(cluster)
distance = dist(retail.data2)
plot(silhouette(kmeans.model$cluster, distance))
after k-means clustering: - cluster-1 : 2710 customers - cluster-2 : 49 customers - cluster-3 : 7 customers
Silhouette coefficients for three clusters are 0.95, 0.27 and 0.32 with average Silhouette coefficients being 0.93.
this indicates very good clustering.
The above plot is a 3D plot of Customers on 3 original variables as axis distributed by 3 clusters as shown by colors.
now we will add back the achieved cluster numbers for customers back to data set so we can use it for exploratory analysis of clusters of customers and see how these 3 clusters of customers are different and similar.
##
## cluster-1 cluster-2 cluster-3
## 2710 49 7
## Source: local data frame [6 x 16]
##
## Country NumTransactions NetRevenue Customer.Regularity item.22086
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 non-UK 7 345.09 0.46153846 0
## 2 non-UK 4 360.00 0.30769231 0
## 3 non-UK 2 312.75 0.07692308 0
## 4 non-UK 1 40.00 0.07692308 0
## 5 non-UK 7 343.75 0.30769231 0
## 6 non-UK 1 25.50 0.07692308 0
## Variables not shown: item.22423 (dbl), item.23084 (dbl), item.47566 (dbl),
## item.79321 (dbl), item.84879 (dbl), item.85099B (dbl), item.85123A
## (dbl), item.DOT (dbl), item.POST (dbl), cluster (fctr), CustomerID
## (fctr)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2766 obs. of 16 variables:
## $ Country : Factor w/ 2 levels "non-UK","UK": 1 1 1 1 1 1 1 1 1 1 ...
## $ NumTransactions : int 7 4 2 1 7 1 4 1 2 7 ...
## $ NetRevenue : num 345 360 313 40 344 ...
## $ Customer.Regularity: num 0.4615 0.3077 0.0769 0.0769 0.3077 ...
## $ item.22086 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.22423 : num 15 0 1 0 5 2 22 8 0 27 ...
## $ item.23084 : num 84 0 0 0 0 0 0 0 0 0 ...
## $ item.47566 : num 0 0 0 0 0 0 0 0 0 5 ...
## $ item.79321 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.84879 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.85099B : num 0 0 0 0 0 0 0 0 0 10 ...
## $ item.85123A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.DOT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.POST : num 0 9 1 1 7 0 18 0 6 0 ...
## $ cluster : Factor w/ 3 levels "cluster-1","cluster-2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CustomerID : Factor w/ 2766 levels "1","10","100",..: 1 1112 1990 2101 2212 2323 2434 2545 2656 2 ...
##
## cluster-1 cluster-2 cluster-3
## 2710 49 7
## cluster NumTransactions
## 1 cluster-1 3.863469
## 2 cluster-2 28.061224
## 3 cluster-3 37.285714
## cluster NumTransactions
## 1 cluster-1 3.863469
## 2 cluster-2 28.061224
## 3 cluster-3 37.285714
clearly, all 3 clusters have different and distinct average and median for number of transactions per customer.
this is a good variable to distinguish customers by cluster segments.
cluster-1 customers have lowest number of transactions with mean at 3.9.
cluster-3 customers have highest number of transactions with mean at 37.3.
## cluster NetRevenue
## 1 cluster-1 369488.72
## 2 cluster-2 189030.03
## 3 cluster-3 92946.05
## cluster NetRevenue
## 1 cluster-1 136.3427
## 2 cluster-2 3857.7557
## 3 cluster-3 13278.0071
all 3 clusters have different and distinct average and median Revenue generated per customer.
cluster-3 customers generating average revenue of 13278.
cluster-1 customers generating very low revenue compared to other 2 clusters at 136.
may be promotions an coupon can be used to attract cluster-1 customers. but further clustering or classification of cluster-1 is necessary as cluster-1 accounts for almost 98% customers.
## cluster Customer.Regularity
## 1 cluster-1 0.1845018
## 2 cluster-2 0.6860283
## 3 cluster-3 0.7142857
clearly, all 3 clusters have different and distinct average customer visit Regularity over the year. this is a good variable to distinguish customers into cluster segments.
cluster-3 has the highest customer Regularity of 71.4% this means this cluster has customers as Regular visitors to the store website over the year.
cluster-1 has the lowest customer visit Regularity of 18.5%. if we combine this information with low number of transactions and low revenue per customer by this group, may be promotions an coupon can be used to attract these customers. further clustering or classification of cluster-1 is necessary as cluster-1 accounts for almost 98% customers.
cluster-2 customers are also Regular visitors with visit Regularity of 68% but still not generating net revenue as cluster-3 customers. may be promotions an coupon can be used to attract these customers.
also, there is an interesting factor about cluster-3 as it is highest in number of transaction per customers, highest in net revenue per customer and highest in customer.visit.Regularity but lowest in number of customers in a cluster as well as net revenue.
so this tell that cluster-3 is small group of customers who Regularly visit the store website, make lot of transactions over the year, bring a good amount of revenue over the year. but because they are a small group of customers their net revenue contribution is still smaller as cluster-1 wins the net revenue due to customer volume.
##
## cluster-1 cluster-2 cluster-3
## non-UK 0.968 0.030 0.003
## UK 0.982 0.016 0.003
##
## cluster-1 cluster-2 cluster-3
## non-UK 0.13 0.22 0.14
## UK 0.87 0.78 0.86
as shown above, cluster-3 customers buy high average no. of items in almost all 10 item categories.
cluster-1 being the lowest average per item bought.
cluster-1 and cluster-2 do not buy “DOT”. now because cluster-1 & 2 account for 99.7% of all cusotmers, this mean only few customers in cluster-3 buy DOT.
as mentioned aearlier, further clustering of cluster-1 is necessary as cluster-1 accounts for almost 98% customers.
here I am going to apply an approach based on my own view. I will filter out the cluster-1 data set only and re-apply the entire k-mean clustering process that we applied ealier to see sub-clusters within cluster-1.
## Classes 'tbl_df' and 'data.frame': 2710 obs. of 13 variables:
## $ Country : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NumTransactions : int 7 4 2 1 7 1 4 1 2 7 ...
## $ NetRevenue : num 345 360 313 40 344 ...
## $ Customer.Regularity: num 0.4615 0.3077 0.0769 0.0769 0.3077 ...
## $ item.22086 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.22423 : num 15 0 1 0 5 2 22 8 0 27 ...
## $ item.23084 : num 84 0 0 0 0 0 0 0 0 0 ...
## $ item.47566 : num 0 0 0 0 0 0 0 0 0 5 ...
## $ item.79321 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.84879 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.85099B : num 0 0 0 0 0 0 0 0 0 10 ...
## $ item.85123A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.POST : num 0 9 1 1 7 0 18 0 6 0 ...
Silhouette coefficients for three clusters are 0.78, 0.34 and 0.21 with average Silhouette coefficients being 0.7.
this indicates very good clustering of CLUSTER-1.
set.seed(1)
kmeans.model = kmeans(c1.data, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss
table(kmeans.model$cluster)
##
## 1 2 3
## 2237 378 95
# PCA using prcomp():
pr.out = prcomp(c1.data, scale = TRUE)
library(rgl)
library(cluster)
distance = dist(c1.data)
plot(silhouette(kmeans.model$cluster, distance))
after k-means clustering within the cluster-1: - cluster-1 (1A) : 2237 customers - cluster-2 (1B) : 378 customers - cluster-3 (1C) : 95 customers
we will name these sub-clusters as cluster-1A, cluster-1B, and cluster-1C.
Silhouette coefficients for three clusters are 0.78, 0.34 and 0.21 with average Silhouette coefficients being 0.7.
this indicates very good clustering of CLUSTER-1.
The above plot is a 3D plot of Customers in cluster-1 on 3 original variables as axis distributed by 3 sub-clusters as shown by colors.
now we will add back the achieved sub-clusters of cluster-1 back to data set so we can use it for exploratory analysis of clusters of customers and see how these 5 clusters of customers are different and similar.
##
## cluster-1A cluster-1B cluster-1C
## 2237 378 95
##
## cluster-1A cluster-1B cluster-1C cluster-2 cluster-3
## 2237 378 95 49 7
## Source: local data frame [6 x 16]
##
## Country NumTransactions NetRevenue Customer.Regularity item.22086
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 non-UK 7 345.09 0.46153846 0
## 2 non-UK 4 360.00 0.30769231 0
## 3 non-UK 2 312.75 0.07692308 0
## 4 non-UK 1 40.00 0.07692308 0
## 5 non-UK 7 343.75 0.30769231 0
## 6 non-UK 1 25.50 0.07692308 0
## Variables not shown: item.22423 (dbl), item.23084 (dbl), item.47566 (dbl),
## item.79321 (dbl), item.84879 (dbl), item.85099B (dbl), item.85123A
## (dbl), item.DOT (dbl), item.POST (dbl), cluster (fctr), CustomerID
## (fctr)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2766 obs. of 16 variables:
## $ Country : Factor w/ 2 levels "non-UK","UK": 1 1 1 1 1 1 1 1 1 1 ...
## $ NumTransactions : int 7 4 2 1 7 1 4 1 2 7 ...
## $ NetRevenue : num 345 360 313 40 344 ...
## $ Customer.Regularity: num 0.4615 0.3077 0.0769 0.0769 0.3077 ...
## $ item.22086 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.22423 : num 15 0 1 0 5 2 22 8 0 27 ...
## $ item.23084 : num 84 0 0 0 0 0 0 0 0 0 ...
## $ item.47566 : num 0 0 0 0 0 0 0 0 0 5 ...
## $ item.79321 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.84879 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.85099B : num 0 0 0 0 0 0 0 0 0 10 ...
## $ item.85123A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.DOT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.POST : num 0 9 1 1 7 0 18 0 6 0 ...
## $ cluster : Factor w/ 5 levels "cluster-1A","cluster-1B",..: 2 2 2 1 2 1 2 1 2 2 ...
## $ CustomerID : Factor w/ 2766 levels "1","10","100",..: 1 1112 1990 2101 2212 2323 2434 2545 2656 2 ...
after k-means clustering within the cluster-1: - cluster-1A : 2237 customers - cluster-1B : 378 customers - cluster-1C : 95 customers - cluster-2 : 49 customers - cluster-3 : 7 customers
##
## non-UK UK
## 371 2395
##
## cluster-1A cluster-1B cluster-1C cluster-2 cluster-3
## 2237 378 95 49 7
## cluster NumTransactions
## 1 cluster-1A 2.708091
## 2 cluster-1B 8.489418
## 3 cluster-1C 12.663158
## 4 cluster-2 28.061224
## 5 cluster-3 37.285714
clearly, all 5 clusters have different and distinct average and median for number of transactions per customer.
this is a good variable to distinguish customers by cluster segments.
cluster-1 customers in general have lowest number of transactions with sub cluster-1A being the lowest with mean at 2.7.
cluster-3 customers have highest number of transactions with mean at 37.3.
## cluster NetRevenue
## 1 cluster-1A 134221.40
## 2 cluster-1B 138046.66
## 3 cluster-1C 97220.66
## 4 cluster-2 189030.03
## 5 cluster-3 92946.05
## cluster NetRevenue
## 1 cluster-1A 60.00063
## 2 cluster-1B 365.20280
## 3 cluster-1C 1023.37537
## 4 cluster-2 3857.75571
## 5 cluster-3 13278.00714
## cluster NetRevenue
## 1 cluster-1A 60.00063
## 2 cluster-1B 365.20280
## 3 cluster-1C 1023.37537
## 4 cluster-2 3857.75571
## 5 cluster-3 13278.00714
all 5 clusters have different and distinct Net Revenue generated.
cluster-1A & 1B customers generating highest Net revenue.
even cluster-1 in general has highest net revenue as we saw before we clustered cluster-1, cluster-1C customers bring low net revenue almost in same range as cluster-3 which is lowest in the group.
cluster-3 being the lowest in net revenue
all 5 clusters have different and distinct average Revenue generated per customer.
cluster-3 customers generating average revenue of 13278.
cluster-1 customers generating very low revenue compared to other 2 clusters with cluster-1A being the lowest.
may be promotions an coupon can be used to attract cluster-1 customers.
## cluster Customer.Regularity
## 1 cluster-1A 0.1438396
## 2 cluster-1B 0.3538869
## 3 cluster-1C 0.4680162
## 4 cluster-2 0.6860283
## 5 cluster-3 0.7142857
cluster-3 has the highest customer Regularity of 71.4% this means this cluster has customers as Regular visitors to the store website over the year.
cluster-1 has the lowest customer visit Regularity. if we combine this information with low number of transactions and low revenue per customer by this group, may be promotions an coupon can be used to attract these customers.
cluster-2 customers are also Regular visitors with visit Regularity of 68% but still not generating net revenue as cluster-3 customers. may be promotions an coupon can be used to attract these customers.
##
## cluster-1A cluster-1B cluster-1C cluster-2 cluster-3
## non-UK 0.644 0.251 0.073 0.030 0.003
## UK 0.834 0.119 0.028 0.016 0.003
##
## cluster-1A cluster-1B cluster-1C cluster-2 cluster-3
## non-UK 0.11 0.25 0.28 0.22 0.14
## UK 0.89 0.75 0.72 0.78 0.86
as shown above, cluster customers buy high average no. of items in almost all 10 item categories.
cluster-1 with the lowest average per item bought.
cluster-1 and cluster-2 do not buy “DOT”. now because cluster-1 & 2 account for 99.7% cusotmers, this mean only few customers in cluster-3 buy DOT.
overall, cluster-1 should be the target cluster for marketing as it acocunts for the lowest sales for the store.
because cluster-1 is the largest cluster with almost 98% customers in it, we further segmented cluster-1 into 3 clusters.
company can start targeting cluster-1B and cluster-1C (17% of total customers) as they subset of cluster-1. and start with a marketing campaign, may be coupons and promotions, to boost customers’ visits to the store’s website and increase revenues eventually.
this will be a good size of customer pool with similar characteristics as cluster-1 which alone cannot be targeted as its almost 98% customers of the store.
also, using Apriori algorithm using WEKA, we found out 2 items, “DOT” and “85099B”, have a strong association. so they should be combined with deals and discounts if one needs to be boosted for sales.
Based on clustering we acheived using kmeans clustering algorithm and exploratory data analysis of these 5 clusters, below is the customerp rofile I have come up with for this customer base alogn with recommended business strategy is also mentioned for eahc lcuster ustomers.
Custmer Profile: Low Volume Buyers; irregular less-frequent visitorS.
Business Goal: along with increasing sales from this goup, main objective should be to attract them to the store’s website.
Business Strategy Recommendation:
[1] Aggressive media marketing especially to attract non-UK customers.
[2] Aggressive promotions / coupons / discounts etc.
Custmer Profile: Low Volume Buyers; Regular visitors.
Business Goal: Increase of Sales.
Business Strategy Recommendation: promotions / coupons / discounts etc.
Custmer Profile: Low-to-Medium Volume Buyers; Regular. visitors
Business Goal: Increase of Sales.
Business Strategy Recommendation: promotions / coupons / discounts etc.
Custmer Profile: Medium Volume Buyers; Frequent visitors.
Business Goal: customer retention.
Business Strategy Recommendation: Premium Customers’ Reward program.
Custmer Profile: High Volume Buyers; Frequent visitors.
Business Goal: customer retention.
Business Strategy Recommendation: Premium Customers’ Reward program.
## InvoiceNo StockCode
## 1 536365 85123A
## 2 536367 84879
## 3 536370 POST
## 4 536371 22086
## 5 536373 85123A
## 6 536375 85123A
## InvoiceNo StockCode
## 567185 : 13 85123A :2392
## 553220 : 12 22423 :2130
## 563444 : 11 85099B :1780
## 553158 : 10 47566 :1758
## 566976 : 10 84879 :1453
## 568193 : 10 POST :1205
## (Other):12040 (Other):1388
## transactions in sparse format with
## 6365 transactions (rows) and
## 10 items (columns)
## ,"22086" ,"22423" ,"23084" ,"47566" ,"79321" ,"84879" ,"85099B"
## ,"22086" 408 51 4 35 23 36 44
## ,"22423" 51 1537 31 208 37 191 127
## ,"23084" 4 31 203 21 10 10 25
## ,"47566" 35 208 21 1254 58 174 163
## ,"79321" 23 37 10 58 377 28 58
## ,"84879" 36 191 10 174 28 1043 87
## ,"85099B" 44 127 25 163 58 87 1270
## ,"85123A" 50 197 13 260 60 226 200
## ,"DOT" 5 3 2 1 1 0 1
## ,"POST" 4 115 42 30 2 30 67
## ,"85123A" ,"DOT" ,"POST"
## ,"22086" 50 5 4
## ,"22423" 197 3 115
## ,"23084" 13 2 42
## ,"47566" 260 1 30
## ,"79321" 60 1 2
## ,"84879" 226 0 30
## ,"85099B" 200 1 67
## ,"85123A" 1623 1 12
## ,"DOT" 1 7 0
## ,"POST" 12 0 870
Before mining any rules, let’s look at the dataset. Let’s plot the frequency of the items and see how frequently they are purchased.
The last table shows which items are purchased together.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 636
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 6365 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Since we are trying to predict the revenue (sales) from the customer transaction data, I decided to use per customer data set that we used for clustering for regression analysis as well.
Idea is if we know our customers, then we can predcit our sales.
From past customer transaction data we know our customer segments as well as we can feed the same data to regression analysis to predict sales numbers.
we will run multiple models for regressiona analysis, compare the mdoels and will select the best model that works for us.
also during each modelling we will use 10-fold cross-validation to estimate test-error.
Response variable:
Explanatory variables:
## # A tibble: 6 × 13
## Country NumTransactions NetRevenue Customer.Regularity item.22086
## <chr> <int> <dbl> <dbl> <dbl>
## 1 non-UK 7 345.09 0.46153846 0
## 2 non-UK 4 360.00 0.30769231 0
## 3 non-UK 2 312.75 0.07692308 0
## 4 non-UK 1 40.00 0.07692308 0
## 5 non-UK 7 343.75 0.30769231 0
## 6 non-UK 1 25.50 0.07692308 0
## # ... with 8 more variables: item.22423 <dbl>, item.23084 <dbl>,
## # item.47566 <dbl>, item.79321 <dbl>, item.84879 <dbl>,
## # item.85099B <dbl>, item.85123A <dbl>, item.POST <dbl>
## Classes 'tbl_df', 'tbl' and 'data.frame': 2766 obs. of 13 variables:
## $ Country : chr "non-UK" "non-UK" "non-UK" "non-UK" ...
## $ NumTransactions : int 7 4 2 1 7 1 4 1 2 7 ...
## $ NetRevenue : num 345 360 313 40 344 ...
## $ Customer.Regularity: num 0.4615 0.3077 0.0769 0.0769 0.3077 ...
## $ item.22086 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.22423 : num 15 0 1 0 5 2 22 8 0 27 ...
## $ item.23084 : num 84 0 0 0 0 0 0 0 0 0 ...
## $ item.47566 : num 0 0 0 0 0 0 0 0 0 5 ...
## $ item.79321 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.84879 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.85099B : num 0 0 0 0 0 0 0 0 0 10 ...
## $ item.85123A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ item.POST : num 0 9 1 1 7 0 18 0 6 0 ...
##
## Call:
## lm(formula = NetRevenue ~ ., data = lm.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1974.1 -19.6 -5.2 9.9 12241.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.27830 16.03694 2.761 0.0058 **
## CountryUK -23.79556 16.29474 -1.460 0.1443
## NumTransactions 18.49682 1.12484 16.444 <2e-16 ***
## Customer.Regularity -406.98956 44.70291 -9.104 <2e-16 ***
## item.22086 2.66879 0.12243 21.799 <2e-16 ***
## item.22423 10.69370 0.15137 70.644 <2e-16 ***
## item.23084 1.93556 0.04186 46.241 <2e-16 ***
## item.47566 3.79595 0.16587 22.886 <2e-16 ***
## item.79321 4.23920 0.10958 38.684 <2e-16 ***
## item.84879 1.38080 0.04449 31.034 <2e-16 ***
## item.85099B 1.67171 0.03714 45.006 <2e-16 ***
## item.85123A 2.72643 0.04780 57.037 <2e-16 ***
## item.POST 16.59836 1.12085 14.809 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 248.7 on 2753 degrees of freedom
## Multiple R-squared: 0.9232, Adjusted R-squared: 0.9229
## F-statistic: 2759 on 12 and 2753 DF, p-value: < 2.2e-16
## [1] 81691.43 79906.38 81598.93 80667.46 80283.61
## [1] 223574568
## [1] 284.9759
## [1] 0.9228884
## [1] 38379.68
## Country NumTransactions Customer.Regularity
## 1.379386 2.778930 2.634485
## item.22086 item.22423 item.23084
## 1.065652 1.122917 1.165752
## item.47566 item.79321 item.84879
## 1.189944 1.123843 1.063733
## item.85099B item.85123A item.POST
## 1.137178 1.100288 1.641594
## [1] 17
## [1] 2474 13
## [1] 275 13
##
## Call:
## lm(formula = NetRevenue ~ ., data = lm.data.out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -335.43 -5.01 -1.31 2.45 670.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.67197 2.75238 8.237 2.83e-16 ***
## CountryUK -22.41653 2.78715 -8.043 1.35e-15 ***
## NumTransactions 2.51422 0.36070 6.970 4.05e-12 ***
## Customer.Regularity 7.88592 10.25812 0.769 0.442
## item.22086 2.47595 0.03397 72.883 < 2e-16 ***
## item.22423 10.91616 0.04312 253.152 < 2e-16 ***
## item.23084 1.91245 0.01650 115.874 < 2e-16 ***
## item.47566 4.20074 0.03282 128.001 < 2e-16 ***
## item.79321 4.36537 0.04339 100.616 < 2e-16 ***
## item.84879 1.56312 0.01726 90.566 < 2e-16 ***
## item.85099B 1.71986 0.01008 170.635 < 2e-16 ***
## item.85123A 2.45307 0.01410 173.988 < 2e-16 ***
## item.POST 18.12537 0.21431 84.574 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.14 on 2461 degrees of freedom
## Multiple R-squared: 0.9944, Adjusted R-squared: 0.9943
## F-statistic: 3.616e+04 on 12 and 2461 DF, p-value: < 2.2e-16
## [1] 1855.65 1847.76 1900.39 1926.65 1898.44
## [1] 4665415
## [1] 43.54009
## [1] 0.9943333
## [1] 25180.78
All Subsets Regression:
All subsets regression is implemented using the regsubsets() function from the leaps package. This regression will suggest the best set of variables graphically. Analyst can prefer this method for variable selection. It will suggest the set of variables having p value less than 0.05. p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables.
##
## Call:
## lm(formula = NetRevenue ~ . - Country - NumTransactions - Customer.Regularity -
## item.22086, data = lm.data.out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -433.75 -17.48 -13.88 -1.31 1381.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.82542 1.55282 11.48 <2e-16 ***
## item.22423 11.22176 0.07786 144.12 <2e-16 ***
## item.23084 1.95996 0.03020 64.89 <2e-16 ***
## item.47566 4.49569 0.05857 76.76 <2e-16 ***
## item.79321 4.45293 0.07964 55.92 <2e-16 ***
## item.84879 1.85061 0.02984 62.01 <2e-16 ***
## item.85099B 1.78191 0.01834 97.18 <2e-16 ***
## item.85123A 2.56718 0.02529 101.52 <2e-16 ***
## item.POST 19.35949 0.30534 63.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72.21 on 2465 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.9807
## F-statistic: 1.572e+04 on 8 and 2465 DF, p-value: < 2.2e-16
## [1] 6050.29 5919.84 5871.04 6052.73 6151.69
## [1] 14866558
## [1] 77.65989
## [1] 0.980712
## [1] 28207.16
## [1] "Country" "NumTransactions" "NetRevenue"
## [4] "Customer.Regularity" "item.22086" "item.22423"
## [7] "item.23084" "item.47566" "item.79321"
## [10] "item.84879" "item.85099B" "item.85123A"
## [13] "item.POST"
##
## Call:
## lm(formula = NetRevenue ~ ., data = pca_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -335.43 -5.01 -1.31 2.45 670.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 188.306289 0.786883 239.307 < 2e-16 ***
## Comp.1 -3.646702 0.008011 -455.219 < 2e-16 ***
## Comp.2 -2.792531 0.013520 -206.555 < 2e-16 ***
## Comp.3 -3.690102 0.015378 -239.959 < 2e-16 ***
## Comp.4 0.261864 0.016326 16.040 < 2e-16 ***
## Comp.5 7.130899 0.029704 240.068 < 2e-16 ***
## Comp.6 -0.107689 0.034134 -3.155 0.00163 **
## Comp.7 9.840579 0.042547 231.289 < 2e-16 ***
## Comp.8 -1.458365 0.044507 -32.767 < 2e-16 ***
## Comp.9 16.244846 0.147814 109.901 < 2e-16 ***
## Comp.10 9.889286 0.228376 43.303 < 2e-16 ***
## Comp.11 -21.722135 2.788677 -7.789 9.85e-15 ***
## Comp.12 -7.645879 10.262685 -0.745 0.45633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.14 on 2461 degrees of freedom
## Multiple R-squared: 0.9944, Adjusted R-squared: 0.9943
## F-statistic: 3.616e+04 on 12 and 2461 DF, p-value: < 2.2e-16
## [1] 1881.99 1855.65 1847.76 1900.39 1926.65
## [1] 4657275
## [1] 43.50209
## [1] 0.9943333
## [1] 25180.78
## Model.No. Model.Name RSE Adj.R2
## 1 Model.1 Least.Squares/full.model 284.97587 0.9228884
## 2 Model.2 Least.Squares/full.model/outliers removed 43.54009 0.9943333
## 3 Model.3 All Subsets Regression/outliers removed 77.65989 0.9807120
## 4 Model.4 PCR/outliers removed 43.50209 0.9943333
## AIC
## 1 38379.68
## 2 25180.78
## 3 28207.16
## 4 25180.78
Before we go ahead and do Model selection, below is the description and importance of each Model evaluation criteria mentioned.
lower the better.
so based on above explanation of each model selection criteria, loos like model-2 and model-4 are the best model with good predictive power.
because least squares method is better when it comes to Interpretability of predictors in regression, i will choose model-2 over model-4. model-4 is principal component regression.
we will run the final model on test set data that we kep on side and never introduced to the model.
Then we will compare predicted total sales numbers with actual total sales numbers and see how our model has performed.
dim(lm.data.test) # data of 275 different customers.
## [1] 275 13
Actual.Sales = sum(lm.data.test$NetRevenue)
Actual.Sales
## [1] 44197.4
There are 275 customers in out test set.
Total actual Sales from these 275 customers = 44197.4 sterlings.
## Lower.Range Pred.Sales Upper.Range
## 1 22411.38 43555.13 64698.88
Predicted Sales from 275 customers = 43555 sterlings
95% Prediction interval for the sales prediction = 22411 to 64699 sterlings
Total actual Sales from these 275 customers = 44197 sterlings.