Project: Customer segmentation, Sales Prediction, Strategy Recommendations for an UK-based non-store online retail Agency

This data relates to a UK based non-store online retail agency. This data which is from UCI Machine Learning Repository is described below: This data represents online transactions.

Objective: Our objective in this project is to analyze the data to help the organization to improve their sale (total revenue).

We will accomplish below tasks to achieve that:

Show an understanding of the data through descriptive statistics and visualization
Use clustering to find interesting groups of customers
Use Statistical methods to predict sales
Compare the performance of these methods
Recommend strategies for the organization
Produce a report

we will perform the analysis and discussion of this data set as per below sections:

[1] About the Data set:

[2] Pre-processing of the Data set:

[3] Exploratory and Descriptive Analysis:

[4] Preparing a data frame for Clustering:

[5] Clustering for Customer segmentation:

[6] exploratory and Descriptive Analysis of 3 Clusters:

[7] Further Clustering of cluster-1:

[8] exploratory and Descriptive Analysis of 5 Clusters:

[9] Customer Profile and Business Strategy:

[10] Association Rules Among Items:

[11] Predicting Sales:

[1] About the Data set:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

Source:

Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Attribute Information:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

I will load this data set into statistical software “R” for our further analysis.

## [1] 541909      8

## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 4027 4035 932 1959 2980 3235 1573 1698 1695 259 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Factor w/ 23260 levels "1/10/2011 10:04",..: 6839 6839 6839 6839 6839 6839 6839 6840 6840 6841 ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...

##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##      InvoiceDate UnitPrice CustomerID        Country
## 1 12/1/2010 8:26      2.55      17850 United Kingdom
## 2 12/1/2010 8:26      3.39      17850 United Kingdom
## 3 12/1/2010 8:26      2.75      17850 United Kingdom
## 4 12/1/2010 8:26      3.39      17850 United Kingdom
## 5 12/1/2010 8:26      3.39      17850 United Kingdom
## 6 12/1/2010 8:26      7.65      17850 United Kingdom

## [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "UnitPrice"   "CustomerID"  "Country"

[2] Pre-processing of the Data set:

We will explore the data set and and we encounter the need we will pre-process the data set to make it suitable for our further analysis of clustering of customers and prediction of sales.

Any duplicate CustomerIDs ?

First, I would like to see if there any duplicate Customer IDs in the data set.

## Source: local data frame [8 x 2]
## 
##   CustomerID count
##        (int) (int)
## 1      12370     2
## 2      12394     2
## 3      12417     2
## 4      12422     2
## 5      12429     2
## 6      12431     2
## 7      12455     2
## 8      12457     2

As shwon above, there are 8 duplicate customer IDs who appear in the database for two different countries.

I have corrected this in excel by changing cusotmer ID for one of them in the pair.

Let’s check again for duplicate IDs after correction.

## [1] 541909      8

## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 4027 4035 932 1959 2980 3235 1573 1698 1695 259 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Factor w/ 23260 levels "1/10/2011 10:04",..: 6839 6839 6839 6839 6839 6839 6839 6840 6840 6841 ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : Factor w/ 4380 levels "12346","12347",..: 4057 4057 4057 4057 4057 4057 4057 4057 4057 549 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...

##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##      InvoiceDate UnitPrice CustomerID        Country
## 1 12/1/2010 8:26      2.55      17850 United Kingdom
## 2 12/1/2010 8:26      3.39      17850 United Kingdom
## 3 12/1/2010 8:26      2.75      17850 United Kingdom
## 4 12/1/2010 8:26      3.39      17850 United Kingdom
## 5 12/1/2010 8:26      3.39      17850 United Kingdom
## 6 12/1/2010 8:26      7.65      17850 United Kingdom

## [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "UnitPrice"   "CustomerID"  "Country"

## Source: local data frame [0 x 2]
## 
## Variables not shown: CustomerID (fctr), count (int)

No duplicate customer IDs found this time. so we are good to go for further analysis.

Pre-processing_1

here we will,

check for any NAs in data set.
any cancelled transactions.
add a new variable for “Revenue” (unit price * quantity).

# any NAs in data set?
colSums(is.na(retail))

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country 
##      135080           0

# customerID variable has 135080 NAs.

# any cancelled invoices?
table(grepl("^C", retail$InvoiceNo))

## 
##  FALSE   TRUE 
## 532621   9288

# 9288 cancelled transactions.

#adding a new variable for revenue (unit price * quantity) per line:
retail$Revenue = with(retail, Quantity*UnitPrice)

Pre-processing_2

here we will, - check which 10 items are best seeling items based on Net Revenue generated by the item for the entire period of this data set.

Plot this top-10 best selling items with their Net Revenue contributed in the order of strating from the top one.

## [1] 4070

## Source: local data frame [10 x 2]
## 
##    StockCode NetRevenue
##       (fctr)      (dbl)
## 1        DOT  206245.48
## 2      22423  164762.19
## 3      47566   98302.98
## 4     85123A   97894.50
## 5     85099B   92356.03
## 6      23084   66756.59
## 7       POST   66230.64
## 8      22086   63791.94
## 9      84879   58959.73
## 10     79321   53768.06

There are 4070 unique items that our store sells.

Above shown table is the top-10 selling items in order of their Revenue contribution.

Pre-processing_3

in next section, we will filter out the original data set for only these top-10 best selling items.
also we will remove NAs/blanks from the data set.

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country     Revenue 
##        2724           0           0

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country     Revenue 
##        2724           0           0

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country     Revenue 
##           0           0           0

Pre-processing_4

Here we will convert the InvoiceDate variable into Date class type so we can use it as a Date.

## [1] "2010-12-01"

## [1] "2011-12-09"

Our dataset is has data starting from 2010-12-01 to 2011-12-09.

Final Tidy Data set: Summary

# dimension of the data set:
dim(retail3.nona)

## [1] 12106    12

# 12106 observations.
# 12 variables.

# strcture of the data set:
str(retail3.nona)

## 'data.frame':    12106 obs. of  12 variables:
##  $ InvoiceNo  : Factor w/ 8724 levels "536365","536367",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ StockCode  : Factor w/ 10 levels "22086","22423",..: 8 6 10 1 8 8 1 7 5 8 ...
##  $ Description: Factor w/ 11 levels "ASSORTED COLOUR BIRD ORNAMENT",..: 11 1 8 6 11 11 6 5 2 11 ...
##  $ Quantity   : int  6 32 3 80 6 6 4 100 192 64 ...
##  $ InvoiceDate: Date, format: "2010-12-01" "2010-12-01" ...
##  $ UnitPrice  : num  2.55 1.69 18 2.55 2.55 2.55 2.95 1.65 3.82 2.55 ...
##  $ CustomerID : Factor w/ 4380 levels "12346","12347",..: 4057 549 200 1056 4057 4057 2201 2731 2731 3810 ...
##  $ Country    : Factor w/ 35 levels "Australia","Austria",..: 33 33 14 33 33 33 33 33 33 33 ...
##  $ Revenue    : num  15.3 54.1 54 204 15.3 ...
##  $ Month      : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ Year       : Factor w/ 2 levels "2010","2011": 1 1 1 1 1 1 1 1 1 1 ...
##  $ YearMonth  : Factor w/ 13 levels "2010-12","2011-1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:2724] 47 48 49 50 51 52 63 64 65 66 ...
##   .. ..- attr(*, "names")= chr [1:2724] "1503" "1577" "1746" "1777" ...

# "retail3" is dataset with top-10 items with NAs.
# "retail3.nona" is dataset with top-10 items & no NAs.
summary(retail3.nona)

##    InvoiceNo       StockCode                                Description  
##  579196 :    8   85123A :2077   WHITE HANGING HEART T-LIGHT HOLDER:2070  
##  574481 :    7   22423  :1905   REGENCY CAKESTAND 3 TIER          :1905  
##  575607 :    7   85099B :1662   JUMBO BAG RED RETROSPOT           :1662  
##  578270 :    7   84879  :1418   ASSORTED COLOUR BIRD ORNAMENT     :1418  
##  563555 :    6   47566  :1416   PARTY BUNTING                     :1416  
##  569246 :    6   POST   :1196   POSTAGE                           :1196  
##  (Other):12065   (Other):2432   (Other)                           :2439  
##     Quantity         InvoiceDate           UnitPrice       
##  Min.   :-1930.00   Min.   :2010-12-01   Min.   :   0.000  
##  1st Qu.:    2.00   1st Qu.:2011-04-05   1st Qu.:   2.080  
##  Median :    6.00   Median :2011-07-18   Median :   2.950  
##  Mean   :   16.19   Mean   :2011-07-05   Mean   :   8.851  
##  3rd Qu.:   12.00   3rd Qu.:2011-10-19   3rd Qu.:  10.950  
##  Max.   : 2880.00   Max.   :2011-12-09   Max.   :8142.750  
##                                                            
##    CustomerID              Country         Revenue             Month     
##  14911  :  150   United Kingdom:10046   Min.   :-8142.75   11     :2042  
##  17841  :  146   Germany       :  544   1st Qu.:   12.75   12     :1365  
##  12748  :   79   France        :  526   Median :   20.80   10     :1174  
##  14096  :   74   EIRE          :  186   Mean   :   53.81   5      :1087  
##  13089  :   71   Belgium       :  136   3rd Qu.:   45.47   9      :1082  
##  12682  :   59   Spain         :  131   Max.   : 8142.75   6      : 876  
##  (Other):11527   (Other)       :  537                      (Other):4480  
##    Year         YearMonth   
##  2010:  848   2011-11:2042  
##  2011:11258   2011-10:1174  
##               2011-5 :1087  
##               2011-9 :1082  
##               2011-6 : 876  
##               2011-3 : 858  
##               (Other):4987

summary observations - One invoice can have multiple items bought under it. - Max and Min InvoiceDate suggests that our transaction database is from 2010-12-01 till 2011-12-09. - There are many repeat customers. - Most of the transactions are from “United Kingdom”. I have calculated detailed statistics in later sections.

[3] Exploratory and Descriptive Analysis:

[A] Top-10 Best Selling Items

Boxplots of revenue of top-10 items show that most of the items have their revenue positively skewed, meaning there are more high values as outliers to make revenue distribution skewed towards right.
Top-10 best seler chart shows that “DOT” is the number one selling item.

[B] Customer Vs. Revenue

## [1] 2766

Because frequency in the range of $0 to $1000 is very high,meaning, there are lot of customers in rage of 0-1000 revenue. so it is hard to see other bins and their distribution.
In 2nd chart i have zoomed in the y-axis scale to see other bins’ distribution more clearly. 2nd chart is the same as the first one, just zoomed in.
there are 2766 unique customers in our data set contributing for 12106 transactions for 10 best selling items in the store.
most of the customers account for revenue in the range of 0-2500 sterlings per customer during the period of this data collection.
the distribution of revenue per customer is clearly positively skewed with some outlier in high ranges contributing revenue greater than 7500 sterlings per customer. this group will be an interesting group to dig into and see their buying behaviour.

Customers with abnormaly high Revenue

## Source: local data frame [9 x 3]
## Groups: CustomerID [9]
## 
##   CustomerID        Country RevenueFromCustomer
##       (fctr)         (fctr)               (dbl)
## 1      12931 United Kingdom             8715.80
## 2      14096 United Kingdom            14739.50
## 3      14646    Netherlands            19251.38
## 4      14911           EIRE             7927.25
## 5      15061 United Kingdom            14545.70
## 6      15769 United Kingdom             8077.00
## 7      16013 United Kingdom            14416.70
## 8      17450 United Kingdom            12315.44
## 9      17511 United Kingdom             8961.53

## [1] 9 3

These are the 9 customer whose net revenue contribution is above 7500 sterlings.

Customers with Negative Revenue

## Source: local data frame [23 x 3]
## Groups: CustomerID [23]
## 
##    CustomerID        Country RevenueFromCustomer
##        (fctr)         (fctr)               (dbl)
## 1       13026 United Kingdom               -7.62
## 2       13426 United Kingdom               -7.62
## 3       13826 United Kingdom               -4.90
## 4       14295 United Kingdom               -3.00
## 5       14339 United Kingdom               -3.82
## 6       14441 United Kingdom              -48.00
## 7       14627 United Kingdom               -2.95
## 8       15128 United Kingdom               -3.71
## 9       15157 United Kingdom              -10.34
## 10      15215 United Kingdom              -32.85
## ..        ...            ...                 ...

## [1] 23  3

These are the 23 customers who have negative revenue contributed.
The net revenue per customer ending in negative due to cancelled transactions.
Interesting that all these customers are from UK.
It will be interesting to see and analyze further that why these customers have more cancellations than normal customer of this store and ending in no positive revenue to the store. may be it will be a good idea to collect more data to find out reasons behind all these cancellations.

Customers Distribution in different Revenue bins

## cut.revenue
## [-2500,    0) [    0, 2500) [ 2500, 5000) [ 5000, 7500) [ 7500,10000) 
##            23          2703            21            10             4 
## [10000,30000] 
##             5

23 customers below 0 Revenue.
2703 customers between 0-2500 Revenue.
21 customers between 2500-5000 Revenue.
10 customers between 5000-7500 Revenue.
4 customers between 7500-10000 Revenue.
5 customers above 10000 Revenue.

[C] Invoice No Vs. items/customers

## Source: local data frame [6 x 3]
## 
##   InvoiceNo ItemsSoldPerInvoice No.of.Customers
##      (fctr)               (int)           (int)
## 1    536365                   1               1
## 2    536367                   1               1
## 3    536370                   1               1
## 4    536371                   1               1
## 5    536373                   1               1
## 6    536375                   1               1

## Source: local data frame [6 x 3]
## 
##   InvoiceNo ItemsSoldPerInvoice No.of.Customers
##      (fctr)               (int)           (int)
## 1   C581117                   1               1
## 2   C581128                   1               1
## 3   C581228                   1               1
## 4   C581229                   1               1
## 5   C581235                   1               1
## 6   C581470                   1               1

This table shows that “Per Invoice”" there can be multiple Items sold. But per Invoice its always 1 customerID.
But per one Customer ID there can be multiple Invoice numbers if customer has shopped more than once.

[D] Revenue Vs. Country

## [1] 35  2

## Source: local data frame [10 x 2]
## 
##           Country RevenuePerCountry
##            (fctr)             (dbl)
## 1  United Kingdom         499551.33
## 2         Germany          31687.91
## 3          France          28468.99
## 4     Netherlands          19841.33
## 5            EIRE          13891.02
## 6           Spain           8777.82
## 7       Australia           7098.86
## 8           Japan           6763.29
## 9     Switzerland           5700.48
## 10        Belgium           5469.09

There are 35 countries customers belong to. original dataset had 38 countries. then top-10 items’ dataset had 36 countries. after removing NAs now 35 countries.
Chart of top-10 countries by revenue shows that United Kindom is by far the highest Revenue contributing country for this store.
most out of top-10 countries are from Europe which tells its local popularity vs. international market capture.
Eventhough this is an online agency with no physical stores in UK or else where, its sales/revenue are coming from UK and nearby eupean countries. this tells that either marketing strategies need to be changed to reach out to international customers if that is the goal.

## Source: local data frame [35 x 4]
## 
##        Country TotalCustomers NetRevenue RevenuePerCustomer
##         (fctr)          (int)      (dbl)              (dbl)
## 1         EIRE              3   13891.02          4630.3400
## 2        Japan              2    6763.29          3381.6450
## 3  Netherlands              9   19841.33          2204.5922
## 4    Australia              5    7098.86          1419.7720
## 5    Singapore              1     463.70           463.7000
## 6       Norway              9    3611.86           401.3178
## 7       France             76   28468.99           374.5920
## 8      Germany             90   31687.91           352.0879
## 9      Iceland              1     345.09           345.0900
## 10     Finland             12    4067.47           338.9558
## ..         ...            ...        ...                ...

these are the top-10 countries by Per Customer revenue contributions.

UK might be the no.1 total Revenue contributor. but when it comes to per Customer revenue, these coutries are in top list. UK is no where near.

## [1] 18

UK is no. 18 from top when it comes to “Per customer Revenue” to the store.

## [1] 86.58713

## [1] 76.68125

## [1] 17.01636

87% of total customers are from UK (2395 over 371).
77% of the Revenue comes from UK customers.
only 17% of the transactions are from countries other than UK. 83% from UK.

this is a very small percentage of the total transactions from nonUK countries. so it does not give us any clustering related insight for each country.

also, our database does not have any location specific information such as address or latitude/logitude. all we have is name the countries. so locating customers in group by their location is not possible anyway.

so it is better to label all nonUK countries as “nonUK” rather than having 35 different labels for Country variable.

## [1] 10046

Out of 12106 observations, 10046 observations are from UK customers and the rest are from nonUK customers.

[E] Revenue over Time

## Source: local data frame [6 x 4]
## Groups: InvoiceDate [6]
## 
##   InvoiceDate YearMonth TotalCustomersPerDay RevenuePerDay
##        (date)    (fctr)                (int)         (dbl)
## 1  2010-12-01   2010-12                   58       4502.98
## 2  2010-12-02   2010-12                   59       4354.57
## 3  2010-12-03   2010-12                   41       3022.23
## 4  2010-12-05   2010-12                   74       3010.21
## 5  2010-12-06   2010-12                   66       2025.20
## 6  2010-12-07   2010-12                   47       4102.19

## Source: local data frame [6 x 4]
## Groups: InvoiceDate [6]
## 
##   InvoiceDate YearMonth TotalCustomersPerDay RevenuePerDay
##        (date)    (fctr)                (int)         (dbl)
## 1  2011-12-04   2011-12                   46        981.35
## 2  2011-12-05   2011-12                   81       4831.22
## 3  2011-12-06   2011-12                   75       4757.26
## 4  2011-12-07   2011-12                   76       7216.56
## 5  2011-12-08   2011-12                   65       2545.61
## 6  2011-12-09   2011-12                   17        572.07

As it is seen in the Time series plot of Revenue by Date, revenues has been flat most of the time during 2011 begining but lately for last 3 months revenue is picking up.

As it is seen in the trend of Revenue by Date earlier, revenues by month also has been flat most of the time during 2011 but lately for last 2-3 months revenue is picking up. last month’s data is partial and not for full 30 days so dip in Revenue is due to that.
Trend for both Revenue per Month and No. of customers served per Month are very similar. this means no. of customers served is a good indicator of trend of revenue made.

[4] Preparing a data frame for Clustering:

retail.data = retail3.nona %>% group_by(CustomerID, Country) %>% summarise(NumTransactions = n(), NetRevenue = sum(Revenue))

# preparing a variable for customer-Regularity
# customer-Regularity is a score out of 13 (as 13 months of data). if customer appeared for 7 months out of 13, his score is 7. max one customer can get is 13.

cus.Regularity = numeric()

for(i in 1:length(retail.data$CustomerID)){
      cus.Regularity[i] = length(unique(retail3.nona$YearMonth[retail3.nona$CustomerID == retail.data$CustomerID[i]]))  
}

cus.Regularity = cus.Regularity/13

cus.item = retail3.nona %>% group_by(CustomerID,StockCode) %>% summarise(count = sum(Quantity))

cus.item.count = dcast(cus.item, CustomerID~StockCode)

## Using 'count' as value column. Use 'value.var' to override

cus.item.count[is.na(cus.item.count)] <- 0

retail.data2 = retail.data
retail.data2$Customer.Regularity = cus.Regularity
retail.data2$item.22086 = cus.item.count$`22086`
retail.data2$item.22423 = cus.item.count$`22423`
retail.data2$item.23084 = cus.item.count$`23084`
retail.data2$item.47566 = cus.item.count$`47566`
retail.data2$item.79321 = cus.item.count$`79321`
retail.data2$item.84879 = cus.item.count$`84879`
retail.data2$item.85099B = cus.item.count$`85099B`
retail.data2$item.85123A = cus.item.count$`85123A`
retail.data2$item.DOT = cus.item.count$DOT
retail.data2$item.POST = cus.item.count$POST

customer.id = retail.data2$CustomerID
rownames(retail.data2) = retail.data2$CustomerID

retail.data2 = retail.data2[,-1]

PCA for Visualization of clusters and data

# PCA using prcomp():
pr.out = prcomp(retail.data2, scale = TRUE)

pr.out$sdev

##  [1] 1.88764372 1.27045571 1.07766277 1.04104830 1.01544402 0.97503404
##  [7] 0.94359622 0.92650296 0.91236506 0.85610288 0.81969123 0.63762911
## [13] 0.44873645 0.04306016

pr.var = pr.out$sdev^2
pr.var

##  [1] 3.563198826 1.614057712 1.161357051 1.083781566 1.031126559
##  [6] 0.950691376 0.890373836 0.858407734 0.832410006 0.732912148
## [11] 0.671893717 0.406570886 0.201364405 0.001854177

pr.out$sdev is std. deviation of PCAs. if you convert it into variance by squaring it, this is the variance explained by each PC for this data set.

to compute the proportion of variance explained by each PC, divide the variance explained by total variance explained by all PCA.

PVE = pr.var / sum(pr.var) * 100 # in percentage
PVE

##  [1] 25.45142019 11.52898366  8.29540751  7.74129690  7.36518971
##  [6]  6.79065268  6.35981311  6.13148382  5.94578576  5.23508677
## [11]  4.79924084  2.90407776  1.43831718  0.01324412

we see that first PC explains 25.5% variance in the data set. the next PC 11.5%.

we can now plot the PVE explained by each component as well as the cumulative PVE.

[5] Clustering for Customer segmentation:

As our variables are numerical and our goal is unsupervised to find out some sort of structure/grouping in the customers, i would to use k-means clustering.

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)

Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The “Choosing K” step in the example below describes how the number of groups can be determined.

Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.

k-mean clustering process

finding best K where “within sum of squares” is optimum

At K=3 we get best clustering with optimum wss. after that the advantage is not that big.

finding best K where “Between sum of squares” is optimum

At K=3 we get best clustering with maximum bss. after that the advantage is not that big.

finding best K where “Silhouette coefficient” is optimum

Silhouette coefficients for three clusters are 0.95, 0.32 and 0.27 with average Silhouette coefficients being 0.93.
this indicates very good clustering.

running k-means clustering with K=3

set.seed(1)

kmeans.model = kmeans(retail.data2, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss

table(kmeans.model$cluster)

## 
##    1    2    3 
## 2710   49    7

# PCA using prcomp():
pr.out = prcomp(retail.data2, scale = TRUE)

library(rgl)

library(cluster)

distance = dist(retail.data2)
plot(silhouette(kmeans.model$cluster, distance))

after k-means clustering: - cluster-1 : 2710 customers - cluster-2 : 49 customers - cluster-3 : 7 customers

Silhouette coefficients for three clusters are 0.95, 0.27 and 0.32 with average Silhouette coefficients being 0.93.
this indicates very good clustering.

The above plot is a 3D plot of Customers on Principal component axis distributed by 3 clusters as shown by colors.

The above plot is a 3D plot of Customers on 3 original variables as axis distributed by 3 clusters as shown by colors.
now we will add back the achieved cluster numbers for customers back to data set so we can use it for exploratory analysis of clusters of customers and see how these 3 clusters of customers are different and similar.

## 
## cluster-1 cluster-2 cluster-3 
##      2710        49         7

## Source: local data frame [6 x 16]
## 
##   Country NumTransactions NetRevenue Customer.Regularity item.22086
##    (fctr)           (int)      (dbl)               (dbl)      (dbl)
## 1  non-UK               7     345.09          0.46153846          0
## 2  non-UK               4     360.00          0.30769231          0
## 3  non-UK               2     312.75          0.07692308          0
## 4  non-UK               1      40.00          0.07692308          0
## 5  non-UK               7     343.75          0.30769231          0
## 6  non-UK               1      25.50          0.07692308          0
## Variables not shown: item.22423 (dbl), item.23084 (dbl), item.47566 (dbl),
##   item.79321 (dbl), item.84879 (dbl), item.85099B (dbl), item.85123A
##   (dbl), item.DOT (dbl), item.POST (dbl), cluster (fctr), CustomerID
##   (fctr)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2766 obs. of  16 variables:
##  $ Country            : Factor w/ 2 levels "non-UK","UK": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NumTransactions    : int  7 4 2 1 7 1 4 1 2 7 ...
##  $ NetRevenue         : num  345 360 313 40 344 ...
##  $ Customer.Regularity: num  0.4615 0.3077 0.0769 0.0769 0.3077 ...
##  $ item.22086         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.22423         : num  15 0 1 0 5 2 22 8 0 27 ...
##  $ item.23084         : num  84 0 0 0 0 0 0 0 0 0 ...
##  $ item.47566         : num  0 0 0 0 0 0 0 0 0 5 ...
##  $ item.79321         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.84879         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.85099B        : num  0 0 0 0 0 0 0 0 0 10 ...
##  $ item.85123A        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.DOT           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.POST          : num  0 9 1 1 7 0 18 0 6 0 ...
##  $ cluster            : Factor w/ 3 levels "cluster-1","cluster-2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CustomerID         : Factor w/ 2766 levels "1","10","100",..: 1 1112 1990 2101 2212 2323 2434 2545 2656 2 ...

[6] Exploratory Analysis of Clusters:

Cluster Size

## 
## cluster-1 cluster-2 cluster-3 
##      2710        49         7

97.98% customers are in cluster-1.
1.77% customers in cluster-2.
0.25% customers in cluster-3.

Number of Transactions per Customer

##     cluster NumTransactions
## 1 cluster-1        3.863469
## 2 cluster-2       28.061224
## 3 cluster-3       37.285714

##     cluster NumTransactions
## 1 cluster-1        3.863469
## 2 cluster-2       28.061224
## 3 cluster-3       37.285714

clearly, all 3 clusters have different and distinct average and median for number of transactions per customer.
this is a good variable to distinguish customers by cluster segments.
cluster-1 customers have lowest number of transactions with mean at 3.9.
cluster-3 customers have highest number of transactions with mean at 37.3.

Net Revenue per cluster and per customer

##     cluster NetRevenue
## 1 cluster-1  369488.72
## 2 cluster-2  189030.03
## 3 cluster-3   92946.05

##     cluster NetRevenue
## 1 cluster-1   136.3427
## 2 cluster-2  3857.7557
## 3 cluster-3 13278.0071

all 3 clusters have different and distinct average and median Revenue generated per customer.
cluster-3 customers generating average revenue of 13278.
cluster-1 customers generating very low revenue compared to other 2 clusters at 136.
may be promotions an coupon can be used to attract cluster-1 customers. but further clustering or classification of cluster-1 is necessary as cluster-1 accounts for almost 98% customers.

Customer.Regularity

##     cluster Customer.Regularity
## 1 cluster-1           0.1845018
## 2 cluster-2           0.6860283
## 3 cluster-3           0.7142857

clearly, all 3 clusters have different and distinct average customer visit Regularity over the year. this is a good variable to distinguish customers into cluster segments.
cluster-3 has the highest customer Regularity of 71.4% this means this cluster has customers as Regular visitors to the store website over the year.
cluster-1 has the lowest customer visit Regularity of 18.5%. if we combine this information with low number of transactions and low revenue per customer by this group, may be promotions an coupon can be used to attract these customers. further clustering or classification of cluster-1 is necessary as cluster-1 accounts for almost 98% customers.
cluster-2 customers are also Regular visitors with visit Regularity of 68% but still not generating net revenue as cluster-3 customers. may be promotions an coupon can be used to attract these customers.
also, there is an interesting factor about cluster-3 as it is highest in number of transaction per customers, highest in net revenue per customer and highest in customer.visit.Regularity but lowest in number of customers in a cluster as well as net revenue.
so this tell that cluster-3 is small group of customers who Regularly visit the store website, make lot of transactions over the year, bring a good amount of revenue over the year. but because they are a small group of customers their net revenue contribution is still smaller as cluster-1 wins the net revenue due to customer volume.

cluster vs. country

##         
##          cluster-1 cluster-2 cluster-3
##   non-UK     0.968     0.030     0.003
##   UK         0.982     0.016     0.003

##         
##          cluster-1 cluster-2 cluster-3
##   non-UK      0.13      0.22      0.14
##   UK          0.87      0.78      0.86

UK customers are huge part of all 3 clusters. this is due to the fact that 87% all customers in data set are UK customers. this proportion of UK customer is dominating the class. and variable does not have any significant segments.

items’ average quantities bought by customers

as shown above, cluster-3 customers buy high average no. of items in almost all 10 item categories.
cluster-1 being the lowest average per item bought.
cluster-1 and cluster-2 do not buy “DOT”. now because cluster-1 & 2 account for 99.7% of all cusotmers, this mean only few customers in cluster-3 buy DOT.
as mentioned aearlier, further clustering of cluster-1 is necessary as cluster-1 accounts for almost 98% customers.
here I am going to apply an approach based on my own view. I will filter out the cluster-1 data set only and re-apply the entire k-mean clustering process that we applied ealier to see sub-clusters within cluster-1.

[7] Further Clustering of cluster-1:

Preparing the data

## Classes 'tbl_df' and 'data.frame':   2710 obs. of  13 variables:
##  $ Country            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumTransactions    : int  7 4 2 1 7 1 4 1 2 7 ...
##  $ NetRevenue         : num  345 360 313 40 344 ...
##  $ Customer.Regularity: num  0.4615 0.3077 0.0769 0.0769 0.3077 ...
##  $ item.22086         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.22423         : num  15 0 1 0 5 2 22 8 0 27 ...
##  $ item.23084         : num  84 0 0 0 0 0 0 0 0 0 ...
##  $ item.47566         : num  0 0 0 0 0 0 0 0 0 5 ...
##  $ item.79321         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.84879         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.85099B        : num  0 0 0 0 0 0 0 0 0 10 ...
##  $ item.85123A        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.POST          : num  0 9 1 1 7 0 18 0 6 0 ...

finding best K where WSS is optimum

at K=3 we get best clustering with minimum wss. after that the advantage is not that big.

finding best K where BSS is optimum

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.

finding best K where “wss”Silhouette coefficient" is optimum

Silhouette coefficients for three clusters are 0.78, 0.34 and 0.21 with average Silhouette coefficients being 0.7.
this indicates very good clustering of CLUSTER-1.

running k-means clustering with K=3

set.seed(1)

kmeans.model = kmeans(c1.data, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss

table(kmeans.model$cluster)

## 
##    1    2    3 
## 2237  378   95

# PCA using prcomp():
pr.out = prcomp(c1.data, scale = TRUE)

library(rgl)

library(cluster)

distance = dist(c1.data)
plot(silhouette(kmeans.model$cluster, distance))

after k-means clustering within the cluster-1: - cluster-1 (1A) : 2237 customers - cluster-2 (1B) : 378 customers - cluster-3 (1C) : 95 customers

we will name these sub-clusters as cluster-1A, cluster-1B, and cluster-1C.
Silhouette coefficients for three clusters are 0.78, 0.34 and 0.21 with average Silhouette coefficients being 0.7.
this indicates very good clustering of CLUSTER-1.

The above plot is a 3D plot of Customers in cluster-1 on Principal component axis distributed by 3 sub-clusters as shown by colors.

The above plot is a 3D plot of Customers in cluster-1 on 3 original variables as axis distributed by 3 sub-clusters as shown by colors.
now we will add back the achieved sub-clusters of cluster-1 back to data set so we can use it for exploratory analysis of clusters of customers and see how these 5 clusters of customers are different and similar.

## 
## cluster-1A cluster-1B cluster-1C 
##       2237        378         95

## 
## cluster-1A cluster-1B cluster-1C  cluster-2  cluster-3 
##       2237        378         95         49          7

## Source: local data frame [6 x 16]
## 
##   Country NumTransactions NetRevenue Customer.Regularity item.22086
##    (fctr)           (int)      (dbl)               (dbl)      (dbl)
## 1  non-UK               7     345.09          0.46153846          0
## 2  non-UK               4     360.00          0.30769231          0
## 3  non-UK               2     312.75          0.07692308          0
## 4  non-UK               1      40.00          0.07692308          0
## 5  non-UK               7     343.75          0.30769231          0
## 6  non-UK               1      25.50          0.07692308          0
## Variables not shown: item.22423 (dbl), item.23084 (dbl), item.47566 (dbl),
##   item.79321 (dbl), item.84879 (dbl), item.85099B (dbl), item.85123A
##   (dbl), item.DOT (dbl), item.POST (dbl), cluster (fctr), CustomerID
##   (fctr)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2766 obs. of  16 variables:
##  $ Country            : Factor w/ 2 levels "non-UK","UK": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NumTransactions    : int  7 4 2 1 7 1 4 1 2 7 ...
##  $ NetRevenue         : num  345 360 313 40 344 ...
##  $ Customer.Regularity: num  0.4615 0.3077 0.0769 0.0769 0.3077 ...
##  $ item.22086         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.22423         : num  15 0 1 0 5 2 22 8 0 27 ...
##  $ item.23084         : num  84 0 0 0 0 0 0 0 0 0 ...
##  $ item.47566         : num  0 0 0 0 0 0 0 0 0 5 ...
##  $ item.79321         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.84879         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.85099B        : num  0 0 0 0 0 0 0 0 0 10 ...
##  $ item.85123A        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.DOT           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.POST          : num  0 9 1 1 7 0 18 0 6 0 ...
##  $ cluster            : Factor w/ 5 levels "cluster-1A","cluster-1B",..: 2 2 2 1 2 1 2 1 2 2 ...
##  $ CustomerID         : Factor w/ 2766 levels "1","10","100",..: 1 1112 1990 2101 2212 2323 2434 2545 2656 2 ...

after k-means clustering within the cluster-1: - cluster-1A : 2237 customers - cluster-1B : 378 customers - cluster-1C : 95 customers - cluster-2 : 49 customers - cluster-3 : 7 customers

[8] Exploratory Analysis of 5 Clusters:

* Visualization of Clusters*

The above plot is a 3D plot of Customers on Principal component axis distributed by 5 clusters as shown by colors.

## 
## non-UK     UK 
##    371   2395

## 
## cluster-1A cluster-1B cluster-1C  cluster-2  cluster-3 
##       2237        378         95         49          7

The above plot is a 3D plot of Customers in cluster-1 on 3 original variables as axis distributed by 3 sub-clusters as shown by colors.

Cluster Size

97.9% customers are in cluster-1. - 80.9% customers are in cluster-1A. - 13.7% customers are in cluster-1B. - 3.4% customers are in cluster-1C.
1.8% customers in cluster-2.
0.3% customers in cluster-3.

Number of Transactions per Customer

##      cluster NumTransactions
## 1 cluster-1A        2.708091
## 2 cluster-1B        8.489418
## 3 cluster-1C       12.663158
## 4  cluster-2       28.061224
## 5  cluster-3       37.285714

clearly, all 5 clusters have different and distinct average and median for number of transactions per customer.
this is a good variable to distinguish customers by cluster segments.
cluster-1 customers in general have lowest number of transactions with sub cluster-1A being the lowest with mean at 2.7.
cluster-3 customers have highest number of transactions with mean at 37.3.

Net Revenue per cluster and per customer

##      cluster NetRevenue
## 1 cluster-1A  134221.40
## 2 cluster-1B  138046.66
## 3 cluster-1C   97220.66
## 4  cluster-2  189030.03
## 5  cluster-3   92946.05

##      cluster  NetRevenue
## 1 cluster-1A    60.00063
## 2 cluster-1B   365.20280
## 3 cluster-1C  1023.37537
## 4  cluster-2  3857.75571
## 5  cluster-3 13278.00714

##      cluster  NetRevenue
## 1 cluster-1A    60.00063
## 2 cluster-1B   365.20280
## 3 cluster-1C  1023.37537
## 4  cluster-2  3857.75571
## 5  cluster-3 13278.00714

all 5 clusters have different and distinct Net Revenue generated.
cluster-1A & 1B customers generating highest Net revenue.
even cluster-1 in general has highest net revenue as we saw before we clustered cluster-1, cluster-1C customers bring low net revenue almost in same range as cluster-3 which is lowest in the group.
cluster-3 being the lowest in net revenue
all 5 clusters have different and distinct average Revenue generated per customer.
cluster-3 customers generating average revenue of 13278.
cluster-1 customers generating very low revenue compared to other 2 clusters with cluster-1A being the lowest.
may be promotions an coupon can be used to attract cluster-1 customers.

Customer.Regularity

##      cluster Customer.Regularity
## 1 cluster-1A           0.1438396
## 2 cluster-1B           0.3538869
## 3 cluster-1C           0.4680162
## 4  cluster-2           0.6860283
## 5  cluster-3           0.7142857

clearly, all 5 clusters have different and distinct average and median customer visit Regularity over the year. this is a good variable to distinguish customers into cluster segments.
cluster-3 has the highest customer Regularity of 71.4% this means this cluster has customers as Regular visitors to the store website over the year.
cluster-1 has the lowest customer visit Regularity. if we combine this information with low number of transactions and low revenue per customer by this group, may be promotions an coupon can be used to attract these customers.
cluster-2 customers are also Regular visitors with visit Regularity of 68% but still not generating net revenue as cluster-3 customers. may be promotions an coupon can be used to attract these customers.

cluster vs. country

##         
##          cluster-1A cluster-1B cluster-1C cluster-2 cluster-3
##   non-UK      0.644      0.251      0.073     0.030     0.003
##   UK          0.834      0.119      0.028     0.016     0.003

##         
##          cluster-1A cluster-1B cluster-1C cluster-2 cluster-3
##   non-UK       0.11       0.25       0.28      0.22      0.14
##   UK           0.89       0.75       0.72      0.78      0.86

UK customers are huge part of all 3 clusters. this is due to the fact that 87% all customers are UK customers. this proportion of UK customer is dominating the class. and variable does not have any significant segments.

Average Item Quantities sold by Cluster

as shown above, cluster customers buy high average no. of items in almost all 10 item categories.
cluster-1 with the lowest average per item bought.
cluster-1 and cluster-2 do not buy “DOT”. now because cluster-1 & 2 account for 99.7% cusotmers, this mean only few customers in cluster-3 buy DOT.

overall, cluster-1 should be the target cluster for marketing as it acocunts for the lowest sales for the store.

because cluster-1 is the largest cluster with almost 98% customers in it, we further segmented cluster-1 into 3 clusters.

company can start targeting cluster-1B and cluster-1C (17% of total customers) as they subset of cluster-1. and start with a marketing campaign, may be coupons and promotions, to boost customers’ visits to the store’s website and increase revenues eventually.

this will be a good size of customer pool with similar characteristics as cluster-1 which alone cannot be targeted as its almost 98% customers of the store.

also, using Apriori algorithm using WEKA, we found out 2 items, “DOT” and “85099B”, have a strong association. so they should be combined with deals and discounts if one needs to be boosted for sales.

[9] Customer Profile and Business Strategy:

Based on clustering we acheived using kmeans clustering algorithm and exploratory data analysis of these 5 clusters, below is the customerp rofile I have come up with for this customer base alogn with recommended business strategy is also mentioned for eahc lcuster ustomers.

Cluster-1A :

Custmer Profile: Low Volume Buyers; irregular less-frequent visitorS.

Business Goal: along with increasing sales from this goup, main objective should be to attract them to the store’s website.

Business Strategy Recommendation:

[1] Aggressive media marketing especially to attract non-UK customers.

[2] Aggressive promotions / coupons / discounts etc.

Cluster-1B :

Custmer Profile: Low Volume Buyers; Regular visitors.
Business Goal: Increase of Sales.

Business Strategy Recommendation: promotions / coupons / discounts etc.

Cluster-1C :

Custmer Profile: Low-to-Medium Volume Buyers; Regular. visitors

Business Goal: Increase of Sales.

Business Strategy Recommendation: promotions / coupons / discounts etc.

Cluster-2 :

Custmer Profile: Medium Volume Buyers; Frequent visitors.

Business Goal: customer retention.

Business Strategy Recommendation: Premium Customers’ Reward program.

Cluster-3 :

Custmer Profile: High Volume Buyers; Frequent visitors.

Business Goal: customer retention.

Business Strategy Recommendation: Premium Customers’ Reward program.

International Business Strategy Recommendation :

Despite being an online store, 87% of the Customer base for this store is from UK.
Another major portion is from nearby European countries.
Even though this store is online and non-store based, it is only serving small portion of customers outside UK.
If Store extends their marketing strategy to target more internation & non-UK customers, they can really increase their sales.
As it is shown earlier that when it comes to Per Customer Revenue, it is lead by countries other than UK. UK is no.18. so there is a huge potential of Revenue if we can attract international customers.

[10] Association rules among items:

* preparing data for Apriori association rule algorithm*

##   InvoiceNo StockCode
## 1    536365    85123A
## 2    536367     84879
## 3    536370      POST
## 4    536371     22086
## 5    536373    85123A
## 6    536375    85123A

##    InvoiceNo       StockCode   
##  567185 :   13   85123A :2392  
##  553220 :   12   22423  :2130  
##  563444 :   11   85099B :1780  
##  553158 :   10   47566  :1758  
##  566976 :   10   84879  :1453  
##  568193 :   10   POST   :1205  
##  (Other):12040   (Other):1388

* Read and Inspect Transaction Data*

## transactions in sparse format with
##  6365 transactions (rows) and
##  10 items (columns)

##           ,"22086" ,"22423" ,"23084" ,"47566" ,"79321" ,"84879" ,"85099B"
## ,"22086"       408       51        4       35       23       36        44
## ,"22423"        51     1537       31      208       37      191       127
## ,"23084"         4       31      203       21       10       10        25
## ,"47566"        35      208       21     1254       58      174       163
## ,"79321"        23       37       10       58      377       28        58
## ,"84879"        36      191       10      174       28     1043        87
## ,"85099B"       44      127       25      163       58       87      1270
## ,"85123A"       50      197       13      260       60      226       200
## ,"DOT"           5        3        2        1        1        0         1
## ,"POST"          4      115       42       30        2       30        67
##           ,"85123A" ,"DOT" ,"POST"
## ,"22086"         50      5       4
## ,"22423"        197      3     115
## ,"23084"         13      2      42
## ,"47566"        260      1      30
## ,"79321"         60      1       2
## ,"84879"        226      0      30
## ,"85099B"       200      1      67
## ,"85123A"      1623      1      12
## ,"DOT"            1      7       0
## ,"POST"          12      0     870

Before mining any rules, let’s look at the dataset. Let’s plot the frequency of the items and see how frequently they are purchased.
The last table shows which items are purchased together.

* Runing Apriori()*

Function apriori() Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm. This algorithm employes level-wise search for frequent itemsets.
Default settings: . minimum support: supp=0.1 . minimum con???dence: conf=0.8 . maximum length of rules, i.e. max number of items in a rule: maxlen=10

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 636 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 6365 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

No rules found. even when lowered confidence level to 50% and support level to 10%.

[11] Predicting Sales:

Since we are trying to predict the revenue (sales) from the customer transaction data, I decided to use per customer data set that we used for clustering for regression analysis as well.

Idea is if we know our customers, then we can predcit our sales.

From past customer transaction data we know our customer segments as well as we can feed the same data to regression analysis to predict sales numbers.

we will run multiple models for regressiona analysis, compare the mdoels and will select the best model that works for us.

also during each modelling we will use 10-fold cross-validation to estimate test-error.

Response variable:

NetRevenue: Revenue per customer

Explanatory variables:

Country: UK, nonUK
NumTransactions: no. of transactions per customer
Customer.Regularity: scale of how regular customer visit
item.22086: quantities bought by a customer for thsi item
item.22423: quantities bought by a customer for thsi item
item.23084: quantities bought by a customer for thsi item
item.47566: quantities bought by a customer for thsi item
item.79321: quantities bought by a customer for thsi item
item.84879: quantities bought by a customer for thsi item
item.85099B: quantities bought by a customer for thsi item
item.85123A: quantities bought by a customer for thsi item
item.POST: quantities bought by a customer for thsi item

Preparing data set

## # A tibble: 6 × 13
##   Country NumTransactions NetRevenue Customer.Regularity item.22086
##     <chr>           <int>      <dbl>               <dbl>      <dbl>
## 1  non-UK               7     345.09          0.46153846          0
## 2  non-UK               4     360.00          0.30769231          0
## 3  non-UK               2     312.75          0.07692308          0
## 4  non-UK               1      40.00          0.07692308          0
## 5  non-UK               7     343.75          0.30769231          0
## 6  non-UK               1      25.50          0.07692308          0
## # ... with 8 more variables: item.22423 <dbl>, item.23084 <dbl>,
## #   item.47566 <dbl>, item.79321 <dbl>, item.84879 <dbl>,
## #   item.85099B <dbl>, item.85123A <dbl>, item.POST <dbl>

## Classes 'tbl_df', 'tbl' and 'data.frame':    2766 obs. of  13 variables:
##  $ Country            : chr  "non-UK" "non-UK" "non-UK" "non-UK" ...
##  $ NumTransactions    : int  7 4 2 1 7 1 4 1 2 7 ...
##  $ NetRevenue         : num  345 360 313 40 344 ...
##  $ Customer.Regularity: num  0.4615 0.3077 0.0769 0.0769 0.3077 ...
##  $ item.22086         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.22423         : num  15 0 1 0 5 2 22 8 0 27 ...
##  $ item.23084         : num  84 0 0 0 0 0 0 0 0 0 ...
##  $ item.47566         : num  0 0 0 0 0 0 0 0 0 5 ...
##  $ item.79321         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.84879         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.85099B        : num  0 0 0 0 0 0 0 0 0 10 ...
##  $ item.85123A        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ item.POST          : num  0 9 1 1 7 0 18 0 6 0 ...

MODEL-1: least-squares method / full-model

## 
## Call:
## lm(formula = NetRevenue ~ ., data = lm.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1974.1   -19.6    -5.2     9.9 12241.6 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           44.27830   16.03694   2.761   0.0058 ** 
## CountryUK            -23.79556   16.29474  -1.460   0.1443    
## NumTransactions       18.49682    1.12484  16.444   <2e-16 ***
## Customer.Regularity -406.98956   44.70291  -9.104   <2e-16 ***
## item.22086             2.66879    0.12243  21.799   <2e-16 ***
## item.22423            10.69370    0.15137  70.644   <2e-16 ***
## item.23084             1.93556    0.04186  46.241   <2e-16 ***
## item.47566             3.79595    0.16587  22.886   <2e-16 ***
## item.79321             4.23920    0.10958  38.684   <2e-16 ***
## item.84879             1.38080    0.04449  31.034   <2e-16 ***
## item.85099B            1.67171    0.03714  45.006   <2e-16 ***
## item.85123A            2.72643    0.04780  57.037   <2e-16 ***
## item.POST             16.59836    1.12085  14.809   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 248.7 on 2753 degrees of freedom
## Multiple R-squared:  0.9232, Adjusted R-squared:  0.9229 
## F-statistic:  2759 on 12 and 2753 DF,  p-value: < 2.2e-16

## [1] 81691.43 79906.38 81598.93 80667.46 80283.61

## [1] 223574568

## [1] 284.9759

## [1] 0.9228884

## [1] 38379.68

Model-1 Summary:

RSS = 223574568
RSE = 284.9759
Adj.R.squared = 0.9228884
AIC = 38379.68

CHECKING MULTICOLLINEARITY

##             Country     NumTransactions Customer.Regularity 
##            1.379386            2.778930            2.634485 
##          item.22086          item.22423          item.23084 
##            1.065652            1.122917            1.165752 
##          item.47566          item.79321          item.84879 
##            1.189944            1.123843            1.063733 
##         item.85099B         item.85123A           item.POST 
##            1.137178            1.100288            1.641594

all VIF values are less than 5. no Multicollinearity in the data.

MODEL-2: least-squares method / full-model / outliers removed

## [1] 17

## [1] 2474   13

## [1] 275  13

## 
## Call:
## lm(formula = NetRevenue ~ ., data = lm.data.out)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -335.43   -5.01   -1.31    2.45  670.18 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          22.67197    2.75238   8.237 2.83e-16 ***
## CountryUK           -22.41653    2.78715  -8.043 1.35e-15 ***
## NumTransactions       2.51422    0.36070   6.970 4.05e-12 ***
## Customer.Regularity   7.88592   10.25812   0.769    0.442    
## item.22086            2.47595    0.03397  72.883  < 2e-16 ***
## item.22423           10.91616    0.04312 253.152  < 2e-16 ***
## item.23084            1.91245    0.01650 115.874  < 2e-16 ***
## item.47566            4.20074    0.03282 128.001  < 2e-16 ***
## item.79321            4.36537    0.04339 100.616  < 2e-16 ***
## item.84879            1.56312    0.01726  90.566  < 2e-16 ***
## item.85099B           1.71986    0.01008 170.635  < 2e-16 ***
## item.85123A           2.45307    0.01410 173.988  < 2e-16 ***
## item.POST            18.12537    0.21431  84.574  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.14 on 2461 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9943 
## F-statistic: 3.616e+04 on 12 and 2461 DF,  p-value: < 2.2e-16

## [1] 1855.65 1847.76 1900.39 1926.65 1898.44

## [1] 4665415

## [1] 43.54009

## [1] 0.9943333

## [1] 25180.78

Model-2 Summary:

RSS = 4665415
RSE = 43.54009
Adj.R.squared = 0.9943333
AIC = 25180.78

MODEL-3: All Subsets Regression:

All Subsets Regression:

All subsets regression is implemented using the regsubsets() function from the leaps package. This regression will suggest the best set of variables graphically. Analyst can prefer this method for variable selection. It will suggest the set of variables having p value less than 0.05. p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables.

## 
## Call:
## lm(formula = NetRevenue ~ . - Country - NumTransactions - Customer.Regularity - 
##     item.22086, data = lm.data.out)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -433.75  -17.48  -13.88   -1.31 1381.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.82542    1.55282   11.48   <2e-16 ***
## item.22423  11.22176    0.07786  144.12   <2e-16 ***
## item.23084   1.95996    0.03020   64.89   <2e-16 ***
## item.47566   4.49569    0.05857   76.76   <2e-16 ***
## item.79321   4.45293    0.07964   55.92   <2e-16 ***
## item.84879   1.85061    0.02984   62.01   <2e-16 ***
## item.85099B  1.78191    0.01834   97.18   <2e-16 ***
## item.85123A  2.56718    0.02529  101.52   <2e-16 ***
## item.POST   19.35949    0.30534   63.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.21 on 2465 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.9807 
## F-statistic: 1.572e+04 on 8 and 2465 DF,  p-value: < 2.2e-16

## [1] 6050.29 5919.84 5871.04 6052.73 6151.69

## [1] 14866558

## [1] 77.65989

## [1] 0.980712

## [1] 28207.16

Model-3 Summary:

RSS = 14866558
RSE = 77.65989
Adj.R.squared = 0.980712
AIC = 28207.16

MODEL-4: Principal Component Regression:

##  [1] "Country"             "NumTransactions"     "NetRevenue"         
##  [4] "Customer.Regularity" "item.22086"          "item.22423"         
##  [7] "item.23084"          "item.47566"          "item.79321"         
## [10] "item.84879"          "item.85099B"         "item.85123A"        
## [13] "item.POST"

## 
## Call:
## lm(formula = NetRevenue ~ ., data = pca_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -335.43   -5.01   -1.31    2.45  670.18 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 188.306289   0.786883  239.307  < 2e-16 ***
## Comp.1       -3.646702   0.008011 -455.219  < 2e-16 ***
## Comp.2       -2.792531   0.013520 -206.555  < 2e-16 ***
## Comp.3       -3.690102   0.015378 -239.959  < 2e-16 ***
## Comp.4        0.261864   0.016326   16.040  < 2e-16 ***
## Comp.5        7.130899   0.029704  240.068  < 2e-16 ***
## Comp.6       -0.107689   0.034134   -3.155  0.00163 ** 
## Comp.7        9.840579   0.042547  231.289  < 2e-16 ***
## Comp.8       -1.458365   0.044507  -32.767  < 2e-16 ***
## Comp.9       16.244846   0.147814  109.901  < 2e-16 ***
## Comp.10       9.889286   0.228376   43.303  < 2e-16 ***
## Comp.11     -21.722135   2.788677   -7.789 9.85e-15 ***
## Comp.12      -7.645879  10.262685   -0.745  0.45633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.14 on 2461 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9943 
## F-statistic: 3.616e+04 on 12 and 2461 DF,  p-value: < 2.2e-16

## [1] 1881.99 1855.65 1847.76 1900.39 1926.65

## [1] 4657275

## [1] 43.50209

## [1] 0.9943333

## [1] 25180.78

Model-4 Summary:

RSS = 4657275
RSE = 43.50209
Adj.R.squared = 0.9943333
AIC = 25180.78

##   Model.No.                                Model.Name       RSE    Adj.R2
## 1   Model.1                  Least.Squares/full.model 284.97587 0.9228884
## 2   Model.2 Least.Squares/full.model/outliers removed  43.54009 0.9943333
## 3   Model.3   All Subsets Regression/outliers removed  77.65989 0.9807120
## 4   Model.4                      PCR/outliers removed  43.50209 0.9943333
##        AIC
## 1 38379.68
## 2 25180.78
## 3 28207.16
## 4 25180.78

Before we go ahead and do Model selection, below is the description and importance of each Model evaluation criteria mentioned.

RSE: Residual Standard Error* - RSE is a measure of the lack of fit of the model to the data. - it is measures in the same unit as Y. - it is an estimate of the std. dev. of the error term. - so lower RSE the better.

Adjusted-R^2:

it is a measure of % variability in Y explained by the model (or by predictors).
it ranges from 0 to 1 (or can say 0% to 100%).
higher the better.

AIC (Akaike Information Criterion)

it is a measure of test-set error by adjusting training-set-error by a penaly.
lower the better.
so based on above explanation of each model selection criteria, loos like model-2 and model-4 are the best model with good predictive power.
because least squares method is better when it comes to Interpretability of predictors in regression, i will choose model-2 over model-4. model-4 is principal component regression.

Final Prediction using selected Model:

we will run the final model on test set data that we kep on side and never introduced to the model.

Then we will compare predicted total sales numbers with actual total sales numbers and see how our model has performed.

dim(lm.data.test) # data of 275 different customers.

## [1] 275  13

Actual.Sales = sum(lm.data.test$NetRevenue)
Actual.Sales

## [1] 44197.4

There are 275 customers in out test set.
Total actual Sales from these 275 customers = 44197.4 sterlings.

Running the prediction model to predict Total Sales:

##   Lower.Range Pred.Sales Upper.Range
## 1    22411.38   43555.13    64698.88

Predicted Sales from 275 customers = 43555 sterlings
95% Prediction interval for the sales prediction = 22411 to 64699 sterlings
Total actual Sales from these 275 customers = 44197 sterlings.

IE575_FINAL_PROJECT

Maulik Patel

December 5, 2016

Project: Customer segmentation, Sales Prediction, Strategy Recommendations for an UK-based non-store online retail Agency

[1] About the Data set:

[2] Pre-processing of the Data set:

[3] Exploratory and Descriptive Analysis:

[4] Preparing a data frame for Clustering:

[5] Clustering for Customer segmentation:

[6] exploratory and Descriptive Analysis of 3 Clusters:

[7] Further Clustering of cluster-1:

[8] exploratory and Descriptive Analysis of 5 Clusters:

[9] Customer Profile and Business Strategy:

[10] Association Rules Among Items:

[11] Predicting Sales:

[1] About the Data set:

[2] Pre-processing of the Data set:

[3] Exploratory and Descriptive Analysis:

[A] Top-10 Best Selling Items

[B] Customer Vs. Revenue

[C] Invoice No Vs. items/customers

[D] Revenue Vs. Country

[E] Revenue over Time

[4] Preparing a data frame for Clustering:

PCA for Visualization of clusters and data

[5] Clustering for Customer segmentation:

k-mean clustering process

finding best K where “within sum of squares” is optimum

finding best K where “Between sum of squares” is optimum

finding best K where “Silhouette coefficient” is optimum

running k-means clustering with K=3

[6] Exploratory Analysis of Clusters:

Cluster Size

Number of Transactions per Customer

Net Revenue per cluster and per customer

Customer.Regularity

cluster vs. country

items’ average quantities bought by customers

[7] Further Clustering of cluster-1:

Preparing the data

finding best K where WSS is optimum

finding best K where BSS is optimum

finding best K where “wss”Silhouette coefficient" is optimum

running k-means clustering with K=3

[8] Exploratory Analysis of 5 Clusters:

*** Visualization of Clusters***

Cluster Size

Number of Transactions per Customer

Net Revenue per cluster and per customer

Customer.Regularity

cluster vs. country

Average Item Quantities sold by Cluster

[9] Customer Profile and Business Strategy:

Cluster-1A :

Cluster-1B :

Cluster-1C :

Cluster-2 :

Cluster-3 :

International Business Strategy Recommendation :

[10] Association rules among items:

*** preparing data for Apriori association rule algorithm***

*** Read and Inspect Transaction Data***

*** Runing Apriori()***

[11] Predicting Sales:

Preparing data set

MODEL-1: least-squares method / full-model

Model-1 Summary:

CHECKING MULTICOLLINEARITY

MODEL-2: least-squares method / full-model / outliers removed

Model-2 Summary:

MODEL-3: All Subsets Regression:

Model-3 Summary:

MODEL-4: Principal Component Regression:

Model-4 Summary:

RSE: Residual Standard Error* - RSE is a measure of the lack of fit of the model to the data. - it is measures in the same unit as Y. - it is an estimate of the std. dev. of the error term. - so lower RSE the better.

Adjusted-R^2:

AIC (Akaike Information Criterion)

Final Prediction using selected Model:

** Running the prediction model to predict Total Sales:**

- so our prediction of total sales is very close to actual sales number. in fact it is well within the 95% interval as well.

* Visualization of Clusters*

* preparing data for Apriori association rule algorithm*

* Read and Inspect Transaction Data*

* Runing Apriori()*

Running the prediction model to predict Total Sales: