R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Reading data

df1=read.csv("http://bit.ly/CarreFourDataset")
head(df1)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Tax      Date  Time     Payment   cogs gross.margin.percentage
## 1        7 26.1415  1/5/2019 13:08     Ewallet 522.83                4.761905
## 2        5  3.8200  3/8/2019 10:29        Cash  76.40                4.761905
## 3        7 16.2155  3/3/2019 13:23 Credit card 324.31                4.761905
## 4        8 23.2880 1/27/2019 20:33     Ewallet 465.76                4.761905
## 5        7 30.2085  2/8/2019 10:37     Ewallet 604.17                4.761905
## 6        7 29.8865 3/25/2019 18:30     Ewallet 597.73                4.761905
##   gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
## 4      23.2880    8.4 489.0480
## 5      30.2085    5.3 634.3785
## 6      29.8865    4.1 627.6165

Data cleaning

Checking for null values

sum(is.na(df1))
## [1] 0
#there is no null values in our dataset
sum(duplicated(df1))
## [1] 0
#there are no duplicates in our data

Feature Selection

Getting numerical columns

 numcols1= df1[c(6:8,12,14:16)]
head(numcols1)
##   Unit.price Quantity     Tax   cogs gross.income Rating    Total
## 1      74.69        7 26.1415 522.83      26.1415    9.1 548.9715
## 2      15.28        5  3.8200  76.40       3.8200    9.6  80.2200
## 3      46.33        7 16.2155 324.31      16.2155    7.4 340.5255
## 4      58.22        8 23.2880 465.76      23.2880    8.4 489.0480
## 5      86.31        7 30.2085 604.17      30.2085    5.3 634.3785
## 6      85.39        7 29.8865 597.73      29.8865    4.1 627.6165

Filter method

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
# Calculating the correlation matrix
# ---
#
correlationMatrix <- cor(numcols1)
# Find attributes that are highly correlated
# ---
#
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)

# Highly correlated attributes
# ---
# 
highlyCorrelated
## [1] 4 7 3
names(numcols1[,highlyCorrelated])
## [1] "cogs"  "Total" "Tax"
# columns (cogs, total and tax) are the most highly correlated columns
# We can remove the variables with a higher correlation 
# and comparing the results graphically as shown below
# ---
# 
# Removing Redundant Features 
# ---
# 
df_1<-numcols1[-highlyCorrelated]

# Performing our graphical comparison
# ---
# 
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(df_1), order = "hclust")

Wrapper method

library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.9
## Type 'citation("mclust")' for citing this R package in publications.
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
library(mclust)
# Sequential forward greedy search (default)
#out = clustvarsel(numcols1,G = 1:2)
#out
# The selection algorithm would indicate that the subset 
# we use for the clustering model is composed of variables X1 and X2 
# and that other variables should be rejected. 
# Having identified the variables that we use, we proceed to build the clustering model
#Subset1 = numcols1[,out$subset]
#mod = Mclust(Subset1, G = 1:5)
#summary(mod)
#plot(mod,c("classification"))

Embedded method

library(wskm)
## Loading required package: latticeExtra
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
## Loading required package: fpc
library(cluster)
set.seed(5)
model <- ewkm(df1[c(6:8,12,14:16)], 3, lambda=2, maxiter=50)
clusplot(df1[c(6:8,12,14:16)], model$cluster, color=T, shade=F,
         labels=2, lines=2,main='Cluster Analysis for Carrefour')

# Weights are calculated for each variable and cluster.
# They are a measure of the relative importance of each variable 
# with regards to the membership of the observations to that cluster. 
# The weights are incorporated into the distance function, 
# typically reducing the distance for more important variables.
# Weights remain stored in the model and we can check them as follows:
# 
round(model$weights*100,2)
##   Unit.price Quantity Tax cogs gross.income Rating Total
## 1          0        0  50    0           50   0.00     0
## 2          0        0  50    0           50   0.00     0
## 3          0        0   0    0            0  99.99     0

#Conclusion

  1. Columns (cogs, total and tax) are the most highly correlated columns.

  2. Important columns to use when clustering are unit price and Quantity according to wrapper method

  3. According to embedded method, gross income and tax play a huge role in our data.