This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
df1=read.csv("http://bit.ly/CarreFourDataset")
head(df1)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 1 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
sum(is.na(df1))
## [1] 0
#there is no null values in our dataset
sum(duplicated(df1))
## [1] 0
#there are no duplicates in our data
numcols1= df1[c(6:8,12,14:16)]
head(numcols1)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 74.69 7 26.1415 522.83 26.1415 9.1 548.9715
## 2 15.28 5 3.8200 76.40 3.8200 9.6 80.2200
## 3 46.33 7 16.2155 324.31 16.2155 7.4 340.5255
## 4 58.22 8 23.2880 465.76 23.2880 8.4 489.0480
## 5 86.31 7 30.2085 604.17 30.2085 5.3 634.3785
## 6 85.39 7 29.8865 597.73 29.8865 4.1 627.6165
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
# Calculating the correlation matrix
# ---
#
correlationMatrix <- cor(numcols1)
# Find attributes that are highly correlated
# ---
#
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)
# Highly correlated attributes
# ---
#
highlyCorrelated
## [1] 4 7 3
names(numcols1[,highlyCorrelated])
## [1] "cogs" "Total" "Tax"
# columns (cogs, total and tax) are the most highly correlated columns
# We can remove the variables with a higher correlation
# and comparing the results graphically as shown below
# ---
#
# Removing Redundant Features
# ---
#
df_1<-numcols1[-highlyCorrelated]
# Performing our graphical comparison
# ---
#
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(df_1), order = "hclust")
library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.9
## Type 'citation("mclust")' for citing this R package in publications.
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
library(mclust)
# Sequential forward greedy search (default)
#out = clustvarsel(numcols1,G = 1:2)
#out
# The selection algorithm would indicate that the subset
# we use for the clustering model is composed of variables X1 and X2
# and that other variables should be rejected.
# Having identified the variables that we use, we proceed to build the clustering model
#Subset1 = numcols1[,out$subset]
#mod = Mclust(Subset1, G = 1:5)
#summary(mod)
#plot(mod,c("classification"))
library(wskm)
## Loading required package: latticeExtra
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
## Loading required package: fpc
library(cluster)
set.seed(5)
model <- ewkm(df1[c(6:8,12,14:16)], 3, lambda=2, maxiter=50)
clusplot(df1[c(6:8,12,14:16)], model$cluster, color=T, shade=F,
labels=2, lines=2,main='Cluster Analysis for Carrefour')
# Weights are calculated for each variable and cluster.
# They are a measure of the relative importance of each variable
# with regards to the membership of the observations to that cluster.
# The weights are incorporated into the distance function,
# typically reducing the distance for more important variables.
# Weights remain stored in the model and we can check them as follows:
#
round(model$weights*100,2)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 0 0 50 0 50 0.00 0
## 2 0 0 50 0 50 0.00 0
## 3 0 0 0 0 0 99.99 0
#Conclusion
Columns (cogs, total and tax) are the most highly correlated columns.
Important columns to use when clustering are unit price and Quantity according to wrapper method
According to embedded method, gross income and tax play a huge role in our data.