This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Kira Plastinina is a Russian brand that is sold through a defunct chain of retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines, and Armenia. The brand’s Sales and Marketing team would like to understand their customer’s behavior from data that they have collected over the past year. More specifically, they would like to learn the characteristics of customer groups.
This project will be considered a success after we have thoroughly cleaned our data and performed both univariate and bivariate analysis and offering summaries of our dataset. WE will also be performing clustering in our dataset to learn characteristics of our groups.
The dataset that we will be using is a russian ecommerce dataset.
The following steps will be followed in conducting this study:
Define the question, the metric for success, the context, experimental design taken.
Read and explore the given dataset. Define the appropriateness of the available data to answer the given question.
Find and deal with outliers, anomalies, and missing data within the dataset.
Perform univariate and bivariate analysis and recording our observations.
From our insights we will provide a conclusion and recommendation.
Perform clustering in our data
http://bit.ly/EcommerceCustomersDataset
ecom=read.csv("http://bit.ly/EcommerceCustomersDataset")
#previewing our top entries
head(ecom)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 -1 0 -1
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 1 0.000000 0.20000000 0.2000000 0
## 2 2 64.000000 0.00000000 0.1000000 0
## 3 1 -1.000000 0.20000000 0.2000000 0
## 4 2 2.666667 0.05000000 0.1400000 0
## 5 10 627.500000 0.02000000 0.0500000 0
## 6 19 154.216667 0.01578947 0.0245614 0
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1 0 Feb 1 1 1 1
## 2 0 Feb 2 2 1 2
## 3 0 Feb 4 1 9 3
## 4 0 Feb 3 2 2 4
## 5 0 Feb 3 3 1 4
## 6 0 Feb 2 2 1 3
## VisitorType Weekend Revenue
## 1 Returning_Visitor FALSE FALSE
## 2 Returning_Visitor FALSE FALSE
## 3 Returning_Visitor FALSE FALSE
## 4 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
# checking data composition
str(ecom)
## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 1 2 3 ...
## $ ProductRelated_Duration: num 0 64 -1 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : chr "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#checking dimension of our dataset
dim(ecom)
## [1] 12330 18
#confirming our dataset is a dataframe
class(ecom)
## [1] "data.frame"
sum(is.na(ecom))
## [1] 112
#there are 112 missing values
#checking exact locations of our null values column wise
colSums(is.na(ecom))
## Administrative Administrative_Duration Informational
## 14 14 14
## Informational_Duration ProductRelated ProductRelated_Duration
## 14 14 14
## BounceRates ExitRates PageValues
## 14 14 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
ecom <-na.omit(ecom)
ecom<- ecom[!duplicated(ecom),]
boxplot(ecom$Administrative, main= 'Boxplot of administrative web pages',col="blue")
boxplot(ecom$Administrative_Duration, main= 'Boxplot of administrative web pages duration', col="red")
boxplot(ecom$Informational, main= 'Boxplot of informational web pages', col="green")
boxplot(ecom$Informational_Duration, main= 'Boxplot of informational web pages duration',col="purple")
boxplot(ecom$ProductRelated, main= 'Boxplot of product related web pages', col="grey")
boxplot(ecom$ProductRelated_Duration, main= 'Boxplot of product related web pages duration', col="yellow")
boxplot(ecom$BounceRates, main= 'Boxplot of bounce rates', col="black")
boxplot(ecom$ExitRates, main= 'Boxplot of exit rates', col="orange")
boxplot(ecom$PageValues, main= 'Boxplot of page values', col="green")
boxplot(ecom$SpecialDay, main= 'Boxplot of special day', col="blue")
# there are outliers in all our columns but we will leave them since they constitute majority of our data.
# outliers will also help us in our modelling
##EDA
#we first get the numerical columns
numcols <- ecom[ ,1:10]
summary(ecom)
## Administrative Administrative_Duration Informational
## Min. : 0.00 Min. : -1.00 Min. : 0.0000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.00 Median : 9.00 Median : 0.0000
## Mean : 2.34 Mean : 81.68 Mean : 0.5088
## 3rd Qu.: 4.00 3rd Qu.: 94.75 3rd Qu.: 0.0000
## Max. :27.00 Max. :3398.75 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : -1.00 Min. : 0.00 Min. : -1.0
## 1st Qu.: 0.00 1st Qu.: 8.00 1st Qu.: 193.6
## Median : 0.00 Median : 18.00 Median : 609.5
## Mean : 34.84 Mean : 32.06 Mean : 1207.5
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1477.6
## Max. :2549.38 Max. :705.00 Max. :63973.5
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.00000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.01422 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.00293 Median :0.02500 Median : 0.000 Median :0.00000
## Mean :0.02045 Mean :0.04150 Mean : 5.952 Mean :0.06197
## 3rd Qu.:0.01667 3rd Qu.:0.04848 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.20000 Max. :0.20000 Max. :361.764 Max. :1.00000
## Month OperatingSystems Browser Region
## Length:12199 Min. :1.000 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mode :character Median :2.000 Median : 2.000 Median :3.000
## Mean :2.124 Mean : 2.358 Mean :3.153
## 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :8.000 Max. :13.000 Max. :9.000
## TrafficType VisitorType Weekend Revenue
## Min. : 1.000 Length:12199 Mode :logical Mode :logical
## 1st Qu.: 2.000 Class :character FALSE:9343 FALSE:10291
## Median : 2.000 Mode :character TRUE :2856 TRUE :1908
## Mean : 4.075
## 3rd Qu.: 4.000
## Max. :20.000
#getting summary in our dataset i.e mean , quartiles, median, maximum and minimum
invisible(lapply(names(numcols), function(n) hist(numcols[[n]])))
catcols <- ecom[ ,11:18]
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## Loading required package: nnet
dev.new(width=10, height=5, unit="in")
lapply(1:ncol(catcols), function(x) tab1(catcols[,x],main = colnames(catcols[,x])))
## [[1]]
## catcols[, x] :
## Frequency Percent Cum. percent
## Aug 433 3.5 3.5
## Dec 1706 14.0 17.5
## Feb 182 1.5 19.0
## Jul 432 3.5 22.6
## June 285 2.3 24.9
## Mar 1853 15.2 40.1
## May 3328 27.3 67.4
## Nov 2983 24.5 91.8
## Oct 549 4.5 96.3
## Sep 448 3.7 100.0
## Total 12199 100.0 100.0
##
## [[2]]
## catcols[, x] :
## Frequency Percent Cum. percent
## 1 2548 20.9 20.9
## 2 6536 53.6 74.5
## 3 2530 20.7 95.2
## 4 478 3.9 99.1
## 5 6 0.0 99.2
## 6 19 0.2 99.3
## 7 7 0.1 99.4
## 8 75 0.6 100.0
## Total 12199 100.0 100.0
##
## [[3]]
## catcols[, x] :
## Frequency Percent Cum. percent
## 1 2426 19.9 19.9
## 2 7878 64.6 84.5
## 3 105 0.9 85.3
## 4 730 6.0 91.3
## 5 466 3.8 95.1
## 6 174 1.4 96.6
## 7 49 0.4 97.0
## 8 135 1.1 98.1
## 9 1 0.0 98.1
## 10 163 1.3 99.4
## 11 6 0.0 99.5
## 12 10 0.1 99.5
## 13 56 0.5 100.0
## Total 12199 100.0 100.0
##
## [[4]]
## catcols[, x] :
## Frequency Percent Cum. percent
## 1 4711 38.6 38.6
## 2 1127 9.2 47.9
## 3 2382 19.5 67.4
## 4 1168 9.6 77.0
## 5 317 2.6 79.6
## 6 800 6.6 86.1
## 7 758 6.2 92.3
## 8 431 3.5 95.9
## 9 505 4.1 100.0
## Total 12199 100.0 100.0
##
## [[5]]
## catcols[, x] :
## Frequency Percent Cum. percent
## 1 2383 19.5 19.5
## 2 3907 32.0 51.6
## 3 2017 16.5 68.1
## 4 1066 8.7 76.8
## 5 260 2.1 79.0
## 6 443 3.6 82.6
## 7 40 0.3 82.9
## 8 343 2.8 85.7
## 9 41 0.3 86.1
## 10 450 3.7 89.8
## 11 247 2.0 91.8
## 12 1 0.0 91.8
## 13 728 6.0 97.8
## 14 13 0.1 97.9
## 15 36 0.3 98.2
## 16 3 0.0 98.2
## 17 1 0.0 98.2
## 18 10 0.1 98.3
## 19 17 0.1 98.4
## 20 193 1.6 100.0
## Total 12199 100.0 100.0
##
## [[6]]
## catcols[, x] :
## Frequency Percent Cum. percent
## New_Visitor 1693 13.9 13.9
## Other 81 0.7 14.5
## Returning_Visitor 10425 85.5 100.0
## Total 12199 100.0 100.0
##
## [[7]]
## catcols[, x] :
## Frequency Percent Cum. percent
## FALSE 9343 76.6 76.6
## TRUE 2856 23.4 100.0
## Total 12199 100.0 100.0
##
## [[8]]
## catcols[, x] :
## Frequency Percent Cum. percent
## FALSE 10291 84.4 84.4
## TRUE 1908 15.6 100.0
## Total 12199 100.0 100.0
matrix=cor(numcols, use="complete.obs")
matrix
## Administrative Administrative_Duration Informational
## Administrative 1.00000000 0.60040965 0.37528761
## Administrative_Duration 0.60040965 1.00000000 0.30143630
## Informational 0.37528761 0.30143630 1.00000000
## Informational_Duration 0.25478602 0.23718986 0.61867795
## ProductRelated 0.42819151 0.28678391 0.37260472
## ProductRelated_Duration 0.37102722 0.35351379 0.38608372
## BounceRates -0.21366664 -0.13733340 -0.10950530
## ExitRates -0.31127413 -0.20202445 -0.15956681
## PageValues 0.09692097 0.06616837 0.04739015
## SpecialDay -0.09707210 -0.07473689 -0.04937677
## Informational_Duration ProductRelated
## Administrative 0.25478602 0.42819151
## Administrative_Duration 0.23718986 0.28678391
## Informational 0.61867795 0.37260472
## Informational_Duration 1.00000000 0.27906195
## ProductRelated 0.27906195 1.00000000
## ProductRelated_Duration 0.34658069 0.86030819
## BounceRates -0.07015947 -0.19351577
## ExitRates -0.10293268 -0.28616321
## PageValues 0.03006416 0.05411549
## SpecialDay -0.03129304 -0.02593062
## ProductRelated_Duration BounceRates ExitRates
## Administrative 0.37102722 -0.21366664 -0.3112741
## Administrative_Duration 0.35351379 -0.13733340 -0.2020245
## Informational 0.38608372 -0.10950530 -0.1595668
## Informational_Duration 0.34658069 -0.07015947 -0.1029327
## ProductRelated 0.86030819 -0.19351577 -0.2861632
## ProductRelated_Duration 1.00000000 -0.17437550 -0.2453340
## BounceRates -0.17437550 1.00000000 0.9033582
## ExitRates -0.24533401 0.90335819 1.0000000
## PageValues 0.05084062 -0.11599198 -0.1735715
## SpecialDay -0.03821065 0.08783999 0.1167838
## PageValues SpecialDay
## Administrative 0.09692097 -0.09707210
## Administrative_Duration 0.06616837 -0.07473689
## Informational 0.04739015 -0.04937677
## Informational_Duration 0.03006416 -0.03129304
## ProductRelated 0.05411549 -0.02593062
## ProductRelated_Duration 0.05084062 -0.03821065
## BounceRates -0.11599198 0.08783999
## ExitRates -0.17357154 0.11678376
## PageValues 1.00000000 -0.06453271
## SpecialDay -0.06453271 1.00000000
#obtaining heatmap of correlation
nba_heatmap <- heatmap(matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
smple <- ecom[sample(nrow(ecom),1000),]
smple_matrix <- data.matrix(smple)
wss <- (nrow(smple_matrix)-1)*sum(apply(smple_matrix,2,var))
for (i in 2:15) wss[i]<-sum(kmeans(smple_matrix,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within Sum of Squares")
# we can see there is asudden drop in values of wss as clusters increase to around 3
cl <- kmeans(smple_matrix,3,nstart=25)
plot(smple_matrix, col =(cl$cluster +1) , main="k-means result with 3 clusters", pch=1, cex=1, las=1)
points(cl$centers, col = "black", pch = 17, cex = 2)
library(stats)
library(cluster)
# Dissimilarity matrix
d <- dist(smple_matrix, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
# Compute with agnes
hc2 <- agnes(smple_matrix, method = "complete")
# Agglomerative coefficient
hc2$ac
## [1] 0.9955162
#finding different hieracheal clustering methods that will help us in determining the strongest of them all
# Here we can see that ward is the strongest of them all
# methods to assess
# our strongest one is ward method
#m <- c( "average", "single", "complete", "ward")
#names(m) <- c( "average", "single", "complete", "ward")
#function to compute coefficient
#ac <- function(x) {
#agnes(smple_matrix, method = x)$ac
#}
#map_dbl(m, ac)
#Visualizing this with ward method we get
#c3 <- agnes(df, method = "ward")
#pltree(hc3, cex = 0.6, hang = -1, main ="Dendrogram of agnes")
#We can see that this model is not doing bad
We are able to see that hierachical clustering with the ward method offers better clustering compared to k means clustering.