0.1 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:


1 Defining Data Question


1.0.1 a) Specifying the Question

Kira Plastinina is a Russian brand that is sold through a defunct chain of retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines, and Armenia. The brand’s Sales and Marketing team would like to understand their customer’s behavior from data that they have collected over the past year. More specifically, they would like to learn the characteristics of customer groups.

1.0.2 b) Defining the Metric for success.

This project will be considered a success after we have thoroughly cleaned our data and performed both univariate and bivariate analysis and offering summaries of our dataset. WE will also be performing clustering in our dataset to learn characteristics of our groups.

1.0.3 c) Understanding the context

The dataset that we will be using is a russian ecommerce dataset.

1.0.4 d) Recording the experimental design.

The following steps will be followed in conducting this study:

  • Define the question, the metric for success, the context, experimental design taken.

  • Read and explore the given dataset. Define the appropriateness of the available data to answer the given question.

  • Find and deal with outliers, anomalies, and missing data within the dataset.

  • Perform univariate and bivariate analysis and recording our observations.

  • From our insights we will provide a conclusion and recommendation.

  • Perform clustering in our data

1.1 Reading our data

ecom=read.csv("http://bit.ly/EcommerceCustomersDataset")

1.1.1 Checking the data

#previewing our top entries
head(ecom)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                      -1             0                     -1
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                0.000000  0.20000000 0.2000000          0
## 2              2               64.000000  0.00000000 0.1000000          0
## 3              1               -1.000000  0.20000000 0.2000000          0
## 4              2                2.666667  0.05000000 0.1400000          0
## 5             10              627.500000  0.02000000 0.0500000          0
## 6             19              154.216667  0.01578947 0.0245614          0
##   SpecialDay Month OperatingSystems Browser Region TrafficType
## 1          0   Feb                1       1      1           1
## 2          0   Feb                2       2      1           2
## 3          0   Feb                4       1      9           3
## 4          0   Feb                3       2      2           4
## 5          0   Feb                3       3      1           4
## 6          0   Feb                2       2      1           3
##         VisitorType Weekend Revenue
## 1 Returning_Visitor   FALSE   FALSE
## 2 Returning_Visitor   FALSE   FALSE
## 3 Returning_Visitor   FALSE   FALSE
## 4 Returning_Visitor   FALSE   FALSE
## 5 Returning_Visitor    TRUE   FALSE
## 6 Returning_Visitor   FALSE   FALSE
# checking data composition
str(ecom)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 1 2 3 ...
##  $ ProductRelated_Duration: num  0 64 -1 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : chr  "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#checking dimension of our dataset
dim(ecom)
## [1] 12330    18
#confirming our dataset is a dataframe
class(ecom)
## [1] "data.frame"

1.2 Cleaning our data

1.2.1 Checking for missing values

sum(is.na(ecom))
## [1] 112
#there are 112 missing values 
#checking exact locations of our null values column wise
colSums(is.na(ecom))
##          Administrative Administrative_Duration           Informational 
##                      14                      14                      14 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                      14                      14                      14 
##             BounceRates               ExitRates              PageValues 
##                      14                      14                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0

1.2.1.1 Dropping null values

ecom <-na.omit(ecom)

1.2.1.2 Checking for duplicates

1.2.1.3 Dropping duplicates

ecom<- ecom[!duplicated(ecom),]

1.2.2 Checking for outliers

boxplot(ecom$Administrative, main= 'Boxplot of administrative web pages',col="blue")

boxplot(ecom$Administrative_Duration, main= 'Boxplot of administrative web pages duration', col="red")

boxplot(ecom$Informational, main= 'Boxplot of informational web pages', col="green")

boxplot(ecom$Informational_Duration, main= 'Boxplot of informational web pages duration',col="purple")

boxplot(ecom$ProductRelated, main= 'Boxplot of product related web pages', col="grey")

boxplot(ecom$ProductRelated_Duration, main= 'Boxplot of product related web pages duration', col="yellow")

boxplot(ecom$BounceRates, main= 'Boxplot of bounce rates', col="black")

boxplot(ecom$ExitRates, main= 'Boxplot of exit rates', col="orange")

boxplot(ecom$PageValues, main= 'Boxplot of page values', col="green")

boxplot(ecom$SpecialDay, main= 'Boxplot of special day', col="blue")

# there are outliers in all our columns but we will leave them since they constitute majority of our data. 
# outliers will also help us in our modelling

##EDA


1.2.3 Univariate analysis

#we first get the numerical columns
numcols <- ecom[ ,1:10]
summary(ecom)
##  Administrative  Administrative_Duration Informational    
##  Min.   : 0.00   Min.   :  -1.00         Min.   : 0.0000  
##  1st Qu.: 0.00   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.00   Median :   9.00         Median : 0.0000  
##  Mean   : 2.34   Mean   :  81.68         Mean   : 0.5088  
##  3rd Qu.: 4.00   3rd Qu.:  94.75         3rd Qu.: 0.0000  
##  Max.   :27.00   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :  -1.00        Min.   :  0.00   Min.   :   -1.0        
##  1st Qu.:   0.00        1st Qu.:  8.00   1st Qu.:  193.6        
##  Median :   0.00        Median : 18.00   Median :  609.5        
##  Mean   :  34.84        Mean   : 32.06   Mean   : 1207.5        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1477.6        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##   BounceRates        ExitRates         PageValues        SpecialDay     
##  Min.   :0.00000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.01422   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.00293   Median :0.02500   Median :  0.000   Median :0.00000  
##  Mean   :0.02045   Mean   :0.04150   Mean   :  5.952   Mean   :0.06197  
##  3rd Qu.:0.01667   3rd Qu.:0.04848   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.20000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##     Month           OperatingSystems    Browser           Region     
##  Length:12199       Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000    Median : 2.000   Median :3.000  
##                     Mean   :2.124    Mean   : 2.358   Mean   :3.153  
##                     3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##                     Max.   :8.000    Max.   :13.000   Max.   :9.000  
##   TrafficType     VisitorType         Weekend         Revenue       
##  Min.   : 1.000   Length:12199       Mode :logical   Mode :logical  
##  1st Qu.: 2.000   Class :character   FALSE:9343      FALSE:10291    
##  Median : 2.000   Mode  :character   TRUE :2856      TRUE :1908     
##  Mean   : 4.075                                                     
##  3rd Qu.: 4.000                                                     
##  Max.   :20.000
#getting summary in our dataset i.e mean , quartiles, median, maximum and minimum

1.2.3.1 Histograms of numerical columns

invisible(lapply(names(numcols), function(n) hist(numcols[[n]])))

1.2.3.2 Bar charts of categorical columns

catcols <- ecom[ ,11:18]
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## Loading required package: nnet
dev.new(width=10, height=5, unit="in")
lapply(1:ncol(catcols), function(x) tab1(catcols[,x],main = colnames(catcols[,x])))
## [[1]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## Aug           433     3.5          3.5
## Dec          1706    14.0         17.5
## Feb           182     1.5         19.0
## Jul           432     3.5         22.6
## June          285     2.3         24.9
## Mar          1853    15.2         40.1
## May          3328    27.3         67.4
## Nov          2983    24.5         91.8
## Oct           549     4.5         96.3
## Sep           448     3.7        100.0
##   Total     12199   100.0        100.0
## 
## [[2]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## 1            2548    20.9         20.9
## 2            6536    53.6         74.5
## 3            2530    20.7         95.2
## 4             478     3.9         99.1
## 5               6     0.0         99.2
## 6              19     0.2         99.3
## 7               7     0.1         99.4
## 8              75     0.6        100.0
##   Total     12199   100.0        100.0
## 
## [[3]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## 1            2426    19.9         19.9
## 2            7878    64.6         84.5
## 3             105     0.9         85.3
## 4             730     6.0         91.3
## 5             466     3.8         95.1
## 6             174     1.4         96.6
## 7              49     0.4         97.0
## 8             135     1.1         98.1
## 9               1     0.0         98.1
## 10            163     1.3         99.4
## 11              6     0.0         99.5
## 12             10     0.1         99.5
## 13             56     0.5        100.0
##   Total     12199   100.0        100.0
## 
## [[4]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## 1            4711    38.6         38.6
## 2            1127     9.2         47.9
## 3            2382    19.5         67.4
## 4            1168     9.6         77.0
## 5             317     2.6         79.6
## 6             800     6.6         86.1
## 7             758     6.2         92.3
## 8             431     3.5         95.9
## 9             505     4.1        100.0
##   Total     12199   100.0        100.0
## 
## [[5]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## 1            2383    19.5         19.5
## 2            3907    32.0         51.6
## 3            2017    16.5         68.1
## 4            1066     8.7         76.8
## 5             260     2.1         79.0
## 6             443     3.6         82.6
## 7              40     0.3         82.9
## 8             343     2.8         85.7
## 9              41     0.3         86.1
## 10            450     3.7         89.8
## 11            247     2.0         91.8
## 12              1     0.0         91.8
## 13            728     6.0         97.8
## 14             13     0.1         97.9
## 15             36     0.3         98.2
## 16              3     0.0         98.2
## 17              1     0.0         98.2
## 18             10     0.1         98.3
## 19             17     0.1         98.4
## 20            193     1.6        100.0
##   Total     12199   100.0        100.0
## 
## [[6]]
## catcols[, x] : 
##                   Frequency Percent Cum. percent
## New_Visitor            1693    13.9         13.9
## Other                    81     0.7         14.5
## Returning_Visitor     10425    85.5        100.0
##   Total               12199   100.0        100.0
## 
## [[7]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## FALSE        9343    76.6         76.6
## TRUE         2856    23.4        100.0
##   Total     12199   100.0        100.0
## 
## [[8]]
## catcols[, x] : 
##         Frequency Percent Cum. percent
## FALSE       10291    84.4         84.4
## TRUE         1908    15.6        100.0
##   Total     12199   100.0        100.0

1.2.4 Bivariate analysis

1.2.4.1 Correlation in numerical columns

matrix=cor(numcols, use="complete.obs")
matrix
##                         Administrative Administrative_Duration Informational
## Administrative              1.00000000              0.60040965    0.37528761
## Administrative_Duration     0.60040965              1.00000000    0.30143630
## Informational               0.37528761              0.30143630    1.00000000
## Informational_Duration      0.25478602              0.23718986    0.61867795
## ProductRelated              0.42819151              0.28678391    0.37260472
## ProductRelated_Duration     0.37102722              0.35351379    0.38608372
## BounceRates                -0.21366664             -0.13733340   -0.10950530
## ExitRates                  -0.31127413             -0.20202445   -0.15956681
## PageValues                  0.09692097              0.06616837    0.04739015
## SpecialDay                 -0.09707210             -0.07473689   -0.04937677
##                         Informational_Duration ProductRelated
## Administrative                      0.25478602     0.42819151
## Administrative_Duration             0.23718986     0.28678391
## Informational                       0.61867795     0.37260472
## Informational_Duration              1.00000000     0.27906195
## ProductRelated                      0.27906195     1.00000000
## ProductRelated_Duration             0.34658069     0.86030819
## BounceRates                        -0.07015947    -0.19351577
## ExitRates                          -0.10293268    -0.28616321
## PageValues                          0.03006416     0.05411549
## SpecialDay                         -0.03129304    -0.02593062
##                         ProductRelated_Duration BounceRates  ExitRates
## Administrative                       0.37102722 -0.21366664 -0.3112741
## Administrative_Duration              0.35351379 -0.13733340 -0.2020245
## Informational                        0.38608372 -0.10950530 -0.1595668
## Informational_Duration               0.34658069 -0.07015947 -0.1029327
## ProductRelated                       0.86030819 -0.19351577 -0.2861632
## ProductRelated_Duration              1.00000000 -0.17437550 -0.2453340
## BounceRates                         -0.17437550  1.00000000  0.9033582
## ExitRates                           -0.24533401  0.90335819  1.0000000
## PageValues                           0.05084062 -0.11599198 -0.1735715
## SpecialDay                          -0.03821065  0.08783999  0.1167838
##                          PageValues  SpecialDay
## Administrative           0.09692097 -0.09707210
## Administrative_Duration  0.06616837 -0.07473689
## Informational            0.04739015 -0.04937677
## Informational_Duration   0.03006416 -0.03129304
## ProductRelated           0.05411549 -0.02593062
## ProductRelated_Duration  0.05084062 -0.03821065
## BounceRates             -0.11599198  0.08783999
## ExitRates               -0.17357154  0.11678376
## PageValues               1.00000000 -0.06453271
## SpecialDay              -0.06453271  1.00000000
#obtaining heatmap of correlation
nba_heatmap <- heatmap(matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))


1.3 Modelling


1.3.1 K_means clustering

1.3.1.1 Sampling

smple <- ecom[sample(nrow(ecom),1000),]
 smple_matrix <- data.matrix(smple)
wss <- (nrow(smple_matrix)-1)*sum(apply(smple_matrix,2,var))

for (i in 2:15) wss[i]<-sum(kmeans(smple_matrix,centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within Sum of Squares")

# we can see there is asudden drop in values of wss as clusters increase to around 3

1.3.1.2 forming clusters with our k =3

cl <- kmeans(smple_matrix,3,nstart=25)
plot(smple_matrix, col =(cl$cluster +1) , main="k-means result with 3 clusters", pch=1, cex=1, las=1)

points(cl$centers, col = "black", pch = 17, cex = 2)


1.3.1.3 Hierachical clustering

library(stats)
library(cluster)
# Dissimilarity matrix
d <- dist(smple_matrix, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )

# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)

# Compute with agnes
hc2 <- agnes(smple_matrix, method = "complete")

# Agglomerative coefficient
hc2$ac
## [1] 0.9955162
#finding different hieracheal clustering methods that will help us in determining the strongest of them all

# Here we can see that ward is the strongest of them all
# methods to assess
# our strongest one is ward method

#m <- c( "average", "single", "complete", "ward")
#names(m) <- c( "average", "single", "complete", "ward")

#function to compute coefficient
#ac <- function(x) {
  #agnes(smple_matrix, method = x)$ac
#}

#map_dbl(m, ac)
#Visualizing this with ward method we get
#c3 <- agnes(df, method = "ward")
#pltree(hc3, cex = 0.6, hang = -1, main ="Dendrogram of agnes") 

#We can see that this model is not doing bad

1.4 Conclusion

1.5 Conclusion

We are able to see that hierachical clustering with the ward method offers better clustering compared to k means clustering.