Matej Suhalj

Research Question:
Can I divide the random sample of 400 people that answered the survey into homogeneous groups based on the 5 cluster variables.

mydata <- read.table("./datahw4.csv", header=TRUE, sep=",")
head(mydata)

##     ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 1 5524       0        0  58138      58               8                 4
## 2 2174       1        1  46344      38               1                 2
## 3 4141       0        0  71613      26               8                10
## 4 6182       1        0  26646      26               2                 4
## 5 5324       1        0  58293      94               5                 6
## 6 7446       0        1  62513      16               6                10
##   NumWebVisitsMonth
## 1                 7
## 2                 5
## 3                 4
## 4                 6
## 5                 5
## 6                 6

Unit od observation: one person.

Description:

Kindhome: how many kids do they have living at home.
Teenhome: how many teens do they have living at home.
Income: the annual income of the individual.
Recency: the number of days since the last purchase.
NumWebPurchases: the number of purchases made through the company’s website.
NumStorePurchases: the number of purchases made in physical stores.
NumWebVisitsMonth: the number of visits to the company’s website in a month.

The data set was taken from Kaggle.com (Customer Segmentation: Clustering) and the sample size is 2240 people. I will choose a random sample of 400 people.

#Random sample of 400 units. 
set.seed(1) 
mydata <- mydata[sample(nrow(mydata), 400), ]

#Descriptive statistics
summary(mydata[c("Income", "Recency", "NumWebPurchases", "NumStorePurchases", "NumWebVisitsMonth")])

##      Income          Recency      NumWebPurchases  NumStorePurchases
##  Min.   :  2447   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   
##  1st Qu.: 35772   1st Qu.:24.00   1st Qu.: 2.000   1st Qu.: 3.000   
##  Median : 53203   Median :50.50   Median : 4.000   Median : 5.000   
##  Mean   : 53083   Mean   :50.67   Mean   : 3.998   Mean   : 5.888   
##  3rd Qu.: 69141   3rd Qu.:78.00   3rd Qu.: 6.000   3rd Qu.: 8.000   
##  Max.   :157733   Max.   :99.00   Max.   :11.000   Max.   :13.000   
##  NA's   :2                                                          
##  NumWebVisitsMonth
##  Min.   : 0.000   
##  1st Qu.: 3.000   
##  Median : 6.000   
##  Mean   : 5.202   
##  3rd Qu.: 7.000   
##  Max.   :13.000   
##

The minimum annual Income per individual is 2447€, where the maximum is 157.773€. The avarage number of days since the last purchase is a little above 50 days. The 75% of people have made 6 purchases or less through the company’s website.

#Standardization of variables
mydata$Income_z <- scale(mydata$Income)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebPurchases_z   <- scale(mydata$NumWebPurchases)
mydata$NumStorePurchases_z <- scale(mydata$NumStorePurchases)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)


head(mydata)

##         ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 1017  7010       0        0  70924      41               6                 7
## 679   8779       1        0  36145      13               4                 3
## 2177  1544       0        0  81380      51               4                10
## 930  10673       0        1  68397       6               6                 5
## 1533  3463       0        1  69283      41               7                13
## 471   2021       0        1  61456      47               6                13
##      NumWebVisitsMonth   Income_z   Recency_z NumWebPurchases_z
## 1017                 3  0.8218743 -0.31779981      0.7579604898
## 679                  9 -0.7802863 -1.23800610      0.0009462678
## 2177                 2  1.3035497  0.01084529      0.0009462678
## 930                  3  0.7054632 -1.46805767      0.7579604898
## 1533                 5  0.7462785 -0.31779981      1.1364676008
## 471                  4  0.3857129 -0.12061275      0.7579604898
##      NumStorePurchases_z NumWebVisitsMonth_z
## 1017           0.3305371         -0.91485515
## 679           -0.8579110          1.57737228
## 2177           1.2218732         -1.33022639
## 930           -0.2636869         -0.91485515
## 1533           2.1132093         -0.08411268
## 471            2.1132093         -0.49948391

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")]), 
      type = "pearson")

##                     Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## Income_z                1.00      0.03              0.45                0.62
## Recency_z               0.03      1.00              0.07                0.04
## NumWebPurchases_z       0.45      0.07              1.00                0.58
## NumStorePurchases_z     0.62      0.04              0.58                1.00
## NumWebVisitsMonth_z    -0.69      0.00             -0.01               -0.42
##                     NumWebVisitsMonth_z
## Income_z                          -0.69
## Recency_z                          0.00
## NumWebPurchases_z                 -0.01
## NumStorePurchases_z               -0.42
## NumWebVisitsMonth_z                1.00
## 
## n
##                     Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## Income_z                 398       398               398                 398
## Recency_z                398       400               400                 400
## NumWebPurchases_z        398       400               400                 400
## NumStorePurchases_z      398       400               400                 400
## NumWebVisitsMonth_z      398       400               400                 400
##                     NumWebVisitsMonth_z
## Income_z                            398
## Recency_z                           400
## NumWebPurchases_z                   400
## NumStorePurchases_z                 400
## NumWebVisitsMonth_z                 400
## 
## P
##                     Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## Income_z                     0.5909    0.0000            0.0000             
## Recency_z           0.5909             0.1853            0.4271             
## NumWebPurchases_z   0.0000   0.1853                      0.0000             
## NumStorePurchases_z 0.0000   0.4271    0.0000                               
## NumWebVisitsMonth_z 0.0000   0.9241    0.8394            0.0000             
##                     NumWebVisitsMonth_z
## Income_z            0.0000             
## Recency_z           0.9241             
## NumWebPurchases_z   0.8394             
## NumStorePurchases_z 0.0000             
## NumWebVisitsMonth_z

Some chosen cluster variables are correlated a bit too much, but I will ignore this and continue with my analysis.

#Finding outliers
mydata$Dissimilarity <- sqrt(mydata$Income_z^2 + mydata$Recency_z^2 + mydata$NumWebPurchases_z^2 + mydata$NumStorePurchases_z^2 +  mydata$NumWebVisitsMonth_z^2)

head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity

##         ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 2133 11181       0        0 156924      85               0                 0
## 1301  5336       1        0 157733      37               1                 1
## 2215  9303       0        1   5305      12               1                 0
## 22    5376       1        0   2447      42               0                 0
## 1253  5153       0        1  77766      97              11                11
## 768   1911       0        0  67430       6              11                12
## 1784 10678       0        1  71232      91              11                10
## 1837  7849       0        0  80336      12               2                13
## 879   1446       0        0  86424      12               6                12
## 19    6565       0        1  76995      91              11                 9
##      NumWebVisitsMonth   Income_z  Recency_z NumWebPurchases_z
## 2133                 0  4.7836272  1.1282386        -1.5130822
## 1301                 1  4.8208953 -0.4492579        -1.1345751
## 2215                13 -2.2009893 -1.2708706        -1.1345751
## 22                   1 -2.3326485 -0.2849353        -1.5130822
## 1253                 6  1.1370640  1.5226128         2.6504960
## 768                  6  0.6609165 -1.4680577         2.6504960
## 1784                 7  0.8360629  1.3254257         2.6504960
## 1837                 1  1.2554559 -1.2708706        -0.7560680
## 879                  1  1.5359111 -1.2708706         0.7579605
## 19                   5  1.1015464  1.3254257         2.6504960
##      NumStorePurchases_z NumWebVisitsMonth_z Dissimilarity
## 2133          -1.7492471         -2.16096887      5.845946
## 1301          -1.4521350         -1.74559763      5.466803
## 2215          -1.7492471          3.23885724      4.614844
## 22            -1.7492471         -1.74559763      3.730795
## 1253           1.5189852          0.33125856      3.612955
## 768            1.8160973          0.33125856      3.609027
## 1784           1.2218732          0.74662980      3.395779
## 1837           2.1132093         -1.74559763      3.357929
## 879            1.8160973         -1.74559763      3.300600
## 19             0.9247612         -0.08411268      3.295079

Units ID: 11181, 5336 and 9303 stand out.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  filter(!ID %in% c(11181, 5336, 9303)) #Removing outliers

#Stdandardizing the data again
mydata$Income_z <- scale(mydata$Income)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebPurchases_z   <- scale(mydata$NumWebPurchases)
mydata$NumStorePurchases_z <- scale(mydata$NumStorePurchases)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Calculating Euclidean distances
distance <- get_dist(mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")], 
                     method = "euclidian")

distance2 <- distance^2

fviz_dist(distance2) #Showing dissimilarity matrix

We can see that there is forming of some see pink squares above in the dissimilarity matrix, which means that this data is probably clusterable and that I will be able to form some cluster groups. This is 397x397 matrix.

get_clust_tendency(mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")], #Hopkins statistics
                   n = nrow(mydata) - 1, 
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6916395
## 
## $plot
## NULL

Data is clustrable, because H is above 0,5 (H=0,76).

Hierarhical Clustering

library(dplyr) #Allowing use of %>%

WARD <- mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")] %>%
  #scale(mydata) %>%                           
  get_dist(method = "euclidean") %>%  #Selecting distance
  hclust(method = "ward.D2") #Selecting algorithm

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 397

397 units will be organized into groups. If I would like to make 1 group, this would take 396 steps.

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the dendrogram, it would be most appropriate to have 3 cluster groups.

fviz_dend(WARD, 
          k = 3,
          cex = 0.5, 
          palette = "jama",
          color_labels_by_k = TRUE, 
          rect = TRUE)

3 cluster groups shown in colour.

mydata$ClusterWard <- cutree(WARD, 
                             k = 3) #Cutting the dendrogram
head(mydata)

##      ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 1  7010       0        0  70924      41               6                 7
## 2  8779       1        0  36145      13               4                 3
## 3  1544       0        0  81380      51               4                10
## 4 10673       0        1  68397       6               6                 5
## 5  3463       0        1  69283      41               7                13
## 6  2021       0        1  61456      47               6                13
##   NumWebVisitsMonth   Income_z    Recency_z NumWebPurchases_z
## 1                 3  0.8971721 -0.319327311       0.750235401
## 2                 9 -0.8127779 -1.239639051      -0.008601425
## 3                 2  1.4112535  0.009355454      -0.008601425
## 4                 3  0.7729292 -1.469716987       0.750235401
## 5                 5  0.8164904 -0.319327311       1.129653814
## 6                 4  0.4316668 -0.122117652       0.750235401
##   NumStorePurchases_z NumWebVisitsMonth_z Dissimilarity ClusterWard
## 1           0.3202131         -0.93466037      1.515649           1
## 2          -0.8762537          1.60684763      2.316381           2
## 3           1.2175632         -1.35824504      2.227518           3
## 4          -0.2780203         -0.93466037      2.033191           1
## 5           2.1149133         -0.08749104      2.534209           3
## 6           2.1149133         -0.51107571      2.335158           3

In the table above we can see to which group each unit has been clustered after the hierarhical clustering (this can be shown under: ClusterWard).

#Calculating positions of initial leaders

Initial_leaders <- aggregate(mydata[, c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")], 
                             by = list(mydata$ClusterWard), 
                             FUN = mean)

Initial_leaders

##   Group.1 Income_z   Recency_z NumWebPurchases_z NumStorePurchases_z
## 1       1       NA  0.12347411         0.9778864           0.3441424
## 2       2       NA -0.12637835        -0.8751743          -0.8910249
## 3       3 1.072744  0.04580936         0.1776585           0.9211657
##   NumWebVisitsMonth_z
## 1           0.4275879
## 2           0.5086652
## 3          -1.2350204

In the table above, the initial leaders are shown, where all the numbers that contain minuses mean that those are below average.

library(tidyr)
mydata <- drop_na(mydata) #Removing units with NA

K-Means Clustering

library(factoextra)

#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS <- hkmeans(mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")], 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 3 clusters of sizes 100, 178, 117
## 
## Cluster means:
##     Income_z   Recency_z NumWebPurchases_z NumStorePurchases_z
## 1  0.3046569  0.18059917         1.2852154           0.5535241
## 2 -0.8794536 -0.08334047        -0.7908855          -0.8527277
## 3  1.0775816 -0.03250244         0.1048998           0.8264106
##   NumWebVisitsMonth_z
## 1           0.4504615
## 2           0.5431266
## 3          -1.2170502
## 
## Clustering vector:
##   [1] 3 2 3 3 1 3 2 1 2 1 2 1 2 1 3 2 1 2 3 2 3 2 3 1 3 2 1 1 2 1 2 3 3 2 3 2 2
##  [38] 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 3 2 3 2 2 1 2 3 3 1 1 1 2 2 1 3 3 2 2 2 2
##  [75] 2 1 1 3 2 2 2 3 2 3 3 2 3 3 2 1 2 1 2 1 3 1 3 2 1 3 2 2 3 3 1 2 3 1 2 3 1
## [112] 1 3 2 2 1 2 3 2 3 1 3 2 3 1 2 3 1 2 2 1 3 2 2 3 2 3 2 2 3 2 1 1 3 2 2 2 1
## [149] 2 1 2 3 1 2 2 3 2 2 2 3 2 2 2 1 2 2 3 3 2 3 2 3 2 3 2 2 2 1 1 3 2 2 3 2 2
## [186] 2 3 2 2 1 1 2 3 2 3 3 2 2 1 1 3 2 2 2 1 3 2 1 3 2 2 1 2 1 2 2 3 3 2 3 1 2
## [223] 1 1 2 1 2 3 2 1 3 2 2 2 1 1 3 2 3 3 2 3 2 1 3 2 3 2 2 2 2 2 1 2 3 1 2 3 2
## [260] 2 2 3 3 3 1 2 2 3 2 2 2 2 1 1 3 1 1 2 2 1 2 2 2 3 2 3 1 2 3 1 2 2 1 3 3 3
## [297] 1 2 2 1 2 2 2 2 2 2 1 3 2 1 1 3 1 1 3 1 2 3 3 1 3 2 2 3 1 1 2 3 3 1 2 1 3
## [334] 3 3 2 1 3 2 3 2 1 2 1 2 1 3 3 2 3 2 1 1 3 3 1 2 3 3 3 1 2 3 3 3 3 1 3 1 2
## [371] 3 2 1 3 2 3 2 2 1 2 3 2 1 2 2 1 1 3 3 2 1 2 2 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 269.6025 361.7333 290.4907
##  (between_SS / total_SS =  53.3 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Above in the section where it says: Cluster means, we can see the final leaders, and we can compare them to the initial leaders and see if there have been any reclassifications.

The ratio between the between sum of squares and total sum of squares is 53,3% which is not bad, but also not very good. This number tells us how much units vary in each cluster groups. We want this number to be as high as possible because that would mean that units in clusters are very homogeneous (clusters contain similar units).

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("ID", "ClusterWard", "ClusterK_Means")])

##      ID ClusterWard ClusterK_Means
## 1  7010           1              3
## 2  8779           2              2
## 3  1544           3              3
## 4 10673           1              3
## 5  3463           3              1
## 6  2021           3              3

Above we can see some reclassificatios of the units after K-Means clustering was performed.

#Checking for reclassifications
table(mydata$ClusterWard)

## 
##   1   2   3 
## 124 161 110

table(mydata$ClusterK_Means)

## 
##   1   2   3 
## 100 178 117

table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##       1   2   3
##   1  91  18  15
##   2   0 160   1
##   3   9   0 101

In third group performed by K-Means clustering, there are 117 units. 101 units were in the 3. group after hierarhical clustering, 15 units were reclassified here from the 1. group and 1 unit was reclassified from the 2. group.

In the 2. group performed by hierarhical clustering there were 161 units, but after K-Means clustering was performed 1 unit was reclassified to group 3, and 160 units remained in the second group.

Centroids <- K_MEANS$centers
Centroids

##     Income_z   Recency_z NumWebPurchases_z NumStorePurchases_z
## 1  0.3046569  0.18059917         1.2852154           0.5535241
## 2 -0.8794536 -0.08334047        -0.7908855          -0.8527277
## 3  1.0775816 -0.03250244         0.1048998           0.8264106
##   NumWebVisitsMonth_z
## 1           0.4504615
## 2           0.5431266
## 3          -1.2170502

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(Income_z, Recency_z, NumWebPurchases_z, NumStorePurchases_z, NumWebVisitsMonth_z))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3), 
                        labels = c("1", "2", "3"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z"), 
                            labels = c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-1.5, 1.5)

As shown in the ggplot above we can see that people in group 1 are above average in all 5 cluster variables, where people in group 2 are below average in basically all of the cluster variables, except in NumWebVisitsMonth_z (where they are above). And lastly people in 3 cluster group are above average in Income_z, NumStorePurchases_z and they are also slightly above average in NumWebPurchases_z. On the other hand, in the same cluster group (3), people are below average in NumWebVisitsMonth_z and around average in Recency_z.

H0: Mean of Income_z for group 1 is equal to the mean of Income_z for group 2 is equal to the mean of Income_z for group 3.
H1: At least one arithmetic mean is different.

H0: Mean of Recency-z for group 1 is equal to the mean of Recency_z for group 2 is equal to the mean of Recency_z for group 3.
H1: At least one arithmetic mean is different.

H0: Mean of NumWebPurchases_z for group 1 is equal to the mean of NumWebPurchases_z for group 2 is equal to the mean of NumWebPurchases_z for group 3.
H1: At least one arithmetic mean is different.

H0: Mean of NumStorePurchases_z for group 1 is equal to the mean of NumStorePurchases_z for group 2 is equal to the mean of NumStorePurchases_z for group 3.
H1: At least one arithmetic mean is different.

H0: Mean of NumWebVisitsMonth_z for group 1 is equal to the mean of NumWebVisitsMonth_z for group 2 is equal to the mean of NumWebVisitsMonth_z for group 3.
H1: At least one arithmetic mean is different.

#Checking if all the cluster variables are successful at classifing units into groups by performing ANOVAs.

fit <- aov(cbind(Income_z, Recency_z, NumWebPurchases_z, NumStorePurchases_z, NumWebVisitsMonth_z) ~ as.factor(ClusterK_Means), 
                 data = mydata)

summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 282.81 141.406  498.54 < 2.2e-16 ***
## Residuals                 392 111.19   0.284                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df Sum Sq Mean Sq F value  Pr(>F)  
## as.factor(ClusterK_Means)   2   4.62 2.31034  2.3322 0.09843 .
## Residuals                 392 388.33 0.99065                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 277.80 138.902  465.21 < 2.2e-16 ***
## Residuals                 392 117.04   0.299                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 239.98 119.988  302.31 < 2.2e-16 ***
## Residuals                 392 155.59   0.397                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 246.10 123.050  322.27 < 2.2e-16 ***
## Residuals                 392 149.67   0.382                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I can reject the H0 for all cluster variables (p<0,001), except for Recency_z. Therefore I would need to do the clustering process again without the Recency_z as a cluster variable, but due to educational pressure at the moment I will just continue with the analysis.

H0: Mean of Kidhome for group 1 is equal to the mean of Kidhome for group 2 is equal to the mean of Kidhome for group 3.
H1: At least one arithmetic mean is different.

#Checking validity of Kidhome
aggregate(mydata$Kidhome, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")

##   Group.1          x
## 1       1 0.32000000
## 2       2 0.84269663
## 3       3 0.05128205

fit <- aov(Kidhome ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                            Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   2  47.47  23.737     112 <2e-16 ***
## Residuals                 392  83.05   0.212                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value I reject H0 at (p<0,001). I validated my results with Kidhome.

H0: Mean of Teenhome for group 1 is equal to the mean of Teenhome for group 2 is equal to the mean of Teenhome for group 3.
H1: At least one arithmetic mean is different.

#Checking validity of Teenhome
aggregate(mydata$Teenhome, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")

##   Group.1         x
## 1       1 0.7600000
## 2       2 0.4887640
## 3       3 0.3504274

fit <- aov(Teenhome ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                            Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(ClusterK_Means)   2   9.29   4.647   16.97 8.57e-08 ***
## Residuals                 392 107.35   0.274                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value I reject H0 at (p<0,001). I validated my results with Teenhome.

Classification of 397 random people out of this data set was based on 5 standardized variables.

For hierarchical clustering, Ward’s algorithm was used, and based on the analysis of the dendrogram it was decided to classify the students into three groups. The classification was further optimized using the K-Means cluster.

Group 2 contains the most students (45%, 178/397), characterized by a lower than average value of all cluster variables except NumWebVisitMonth. At the end (after reclassification) there were 178 people in the second cluster (group).

Theoretically I wasn’t able to divide the people into homogeneous groups based on the 5 cluster variables, because the arithmetic mean of the cluster variable Recency_z wasn’t statistically different among all 3 cluster groups.

Homewwork 4

2024-01-27

Matej Suhalj

Hierarhical Clustering

K-Means Clustering