Research Question:
Can I divide the random sample of 400 people that answered the survey
into homogeneous groups based on the 5 cluster variables.
mydata <- read.table("./datahw4.csv", header=TRUE, sep=",")
head(mydata)
## ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 1 5524 0 0 58138 58 8 4
## 2 2174 1 1 46344 38 1 2
## 3 4141 0 0 71613 26 8 10
## 4 6182 1 0 26646 26 2 4
## 5 5324 1 0 58293 94 5 6
## 6 7446 0 1 62513 16 6 10
## NumWebVisitsMonth
## 1 7
## 2 5
## 3 4
## 4 6
## 5 5
## 6 6
Unit od observation: one person.
Description:
The data set was taken from Kaggle.com (Customer Segmentation: Clustering) and the sample size is 2240 people. I will choose a random sample of 400 people.
#Random sample of 400 units.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 400), ]
#Descriptive statistics
summary(mydata[c("Income", "Recency", "NumWebPurchases", "NumStorePurchases", "NumWebVisitsMonth")])
## Income Recency NumWebPurchases NumStorePurchases
## Min. : 2447 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 35772 1st Qu.:24.00 1st Qu.: 2.000 1st Qu.: 3.000
## Median : 53203 Median :50.50 Median : 4.000 Median : 5.000
## Mean : 53083 Mean :50.67 Mean : 3.998 Mean : 5.888
## 3rd Qu.: 69141 3rd Qu.:78.00 3rd Qu.: 6.000 3rd Qu.: 8.000
## Max. :157733 Max. :99.00 Max. :11.000 Max. :13.000
## NA's :2
## NumWebVisitsMonth
## Min. : 0.000
## 1st Qu.: 3.000
## Median : 6.000
## Mean : 5.202
## 3rd Qu.: 7.000
## Max. :13.000
##
The minimum annual Income per individual is 2447€, where the maximum is 157.773€. The avarage number of days since the last purchase is a little above 50 days. The 75% of people have made 6 purchases or less through the company’s website.
#Standardization of variables
mydata$Income_z <- scale(mydata$Income)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebPurchases_z <- scale(mydata$NumWebPurchases)
mydata$NumStorePurchases_z <- scale(mydata$NumStorePurchases)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)
head(mydata)
## ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 1017 7010 0 0 70924 41 6 7
## 679 8779 1 0 36145 13 4 3
## 2177 1544 0 0 81380 51 4 10
## 930 10673 0 1 68397 6 6 5
## 1533 3463 0 1 69283 41 7 13
## 471 2021 0 1 61456 47 6 13
## NumWebVisitsMonth Income_z Recency_z NumWebPurchases_z
## 1017 3 0.8218743 -0.31779981 0.7579604898
## 679 9 -0.7802863 -1.23800610 0.0009462678
## 2177 2 1.3035497 0.01084529 0.0009462678
## 930 3 0.7054632 -1.46805767 0.7579604898
## 1533 5 0.7462785 -0.31779981 1.1364676008
## 471 4 0.3857129 -0.12061275 0.7579604898
## NumStorePurchases_z NumWebVisitsMonth_z
## 1017 0.3305371 -0.91485515
## 679 -0.8579110 1.57737228
## 2177 1.2218732 -1.33022639
## 930 -0.2636869 -0.91485515
## 1533 2.1132093 -0.08411268
## 471 2.1132093 -0.49948391
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")]),
type = "pearson")
## Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## Income_z 1.00 0.03 0.45 0.62
## Recency_z 0.03 1.00 0.07 0.04
## NumWebPurchases_z 0.45 0.07 1.00 0.58
## NumStorePurchases_z 0.62 0.04 0.58 1.00
## NumWebVisitsMonth_z -0.69 0.00 -0.01 -0.42
## NumWebVisitsMonth_z
## Income_z -0.69
## Recency_z 0.00
## NumWebPurchases_z -0.01
## NumStorePurchases_z -0.42
## NumWebVisitsMonth_z 1.00
##
## n
## Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## Income_z 398 398 398 398
## Recency_z 398 400 400 400
## NumWebPurchases_z 398 400 400 400
## NumStorePurchases_z 398 400 400 400
## NumWebVisitsMonth_z 398 400 400 400
## NumWebVisitsMonth_z
## Income_z 398
## Recency_z 400
## NumWebPurchases_z 400
## NumStorePurchases_z 400
## NumWebVisitsMonth_z 400
##
## P
## Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## Income_z 0.5909 0.0000 0.0000
## Recency_z 0.5909 0.1853 0.4271
## NumWebPurchases_z 0.0000 0.1853 0.0000
## NumStorePurchases_z 0.0000 0.4271 0.0000
## NumWebVisitsMonth_z 0.0000 0.9241 0.8394 0.0000
## NumWebVisitsMonth_z
## Income_z 0.0000
## Recency_z 0.9241
## NumWebPurchases_z 0.8394
## NumStorePurchases_z 0.0000
## NumWebVisitsMonth_z
Some chosen cluster variables are correlated a bit too much, but I will ignore this and continue with my analysis.
#Finding outliers
mydata$Dissimilarity <- sqrt(mydata$Income_z^2 + mydata$Recency_z^2 + mydata$NumWebPurchases_z^2 + mydata$NumStorePurchases_z^2 + mydata$NumWebVisitsMonth_z^2)
head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
## ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 2133 11181 0 0 156924 85 0 0
## 1301 5336 1 0 157733 37 1 1
## 2215 9303 0 1 5305 12 1 0
## 22 5376 1 0 2447 42 0 0
## 1253 5153 0 1 77766 97 11 11
## 768 1911 0 0 67430 6 11 12
## 1784 10678 0 1 71232 91 11 10
## 1837 7849 0 0 80336 12 2 13
## 879 1446 0 0 86424 12 6 12
## 19 6565 0 1 76995 91 11 9
## NumWebVisitsMonth Income_z Recency_z NumWebPurchases_z
## 2133 0 4.7836272 1.1282386 -1.5130822
## 1301 1 4.8208953 -0.4492579 -1.1345751
## 2215 13 -2.2009893 -1.2708706 -1.1345751
## 22 1 -2.3326485 -0.2849353 -1.5130822
## 1253 6 1.1370640 1.5226128 2.6504960
## 768 6 0.6609165 -1.4680577 2.6504960
## 1784 7 0.8360629 1.3254257 2.6504960
## 1837 1 1.2554559 -1.2708706 -0.7560680
## 879 1 1.5359111 -1.2708706 0.7579605
## 19 5 1.1015464 1.3254257 2.6504960
## NumStorePurchases_z NumWebVisitsMonth_z Dissimilarity
## 2133 -1.7492471 -2.16096887 5.845946
## 1301 -1.4521350 -1.74559763 5.466803
## 2215 -1.7492471 3.23885724 4.614844
## 22 -1.7492471 -1.74559763 3.730795
## 1253 1.5189852 0.33125856 3.612955
## 768 1.8160973 0.33125856 3.609027
## 1784 1.2218732 0.74662980 3.395779
## 1837 2.1132093 -1.74559763 3.357929
## 879 1.8160973 -1.74559763 3.300600
## 19 0.9247612 -0.08411268 3.295079
Units ID: 11181, 5336 and 9303 stand out.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID %in% c(11181, 5336, 9303)) #Removing outliers
#Stdandardizing the data again
mydata$Income_z <- scale(mydata$Income)
mydata$Recency_z <- scale(mydata$Recency)
mydata$NumWebPurchases_z <- scale(mydata$NumWebPurchases)
mydata$NumStorePurchases_z <- scale(mydata$NumStorePurchases)
mydata$NumWebVisitsMonth_z <- scale(mydata$NumWebVisitsMonth)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2) #Showing dissimilarity matrix
We can see that there is forming of some see pink squares above in the dissimilarity matrix, which means that this data is probably clusterable and that I will be able to form some cluster groups. This is 397x397 matrix.
get_clust_tendency(mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")], #Hopkins statistics
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6916395
##
## $plot
## NULL
Data is clustrable, because H is above 0,5 (H=0,76).
library(dplyr) #Allowing use of %>%
WARD <- mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")] %>%
#scale(mydata) %>%
get_dist(method = "euclidean") %>% #Selecting distance
hclust(method = "ward.D2") #Selecting algorithm
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 397
397 units will be organized into groups. If I would like to make 1 group, this would take 396 steps.
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the dendrogram, it would be most appropriate to have 3 cluster groups.
fviz_dend(WARD,
k = 3,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
3 cluster groups shown in colour.
mydata$ClusterWard <- cutree(WARD,
k = 3) #Cutting the dendrogram
head(mydata)
## ID Kidhome Teenhome Income Recency NumWebPurchases NumStorePurchases
## 1 7010 0 0 70924 41 6 7
## 2 8779 1 0 36145 13 4 3
## 3 1544 0 0 81380 51 4 10
## 4 10673 0 1 68397 6 6 5
## 5 3463 0 1 69283 41 7 13
## 6 2021 0 1 61456 47 6 13
## NumWebVisitsMonth Income_z Recency_z NumWebPurchases_z
## 1 3 0.8971721 -0.319327311 0.750235401
## 2 9 -0.8127779 -1.239639051 -0.008601425
## 3 2 1.4112535 0.009355454 -0.008601425
## 4 3 0.7729292 -1.469716987 0.750235401
## 5 5 0.8164904 -0.319327311 1.129653814
## 6 4 0.4316668 -0.122117652 0.750235401
## NumStorePurchases_z NumWebVisitsMonth_z Dissimilarity ClusterWard
## 1 0.3202131 -0.93466037 1.515649 1
## 2 -0.8762537 1.60684763 2.316381 2
## 3 1.2175632 -1.35824504 2.227518 3
## 4 -0.2780203 -0.93466037 2.033191 1
## 5 2.1149133 -0.08749104 2.534209 3
## 6 2.1149133 -0.51107571 2.335158 3
In the table above we can see to which group each unit has been clustered after the hierarhical clustering (this can be shown under: ClusterWard).
#Calculating positions of initial leaders
Initial_leaders <- aggregate(mydata[, c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")],
by = list(mydata$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## 1 1 NA 0.12347411 0.9778864 0.3441424
## 2 2 NA -0.12637835 -0.8751743 -0.8910249
## 3 3 1.072744 0.04580936 0.1776585 0.9211657
## NumWebVisitsMonth_z
## 1 0.4275879
## 2 0.5086652
## 3 -1.2350204
In the table above, the initial leaders are shown, where all the numbers that contain minuses mean that those are below average.
library(tidyr)
mydata <- drop_na(mydata) #Removing units with NA
library(factoextra)
#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS <- hkmeans(mydata[c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z")],
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 3 clusters of sizes 100, 178, 117
##
## Cluster means:
## Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## 1 0.3046569 0.18059917 1.2852154 0.5535241
## 2 -0.8794536 -0.08334047 -0.7908855 -0.8527277
## 3 1.0775816 -0.03250244 0.1048998 0.8264106
## NumWebVisitsMonth_z
## 1 0.4504615
## 2 0.5431266
## 3 -1.2170502
##
## Clustering vector:
## [1] 3 2 3 3 1 3 2 1 2 1 2 1 2 1 3 2 1 2 3 2 3 2 3 1 3 2 1 1 2 1 2 3 3 2 3 2 2
## [38] 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 3 2 3 2 2 1 2 3 3 1 1 1 2 2 1 3 3 2 2 2 2
## [75] 2 1 1 3 2 2 2 3 2 3 3 2 3 3 2 1 2 1 2 1 3 1 3 2 1 3 2 2 3 3 1 2 3 1 2 3 1
## [112] 1 3 2 2 1 2 3 2 3 1 3 2 3 1 2 3 1 2 2 1 3 2 2 3 2 3 2 2 3 2 1 1 3 2 2 2 1
## [149] 2 1 2 3 1 2 2 3 2 2 2 3 2 2 2 1 2 2 3 3 2 3 2 3 2 3 2 2 2 1 1 3 2 2 3 2 2
## [186] 2 3 2 2 1 1 2 3 2 3 3 2 2 1 1 3 2 2 2 1 3 2 1 3 2 2 1 2 1 2 2 3 3 2 3 1 2
## [223] 1 1 2 1 2 3 2 1 3 2 2 2 1 1 3 2 3 3 2 3 2 1 3 2 3 2 2 2 2 2 1 2 3 1 2 3 2
## [260] 2 2 3 3 3 1 2 2 3 2 2 2 2 1 1 3 1 1 2 2 1 2 2 2 3 2 3 1 2 3 1 2 2 1 3 3 3
## [297] 1 2 2 1 2 2 2 2 2 2 1 3 2 1 1 3 1 1 3 1 2 3 3 1 3 2 2 3 1 1 2 3 3 1 2 1 3
## [334] 3 3 2 1 3 2 3 2 1 2 1 2 1 3 3 2 3 2 1 1 3 3 1 2 3 3 3 1 2 3 3 3 3 1 3 1 2
## [371] 3 2 1 3 2 3 2 2 1 2 3 2 1 2 2 1 1 3 3 2 1 2 2 1 3
##
## Within cluster sum of squares by cluster:
## [1] 269.6025 361.7333 290.4907
## (between_SS / total_SS = 53.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Above in the section where it says: Cluster means, we can see the final leaders, and we can compare them to the initial leaders and see if there have been any reclassifications.
The ratio between the between sum of squares and total sum of squares is 53,3% which is not bad, but also not very good. This number tells us how much units vary in each cluster groups. We want this number to be as high as possible because that would mean that units in clusters are very homogeneous (clusters contain similar units).
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("ID", "ClusterWard", "ClusterK_Means")])
## ID ClusterWard ClusterK_Means
## 1 7010 1 3
## 2 8779 2 2
## 3 1544 3 3
## 4 10673 1 3
## 5 3463 3 1
## 6 2021 3 3
Above we can see some reclassificatios of the units after K-Means clustering was performed.
#Checking for reclassifications
table(mydata$ClusterWard)
##
## 1 2 3
## 124 161 110
table(mydata$ClusterK_Means)
##
## 1 2 3
## 100 178 117
table(mydata$ClusterWard, mydata$ClusterK_Means)
##
## 1 2 3
## 1 91 18 15
## 2 0 160 1
## 3 9 0 101
In third group performed by K-Means clustering, there are 117 units. 101 units were in the 3. group after hierarhical clustering, 15 units were reclassified here from the 1. group and 1 unit was reclassified from the 2. group.
In the 2. group performed by hierarhical clustering there were 161 units, but after K-Means clustering was performed 1 unit was reclassified to group 3, and 160 units remained in the second group.
Centroids <- K_MEANS$centers
Centroids
## Income_z Recency_z NumWebPurchases_z NumStorePurchases_z
## 1 0.3046569 0.18059917 1.2852154 0.5535241
## 2 -0.8794536 -0.08334047 -0.7908855 -0.8527277
## 3 1.0775816 -0.03250244 0.1048998 0.8264106
## NumWebVisitsMonth_z
## 1 0.4504615
## 2 0.5431266
## 3 -1.2170502
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(Income_z, Recency_z, NumWebPurchases_z, NumStorePurchases_z, NumWebVisitsMonth_z))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$nameFactor <- factor(Figure$name,
levels = c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z"),
labels = c("Income_z", "Recency_z", "NumWebPurchases_z", "NumStorePurchases_z", "NumWebVisitsMonth_z"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
ylim(-1.5, 1.5)
As shown in the ggplot above we can see that people in group 1 are above average in all 5 cluster variables, where people in group 2 are below average in basically all of the cluster variables, except in NumWebVisitsMonth_z (where they are above). And lastly people in 3 cluster group are above average in Income_z, NumStorePurchases_z and they are also slightly above average in NumWebPurchases_z. On the other hand, in the same cluster group (3), people are below average in NumWebVisitsMonth_z and around average in Recency_z.
H0: Mean of Income_z for group 1 is equal to the mean of Income_z for
group 2 is equal to the mean of Income_z for group 3.
H1: At least one arithmetic mean is different.
H0: Mean of Recency-z for group 1 is equal to the mean of Recency_z
for group 2 is equal to the mean of Recency_z for group 3.
H1: At least one arithmetic mean is different.
H0: Mean of NumWebPurchases_z for group 1 is equal to the mean of
NumWebPurchases_z for group 2 is equal to the mean of NumWebPurchases_z
for group 3.
H1: At least one arithmetic mean is different.
H0: Mean of NumStorePurchases_z for group 1 is equal to the mean of
NumStorePurchases_z for group 2 is equal to the mean of
NumStorePurchases_z for group 3.
H1: At least one arithmetic mean is different.
H0: Mean of NumWebVisitsMonth_z for group 1 is equal to the mean of
NumWebVisitsMonth_z for group 2 is equal to the mean of
NumWebVisitsMonth_z for group 3.
H1: At least one arithmetic mean is different.
#Checking if all the cluster variables are successful at classifing units into groups by performing ANOVAs.
fit <- aov(cbind(Income_z, Recency_z, NumWebPurchases_z, NumStorePurchases_z, NumWebVisitsMonth_z) ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 282.81 141.406 498.54 < 2.2e-16 ***
## Residuals 392 111.19 0.284
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 4.62 2.31034 2.3322 0.09843 .
## Residuals 392 388.33 0.99065
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 277.80 138.902 465.21 < 2.2e-16 ***
## Residuals 392 117.04 0.299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 239.98 119.988 302.31 < 2.2e-16 ***
## Residuals 392 155.59 0.397
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 5 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 246.10 123.050 322.27 < 2.2e-16 ***
## Residuals 392 149.67 0.382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I can reject the H0 for all cluster variables (p<0,001), except for Recency_z. Therefore I would need to do the clustering process again without the Recency_z as a cluster variable, but due to educational pressure at the moment I will just continue with the analysis.
H0: Mean of Kidhome for group 1 is equal to the mean of Kidhome for
group 2 is equal to the mean of Kidhome for group 3.
H1: At least one arithmetic mean is different.
#Checking validity of Kidhome
aggregate(mydata$Kidhome,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 0.32000000
## 2 2 0.84269663
## 3 3 0.05128205
fit <- aov(Kidhome ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 47.47 23.737 112 <2e-16 ***
## Residuals 392 83.05 0.212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value I reject H0 at (p<0,001). I validated my results with Kidhome.
H0: Mean of Teenhome for group 1 is equal to the mean of Teenhome for
group 2 is equal to the mean of Teenhome for group 3.
H1: At least one arithmetic mean is different.
#Checking validity of Teenhome
aggregate(mydata$Teenhome,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 0.7600000
## 2 2 0.4887640
## 3 3 0.3504274
fit <- aov(Teenhome ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 9.29 4.647 16.97 8.57e-08 ***
## Residuals 392 107.35 0.274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value I reject H0 at (p<0,001). I validated my results with Teenhome.
Classification of 397 random people out of this data set was based on 5 standardized variables.
For hierarchical clustering, Ward’s algorithm was used, and based on the analysis of the dendrogram it was decided to classify the students into three groups. The classification was further optimized using the K-Means cluster.
Group 2 contains the most students (45%, 178/397), characterized by a lower than average value of all cluster variables except NumWebVisitMonth. At the end (after reclassification) there were 178 people in the second cluster (group).
Theoretically I wasn’t able to divide the people into homogeneous groups based on the 5 cluster variables, because the arithmetic mean of the cluster variable Recency_z wasn’t statistically different among all 3 cluster groups.