In this problem, we’ll use the dataset Households.csv, which contains data collected over two years for a group of 2,500 households. Each row (observation) in our dataset represents a unique household. The dataset contains the following variables:
NumVisits = the number of times the household visited the retailer AvgProdCount = the average number of products purchased per transaction AvgDiscount = the average discount per transaction from coupon usage (in %) - NOTE: Do not divide this value by 100! AvgSalesValue = the average sales value per transaction MorningPct = the percentage of visits in the morning (8am - 1:59pm) AfternoonPct = the percentage of visits in the afternoon (2pm - 7:59pm) Note that some visits can occur outside of morning and afternoon hours. That is, visits from 8pm - 7:59am are possible.
This dataset was derived from source files provided by dunnhumby, a customer science company based in the United Kingdom.
households <- read.csv("Households.csv")
str(households)
## 'data.frame': 2500 obs. of 6 variables:
## $ NumVisits : int 86 45 47 30 40 250 59 113 20 9 ...
## $ AvgProdCount : num 20.08 15.87 19.62 10.03 5.55 ...
## $ AvgDiscount : num 8.11 7.44 14.37 3.85 2.96 ...
## $ AvgSalesValue: num 50.4 43.4 56.5 40 19.5 ...
## $ MorningPct : num 46.51 8.89 14.89 13.33 2.5 ...
## $ AfternoonPct : num 51.2 60 76.6 56.7 67.5 ...
head(households)
## NumVisits AvgProdCount AvgDiscount AvgSalesValue MorningPct AfternoonPct
## 1 86 20.08140 8.105116 50.35070 46.511628 51.16279
## 2 45 15.86667 7.444222 43.42978 8.888889 60.00000
## 3 47 19.61702 14.365106 56.45128 14.893617 76.59574
## 4 30 10.03333 3.855000 40.00367 13.333333 56.66667
## 5 40 5.55000 2.958250 19.47650 2.500000 67.50000
## 6 250 7.16400 3.313360 23.98464 25.600000 61.20000
table(households$MorningPct == 100)
##
## FALSE TRUE
## 2496 4
Data frame has 4 rows
table(households$AfternoonPct == 100)
##
## FALSE TRUE
## 2487 13
The resulting dataset has 13 rows.
tabl
min(subset(households, AvgSalesValue > 150 , AvgDiscount))
## [1] 15.64607
min(subset(households, AvgDiscount > 25, AvgSalesValue))
## [1] 50.1175
table(households$NumVisits >= 300)
##
## FALSE TRUE
## 2352 148
148/(2352+148)
## [1] 0.0592
When clustering data, it is often important to normalize the variables so that they are all on the same scale. If you clustered this dataset without normalizing, which variable would you expect to dominate in the distance calculations?
# We would expect NumVisits to dominate, because it is on the largest scale.
Normalize all of the variables in the HouseHolds dataset by entering the following commands in your R console: (Note that these commands assume that your dataset is called “households”, and create the normalized dataset “HouseholdsNorm”. You can change the names to anything you want by editing the commands.)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc = preProcess(households)
HouseholdsNorm = predict(preproc, households)
(Remember that for each variable, the normalization process subtracts the mean and divides by the standard deviation. We learned how to do this in Unit 6.) In your normalized dataset, all of the variables should have mean 0 and standard deviation 1.
max(HouseholdsNorm$NumVisits)
## [1] 10.28281
min(HouseholdsNorm$AfternoonPct)
## [1] -3.228427
Run the following code to create a dendrogram of your data:
set.seed(200)
distances <- dist(HouseholdsNorm, method = "euclidean")
ClusterShoppers <- hclust(distances, method = "ward.D")
plot(ClusterShoppers, labels = FALSE)
#Four clusters and six clusters have very little "wiggle room", which means that the additional clusters are not very distinct from existing clusters. That is, when moving from 3 clusters to 4 clusters, the additional cluster is very similar to an existing one (as well as when moving from 5 clusters to 6 clusters).
Run the k-means clustering algorithm on your normalized dataset, selecting 10 clusters. Right before using the kmeans function, type “set.seed(200)” in your R console.
set.seed(200)
kmeansClust <- kmeans(HouseholdsNorm, centers = 10, iter.max = 2000)
table(kmeansClust$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 246 51 490 118 504 226 141 284 52 388
Smallest Cluster value is 51 (Cluster 2)
Largests Cluster value is 504 (Cluster 5)
Now, use the cluster assignments from k-means clustering together with the cluster centroids to answer the next few questions.
kmeansClust$centers
## NumVisits AvgProdCount AvgDiscount AvgSalesValue MorningPct
## 1 -0.24811488 1.47685425 1.2995075 1.4630282 -0.34840552
## 2 -0.48316783 3.73740748 3.4739658 3.5747198 0.19984565
## 3 -0.23416257 0.29953633 0.2910342 0.3028503 -0.18244421
## 4 -0.17987013 -0.54192010 -0.4572379 -0.5481618 2.49106094
## 5 -0.24562303 -0.73554976 -0.6988240 -0.7400683 -0.54700392
## 6 1.48010544 -0.36385774 -0.3526725 -0.3240381 0.06527668
## 7 -0.09256621 0.86666142 0.9044825 0.9880042 1.44404053
## 8 -0.26199562 -0.04997603 -0.1057321 -0.1203396 -0.89830632
## 9 4.46367311 -0.85145063 -0.7674102 -0.8051772 -0.23899301
## 10 -0.34463938 -0.63794295 -0.5443075 -0.6261987 0.50474863
## AfternoonPct
## 1 0.658274557
## 2 -0.127626782
## 3 -0.016514760
## 4 -1.811251943
## 5 0.225297175
## 6 0.008440554
## 7 -0.979661193
## 8 1.419272373
## 9 -0.170491325
## 10 -0.786442091
#Cluster 4
#Cluster 2
#Cluster 9
#The cluster centroid captures the average behavior in the cluster, and can be used to summarize the general pattern in the cluster.