In this problem, we’ll use the dataset Households.csv, which contains data collected over two years for a group of 2,500 households. Each row (observation) in our dataset represents a unique household. The dataset contains the following variables:

NumVisits = the number of times the household visited the retailer AvgProdCount = the average number of products purchased per transaction AvgDiscount = the average discount per transaction from coupon usage (in %) - NOTE: Do not divide this value by 100! AvgSalesValue = the average sales value per transaction MorningPct = the percentage of visits in the morning (8am - 1:59pm) AfternoonPct = the percentage of visits in the afternoon (2pm - 7:59pm) Note that some visits can occur outside of morning and afternoon hours. That is, visits from 8pm - 7:59am are possible.

This dataset was derived from source files provided by dunnhumby, a customer science company based in the United Kingdom.

Preparing the dataset

households <- read.csv("Households.csv")
str(households)
## 'data.frame':    2500 obs. of  6 variables:
##  $ NumVisits    : int  86 45 47 30 40 250 59 113 20 9 ...
##  $ AvgProdCount : num  20.08 15.87 19.62 10.03 5.55 ...
##  $ AvgDiscount  : num  8.11 7.44 14.37 3.85 2.96 ...
##  $ AvgSalesValue: num  50.4 43.4 56.5 40 19.5 ...
##  $ MorningPct   : num  46.51 8.89 14.89 13.33 2.5 ...
##  $ AfternoonPct : num  51.2 60 76.6 56.7 67.5 ...
head(households)
##   NumVisits AvgProdCount AvgDiscount AvgSalesValue MorningPct AfternoonPct
## 1        86     20.08140    8.105116      50.35070  46.511628     51.16279
## 2        45     15.86667    7.444222      43.42978   8.888889     60.00000
## 3        47     19.61702   14.365106      56.45128  14.893617     76.59574
## 4        30     10.03333    3.855000      40.00367  13.333333     56.66667
## 5        40      5.55000    2.958250      19.47650   2.500000     67.50000
## 6       250      7.16400    3.313360      23.98464  25.600000     61.20000

How many households have logged transactions at the retailer only in the morning?

table(households$MorningPct == 100)
## 
## FALSE  TRUE 
##  2496     4

Data frame has 4 rows

How many households have logged transactions at the retailer only in the afternoon?

table(households$AfternoonPct == 100)
## 
## FALSE  TRUE 
##  2487    13

The resulting dataset has 13 rows.

Descriptive statistics

Of the households that spend more than $150 per transaction on average, what is the minimum average discount per transaction?

tabl

min(subset(households, AvgSalesValue > 150 , AvgDiscount))
## [1] 15.64607

Of the households who have an average discount per transaction greater than 25%, what is the minimum average sales value per transaction?

min(subset(households, AvgDiscount > 25, AvgSalesValue))
## [1] 50.1175

In the dataset, what proportion of households visited the retailer at least 300 times?

table(households$NumVisits >= 300)
## 
## FALSE  TRUE 
##  2352   148
148/(2352+148)
## [1] 0.0592

Importance of Normalizing

When clustering data, it is often important to normalize the variables so that they are all on the same scale. If you clustered this dataset without normalizing, which variable would you expect to dominate in the distance calculations?

# We would expect NumVisits to dominate, because it is on the largest scale.

Normalizing the Data

Normalize all of the variables in the HouseHolds dataset by entering the following commands in your R console: (Note that these commands assume that your dataset is called “households”, and create the normalized dataset “HouseholdsNorm”. You can change the names to anything you want by editing the commands.)

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc = preProcess(households)

HouseholdsNorm = predict(preproc, households)

(Remember that for each variable, the normalization process subtracts the mean and divides by the standard deviation. We learned how to do this in Unit 6.) In your normalized dataset, all of the variables should have mean 0 and standard deviation 1.

What is the maximum value of NumVisits in the normalized dataset?

max(HouseholdsNorm$NumVisits)
## [1] 10.28281

What is the minimum value of AfternoonPct in the normalized dataset?

min(HouseholdsNorm$AfternoonPct)
## [1] -3.228427

Run the following code to create a dendrogram of your data:

set.seed(200)
distances <- dist(HouseholdsNorm, method = "euclidean")
ClusterShoppers <- hclust(distances, method = "ward.D")
plot(ClusterShoppers, labels = FALSE)

Interpreting the Dendrogram

Based on the dendrogram, how many clusters do you think would be appropriate for this problem? Select all that apply.

#Four clusters and six clusters have very little "wiggle room", which means that the additional clusters are not very distinct from existing clusters. That is, when moving from 3 clusters to 4 clusters, the additional cluster is very similar to an existing one (as well as when moving from 5 clusters to 6 clusters).

K-means Clustering

Run the k-means clustering algorithm on your normalized dataset, selecting 10 clusters. Right before using the kmeans function, type “set.seed(200)” in your R console.

set.seed(200)

How many observations are in the smallest cluster?

kmeansClust <- kmeans(HouseholdsNorm, centers = 10, iter.max = 2000)

table(kmeansClust$cluster)
## 
##   1   2   3   4   5   6   7   8   9  10 
## 246  51 490 118 504 226 141 284  52 388

Smallest Cluster value is 51 (Cluster 2)

How many observations are in the largest cluster?

Largests Cluster value is 504 (Cluster 5)

Understanding the Clusters

Now, use the cluster assignments from k-means clustering together with the cluster centroids to answer the next few questions.

kmeansClust$centers
##      NumVisits AvgProdCount AvgDiscount AvgSalesValue  MorningPct
## 1  -0.24811488   1.47685425   1.2995075     1.4630282 -0.34840552
## 2  -0.48316783   3.73740748   3.4739658     3.5747198  0.19984565
## 3  -0.23416257   0.29953633   0.2910342     0.3028503 -0.18244421
## 4  -0.17987013  -0.54192010  -0.4572379    -0.5481618  2.49106094
## 5  -0.24562303  -0.73554976  -0.6988240    -0.7400683 -0.54700392
## 6   1.48010544  -0.36385774  -0.3526725    -0.3240381  0.06527668
## 7  -0.09256621   0.86666142   0.9044825     0.9880042  1.44404053
## 8  -0.26199562  -0.04997603  -0.1057321    -0.1203396 -0.89830632
## 9   4.46367311  -0.85145063  -0.7674102    -0.8051772 -0.23899301
## 10 -0.34463938  -0.63794295  -0.5443075    -0.6261987  0.50474863
##    AfternoonPct
## 1   0.658274557
## 2  -0.127626782
## 3  -0.016514760
## 4  -1.811251943
## 5   0.225297175
## 6   0.008440554
## 7  -0.979661193
## 8   1.419272373
## 9  -0.170491325
## 10 -0.786442091

Which cluster best fits the description “morning shoppers stopping in to make a quick purchase”?

#Cluster 4

Which cluster best fits the description “shoppers with high average product count and high average value per visit”?

#Cluster 2

Which cluster best fits the description “frequent shoppers with low value per visit”?

#Cluster 9

Understanding Centroids

Why do we typically use cluster centroids to describe the clusters?

#The cluster centroid captures the average behavior in the cluster, and can be used to summarize the general pattern in the cluster.