Learning objectives

By the end of this lab session, you should be able to:

1. Understand how cloud computing works (currently in beta release at the time I am writing this tutorial).

2. Understand how to import your own data to the cloud environment

3. Create descriptive stats to help understand the frequency distributions of your data

4. Understand how hierarchical cluster analysis works.

5. Perform a very basic cluster analysis using R Studio Cloud

6.Understand how to interpret your cluster analysis results.

7.Understand how to export your final results from the cloud environment to your own computer

8.Understand how to use some basic packages and custom functions to process your data (optional)

Reading

For hierachical clustering and exploratory data analysis read Chapter 12 ìCluster Analysisî from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani - reading (p.385-p.399).

Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.

Reference: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 2013.An Introduction to Statistical Learning with Applications in R. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Notice:

This is still an early draft. Let me know if there are any errors or typos.

Keep in mind that no programmer can avoid errors. I strongly agree with this quote from “CodeAcademy” that “Errors in your code mean you’re trying to do something cool.”

https://news.codecademy.com/errors-in-code-think-differently/

Segmentation

Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses.

A potentially interesting question might be are some products (or customers) more alike than the others.

Market segmentation

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

Create a product which evokes the needs & wants in target market

Imagine that you are the Director of Customer Relationships at Apple, and you might be interested in understanding consumers’ attitude towards iPhone 12 and Google’s Pixel 5. Once the product is created, the ball shifts to the marketing teamís court. As mentioned above, to understand which groups of customers will be interested in which kind of features, marketers will make use of market segmentation strategy. The cluster analysis algorithm is designed to address this problem. Doing this ensures the product is positioned to the right segment of customers with a high propensity to buy.

Examples of Objectives

1.Identify the type of customers who would respond to a particular offer

2.Identify high spenders among customers who will use the e-commerce channel for festive shopping

3.Identify customers who will default on their credit obligation for a loan or credit card

Dataset

The file customer_segmetation.csv contains data collected by one of the student groups who took the marketing research course in spring 2020.

Importing data into R Studio Cloud - No need to download R or R studio

Search for Rstudio Cloud, register (or set up a free user account), and log into the cloud environment with your Gmail credentials.

You will upload your dataset (.csv) from your own computer to R Studio Cloud first. Make sure the first column is id instead of a variable.

Once the dataset is uploaded, you will see the dataset available on the right pane of your cloud environment.

Now we will be using the package (readr) and the function read_csv to read the dataset.

library(readr)
mydata <-read_csv('Segmentation.csv')
## Rows: 221 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): ID, Fashn, Price, Convnience, ShpTime, Fitness, Perceptn, ChNoise,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Importing data

In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.

Building distance function and ploting the trees (dendrograms)

Hierarchical clustering (using the function hclust) is an informative way to visualize the data.

We will see if we could discover subgroups among the variables or among the observations.

use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)  
d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)                # apply hirarchical clustering 
library(ggplot2) # needs no introduction
plot(seg.hclust)

Identifying clustering memberships for each cluster

Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.

groups.3 = cutree(seg.hclust,3)
table(groups.3)  #A good first step is to use the table function to see how # many observations are in each cluster 
## groups.3
##   1   2   3 
##   9 170  42
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## [1]   1   3  16  59  98 101 143 167 170
mydata$ID[groups.3 == 2]
##   [1]   2   5   6   7   8  10  14  15  18  20  21  22  23  24  25  26  27  28
##  [19]  29  30  32  33  35  36  37  38  39  40  41  42  43  44  45  46  47  49
##  [37]  51  52  54  55  56  60  61  62  63  64  67  68  69  70  71  72  73  74
##  [55]  76  78  79  80  82  83  84  85  88  89  90  91  92  93  94  96  97  99
##  [73] 103 106 107 108 110 113 114 115 117 119 120 121 122 123 124 125 126 127
##  [91] 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 144 145 146
## [109] 147 148 150 151 153 154 156 157 158 159 160 161 162 163 164 165 166 168
## [127] 171 173 174 175 176 178 179 180 181 182 183 184 185 186 187 188 189 190
## [145] 192 193 195 196 197 198 199 201 202 203 204 205 207 208 209 210 211 212
## [163] 213 214 216 217 218 219 220 221
mydata$ID[groups.3 == 3]
##  [1]   4   9  11  12  13  17  19  31  34  48  50  53  57  58  65  66  75  77  81
## [20]  86  87  95 100 102 104 105 109 111 112 116 118 149 152 155 169 172 177 191
## [39] 194 200 206 215

Identifying common features of each cluster using the aggregate function

#?aggregate
aggregate(mydata,list(groups.3),median)
##   Group.1    ID Fashn Price Convnience ShpTime Fitness Perceptn ChNoise
## 1       1  98.0     1     5          5       5       3        3       6
## 2       2 122.5     3     5          3       4       6        5       4
## 3       3  91.0     2     5          4       5       6        4       5
##   RetailEx KnowdgStaf Brand4Slf Brand4Els Populr StrDisp SaleStaf Fabric Cut
## 1      6.0          6         1         1    1.0       1        2    1.0   4
## 2      6.0          5         4         4    4.0       4        5    6.0   6
## 3      5.5          5         2         2    1.5       2        4    5.5   6
##   Seam ShpOHngr ShpOBody Colrs Match
## 1  1.0        1      4.0     4     1
## 2  5.5        4      7.0     6     4
## 3  5.0        4      6.5     5     2
aggregate(mydata,list(groups.3),mean)
##   Group.1        ID    Fashn    Price Convnience  ShpTime  Fitness Perceptn
## 1       1  84.22222 2.111111 5.555556   5.000000 5.000000 4.000000 3.222222
## 2       2 116.28235 3.352941 4.623529   3.582353 3.905882 5.917647 4.411765
## 3       3  95.35714 2.833333 4.666667   4.238095 4.595238 5.309524 3.928571
##    ChNoise RetailEx KnowdgStaf Brand4Slf Brand4Els   Populr  StrDisp SaleStaf
## 1 4.666667 5.444444   5.111111  1.666667  1.444444 1.777778 2.000000 3.444444
## 2 4.070588 5.300000   4.676471  4.111765  4.423529 4.064706 3.676471 4.535294
## 3 4.452381 5.119048   4.523810  2.761905  2.619048 2.404762 3.023810 4.119048
##     Fabric      Cut     Seam ShpOHngr ShpOBody    Colrs    Match
## 1 2.888889 3.333333 1.444444 2.333333 3.777778 3.222222 1.333333
## 2 5.658824 6.017647 5.341176 4.364706 6.458824 5.482353 4.323529
## 3 4.857143 5.785714 4.761905 3.619048 6.190476 4.595238 3.190476
aggregate(mydata[,-1],list(groups.3),median)
##   Group.1 Fashn Price Convnience ShpTime Fitness Perceptn ChNoise RetailEx
## 1       1     1     5          5       5       3        3       6      6.0
## 2       2     3     5          3       4       6        5       4      6.0
## 3       3     2     5          4       5       6        4       5      5.5
##   KnowdgStaf Brand4Slf Brand4Els Populr StrDisp SaleStaf Fabric Cut Seam
## 1          6         1         1    1.0       1        2    1.0   4  1.0
## 2          5         4         4    4.0       4        5    6.0   6  5.5
## 3          5         2         2    1.5       2        4    5.5   6  5.0
##   ShpOHngr ShpOBody Colrs Match
## 1        1      4.0     4     1
## 2        4      7.0     6     4
## 3        4      6.5     5     2
aggregate(mydata[,-1],list(groups.3),mean)
##   Group.1    Fashn    Price Convnience  ShpTime  Fitness Perceptn  ChNoise
## 1       1 2.111111 5.555556   5.000000 5.000000 4.000000 3.222222 4.666667
## 2       2 3.352941 4.623529   3.582353 3.905882 5.917647 4.411765 4.070588
## 3       3 2.833333 4.666667   4.238095 4.595238 5.309524 3.928571 4.452381
##   RetailEx KnowdgStaf Brand4Slf Brand4Els   Populr  StrDisp SaleStaf   Fabric
## 1 5.444444   5.111111  1.666667  1.444444 1.777778 2.000000 3.444444 2.888889
## 2 5.300000   4.676471  4.111765  4.423529 4.064706 3.676471 4.535294 5.658824
## 3 5.119048   4.523810  2.761905  2.619048 2.404762 3.023810 4.119048 4.857143
##        Cut     Seam ShpOHngr ShpOBody    Colrs    Match
## 1 3.333333 1.444444 2.333333 3.777778 3.222222 1.333333
## 2 6.017647 5.341176 4.364706 6.458824 5.482353 4.323529
## 3 5.785714 4.761905 3.619048 6.190476 4.595238 3.190476
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)

Exporting cluster analysis results into excel from R Studio Cloud

write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")

Downloading your solutions mannually

First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.

Second, click the gear icon on the right side of your pane and export the data.

Finding means or medians of each variable (factor) for each cluster

Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.

Discussion Questions for you

  1. How many observations do we have in each cluster? Answer: Your answer here:

  2. We can look at the medians (or means) for the variables in each cluster. Why is this important?

Answer: Your answer here:

  1. Do you think if mean or median should be used when it comes to analyzing the differences among different clusters? Why?

Answer: Your answer here:

  1. Now we need to understand the common characteristics of each cluster. Our goal is to build targeting strategy using the profiles of each cluster. What summary measures of each cluster are appropriate in a descriptive sense.

Answer: Your answer here:

  1. Any major differences between K-means clustering (https://rpubs.com/utjimmyx/kmeans) and Hierarchical clustering? Which one do you like better? Why? You may refer to the assigned readings.

  2. Do a keyword search using “cluster analysis.” How many relevant job titles are there?

Answer: Your answer here: