Learning objectives

By the end of this lab session, you should be able to:

1. Understand how cloud computing works (currently in beta release at the time I am writing this tutorial).

2. Understand how to import your own data to the cloud environment

3. Create descriptive stats to help understand the frequency distributions of your data

4. Understand how hierarchical cluster analysis works.

5. Perform a very basic cluster analysis using R Studio Cloud

6.Understand how to interpret your cluster analysis results.

7.Understand how to export your final results from the cloud environment to your own computer

8.Understand how to use some basic packages and custom functions to process your data (optional)

Reading

For hierachical clustering and exploratory data analysis read Chapter 12 “Cluster Analysis” from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani - reading (p.385-p.399).

Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.

Reference: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 2013.An Introduction to Statistical Learning with Applications in R. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Segmentation

Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses.

Market segmentation

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

library(readr)
mydata <-read.csv('wonderfulcustomer_segmentation.csv')

Importing data

In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.

Building distance function and ploting the trees (dendrograms)

Hierarchical clustering (using the function hclust) is an informative way to visualize the data.

We will see if we could discover subgroups among the variables or among the observations.

use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)  
d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)                # apply hirarchical clustering 
library(ggplot2) # needs no introduction
plot(seg.hclust)

Identifying clustering memberships for each cluster

Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.

groups.3 = cutree(seg.hclust,3)
table(groups.3)  #A good first step is to use the table function to see how # many observations are in each cluster 
## groups.3
## 1 2 3 
## 5 5 3
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## NULL
mydata$ID[groups.3 == 2]
## NULL
mydata$ID[groups.3 == 3]
## NULL

Identifying common features of each cluster using the aggregate function

#?aggregate
aggregate(mydata,list(groups.3),median)
##   Group.1 gender age companies Halo_oranges experience userfriendly price
## 1       1      1   1         4            1          5            5     5
## 2       2      2   2         3            1          4            3     2
## 3       3      1   1         3            0          4            4     3
##   variety ad_effectiveness halo_organic pistachio_flavor
## 1       5                5            1                2
## 2       3                3            2                2
## 3       4                4            2                2
aggregate(mydata,list(groups.3),mean)
##   Group.1   gender age companies Halo_oranges experience userfriendly    price
## 1       1 1.400000 1.2  3.600000    0.8000000   4.800000            5 4.600000
## 2       2 1.600000 1.6  2.600000    1.0000000   3.800000            3 2.200000
## 3       3 1.333333 1.0  2.666667    0.3333333   4.333333            4 3.333333
##    variety ad_effectiveness halo_organic pistachio_flavor
## 1 4.600000              4.8          1.2              2.0
## 2 3.400000              2.8          1.6              1.6
## 3 3.666667              4.0          2.0              2.0
aggregate(mydata[,-1],list(groups.3),median)
##   Group.1 age companies Halo_oranges experience userfriendly price variety
## 1       1   1         4            1          5            5     5       5
## 2       2   2         3            1          4            3     2       3
## 3       3   1         3            0          4            4     3       4
##   ad_effectiveness halo_organic pistachio_flavor
## 1                5            1                2
## 2                3            2                2
## 3                4            2                2
aggregate(mydata[,-1],list(groups.3),mean)
##   Group.1 age companies Halo_oranges experience userfriendly    price  variety
## 1       1 1.2  3.600000    0.8000000   4.800000            5 4.600000 4.600000
## 2       2 1.6  2.600000    1.0000000   3.800000            3 2.200000 3.400000
## 3       3 1.0  2.666667    0.3333333   4.333333            4 3.333333 3.666667
##   ad_effectiveness halo_organic pistachio_flavor
## 1              4.8          1.2              2.0
## 2              2.8          1.6              1.6
## 3              4.0          2.0              2.0
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)

Exporting cluster analysis results into excel from R Studio Cloud

write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")

Downloading your solutions mannually

First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.

Second, click the gear icon on the right side of your pane and export the data.

Finding means or medians of each variable (factor) for each cluster

Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.

##Analysis

  1. Basic descriptive statistics (mean, SD, outliers, etc.) Most of the people that took the survey were male and were between the ages of 18-24, as evident by both the mean and median. Many of the responses indicated that they had heard of at least 3 of the companies suggested. A decent majority have tried Halo oranges but there was a fair amount that had not had them. These consumers’ experience with Wonderful are leaning toward the higher end at a rating of 4 as the median on a scale of 1-5. For both the mean and the median, responses showed the user friendliness of the website at a 4 on a scale of 1-5. Consumers from this survey also tend to be satisfied with the variety of products Wonderful has to offer. From a scale of 1-5, the mean came out as 3.9 with the median being 4. The ad effectiveness for Wonderful was voted to be fairly high in its effectiveness toward the consumers.
  2. How many clusters do we have? Three
  3. How many observations do you have in each cluster, respectively? Group 1: 5 observations, Group 2: 5 observations, Group 3: 3 observations
  4. List the cluster membership (the customer IDs) for each cluster. Group 1:9,10,2,3,6 Group 2:4,8,13,1,12 Group 3:5,7,11
  5. What are common characteristics of the customers in each cluster? In Group 1, most of the observations regarding the ad effectiveness, variety of products, user friendliness of the website, and their experience with Halo oranges was very high. This cluster appears to have had the best experience with Wonderful and their products. In Group 2, they had a more average experience with Wonderful but particularly had negative perspectives regarding the price of the products and a not too great reaction to the variety of Wonderful products. Overall, this group did not appear to be greatly impressed. In Group 3, most having not had Halo oranges, seemed satisfied with Wonderful, although not as satisfied as Group 1. They did however, rate around the 3 range on a scale of 1-5 regarding price, so perhaps the group thinks that Wonderful is a bit too expensive.
  6. List at least two R functions (mean(), library(help = ggplot2), etc.) you have learned so far and explain it by referring to an online article. One of the R functions: ? is useful because it allows you to type the ? and follow it with another function that you choose to know more information about. For example typing the question mark and then “mean” will create a popup that will explain what that function will perform when the syntax is run. The “read.csv()” function allows R to read the data from a spreadsheet file uploaded to the cloud. More information regarding R functions can be found here: https://www.dummies.com/programming/r/r-for-dummies-cheat-sheet/
  7. A 50-word reflection of your learning experience (e.g., three things you have learned this week). How could it benefit you as a future manager? My learning experience here posed a challenge in performing this analysis and getting the results I needed, especially in getting the syntax correct. This clustering analysis could help benefit me as a future manager because it would allow me to take data from customers and even employees that could provide me with performance feedback. From there, I could make necessary adjustments for improvement.