By the end of this lab session, you should be able to:
1. Understand how cloud computing works (currently in beta release at the time I am writing this tutorial).
2. Understand how to import your own data to the cloud environment
3. Create descriptive stats to help understand the frequency distributions of your data
4. Understand how hierarchical cluster analysis works.
5. Perform a very basic cluster analysis using R Studio Cloud
6.Understand how to interpret your cluster analysis results.
7.Understand how to export your final results from the cloud environment to your own computer
8.Understand how to use some basic packages and custom functions to process your data (optional)
For hierachical clustering and exploratory data analysis read Chapter 12 “Cluster Analysis” from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani - reading (p.385-p.399).
Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.
Reference: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 2013.An Introduction to Statistical Learning with Applications in R. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses.
Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.
library(readr)
mydata <-read.csv('wonderfulcustomer_segmentation.csv')
In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.
Hierarchical clustering (using the function hclust) is an informative way to visualize the data.
We will see if we could discover subgroups among the variables or among the observations.
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)
d <- dist(as.matrix(dist)) # find distance matrix
seg.hclust <- hclust(d) # apply hirarchical clustering
library(ggplot2) # needs no introduction
plot(seg.hclust)
Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.
groups.3 = cutree(seg.hclust,3)
table(groups.3) #A good first step is to use the table function to see how # many observations are in each cluster
## groups.3
## 1 2 3
## 5 5 3
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## NULL
mydata$ID[groups.3 == 2]
## NULL
mydata$ID[groups.3 == 3]
## NULL
#?aggregate
aggregate(mydata,list(groups.3),median)
## Group.1 gender age companies Halo_oranges experience userfriendly price
## 1 1 1 1 4 1 5 5 5
## 2 2 2 2 3 1 4 3 2
## 3 3 1 1 3 0 4 4 3
## variety ad_effectiveness halo_organic pistachio_flavor
## 1 5 5 1 2
## 2 3 3 2 2
## 3 4 4 2 2
aggregate(mydata,list(groups.3),mean)
## Group.1 gender age companies Halo_oranges experience userfriendly price
## 1 1 1.400000 1.2 3.600000 0.8000000 4.800000 5 4.600000
## 2 2 1.600000 1.6 2.600000 1.0000000 3.800000 3 2.200000
## 3 3 1.333333 1.0 2.666667 0.3333333 4.333333 4 3.333333
## variety ad_effectiveness halo_organic pistachio_flavor
## 1 4.600000 4.8 1.2 2.0
## 2 3.400000 2.8 1.6 1.6
## 3 3.666667 4.0 2.0 2.0
aggregate(mydata[,-1],list(groups.3),median)
## Group.1 age companies Halo_oranges experience userfriendly price variety
## 1 1 1 4 1 5 5 5 5
## 2 2 2 3 1 4 3 2 3
## 3 3 1 3 0 4 4 3 4
## ad_effectiveness halo_organic pistachio_flavor
## 1 5 1 2
## 2 3 2 2
## 3 4 2 2
aggregate(mydata[,-1],list(groups.3),mean)
## Group.1 age companies Halo_oranges experience userfriendly price variety
## 1 1 1.2 3.600000 0.8000000 4.800000 5 4.600000 4.600000
## 2 2 1.6 2.600000 1.0000000 3.800000 3 2.200000 3.400000
## 3 3 1.0 2.666667 0.3333333 4.333333 4 3.333333 3.666667
## ad_effectiveness halo_organic pistachio_flavor
## 1 4.8 1.2 2.0
## 2 2.8 1.6 1.6
## 3 4.0 2.0 2.0
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)
write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")
First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.
Second, click the gear icon on the right side of your pane and export the data.
Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.
##Analysis