By the end of this lab session, you should be able to:
1. Understand how cloud computing works (currently in beta release at the time I am writing this tutorial).
2. Understand how to import your own data to the cloud environment
3. Create descriptive stats to help understand the frequency distributions of your data
4. Understand how hierarchical cluster analysis works.
5. Perform a very basic cluster analysis using R Studio Cloud
6.Understand how to interpret your cluster analysis results.
7.Understand how to export your final results from the cloud environment to your own computer
8.Understand how to use some basic packages and custom functions to process your data (optional)
For hierachical clustering and exploratory data analysis read Chapter 12 ìCluster Analysisî from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani - reading (p.385-p.399).
Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.
Reference: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 2013.An Introduction to Statistical Learning with Applications in R. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This is still an early draft. Let me know if there are any errors or typos.
Keep in mind that no programmer can avoid errors. I strongly agree with this quote from “CodeAcademy” that “Errors in your code mean you’re trying to do something cool.”
https://news.codecademy.com/errors-in-code-think-differently/
Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses.
A potentially interesting question might be are some products (or customers) more alike than the others.
Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.
Imagine that you are the Director of Customer Relationships at Apple, and you might be interested in understanding consumers’ attitude towards iPhone 12 and Google’s Pixel 5. Once the product is created, the ball shifts to the marketing teamís court. As mentioned above, to understand which groups of customers will be interested in which kind of features, marketers will make use of market segmentation strategy. The cluster analysis algorithm is designed to address this problem. Doing this ensures the product is positioned to the right segment of customers with a high propensity to buy.
1.Identify the type of customers who would respond to a particular offer
2.Identify high spenders among customers who will use the e-commerce channel for festive shopping
3.Identify customers who will default on their credit obligation for a loan or credit card
The file customer_segmetation.csv contains data collected by one of the student groups who took the marketing research course in spring 2020.
Search for Rstudio Cloud, register (or set up a free user account), and log into the cloud environment with your Gmail credentials.
You will upload your dataset (.csv) from your own computer to R Studio Cloud first. Make sure the first column is id instead of a variable.
Once the dataset is uploaded, you will see the dataset available on the right pane of your cloud environment.
Now we will be using the package (readr) and the function read_csv to read the dataset.
library(readr)
mydata <-read_csv('customer_segmentation.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ID = col_double(),
## CS_helpful = col_double(),
## Recommend = col_double(),
## Come_again = col_double(),
## All_Products = col_double(),
## Profesionalism = col_double(),
## Limitation = col_double(),
## Online_grocery = col_double(),
## delivery = col_double(),
## Pick_up = col_double(),
## Find_items = col_double(),
## other_shops = col_double(),
## Gender = col_double(),
## Age = col_double(),
## Education = col_double()
## )
In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.
Hierarchical clustering (using the function hclust) is an informative way to visualize the data.
We will see if we could discover subgroups among the variables or among the observations.
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)
d <- dist(as.matrix(dist)) # find distance matrix
seg.hclust <- hclust(d) # apply hirarchical clustering
library(ggplot2) # needs no introduction
plot(seg.hclust)
Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.
groups.3 = cutree(seg.hclust,3)
table(groups.3) #A good first step is to use the table function to see how # many observations are in each cluster
## groups.3
## 1 2 3
## 17 2 3
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## [1] 1 2 3 6 7 8 9 10 11 12 13 14 15 16 17 18 21
mydata$ID[groups.3 == 2]
## [1] 4 22
mydata$ID[groups.3 == 3]
## [1] 5 19 20
#?aggregate
aggregate(mydata,list(groups.3),median)
## Group.1 ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 11 1 1.0 1.0 2 1.0
## 2 2 13 3 2.5 1.5 3 1.5
## 3 3 19 2 1.0 3.0 3 2.0
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1 1 2 2 3.0 1 2.0 1 2.0
## 2 2 3 3 2.5 2 1.5 1 2.5
## 3 1 2 3 1.0 2 3.0 2 2.0
## Education
## 1 2
## 2 5
## 3 2
aggregate(mydata,list(groups.3),mean)
## Group.1 ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 10.76471 1.294118 1.117647 1.235294 1.823529 1.235294
## 2 2 13.00000 3.000000 2.500000 1.500000 3.000000 1.500000
## 3 3 14.66667 2.333333 1.666667 2.666667 3.000000 2.333333
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.352941 2.235294 2.235294 2.705882 1.294118 2.647059 1.176471
## 2 2.000000 3.000000 3.000000 2.500000 2.000000 1.500000 1.000000
## 3 2.000000 2.000000 3.000000 1.000000 2.000000 3.000000 2.000000
## Age Education
## 1 2.411765 3.117647
## 2 2.500000 5.000000
## 3 2.666667 2.333333
aggregate(mydata[,-1],list(groups.3),median)
## Group.1 CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1 1.0 1.0 2 1.0
## 2 2 3 2.5 1.5 3 1.5
## 3 3 2 1.0 3.0 3 2.0
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1 1 2 2 3.0 1 2.0 1 2.0
## 2 2 3 3 2.5 2 1.5 1 2.5
## 3 1 2 3 1.0 2 3.0 2 2.0
## Education
## 1 2
## 2 5
## 3 2
aggregate(mydata[,-1],list(groups.3),mean)
## Group.1 CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.294118 1.117647 1.235294 1.823529 1.235294
## 2 2 3.000000 2.500000 1.500000 3.000000 1.500000
## 3 3 2.333333 1.666667 2.666667 3.000000 2.333333
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.352941 2.235294 2.235294 2.705882 1.294118 2.647059 1.176471
## 2 2.000000 3.000000 3.000000 2.500000 2.000000 1.500000 1.000000
## 3 2.000000 2.000000 3.000000 1.000000 2.000000 3.000000 2.000000
## Age Education
## 1 2.411765 3.117647
## 2 2.500000 5.000000
## 3 2.666667 2.333333
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)
In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.
Hierarchical clustering (using the function hclust) is an informative way to visualize the data.
We will see if we could discover subgroups among the variables or among the observations.
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)
d <- dist(as.matrix(dist)) # find distance matrix
seg.hclust <- hclust(d) # apply hirarchical clustering
library(ggplot2) # needs no introduction
plot(seg.hclust)
Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.
groups.3 = cutree(seg.hclust,3)
table(groups.3) #A good first step is to use the table function to see how # many observations are in each cluster
## groups.3
## 1 2 3
## 17 2 3
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## [1] 1 2 3 6 7 8 9 10 11 12 13 14 15 16 17 18 21
mydata$ID[groups.3 == 2]
## [1] 4 22
mydata$ID[groups.3 == 3]
## [1] 5 19 20
#?aggregate
aggregate(mydata,list(groups.3),median)
## Group.1 ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 11 1 1.0 1.0 2 1.0
## 2 2 13 3 2.5 1.5 3 1.5
## 3 3 19 2 1.0 3.0 3 2.0
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1 1 2 2 3.0 1 2.0 1 2.0
## 2 2 3 3 2.5 2 1.5 1 2.5
## 3 1 2 3 1.0 2 3.0 2 2.0
## Education
## 1 2
## 2 5
## 3 2
aggregate(mydata,list(groups.3),mean)
## Group.1 ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 10.76471 1.294118 1.117647 1.235294 1.823529 1.235294
## 2 2 13.00000 3.000000 2.500000 1.500000 3.000000 1.500000
## 3 3 14.66667 2.333333 1.666667 2.666667 3.000000 2.333333
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.352941 2.235294 2.235294 2.705882 1.294118 2.647059 1.176471
## 2 2.000000 3.000000 3.000000 2.500000 2.000000 1.500000 1.000000
## 3 2.000000 2.000000 3.000000 1.000000 2.000000 3.000000 2.000000
## Age Education
## 1 2.411765 3.117647
## 2 2.500000 5.000000
## 3 2.666667 2.333333
aggregate(mydata[,-1],list(groups.3),median)
## Group.1 CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1 1.0 1.0 2 1.0
## 2 2 3 2.5 1.5 3 1.5
## 3 3 2 1.0 3.0 3 2.0
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1 1 2 2 3.0 1 2.0 1 2.0
## 2 2 3 3 2.5 2 1.5 1 2.5
## 3 1 2 3 1.0 2 3.0 2 2.0
## Education
## 1 2
## 2 5
## 3 2
aggregate(mydata[,-1],list(groups.3),mean)
## Group.1 CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.294118 1.117647 1.235294 1.823529 1.235294
## 2 2 3.000000 2.500000 1.500000 3.000000 1.500000
## 3 3 2.333333 1.666667 2.666667 3.000000 2.333333
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.352941 2.235294 2.235294 2.705882 1.294118 2.647059 1.176471
## 2 2.000000 3.000000 3.000000 2.500000 2.000000 1.500000 1.000000
## 3 2.000000 2.000000 3.000000 1.000000 2.000000 3.000000 2.000000
## Age Education
## 1 2.411765 3.117647
## 2 2.500000 5.000000
## 3 2.666667 2.333333
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)
write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")
First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.
Second, click the gear icon on the right side of your pane and export the data.
Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.
We have about 3 observations.
Answer: Your answer here: The reason as to finding the mean variable in each cluster is due to seeking out the preference to the average person. People such as the average american are more likely to gravitate/ guage their liking based on a general liking to the public, hence as to why surveys are taken for. As seen, the average, mean, display for gender is between 18-24, pricing and variability is almost the same as well. I feel that the easiest to read would be the median value, but the most accurate or best suitable for the consumer would be the average, this is, as long as it does not contain too many outliers that may skew the results. In this case though, both the average as well as the median seem to be in close proximity of eachother.
Answer: Your answer here: I feel A median would be best for this analysis as it would give us a definent answer. An example of this would be for things such as the age and gender analysis. Averaging this would not give us a definent answer, such as seen with responses in groups 1-3 being 1.2, 1.6, and 1.0 for the age, and for the gender the average being 1.4, 1.6, and 1.33. If you look at this data set with simply the median analysis, this would simplify the answers to 1, being 18-24 for group 1, group 2 being 25-31, and group 3 the median being 18-24. The gender median for group 1 was male, 1 being the male option, group 2 being female, 2 being the female option, and group 3 median was male as well. As seen, this simplifies the data being read, making the median the better analysis to analyze.
Answer: Your answer here: The goal target market used for group 1 would be males who are between the ages 18-24. The goal target group would be female with the average age being 25-31. And finally group 3 would be targeting males who are also 18-24 years old. As far as the prefered different pistachos and oranges, it can vary but the primary target group would be the age as well as the gender.
The cluster between K means and Hierachhial seem to be that K means is more simplified meanwhile Hireachial is more descriptive. As seen, the Hierachial cluster shows outliers as well as the average of the results, areas that are more clustered together than others. K Means simply just collects the data and stacks them in a bar chart, making this easier to read but less informative.
Answer: Your answer here:
2,915
O. The aggregate function is well suited for this task. Should we use mydata or mydata[,-1] along with the aggregate function? Why? Hint: see the results on my tutorial.
Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.
PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.
The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).