This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This is still an early draft. Let me know if there are any errors or typos.
Keep in mind that no programmer can avoid errors. I strongly agree with this quote from “CodeAcademy” that “Errors in your code mean you’re trying to do something cool.”
https://news.codecademy.com/errors-in-code-think-differently/
Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses.
A potentially interesting question might be are some products (or customers) more alike than the others.
Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.
Imagine that you are the Director of Customer Relationships at Apple, and you might be interested in understanding consumers’ attitude towards iPhone 12 and Google’s Pixel 5. Once the product is created, the ball shifts to the marketing team’s court. As mentioned above, to understand which groups of customers will be interested in which kind of features, marketers will make use of market segmentation strategy. The cluster analysis algorithm is designed to address this problem. Doing this ensures the product is positioned to the right segment of customers with a high propensity to buy.
1.Identify the type of customers who would respond to a particular offer
2.Identify high spenders among customers who will use the e-commerce channel for festive shopping
3.Identify customers who will default on their credit obligation for a loan or credit card
The file customer_segmetation.csv contains data collected by my students in spring 2020.
Search for Rstudio Cloud, register (or set up a free user account), and log into the cloud environment with your Gmail credentials.
You will upload your dataset (.csv) from your own computer to R Studio Cloud first. Make sure the first column is id instead of a variable.
Once the dataset is uploaded, you will see the dataset available on the right pane of your cloud environment.
Now we will be using the package (readr) and the function read_csv to read the dataset.
library(readr)
mydata <-read_csv('cleaned_survey_data_final.csv')
## Rows: 34 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): ID, Uses_Meta, Usage_Frequency, Ad_Click_Frequency, Ad_Trust, Purch...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.
Hierarchical clustering (using the function hclust) is an informative way to visualize the data.
We will see if we could discover subgroups among the variables or among the observations.
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)
d <- dist(as.matrix(dist)) # find distance matrix
seg.hclust <- hclust(d) # apply hirarchical clustering
library(ggplot2) # needs no introduction
plot(seg.hclust)
Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.
groups.3 = cutree(seg.hclust,3)
table(groups.3) #A good first step is to use the table function to see how # many observations are in each cluster
## groups.3
## 1 2 3
## 28 5 1
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## [1] 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 18 19 20 21 22 24 27 29 30 31
## [26] 32 33 34
mydata$ID[groups.3 == 2]
## [1] 11 23 25 26 28
mydata$ID[groups.3 == 3]
## [1] 17
#?aggregate
aggregate(mydata,list(groups.3),median)
## Group.1 ID Uses_Meta Usage_Frequency Ad_Click_Frequency Ad_Trust
## 1 1 15.5 18.5 2 2.666667 3
## 2 2 25.0 18.0 2 3.000000 5
## 3 3 17.0 9.0 1 1.000000 5
## Purchased_After_Ad Ad_Type Group
## 1 1.416667 7.5 3
## 2 2.000000 4.0 1
## 3 2.000000 7.5 2
aggregate(mydata,list(groups.3),mean)
## Group.1 ID Uses_Meta Usage_Frequency Ad_Click_Frequency Ad_Trust
## 1 1 16.60714 18.17857 1.985119 2.738095 2.946429
## 2 2 22.60000 15.40000 2.000000 2.600000 4.600000
## 3 3 17.00000 9.00000 1.000000 1.000000 5.000000
## Purchased_After_Ad Ad_Type Group
## 1 1.363095 8.0 2.214286
## 2 1.600000 4.7 1.000000
## 3 2.000000 7.5 2.000000
aggregate(mydata[,-1],list(groups.3),median)
## Group.1 Uses_Meta Usage_Frequency Ad_Click_Frequency Ad_Trust
## 1 1 18.5 2 2.666667 3
## 2 2 18.0 2 3.000000 5
## 3 3 9.0 1 1.000000 5
## Purchased_After_Ad Ad_Type Group
## 1 1.416667 7.5 3
## 2 2.000000 4.0 1
## 3 2.000000 7.5 2
aggregate(mydata[,-1],list(groups.3),mean)
## Group.1 Uses_Meta Usage_Frequency Ad_Click_Frequency Ad_Trust
## 1 1 18.17857 1.985119 2.738095 2.946429
## 2 2 15.40000 2.000000 2.600000 4.600000
## 3 3 9.00000 1.000000 1.000000 5.000000
## Purchased_After_Ad Ad_Type Group
## 1 1.363095 8.0 2.214286
## 2 1.600000 4.7 1.000000
## 3 2.000000 7.5 2.000000
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)
write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")
First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.
Second, click the gear icon on the right side of your pane and export the data.
Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.
How many observations do we have in each cluster? Answer: Your answer here:Cluster 1 has 28 observations, Cluster 2 has 5 observations, and Cluster 3 has 1 observation. This indicates that most of the data is concentrated in Cluster 1, while Clusters 2 and 3 represent smaller, more specific segments. Cluster 3 may be an outlier or a very unique profile.
We can look at the medians (or means) for the variables in each cluster. Why is this important?
Answer: Your answer here:Looking at the means or medians allows us to understand the central tendency of each variable within a cluster. This helps identify patterns in customer behavior such as how frequently they use Meta, how much they trust ads, and whether they are likely to purchase after seeing an ad. It also makes it easier to compare clusters and understand what differentiates them.
Answer: Your answer here:The median is more reliable in this case because it is less sensitive to extreme values or outliers. Since Cluster 3 only has one observation and Cluster 2 is small, the mean could be misleading. The median gives a more accurate representation of the “typical” behavior in each cluster.
Answer: Your answer here:The most appropriate summary measures are:
Mean and median for numerical variables (usage, trust, click frequency, etc.) Counts or frequencies for categorical variables
These measures help describe each cluster’s behavior and allow us to build customer profiles, which are essential for developing targeted marketing strategies.
Answer: Your answer here:K-means clustering works by assigning observations to clusters based on minimizing the distance to a centroid. It requires choosing the number of clusters beforehand and is efficient for large datasets, but results can vary depending on the starting points.
Hierarchical clustering builds clusters step by step and represents them in a dendrogram. It does not require selecting the number of clusters initially and allows us to visually understand how observations are grouped.
I prefer hierarchical clustering because it provides better visualization and helps understand the structure of the data, especially in smaller datasets like this one.
Our tutor played a key role in helping us manually clean the dataset, ensuring we understood the structure and quality of the data before beginning the analysis. At the same time, he introduced us to more efficient approaches using RStudio, showing how data cleaning can be automated and scaled.
As beginners in cluster analysis, we were also encouraged to use AI tools such as ChatGPT and Claude to support our learning process. These tools helped us better understand concepts, debug errors, and interpret results more effectively. However, it was emphasized that AI should be used as a support tool, not a replacement for the course material.
The professor’s tutorial remained the foundation of our analysis, providing the correct methodology and structure. AI complemented this by guiding us through challenges and helping us move faster, but all outputs still required verification.
Overall, this combination of tutor guidance, course material, and AI support allowed us to both understand the theory and apply it more efficiently in practice.
O. The aggregate function is well suited for this task. Should we use mydata or mydata[,-1] along with the aggregate function? Why? Hint: see the results on my tutorial.
Answer: We should use mydata[,-1] because the first column is an ID variable, which does not contain meaningful information for analysis. Including it would distort the summary statistics since IDs are just identifiers and not behavioral or numerical variables relevant to clustering.
Cluster analysis - reading (p.385-p.399) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L) https://www.scielo.br/scielo.php?script=sci_arttext&pid=S1415-47572004000100014&lng=en&nrm=iso
Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/