Final Primary Data Analysis using Clustering and PCA

Reading

For hierachical clustering and exploratory data analysis read Chapter 12 “Cluster Analysis” from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani - reading (p.385-p.399).

Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.

Reference: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 2013.An Introduction to Statistical Learning with Applications in R. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

library(readr)
mydata <-read_csv('customer_segmentation.csv')

## Rows: 71 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): tiktok_use, recommendations, purchase_tiktok
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Importing data

In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.

Building distance function and ploting the trees (dendrograms)

Hierarchical clustering (using the function hclust) is an informative way to visualize the data.

We will see if we could discover subgroups among the variables or among the observations.

use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)  
d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)                # apply hirarchical clustering 
library(ggplot2) # needs no introduction
plot(seg.hclust)

Identifying clustering memberships for each cluster

Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.

groups.3 = cutree(seg.hclust,3)
table(groups.3)  #A good first step is to use the table function to see how # many observations are in each cluster

## groups.3
##  1  2  3 
## 28 38  5

#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]

## Warning: Unknown or uninitialised column: `ID`.

## NULL

mydata$ID[groups.3 == 2]

## Warning: Unknown or uninitialised column: `ID`.

## NULL

mydata$ID[groups.3 == 3]

## Warning: Unknown or uninitialised column: `ID`.

## NULL

Identifying common features of each cluster using the aggregate function

#?aggregate
aggregate(mydata,list(groups.3),median)

##   Group.1 tiktok_use recommendations purchase_tiktok
## 1       1          2               7               1
## 2       2          2               1              NA
## 3       3         NA               2               6

aggregate(mydata,list(groups.3),mean)

##   Group.1 tiktok_use recommendations purchase_tiktok
## 1       1   2.500000        6.071429        1.607143
## 2       2   2.605263        1.289474              NA
## 3       3         NA        2.600000        6.000000

aggregate(mydata[,-1],list(groups.3),median)

##   Group.1 recommendations purchase_tiktok
## 1       1               7               1
## 2       2               1              NA
## 3       3               2               6

aggregate(mydata[,-1],list(groups.3),mean)

##   Group.1 recommendations purchase_tiktok
## 1       1        6.071429        1.607143
## 2       2        1.289474              NA
## 3       3        2.600000        6.000000

cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)

Exporting cluster analysis results into excel from R Studio Cloud

write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")

Downloading your solutions mannually

First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.

Second, click the gear icon on the right side of your pane and export the data.

Finding means or medians of each variable (factor) for each cluster

Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.

Discussion Questions for you

How many observations do we have in each cluster? Answer: Your answer here:
We can look at the medians (or means) for the variables in each cluster. Why is this important?

Answer: Your answer here:

Do you think if mean or median should be used when it comes to analyzing the differences among different clusters? Why?

Answer: Your answer here:

Now we need to understand the common characteristics of each cluster. Our goal is to build targeting strategy using the profiles of each cluster. What summary measures of each cluster are appropriate in a descriptive sense.

Answer: Your answer here:

Any major differences between K-means clustering (https://rpubs.com/utjimmyx/kmeans) and Hierarchical clustering? Which one do you like better? Why? You may refer to the assigned readings.
Do a keyword search using “cluster analysis.” How many relevant job titles are there?

Answer: Your answer here:

Advanced Questions (optional but highly recommended)

O. The aggregate function is well suited for this task. Should we use mydata or mydata[,-1] along with the aggregate function? Why? Hint: see the results of my tutorial.

Final Primary Data Analysis using Clustering and PCA

Group 1: Taylor Mazza, Abbi Kindrid, Alina Vasquez-Alvarez, Veronica Guzman - MKTG4000 - for Dr. Xu

05/04/2022

Reading

R Markdown

Importing data

Building distance function and ploting the trees (dendrograms)

Identifying clustering memberships for each cluster

Identifying common features of each cluster using the aggregate function

Exporting cluster analysis results into excel from R Studio Cloud

Downloading your solutions mannually

Finding means or medians of each variable (factor) for each cluster

Discussion Questions for you

Advanced Questions (optional but highly recommended)