This is an analysis done on a survey determining whether participants would prefer alcohol to be sold at fast food restaurants such as Taco Bell.
library(readr)
mydata <-read_csv('customer_segmentation.csv')
## Rows: 33 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Gender, Age, Alcohol Consumption, Restaurant, Taco Bell Order, Visi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.
Hierarchical clustering (using the function hclust) is an informative way to visualize the data.
We will see if we could discover subgroups among the variables or among the observations.
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)
d <- dist(as.matrix(dist)) # find distance matrix
seg.hclust <- hclust(d) # apply hirarchical clustering
library(ggplot2) # needs no introduction
plot(seg.hclust)
Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.
groups.3 = cutree(seg.hclust,3)
table(groups.3) #A good first step is to use the table function to see how # many observations are in each cluster
## groups.3
## 1 2 3
## 13 9 11
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## Warning: Unknown or uninitialised column: `ID`.
## NULL
mydata$ID[groups.3 == 2]
## Warning: Unknown or uninitialised column: `ID`.
## NULL
mydata$ID[groups.3 == 3]
## Warning: Unknown or uninitialised column: `ID`.
## NULL
#?aggregate
aggregate(mydata,list(groups.3),median)
## Group.1 Gender Age Alcohol Consumption Restaurant Taco Bell Order
## 1 1 1 1 4 1 1
## 2 2 2 2 1 0 0
## 3 3 2 4 5 2 2
## Visit Taco Bell Acohol Taco Bell Offer Alcohol Alcohol Options
## 1 4 2 1 2
## 2 0 0 0 0
## 3 4 2 2 0
aggregate(mydata,list(groups.3),mean)
## Group.1 Gender Age Alcohol Consumption Restaurant Taco Bell Order
## 1 1 1.384615 1.307692 3.6923077 1.000000 1.230769
## 2 2 1.222222 1.666667 0.6666667 0.000000 0.000000
## 3 3 1.818182 3.272727 5.0000000 1.545455 1.545455
## Visit Taco Bell Acohol Taco Bell Offer Alcohol Alcohol Options
## 1 3.615385 1.769231 1.230769 2.3076923
## 2 0.000000 0.000000 0.000000 0.0000000
## 3 4.090909 2.000000 1.909091 0.3636364
aggregate(mydata[,-1],list(groups.3),median)
## Group.1 Age Alcohol Consumption Restaurant Taco Bell Order Visit Taco Bell
## 1 1 1 4 1 1 4
## 2 2 2 1 0 0 0
## 3 3 4 5 2 2 4
## Acohol Taco Bell Offer Alcohol Alcohol Options
## 1 2 1 2
## 2 0 0 0
## 3 2 2 0
aggregate(mydata[,-1],list(groups.3),mean)
## Group.1 Age Alcohol Consumption Restaurant Taco Bell Order
## 1 1 1.307692 3.6923077 1.000000 1.230769
## 2 2 1.666667 0.6666667 0.000000 0.000000
## 3 3 3.272727 5.0000000 1.545455 1.545455
## Visit Taco Bell Acohol Taco Bell Offer Alcohol Alcohol Options
## 1 3.615385 1.769231 1.230769 2.3076923
## 2 0.000000 0.000000 0.000000 0.0000000
## 3 4.090909 2.000000 1.909091 0.3636364
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)
write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")
First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.
Second, click the gear icon on the right side of your pane and export the data.
Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.
Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.
PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.
The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).
#install.packages('dplyr')
library(dplyr) # sane data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr) # sane data munging
library(ggplot2) # needs no introduction
library(ggfortify) # super-helpful for plotting non-"standard" stats objects
#identifying your working directory
getwd() #confirm your working directory is accurate
## [1] "/cloud/project"
library(readr)
## mydata <-read_csv('Segmentation.csv')
mydata <-read_csv('customer_segmentation.csv')
## Rows: 33 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Gender, Age, Alcohol Consumption, Restaurant, Taco Bell Order, Visi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read csv file #This allows you to read the data from my Github site.
#Open the data. Note that some students will see an Excel option in "Import Dataset";
#those that do not will need to save the original data as a csv and import that as a text file.
#rm(list = ls()) #used to clean your working environment
fit <- kmeans(mydata[,-1], 3, iter.max=1000)
#exclude the first column since it is "id" instead of a factor #or variable.
#3 means you want to have 3 clusters
table(fit$cluster)
##
## 1 2 3
## 8 9 16
barplot(table(fit$cluster), col="#336699") #plot
pca <- prcomp(mydata[,-1], scale=TRUE) #principle component analysis
pca_data <- mutate(fortify(pca), col=fit$cluster)
#We want to examine the cluster memberships for each #observation - see last column
ggplot(pca_data) + geom_point(aes(x=PC1, y=PC2, fill=factor(col)),
size=3, col="#7f7f7f", shape=21) + theme_bw(base_family="Helvetica")
autoplot(fit, data=mydata[,-1], frame=TRUE, frame.type='norm')
names(pca)
## [1] "sdev" "rotation" "center" "scale" "x"
pca$center
## Age Alcohol Consumption Restaurant Taco Bell Order
## 2.0606061 3.3030303 0.9090909 1.0000000
## Visit Taco Bell Acohol Taco Bell Offer Alcohol Alcohol Options
## 2.7878788 1.3636364 1.1212121 1.0303030
pca$scale
## Age Alcohol Consumption Restaurant Taco Bell Order
## 1.4987368 1.8111607 0.6784005 0.7500000
## Visit Taco Bell Acohol Taco Bell Offer Alcohol Alcohol Options
## 1.8834406 0.8950622 0.8199686 1.7044949
pca$rotation
## PC1 PC2 PC3 PC4
## Age 0.14410654 0.628591630 0.702266794 0.15446481
## Alcohol Consumption 0.41710924 0.024319941 0.070259551 0.04842132
## Restaurant 0.39708739 -0.007971829 0.127676694 0.10330762
## Taco Bell Order 0.38939574 -0.093961977 -0.273894569 0.65316140
## Visit Taco Bell 0.41137403 -0.115669394 -0.097159333 0.16689131
## Acohol Taco Bell 0.40682544 -0.082234028 -0.002106686 -0.47994663
## Offer Alcohol 0.39608721 0.173257642 -0.185687543 -0.52543481
## Alcohol Options 0.06576578 -0.738395995 0.605508988 -0.04712223
## PC5 PC6 PC7 PC8
## Age -0.17527549 0.1022157 -0.029585454 0.1581866
## Alcohol Consumption -0.12314075 -0.6793185 -0.157496175 -0.5627656
## Restaurant 0.85902657 0.1855666 0.155652111 -0.1367299
## Taco Bell Order -0.08946469 0.0409041 -0.416553164 0.3933325
## Visit Taco Bell -0.40404855 0.2545455 0.732942770 -0.1218425
## Acohol Taco Bell -0.19829956 0.5400896 -0.466050914 -0.2217082
## Offer Alcohol 0.05240193 -0.3465909 0.148718160 0.5979522
## Alcohol Options -0.05348236 -0.1239148 -0.007063127 0.2516336
dim(pca$x)
## [1] 33 8
biplot(pca, scale=0)
pca$rotation=-pca$rotation
pca$x=-pca$x
biplot(pca, scale=0)
pca$sdev
## [1] 2.3117132 1.1294193 0.8676089 0.4956930 0.4305975 0.3203037 0.2244333
## [8] 0.2087057
pca.var=pca$sdev^2
pca.var
## [1] 5.34401813 1.27558805 0.75274526 0.24571153 0.18541424 0.10259445 0.05037029
## [8] 0.04355806
pve=pca.var/sum(pca.var)
pve
## [1] 0.668002266 0.159448506 0.094093158 0.030713941 0.023176780 0.012824306
## [7] 0.006296286 0.005444757
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')
write.csv(pca_data, "pca_data.csv")
#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data
Cluster analysis - reading (p.385-p.399) https://www.statlearning.com/
Hint:you can download the free version of this book from this website.
Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L) https://www.scielo.br/scielo.php?script=sci_arttext&pid=S1415-47572004000100014&lng=en&nrm=iso
Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Principal component analysis - reading (p.404-p.405) https://www.statlearning.com/
Hint:you can download the free version from this website.
Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/