Reading

For hierachical clustering and exploratory data analysis read Chapter 12 “Cluster Analysis” from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani - reading (p.385-p.399).

Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.

Reference: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 2013.An Introduction to Statistical Learning with Applications in R. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

getwd()
## [1] "/cloud/project"
library(readr)
mydata <-read_csv('customer_segmentation.csv')
## New names:
## * Cost -> Cost...8
## * Cost -> Cost...11
## Rows: 28 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): ID, Residence_Type, Age, Income, Zip_Code, Service_Company, AC_Ser...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Importing data

In the following step, you will standardize your data(i.e., data with a mean of 0 and a standard deviation of 1). You can use the scale function from the R environment which is a generic function whose default method centers and/or scales the columns of a numeric matrix.

Building distance function and ploting the trees (dendrograms)

Hierarchical clustering (using the function hclust) is an informative way to visualize the data.

We will see if we could discover subgroups among the variables or among the observations.

use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)  
d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)                # apply hirarchical clustering 
library(ggplot2) # needs no introduction
plot(seg.hclust)

Identifying clustering memberships for each cluster

Imagine if your goal is to find some profitable customers to target. Now you will be able to see the number of customers using this algorithm.

groups.3 = cutree(seg.hclust,3)
table(groups.3)  #A good first step is to use the table function to see how # many observations are in each cluster 
## groups.3
##  1  2  3 
## 22  4  2
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
##  [1]  1  2  3  7  8  9 10 11 12 15 16 17 18 19 20 21 22 23 25 26 27 28
mydata$ID[groups.3 == 2]
## [1]  4  6 13 24
mydata$ID[groups.3 == 3]
## [1]  5 14

Identifying common features of each cluster using the aggregate function

#?aggregate
aggregate(mydata,list(groups.3),median)
##   Group.1   ID Residence_Type  Age Income Zip_Code Service_Company AC_Serviced
## 1       1 16.5            1.0 32.0      3      8.0            7.50      2.0925
## 2       2  9.5            1.0 39.5      5     10.0            7.75      1.0000
## 3       3  9.5            1.5 32.0      3      9.5            7.50      1.0000
##   Cost...8 Late Rude Cost...11 On_Time 24Hr 1Hr Overall_Time Service_Plan Trust
## 1     1.31  1.2 1.57       2.0       3  5.0 6.0          3.0            7 3.115
## 2     1.31  1.2 1.57       1.5       2  5.5 5.0          4.0            7 3.000
## 3     2.00  1.2 1.00       1.0       4  4.5 6.5          5.5            2 4.500
##   Other Offer_24hr Offer_1Hr Offer_Service_Plan Change
## 1     8       1.64     1.710              1.270    3.0
## 2     8       2.00     2.000              1.000    1.5
## 3     8       1.32     1.355              1.135    3.0
aggregate(mydata,list(groups.3),mean)
##   Group.1       ID Residence_Type      Age   Income Zip_Code Service_Company
## 1       1 15.45455       1.818182 34.31818 3.363636 8.090909        7.704545
## 2       2 11.75000       1.000000 38.50000 4.750000 9.250000        6.375000
## 3       3  9.50000       1.500000 32.00000 3.000000 9.500000        7.500000
##   AC_Serviced Cost...8     Late     Rude Cost...11  On_Time     24Hr  1Hr
## 1      2.4175 1.228636 1.163636 1.622273  2.045455 3.041818 5.189091 5.35
## 2      1.5000 1.405000 1.400000 1.570000  2.250000 2.000000 4.750000 4.75
## 3      1.0000 2.000000 1.200000 1.000000  1.000000 4.000000 4.500000 6.50
##   Overall_Time Service_Plan    Trust Other Offer_24hr Offer_1Hr
## 1     2.793636     6.524545 3.066364  7.99   1.621818  1.692273
## 2     4.500000     7.000000 3.500000  7.25   1.910000  2.000000
## 3     5.500000     2.000000 4.500000  8.00   1.320000  1.355000
##   Offer_Service_Plan   Change
## 1           1.316818 2.681818
## 2           1.067500 1.500000
## 3           1.135000 3.000000
aggregate(mydata[,-1],list(groups.3),median)
##   Group.1 Residence_Type  Age Income Zip_Code Service_Company AC_Serviced
## 1       1            1.0 32.0      3      8.0            7.50      2.0925
## 2       2            1.0 39.5      5     10.0            7.75      1.0000
## 3       3            1.5 32.0      3      9.5            7.50      1.0000
##   Cost...8 Late Rude Cost...11 On_Time 24Hr 1Hr Overall_Time Service_Plan Trust
## 1     1.31  1.2 1.57       2.0       3  5.0 6.0          3.0            7 3.115
## 2     1.31  1.2 1.57       1.5       2  5.5 5.0          4.0            7 3.000
## 3     2.00  1.2 1.00       1.0       4  4.5 6.5          5.5            2 4.500
##   Other Offer_24hr Offer_1Hr Offer_Service_Plan Change
## 1     8       1.64     1.710              1.270    3.0
## 2     8       2.00     2.000              1.000    1.5
## 3     8       1.32     1.355              1.135    3.0
aggregate(mydata[,-1],list(groups.3),mean)
##   Group.1 Residence_Type      Age   Income Zip_Code Service_Company AC_Serviced
## 1       1       1.818182 34.31818 3.363636 8.090909        7.704545      2.4175
## 2       2       1.000000 38.50000 4.750000 9.250000        6.375000      1.5000
## 3       3       1.500000 32.00000 3.000000 9.500000        7.500000      1.0000
##   Cost...8     Late     Rude Cost...11  On_Time     24Hr  1Hr Overall_Time
## 1 1.228636 1.163636 1.622273  2.045455 3.041818 5.189091 5.35     2.793636
## 2 1.405000 1.400000 1.570000  2.250000 2.000000 4.750000 4.75     4.500000
## 3 2.000000 1.200000 1.000000  1.000000 4.000000 4.500000 6.50     5.500000
##   Service_Plan    Trust Other Offer_24hr Offer_1Hr Offer_Service_Plan   Change
## 1     6.524545 3.066364  7.99   1.621818  1.692273           1.316818 2.681818
## 2     7.000000 3.500000  7.25   1.910000  2.000000           1.067500 1.500000
## 3     2.000000 4.500000  8.00   1.320000  1.355000           1.135000 3.000000
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)

Exporting cluster analysis results into excel from R Studio Cloud

write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")

Downloading your solutions mannually

First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.

Second, click the gear icon on the right side of your pane and export the data.

Finding means or medians of each variable (factor) for each cluster

Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.

Discussion Questions for you

  1. How many observations do we have in each cluster? Answer: Your answer here:

  2. We can look at the medians (or means) for the variables in each cluster. Why is this important?

Answer: Your answer here:

  1. Do you think if mean or median should be used when it comes to analyzing the differences among different clusters? Why?

Answer: Your answer here:

  1. Now we need to understand the common characteristics of each cluster. Our goal is to build targeting strategy using the profiles of each cluster. What summary measures of each cluster are appropriate in a descriptive sense.

Answer: Your answer here:

  1. Any major differences between K-means clustering (https://rpubs.com/utjimmyx/kmeans) and Hierarchical clustering? Which one do you like better? Why? You may refer to the assigned readings.

  2. Do a keyword search using “cluster analysis.” How many relevant job titles are there?

Answer: Your answer here:

Principal Component Analysis (PCA)

Intro

Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.

PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.

The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).

#install.packages('dplyr')
library(dplyr) # sane data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr) # sane data munging
library(ggplot2) # needs no introduction
library(ggfortify) # super-helpful for plotting non-"standard" stats objects

#identifying your working directory
getwd() #confirm your working directory is accurate
## [1] "/cloud/project"
library(readr)

##  mydata <-read_csv('Segmentation.csv')

mydata <-read_csv('customer_segmentation.csv')
## New names:
## * Cost -> Cost...8
## * Cost -> Cost...11
## Rows: 28 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): ID, Residence_Type, Age, Income, Zip_Code, Service_Company, AC_Ser...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read csv file #This allows you to read the data from my Github site.

#Open the data. Note that some students will see an Excel option in "Import Dataset";
#those that do not will need to save the original data as a csv and import that as a text file.
#rm(list = ls()) #used to clean your working environment
fit <- kmeans(na.omit(mydata[,-1]), 3, iter.max=1000)
#exclude the first column since it is "id" instead of a factor #or variable.
#3 means you want to have 3 clusters
table(fit$cluster)
## 
##  1  2  3 
##  9  9 10
barplot(table(fit$cluster), col="#336699") #plot

pca <- prcomp(mydata[,-1], scale=TRUE) #principle component analysis
pca_data <- mutate(fortify(pca), col=fit$cluster)
#We want to examine the cluster memberships for each #observation - see last column

ggplot(pca_data) + geom_point(aes(x=PC1, y=PC2, fill=factor(col)),
size=3, col="#7f7f7f", shape=21) + theme_bw(base_family="Helvetica")

autoplot(fit, data=mydata[,-1], frame=TRUE, frame.type='norm')

names(pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
pca$center
##     Residence_Type                Age             Income           Zip_Code 
##           1.678571          34.750000           3.535714           8.357143 
##    Service_Company        AC_Serviced           Cost...8               Late 
##           7.500000           2.185179           1.308929           1.200000 
##               Rude          Cost...11            On_Time               24Hr 
##           1.570357           2.000000           2.961429           5.077143 
##                1Hr       Overall_Time       Service_Plan              Trust 
##           5.346429           3.230714           6.269286           3.230714 
##              Other         Offer_24hr          Offer_1Hr Offer_Service_Plan 
##           7.885000           1.641429           1.712143           1.268214 
##             Change 
##           2.535714
pca$scale
##     Residence_Type                Age             Income           Zip_Code 
##          1.0559732         10.2906826          1.2904820          2.9466150 
##    Service_Company        AC_Serviced           Cost...8               Late 
##          1.1055416          1.0196963          0.3202585          0.1721326 
##               Rude          Cost...11            On_Time               24Hr 
##          0.2519771          1.0540926          1.1700197          1.3857231 
##                1Hr       Overall_Time       Service_Plan              Trust 
##          1.3036223          1.5941489          1.4025821          1.7492367 
##              Other         Offer_24hr          Offer_1Hr Offer_Service_Plan 
##          0.5661403          0.3450358          0.3253073          0.3296132 
##             Change 
##          1.1379690
pca$rotation
##                             PC1          PC2          PC3         PC4
## Residence_Type      0.373629361  0.276414724 -0.027170505 -0.25393000
## Age                -0.230324195 -0.370438131 -0.003613364 -0.05806519
## Income             -0.274214571 -0.379502994 -0.050661042  0.23509101
## Zip_Code           -0.172766937 -0.168781353 -0.012209557 -0.08420956
## Service_Company     0.089178700 -0.175532910  0.222446273 -0.19083304
## AC_Serviced         0.047436762 -0.084466584  0.475126138  0.33431638
## Cost...8           -0.316986844  0.259845102 -0.209373252 -0.08457444
## Late               -0.079040304 -0.142599719 -0.412584857 -0.05389986
## Rude                0.368015914 -0.290418890 -0.021098585  0.01378698
## Cost...11           0.238287528 -0.242738208 -0.142177165 -0.14603391
## On_Time            -0.029355172 -0.154034607  0.083391769 -0.43272166
## 24Hr                0.001238570  0.190216821  0.128434635  0.22163572
## 1Hr                -0.097211253 -0.005881459 -0.010161842 -0.22687296
## Overall_Time       -0.379913275 -0.012402662 -0.141705658  0.09876885
## Service_Plan        0.360239784 -0.314445429  0.009396611  0.25884481
## Trust              -0.000562543  0.335662716  0.055942169  0.18796944
## Other               0.016345453  0.094341866  0.005373530 -0.35433828
## Offer_24hr          0.245293778  0.201079167 -0.338114468  0.14440719
## Offer_1Hr           0.150776505  0.021623108 -0.342772536  0.29267916
## Offer_Service_Plan  0.084761401  0.036332469  0.265547888 -0.21378606
## Change             -0.135600081  0.148430326  0.370638367  0.11101994
##                            PC5         PC6         PC7         PC8         PC9
## Residence_Type     -0.01760445  0.03572443  0.04646258 -0.15267361 -0.08976698
## Age                -0.10655625  0.09041125 -0.19451291  0.03632537  0.16636759
## Income              0.04860570  0.01928342  0.08043885 -0.08340413  0.20305153
## Zip_Code            0.42013257 -0.34992335 -0.13949293 -0.20314758  0.19337703
## Service_Company    -0.40985320 -0.04918690 -0.23780900 -0.04328157 -0.39766034
## AC_Serviced        -0.05337715  0.06766333 -0.06494637  0.22196218 -0.07761310
## Cost...8            0.27430292  0.13767459 -0.04324081  0.12293524 -0.20387912
## Late               -0.27740403  0.20648723  0.06651294 -0.13481704  0.07994115
## Rude                0.10838133  0.03136980  0.21763508  0.04372112  0.15451911
## Cost...11           0.16553650 -0.44529040 -0.23584877 -0.13029039 -0.09476957
## On_Time             0.16680671  0.16041990  0.17613266  0.40742349 -0.26596464
## 24Hr                0.42173843  0.19241251  0.24894553 -0.20204873 -0.08526099
## 1Hr                 0.09497311  0.33494823 -0.53335484 -0.20668902  0.04392626
## Overall_Time       -0.31224598  0.04228363  0.22349850  0.04145076 -0.09301040
## Service_Plan        0.02484368  0.19698985  0.05223552  0.07706370  0.04342399
## Trust              -0.32649364 -0.38611634 -0.05644219 -0.13812477  0.14987375
## Other              -0.07777815 -0.15884279  0.10952053  0.48967630  0.52554900
## Offer_24hr         -0.02455806  0.20252677 -0.15902144  0.11941787  0.27184194
## Offer_1Hr           0.04682341  0.08038004 -0.41050137  0.33226250 -0.15026588
## Offer_Service_Plan -0.10263137  0.40238892 -0.02239345 -0.34741126  0.33573984
## Change              0.09214424  0.02495464 -0.35270025  0.24983972  0.20574531
##                           PC10        PC11        PC12        PC13         PC14
## Residence_Type      0.22207050 -0.09315153 -0.06700190 -0.10968125 -0.430516785
## Age                -0.27416819  0.55075599 -0.24060872 -0.04273385 -0.213514409
## Income              0.18090983 -0.27739836 -0.14922792  0.28828884  0.046543748
## Zip_Code           -0.09350831 -0.24516110  0.30003372 -0.04989144 -0.222402163
## Service_Company    -0.09120610  0.15262041  0.34520894  0.24287795  0.179890202
## AC_Serviced         0.09565969 -0.29180888 -0.08513375  0.18464044 -0.121452311
## Cost...8           -0.17483702  0.15150764 -0.27073688  0.12442720 -0.207004278
## Late                0.42533936  0.11294024  0.31605167  0.24100267 -0.357749210
## Rude                0.17436763  0.25234653 -0.13379445 -0.41169167  0.254956054
## Cost...11          -0.15687124  0.01097831  0.05915662  0.08319541  0.002581943
## On_Time            -0.06945203 -0.18670617 -0.03884217  0.02595759 -0.166563481
## 24Hr                0.04087605  0.38247544  0.30079944  0.30735991  0.197222577
## 1Hr                 0.37935392 -0.13949288 -0.18840137 -0.13164123  0.302548994
## Overall_Time       -0.16722659 -0.19519524  0.24050652 -0.45585244  0.084038691
## Service_Plan       -0.09185030 -0.01073019 -0.02194911  0.00892740 -0.349228014
## Trust               0.01429480  0.09089043 -0.33276376  0.09448350 -0.132656838
## Other               0.11639016  0.04833519  0.07225281  0.31160850  0.196547947
## Offer_24hr         -0.36337328 -0.15050717  0.26265670 -0.02224944  0.074978929
## Offer_1Hr          -0.03394036 -0.04524192 -0.04336421  0.13053128  0.087335296
## Offer_Service_Plan -0.42801959 -0.15365793 -0.03186151  0.13249467 -0.034978418
## Change              0.18202162  0.21272858  0.36149530 -0.30986571 -0.276734034
##                           PC15        PC16        PC17         PC18        PC19
## Residence_Type      0.01249869  0.34489817 -0.16517772  0.352528189  0.13871912
## Age                 0.04221382  0.31607044 -0.20637468  0.005817232  0.30187786
## Income              0.27190992 -0.09741541  0.06557375  0.553208084  0.03360782
## Zip_Code            0.17804963  0.38151683 -0.06803315 -0.230021814 -0.30276204
## Service_Company     0.19723181  0.20743622  0.20289291  0.196585238 -0.28119848
## AC_Serviced        -0.10083426  0.28176661  0.23822279 -0.387016191  0.29848911
## Cost...8           -0.09870146  0.03866889  0.54935880  0.028373723 -0.26932810
## Late                0.03601857 -0.14262474  0.11914077 -0.363785990  0.04694659
## Rude                0.29306609  0.03231173  0.27732730 -0.188964532 -0.19258841
## Cost...11          -0.26567879 -0.34667613  0.18620057  0.001506252  0.36569139
## On_Time             0.45053487 -0.18644113 -0.12234750 -0.090127695  0.12307268
## 24Hr                0.05433606  0.11104277 -0.15009048  0.013854410  0.12861653
## 1Hr                -0.07756399  0.14368149  0.07040877 -0.028139515  0.03340802
## Overall_Time       -0.18788579  0.11762113 -0.04856551  0.024471539  0.01323205
## Service_Plan       -0.31491515  0.03878453  0.06837898  0.190526685 -0.36090924
## Trust               0.38257934 -0.05759850  0.02745469 -0.123678063 -0.11737125
## Other              -0.26529228  0.17709560 -0.01081481  0.054756035 -0.10511406
## Offer_24hr          0.31812211  0.17860991  0.32020821  0.108141678  0.32532492
## Offer_1Hr           0.07072903 -0.05834078 -0.47811995 -0.126357068 -0.24264880
## Offer_Service_Plan -0.01319258 -0.28468033 -0.09562606 -0.158786065 -0.17319786
## Change              0.07935220 -0.35568184  0.07124782  0.205630900 -0.01813869
##                           PC20          PC21
## Residence_Type     -0.36591635 -2.736078e-04
## Age                -0.02711828 -5.497727e-04
## Income             -0.23117364  6.427987e-05
## Zip_Code            0.00802639  1.882913e-04
## Service_Company    -0.07369877  3.316176e-04
## AC_Serviced        -0.22491179 -2.387594e-05
## Cost...8           -0.23819090  2.943445e-04
## Late               -0.04127826 -1.188675e-04
## Rude               -0.33254488  9.684114e-05
## Cost...11          -0.24875215  2.815715e-01
## On_Time             0.12818145  3.128524e-01
## 24Hr               -0.09877026  3.703242e-01
## 1Hr                 0.16038263  3.487647e-01
## Overall_Time       -0.30568501  4.261282e-01
## Service_Plan        0.33243944  3.752207e-01
## Trust               0.06670890  4.677341e-01
## Other              -0.10166072  1.513665e-01
## Offer_24hr          0.16183625  5.329676e-04
## Offer_1Hr          -0.34915206 -2.852282e-04
## Offer_Service_Plan -0.31101619 -3.130004e-04
## Change             -0.09727630 -2.191092e-04
dim(pca$x)
## [1] 28 21
biplot(pca, scale=0)

pca$rotation=-pca$rotation
pca$x=-pca$x
biplot(pca, scale=0)

pca$sdev
##  [1] 1.8100142614 1.6322855562 1.5000285284 1.4726879312 1.3742015470
##  [6] 1.2785726050 1.2135372946 1.1243211218 0.9200126452 0.8493238349
## [11] 0.8209681720 0.7687454359 0.6644650739 0.5646446355 0.5174821084
## [16] 0.4372680652 0.3468031636 0.3411589338 0.2264673469 0.2015884733
## [21] 0.0003390012
pca.var=pca$sdev^2
pca.var
##  [1] 3.276152e+00 2.664356e+00 2.250086e+00 2.168810e+00 1.888430e+00
##  [6] 1.634748e+00 1.472673e+00 1.264098e+00 8.464233e-01 7.213510e-01
## [11] 6.739887e-01 5.909695e-01 4.415138e-01 3.188236e-01 2.677877e-01
## [16] 1.912034e-01 1.202724e-01 1.163894e-01 5.128746e-02 4.063791e-02
## [21] 1.149218e-07
pve=pca.var/sum(pca.var)
pve
##  [1] 1.560072e-01 1.268741e-01 1.071469e-01 1.032767e-01 8.992523e-02
##  [6] 7.784514e-02 7.012727e-02 6.019514e-02 4.030587e-02 3.435005e-02
## [11] 3.209470e-02 2.814141e-02 2.102447e-02 1.518207e-02 1.275180e-02
## [16] 9.104922e-03 5.727259e-03 5.542353e-03 2.442260e-03 1.935139e-03
## [21] 5.472467e-09
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

write.csv(pca_data, "pca_data.csv")
#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data

References

Cluster analysis - reading (p.385-p.399) https://www.statlearning.com/

Hint:you can download the free version of this book from this website.

Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L) https://www.scielo.br/scielo.php?script=sci_arttext&pid=S1415-47572004000100014&lng=en&nrm=iso

Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

Principal component analysis - reading (p.404-p.405) https://www.statlearning.com/

Hint:you can download the free version from this website.

Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

https://online.stat.psu.edu/stat505/lesson/11/11.4