R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Intro

Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.

PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.

The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).

In this tutorial, we will be using the sample syntax available from the following book. Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

setwd("C:/Users/zxu3/Documents/R/segmentation") 
library(rlang)
library(readr)

##  mydata <-read_csv('SEOdental.csv')

mydata <-read_csv('SEOdental.csv')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   ID = col_double(),
##   Page_rank = col_double(),
##   `Keyword Appearance (URL)` = col_double(),
##   K.A.1 = col_double(),
##   Domain_Reg = col_double(),
##   `Keyword Appearance (title tag)` = col_double(),
##   K.A. = col_double(),
##   `Bold appearances in Meta Description` = col_double(),
##   H1 = col_double(),
##   Mobile = col_double(),
##   Desktop = col_double(),
##   Page_Score = col_double(),
##   `Directory % missing` = col_double(),
##   Yelp = col_double(),
##   claimed = col_double(),
##   reviews = col_double(),
##   `star rating` = col_double()
## )
pr.out=prcomp(mydata, scale=TRUE)
names(pr.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
pr.out$center
##                                   ID                            Page_rank 
##                           10.0000000                           29.3157895 
##             Keyword Appearance (URL)                                K.A.1 
##                            1.0000000                            0.1052632 
##                           Domain_Reg       Keyword Appearance (title tag) 
##                            7.6842105                            3.5263158 
##                                 K.A. Bold appearances in Meta Description 
##                            0.2631579                            2.7894737 
##                                   H1                               Mobile 
##                            1.1578947                           43.9473684 
##                              Desktop                           Page_Score 
##                           71.5789474                           75.7368421 
##                  Directory % missing                                 Yelp 
##                           37.8947368                            0.8947368 
##                              claimed                              reviews 
##                            0.8421053                           72.1578947 
##                          star rating 
##                            4.5052632
pr.out$scale
##                                   ID                            Page_rank 
##                            5.6273143                           27.0986934 
##             Keyword Appearance (URL)                                K.A.1 
##                            1.2018504                            0.3153018 
##                           Domain_Reg       Keyword Appearance (title tag) 
##                            3.8880534                            1.0202626 
##                                 K.A. Bold appearances in Meta Description 
##                            0.4524139                            1.2283208 
##                                   H1                               Mobile 
##                            0.9581903                           29.1289731 
##                              Desktop                           Page_Score 
##                           21.2976676                            6.2702304 
##                  Directory % missing                                 Yelp 
##                           22.0803954                            0.3153018 
##                              claimed                              reviews 
##                            0.3746343                           69.1482009 
##                          star rating 
##                            0.6275824
pr.out$rotation
##                                              PC1          PC2         PC3
## ID                                    0.41491062  0.098409362  0.17804205
## Page_rank                             0.37139307  0.198553005  0.19250261
## Keyword Appearance (URL)             -0.29840474  0.213297249  0.08103749
## K.A.1                                -0.24099188  0.177621939 -0.01865257
## Domain_Reg                           -0.07327380  0.213057117  0.10839475
## Keyword Appearance (title tag)       -0.20036272  0.186006782  0.28118506
## K.A.                                 -0.03081541  0.402794644 -0.34009285
## Bold appearances in Meta Description -0.06731706  0.184227508 -0.33002865
## H1                                   -0.31188647  0.158456723  0.06640510
## Mobile                                0.30874970  0.163652049  0.39175058
## Desktop                               0.28396289  0.333975644  0.14878942
## Page_Score                           -0.20813859  0.279222083  0.29754729
## Directory % missing                   0.25833175 -0.007796739 -0.45692113
## Yelp                                  0.04428533 -0.430305440  0.08215275
## claimed                              -0.24921494 -0.049655350  0.20189151
## reviews                              -0.19125230 -0.326741326  0.16312879
## star rating                           0.08865831 -0.248651437  0.23975874
##                                              PC4         PC5         PC6
## ID                                   -0.02588717  0.15686178 -0.12769942
## Page_rank                            -0.00231635  0.11942147 -0.27819343
## Keyword Appearance (URL)              0.32311162 -0.02389552 -0.07621342
## K.A.1                                 0.53552813 -0.03700137  0.16193831
## Domain_Reg                           -0.14505100  0.60486919 -0.15480328
## Keyword Appearance (title tag)       -0.14578302 -0.48508378 -0.27367480
## K.A.                                 -0.16638018 -0.05944836  0.24570200
## Bold appearances in Meta Description -0.42038400 -0.36237103 -0.05555327
## H1                                   -0.04173450  0.10674982 -0.44043795
## Mobile                                0.03219065 -0.19234673  0.11930466
## Desktop                               0.11880393 -0.24298125  0.15973984
## Page_Score                           -0.14790988  0.18792736  0.18234074
## Directory % missing                  -0.01689185  0.16825889 -0.05346103
## Yelp                                  0.30889995 -0.11520735  0.02079515
## claimed                              -0.20910136  0.18343581  0.54445576
## reviews                              -0.24172380 -0.04479067 -0.28796957
## star rating                          -0.35328096 -0.05643448  0.24963567
##                                              PC7          PC8         PC9
## ID                                    0.04506485 -0.177353378  0.06974235
## Page_rank                             0.11942890 -0.044346951 -0.13751115
## Keyword Appearance (URL)              0.25061870 -0.429959459  0.40858541
## K.A.1                                 0.29834417 -0.072480319 -0.23176142
## Domain_Reg                            0.40292322  0.148151083 -0.35821913
## Keyword Appearance (title tag)        0.05154998  0.211183264 -0.16896342
## K.A.                                  0.25070300 -0.036954447  0.08451624
## Bold appearances in Meta Description  0.23771904 -0.002644793 -0.14588716
## H1                                   -0.12951332  0.029509382  0.02369439
## Mobile                               -0.07503328 -0.044942306  0.13178531
## Desktop                               0.02015830 -0.264485714 -0.27874188
## Page_Score                           -0.03865335  0.237025843  0.47837103
## Directory % missing                   0.06847800 -0.219192781  0.17452965
## Yelp                                  0.41835535  0.264978889 -0.09040127
## claimed                              -0.21116027 -0.273994576 -0.37602808
## reviews                               0.10554885 -0.621220872 -0.04797515
## star rating                           0.53644784  0.016389323  0.25195091
##                                              PC10         PC11         PC12
## ID                                   -0.189444904  0.150696909 -0.121693693
## Page_rank                            -0.137616341 -0.161484546 -0.306798285
## Keyword Appearance (URL)              0.064827540 -0.139310436 -0.204441344
## K.A.1                                -0.159986031  0.079620379 -0.009139140
## Domain_Reg                            0.304667188  0.002217546  0.078009376
## Keyword Appearance (title tag)        0.242041037  0.109338573  0.310834154
## K.A.                                  0.111323516 -0.291224975  0.301086972
## Bold appearances in Meta Description -0.172337901  0.238546459 -0.512037516
## H1                                   -0.680500249  0.002335028  0.327367654
## Mobile                                0.128491407 -0.082567583  0.176771791
## Desktop                              -0.097022886  0.266063747  0.162225548
## Page_Score                            0.076553705  0.532759727 -0.150271395
## Directory % missing                  -0.002010201  0.467712041  0.413061783
## Yelp                                 -0.066448187  0.333938149  0.024692875
## claimed                              -0.196084975  0.094617999 -0.004931365
## reviews                               0.265612657  0.135339509  0.055786475
## star rating                          -0.337033604 -0.229795231  0.178784403
##                                             PC13          PC14        PC15
## ID                                    0.15368552 -2.311681e-01  0.30482423
## Page_rank                             0.14711888 -1.056681e-01  0.27576230
## Keyword Appearance (URL)              0.08232822  8.816408e-02 -0.06614084
## K.A.1                                -0.29161884  8.098577e-02  0.35247495
## Domain_Reg                           -0.08731380  1.945602e-01 -0.20027800
## Keyword Appearance (title tag)       -0.06320955 -1.324946e-01  0.40504232
## K.A.                                  0.50796207 -2.421950e-01  0.03875466
## Bold appearances in Meta Description  0.01099582  3.178286e-01 -0.06353075
## H1                                    0.16631251  1.213419e-01 -0.17090883
## Mobile                                0.17048739  7.221460e-01 -0.03220194
## Desktop                              -0.17240600 -2.846855e-01 -0.55452266
## Page_Score                            0.01276705 -1.568506e-01 -0.01300349
## Directory % missing                  -0.12439030  2.164994e-01  0.26748767
## Yelp                                  0.54766408 -6.494029e-05 -0.12672552
## claimed                               0.24237242  8.701303e-02  0.25396201
## reviews                               0.08519085 -8.379138e-02 -0.04880102
## star rating                          -0.35373494 -5.148979e-02  0.01780417
##                                             PC16         PC17
## ID                                    0.07292611  0.676632542
## Page_rank                            -0.05266873 -0.638764209
## Keyword Appearance (URL)             -0.49429268  0.069550969
## K.A.1                                 0.44696292 -0.005728636
## Domain_Reg                           -0.09254248  0.144691302
## Keyword Appearance (title tag)       -0.28664829  0.063217957
## K.A.                                  0.22736481 -0.021409074
## Bold appearances in Meta Description  0.02204962  0.096136621
## H1                                    0.07751711 -0.004668534
## Mobile                                0.20153314  0.018702624
## Desktop                              -0.09790181 -0.067746430
## Page_Score                            0.20225614 -0.187365947
## Directory % missing                  -0.22327148 -0.191451955
## Yelp                                 -0.09801259 -0.038126818
## claimed                              -0.28185543 -0.046065926
## reviews                               0.40763424 -0.111921216
## star rating                          -0.05208442 -0.042693776
dim(pr.out$x)
## [1] 19 17
biplot(pr.out, scale=0)

pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)

pr.out$sdev
##  [1] 2.13681152 1.70357697 1.54345370 1.28940496 1.15623874 1.02822038
##  [7] 0.93882115 0.75388287 0.70657159 0.67078412 0.56194017 0.44777564
## [13] 0.28595948 0.20291991 0.17362534 0.12871520 0.08978104
pr.var=pr.out$sdev^2
pr.var
##  [1] 4.565963474 2.902174506 2.382249328 1.662565142 1.336888031 1.057237148
##  [7] 0.881385150 0.568339376 0.499243414 0.449951339 0.315776750 0.200503027
## [13] 0.081772826 0.041176490 0.030145760 0.016567604 0.008060635
pve=pr.var/sum(pr.var)
pve
##  [1] 0.2685860867 0.1707161474 0.1401323134 0.0977979495 0.0786404724
##  [6] 0.0621904205 0.0518461853 0.0334317280 0.0293672596 0.0264677258
## [11] 0.0185751029 0.0117942957 0.0048101662 0.0024221465 0.0017732800
## [16] 0.0009745649 0.0004741550
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data
fit <- kmeans(mydata[,-1], 3, iter.max=1000) 
 #exclude the first column since it is "id" instead of a factor #or variable. 
#3 means you want to have 3 clusters
table(fit$cluster)
## 
## 1 2 3 
## 9 5 5
barplot(table(fit$cluster), col="#336699")  #plot

#save your cluster solutions in the working directory 
#We want to examine the cluster memberships for each observation - see last column of pca_data
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)  
d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)                # apply hirarchical clustering 
library(ggplot2) # needs no introduction
plot(seg.hclust)

groups.3 = cutree(seg.hclust,3)
table(groups.3)  #A good first step is to use the table function to see how # many observations are in each cluster 
## groups.3
## 1 2 3 
## 5 9 5
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## [1]  1  4  5  7 17
mydata$ID[groups.3 == 2]
## [1]  2  3  6  8  9 11 14 15 18
mydata$ID[groups.3 == 3]
## [1] 10 12 13 16 19

Identifying common features of each cluster using the aggregate function

#?aggregate
aggregate(mydata,list(groups.3),median)
##   Group.1 ID Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1       1  5         8                        3     0         12
## 2       2  9        12                        0     0          5
## 3       3 13        31                        0     0          7
##   Keyword Appearance (title tag) K.A. Bold appearances in Meta Description H1
## 1                              4    1                                    3  2
## 2                              4    0                                    3  1
## 3                              3    0                                    2  0
##   Mobile Desktop Page_Score Directory % missing Yelp claimed reviews
## 1     23      72         81                  24    1       1      30
## 2     61      71         71                  28    1       1      92
## 3     38      73         71                  64    1       0      23
##   star rating
## 1         4.3
## 2         4.9
## 3         4.6
aggregate(mydata,list(groups.3),mean)
##   Group.1        ID Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1       1  6.800000  19.00000                2.6000000   0.4  10.200000
## 2       2  9.555556  26.11111                0.4444444   0.0   6.444444
## 3       3 14.000000  45.40000                0.4000000   0.0   7.400000
##   Keyword Appearance (title tag) K.A. Bold appearances in Meta Description
## 1                       4.000000  0.6                             3.000000
## 2                       3.888889  0.0                             2.666667
## 3                       2.400000  0.4                             2.800000
##         H1   Mobile  Desktop Page_Score Directory % missing Yelp claimed
## 1 2.000000 33.20000 70.00000   82.40000            29.00000  0.6     1.0
## 2 1.111111 49.44444 70.77778   75.11111            27.22222  1.0     1.0
## 3 0.400000 44.80000 74.60000   70.20000            66.00000  1.0     0.4
##    reviews star rating
## 1 68.80000    4.040000
## 2 99.33333    4.788889
## 3 26.60000    4.460000
aggregate(mydata[,-1],list(groups.3),median)
##   Group.1 Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1       1         8                        3     0         12
## 2       2        12                        0     0          5
## 3       3        31                        0     0          7
##   Keyword Appearance (title tag) K.A. Bold appearances in Meta Description H1
## 1                              4    1                                    3  2
## 2                              4    0                                    3  1
## 3                              3    0                                    2  0
##   Mobile Desktop Page_Score Directory % missing Yelp claimed reviews
## 1     23      72         81                  24    1       1      30
## 2     61      71         71                  28    1       1      92
## 3     38      73         71                  64    1       0      23
##   star rating
## 1         4.3
## 2         4.9
## 3         4.6
aggregate(mydata[,-1],list(groups.3),mean)
##   Group.1 Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1       1  19.00000                2.6000000   0.4  10.200000
## 2       2  26.11111                0.4444444   0.0   6.444444
## 3       3  45.40000                0.4000000   0.0   7.400000
##   Keyword Appearance (title tag) K.A. Bold appearances in Meta Description
## 1                       4.000000  0.6                             3.000000
## 2                       3.888889  0.0                             2.666667
## 3                       2.400000  0.4                             2.800000
##         H1   Mobile  Desktop Page_Score Directory % missing Yelp claimed
## 1 2.000000 33.20000 70.00000   82.40000            29.00000  0.6     1.0
## 2 1.111111 49.44444 70.77778   75.11111            27.22222  1.0     1.0
## 3 0.400000 44.80000 74.60000   70.20000            66.00000  1.0     0.4
##    reviews star rating
## 1 68.80000    4.040000
## 2 99.33333    4.788889
## 3 26.60000    4.460000
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)

Exporting cluster analysis results into excel from R Studio Cloud

write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")

Downloading your solutions mannually

First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.

Second, click the gear icon on the right side of your pane and export the data.

Finding means or medians of each variable (factor) for each cluster

Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.

Discussion Questions for you

  1. How many observations do we have in each cluster? Answer: Your answer here:

  2. We can look at the medians (or means) for the variables in each cluster. Why is this important?

Answer: Your answer here:

  1. Do you think if mean or median should be used when it comes to analyzing the differences among different clusters? Why?

Answer: Your answer here:

  1. Now we need to understand the common characteristics of each cluster. Our goal is to build targeting strategy using the profiles of each cluster. What summary measures of each cluster are appropriate in a descriptive sense.

Answer: Your answer here:

  1. Any major differences between K-means clustering (https://rpubs.com/utjimmyx/kmeans) and Hierarchical clustering? Which one do you like better? Why? You may refer to the assigned readings.

Answer: Your answer here:

References

Cluster analysis - reading (p.385-p.399) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L) https://www.scielo.br/scielo.php?script=sci_arttext&pid=S1415-47572004000100014&lng=en&nrm=iso

Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

References

Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

Interpretation of the Principal Components https://online.stat.psu.edu/stat505/lesson/11/11.4