This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.
PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.
The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).
In this tutorial, we will be using the sample syntax available from the following book. Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
setwd("C:/Users/zxu3/Documents/R/segmentation")
library(rlang)
library(readr)
## mydata <-read_csv('SEOdental.csv')
mydata <-read_csv('SEOdental.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## ID = col_double(),
## Page_rank = col_double(),
## `Keyword Appearance (URL)` = col_double(),
## K.A.1 = col_double(),
## Domain_Reg = col_double(),
## `Keyword Appearance (title tag)` = col_double(),
## K.A. = col_double(),
## `Bold appearances in Meta Description` = col_double(),
## H1 = col_double(),
## Mobile = col_double(),
## Desktop = col_double(),
## Page_Score = col_double(),
## `Directory % missing` = col_double(),
## Yelp = col_double(),
## claimed = col_double(),
## reviews = col_double(),
## `star rating` = col_double()
## )
pr.out=prcomp(mydata, scale=TRUE)
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
pr.out$center
## ID Page_rank
## 10.0000000 29.3157895
## Keyword Appearance (URL) K.A.1
## 1.0000000 0.1052632
## Domain_Reg Keyword Appearance (title tag)
## 7.6842105 3.5263158
## K.A. Bold appearances in Meta Description
## 0.2631579 2.7894737
## H1 Mobile
## 1.1578947 43.9473684
## Desktop Page_Score
## 71.5789474 75.7368421
## Directory % missing Yelp
## 37.8947368 0.8947368
## claimed reviews
## 0.8421053 72.1578947
## star rating
## 4.5052632
pr.out$scale
## ID Page_rank
## 5.6273143 27.0986934
## Keyword Appearance (URL) K.A.1
## 1.2018504 0.3153018
## Domain_Reg Keyword Appearance (title tag)
## 3.8880534 1.0202626
## K.A. Bold appearances in Meta Description
## 0.4524139 1.2283208
## H1 Mobile
## 0.9581903 29.1289731
## Desktop Page_Score
## 21.2976676 6.2702304
## Directory % missing Yelp
## 22.0803954 0.3153018
## claimed reviews
## 0.3746343 69.1482009
## star rating
## 0.6275824
pr.out$rotation
## PC1 PC2 PC3
## ID 0.41491062 0.098409362 0.17804205
## Page_rank 0.37139307 0.198553005 0.19250261
## Keyword Appearance (URL) -0.29840474 0.213297249 0.08103749
## K.A.1 -0.24099188 0.177621939 -0.01865257
## Domain_Reg -0.07327380 0.213057117 0.10839475
## Keyword Appearance (title tag) -0.20036272 0.186006782 0.28118506
## K.A. -0.03081541 0.402794644 -0.34009285
## Bold appearances in Meta Description -0.06731706 0.184227508 -0.33002865
## H1 -0.31188647 0.158456723 0.06640510
## Mobile 0.30874970 0.163652049 0.39175058
## Desktop 0.28396289 0.333975644 0.14878942
## Page_Score -0.20813859 0.279222083 0.29754729
## Directory % missing 0.25833175 -0.007796739 -0.45692113
## Yelp 0.04428533 -0.430305440 0.08215275
## claimed -0.24921494 -0.049655350 0.20189151
## reviews -0.19125230 -0.326741326 0.16312879
## star rating 0.08865831 -0.248651437 0.23975874
## PC4 PC5 PC6
## ID -0.02588717 0.15686178 -0.12769942
## Page_rank -0.00231635 0.11942147 -0.27819343
## Keyword Appearance (URL) 0.32311162 -0.02389552 -0.07621342
## K.A.1 0.53552813 -0.03700137 0.16193831
## Domain_Reg -0.14505100 0.60486919 -0.15480328
## Keyword Appearance (title tag) -0.14578302 -0.48508378 -0.27367480
## K.A. -0.16638018 -0.05944836 0.24570200
## Bold appearances in Meta Description -0.42038400 -0.36237103 -0.05555327
## H1 -0.04173450 0.10674982 -0.44043795
## Mobile 0.03219065 -0.19234673 0.11930466
## Desktop 0.11880393 -0.24298125 0.15973984
## Page_Score -0.14790988 0.18792736 0.18234074
## Directory % missing -0.01689185 0.16825889 -0.05346103
## Yelp 0.30889995 -0.11520735 0.02079515
## claimed -0.20910136 0.18343581 0.54445576
## reviews -0.24172380 -0.04479067 -0.28796957
## star rating -0.35328096 -0.05643448 0.24963567
## PC7 PC8 PC9
## ID 0.04506485 -0.177353378 0.06974235
## Page_rank 0.11942890 -0.044346951 -0.13751115
## Keyword Appearance (URL) 0.25061870 -0.429959459 0.40858541
## K.A.1 0.29834417 -0.072480319 -0.23176142
## Domain_Reg 0.40292322 0.148151083 -0.35821913
## Keyword Appearance (title tag) 0.05154998 0.211183264 -0.16896342
## K.A. 0.25070300 -0.036954447 0.08451624
## Bold appearances in Meta Description 0.23771904 -0.002644793 -0.14588716
## H1 -0.12951332 0.029509382 0.02369439
## Mobile -0.07503328 -0.044942306 0.13178531
## Desktop 0.02015830 -0.264485714 -0.27874188
## Page_Score -0.03865335 0.237025843 0.47837103
## Directory % missing 0.06847800 -0.219192781 0.17452965
## Yelp 0.41835535 0.264978889 -0.09040127
## claimed -0.21116027 -0.273994576 -0.37602808
## reviews 0.10554885 -0.621220872 -0.04797515
## star rating 0.53644784 0.016389323 0.25195091
## PC10 PC11 PC12
## ID -0.189444904 0.150696909 -0.121693693
## Page_rank -0.137616341 -0.161484546 -0.306798285
## Keyword Appearance (URL) 0.064827540 -0.139310436 -0.204441344
## K.A.1 -0.159986031 0.079620379 -0.009139140
## Domain_Reg 0.304667188 0.002217546 0.078009376
## Keyword Appearance (title tag) 0.242041037 0.109338573 0.310834154
## K.A. 0.111323516 -0.291224975 0.301086972
## Bold appearances in Meta Description -0.172337901 0.238546459 -0.512037516
## H1 -0.680500249 0.002335028 0.327367654
## Mobile 0.128491407 -0.082567583 0.176771791
## Desktop -0.097022886 0.266063747 0.162225548
## Page_Score 0.076553705 0.532759727 -0.150271395
## Directory % missing -0.002010201 0.467712041 0.413061783
## Yelp -0.066448187 0.333938149 0.024692875
## claimed -0.196084975 0.094617999 -0.004931365
## reviews 0.265612657 0.135339509 0.055786475
## star rating -0.337033604 -0.229795231 0.178784403
## PC13 PC14 PC15
## ID 0.15368552 -2.311681e-01 0.30482423
## Page_rank 0.14711888 -1.056681e-01 0.27576230
## Keyword Appearance (URL) 0.08232822 8.816408e-02 -0.06614084
## K.A.1 -0.29161884 8.098577e-02 0.35247495
## Domain_Reg -0.08731380 1.945602e-01 -0.20027800
## Keyword Appearance (title tag) -0.06320955 -1.324946e-01 0.40504232
## K.A. 0.50796207 -2.421950e-01 0.03875466
## Bold appearances in Meta Description 0.01099582 3.178286e-01 -0.06353075
## H1 0.16631251 1.213419e-01 -0.17090883
## Mobile 0.17048739 7.221460e-01 -0.03220194
## Desktop -0.17240600 -2.846855e-01 -0.55452266
## Page_Score 0.01276705 -1.568506e-01 -0.01300349
## Directory % missing -0.12439030 2.164994e-01 0.26748767
## Yelp 0.54766408 -6.494029e-05 -0.12672552
## claimed 0.24237242 8.701303e-02 0.25396201
## reviews 0.08519085 -8.379138e-02 -0.04880102
## star rating -0.35373494 -5.148979e-02 0.01780417
## PC16 PC17
## ID 0.07292611 0.676632542
## Page_rank -0.05266873 -0.638764209
## Keyword Appearance (URL) -0.49429268 0.069550969
## K.A.1 0.44696292 -0.005728636
## Domain_Reg -0.09254248 0.144691302
## Keyword Appearance (title tag) -0.28664829 0.063217957
## K.A. 0.22736481 -0.021409074
## Bold appearances in Meta Description 0.02204962 0.096136621
## H1 0.07751711 -0.004668534
## Mobile 0.20153314 0.018702624
## Desktop -0.09790181 -0.067746430
## Page_Score 0.20225614 -0.187365947
## Directory % missing -0.22327148 -0.191451955
## Yelp -0.09801259 -0.038126818
## claimed -0.28185543 -0.046065926
## reviews 0.40763424 -0.111921216
## star rating -0.05208442 -0.042693776
dim(pr.out$x)
## [1] 19 17
biplot(pr.out, scale=0)
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)
pr.out$sdev
## [1] 2.13681152 1.70357697 1.54345370 1.28940496 1.15623874 1.02822038
## [7] 0.93882115 0.75388287 0.70657159 0.67078412 0.56194017 0.44777564
## [13] 0.28595948 0.20291991 0.17362534 0.12871520 0.08978104
pr.var=pr.out$sdev^2
pr.var
## [1] 4.565963474 2.902174506 2.382249328 1.662565142 1.336888031 1.057237148
## [7] 0.881385150 0.568339376 0.499243414 0.449951339 0.315776750 0.200503027
## [13] 0.081772826 0.041176490 0.030145760 0.016567604 0.008060635
pve=pr.var/sum(pr.var)
pve
## [1] 0.2685860867 0.1707161474 0.1401323134 0.0977979495 0.0786404724
## [6] 0.0621904205 0.0518461853 0.0334317280 0.0293672596 0.0264677258
## [11] 0.0185751029 0.0117942957 0.0048101662 0.0024221465 0.0017732800
## [16] 0.0009745649 0.0004741550
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')
#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data
fit <- kmeans(mydata[,-1], 3, iter.max=1000)
#exclude the first column since it is "id" instead of a factor #or variable.
#3 means you want to have 3 clusters
table(fit$cluster)
##
## 1 2 3
## 9 5 5
barplot(table(fit$cluster), col="#336699") #plot
#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data
use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)
d <- dist(as.matrix(dist)) # find distance matrix
seg.hclust <- hclust(d) # apply hirarchical clustering
library(ggplot2) # needs no introduction
plot(seg.hclust)
groups.3 = cutree(seg.hclust,3)
table(groups.3) #A good first step is to use the table function to see how # many observations are in each cluster
## groups.3
## 1 2 3
## 5 9 5
#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]
## [1] 1 4 5 7 17
mydata$ID[groups.3 == 2]
## [1] 2 3 6 8 9 11 14 15 18
mydata$ID[groups.3 == 3]
## [1] 10 12 13 16 19
#?aggregate
aggregate(mydata,list(groups.3),median)
## Group.1 ID Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1 1 5 8 3 0 12
## 2 2 9 12 0 0 5
## 3 3 13 31 0 0 7
## Keyword Appearance (title tag) K.A. Bold appearances in Meta Description H1
## 1 4 1 3 2
## 2 4 0 3 1
## 3 3 0 2 0
## Mobile Desktop Page_Score Directory % missing Yelp claimed reviews
## 1 23 72 81 24 1 1 30
## 2 61 71 71 28 1 1 92
## 3 38 73 71 64 1 0 23
## star rating
## 1 4.3
## 2 4.9
## 3 4.6
aggregate(mydata,list(groups.3),mean)
## Group.1 ID Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1 1 6.800000 19.00000 2.6000000 0.4 10.200000
## 2 2 9.555556 26.11111 0.4444444 0.0 6.444444
## 3 3 14.000000 45.40000 0.4000000 0.0 7.400000
## Keyword Appearance (title tag) K.A. Bold appearances in Meta Description
## 1 4.000000 0.6 3.000000
## 2 3.888889 0.0 2.666667
## 3 2.400000 0.4 2.800000
## H1 Mobile Desktop Page_Score Directory % missing Yelp claimed
## 1 2.000000 33.20000 70.00000 82.40000 29.00000 0.6 1.0
## 2 1.111111 49.44444 70.77778 75.11111 27.22222 1.0 1.0
## 3 0.400000 44.80000 74.60000 70.20000 66.00000 1.0 0.4
## reviews star rating
## 1 68.80000 4.040000
## 2 99.33333 4.788889
## 3 26.60000 4.460000
aggregate(mydata[,-1],list(groups.3),median)
## Group.1 Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1 1 8 3 0 12
## 2 2 12 0 0 5
## 3 3 31 0 0 7
## Keyword Appearance (title tag) K.A. Bold appearances in Meta Description H1
## 1 4 1 3 2
## 2 4 0 3 1
## 3 3 0 2 0
## Mobile Desktop Page_Score Directory % missing Yelp claimed reviews
## 1 23 72 81 24 1 1 30
## 2 61 71 71 28 1 1 92
## 3 38 73 71 64 1 0 23
## star rating
## 1 4.3
## 2 4.9
## 3 4.6
aggregate(mydata[,-1],list(groups.3),mean)
## Group.1 Page_rank Keyword Appearance (URL) K.A.1 Domain_Reg
## 1 1 19.00000 2.6000000 0.4 10.200000
## 2 2 26.11111 0.4444444 0.0 6.444444
## 3 3 45.40000 0.4000000 0.0 7.400000
## Keyword Appearance (title tag) K.A. Bold appearances in Meta Description
## 1 4.000000 0.6 3.000000
## 2 3.888889 0.0 2.666667
## 3 2.400000 0.4 2.800000
## H1 Mobile Desktop Page_Score Directory % missing Yelp claimed
## 1 2.000000 33.20000 70.00000 82.40000 29.00000 0.6 1.0
## 2 1.111111 49.44444 70.77778 75.11111 27.22222 1.0 1.0
## 3 0.400000 44.80000 74.60000 70.20000 66.00000 1.0 0.4
## reviews star rating
## 1 68.80000 4.040000
## 2 99.33333 4.788889
## 3 26.60000 4.460000
cluster_means <- aggregate(mydata[,-1],list(groups.3),mean)
write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")
First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.
Second, click the gear icon on the right side of your pane and export the data.
Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.
How many observations do we have in each cluster? Answer: Your answer here:
We can look at the medians (or means) for the variables in each cluster. Why is this important?
Answer: Your answer here:
Answer: Your answer here:
Answer: Your answer here:
Answer: Your answer here:
O. The aggregate function is well suited for this task. Should we use mydata or mydata[,-1] along with the aggregate function? Why? Hint: see the results on my tutorial.
head(mydata)
## # A tibble: 6 x 17
## ID Page_rank `Keyword Appear~ K.A.1 Domain_Reg `Keyword Appear~ K.A.
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 4 1 6 4 1
## 2 2 4 0 0 7 4 0
## 3 3 5 1 0 10 4 0
## 4 4 8 1 0 9 4 1
## 5 5 9 3 1 12 4 0
## 6 6 6 1 0 3 4 0
## # ... with 10 more variables: `Bold appearances in Meta Description` <dbl>,
## # H1 <dbl>, Mobile <dbl>, Desktop <dbl>, Page_Score <dbl>, `Directory %
## # missing` <dbl>, Yelp <dbl>, claimed <dbl>, reviews <dbl>, `star
## # rating` <dbl>
#summary(lm(Page_rank~ . -ID, data = mydata))
summary(lm(Page_rank~ . -ID -Page_Score, data = mydata))
##
## Call:
## lm(formula = Page_rank ~ . - ID - Page_Score, data = mydata)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 2.1668 3.0055 -12.4046 -3.3900 -2.1668 3.8470 2.9248 -14.9957
## 9 10 11 12 13 14 15 16
## 6.4353 -2.7866 15.0576 0.6198 -4.7710 -12.8892 10.3552 0.2217
## 17 18 19
## 3.3900 2.8151 2.5650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.45764 64.80441 0.532 0.6231
## `Keyword Appearance (URL)` -3.21723 8.93668 -0.360 0.7370
## K.A.1 -12.09024 41.20618 -0.293 0.7838
## Domain_Reg 2.76122 1.13401 2.435 0.0716 .
## `Keyword Appearance (title tag)` -15.78711 8.63056 -1.829 0.1414
## K.A. -7.30391 20.45122 -0.357 0.7390
## `Bold appearances in Meta Description` 4.63390 6.36146 0.728 0.5067
## H1 7.12610 6.79145 1.049 0.3533
## Mobile 0.36545 0.48059 0.760 0.4894
## Desktop 0.71281 0.52280 1.363 0.2444
## `Directory % missing` -0.49990 0.42592 -1.174 0.3056
## Yelp -1.51865 28.17401 -0.054 0.9596
## claimed -39.07955 18.11072 -2.158 0.0971 .
## reviews 0.05486 0.12982 0.423 0.6943
## `star rating` -0.72902 10.20815 -0.071 0.9465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.05 on 4 degrees of freedom
## Multiple R-squared: 0.922, Adjusted R-squared: 0.6492
## F-statistic: 3.379 on 14 and 4 DF, p-value: 0.1243
## summary(glm(Page_rank~ . -ID, data = mydata))
#df.scaled <- as.data.frame(scale(mydata))
#summary(lm(Page_rank~ . -ID, data = df.scaled))
library(ggplot2)
ggplot(data = mydata, aes(x = Domain_Reg, y = Page_rank)) +
geom_point()
Cluster analysis - reading (p.385-p.399) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L) https://www.scielo.br/scielo.php?script=sci_arttext&pid=S1415-47572004000100014&lng=en&nrm=iso
Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Interpretation of the Principal Components https://online.stat.psu.edu/stat505/lesson/11/11.4