This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This is still an early draft. Let me know if there are any errors or typos.
Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.
PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.
The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).
In this tutorial, we will be using the sample syntax available from the following book. Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
The file customer_segmetation.csv contains data collected by my students in spring 2020.
Search for Rstudio Cloud, register (or set up a free user account), and log into the cloud environment with your Gmail credentials.
You will upload your dataset (.csv) from your own computer to R Studio Cloud first. Make sure the first column is id instead of a variable.
Once the dataset is uploaded, you will see the dataset available on the right pane of your cloud environment.
Now we will be using the package (readr) and the function read_csv to read the dataset.
library(readr)
mydata <-read_csv('customer_segmentation.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ID = col_double(),
## `CS is helpful` = col_double(),
## Recommend = col_double(),
## `Come again` = col_double(),
## `All Product I need` = col_double(),
## Profesionalism = col_double(),
## Limitation = col_double(),
## `Online grocery` = col_double(),
## delivery = col_double(),
## `Pick up sevice` = col_double(),
## `Find items` = col_double(),
## `other shops` = col_double(),
## Gender = col_double(),
## Age = col_double(),
## Education = col_double()
## )
pr.out=prcomp(mydata, scale=TRUE)
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
pr.out$center
## ID CS is helpful Recommend Come again
## 11.500000 1.590909 1.318182 1.454545
## All Product I need Profesionalism Limitation Online grocery
## 2.090909 1.409091 1.500000 2.272727
## delivery Pick up sevice Find items other shops
## 2.409091 2.454545 1.454545 2.590909
## Gender Age Education
## 1.272727 2.454545 3.181818
pr.out$scale
## ID CS is helpful Recommend Come again
## 6.4935866 0.7341397 0.6463350 0.7385489
## All Product I need Profesionalism Limitation Online grocery
## 1.0649879 0.5903261 0.8017837 0.7672969
## delivery Pick up sevice Find items other shops
## 0.7341397 1.0568269 0.6709817 1.4026876
## Gender Age Education
## 0.4558423 0.7385489 1.6223547
pr.out$rotation
## PC1 PC2 PC3 PC4
## ID 0.103796482 -0.19549219 0.16240939 -1.717915e-02
## CS is helpful 0.492101674 -0.15549856 0.06984284 4.852198e-02
## Recommend 0.324487249 -0.09307301 -0.23685161 3.606765e-01
## Come again 0.307412032 0.36689207 -0.17518619 1.144360e-01
## All Product I need 0.223729959 0.33561643 0.30397690 2.602293e-02
## Profesionalism 0.372740294 -0.01743956 -0.39059907 -1.558591e-01
## Limitation 0.286227998 -0.18865912 0.34162019 -3.294077e-01
## Online grocery 0.059643954 -0.33154112 -0.15083033 4.200091e-01
## delivery 0.357893888 -0.25892244 0.07754144 1.551933e-01
## Pick up sevice -0.188441370 -0.45536558 0.07437270 -9.798278e-03
## Find items 0.246980360 0.06185572 0.52894716 -1.506063e-01
## other shops -0.088319700 0.21055791 0.12898373 1.927525e-05
## Gender 0.188890506 0.28256757 -0.30431289 -2.646999e-01
## Age -0.068655126 0.34906819 0.11228449 3.483981e-01
## Education 0.001675921 0.10781411 0.29043507 5.531460e-01
## PC5 PC6 PC7 PC8 PC9
## ID 0.26364142 -0.74737395 0.09514658 0.05178790 -0.18538584
## CS is helpful 0.04977480 0.13602420 0.09908314 0.09454385 0.14131438
## Recommend -0.25420446 0.10633489 0.06686650 0.12431130 0.02508822
## Come again -0.32459482 -0.09955057 0.06095641 -0.12240536 -0.21023778
## All Product I need -0.08949351 0.04717091 -0.47590034 0.33667729 0.23109372
## Profesionalism 0.07627971 -0.21688163 -0.06569402 0.41019615 -0.25190055
## Limitation -0.05315286 0.09316149 0.29350827 -0.35304171 0.07314872
## Online grocery 0.08827952 -0.21089660 -0.31736717 -0.23599279 0.57206881
## delivery -0.05293745 0.16901910 -0.18085314 -0.42177030 -0.39914725
## Pick up sevice -0.13260691 -0.05336678 -0.41337353 0.12966920 -0.33479219
## Find items 0.09672176 -0.03519071 -0.20610416 0.13177487 0.06635705
## other shops -0.62500220 -0.45930333 -0.07866288 -0.26488198 0.03843060
## Gender 0.38432070 -0.17521073 -0.11680037 -0.35044014 0.18193527
## Age 0.37904971 0.08989847 -0.27501532 -0.25738354 -0.37511720
## Education 0.14777306 -0.12999637 0.46232988 0.15720572 -0.01623201
## PC10 PC11 PC12 PC13
## ID 0.008676069 -0.02387068 -0.15690858 -0.05806659
## CS is helpful 0.432389271 -0.06599241 -0.19408757 -0.05788369
## Recommend -0.451524953 -0.56649385 -0.17002152 0.01098916
## Come again 0.110054977 0.16756964 0.59656171 -0.27588399
## All Product I need 0.239006374 0.01519205 -0.04772105 0.36135415
## Profesionalism 0.121877891 0.06911371 -0.06003705 -0.06779497
## Limitation 0.251231470 -0.35407819 0.16994098 -0.03707063
## Online grocery 0.097485139 0.12952918 0.16619183 -0.22762471
## delivery -0.167111495 0.45284278 -0.18927088 0.31362688
## Pick up sevice 0.108556822 -0.33912510 0.45559735 0.21938726
## Find items -0.567871014 0.05685186 0.15299857 -0.34952169
## other shops 0.049014120 -0.10467485 -0.30186323 0.07853529
## Gender -0.199892044 -0.21692821 0.18412202 0.45023854
## Age 0.218606925 -0.34031955 -0.17486541 -0.30608249
## Education -0.006478144 0.04227308 0.26102473 0.39680948
## PC14 PC15
## ID 0.468948575 0.040123525
## CS is helpful -0.009114756 -0.659013501
## Recommend 0.194319229 0.108629052
## Come again 0.269178701 -0.065458124
## All Product I need 0.263011945 0.285257590
## Profesionalism -0.532288262 0.289571770
## Limitation -0.094519621 0.449388612
## Online grocery -0.134903554 0.165845914
## delivery 0.011207075 0.078634413
## Pick up sevice -0.069252430 -0.209782045
## Find items -0.238488774 -0.172355593
## other shops -0.341718008 -0.156185979
## Gender -0.060831924 -0.215685965
## Age -0.121359938 0.049967559
## Education -0.305204304 0.004242402
dim(pr.out$x)
## [1] 22 15
biplot(pr.out, scale=0)
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)
pr.out$sdev
## [1] 1.7823571 1.5563851 1.3498398 1.2575163 1.1237776 1.1047966 1.0063521
## [8] 0.7730121 0.7274903 0.6730120 0.6232295 0.4680043 0.4414254 0.3001192
## [15] 0.1707378
pr.var=pr.out$sdev^2
pr.var
## [1] 3.17679697 2.42233470 1.82206749 1.58134717 1.26287613 1.22057551
## [7] 1.01274446 0.59754775 0.52924217 0.45294519 0.38841501 0.21902805
## [13] 0.19485642 0.09007154 0.02915141
pve=pr.var/sum(pr.var)
pve
## [1] 0.211786465 0.161488980 0.121471166 0.105423145 0.084191742 0.081371701
## [7] 0.067516298 0.039836517 0.035282811 0.030196346 0.025894334 0.014601870
## [13] 0.012990428 0.006004769 0.001943427
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')
#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data
Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Interpretation of the Principal Components https://online.stat.psu.edu/stat505/lesson/11/11.4