R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Notice:

This is still an early draft. Let me know if there are any errors or typos.

Intro

Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.

PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.

The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).

In this tutorial, we will be using the sample syntax available from the following book. Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

Example

The file customer_segmetation.csv contains data collected by my students in spring 2020.

Importing data - No need to download R or R studio or setup your working directory

Search for Rstudio Cloud, register (or set up a free user account), and log into the cloud environment with your Gmail credentials.

You will upload your dataset (.csv) from your own computer to R Studio Cloud first. Make sure the first column is id instead of a variable.

Once the dataset is uploaded, you will see the dataset available on the right pane of your cloud environment.

Now we will be using the package (readr) and the function read_csv to read the dataset.

library(readr)
mydata <-read_csv('customer_segmentation.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ID = col_double(),
##   `CS is helpful` = col_double(),
##   Recommend = col_double(),
##   `Come again` = col_double(),
##   `All Product I need` = col_double(),
##   Profesionalism = col_double(),
##   Limitation = col_double(),
##   `Online grocery` = col_double(),
##   delivery = col_double(),
##   `Pick up sevice` = col_double(),
##   `Find items` = col_double(),
##   `other shops` = col_double(),
##   Gender = col_double(),
##   Age = col_double(),
##   Education = col_double()
## )
pr.out=prcomp(mydata, scale=TRUE)
names(pr.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
pr.out$center
##                 ID      CS is helpful          Recommend         Come again 
##          11.500000           1.590909           1.318182           1.454545 
## All Product I need     Profesionalism         Limitation     Online grocery 
##           2.090909           1.409091           1.500000           2.272727 
##           delivery     Pick up sevice         Find items        other shops 
##           2.409091           2.454545           1.454545           2.590909 
##             Gender                Age          Education 
##           1.272727           2.454545           3.181818
pr.out$scale
##                 ID      CS is helpful          Recommend         Come again 
##          6.4935866          0.7341397          0.6463350          0.7385489 
## All Product I need     Profesionalism         Limitation     Online grocery 
##          1.0649879          0.5903261          0.8017837          0.7672969 
##           delivery     Pick up sevice         Find items        other shops 
##          0.7341397          1.0568269          0.6709817          1.4026876 
##             Gender                Age          Education 
##          0.4558423          0.7385489          1.6223547
pr.out$rotation
##                             PC1         PC2         PC3           PC4
## ID                  0.103796482 -0.19549219  0.16240939 -1.717915e-02
## CS is helpful       0.492101674 -0.15549856  0.06984284  4.852198e-02
## Recommend           0.324487249 -0.09307301 -0.23685161  3.606765e-01
## Come again          0.307412032  0.36689207 -0.17518619  1.144360e-01
## All Product I need  0.223729959  0.33561643  0.30397690  2.602293e-02
## Profesionalism      0.372740294 -0.01743956 -0.39059907 -1.558591e-01
## Limitation          0.286227998 -0.18865912  0.34162019 -3.294077e-01
## Online grocery      0.059643954 -0.33154112 -0.15083033  4.200091e-01
## delivery            0.357893888 -0.25892244  0.07754144  1.551933e-01
## Pick up sevice     -0.188441370 -0.45536558  0.07437270 -9.798278e-03
## Find items          0.246980360  0.06185572  0.52894716 -1.506063e-01
## other shops        -0.088319700  0.21055791  0.12898373  1.927525e-05
## Gender              0.188890506  0.28256757 -0.30431289 -2.646999e-01
## Age                -0.068655126  0.34906819  0.11228449  3.483981e-01
## Education           0.001675921  0.10781411  0.29043507  5.531460e-01
##                            PC5         PC6         PC7         PC8         PC9
## ID                  0.26364142 -0.74737395  0.09514658  0.05178790 -0.18538584
## CS is helpful       0.04977480  0.13602420  0.09908314  0.09454385  0.14131438
## Recommend          -0.25420446  0.10633489  0.06686650  0.12431130  0.02508822
## Come again         -0.32459482 -0.09955057  0.06095641 -0.12240536 -0.21023778
## All Product I need -0.08949351  0.04717091 -0.47590034  0.33667729  0.23109372
## Profesionalism      0.07627971 -0.21688163 -0.06569402  0.41019615 -0.25190055
## Limitation         -0.05315286  0.09316149  0.29350827 -0.35304171  0.07314872
## Online grocery      0.08827952 -0.21089660 -0.31736717 -0.23599279  0.57206881
## delivery           -0.05293745  0.16901910 -0.18085314 -0.42177030 -0.39914725
## Pick up sevice     -0.13260691 -0.05336678 -0.41337353  0.12966920 -0.33479219
## Find items          0.09672176 -0.03519071 -0.20610416  0.13177487  0.06635705
## other shops        -0.62500220 -0.45930333 -0.07866288 -0.26488198  0.03843060
## Gender              0.38432070 -0.17521073 -0.11680037 -0.35044014  0.18193527
## Age                 0.37904971  0.08989847 -0.27501532 -0.25738354 -0.37511720
## Education           0.14777306 -0.12999637  0.46232988  0.15720572 -0.01623201
##                            PC10        PC11        PC12        PC13
## ID                  0.008676069 -0.02387068 -0.15690858 -0.05806659
## CS is helpful       0.432389271 -0.06599241 -0.19408757 -0.05788369
## Recommend          -0.451524953 -0.56649385 -0.17002152  0.01098916
## Come again          0.110054977  0.16756964  0.59656171 -0.27588399
## All Product I need  0.239006374  0.01519205 -0.04772105  0.36135415
## Profesionalism      0.121877891  0.06911371 -0.06003705 -0.06779497
## Limitation          0.251231470 -0.35407819  0.16994098 -0.03707063
## Online grocery      0.097485139  0.12952918  0.16619183 -0.22762471
## delivery           -0.167111495  0.45284278 -0.18927088  0.31362688
## Pick up sevice      0.108556822 -0.33912510  0.45559735  0.21938726
## Find items         -0.567871014  0.05685186  0.15299857 -0.34952169
## other shops         0.049014120 -0.10467485 -0.30186323  0.07853529
## Gender             -0.199892044 -0.21692821  0.18412202  0.45023854
## Age                 0.218606925 -0.34031955 -0.17486541 -0.30608249
## Education          -0.006478144  0.04227308  0.26102473  0.39680948
##                            PC14         PC15
## ID                  0.468948575  0.040123525
## CS is helpful      -0.009114756 -0.659013501
## Recommend           0.194319229  0.108629052
## Come again          0.269178701 -0.065458124
## All Product I need  0.263011945  0.285257590
## Profesionalism     -0.532288262  0.289571770
## Limitation         -0.094519621  0.449388612
## Online grocery     -0.134903554  0.165845914
## delivery            0.011207075  0.078634413
## Pick up sevice     -0.069252430 -0.209782045
## Find items         -0.238488774 -0.172355593
## other shops        -0.341718008 -0.156185979
## Gender             -0.060831924 -0.215685965
## Age                -0.121359938  0.049967559
## Education          -0.305204304  0.004242402
dim(pr.out$x)
## [1] 22 15
biplot(pr.out, scale=0)

pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)

pr.out$sdev
##  [1] 1.7823571 1.5563851 1.3498398 1.2575163 1.1237776 1.1047966 1.0063521
##  [8] 0.7730121 0.7274903 0.6730120 0.6232295 0.4680043 0.4414254 0.3001192
## [15] 0.1707378
pr.var=pr.out$sdev^2
pr.var
##  [1] 3.17679697 2.42233470 1.82206749 1.58134717 1.26287613 1.22057551
##  [7] 1.01274446 0.59754775 0.52924217 0.45294519 0.38841501 0.21902805
## [13] 0.19485642 0.09007154 0.02915141
pve=pr.var/sum(pr.var)
pve
##  [1] 0.211786465 0.161488980 0.121471166 0.105423145 0.084191742 0.081371701
##  [7] 0.067516298 0.039836517 0.035282811 0.030196346 0.025894334 0.014601870
## [13] 0.012990428 0.006004769 0.001943427
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

#save your cluster solutions in the working directory
#We want to examine the cluster memberships for each observation - see last column of pca_data

References

Principal component analysis - reading (p.404-p.405) https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

Interpretation of the Principal Components https://online.stat.psu.edu/stat505/lesson/11/11.4