This second project concerning the PCA analysis focuses on a dataset that I created based on a game I was playing when I was younger of the brand Top Trumps. The game consists in duel between cards of opponents on a specific criterion of each character. This analysis will help define which character of the games have similar caracteristics and also could help deine which card is the best asset to own.

Data Preparation

characters<-read.csv("Simpsons.csv", sep=";", dec=",", header=TRUE) 
rownames(characters) <- characters[,1]
characters <- characters[,2:7]

Checking the correlation

nominal values

characterCorrelation <-cor(characters, method="pearson") 
print(characterCorrelation, digits=2)
##                 good.listener personal.hygene     IQ shamelessness
## good.listener           1.000           0.405  0.153        -0.627
## personal.hygene         0.405           1.000  0.307        -0.450
## IQ                      0.153           0.307  1.000        -0.317
## shamelessness          -0.627          -0.450 -0.317         1.000
## huggability             0.012           0.056  0.012        -0.043
## prone.to.mathem        -0.521          -0.392 -0.135         0.270
##                 huggability prone.to.mathem
## good.listener         0.012           -0.52
## personal.hygene       0.056           -0.39
## IQ                    0.012           -0.13
## shamelessness        -0.043            0.27
## huggability           1.000           -0.42
## prone.to.mathem      -0.419            1.00

standardized values

NormalizedCharacters <-data.Normalization(characters, type="n1",normalization="column")
NormalizedCharactersCorrelation <-cor(NormalizedCharacters, method="pearson") 
print (NormalizedCharactersCorrelation, digits =  2)
##                 good.listener personal.hygene     IQ shamelessness
## good.listener           1.000           0.405  0.153        -0.627
## personal.hygene         0.405           1.000  0.307        -0.450
## IQ                      0.153           0.307  1.000        -0.317
## shamelessness          -0.627          -0.450 -0.317         1.000
## huggability             0.012           0.056  0.012        -0.043
## prone.to.mathem        -0.521          -0.392 -0.135         0.270
##                 huggability prone.to.mathem
## good.listener         0.012           -0.52
## personal.hygene       0.056           -0.39
## IQ                    0.012           -0.13
## shamelessness        -0.043            0.27
## huggability           1.000           -0.42
## prone.to.mathem      -0.419            1.00

Some variables are stronly correlated with each other for instance the shamelessness or the “prone to manthem” and the “good listeneer” quality, the shamelessness and the personal hygene or the huggability and the “prone to manthem” variable.

MDS

Analysis for characters

distance <-dist(characters) 
MDSResult <-cmdscale(distance , k=2) 
plot(MDSResult) 
pointLabel(MDSResult, rownames(MDSResult), cex=0.8) 

This plot shows the relation between the characters in 2D. We can spot a clear outlier in the list of characters, ‘Blinky’ which is far away for the other group of characters on this plot. Overall the rest of the variables form a quite homogeneous large group.

Analysis for variables

distanceTranspo <-dist(t(characters)) 
MDSTranspo <-cmdscale(distanceTranspo, k=2) 
plot(MDSTranspo) 
pointLabel(MDSTranspo, rownames(MDSTranspo), cex=0.8) 

On the contrary the relation between the variable is strongly split into three poles. The “prone to manthem” and “IQ” variable are really far don’t share a much in common with the rest of the variable. Shamelessness is in a smaller proportion also far away from the center of activity of this plot.

PCA

# Data Analysis
CharactersStandardised <-data.Normalization(characters, type="n1", normalization="column")
CharactersCov<-cov(CharactersStandardised)
CharactersEigenV<-eigen(CharactersCov)
CharactersEigenV$vectors
##            [,1]        [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] -0.5034617  0.09325768  0.4571593  0.2299526  0.2691041 0.63524755
## [2,] -0.4552084  0.13558455 -0.1178664 -0.7716890 -0.3797004 0.14433788
## [3,] -0.2843950  0.31287532 -0.8124186  0.2017686  0.3325844 0.09940635
## [4,]  0.4849879 -0.27041609 -0.1324578 -0.4593690  0.4800726 0.48231452
## [5,] -0.1593167 -0.76521139 -0.3033209  0.2520396 -0.3938926 0.27998363
## [6,]  0.4447876  0.46514035 -0.0868348  0.1906923 -0.5358892 0.50470524

Princomp Statistic - Complete PCA

SimplePCAMoreInfo <-princomp(characters) 
loadings(SimplePCAMoreInfo)
## 
## Loadings:
##                 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## good.listener                                0.120  0.990
## personal.hygene                0.173  0.977              
## IQ               0.221  0.961 -0.162                     
## shamelessness   -0.115 -0.130 -0.964  0.187              
## huggability                                 -0.988  0.121
## prone.to.mathem -0.965  0.239                            
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.167  0.167  0.167  0.167  0.167  0.167
## Cumulative Var  0.167  0.333  0.500  0.667  0.833  1.000
plot(SimplePCAMoreInfo)

The first 2 components of the PCA are the ones representing the biggest part of the variance within the dataset and on their own they explain the majority of the variation in the data.

Basic PCA with 2 factors for an easier visualization

nbrDimMax <- 2
SimplePCA <-prcomp(characters, center=TRUE, scale.=TRUE, rank. = nbrDimMax) # stats::
summary (SimplePCA)
## Importance of first k=2 (out of 6) components:
##                           PC1    PC2
## Standard deviation     1.5878 1.1060
## Proportion of Variance 0.4202 0.2039
## Cumulative Proportion  0.4202 0.6240

The first 2 principal components already cumulate 62% of the total variance.

Rotated PCA

RotatedPCA <-principal(characters, nfactors= nbrDimMax, rotate="varimax")
print(loadings(RotatedPCA), digits=3, cutoff=0.4, sort=TRUE)
## 
## Loadings:
##                 RC1    RC2   
## good.listener    0.768       
## personal.hygene  0.718       
## IQ               0.556       
## shamelessness   -0.824       
## huggability             0.874
## prone.to.mathem -0.421 -0.766
## 
##                  RC1   RC2
## SS loadings    2.287 1.457
## Proportion Var 0.381 0.243
## Cumulative Var 0.381 0.624

The factor loading analysis allows a better understanding of the influence of each variable on each factor extracted. The prone to manthem variable is loaded on both factor and the second dimension is only dependant on 2 variables.

Interpretation of Dimension

listPlots <- list()
for(dimens in c(1:nbrDimMax) ){
  listPlots[[dimens]] <- fviz_contrib(SimplePCA, "var", axes = dimens )
}
grid.arrange(grobs = listPlots, ncol = 2) 

The variable contributing mainly to the first dimension extracted by the PCA are the good listener - shamelessness - personal hygene and prone to manthem. On the other hand, the variables contributing the to second dimension are once again the prone to manthem but in a lower proportion and mainly the huggability.

Quality measures of PCA

Scree Plot variance explained

plot(SimplePCA)

plot(SimplePCA, type = "l")

fviz_eig(SimplePCA) 

Complexity

RotatedPCA$complexity
##   good.listener personal.hygene              IQ   shamelessness 
##        1.203510        1.113331        1.095373        1.009325 
##     huggability prone.to.mathem 
##        1.044549        1.553764

The dataset is not too complex, all the values of the variabes are close to 1 and they are quite similar. This is not a bad quality that a data set is not too complex

Uniqueness

RotatedPCA$uniqueness
##   good.listener personal.hygene              IQ   shamelessness 
##       0.3503557       0.4551264       0.6763607       0.3175822 
##     huggability prone.to.mathem 
##       0.2197723       0.2366110

The values are not small which means that there is a consequent part of each variable that is shared with the others.

Individual results

IndividualResults <-get_pca_ind(SimplePCA)  
head(IndividualResults$coord)  
##                          Dim.1      Dim.2
## Squeaky Voiced Teen -0.6853462 -0.5035636
## Fallout Boy         -0.1819672 -0.4937146
## Database            -2.9420074  0.3482658
## Bart Simpson         1.7780077 -0.5959513
## Kearney              1.2938507  0.7897272
## Neslon Muntz         2.5294428  0.4315024
head(IndividualResults$contrib) 
##                          Dim.1     Dim.2
## Squeaky Voiced Teen 0.31052499 0.3455107
## Fallout Boy         0.02189086 0.3321274
## Database            5.72221377 0.1652626
## Bart Simpson        2.08998817 0.4839209
## Kearney             1.10673810 0.8497810
## Neslon Muntz        4.22986281 0.2536994

Visualisation of the PCA

labeled observations in two dimensions

fviz_pca_ind(SimplePCA, col.ind = "darkblue" , repel = TRUE)

Correlation plot

fviz_pca_var(SimplePCA, col.var = "purple")

All of the variables are far away from the center which means that they are well represented on the factor map. We can distinguish 1 groups of variables (IQ, personal Hygene, good listener) that is positively correlated and negatively correlated with the Shamelessness. Prone to manthem and huggability are also negatively correlated.

Standard function

biplot(SimplePCA)

k-means for PCA result

charactersCentered<-center_scale(characters) 
charactersPCA <-princomp(charactersCentered)$scores[, 1:2] 
km<-KMeans_rcpp(charactersPCA, clusters=4, num_init=5, max_iters = 100) 
plot_2d(charactersPCA, km$clusters, km$centroids)

K-means for MDS result

distancePCA <-dist(charactersCentered) # as a main input we need distance between units
MDS1Bis <-cmdscale(distancePCA, k = 2)
km<-KMeans_rcpp(MDS1Bis, clusters= 4, num_init=5, max_iters = 100)
plot_2d (MDS1Bis, km$clusters, km$centroids)

The two clustering are really similar and group the characters into 4 similar groups. We can conclude by saying that in this game there are 4 types of cards with characters having a quite similar way of behvaing and who represent the same asset in the hand of the player