Introduction

The aim of this project is to reduce the music and movie preferences of people supervised to investigation usign Principal Component Analysis.

Dataset and first view

The dataset used in this project is made up of 1010 observations where each row represents a person to whom the survey was administered and 150 different columns.

The variables (columns) can be split into the following groups:

Music preferences (19 items)
Movie preferences (12 items)
Hobbies & interests (32 items)
Phobias (10 items)
Health habits (3 items)
Personality traits, views on life, & opinions (57 items)
Spending habits (7 items)
Demographics (10 items)

I decide to focus my analysis only in music and movies preference.

## 'data.frame':    1010 obs. of  150 variables:
##  $ Music                         : int  5 4 5 5 5 5 5 5 5 5 ...
##  $ Slow.songs.or.fast.songs      : int  3 4 5 3 3 3 5 3 3 3 ...
##  $ Dance                         : int  2 2 2 2 4 2 5 3 3 2 ...
##  $ Folk                          : int  1 1 2 1 3 3 3 2 1 5 ...
##  $ Country                       : int  2 1 3 1 2 2 1 1 1 2 ...
##  $ Classical.music               : int  2 1 4 1 4 3 2 2 2 2 ...
##  $ Musical                       : int  1 2 5 1 3 3 2 2 4 5 ...
##  $ Pop                           : int  5 3 3 2 5 2 5 4 3 3 ...
##  $ Rock                          : int  5 5 5 2 3 5 3 5 5 5 ...
##  $ Metal.or.Hardrock             : int  1 4 3 1 1 5 1 1 5 2 ...
##  $ Punk                          : int  1 4 4 4 2 3 1 2 1 3 ...
##  $ Hiphop..Rap                   : int  1 1 1 2 5 4 3 3 1 2 ...
##  $ Reggae..Ska                   : int  1 3 4 2 3 3 1 2 2 4 ...
##  $ Swing..Jazz                   : int  1 1 3 1 2 4 1 2 2 4 ...
##  $ Rock.n.roll                   : int  3 4 5 2 1 4 2 3 2 4 ...
##  $ Alternative                   : int  1 4 5 5 2 5 3 1 NA 4 ...
##  $ Latino                        : int  1 2 5 1 4 3 3 2 1 5 ...
##  $ Techno..Trance                : int  1 1 1 2 2 1 5 3 1 1 ...
##  $ Opera                         : int  1 1 3 1 2 3 2 2 1 2 ...
##  $ Movies                        : int  5 5 5 5 5 5 4 5 5 5 ...
##  $ Horror                        : int  4 2 3 4 4 5 2 4 1 2 ...
##  $ Thriller                      : int  2 2 4 4 4 5 1 4 5 1 ...
##  $ Comedy                        : int  5 4 4 3 5 5 5 5 5 5 ...
##  $ Romantic                      : int  4 3 2 3 2 2 3 2 4 5 ...
##  $ Sci.fi                        : int  4 4 4 4 3 3 1 3 4 1 ...
##  $ War                           : int  1 1 2 3 3 3 3 3 5 3 ...
##  $ Fantasy.Fairy.tales           : int  5 3 5 1 4 4 5 4 4 4 ...
##  $ Animated                      : int  5 5 5 2 4 3 5 4 4 4 ...
##  $ Documentary                   : int  3 4 2 5 3 3 3 3 5 4 ...
##  $ Western                       : int  1 1 2 1 1 2 1 1 1 1 ...
##  $ Action                        : int  2 4 1 2 4 4 2 3 1 2 ...
##  $ History                       : int  1 1 1 4 3 5 3 5 3 3 ...
##  $ Psychology                    : int  5 3 2 4 2 3 3 2 2 2 ...
##  $ Politics                      : int  1 4 1 5 3 4 1 3 1 3 ...
##  $ Mathematics                   : int  3 5 5 4 2 2 1 1 1 3 ...
##  $ Physics                       : int  3 2 2 1 2 3 1 1 1 1 ...
##  $ Internet                      : int  5 4 4 3 2 4 2 5 1 5 ...
##  $ PC                            : int  3 4 2 1 2 4 1 4 1 1 ...
##  $ Economy.Management            : int  5 5 4 2 2 1 3 1 1 4 ...
##  $ Biology                       : int  3 1 1 3 3 4 5 2 3 2 ...
##  $ Chemistry                     : int  3 1 1 3 3 4 5 2 1 1 ...
##  $ Reading                       : int  3 4 5 5 5 3 3 2 5 4 ...
##  $ Geography                     : int  3 4 2 4 2 3 3 3 1 4 ...
##  $ Foreign.languages             : int  5 5 5 4 3 4 4 4 1 5 ...
##  $ Medicine                      : int  3 1 2 2 3 4 5 1 1 1 ...
##  $ Law                           : int  1 2 3 5 2 3 3 2 1 1 ...
##  $ Cars                          : int  1 2 1 1 3 5 4 1 1 1 ...
##  $ Art.exhibitions               : int  1 2 5 5 1 2 1 1 1 4 ...
##  $ Religion                      : int  1 1 5 4 4 2 1 2 2 4 ...
##  $ Countryside..outdoors         : int  5 1 5 1 4 5 4 2 4 4 ...
##  $ Dancing                       : int  3 1 5 1 1 1 3 1 1 5 ...
##  $ Musical.instruments           : int  3 1 5 1 3 5 2 1 2 3 ...
##  $ Writing                       : int  2 1 5 3 1 1 1 1 1 1 ...
##  $ Passive.sport                 : int  1 1 5 1 3 5 5 4 4 4 ...
##  $ Active.sport                  : int  5 1 2 1 1 4 3 5 1 4 ...
##  $ Gardening                     : int  5 1 1 1 4 2 3 1 1 1 ...
##  $ Celebrities                   : int  1 2 1 2 3 1 1 3 5 2 ...
##  $ Shopping                      : int  4 3 4 4 3 2 3 3 2 4 ...
##  $ Science.and.technology        : int  4 3 2 3 3 3 4 2 1 3 ...
##  $ Theatre                       : int  2 2 5 1 2 1 3 2 5 5 ...
##  $ Fun.with.friends              : int  5 4 5 2 4 3 5 4 4 5 ...
##  $ Adrenaline.sports             : int  4 2 5 1 2 3 1 2 1 2 ...
##  $ Pets                          : int  4 5 5 1 1 2 5 5 1 2 ...
##  $ Flying                        : int  1 1 1 2 1 3 1 3 2 4 ...
##  $ Storm                         : int  1 1 1 1 2 2 3 2 3 5 ...
##  $ Darkness                      : int  1 1 1 1 1 2 2 4 1 4 ...
##  $ Heights                       : int  1 2 1 3 1 2 1 3 5 5 ...
##  $ Spiders                       : int  1 1 1 5 1 1 1 1 5 3 ...
##  $ Snakes                        : int  5 1 1 5 1 2 5 5 5 4 ...
##  $ Rats                          : int  3 1 1 5 2 2 1 3 2 4 ...
##  $ Ageing                        : int  1 3 1 4 2 1 4 1 2 3 ...
##  $ Dangerous.dogs                : int  3 1 1 5 4 1 1 2 3 5 ...
##  $ Fear.of.public.speaking       : int  2 4 2 5 3 3 1 4 4 3 ...
##  $ Smoking                       : chr  "never smoked" "never smoked" "tried smoking" "former smoker" ...
##  $ Alcohol                       : chr  "drink a lot" "drink a lot" "drink a lot" "drink a lot" ...
##  $ Healthy.eating                : int  4 3 3 3 4 2 4 2 1 3 ...
##  $ Daily.events                  : int  2 3 1 4 3 2 3 3 1 4 ...
##  $ Prioritising.workload         : int  2 2 2 4 1 2 5 1 2 2 ...
##  $ Writing.notes                 : int  5 4 5 4 2 3 5 3 1 2 ...
##  $ Workaholism                   : int  4 5 3 5 3 3 5 2 4 3 ...
##  $ Thinking.ahead                : int  2 4 5 3 5 3 3 4 2 3 ...
##  $ Final.judgement               : int  5 1 3 1 5 1 3 3 5 5 ...
##  $ Reliability                   : int  4 4 4 3 5 3 4 3 5 4 ...
##  $ Keeping.promises              : int  4 4 5 4 4 4 5 3 4 5 ...
##  $ Loss.of.interest              : int  1 3 1 5 2 3 3 1 1 3 ...
##  $ Friends.versus.money          : int  3 4 5 2 3 2 4 4 4 4 ...
##  $ Funniness                     : int  5 3 2 1 3 3 4 4 2 3 ...
##  $ Fake                          : int  1 2 4 1 2 1 1 2 2 1 ...
##  $ Criminal.damage               : int  1 1 1 5 1 4 2 1 1 2 ...
##  $ Decision.making               : int  3 2 3 5 3 2 2 3 4 5 ...
##  $ Elections                     : int  4 5 5 5 5 5 5 5 1 5 ...
##  $ Self.criticism                : int  1 4 4 5 5 4 3 3 3 4 ...
##  $ Judgment.calls                : int  3 4 4 4 5 4 5 5 2 5 ...
##  $ Hypochondria                  : int  1 1 1 3 1 1 1 2 2 1 ...
##  $ Empathy                       : int  3 2 5 3 3 4 4 1 5 4 ...
##  $ Eating.to.survive             : int  1 1 5 1 1 2 1 2 1 1 ...
##  $ Giving                        : int  4 2 5 1 3 3 5 3 1 4 ...
##  $ Compassion.to.animals         : int  5 4 4 2 3 5 5 5 4 5 ...
##  $ Borrowed.stuff                : int  4 3 2 5 4 5 5 2 5 4 ...
##   [list output truncated]

music <- data[,1:19]
movies <- data[20:31]

Remove missing values

I check if missing values are present in our categories of interest with summary.

In order to remove them, I will use the comand drop_na.

##                    Music Slow.songs.or.fast.songs                    Dance 
##            "NA's   :3  "            "NA's   :2  "            "NA's   :4  " 
##                     Folk                  Country          Classical.music 
##            "NA's   :5  "            "NA's   :5  "            "NA's   :7  " 
##                  Musical                      Pop                     Rock 
##            "NA's   :2  "            "NA's   :3  "            "NA's   :6  " 
##        Metal.or.Hardrock                     Punk              Hiphop..Rap 
##            "NA's   :3  "            "NA's   :8  "            "NA's   :4  " 
##              Reggae..Ska              Swing..Jazz              Rock.n.roll 
##            "NA's   :7  "            "NA's   :6  "            "NA's   :7  " 
##              Alternative                   Latino           Techno..Trance 
##            "NA's   :7  "            "NA's   :8  "            "NA's   :7  " 
##                    Opera 
##            "NA's   :1  "

##              Movies              Horror            Thriller              Comedy 
##       "NA's   :6  "       "NA's   :2  "       "NA's   :1  "       "NA's   :3  " 
##            Romantic              Sci.fi                 War Fantasy.Fairy.tales 
##       "NA's   :3  "       "NA's   :2  "       "NA's   :2  "       "NA's   :3  " 
##            Animated         Documentary             Western              Action 
##       "NA's   :3  "       "NA's   :8  "       "NA's   :4  "       "NA's   :2  "

Relation between Music types

As the plot below shows, there are some variables positively correlated with each other while there are also some which are negatively correlated.

For example is easy to see that Opera music is higly correleted with Classical.music or Punk with Metal or Hardrock.

On other hand Metal/Hardrock is negative correleted with Pop music.

Relation between Movie types

Same as before, the plot illustrate the correletion between the variable in movie database.

As instance, the two most correlete types of Movie are Animated and Fantasy Fairy tales.

Principal component analysis

Theory

Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns.

It does this by transforming the data into fewer dimensions, which act as summaries of features.

PCA may be influened by two elements, which should be addressed:

skewness – one should run log or Box and Cox transformation
magnitude – one should scale and shift the data (normalization)

PCA code

Music

Choice of the number of components

Kaiser-Guttman's Stopping Rule is a way of determining which components should be taken.

Components with an individual value greater than 1 should be maintained in this strategy.

It is also related to a screen test in which vertical axis values and horizontal axis components are plotted.

The pieces are ordered from the largest to the smallest and we pick the number of components based on the elbow rule.

We can select the number of components if the line of eigenvalues is levelling off.

The other technique is to look at the percentage of variance clarified, when components describe 80-90 percent it is fine.

pca1<-prcomp(music, center=TRUE, scale.=TRUE)
summary(pca1)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.954 1.6287 1.4203 1.06085 1.04564 1.01691 0.94327
## Proportion of Variance 0.201 0.1396 0.1062 0.05923 0.05755 0.05443 0.04683
## Cumulative Proportion  0.201 0.3406 0.4468 0.50601 0.56355 0.61798 0.66481
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.92083 0.89933 0.80689 0.79847 0.75958 0.71350 0.66449
## Proportion of Variance 0.04463 0.04257 0.03427 0.03356 0.03037 0.02679 0.02324
## Cumulative Proportion  0.70944 0.75201 0.78627 0.81983 0.85020 0.87699 0.90023
##                           PC15    PC16    PC17    PC18   PC19
## Standard deviation     0.65879 0.63298 0.61603 0.59238 0.5749
## Proportion of Variance 0.02284 0.02109 0.01997 0.01847 0.0174
## Cumulative Proportion  0.92307 0.94416 0.96413 0.98260 1.0000

Altough the values grather than 1 above advice me to take 6 components, the following graph show that the sum of first 6 comoponents are not enough to reach 80% of variance clarified.

fviz_eig(pca1, addlabels = T)

In fact if we consider the technique of percentage of variance clarified, I need at least 11 principal components to cover at least 80% of the variation in the data.

summary(pca1)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.954 1.6287 1.4203 1.06085 1.04564 1.01691 0.94327
## Proportion of Variance 0.201 0.1396 0.1062 0.05923 0.05755 0.05443 0.04683
## Cumulative Proportion  0.201 0.3406 0.4468 0.50601 0.56355 0.61798 0.66481
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.92083 0.89933 0.80689 0.79847 0.75958 0.71350 0.66449
## Proportion of Variance 0.04463 0.04257 0.03427 0.03356 0.03037 0.02679 0.02324
## Cumulative Proportion  0.70944 0.75201 0.78627 0.81983 0.85020 0.87699 0.90023
##                           PC15    PC16    PC17    PC18   PC19
## Standard deviation     0.65879 0.63298 0.61603 0.59238 0.5749
## Proportion of Variance 0.02284 0.02109 0.01997 0.01847 0.0174
## Cumulative Proportion  0.92307 0.94416 0.96413 0.98260 1.0000

Seen the discordant results, I decide to investigate the number of components with another method: parallel analysis.

I have to compare my eigenvalues with the numbers of 95`percentile.

If eigenvalue is higher than the particular value returned by the function, there is a support to keep the component.

In my case, the first 4 components (3 + 1 borderline) seem higher than the competitors.

(eig.val <- get_eigenvalue(pca1))

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   3.8188378        20.099146                    20.09915
## Dim.2   2.6525925        13.961013                    34.06016
## Dim.3   2.0173416        10.617587                    44.67775
## Dim.4   1.1253978         5.923146                    50.60089
## Dim.5   1.0933675         5.754566                    56.35546
## Dim.6   1.0341125         5.442697                    61.79816
## Dim.7   0.8897612         4.682954                    66.48111
## Dim.8   0.8479299         4.462789                    70.94390
## Dim.9   0.8087956         4.256819                    75.20072
## Dim.10  0.6510723         3.426696                    78.62741
## Dim.11  0.6375536         3.355545                    81.98296
## Dim.12  0.5769631         3.036648                    85.01961
## Dim.13  0.5090878         2.679409                    87.69902
## Dim.14  0.4415476         2.323935                    90.02295
## Dim.15  0.4340003         2.284212                    92.30716
## Dim.16  0.4006642         2.108759                    94.41592
## Dim.17  0.3794989         1.997363                    96.41328
## Dim.18  0.3509148         1.846920                    98.26020
## Dim.19  0.3305612         1.739796                   100.00000

hornpa(k=19,size=1000,reps=500,seed=123)

## 
##  Parallel Analysis Results  
##  
## Method: pca 
## Number of variables: 19 
## Sample size: 1000 
## Number of correlation matrices: 500 
## Seed: 123 
## Percentile: 0.95 
##  
## Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component. 
##  
##  Component  Mean  0.95
##          1 1.247 1.292
##          2 1.205 1.239
##          3 1.169 1.194
##          4 1.139 1.164
##          5 1.112 1.136
##          6 1.085 1.108
##          7 1.062 1.083
##          8 1.039 1.057
##          9 1.016 1.034
##         10 0.995 1.014
##         11 0.973 0.991
##         12 0.951 0.971
##         13 0.929 0.948
##         14 0.908 0.926
##         15 0.885 0.905
##         16 0.862 0.881
##         17 0.837 0.857
##         18 0.809 0.834
##         19 0.775 0.804

Overall music results, suggest me to consider 5 components.

CLUSTER

Here, I use k-mean in order to clusterize the results above and plot it.

km1<-eclust(music, k=5)

autoplot(pca1, loadings=TRUE, loadings.colour='blue', loadings.label=TRUE, loadings.label.size=5)

Movie

Choice of the number of components

As I already did with music preferences, I decided to choose number of components with three different methods:

Elbow rule
Percentage of variance explained
Parallel analysis

Moreover, I just change the pre-processing in order to proceed with Principal Component Analysis without different results.

The eigen values grather than 1 seem the first 4 ones.

pca2 <- prcomp(movies, center=TRUE, scale=TRUE)
fviz_eig(pca2, choice='eigenvalue', addlabels = T)

If we compare the result of scree plot with the summary where Cumulative Proportion, at least 8 components are needed to reach the 85% of Percentage to variance explained.

summary(pca2)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.5938 1.4522 1.2083 1.00035 0.89769 0.88677 0.86561
## Proportion of Variance 0.2117 0.1757 0.1217 0.08339 0.06715 0.06553 0.06244
## Cumulative Proportion  0.2117 0.3874 0.5091 0.59250 0.65965 0.72518 0.78762
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.82154 0.77391 0.74671 0.64831 0.54479
## Proportion of Variance 0.05624 0.04991 0.04646 0.03503 0.02473
## Cumulative Proportion  0.84387 0.89378 0.94024 0.97527 1.00000

Regarding parallel analysis, the comparison between my eigen value and the column of 95` percentile, indicate that 4 components (3 + 1 borderline) should be keep.

(eig.val2 <- get_eigenvalue(pca2))

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.5403240        21.169367                    21.16937
## Dim.2   2.1090111        17.575093                    38.74446
## Dim.3   1.4599617        12.166348                    50.91081
## Dim.4   1.0006963         8.339136                    59.24994
## Dim.5   0.8058430         6.715359                    65.96530
## Dim.6   0.7863557         6.552964                    72.51827
## Dim.7   0.7492737         6.243948                    78.76221
## Dim.8   0.6749285         5.624404                    84.38662
## Dim.9   0.5989331         4.991109                    89.37773
## Dim.10  0.5575788         4.646490                    94.02422
## Dim.11  0.4203027         3.502523                    97.52674
## Dim.12  0.2967911         2.473259                   100.00000

hornpa(k=12,size=1000,reps=500,seed=123)

## 
##  Parallel Analysis Results  
##  
## Method: pca 
## Number of variables: 12 
## Sample size: 1000 
## Number of correlation matrices: 500 
## Seed: 123 
## Percentile: 0.95 
##  
## Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component. 
##  
##  Component  Mean  0.95
##          1 1.180 1.224
##          2 1.135 1.167
##          3 1.099 1.127
##          4 1.067 1.093
##          5 1.038 1.060
##          6 1.010 1.030
##          7 0.984 1.004
##          8 0.957 0.978
##          9 0.929 0.951
##         10 0.900 0.924
##         11 0.870 0.896
##         12 0.831 0.863

Overall movie results, suggest me to consider 4 components.

CLUSTER

Finally, I divided the movie preferences in 4 different cluster and plot it.

km2<-eclust(movies, k=4)

autoplot(pca2, loadings=TRUE, loadings.colour='darkred', loadings.label=TRUE, loadings.label.size=5)

Variable’s loadings

The correlation between original variables and the power of each variable contribution to specific main components is shown in the below plots.

If variables are grouped together, the dependency implies that they are positively correlated, while if variables are placed on opposite sides of the plot, they are negatively correlated.

The length of the vector indicates how strong the specific variable's contribution to the specific main component is.

Music

PC1, PC2

The plot where the PC1 and PC2 are considered togheter.

fviz_pca_var(pca1, col.var = "navy", repel = TRUE, axes = c(1, 2)) +
  labs(title="Principal component analysis - Music", x="PC1", y="PC2")

PC2, PC3

The plot where the PC2 and PC3 are considered togheter.

fviz_pca_var(pca1, col.var = "navy", repel = TRUE, axes = c(2, 3)) +
  labs(title="Principal component analysis - Music", x="PC2", y="PC3")

PC3, PC4

The plot where the PC3 and PC4 are considered togheter.

fviz_pca_var(pca1, col.var = "navy", repel = TRUE, axes = c(3, 4)) +
  labs(title="Principal component analysis - Music", x="PC3", y="PC4")

PC4, PC5

The plot where the PC4 and PC5 are considered togheter.

fviz_pca_var(pca1, col.var = "navy", repel = TRUE, axes = c(4, 5)) +
  labs(title="Principal component analysis - Music", x="PC4", y="PC5")

Movie

PC1, PC2

The plot where the PC1 and PC2 are considered togheter.

fviz_pca_var(pca2, col.var = "darkred", repel = TRUE, axes = c(1, 2)) +
  labs(title="Principal component analysis - Movie", x="PC1", y="PC2")

PC2, PC3

The plot where the PC2 and PC3 are considered togheter.

fviz_pca_var(pca2, col.var = "darkred", repel = TRUE, axes = c(2, 3)) +
  labs(title="Principal component analysis - Movie", x="PC2", y="PC3")

PC3, PC4

The plot where the PC3 and PC4 are considered togheter.

fviz_pca_var(pca2, col.var = "darkred", repel = TRUE, axes = c(3, 4)) +
  labs(title="Principal component analysis - Movie", x="PC3", y="PC4")

Analysis of components

Music

Continuing with the analysis, the plot below is not so helful in order to visulize the components.

On other hand, could be usefull to interpret the first five dimensions with the following types:

Rock and Roll, Classical and Jazz
Dance, Latino and Pop
Slow and Fast song, Reggae, Punk vs Opera
Reggae vs Techno
Pop Rock and Musical vs Techno

fviz_pca_var(pca1, col.var="contrib", col.circle = "blue",repel = T,ggtheme = theme_minimal())

pca1$rotation[,1:5]

##                                  PC1         PC2         PC3          PC4
## Music                    -0.07910658 -0.06773298  0.20146291 -0.254868321
## Slow.songs.or.fast.songs  0.07401692 -0.02118718  0.32501523 -0.412353666
## Dance                     0.08912409 -0.41141203  0.24874411 -0.185907411
## Folk                     -0.24907735 -0.19557249 -0.18807192 -0.116988542
## Country                  -0.22796452 -0.15293615 -0.10319001 -0.007203249
## Classical.music          -0.33323030 -0.10626120 -0.23209176 -0.261992839
## Musical                  -0.21629296 -0.27823414 -0.18778344  0.018861734
## Pop                       0.08175356 -0.37694117  0.11266663  0.014169543
## Rock                     -0.31414063  0.16180183  0.25068801 -0.034958632
## Metal.or.Hardrock        -0.27074221  0.27896234  0.20058940 -0.161250816
## Punk                     -0.25452724  0.23314253  0.32438800  0.031380326
## Hiphop..Rap               0.13882985 -0.28639299  0.31573766  0.191871206
## Reggae..Ska              -0.16549974 -0.14036770  0.32986698  0.461575056
## Swing..Jazz              -0.32830852 -0.16832127  0.02723633  0.194691467
## Rock.n.roll              -0.35014029 -0.01727067  0.16444190  0.147014209
## Alternative              -0.29025469  0.13063016  0.16933259 -0.050340112
## Latino                   -0.11330156 -0.40114229 -0.02427954  0.183821427
## Techno..Trance            0.08787564 -0.22289556  0.28282930 -0.451521882
## Opera                    -0.28840758 -0.13001414 -0.30035467 -0.264047130
##                                  PC5
## Music                     0.36421487
## Slow.songs.or.fast.songs  0.03391737
## Dance                    -0.02388895
## Folk                     -0.22087379
## Country                  -0.18740981
## Classical.music          -0.10652367
## Musical                   0.32763451
## Pop                       0.45330851
## Rock                      0.33262216
## Metal.or.Hardrock        -0.03814476
## Punk                      0.01845789
## Hiphop..Rap              -0.21104231
## Reggae..Ska              -0.28975923
## Swing..Jazz              -0.14527963
## Rock.n.roll               0.15707578
## Alternative              -0.11861334
## Latino                    0.13328062
## Techno..Trance           -0.35924519
## Opera                    -0.09968505

Let's visualize the contribution to the first 4 Principal Components (for better viewing 4 instead of 5).

var<-get_pca_var(pca1)
a<-fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90)
b<-fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)
c<-fviz_contrib(pca1, "var", axes=3, xtickslab.rt=90)
d<-fviz_contrib(pca1, "var", axes=4, xtickslab.rt=90)
e<-fviz_contrib(pca1, "var", axes=5, xtickslab.rt=90)

grid.arrange(a,b,c,d,top='Contribution to the first 4 Principal Components')

Movie

The plot below visualize, as the red color rises, the contribution to the type of movie.

The following table allow to interpret the first 4 dimensions with this movie's type:

War and Action
Fantasy.Fairy.tales and Animated
Horrow vs Documentary
Comedy vs Horror

fviz_pca_var(pca2, col.var="contrib", 
             col.circle = "darkred",
             gradient.cols=c("yellow2","red2"), 
             repel = T,ggtheme = theme_minimal())

pca2$rotation[,1:4]

##                             PC1        PC2         PC3          PC4
## Movies              -0.17247372 0.30093113 -0.29457835  0.173533210
## Horror              -0.29873279 0.04089540 -0.48318125 -0.366010065
## Thriller            -0.39067146 0.04516843 -0.38457768 -0.330003569
## Comedy               0.02472672 0.35412470 -0.24791208  0.496728998
## Romantic             0.26740368 0.35425303 -0.10363250  0.282676671
## Sci.fi              -0.37406413 0.13120295  0.03958551  0.211536845
## War                 -0.40131774 0.01563640  0.24713217 -0.001577096
## Fantasy.Fairy.tales  0.14371341 0.55280375  0.15376344 -0.270634022
## Animated             0.06584556 0.54609887  0.12180583 -0.345482098
## Documentary         -0.18210840 0.14934485  0.48563533 -0.176469524
## Western             -0.36425924 0.04084033  0.34900396  0.146949337
## Action              -0.40552911 0.09788688  0.02987917  0.332096560

var2<-get_pca_var(pca2)
a2<-fviz_contrib(pca2, "var", axes=1, xtickslab.rt=90, color = "darkred", fill = "yellow")
b2<-fviz_contrib(pca2, "var", axes=2, xtickslab.rt=90,color = "darkred", fill = "yellow")
c2<-fviz_contrib(pca2, "var", axes=3, xtickslab.rt=90,color = "darkred", fill = "yellow")
d2<-fviz_contrib(pca2, "var", axes=4, xtickslab.rt=90,color = "darkred", fill = "yellow")
grid.arrange(a2,b2,c2,d2,top='Contribution to the first 4 Principal Components')

PCA on Music and Movie preferences

Matteo Pancaldi

Introduction

Dataset and first view

Remove missing values

Relation between Music types

Relation between Movie types

Principal component analysis

Theory

PCA code

Music

Choice of the number of components

CLUSTER

Movie

Choice of the number of components

CLUSTER

Variable’s loadings

Music

PC1, PC2

PC2, PC3

PC3, PC4

PC4, PC5

Movie

PC1, PC2

PC2, PC3

PC3, PC4

Analysis of components

Music

Movie