Importing and describing data

mydata <- read.table("./2018-personality-data.csv", header=TRUE, sep=",", dec=".")
mydata$userid <- seq(1, nrow(mydata))
mydata <- mydata[c(1:6, 33, 34)]
head(mydata)
##   userid openness agreeableness emotional_stability conscientiousness
## 1      1      5.0           2.0                 3.0               2.5
## 2      2      7.0           4.0                 6.0               5.5
## 3      3      4.0           3.0                 4.5               2.0
## 4      4      5.5           5.5                 4.0               4.5
## 5      5      5.5           5.5                 3.5               4.5
## 6      6      6.0           3.0                 4.0               3.5
##   extraversion is_personalized enjoy_watching
## 1          6.5               4              4
## 2          4.0               2              3
## 3          2.5               2              2
## 4          4.0               3              3
## 5          2.5               2              3
## 6          1.5               2              4

Unit of observation is an individual person. Sample size is 1834.

Explanation of variables:

Source: https://www.kaggle.com/datasets/arslanali4343/top-personality-dataset

library(psych)
describe(mydata)
##                     vars    n   mean     sd median trimmed    mad min  max
## userid                 1 1834 917.50 529.57  917.5  917.50 679.77   1 1834
## openness               2 1834   5.38   1.04    5.5    5.42   0.74   1    7
## agreeableness          3 1834   4.22   1.14    4.0    4.19   0.74   1    7
## emotional_stability    4 1834   4.56   1.39    4.5    4.60   1.48   1    7
## conscientiousness      5 1834   4.66   1.31    4.5    4.68   1.48   1    7
## extraversion           6 1834   3.49   1.47    3.5    3.43   1.48   1    7
## is_personalized        7 1834   3.06   1.08    3.0    3.11   1.48   1    5
## enjoy_watching         8 1834   3.52   1.06    4.0    3.57   1.48   1    5
##                     range  skew kurtosis    se
## userid               1833  0.00    -1.20 12.37
## openness                6 -0.50     0.04  0.02
## agreeableness           6  0.17    -0.15  0.03
## emotional_stability     6 -0.19    -0.65  0.03
## conscientiousness       6 -0.19    -0.57  0.03
## extraversion            6  0.28    -0.65  0.03
## is_personalized         4 -0.26    -0.79  0.03
## enjoy_watching          4 -0.49    -0.51  0.02

The median agreeableness is 4 meaning they half of people have a tendency to not be compassionate and cooperative and half have a tendency to be compassionate and cooperative.

Noone strongly disagrees with the statement that the list of movies is personalized.

The average for openness is 5.38, meaning that people have some tendency to prefer a new experience.

PCA

Research question: How many dimensions can be identified in the personality traits using PCA, and how do they simplify the understanding of personality profiles?

mydata_PCA <- mydata[c(2:6)]
R <- cor(mydata_PCA)
round (R, 3)
##                     openness agreeableness emotional_stability
## openness               1.000         0.055               0.075
## agreeableness          0.055         1.000               0.178
## emotional_stability    0.075         0.178               1.000
## conscientiousness      0.011         0.076               0.280
## extraversion           0.261         0.107               0.036
##                     conscientiousness extraversion
## openness                        0.011        0.261
## agreeableness                   0.076        0.107
## emotional_stability             0.280        0.036
## conscientiousness               1.000        0.004
## extraversion                    0.004        1.000
library(psych)
corPlot(R)

The correlation between variables should be at least 0.3. In our data the variables are not sufficiently correlated. Let’s first try to do the other checks to find if the data is appropriate.

library(psych)
cortest.bartlett(R, n = nrow(mydata_PCA))
## $chisq
## [1] 369.238
## 
## $p.value
## [1] 3.275957e-73
## 
## $df
## [1] 10

Bartlett’s test of sphericity is testing if the population correlation matrix is equal to the identity matrix.

We reject H0 at p < 0.001. We found that at least some correlations are different from 0.

library(psych)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.53
## MSA for each item = 
##            openness       agreeableness emotional_stability   conscientiousness 
##                0.52                0.60                0.53                0.53 
##        extraversion 
##                0.52

KMO and MSA measure the adequacy of variables.

KMO and MSA are bigger than 0.5, so we can perform the principal component analysis.

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.2
components <- PCA(mydata_PCA, 
                  scale.unit = TRUE, 
                  graph = FALSE)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.4415416         28.83083                    28.83083
## Dim.2  1.2107801         24.21560                    53.04643
## Dim.3  0.9173932         18.34786                    71.39430
## Dim.4  0.7465157         14.93031                    86.32461
## Dim.5  0.6837694         13.67539                   100.00000

How many components to retain?

Eigenvalue of first 2 principal components for standardized variables is bigger than 1.

The first 2 principal components explain more than 40% of the data. We measure evaluation which is subjective, so we measure soft data for which the chosen number of components should explain around 40% of the data.

The last chosen principal component captures more than 5% of total variance of original variables.

fviz_eig(components, 
         choice = "eigenvalue", 
         main = "Screeplot",
         ylab = "Eigenvalue",
         xlab = "Principal component",
         addlabels = TRUE)

When looking at the Scree plot the cut point is not so obvious, but the biggest difference between eigenvalues is between 2 and 3, so we choose 2 principal components.

library(psych)
fa.parallel(mydata_PCA,
            sim = FALSE,
            fa = "pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

In the parallel analysis we choose the number of principal components that have the empirical eigenvalue bigger than the theoretical.

Parallel analysis suggests that we should choose 2 principal components.

library(FactoMineR)
components <- PCA(mydata_PCA, 
                  ncp = 2, 
                  scale.unit = TRUE, 
                  graph = FALSE)

components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 1834 individuals, described by 5 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
components$var$cor
##                         Dim.1       Dim.2
## openness            0.4665851  0.60286002
## agreeableness       0.5304237 -0.05227395
## emotional_stability 0.6675727 -0.40961178
## conscientiousness   0.5310001 -0.51946186
## extraversion        0.4635474  0.63795365
components$var$contrib
##                        Dim.1      Dim.2
## openness            15.10200 30.0170291
## agreeableness       19.51725  0.2256864
## emotional_stability 30.91505 13.8573316
## conscientiousness   19.55969 22.2865101
## extraversion        14.90600 33.6134429
fviz_pca_var(components, repel = TRUE)

mydata$PC1 <- components$ind$coord[ , 1]
mydata$PC2 <- components$ind$coord[ , 2]

Principal component analysis was performed on 5 standardized variables (n = 1834). The KMO measure confirms the appropriateness of the variables, KMO = 0.53, although the data falls into the category “Miserable”. The MSA statistics for the individual variables are above 0.50 for all variables. Based on the parallel analysis and every other rules explained in class, it makes most sense to retain the first two principal components, which together summarizes 53% of the information in the original 5 variables. Based on the component’s loadings, we conclude that PC1 (𝜆1 = 1.44) represents general tendencies of people, while PC2 (𝜆2 = 1.21) represents the contrast between social and reasonable people.