mydata <- read.table("./2018-personality-data.csv", header=TRUE, sep=",", dec=".")
mydata$userid <- seq(1, nrow(mydata))
mydata <- mydata[c(1:6, 33, 34)]
head(mydata)
## userid openness agreeableness emotional_stability conscientiousness
## 1 1 5.0 2.0 3.0 2.5
## 2 2 7.0 4.0 6.0 5.5
## 3 3 4.0 3.0 4.5 2.0
## 4 4 5.5 5.5 4.0 4.5
## 5 5 5.5 5.5 3.5 4.5
## 6 6 6.0 3.0 4.0 3.5
## extraversion is_personalized enjoy_watching
## 1 6.5 4 4
## 2 4.0 2 3
## 3 2.5 2 2
## 4 4.0 3 3
## 5 2.5 2 3
## 6 1.5 2 4
Unit of observation is an individual person. Sample size is 1834.
Explanation of variables:
userid: ID of a person.
Openness: an assessment score (from 1 to 7) assessing user tendency to prefer new experience. 1 means the user has tendency NOT to prefer new experience, 7 means the user has tendency to prefer new experience.
Agreeableness: an assessment score (from 1 to 7) assessing user tendency to be compassionate and cooperative rather than suspicious and antagonistic towards others. 1 means the user has tendency to NOT be compassionate and cooperative. 7 means the user has tendency to be compassionate and cooperative.
Emotional Stability: an assessment score (from 1 to 7) assessing user tendency to have psychological stress. 1 means the user has tendency to have psychological stress, and 7 means the user has tendency to NOT have psychological stress.
Conscientiousness: an assessment score (from 1 to 7) assessing user tendency to be organized and dependable, and show self-discipline. 1 means the user does not have such a tendency, and 7 means the user has such tendency.
Extraversion: an assessment score (from 1 to 7) assessing user tendency to be outgoing. 1 means the user does not have such a tendency, and 7 means the user has such a tendency.
Is_Personalized: The response of the user to the question This list is personalized for me. Users answered on the 5-point Likert scale. (1: Strongly Disagree, 5: Strongly Agree).
Enjoy_watching: The response of the user to the question This list contains movies I think I enjoyed watching. Users answered on the 5-point Likert scale. (1: Strongly Disagree, 5: Strongly Agree).
Source: https://www.kaggle.com/datasets/arslanali4343/top-personality-dataset
library(psych)
describe(mydata)
## vars n mean sd median trimmed mad min max
## userid 1 1834 917.50 529.57 917.5 917.50 679.77 1 1834
## openness 2 1834 5.38 1.04 5.5 5.42 0.74 1 7
## agreeableness 3 1834 4.22 1.14 4.0 4.19 0.74 1 7
## emotional_stability 4 1834 4.56 1.39 4.5 4.60 1.48 1 7
## conscientiousness 5 1834 4.66 1.31 4.5 4.68 1.48 1 7
## extraversion 6 1834 3.49 1.47 3.5 3.43 1.48 1 7
## is_personalized 7 1834 3.06 1.08 3.0 3.11 1.48 1 5
## enjoy_watching 8 1834 3.52 1.06 4.0 3.57 1.48 1 5
## range skew kurtosis se
## userid 1833 0.00 -1.20 12.37
## openness 6 -0.50 0.04 0.02
## agreeableness 6 0.17 -0.15 0.03
## emotional_stability 6 -0.19 -0.65 0.03
## conscientiousness 6 -0.19 -0.57 0.03
## extraversion 6 0.28 -0.65 0.03
## is_personalized 4 -0.26 -0.79 0.03
## enjoy_watching 4 -0.49 -0.51 0.02
The median agreeableness is 4 meaning they half of people have a tendency to not be compassionate and cooperative and half have a tendency to be compassionate and cooperative.
Noone strongly disagrees with the statement that the list of movies is personalized.
The average for openness is 5.38, meaning that people have some tendency to prefer a new experience.
Research question: How many dimensions can be identified in the personality traits using PCA, and how do they simplify the understanding of personality profiles?
mydata_PCA <- mydata[c(2:6)]
R <- cor(mydata_PCA)
round (R, 3)
## openness agreeableness emotional_stability
## openness 1.000 0.055 0.075
## agreeableness 0.055 1.000 0.178
## emotional_stability 0.075 0.178 1.000
## conscientiousness 0.011 0.076 0.280
## extraversion 0.261 0.107 0.036
## conscientiousness extraversion
## openness 0.011 0.261
## agreeableness 0.076 0.107
## emotional_stability 0.280 0.036
## conscientiousness 1.000 0.004
## extraversion 0.004 1.000
library(psych)
corPlot(R)
The correlation between variables should be at least 0.3. In our data the variables are not sufficiently correlated. Let’s first try to do the other checks to find if the data is appropriate.
library(psych)
cortest.bartlett(R, n = nrow(mydata_PCA))
## $chisq
## [1] 369.238
##
## $p.value
## [1] 3.275957e-73
##
## $df
## [1] 10
Bartlett’s test of sphericity is testing if the population correlation matrix is equal to the identity matrix.
We reject H0 at p < 0.001. We found that at least some correlations are different from 0.
library(psych)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.53
## MSA for each item =
## openness agreeableness emotional_stability conscientiousness
## 0.52 0.60 0.53 0.53
## extraversion
## 0.52
KMO and MSA measure the adequacy of variables.
KMO and MSA are bigger than 0.5, so we can perform the principal component analysis.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.2
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.4415416 28.83083 28.83083
## Dim.2 1.2107801 24.21560 53.04643
## Dim.3 0.9173932 18.34786 71.39430
## Dim.4 0.7465157 14.93031 86.32461
## Dim.5 0.6837694 13.67539 100.00000
Eigenvalue of first 2 principal components for standardized variables is bigger than 1.
The first 2 principal components explain more than 40% of the data. We measure evaluation which is subjective, so we measure soft data for which the chosen number of components should explain around 40% of the data.
The last chosen principal component captures more than 5% of total variance of original variables.
fviz_eig(components,
choice = "eigenvalue",
main = "Screeplot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
When looking at the Scree plot the cut point is not so obvious, but the biggest difference between eigenvalues is between 2 and 3, so we choose 2 principal components.
library(psych)
fa.parallel(mydata_PCA,
sim = FALSE,
fa = "pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
In the parallel analysis we choose the number of principal components that have the empirical eigenvalue bigger than the theoretical.
Parallel analysis suggests that we should choose 2 principal components.
library(FactoMineR)
components <- PCA(mydata_PCA,
ncp = 2,
scale.unit = TRUE,
graph = FALSE)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 1834 individuals, described by 5 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
components$var$cor
## Dim.1 Dim.2
## openness 0.4665851 0.60286002
## agreeableness 0.5304237 -0.05227395
## emotional_stability 0.6675727 -0.40961178
## conscientiousness 0.5310001 -0.51946186
## extraversion 0.4635474 0.63795365
components$var$contrib
## Dim.1 Dim.2
## openness 15.10200 30.0170291
## agreeableness 19.51725 0.2256864
## emotional_stability 30.91505 13.8573316
## conscientiousness 19.55969 22.2865101
## extraversion 14.90600 33.6134429
fviz_pca_var(components, repel = TRUE)
mydata$PC1 <- components$ind$coord[ , 1]
mydata$PC2 <- components$ind$coord[ , 2]
Principal component analysis was performed on 5 standardized variables (n = 1834). The KMO measure confirms the appropriateness of the variables, KMO = 0.53, although the data falls into the category “Miserable”. The MSA statistics for the individual variables are above 0.50 for all variables. Based on the parallel analysis and every other rules explained in class, it makes most sense to retain the first two principal components, which together summarizes 53% of the information in the original 5 variables. Based on the component’s loadings, we conclude that PC1 (𝜆1 = 1.44) represents general tendencies of people, while PC2 (𝜆2 = 1.21) represents the contrast between social and reasonable people.