Across our lives, all of us have met people with completely different personalities from ours. Although at some point, we might have wondered about someone “Why do they remind me so much of myself?”. Well, it would appear that everyone has kindred spirits living somewhere in the world :). One method to explain this phenomenon is the Big Five model.
From: openpsychometrics.org
The big five personality traits is the best accepted and most commonly used model of personality in academic psychology. If you take a college course in personality psychology, this is what you will learn about. The big five come from the statistical study of responses to personality items. Using a technique called factor analysis researchers can look at the responses of people to hundreds of personality items and ask the question “what is the best way to summarize an individual?”. This has been done with many samples from all over the world and the general result is that, while there seem to be unlimited personality variables, five stand out from the pack in terms of explaining a lot of a persons answers to questions about their personality:
Extroversion
Neuroticism
Agreeableness
Conscientiousness
Openness to experience
In this project I will attempt to find clusters of the big five personality traits using unsupervised learning methods. These clusters can then be interpreted as particular personality types exhibited in the sample.
The dataset comes from https://www.kaggle.com/datasets/ramontanoeiro/big-five-personality-test-removed-nan-and-0. It consists of answers to a questionnaire collected from Open Psychometrics. There are 50 questions total, 10 each for one of the big five personality traits. The participants were asked to score how they feel about a question on a scale from 1 (strongly disagree) to 5 (strongly agree).
As the data set is quite large with 874,366 observations and 51 variables, I undersampled the data to 1000 observations. I loaded the data with the data.table package, since it is better suited for loading large datasets.
library("data.table")
library("factoextra")
library("cluster")
library("hopkins")
library("GGally")
library("gridExtra")
library("ClusterR")
library("fpc")
library("rmarkdown")
bfive <- fread("bigfive/data-final-clean.csv", header=TRUE, sep=",")
set.seed(1)
bfive <- bfive[sample(nrow(bfive),1000), c(1:ncol(bfive)-1), with=FALSE]
The number of features may be too high to use for clustering. An aggregate of the scores in each question will be more optimal for computation, that’s why I decided to sum up these scores. First though, some columns need to be transformed. A good few of the questions related to a certain trait are ranked higher if a subject exhibits an opposite trait. This statement from the questionnaire illustrates this:
EXT2: I don’t talk a lot.
Obviously, it’s unlikely for an extroverted person to describe themselves as quiet. Therefore, the rank for all these types of statements needs to be inverted - if the score is equal to 5, it will be subtracted from six so that it becomes one.
bfive$EXT2 = 6 - bfive$EXT2
bfive$EXT4 = 6 - bfive$EXT4
bfive$EXT6 = 6 - bfive$EXT6
bfive$EXT8 = 6 - bfive$EXT8
bfive$EXT10 = 6 - bfive$EXT10
bfive$EST2 = 6 - bfive$EST2
bfive$EST4 = 6 - bfive$EST4
bfive$AGR2 = 6 - bfive$AGR2
bfive$AGR4 = 6 - bfive$AGR4
bfive$AGR6 = 6 - bfive$AGR6
bfive$AGR8 = 6 - bfive$AGR8
bfive$AGR9 = 6 - bfive$AGR9
bfive$AGR10 = 6 - bfive$AGR10
bfive$CSN2 = 6 - bfive$CSN2
bfive$CSN4 = 6 - bfive$CSN4
bfive$CSN6 = 6 - bfive$CSN6
bfive$CSN8 = 6 - bfive$CSN8
bfive$OPN2 = 6 - bfive$OPN2
bfive$OPN4 = 6 - bfive$OPN4
bfive$OPN6 = 6 - bfive$OPN6
Now, features can be summed up into new five aggregates representing the individual’s scores for each of the personality traits.
bfive$Extroversion <- rowSums(bfive[,c(1:10)])
bfive$Neuroticism <- rowSums(bfive[,c(11:20)])
bfive$Agreeableness <- rowSums(bfive[,c(21:30)])
bfive$Conscientiouness <- rowSums(bfive[,c(31:40)])
bfive$Openness <- rowSums(bfive[,c(41:50)])
bfive <- bfive[,c(51:55)]
It is not necessary to scale the columns, since they should be on the same scale of 10-50. The summary() function will allow me to verify this.
summary(bfive)
## Extroversion Neuroticism Agreeableness Conscientiouness
## Min. :10.00 Min. :10.00 Min. :10.00 Min. :11.00
## 1st Qu.:23.00 1st Qu.:24.00 1st Qu.:17.00 1st Qu.:29.00
## Median :29.00 Median :31.00 Median :21.00 Median :34.00
## Mean :29.47 Mean :30.66 Mean :21.99 Mean :33.73
## 3rd Qu.:36.00 3rd Qu.:37.00 3rd Qu.:26.00 3rd Qu.:39.00
## Max. :50.00 Max. :50.00 Max. :49.00 Max. :50.00
## Openness
## Min. :19.00
## 1st Qu.:35.00
## Median :40.00
## Mean :39.06
## 3rd Qu.:44.00
## Max. :50.00
There are no individuals that score lower than 10 or higher than 50, so the scores all have the same scale.
Before proceeding with clustering, it is recommended to check whether this data can be clustered. Hopkins statistic is a way of verifying that.
set.seed(1)
hopkins(bfive, m=999)
## [1] 0.9442261
The statistic is higher than 0.5, which means that the scores can be clustered.
Next, it may prove useful to examine the plot of the Ordered Dissimilarity Matrix.
There are some color blocks visible on the graph above, therefore there is ground to consider clustering the big five aggregates.
I will use three clustering algorithms: K-means, Partitioning Around Medoids (PAM) and Gaussian Mixture Model (GMM). The number of clusters is unknown and can be determined by using a variety of methods. For K-means and PAM I will compare the silhouette scores and for GMM I will use the Bayesian Information Criterion (BIC).
kmeans_clust <- eclust(bfive, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
## cluster size ave.sil.width
## 1 1 489 0.24
## 2 2 511 0.19
pam_clust <- eclust(bfive, k=2, FUNcluster='pam', hc_metric="euclidean", graph=FALSE)
## cluster size ave.sil.width
## 1 1 570 0.22
## 2 2 430 0.18
A Gaussian Mixture Model is a probabilistic model that assigns a separate Gaussian distribution to each cluster. The parameters of these distributions are found using the Expectation-Maximization (EM) algorithm. The animation below illustrates how GMM finds clusters:
The implementation of GMM in the “ClusterR” package enables finding the optimal number of clusters using the Bayesian Information Criterion (BIC). BIC is computed with the formula below:
\(BIC=klog(n)-2log(L(\theta))\)
Where:
\(n\) - sample size
\(k\) - number of parameters estimated by the model
\(L(\theta)\) - likelihood of the tested model
\(\theta\) - set of all parameters
For this method, the lowest score indicates the optimal number of clusters.
gmm = GMM(bfive, gaussian_comps=4, dist_mode="eucl_dist", seed=1)
GMM_clusters = predict(gmm, newdata=bfive)
Based on the graphs above, the appropriate number of clusters for K-means and PAM is two. The average silhouette score for each algorithm is around 0.2. K-means seems to perform slightly better, because its silhouette score is mostly positive. Consequently, I will only analyze the results obtained from K-means. Gaussian Mixture Model slightly differs from the aforementioned unsupervised learning techniques, as the number of clusters determined with BIC is equal to four. It will certainly be interesting to see what personality types were found with K-means and GMM.
The clusters can be interpreted with the pairs plot from the GGally package.
It looks like the clusters differ the most in terms of the first two columns - Extroversion and Neuroticism. The first cluster includes extroverted, relaxed people, while the second cluster gathers introverts, more prone to feelings of stress and anxiety. The scores for the remaining big five traits were about equal for both clusters.
The first cluster seems to include individuals who scored high on the extroversion and openness, but lower on agreeableness. The second cluster somewhat resembles the first, although the conscientiousness score is higher for most of its members. The third and fourth clusters also share some similarities, both describing introverts with moderately high openness and conscientiousness. Albeit, in the third cluster neuroticism is usually higher, whereas people in the fourth are much more agreeable.
The purpose of this project was to generalize the scores obtained from the big five personality traits questionnaire. Using the K-means algorithm and Gaussian Mixture Model, different clusters of personality traits were discovered. It’s hard to determine which unsupervised learning method performed better in this case, but I would assume that two personality types returned by K-means are overly simplifying reality. That’s why the four clusters found with GMM have a slight edge.
Sources:
https://openpsychometrics.org/tests/IPIP-BFFM/
https://www.statisticshowto.com/bayesian-information-criterion/
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68