Introduction

Across our lives, all of us have met people with completely different personalities from ours. Although at some point, we might have wondered about someone “Why do they remind me so much of myself?”. Well, it would appear that everyone has kindred spirits living somewhere in the world :). One method to explain this phenomenon is the Big Five model.

From: openpsychometrics.org

The big five personality traits is the best accepted and most commonly used model of personality in academic psychology. If you take a college course in personality psychology, this is what you will learn about. The big five come from the statistical study of responses to personality items. Using a technique called factor analysis researchers can look at the responses of people to hundreds of personality items and ask the question “what is the best way to summarize an individual?”. This has been done with many samples from all over the world and the general result is that, while there seem to be unlimited personality variables, five stand out from the pack in terms of explaining a lot of a persons answers to questions about their personality:

  • Extroversion

  • Neuroticism

  • Agreeableness

  • Conscientiousness

  • Openness to experience

In this project I will attempt to find clusters of the big five personality traits using unsupervised learning methods. These clusters can then be interpreted as particular personality types exhibited in the sample.

Data preprocessing

Dataset

The dataset comes from https://www.kaggle.com/datasets/ramontanoeiro/big-five-personality-test-removed-nan-and-0. It consists of answers to a questionnaire collected from Open Psychometrics. There are 50 questions total, 10 each for one of the big five personality traits. The participants were asked to score how they feel about a question on a scale from 1 (strongly disagree) to 5 (strongly agree).

Preprocessing

As the data set is quite large with 874,366 observations and 51 variables, I undersampled the data to 1000 observations. I loaded the data with the data.table package, since it is better suited for loading large datasets.

library("data.table") 
library("factoextra") 
library("cluster")
library("hopkins")
library("GGally")
library("gridExtra")
library("ClusterR")
library("fpc")
library("rmarkdown")


bfive <- fread("bigfive/data-final-clean.csv", header=TRUE, sep=",")
set.seed(1)
bfive <- bfive[sample(nrow(bfive),1000), c(1:ncol(bfive)-1), with=FALSE]

The number of features may be too high to use for clustering. An aggregate of the scores in each question will be more optimal for computation, that’s why I decided to sum up these scores. First though, some columns need to be transformed. A good few of the questions related to a certain trait are ranked higher if a subject exhibits an opposite trait. This statement from the questionnaire illustrates this:

EXT2: I don’t talk a lot.

Obviously, it’s unlikely for an extroverted person to describe themselves as quiet. Therefore, the rank for all these types of statements needs to be inverted - if the score is equal to 5, it will be subtracted from six so that it becomes one.

bfive$EXT2 = 6 - bfive$EXT2 
bfive$EXT4 = 6 - bfive$EXT4
bfive$EXT6 = 6 - bfive$EXT6 
bfive$EXT8 = 6 - bfive$EXT8
bfive$EXT10 = 6 - bfive$EXT10 
bfive$EST2 = 6 - bfive$EST2
bfive$EST4 = 6 - bfive$EST4 
bfive$AGR2 = 6 - bfive$AGR2
bfive$AGR4 = 6 - bfive$AGR4 
bfive$AGR6 = 6 - bfive$AGR6
bfive$AGR8 = 6 - bfive$AGR8 
bfive$AGR9 = 6 - bfive$AGR9
bfive$AGR10 = 6 - bfive$AGR10 
bfive$CSN2 = 6 - bfive$CSN2
bfive$CSN4 = 6 - bfive$CSN4 
bfive$CSN6 = 6 - bfive$CSN6
bfive$CSN8 = 6 - bfive$CSN8 
bfive$OPN2 = 6 - bfive$OPN2
bfive$OPN4 = 6 - bfive$OPN4 
bfive$OPN6 = 6 - bfive$OPN6

Now, features can be summed up into new five aggregates representing the individual’s scores for each of the personality traits.

bfive$Extroversion <- rowSums(bfive[,c(1:10)]) 
bfive$Neuroticism <- rowSums(bfive[,c(11:20)])
bfive$Agreeableness <- rowSums(bfive[,c(21:30)]) 
bfive$Conscientiouness <- rowSums(bfive[,c(31:40)]) 
bfive$Openness <- rowSums(bfive[,c(41:50)])

bfive <- bfive[,c(51:55)]

It is not necessary to scale the columns, since they should be on the same scale of 10-50. The summary() function will allow me to verify this.

summary(bfive)
##   Extroversion    Neuroticism    Agreeableness   Conscientiouness
##  Min.   :10.00   Min.   :10.00   Min.   :10.00   Min.   :11.00   
##  1st Qu.:23.00   1st Qu.:24.00   1st Qu.:17.00   1st Qu.:29.00   
##  Median :29.00   Median :31.00   Median :21.00   Median :34.00   
##  Mean   :29.47   Mean   :30.66   Mean   :21.99   Mean   :33.73   
##  3rd Qu.:36.00   3rd Qu.:37.00   3rd Qu.:26.00   3rd Qu.:39.00   
##  Max.   :50.00   Max.   :50.00   Max.   :49.00   Max.   :50.00   
##     Openness    
##  Min.   :19.00  
##  1st Qu.:35.00  
##  Median :40.00  
##  Mean   :39.06  
##  3rd Qu.:44.00  
##  Max.   :50.00

There are no individuals that score lower than 10 or higher than 50, so the scores all have the same scale.

Clustering

Measuring the cluster tendency

Before proceeding with clustering, it is recommended to check whether this data can be clustered. Hopkins statistic is a way of verifying that.

set.seed(1)
hopkins(bfive, m=999)
## [1] 0.9442261

The statistic is higher than 0.5, which means that the scores can be clustered.

Next, it may prove useful to examine the plot of the Ordered Dissimilarity Matrix.

There are some color blocks visible on the graph above, therefore there is ground to consider clustering the big five aggregates.

Clustering with K-means, PAM and Gaussian Mixture Model

I will use three clustering algorithms: K-means, Partitioning Around Medoids (PAM) and Gaussian Mixture Model (GMM). The number of clusters is unknown and can be determined by using a variety of methods. For K-means and PAM I will compare the silhouette scores and for GMM I will use the Bayesian Information Criterion (BIC).

K-means

kmeans_clust <- eclust(bfive, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
##   cluster size ave.sil.width
## 1       1  489          0.24
## 2       2  511          0.19

PAM

pam_clust <- eclust(bfive, k=2, FUNcluster='pam', hc_metric="euclidean", graph=FALSE)
##   cluster size ave.sil.width
## 1       1  570          0.22
## 2       2  430          0.18

Gaussian Mixture Model

A Gaussian Mixture Model is a probabilistic model that assigns a separate Gaussian distribution to each cluster. The parameters of these distributions are found using the Expectation-Maximization (EM) algorithm. The animation below illustrates how GMM finds clusters:

The implementation of GMM in the “ClusterR” package enables finding the optimal number of clusters using the Bayesian Information Criterion (BIC). BIC is computed with the formula below:

\(BIC=klog(n)-2log(L(\theta))\)

Where:

  • \(n\) - sample size

  • \(k\) - number of parameters estimated by the model

  • \(L(\theta)\) - likelihood of the tested model

  • \(\theta\) - set of all parameters

For this method, the lowest score indicates the optimal number of clusters.

gmm = GMM(bfive, gaussian_comps=4, dist_mode="eucl_dist", seed=1)
GMM_clusters = predict(gmm, newdata=bfive)

Differences between methods

Based on the graphs above, the appropriate number of clusters for K-means and PAM is two. The average silhouette score for each algorithm is around 0.2. K-means seems to perform slightly better, because its silhouette score is mostly positive. Consequently, I will only analyze the results obtained from K-means. Gaussian Mixture Model slightly differs from the aforementioned unsupervised learning techniques, as the number of clusters determined with BIC is equal to four. It will certainly be interesting to see what personality types were found with K-means and GMM.

Analyzing the clusters

The clusters can be interpreted with the pairs plot from the GGally package.

K-means

It looks like the clusters differ the most in terms of the first two columns - Extroversion and Neuroticism. The first cluster includes extroverted, relaxed people, while the second cluster gathers introverts, more prone to feelings of stress and anxiety. The scores for the remaining big five traits were about equal for both clusters.

Gaussian Mixture Model

The first cluster seems to include individuals who scored high on the extroversion and openness, but lower on agreeableness. The second cluster somewhat resembles the first, although the conscientiousness score is higher for most of its members. The third and fourth clusters also share some similarities, both describing introverts with moderately high openness and conscientiousness. Albeit, in the third cluster neuroticism is usually higher, whereas people in the fourth are much more agreeable.

Conclusion

The purpose of this project was to generalize the scores obtained from the big five personality traits questionnaire. Using the K-means algorithm and Gaussian Mixture Model, different clusters of personality traits were discovered. It’s hard to determine which unsupervised learning method performed better in this case, but I would assume that two personality types returned by K-means are overly simplifying reality. That’s why the four clusters found with GMM have a slight edge.

Sources:

https://openpsychometrics.org/tests/IPIP-BFFM/

https://www.statisticshowto.com/bayesian-information-criterion/

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68