Figure 1: Big Five Personality Trait

Background Problem

Personality fundamentally shapes how individuals think, feel, and behave, significantly influencing their interactions with technology. With the increasing integration of intelligent systems—such as virtual assistants, personalized platforms, and adaptive interfaces—understanding individual personality differences is crucial for creating meaningful, human-centered experiences (American Psychological Association, 2023). However, many existing systems treat users uniformly, failing to capture the nuances in psychological profiles that could lead to more intuitive and engaging interactions.

Recent research emphasizes the growing importance of incorporating personality traits into the design of information systems to better align with user expectations and behavior. Lopatovska & Arapakis (2020) highlighted how personality-based models can improve personalization in search engines and conversational agents, arguing that user satisfaction often hinges on how well systems adapt to individual psychological profiles. This insight aligns with the human-centered design philosophy, emphasizing that emotional and cognitive aspects of users should play a central role in shaping interactive technologies.

Rather than relying on predefined personality labels, this study explores latent patterns within responses to a Big Five personality questionnaire. By applying unsupervised learning methods, we aim to uncover natural groupings in personality traits—allowing us to identify underlying psychological structures without prior assumptions. These clusters may reflect distinct behavioral styles, cognitive preferences, and communication tendencies that have direct implications for how individuals interact with technology, offering valuable insights for personalization in the design of user interfaces, the behavior of virtual agents, and the tailoring of digital content.

Objectives

This study aims to apply unsupervised learning to Big Five personality trait data to discover meaningful patterns and user groupings that can ultimately inform the development of more human-centered intelligent systems:

Apply k-Means clustering to uncover groups of individuals with similar personality profiles based on questionnaire responses, with the optimal number of clusters determined using the Elbow method. The identified clusters will represent distinct personality segments within the user population, potentially highlighting different preferences and interaction styles with technology.
Analyze the characteristics of each cluster in terms of potential behavioral patterns, communication styles, and implications for system personalization in human-computer interaction. This includes understanding if certain personality profiles are more likely to:
- Prefer specific interface designs (e.g., minimalist vs. feature-rich).
- Engage with virtual assistants in particular ways (e.g., direct vs. conversational).
- Respond to different types of feedback or error messages.
- Benefit from tailored levels of proactivity or guidance from a system.

Dataset Description

This study leverages a large-scale dataset collected between 2016 and 2018 through an interactive online personality test, hosted on Open Psychometrics. The questionnaire is based on the Big Five Factor Markers from the International Personality Item Pool (IPIP), covering five key personality dimensions:

Extraversion (EXT)
Neuroticism (Emotional Stability - EST)
Agreeableness (AGR)
Conscientiousness (CSN)
Openness to Experience (OPN)

Each personality trait is measured by 10 items, resulting in a total of 50 statements, rated on a 5-point Likert scale: > 1 = Disagree, 3 = Neutral, 5 = Agree

The questions were shown in a fixed interleaved order (e.g., EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc.) to ensure balanced response behavior.

Click to show/hide Big Five items

Item Code	Item Description
EXT1	I am the life of the party.
EXT2	I don’t talk a lot.
EXT3	I feel comfortable around people.
EXT4	I keep in the background.
EXT5	I start conversations.
EXT6	I have little to say.
EXT7	I talk to a lot of different people at parties.
EXT8	I don’t like to draw attention to myself.
EXT9	I don’t mind being the center of attention.
EXT10	I am quiet around strangers.
EST1	I get stressed out easily.
EST2	I am relaxed most of the time.
EST3	I worry about things.
EST4	I seldom feel blue.
EST5	I am easily disturbed.
EST6	I get upset easily.
EST7	I change my mood a lot.
EST8	I have frequent mood swings.
EST9	I get irritated easily.
EST10	I often feel blue.
AGR1	I feel little concern for others.
AGR2	I am interested in people.
AGR3	I insult people.
AGR4	I sympathize with others’ feelings.
AGR5	I am not interested in other people’s problems.
AGR6	I have a soft heart.
AGR7	I am not really interested in others.
AGR8	I take time out for others.
AGR9	I feel others’ emotions.
AGR10	I make people feel at ease.
CSN1	I am always prepared.
CSN2	I leave my belongings around.
CSN3	I pay attention to details.
CSN4	I make a mess of things.
CSN5	I get chores done right away.
CSN6	I often forget to put things back in their proper place.
CSN7	I like order.
CSN8	I shirk my duties.
CSN9	I follow a schedule.
CSN10	I am exacting in my work.
OPN1	I have a rich vocabulary.
OPN2	I have difficulty understanding abstract ideas.
OPN3	I have a vivid imagination.
OPN4	I am not interested in abstract ideas.
OPN5	I have excellent ideas.
OPN6	I do not have a good imagination.
OPN7	I am quick to understand things.
OPN8	I use difficult words.
OPN9	I spend time reflecting on things.
OPN10	I am full of ideas.

Key Features

Response Variables
Each item is labeled with a trait code and item number (e.g., EXT1, AGR5, OPN10) corresponding to a specific statement.
Timing Variables
For each question, an additional variable ending in _E captures response time in milliseconds, measuring how long the participant took to answer that item.
Session Metadata
- dateload: Timestamp when the test began
- introelapse, testelapse, endelapse: Time spent (in seconds) on the introduction page, survey page, and finalization page, respectively
- IPC: Number of responses from the same IP address (used to filter out multiple submissions)
- country: Automatically inferred from network metadata
- screenw, screenh: Screen dimensions (can relate to device context)
- lat_appx_lots_of_err, long_appx_lots_of_err: Approximate location (not highly accurate)

For data quality, only records with IPC = 1 were used to ensure uniqueness of submissions.

Methodology

Import Libraries

suppressPackageStartupMessages({
  library(readr)
  library(knitr)
  library(tidyverse)
  library(dplyr)
  library(readr)
  library(FactoMineR)
  library(factoextra)
  library(ggplot2)
  library(stringr)
  library(scales)
  library(cluster)
  library(ClusterR)
  library(GGally)
})

Load dataset

# Read the data from the CSV
df <- read.delim("C:/Users/LENOVO/big-five-personality-test/IPIP-FFM-data-8Nov2018/data-final.csv", sep = "\t", stringsAsFactors = FALSE)

kable(head(df))

EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	EST1	EST2	EST3	EST4	EST5	EST6	EST7	EST8	EST9	EST10	AGR1	AGR2	AGR3	AGR4	AGR5	AGR6	AGR7	AGR8	AGR9	AGR10	CSN1	CSN2	CSN3	CSN4	CSN5	CSN6	CSN7	CSN8	CSN9	CSN10	OPN1	OPN2	OPN3	OPN4	OPN5	OPN6	OPN7	OPN8	OPN9	OPN10	EXT1_E	EXT2_E	EXT3_E	EXT4_E	EXT5_E	EXT6_E	EXT7_E	EXT8_E	EXT9_E	EXT10_E	EST1_E	EST2_E	EST3_E	EST4_E	EST5_E	EST6_E	EST7_E	EST8_E	EST9_E	EST10_E	AGR1_E	AGR2_E	AGR3_E	AGR4_E	AGR5_E	AGR6_E	AGR7_E	AGR8_E	AGR9_E	AGR10_E	CSN1_E	CSN2_E	CSN3_E	CSN4_E	CSN5_E	CSN6_E	CSN7_E	CSN8_E	CSN9_E	CSN10_E	OPN1_E	OPN2_E	OPN3_E	OPN4_E	OPN5_E	OPN6_E	OPN7_E	OPN8_E	OPN9_E	OPN10_E	dateload	screenw	screenh	introelapse	testelapse	endelapse	IPC	country	lat_appx_lots_of_err	long_appx_lots_of_err
4	1	5	2	5	1	5	2	4	1	1	4	4	2	2	2	2	2	3	2	2	5	2	4	2	3	2	4	3	4	3	4	3	2	2	4	4	2	4	4	5	1	4	1	4	1	5	3	4	5	9419	5491	3959	4821	5611	2756	2388	2113	5900	4110	6135	4150	5739	6364	3663	5070	5709	4285	2587	3997	4750	5475	11641	3115	3207	3260	10235	5897	1758	3081	6602	5457	1569	2129	3762	4420	9382	5286	4983	6339	3146	4067	2959	3411	2170	4920	4436	3116	2992	4354	2016-03-03 02:01:01	768	1024	9	234	6	1	GB	51.5448	0.1991
3	5	3	4	3	3	2	5	1	5	2	3	4	1	3	1	2	1	3	1	1	4	1	5	1	5	3	4	5	3	3	2	5	3	3	1	3	3	5	3	1	2	4	2	3	1	4	2	5	3	7235	3598	3315	2564	2976	3050	4787	3228	3465	3309	9036	2406	3484	3359	3061	2539	4226	2962	1799	1607	2158	2090	2143	2807	3422	5324	4494	3627	1850	1747	5163	5240	7208	2783	4103	3431	3347	2399	3360	5595	2624	4985	1684	3026	4742	3336	2718	3374	3096	3019	2016-03-03 02:01:20	1360	768	12	179	11	1	MY	3.1698	101.706
2	3	4	4	3	2	1	3	2	5	4	4	4	2	2	2	2	2	1	3	1	4	1	4	2	4	1	4	4	3	4	2	2	2	3	3	4	2	4	2	5	1	2	1	4	2	5	3	4	4	4657	3549	2543	3335	5847	2540	4922	3142	14621	2191	5128	3675	3442	4546	8275	2185	2164	1175	3813	1593	1089	2203	3386	1464	2562	1493	3067	13719	3892	4100	4286	4775	2713	2813	4237	6308	2690	1516	2379	2983	1930	1470	1644	1683	2229	8114	2043	6295	1585	2529	2016-03-03 02:01:56	1366	768	3	186	7	1	GB	54.9119	-1.3833
2	2	2	3	4	2	2	4	1	4	3	3	3	2	3	2	2	2	4	3	2	4	3	4	2	4	2	4	3	4	2	4	4	4	1	2	2	3	1	4	4	2	5	2	3	1	4	4	3	3	3996	2896	5096	4240	5168	5456	4360	4496	5240	4000	3736	4616	3015	2711	3960	4064	4208	2936	7336	3896	6062	11952	1040	2264	3664	3049	4912	7545	4632	6896	2824	520	2368	3225	2848	6264	3760	10472	3192	7704	3456	6665	1977	3728	4128	3776	2984	4192	3480	3257	2016-03-03 02:02:02	1920	1200	186	219	7	1	GB	51.75	-1.25
3	3	3	3	5	3	3	5	3	4	1	5	5	3	1	1	1	1	3	2	1	5	1	5	1	3	1	5	5	3	5	1	5	1	3	1	5	1	5	5	5	1	5	1	5	1	5	3	5	5	6004	3965	2721	3706	2968	2426	7339	3302	16819	3731	4740	2856	7461	2179	3324	2255	4308	4506	3127	3115	6771	2819	3682	2511	16204	1736	28983	1612	2437	4532	3843	7019	3102	3153	2869	6550	1811	3682	21500	20587	8458	3510	17042	7029	2327	5835	6846	5320	11401	8642	2016-03-03 02:02:57	1366	768	8	315	17	2	KE	1.0	38.0
3	3	4	2	4	2	2	3	3	4	3	4	3	2	2	1	2	1	2	2	2	3	1	4	2	3	2	3	4	4	3	2	4	1	3	2	4	3	4	3	5	1	5	1	3	1	5	4	5	2	4834	5064	1160	2664	6711	3344	2512	6264	6992	4592	2808	1776	3280	4520	2640	5408	3647	3183	1575	672	6375	4727	3775	1647	1233	8694	2904	2152	2856	2848	4288	4360	7328	3976	7895	2640	1760	5720	9032	3928	2104	5488	3656	4352	2681	3272	2640	1568	1640	3192	2016-03-03 02:03:12	1600	1000	4	196	3	1	SE	59.3333	18.05

Based on the suggestion of the original dataset, we will try to preprocess data before it is used for modelling.

Data Preprocessing

Firstly, we can check from the dataset that for number of records from IP address is 1 (for maximum cleanliness).

# Keep only unique entries
df <- df %>% filter(IPC == 1)

Since those numeric variables are still in chr, we can convert them into numerical (for columns related to Big Five response).

# Convert 50 personality items to numeric
trait_cols <- c(paste0("EXT", 1:10), paste0("EST", 1:10),
                paste0("AGR", 1:10), paste0("CSN", 1:10), paste0("OPN", 1:10))

# Convert to numeric safely
df_filtered <- df %>%
  filter(if_all(all_of(trait_cols), ~ . %in% c("1", "2", "3", "4", "5")))

df_clean <- df_filtered %>%
  mutate(across(all_of(trait_cols), ~ as.numeric(.)))

Next, drop some non-related variables towards Big Five questionnaires and responses.

# Drop non-related variables towards Big Five questionnaires-responses
df_clean <- df_clean %>%
  dplyr::select(-IPC, -dateload, -country, -lat_appx_lots_of_err, -long_appx_lots_of_err, -screenw, -screenh, -introelapse, -testelapse, -endelapse)

Just like step before, convert columns those are still recorded as chr but actually contains numerical variables.

# Convert only character columns that look numeric
df_clean <- df_clean %>%
  mutate(across(
    where(~ is.character(.) && all(grepl("^\\d+(\\.\\d+)?$", .[!is.na(.)]))),
    ~ as.numeric(.)
  ))

# Identify character columns that are mostly numeric
char_cols <- names(df_clean)[sapply(df_clean, is.character)]

# Try converting them, keeping only if >90% successfully convert
for (col in char_cols) {
  temp <- suppressWarnings(as.numeric(df_clean[[col]]))
  success_rate <- mean(!is.na(temp))

  if (success_rate > 0.9) {
    df_clean[[col]] <- temp  # overwrite with numeric version
  }
}

After checking the cleanliness of dataset, look and observe from this summary.

kable(summary(df_clean))

EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	EST1	EST2	EST3	EST4	EST5	EST6	EST7	EST8	EST9	EST10	AGR1	AGR2	AGR3	AGR4	AGR5	AGR6	AGR7	AGR8	AGR9	AGR10	CSN1	CSN2	CSN3	CSN4	CSN5	CSN6	CSN7	CSN8	CSN9	CSN10	OPN1	OPN2	OPN3	OPN4	OPN5	OPN6	OPN7	OPN8	OPN9	OPN10	EXT1_E	EXT2_E	EXT3_E	EXT4_E	EXT5_E	EXT6_E	EXT7_E	EXT8_E	EXT9_E	EXT10_E	EST1_E	EST2_E	EST3_E	EST4_E	EST5_E	EST6_E	EST7_E	EST8_E	EST9_E	EST10_E	AGR1_E	AGR2_E	AGR3_E	AGR4_E	AGR5_E	AGR6_E	AGR7_E	AGR8_E	AGR9_E	AGR10_E	CSN1_E	CSN2_E	CSN3_E	CSN4_E	CSN5_E	CSN6_E	CSN7_E	CSN8_E	CSN9_E	CSN10_E	OPN1_E	OPN2_E	OPN3_E	OPN4_E	OPN5_E	OPN6_E	OPN7_E	OPN8_E	OPN9_E	OPN10_E
Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.00	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.00	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.00	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. : -42958762	Min. : -75632	Min. : -3593866	Min. :-2494907	Min. : -58566	Min. : -79860	Min. : -89610	Min. : -461138	Min. : -11382	Min. : -142238	Min. : -112165	Min. : -71572	Min. : -24118	Min. : -3598047	Min. : -88286	Min. : -81895	Min. :-2187273	Min. : -92455	Min. :-79175662	Min. : -43558	Min. : -48675	Min. : -3592606	Min. : -1795552	Min. : -67786	Min. : -20294	Min. : -21938	Min. : -65423	Min. : -27029	Min. : -527846	Min. : -85674	Min. : -3590638	Min. :-35996486	Min. : -15375	Min. : -11993	Min. : -3512740	Min. : -74245	Min. : -30016	Min. : -177880	Min. : -29167	Min. : -14988	Min. :-53927742	Min. : -215205	Min. : -417031	Min. : -74467	Min. : -75300	Min. : -509916	Min. : -51694	Min. : -17007	Min. : -95986	Min. :-3594871
1st Qu.:1.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:2.000	1st Qu.:3.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:3.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:2.00	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:3.000	1st Qu.:3.000	1st Qu.:3.00	1st Qu.:2.000	1st Qu.:4.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:2.00	1st Qu.:3.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:3.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:4.000	1st Qu.:2.000	1st Qu.:4.000	1st Qu.:3.000	1st Qu.: 4834	1st Qu.: 2407	1st Qu.: 2503	1st Qu.: 2422	1st Qu.: 2166	1st Qu.: 2208	1st Qu.: 3077	1st Qu.: 2541	1st Qu.: 2610	1st Qu.: 2268	1st Qu.: 2242	1st Qu.: 2515	1st Qu.: 1945	1st Qu.: 2480	1st Qu.: 2487	1st Qu.: 2233	1st Qu.: 2224	1st Qu.: 2086	1st Qu.: 1954	1st Qu.: 1775	1st Qu.: 2926	1st Qu.: 2272	1st Qu.: 2248	1st Qu.: 2216	1st Qu.: 2937	1st Qu.: 2040	1st Qu.: 2667	1st Qu.: 2711	1st Qu.: 2235	1st Qu.: 2409	1st Qu.: 2439	1st Qu.: 3000	1st Qu.: 2283	1st Qu.: 2360	1st Qu.: 2549	1st Qu.: 3117	1st Qu.: 2067	1st Qu.: 2480	1st Qu.: 2055	1st Qu.: 2690	1st Qu.: 2088	1st Qu.: 3057	1st Qu.: 1876	1st Qu.: 2685	1st Qu.: 2001	1st Qu.: 2375	1st Qu.: 2294	1st Qu.: 2166	1st Qu.: 2344	1st Qu.: 1491
Median :3.000	Median :3.000	Median :3.000	Median :3.000	Median :3.000	Median :2.000	Median :3.000	Median :4.000	Median :3.000	Median :4.000	Median :4.000	Median :3.000	Median :4.000	Median :3.000	Median :3.000	Median :3.00	Median :3.000	Median :3.000	Median :3.000	Median :3.000	Median :2.000	Median :4.000	Median :2.000	Median :4.000	Median :2.000	Median :4.000	Median :2.000	Median :4.000	Median :4.000	Median :4.000	Median :3.00	Median :3.000	Median :4.000	Median :3.000	Median :2.000	Median :3.00	Median :4.000	Median :2.000	Median :3.000	Median :4.000	Median :4.000	Median :2.000	Median :4.000	Median :2.000	Median :4.000	Median :2.000	Median :4.000	Median :3.000	Median :4.000	Median :4.000	Median : 7315	Median : 3427	Median : 3517	Median : 3471	Median : 3040	Median : 3128	Median : 4340	Median : 3608	Median : 3651	Median : 3216	Median : 3311	Median : 3614	Median : 2787	Median : 3585	Median : 3509	Median : 3191	Median : 3182	Median : 2932	Median : 2798	Median : 2566	Median : 4369	Median : 3268	Median : 3169	Median : 3181	Median : 4062	Median : 2892	Median : 3686	Median : 3854	Median : 3138	Median : 3339	Median : 3584	Median : 4269	Median : 3200	Median : 3339	Median : 3586	Median : 4352	Median : 2916	Median : 3745	Median : 2923	Median : 3934	Median : 3020	Median : 4216	Median : 2739	Median : 3703	Median : 2841	Median : 3319	Median : 3199	Median : 3057	Median : 3252	Median : 2190
Mean :2.577	Mean :2.848	Mean :3.231	Mean :3.219	Mean :3.249	Mean :2.424	Mean :2.711	Mean :3.469	Mean :2.954	Mean :3.622	Mean :3.311	Mean :3.171	Mean :3.879	Mean :2.659	Mean :2.857	Mean :2.87	Mean :3.065	Mean :2.702	Mean :3.102	Mean :2.858	Mean :2.236	Mean :3.854	Mean :2.268	Mean :3.947	Mean :2.303	Mean :3.758	Mean :2.235	Mean :3.684	Mean :3.792	Mean :3.597	Mean :3.32	Mean :2.997	Mean :4.006	Mean :2.654	Mean :2.584	Mean :2.86	Mean :3.731	Mean :2.492	Mean :3.159	Mean :3.629	Mean :3.777	Mean :2.021	Mean :4.066	Mean :1.952	Mean :3.834	Mean :1.878	Mean :4.059	Mean :3.284	Mean :4.221	Mean :3.998	Mean : 101700	Mean : 8421	Mean : 9273	Mean : 7762	Mean : 7746	Mean : 5871	Mean : 7702	Mean : 6929	Mean : 5690	Mean : 5258	Mean : 8348	Mean : 8328	Mean : 6616	Mean : 10769	Mean : 7047	Mean : 8318	Mean : 6224	Mean : 5258	Mean : 4995	Mean : 4590	Mean : 18296	Mean : 9139	Mean : 6572	Mean : 8087	Mean : 7966	Mean : 5417	Mean : 7601	Mean : 9732	Mean : 5107	Mean : 5660	Mean : 12627	Mean : 9918	Mean : 8581	Mean : 7655	Mean : 9714	Mean : 9841	Mean : 5116	Mean : 10370	Mean : 5109	Mean : 8926	Mean : 8892	Mean : 12898	Mean : 6580	Mean : 8246	Mean : 6112	Mean : 7146	Mean : 7564	Mean : 4849	Mean : 5661	Mean : 4242
3rd Qu.:3.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:3.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.00	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:4.00	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.00	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:2.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.: 11886	3rd Qu.: 5057	3rd Qu.: 5156	3rd Qu.: 5174	3rd Qu.: 4509	3rd Qu.: 4605	3rd Qu.: 6229	3rd Qu.: 5335	3rd Qu.: 5332	3rd Qu.: 4681	3rd Qu.: 5106	3rd Qu.: 5387	3rd Qu.: 4195	3rd Qu.: 5604	3rd Qu.: 5173	3rd Qu.: 4817	3rd Qu.: 4727	3rd Qu.: 4396	3rd Qu.: 4206	3rd Qu.: 3932	3rd Qu.: 6822	3rd Qu.: 4919	3rd Qu.: 4716	3rd Qu.: 4828	3rd Qu.: 5864	3rd Qu.: 4347	3rd Qu.: 5324	3rd Qu.: 5686	3rd Qu.: 4632	3rd Qu.: 4868	3rd Qu.: 5518	3rd Qu.: 6250	3rd Qu.: 4715	3rd Qu.: 4921	3rd Qu.: 5388	3rd Qu.: 6193	3rd Qu.: 4317	3rd Qu.: 6919	3rd Qu.: 4314	3rd Qu.: 6059	3rd Qu.: 4520	3rd Qu.: 6097	3rd Qu.: 4212	3rd Qu.: 5402	3rd Qu.: 4240	3rd Qu.: 4872	3rd Qu.: 4668	3rd Qu.: 4446	3rd Qu.: 4710	3rd Qu.: 3317
Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.00	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.00	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.00	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :2147483647	Max. :261773449	Max. :605905746	Max. :99674242	Max. :351067965	Max. :166382065	Max. :85145915	Max. :247706204	Max. :88154250	Max. :150252063	Max. :240303883	Max. :184071718	Max. :139259333	Max. :880042878	Max. :96737602	Max. :346412854	Max. :87632813	Max. :31299296	Max. :183826938	Max. :81790376	Max. :1170859453	Max. :473898335	Max. :130124433	Max. :329878192	Max. :134804667	Max. :97539002	Max. :251861470	Max. :1367497215	Max. :62757478	Max. :81582425	Max. :772659169	Max. :169887206	Max. :1100334734	Max. :140148953	Max. :958623288	Max. :443209652	Max. :84828106	Max. :177082189	Max. :87497885	Max. :338015826	Max. :675047026	Max. :1026125615	Max. :124483678	Max. :201571919	Max. :162680825	Max. :243586621	Max. :389143415	Max. :46761765	Max. :113808739	Max. :90484840

Check also how many NaN values left in the dataset.

kable(colSums(is.na(df_clean)))

	x
EXT1	0
EXT2	0
EXT3	0
EXT4	0
EXT5	0
EXT6	0
EXT7	0
EXT8	0
EXT9	0
EXT10	0
EST1	0
EST2	0
EST3	0
EST4	0
EST5	0
EST6	0
EST7	0
EST8	0
EST9	0
EST10	0
AGR1	0
AGR2	0
AGR3	0
AGR4	0
AGR5	0
AGR6	0
AGR7	0
AGR8	0
AGR9	0
AGR10	0
CSN1	0
CSN2	0
CSN3	0
CSN4	0
CSN5	0
CSN6	0
CSN7	0
CSN8	0
CSN9	0
CSN10	0
OPN1	0
OPN2	0
OPN3	0
OPN4	0
OPN5	0
OPN6	0
OPN7	0
OPN8	0
OPN9	0
OPN10	0
EXT1_E	0
EXT2_E	0
EXT3_E	0
EXT4_E	0
EXT5_E	0
EXT6_E	0
EXT7_E	0
EXT8_E	0
EXT9_E	0
EXT10_E	0
EST1_E	0
EST2_E	0
EST3_E	0
EST4_E	0
EST5_E	0
EST6_E	0
EST7_E	0
EST8_E	0
EST9_E	0
EST10_E	0
AGR1_E	0
AGR2_E	0
AGR3_E	0
AGR4_E	0
AGR5_E	0
AGR6_E	0
AGR7_E	0
AGR8_E	0
AGR9_E	0
AGR10_E	0
CSN1_E	0
CSN2_E	0
CSN3_E	0
CSN4_E	0
CSN5_E	0
CSN6_E	0
CSN7_E	0
CSN8_E	0
CSN9_E	0
CSN10_E	0
OPN1_E	0
OPN2_E	0
OPN3_E	0
OPN4_E	0
OPN5_E	0
OPN6_E	0
OPN7_E	0
OPN8_E	0
OPN9_E	0
OPN10_E	0

Next, take columns contain Big Five response only for the clustering model.

# Take columns contain OCEAN but not endswith _E
ocean_cols <- grep("^(EXT|EST|AGR|CSN|OPN)", colnames(df_clean), value = TRUE)
ocean_cols <- ocean_cols[!grepl("_E$", ocean_cols)]

Before we will go next to the modelling, we need to make sure that the data do not contain duplication.

# Identify duplicate rows based on the columns in ocean_cols
duplicate_rows <- duplicated(df_clean[, ocean_cols])

# Count the number of duplicate rows
num_duplicate_rows <- sum(duplicate_rows)
cat("Number of duplicate rows (based on OCEAN columns):", num_duplicate_rows, "\n")

## Number of duplicate rows (based on OCEAN columns): 854

if (num_duplicate_rows > 0) {
  # Remove duplicate rows
  df_clean <- df_clean[!duplicate_rows, ]
  cat("Duplicate rows removed. Updated number of rows in df_clean:", nrow(df_clean), "\n")
} else {
  cat("No duplicate rows found (based on OCEAN columns).\n")
}

## Duplicate rows removed. Updated number of rows in df_clean: 602468

To see how vary the data for each variables (OCEAN), we can plot it with bar plot and box plot.

# Reshape to long format
df_long <- reshape2::melt(df_clean[, ocean_cols])

# Make sure there's no more NaN values and all of them into numeric
df_long <- na.omit(df_long)
df_long$value <- as.numeric(df_long$value)

ggplot(df_long, aes(x = value)) +
  geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
  facet_wrap(~ variable, scales = "free_y", nrow = 10) +
  labs(title = "Score Distribution for Each Dimension", 
       x = "Score", 
       y = "Frequency") +
  theme_bw()

Figure 2: Score Distribution for Each Dimension of Personality Trait

ggplot(df_long, aes(x = variable, y = value)) +
  geom_boxplot(fill = "lightcoral", color = "black") +
  labs(title = "Score Distribution for Each Dimension (Box Plot)",
       x = "Dimension",
       y = "Score") +
  theme_bw() +
  coord_flip()

Figure 3: Score Distribution for Each Dimension of Personality Trait in Box Plot

We can see some outliers in certain variables related to Agreeableness (AGR), Conscientiousness (CSN), and Openness to Experience (OPN). On the other hand, k-Means Clustering tends to be sensitive to outliers. Therefore, we can devise strategies to deal with outliers, one of which is by performing normalization.

summary_stats <- df_long %>%
  group_by(variable) %>%
  summarize(
    mean_score = mean(value, na.rm = TRUE),
    median_score = median(value, na.rm = TRUE),
    sd_score = sd(value, na.rm = TRUE),
    min_score = min(value, na.rm = TRUE),
    max_score = max(value, na.rm = TRUE),
    n_observations = n()
  )

kable(summary_stats)

variable	mean_score	median_score	sd_score	min_score	max_score	n_observations
EXT1	2.576099	3	1.2374681	1	5	602468
EXT2	2.847959	3	1.3057964	1	5	602468
EXT3	3.230769	3	1.1912989	1	5	602468
EXT4	3.219635	3	1.2047707	1	5	602468
EXT5	3.248511	3	1.2465256	1	5	602468
EXT6	2.423392	2	1.2139971	1	5	602468
EXT7	2.710200	3	1.3698189	1	5	602468
EXT8	3.470126	4	1.2390399	1	5	602468
EXT9	2.953940	3	1.3246703	1	5	602468
EXT10	3.622802	4	1.2633410	1	5	602468
EST1	3.311188	4	1.3117901	1	5	602468
EST2	3.171010	3	1.1912015	1	5	602468
EST3	3.880111	4	1.1217407	1	5	602468
EST4	2.658513	3	1.2248307	1	5	602468
EST5	2.857219	3	1.2537206	1	5	602468
EST6	2.869767	3	1.2935461	1	5	602468
EST7	3.064667	3	1.2649211	1	5	602468
EST8	2.701616	3	1.3213905	1	5	602468
EST9	3.102203	3	1.2677605	1	5	602468
EST10	2.857597	3	1.3053596	1	5	602468
AGR1	2.234690	2	1.3043405	1	5	602468
AGR2	3.854976	4	1.0869058	1	5	602468
AGR3	2.267007	2	1.2641696	1	5	602468
AGR4	3.947790	4	1.0799717	1	5	602468
AGR5	2.302340	2	1.1560334	1	5	602468
AGR6	3.758311	4	1.1663487	1	5	602468
AGR7	2.234258	2	1.1129761	1	5	602468
AGR8	3.684737	4	1.0466293	1	5	602468
AGR9	3.792691	4	1.1401198	1	5	602468
AGR10	3.597522	4	1.0354886	1	5	602468
CSN1	3.320069	3	1.1241311	1	5	602468
CSN2	2.996669	3	1.3710137	1	5	602468
CSN3	4.006905	4	0.9966201	1	5	602468
CSN4	2.653452	3	1.2350004	1	5	602468
CSN5	2.582891	2	1.2417003	1	5	602468
CSN6	2.859699	3	1.4031575	1	5	602468
CSN7	3.731396	4	1.0734690	1	5	602468
CSN8	2.491583	2	1.1258857	1	5	602468
CSN9	3.159303	3	1.2501190	1	5	602468
CSN10	3.629258	4	0.9979830	1	5	602468
OPN1	3.777719	4	1.0741179	1	5	602468
OPN2	2.020036	2	1.0826338	1	5	602468
OPN3	4.067638	4	1.0259751	1	5	602468
OPN4	1.950030	2	1.0575048	1	5	602468
OPN5	3.835307	4	0.9289385	1	5	602468
OPN6	1.876481	2	1.0715174	1	5	602468
OPN7	4.060262	4	0.9180440	1	5	602468
OPN8	3.284412	3	1.2123321	1	5	602468
OPN9	4.222360	4	0.9378580	1	5	602468
OPN10	3.998694	4	0.9813749	1	5	602468

Principal Component Analysis (PCA)

At this step, we will use principal component analysis (PCA) in order to reduce dimensionality of input data while retaining the most significant variations.

# PCA (use scaling)
pca_result <- prcomp(df_clean[, ocean_cols], center = TRUE, scale. = TRUE)

# Scree plot
screeplot(pca_result, type = "lines", main = "Scree Plot PCA")

Figure 4: Scree Plot of Principal Component Analysis

# Variance explained
kable(summary(pca_result)$importance)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11	PC12	PC13	PC14	PC15	PC16	PC17	PC18	PC19	PC20	PC21	PC22	PC23	PC24	PC25	PC26	PC27	PC28	PC29	PC30	PC31	PC32	PC33	PC34	PC35	PC36	PC37	PC38	PC39	PC40	PC41	PC42	PC43	PC44	PC45	PC46	PC47	PC48	PC49	PC50
Standard deviation	2.777921	2.247946	2.001303	1.89383	1.702017	1.198502	1.136224	1.01711	0.9866744	0.9615661	0.9448665	0.9220993	0.9110339	0.9008106	0.8827829	0.8606061	0.8554104	0.839683	0.8267318	0.8081084	0.8021401	0.7972646	0.779069	0.7732502	0.7641308	0.7483038	0.7313321	0.7246511	0.7206425	0.7054458	0.7004669	0.695075	0.6913724	0.6659864	0.6554384	0.6472175	0.6461166	0.6384365	0.6298325	0.6265253	0.6181612	0.6036435	0.6025689	0.5981394	0.5970718	0.583583	0.5730759	0.5691529	0.5529994	0.4669083
Proportion of Variance	0.154340	0.101070	0.080100	0.07173	0.057940	0.028730	0.025820	0.02069	0.0194700	0.0184900	0.0178600	0.0170100	0.0166000	0.0162300	0.0155900	0.0148100	0.0146300	0.014100	0.0136700	0.0130600	0.0128700	0.0127100	0.012140	0.0119600	0.0116800	0.0112000	0.0107000	0.0105000	0.0103900	0.0099500	0.0098100	0.009660	0.0095600	0.0088700	0.0085900	0.0083800	0.0083500	0.0081500	0.0079300	0.0078500	0.0076400	0.0072900	0.0072600	0.0071600	0.0071300	0.006810	0.0065700	0.0064800	0.0061200	0.0043600
Cumulative Proportion	0.154340	0.255400	0.335510	0.40724	0.465180	0.493900	0.519720	0.54041	0.5598800	0.5783800	0.5962300	0.6132400	0.6298400	0.6460700	0.6616500	0.6764700	0.6911000	0.705200	0.7188700	0.7319300	0.7448000	0.7575100	0.769650	0.7816100	0.7932900	0.8044900	0.8151800	0.8256900	0.8360700	0.8460300	0.8558400	0.865500	0.8750600	0.8839300	0.8925200	0.9009000	0.9092500	0.9174000	0.9253400	0.9331900	0.9408300	0.9481200	0.9553800	0.9625400	0.9696700	0.976480	0.9830500	0.9895200	0.9956400	1.0000000

From the variance explanation of the PCA, we can take a look that with 6-8 principle components, those can explain the variance of data at 49-54%. On the other hand, to reach around 80% variance explained from the data, we can use 26 principle components.

But the decision to choose which PCs that will be used, will be based on the model evaluation. For now, we will use 6 principle components as input of k-Means clustering.

pca_data <- as.data.frame(pca_result$x[, 1:6])

Elbow Method

To find the optimal number of clusters in k-Means clustering, we can use elbow method. The chosen k has a goodness of fit to clustering model based on the PCA input to the model.

Figure 5: Elbow Method for Optimal k

From the illustration above, we can observe the point where k is at the elbow joint is the k=2. So, we have got the k for the model.

Modeling with Clustering

We model the data with k-means clustering, where k=2.

set.seed(42)
kmeans_result <- kmeans(pca_data, centers = 2, nstart = 25)

df_clean$Cluster <- as.factor(kmeans_result$cluster)

We can interpret the model with pairwise scatter plot for each PCs.

# Use the cluster assignments from your K-means result
pca_data_with_cluster <- pca_data %>%
  mutate(Cluster = as.factor(df_clean$Cluster))

# Create pairwise scatter plots for the first n PCs
ggpairs(pca_data_with_cluster, columns = 1:6, aes(color = Cluster, alpha = 0.7)) +
  theme_minimal() +
  labs(title = "Pairwise Scatter Plots of First 6 PCs by Cluster")

Figure 6: Pairwise Scatter Plots of First 6 PCs by Clusters

The summary below stated how statistically cluster 1 and 2 are divided and distributed by mean and standard deviation.

kable(
  df_clean %>%
    group_by(Cluster) %>%
    summarise(across(all_of(ocean_cols), list(mean = mean, sd = sd)))
)

Cluster	EXT1_mean	EXT1_sd	EXT2_mean	EXT2_sd	EXT3_mean	EXT3_sd	EXT4_mean	EXT4_sd	EXT5_mean	EXT5_sd	EXT6_mean	EXT6_sd	EXT7_mean	EXT7_sd	EXT8_mean	EXT8_sd	EXT9_mean	EXT9_sd	EXT10_mean	EXT10_sd	EST1_mean	EST1_sd	EST2_mean	EST2_sd	EST3_mean	EST3_sd	EST4_mean	EST4_sd	EST5_mean	EST5_sd	EST6_mean	EST6_sd	EST7_mean	EST7_sd	EST8_mean	EST8_sd	EST9_mean	EST9_sd	EST10_mean	EST10_sd	AGR1_mean	AGR1_sd	AGR2_mean	AGR2_sd	AGR3_mean	AGR3_sd	AGR4_mean	AGR4_sd	AGR5_mean	AGR5_sd	AGR6_mean	AGR6_sd	AGR7_mean	AGR7_sd	AGR8_mean	AGR8_sd	AGR9_mean	AGR9_sd	AGR10_mean	AGR10_sd	CSN1_mean	CSN1_sd	CSN2_mean	CSN2_sd	CSN3_mean	CSN3_sd	CSN4_mean	CSN4_sd	CSN5_mean	CSN5_sd	CSN6_mean	CSN6_sd	CSN7_mean	CSN7_sd	CSN8_mean	CSN8_sd	CSN9_mean	CSN9_sd	CSN10_mean	CSN10_sd	OPN1_mean	OPN1_sd	OPN2_mean	OPN2_sd	OPN3_mean	OPN3_sd	OPN4_mean	OPN4_sd	OPN5_mean	OPN5_sd	OPN6_mean	OPN6_sd	OPN7_mean	OPN7_sd	OPN8_mean	OPN8_sd	OPN9_mean	OPN9_sd	OPN10_mean	OPN10_sd
1	3.197653	1.114255	2.189290	1.099072	3.965228	0.9058906	2.561354	1.0695974	3.990641	0.9238935	1.840075	0.8864732	3.486022	1.210761	2.994135	1.192727	3.536425	1.176324	2.944203	1.2135210	2.883025	1.283191	3.495029	1.110808	3.590431	1.1860263	2.949282	1.220582	2.530577	1.186436	2.453254	1.197802	2.688555	1.219714	2.269352	1.207805	2.664984	1.214276	2.302960	1.165657	1.963775	1.244857	4.320473	0.8209272	2.066392	1.196233	4.202817	0.9409185	1.957524	1.011916	3.876663	1.132339	1.758672	0.8603816	3.983791	0.9192772	4.064737	1.004347	4.034356	0.8602155	3.533553	1.064935	2.902685	1.378435	4.11392	0.9625651	2.306633	1.158251	2.818343	1.254877	2.640451	1.385987	3.787204	1.055819	2.203965	1.061983	3.358943	1.210066	3.793986	0.9579982	3.939005	1.012748	1.812282	0.9789907	4.158117	0.9722771	1.808834	0.9960333	4.100935	0.7945832	1.686435	0.973477	4.264987	0.8039078	3.355442	1.216698	4.221736	0.9255650	4.243709	0.8306251
2	1.978479	1.039871	3.481263	1.168287	2.524594	0.9886152	3.852566	0.9642721	2.534960	1.0901407	2.984245	1.2221053	1.964255	1.065790	3.927787	1.102821	2.393885	1.213094	4.275268	0.9198928	3.722863	1.202669	2.859468	1.182751	4.158637	0.9784697	2.378941	1.162219	3.171282	1.236309	3.270241	1.255162	3.426296	1.200516	3.117233	1.292246	3.522585	1.172322	3.390877	1.206532	2.495172	1.307167	3.407406	1.1227675	2.459896	1.297283	3.702583	1.1462423	2.633877	1.188523	3.644517	1.187065	2.691530	1.1364036	3.397199	1.0805045	3.531122	1.200103	3.177510	1.0157876	3.114805	1.141190	3.087033	1.357722	3.90401	1.0177459	2.986915	1.214491	2.356506	1.185598	3.070504	1.387172	3.677738	1.087474	2.768125	1.116259	2.967351	1.257926	3.470874	1.0099481	3.622644	1.108067	2.219790	1.1384828	3.980644	1.0678761	2.085789	1.0964003	3.579909	0.9759647	2.059209	1.128188	3.863421	0.9760795	3.216117	1.204177	4.222961	0.9495287	3.763114	1.0547610

# You can use this for all columns starting with EXT, AGR, etc.
summary_by_cluster <- df_clean %>%
  group_by(Cluster) %>%
  summarise(across(everything(), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))

# Pivot to long format so it's easier to plot
summary_long <- summary_by_cluster %>%
  pivot_longer(
    cols = -Cluster,
    names_to = c("Trait", "Stat"),
    names_sep = "_"
  ) %>%
  filter(Stat == "mean")  # Only plot means for now

trait_cols <- df_clean %>%
  select(Cluster, matches("^(EXT|EST|AGR|CSN|OPN)[0-9]+$"))

# Then calculate the mean per trait group
trait_summary <- trait_cols %>%
  pivot_longer(
    cols = -Cluster,
    names_to = "Trait_Item",
    values_to = "Value"
  ) %>%
  mutate(Trait = str_extract(Trait_Item, "^[A-Z]+")) %>%  # Extract EXT, EST, AGR, etc.
  group_by(Cluster, Trait) %>%
  summarise(Mean = mean(Value, na.rm = TRUE), .groups = 'drop')

ggplot(trait_summary, aes(x = Trait, y = Mean, fill = as.factor(Cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Big 5 Traits Mean per Cluster", x = "Trait", fill = "Cluster") +
  theme_minimal()

Figure 7: Big 5 Traits Mean per Cluster

With the assigned cluster to each point of data, we can simplified the dominancy of each cluster to the Big Five components.

Conclusion

This study successfully applied unsupervised learning to Big Five personality trait data, yielding valuable insights for the development of more human-centered intelligent systems.

We determined that 6 principal components effectively capture 49-54% of the variance in the dataset. These components served as input for the k-means clustering model, which identified an optimal number of 2 clusters (k=2). This suggests that the personality data can be meaningfully segmented into two distinct groups, capturing a substantial portion of the variability in personality traits with a reduced number of dimensions.
Analysis of the clusters revealed distinct personality profiles. Cluster 1 exhibits higher mean scores across most of the Big Five dimensions, indicating a generally more pronounced expression of these traits on average within this group. In contrast, Cluster 2 shows a relative dominance only in Emotional Stability (EST), suggesting that individuals in this cluster tend to exhibit higher average emotional stability compared to those in Cluster 1. However, the box plots reveal that within these cluster-level averages, there is considerable variability and distributional nuance. For instance, while Cluster 1 has a higher mean for Extraversion (EXT), the box plot shows a wide spread of scores in both clusters, indicating that not all individuals in Cluster 1 are highly extraverted and some individuals in Cluster 2 also exhibit high extraversion.

These findings offer several key insights for the design of human-centered intelligent systems:

Personalization Potential: The identified clusters highlight the potential for tailoring system interactions to different personality types. For example, users in Cluster 1, characterized by higher average expression across most Big Five traits, might generally respond well to systems that are more engaging, interactive, and provide a richer set of features. However, it is crucial to acknowledge the variability observed in the box plots. Some users within Cluster 1 may still prefer minimalist interfaces, and some users within Cluster 2 may appreciate highly interactive features. Therefore, personalization should not be solely based on cluster assignment but should also consider individual variations within each cluster.
Emotional Stability Considerations: Cluster 2’s emphasis on emotional stability at the cluster level suggests that on average, users in this group may prefer systems that are reliable, predictable, and provide a supportive and calming user experience. The box plots further elaborate that while this trend holds, there’s a range of emotional stability scores within both clusters, implying that some users in Cluster 1 might also benefit from calming interfaces, and some users in Cluster 2 might tolerate or even prefer more dynamic interactions. These insights can inform the design of virtual agent interactions, user interface elements, and feedback mechanisms.
Strategic Implications: Understanding these personality-based clusters and the score distributions within them can enable developers to create more targeted and effective user experiences, potentially increasing user satisfaction and engagement. For instance, in the design of a virtual assistant, developers might consider offering a range of personality options, acknowledging the variability within clusters, rather than simply offering “Cluster 1” or “Cluster 2” personalities.

Further Improvement

To further enhance this research and its practical applications, the following improvements are recommended:

Clustering Validation: Implement silhouette score analysis to quantitatively evaluate the quality of the clustering results. This will provide a measure of how well each data point fits within its assigned cluster and how distinct the clusters are from each other, adding robustness to the findings.
Alternative Clustering Methods: Explore alternative clustering algorithms, such as DBSCAN, which is less sensitive to outliers than k-means. This could provide additional insights into the data’s structure and potentially reveal different or more refined user groupings.
Multi-Modal Data Integration: Incorporate additional datasets, such as text-based data from user interactions with intelligent personal assistants, to provide a more comprehensive understanding of user personality. This multi-modal approach could lead to the development of more sophisticated and nuanced models of user behavior and preferences.
Actionable Recommendations for Industry:
- Personalized User Interfaces: Design user interfaces that adapt to different personality types. For example, users in Cluster 1 might prefer feature-rich interfaces with opportunities for interaction and customization, while users in Cluster 2 might prefer minimalist designs that prioritize clarity and ease of use.
- Virtual Agent Personalities: Develop virtual agent personalities that align with the identified clusters. This could involve offering users a choice of virtual agent personality (e.g., “engaging” vs. “calm”) to enhance user experience and satisfaction.
- Targeted Marketing and Communication: Tailor marketing messages and communication strategies to resonate with different personality types. Understanding the preferences and communication styles of each cluster can lead to more effective user engagement and product adoption.
- Adaptive System Behavior: Design systems that can adapt their behavior based on user personality. For example, a system might provide more proactive assistance to users in Cluster 1, while offering more guidance and support to users in Cluster 2.

By pursuing these improvements, future research can provide even more valuable insights for the development of human-centered intelligent systems, ultimately leading to more personalized, effective, and satisfying user experiences.

References

American Psychological Association. (2023). Personality. https://www.apa.org/topics/personality

Lopatovska, I., & Arapakis, I. (2020). Personality in information behavior. Journal of the Association for Information Science and Technology, 71(8), 921–926. https://doi.org/10.1002/asi.24314

Clustering-Based Personality Analysis on Big Five Traits for Human-Centered Insights

Sasha Nabila Fortuna

06 May, 2025