Figure 1: Big Five Personality Trait

Figure 1: Big Five Personality Trait


Background Problem

Personality fundamentally shapes how individuals think, feel, and behave, significantly influencing their interactions with technology. With the increasing integration of intelligent systems—such as virtual assistants, personalized platforms, and adaptive interfaces—understanding individual personality differences is crucial for creating meaningful, human-centered experiences (American Psychological Association, 2023). However, many existing systems treat users uniformly, failing to capture the nuances in psychological profiles that could lead to more intuitive and engaging interactions.

Recent research emphasizes the growing importance of incorporating personality traits into the design of information systems to better align with user expectations and behavior. Lopatovska & Arapakis (2020) highlighted how personality-based models can improve personalization in search engines and conversational agents, arguing that user satisfaction often hinges on how well systems adapt to individual psychological profiles. This insight aligns with the human-centered design philosophy, emphasizing that emotional and cognitive aspects of users should play a central role in shaping interactive technologies.

Rather than relying on predefined personality labels, this study explores latent patterns within responses to a Big Five personality questionnaire. By applying unsupervised learning methods, we aim to uncover natural groupings in personality traits—allowing us to identify underlying psychological structures without prior assumptions. These clusters may reflect distinct behavioral styles, cognitive preferences, and communication tendencies that have direct implications for how individuals interact with technology, offering valuable insights for personalization in the design of user interfaces, the behavior of virtual agents, and the tailoring of digital content.

Objectives

This study aims to apply unsupervised learning to Big Five personality trait data to discover meaningful patterns and user groupings that can ultimately inform the development of more human-centered intelligent systems:

  • Apply k-Means clustering to uncover groups of individuals with similar personality profiles based on questionnaire responses, with the optimal number of clusters determined using the Elbow method. The identified clusters will represent distinct personality segments within the user population, potentially highlighting different preferences and interaction styles with technology.

  • Analyze the characteristics of each cluster in terms of potential behavioral patterns, communication styles, and implications for system personalization in human-computer interaction. This includes understanding if certain personality profiles are more likely to:

    • Prefer specific interface designs (e.g., minimalist vs. feature-rich).
    • Engage with virtual assistants in particular ways (e.g., direct vs. conversational).
    • Respond to different types of feedback or error messages.
    • Benefit from tailored levels of proactivity or guidance from a system.

Dataset Description

This study leverages a large-scale dataset collected between 2016 and 2018 through an interactive online personality test, hosted on Open Psychometrics. The questionnaire is based on the Big Five Factor Markers from the International Personality Item Pool (IPIP), covering five key personality dimensions:

  • Extraversion (EXT)
  • Neuroticism (Emotional Stability - EST)
  • Agreeableness (AGR)
  • Conscientiousness (CSN)
  • Openness to Experience (OPN)

Each personality trait is measured by 10 items, resulting in a total of 50 statements, rated on a 5-point Likert scale: > 1 = Disagree, 3 = Neutral, 5 = Agree

The questions were shown in a fixed interleaved order (e.g., EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc.) to ensure balanced response behavior.

Click to show/hide Big Five items


Item Code Item Description
EXT1 I am the life of the party.
EXT2 I don’t talk a lot.
EXT3 I feel comfortable around people.
EXT4 I keep in the background.
EXT5 I start conversations.
EXT6 I have little to say.
EXT7 I talk to a lot of different people at parties.
EXT8 I don’t like to draw attention to myself.
EXT9 I don’t mind being the center of attention.
EXT10 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I am relaxed most of the time.
EST3 I worry about things.
EST4 I seldom feel blue.
EST5 I am easily disturbed.
EST6 I get upset easily.
EST7 I change my mood a lot.
EST8 I have frequent mood swings.
EST9 I get irritated easily.
EST10 I often feel blue.
AGR1 I feel little concern for others.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I sympathize with others’ feelings.
AGR5 I am not interested in other people’s problems.
AGR6 I have a soft heart.
AGR7 I am not really interested in others.
AGR8 I take time out for others.
AGR9 I feel others’ emotions.
AGR10 I make people feel at ease.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I pay attention to details.
CSN4 I make a mess of things.
CSN5 I get chores done right away.
CSN6 I often forget to put things back in their proper place.
CSN7 I like order.
CSN8 I shirk my duties.
CSN9 I follow a schedule.
CSN10 I am exacting in my work.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I have a vivid imagination.
OPN4 I am not interested in abstract ideas.
OPN5 I have excellent ideas.
OPN6 I do not have a good imagination.
OPN7 I am quick to understand things.
OPN8 I use difficult words.
OPN9 I spend time reflecting on things.
OPN10 I am full of ideas.

Key Features

  • Response Variables
    Each item is labeled with a trait code and item number (e.g., EXT1, AGR5, OPN10) corresponding to a specific statement.

  • Timing Variables
    For each question, an additional variable ending in _E captures response time in milliseconds, measuring how long the participant took to answer that item.

  • Session Metadata

    • dateload: Timestamp when the test began
    • introelapse, testelapse, endelapse: Time spent (in seconds) on the introduction page, survey page, and finalization page, respectively
    • IPC: Number of responses from the same IP address (used to filter out multiple submissions)
    • country: Automatically inferred from network metadata
    • screenw, screenh: Screen dimensions (can relate to device context)
    • lat_appx_lots_of_err, long_appx_lots_of_err: Approximate location (not highly accurate)

For data quality, only records with IPC = 1 were used to ensure uniqueness of submissions.

Methodology

Import Libraries

suppressPackageStartupMessages({
  library(readr)
  library(knitr)
  library(tidyverse)
  library(dplyr)
  library(readr)
  library(FactoMineR)
  library(factoextra)
  library(ggplot2)
  library(stringr)
  library(scales)
  library(cluster)
  library(ClusterR)
  library(GGally)
})

Load dataset

# Read the data from the CSV
df <- read.delim("C:/Users/LENOVO/big-five-personality-test/IPIP-FFM-data-8Nov2018/data-final.csv", sep = "\t", stringsAsFactors = FALSE)

kable(head(df))
EXT1 EXT2 EXT3 EXT4 EXT5 EXT6 EXT7 EXT8 EXT9 EXT10 EST1 EST2 EST3 EST4 EST5 EST6 EST7 EST8 EST9 EST10 AGR1 AGR2 AGR3 AGR4 AGR5 AGR6 AGR7 AGR8 AGR9 AGR10 CSN1 CSN2 CSN3 CSN4 CSN5 CSN6 CSN7 CSN8 CSN9 CSN10 OPN1 OPN2 OPN3 OPN4 OPN5 OPN6 OPN7 OPN8 OPN9 OPN10 EXT1_E EXT2_E EXT3_E EXT4_E EXT5_E EXT6_E EXT7_E EXT8_E EXT9_E EXT10_E EST1_E EST2_E EST3_E EST4_E EST5_E EST6_E EST7_E EST8_E EST9_E EST10_E AGR1_E AGR2_E AGR3_E AGR4_E AGR5_E AGR6_E AGR7_E AGR8_E AGR9_E AGR10_E CSN1_E CSN2_E CSN3_E CSN4_E CSN5_E CSN6_E CSN7_E CSN8_E CSN9_E CSN10_E OPN1_E OPN2_E OPN3_E OPN4_E OPN5_E OPN6_E OPN7_E OPN8_E OPN9_E OPN10_E dateload screenw screenh introelapse testelapse endelapse IPC country lat_appx_lots_of_err long_appx_lots_of_err
4 1 5 2 5 1 5 2 4 1 1 4 4 2 2 2 2 2 3 2 2 5 2 4 2 3 2 4 3 4 3 4 3 2 2 4 4 2 4 4 5 1 4 1 4 1 5 3 4 5 9419 5491 3959 4821 5611 2756 2388 2113 5900 4110 6135 4150 5739 6364 3663 5070 5709 4285 2587 3997 4750 5475 11641 3115 3207 3260 10235 5897 1758 3081 6602 5457 1569 2129 3762 4420 9382 5286 4983 6339 3146 4067 2959 3411 2170 4920 4436 3116 2992 4354 2016-03-03 02:01:01 768 1024 9 234 6 1 GB 51.5448 0.1991
3 5 3 4 3 3 2 5 1 5 2 3 4 1 3 1 2 1 3 1 1 4 1 5 1 5 3 4 5 3 3 2 5 3 3 1 3 3 5 3 1 2 4 2 3 1 4 2 5 3 7235 3598 3315 2564 2976 3050 4787 3228 3465 3309 9036 2406 3484 3359 3061 2539 4226 2962 1799 1607 2158 2090 2143 2807 3422 5324 4494 3627 1850 1747 5163 5240 7208 2783 4103 3431 3347 2399 3360 5595 2624 4985 1684 3026 4742 3336 2718 3374 3096 3019 2016-03-03 02:01:20 1360 768 12 179 11 1 MY 3.1698 101.706
2 3 4 4 3 2 1 3 2 5 4 4 4 2 2 2 2 2 1 3 1 4 1 4 2 4 1 4 4 3 4 2 2 2 3 3 4 2 4 2 5 1 2 1 4 2 5 3 4 4 4657 3549 2543 3335 5847 2540 4922 3142 14621 2191 5128 3675 3442 4546 8275 2185 2164 1175 3813 1593 1089 2203 3386 1464 2562 1493 3067 13719 3892 4100 4286 4775 2713 2813 4237 6308 2690 1516 2379 2983 1930 1470 1644 1683 2229 8114 2043 6295 1585 2529 2016-03-03 02:01:56 1366 768 3 186 7 1 GB 54.9119 -1.3833
2 2 2 3 4 2 2 4 1 4 3 3 3 2 3 2 2 2 4 3 2 4 3 4 2 4 2 4 3 4 2 4 4 4 1 2 2 3 1 4 4 2 5 2 3 1 4 4 3 3 3996 2896 5096 4240 5168 5456 4360 4496 5240 4000 3736 4616 3015 2711 3960 4064 4208 2936 7336 3896 6062 11952 1040 2264 3664 3049 4912 7545 4632 6896 2824 520 2368 3225 2848 6264 3760 10472 3192 7704 3456 6665 1977 3728 4128 3776 2984 4192 3480 3257 2016-03-03 02:02:02 1920 1200 186 219 7 1 GB 51.75 -1.25
3 3 3 3 5 3 3 5 3 4 1 5 5 3 1 1 1 1 3 2 1 5 1 5 1 3 1 5 5 3 5 1 5 1 3 1 5 1 5 5 5 1 5 1 5 1 5 3 5 5 6004 3965 2721 3706 2968 2426 7339 3302 16819 3731 4740 2856 7461 2179 3324 2255 4308 4506 3127 3115 6771 2819 3682 2511 16204 1736 28983 1612 2437 4532 3843 7019 3102 3153 2869 6550 1811 3682 21500 20587 8458 3510 17042 7029 2327 5835 6846 5320 11401 8642 2016-03-03 02:02:57 1366 768 8 315 17 2 KE 1.0 38.0
3 3 4 2 4 2 2 3 3 4 3 4 3 2 2 1 2 1 2 2 2 3 1 4 2 3 2 3 4 4 3 2 4 1 3 2 4 3 4 3 5 1 5 1 3 1 5 4 5 2 4834 5064 1160 2664 6711 3344 2512 6264 6992 4592 2808 1776 3280 4520 2640 5408 3647 3183 1575 672 6375 4727 3775 1647 1233 8694 2904 2152 2856 2848 4288 4360 7328 3976 7895 2640 1760 5720 9032 3928 2104 5488 3656 4352 2681 3272 2640 1568 1640 3192 2016-03-03 02:03:12 1600 1000 4 196 3 1 SE 59.3333 18.05

Based on the suggestion of the original dataset, we will try to preprocess data before it is used for modelling.

Data Preprocessing

Firstly, we can check from the dataset that for number of records from IP address is 1 (for maximum cleanliness).

# Keep only unique entries
df <- df %>% filter(IPC == 1)

Since those numeric variables are still in chr, we can convert them into numerical (for columns related to Big Five response).

# Convert 50 personality items to numeric
trait_cols <- c(paste0("EXT", 1:10), paste0("EST", 1:10),
                paste0("AGR", 1:10), paste0("CSN", 1:10), paste0("OPN", 1:10))
# Convert to numeric safely
df_filtered <- df %>%
  filter(if_all(all_of(trait_cols), ~ . %in% c("1", "2", "3", "4", "5")))

df_clean <- df_filtered %>%
  mutate(across(all_of(trait_cols), ~ as.numeric(.)))

Next, drop some non-related variables towards Big Five questionnaires and responses.

# Drop non-related variables towards Big Five questionnaires-responses
df_clean <- df_clean %>%
  dplyr::select(-IPC, -dateload, -country, -lat_appx_lots_of_err, -long_appx_lots_of_err, -screenw, -screenh, -introelapse, -testelapse, -endelapse)

Just like step before, convert columns those are still recorded as chr but actually contains numerical variables.

# Convert only character columns that look numeric
df_clean <- df_clean %>%
  mutate(across(
    where(~ is.character(.) && all(grepl("^\\d+(\\.\\d+)?$", .[!is.na(.)]))),
    ~ as.numeric(.)
  ))
# Identify character columns that are mostly numeric
char_cols <- names(df_clean)[sapply(df_clean, is.character)]

# Try converting them, keeping only if >90% successfully convert
for (col in char_cols) {
  temp <- suppressWarnings(as.numeric(df_clean[[col]]))
  success_rate <- mean(!is.na(temp))

  if (success_rate > 0.9) {
    df_clean[[col]] <- temp  # overwrite with numeric version
  }
}

After checking the cleanliness of dataset, look and observe from this summary.

kable(summary(df_clean))
EXT1 EXT2 EXT3 EXT4 EXT5 EXT6 EXT7 EXT8 EXT9 EXT10 EST1 EST2 EST3 EST4 EST5 EST6 EST7 EST8 EST9 EST10 AGR1 AGR2 AGR3 AGR4 AGR5 AGR6 AGR7 AGR8 AGR9 AGR10 CSN1 CSN2 CSN3 CSN4 CSN5 CSN6 CSN7 CSN8 CSN9 CSN10 OPN1 OPN2 OPN3 OPN4 OPN5 OPN6 OPN7 OPN8 OPN9 OPN10 EXT1_E EXT2_E EXT3_E EXT4_E EXT5_E EXT6_E EXT7_E EXT8_E EXT9_E EXT10_E EST1_E EST2_E EST3_E EST4_E EST5_E EST6_E EST7_E EST8_E EST9_E EST10_E AGR1_E AGR2_E AGR3_E AGR4_E AGR5_E AGR6_E AGR7_E AGR8_E AGR9_E AGR10_E CSN1_E CSN2_E CSN3_E CSN4_E CSN5_E CSN6_E CSN7_E CSN8_E CSN9_E CSN10_E OPN1_E OPN2_E OPN3_E OPN4_E OPN5_E OPN6_E OPN7_E OPN8_E OPN9_E OPN10_E
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. : -42958762 Min. : -75632 Min. : -3593866 Min. :-2494907 Min. : -58566 Min. : -79860 Min. : -89610 Min. : -461138 Min. : -11382 Min. : -142238 Min. : -112165 Min. : -71572 Min. : -24118 Min. : -3598047 Min. : -88286 Min. : -81895 Min. :-2187273 Min. : -92455 Min. :-79175662 Min. : -43558 Min. : -48675 Min. : -3592606 Min. : -1795552 Min. : -67786 Min. : -20294 Min. : -21938 Min. : -65423 Min. : -27029 Min. : -527846 Min. : -85674 Min. : -3590638 Min. :-35996486 Min. : -15375 Min. : -11993 Min. : -3512740 Min. : -74245 Min. : -30016 Min. : -177880 Min. : -29167 Min. : -14988 Min. :-53927742 Min. : -215205 Min. : -417031 Min. : -74467 Min. : -75300 Min. : -509916 Min. : -51694 Min. : -17007 Min. : -95986 Min. :-3594871
1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:3.000 1st Qu.: 4834 1st Qu.: 2407 1st Qu.: 2503 1st Qu.: 2422 1st Qu.: 2166 1st Qu.: 2208 1st Qu.: 3077 1st Qu.: 2541 1st Qu.: 2610 1st Qu.: 2268 1st Qu.: 2242 1st Qu.: 2515 1st Qu.: 1945 1st Qu.: 2480 1st Qu.: 2487 1st Qu.: 2233 1st Qu.: 2224 1st Qu.: 2086 1st Qu.: 1954 1st Qu.: 1775 1st Qu.: 2926 1st Qu.: 2272 1st Qu.: 2248 1st Qu.: 2216 1st Qu.: 2937 1st Qu.: 2040 1st Qu.: 2667 1st Qu.: 2711 1st Qu.: 2235 1st Qu.: 2409 1st Qu.: 2439 1st Qu.: 3000 1st Qu.: 2283 1st Qu.: 2360 1st Qu.: 2549 1st Qu.: 3117 1st Qu.: 2067 1st Qu.: 2480 1st Qu.: 2055 1st Qu.: 2690 1st Qu.: 2088 1st Qu.: 3057 1st Qu.: 1876 1st Qu.: 2685 1st Qu.: 2001 1st Qu.: 2375 1st Qu.: 2294 1st Qu.: 2166 1st Qu.: 2344 1st Qu.: 1491
Median :3.000 Median :3.000 Median :3.000 Median :3.000 Median :3.000 Median :2.000 Median :3.000 Median :4.000 Median :3.000 Median :4.000 Median :4.000 Median :3.000 Median :4.000 Median :3.000 Median :3.000 Median :3.00 Median :3.000 Median :3.000 Median :3.000 Median :3.000 Median :2.000 Median :4.000 Median :2.000 Median :4.000 Median :2.000 Median :4.000 Median :2.000 Median :4.000 Median :4.000 Median :4.000 Median :3.00 Median :3.000 Median :4.000 Median :3.000 Median :2.000 Median :3.00 Median :4.000 Median :2.000 Median :3.000 Median :4.000 Median :4.000 Median :2.000 Median :4.000 Median :2.000 Median :4.000 Median :2.000 Median :4.000 Median :3.000 Median :4.000 Median :4.000 Median : 7315 Median : 3427 Median : 3517 Median : 3471 Median : 3040 Median : 3128 Median : 4340 Median : 3608 Median : 3651 Median : 3216 Median : 3311 Median : 3614 Median : 2787 Median : 3585 Median : 3509 Median : 3191 Median : 3182 Median : 2932 Median : 2798 Median : 2566 Median : 4369 Median : 3268 Median : 3169 Median : 3181 Median : 4062 Median : 2892 Median : 3686 Median : 3854 Median : 3138 Median : 3339 Median : 3584 Median : 4269 Median : 3200 Median : 3339 Median : 3586 Median : 4352 Median : 2916 Median : 3745 Median : 2923 Median : 3934 Median : 3020 Median : 4216 Median : 2739 Median : 3703 Median : 2841 Median : 3319 Median : 3199 Median : 3057 Median : 3252 Median : 2190
Mean :2.577 Mean :2.848 Mean :3.231 Mean :3.219 Mean :3.249 Mean :2.424 Mean :2.711 Mean :3.469 Mean :2.954 Mean :3.622 Mean :3.311 Mean :3.171 Mean :3.879 Mean :2.659 Mean :2.857 Mean :2.87 Mean :3.065 Mean :2.702 Mean :3.102 Mean :2.858 Mean :2.236 Mean :3.854 Mean :2.268 Mean :3.947 Mean :2.303 Mean :3.758 Mean :2.235 Mean :3.684 Mean :3.792 Mean :3.597 Mean :3.32 Mean :2.997 Mean :4.006 Mean :2.654 Mean :2.584 Mean :2.86 Mean :3.731 Mean :2.492 Mean :3.159 Mean :3.629 Mean :3.777 Mean :2.021 Mean :4.066 Mean :1.952 Mean :3.834 Mean :1.878 Mean :4.059 Mean :3.284 Mean :4.221 Mean :3.998 Mean : 101700 Mean : 8421 Mean : 9273 Mean : 7762 Mean : 7746 Mean : 5871 Mean : 7702 Mean : 6929 Mean : 5690 Mean : 5258 Mean : 8348 Mean : 8328 Mean : 6616 Mean : 10769 Mean : 7047 Mean : 8318 Mean : 6224 Mean : 5258 Mean : 4995 Mean : 4590 Mean : 18296 Mean : 9139 Mean : 6572 Mean : 8087 Mean : 7966 Mean : 5417 Mean : 7601 Mean : 9732 Mean : 5107 Mean : 5660 Mean : 12627 Mean : 9918 Mean : 8581 Mean : 7655 Mean : 9714 Mean : 9841 Mean : 5116 Mean : 10370 Mean : 5109 Mean : 8926 Mean : 8892 Mean : 12898 Mean : 6580 Mean : 8246 Mean : 6112 Mean : 7146 Mean : 7564 Mean : 4849 Mean : 5661 Mean : 4242
3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.: 11886 3rd Qu.: 5057 3rd Qu.: 5156 3rd Qu.: 5174 3rd Qu.: 4509 3rd Qu.: 4605 3rd Qu.: 6229 3rd Qu.: 5335 3rd Qu.: 5332 3rd Qu.: 4681 3rd Qu.: 5106 3rd Qu.: 5387 3rd Qu.: 4195 3rd Qu.: 5604 3rd Qu.: 5173 3rd Qu.: 4817 3rd Qu.: 4727 3rd Qu.: 4396 3rd Qu.: 4206 3rd Qu.: 3932 3rd Qu.: 6822 3rd Qu.: 4919 3rd Qu.: 4716 3rd Qu.: 4828 3rd Qu.: 5864 3rd Qu.: 4347 3rd Qu.: 5324 3rd Qu.: 5686 3rd Qu.: 4632 3rd Qu.: 4868 3rd Qu.: 5518 3rd Qu.: 6250 3rd Qu.: 4715 3rd Qu.: 4921 3rd Qu.: 5388 3rd Qu.: 6193 3rd Qu.: 4317 3rd Qu.: 6919 3rd Qu.: 4314 3rd Qu.: 6059 3rd Qu.: 4520 3rd Qu.: 6097 3rd Qu.: 4212 3rd Qu.: 5402 3rd Qu.: 4240 3rd Qu.: 4872 3rd Qu.: 4668 3rd Qu.: 4446 3rd Qu.: 4710 3rd Qu.: 3317
Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :2147483647 Max. :261773449 Max. :605905746 Max. :99674242 Max. :351067965 Max. :166382065 Max. :85145915 Max. :247706204 Max. :88154250 Max. :150252063 Max. :240303883 Max. :184071718 Max. :139259333 Max. :880042878 Max. :96737602 Max. :346412854 Max. :87632813 Max. :31299296 Max. :183826938 Max. :81790376 Max. :1170859453 Max. :473898335 Max. :130124433 Max. :329878192 Max. :134804667 Max. :97539002 Max. :251861470 Max. :1367497215 Max. :62757478 Max. :81582425 Max. :772659169 Max. :169887206 Max. :1100334734 Max. :140148953 Max. :958623288 Max. :443209652 Max. :84828106 Max. :177082189 Max. :87497885 Max. :338015826 Max. :675047026 Max. :1026125615 Max. :124483678 Max. :201571919 Max. :162680825 Max. :243586621 Max. :389143415 Max. :46761765 Max. :113808739 Max. :90484840

Check also how many NaN values left in the dataset.

kable(colSums(is.na(df_clean)))
x
EXT1 0
EXT2 0
EXT3 0
EXT4 0
EXT5 0
EXT6 0
EXT7 0
EXT8 0
EXT9 0
EXT10 0
EST1 0
EST2 0
EST3 0
EST4 0
EST5 0
EST6 0
EST7 0
EST8 0
EST9 0
EST10 0
AGR1 0
AGR2 0
AGR3 0
AGR4 0
AGR5 0
AGR6 0
AGR7 0
AGR8 0
AGR9 0
AGR10 0
CSN1 0
CSN2 0
CSN3 0
CSN4 0
CSN5 0
CSN6 0
CSN7 0
CSN8 0
CSN9 0
CSN10 0
OPN1 0
OPN2 0
OPN3 0
OPN4 0
OPN5 0
OPN6 0
OPN7 0
OPN8 0
OPN9 0
OPN10 0
EXT1_E 0
EXT2_E 0
EXT3_E 0
EXT4_E 0
EXT5_E 0
EXT6_E 0
EXT7_E 0
EXT8_E 0
EXT9_E 0
EXT10_E 0
EST1_E 0
EST2_E 0
EST3_E 0
EST4_E 0
EST5_E 0
EST6_E 0
EST7_E 0
EST8_E 0
EST9_E 0
EST10_E 0
AGR1_E 0
AGR2_E 0
AGR3_E 0
AGR4_E 0
AGR5_E 0
AGR6_E 0
AGR7_E 0
AGR8_E 0
AGR9_E 0
AGR10_E 0
CSN1_E 0
CSN2_E 0
CSN3_E 0
CSN4_E 0
CSN5_E 0
CSN6_E 0
CSN7_E 0
CSN8_E 0
CSN9_E 0
CSN10_E 0
OPN1_E 0
OPN2_E 0
OPN3_E 0
OPN4_E 0
OPN5_E 0
OPN6_E 0
OPN7_E 0
OPN8_E 0
OPN9_E 0
OPN10_E 0

Next, take columns contain Big Five response only for the clustering model.

# Take columns contain OCEAN but not endswith _E
ocean_cols <- grep("^(EXT|EST|AGR|CSN|OPN)", colnames(df_clean), value = TRUE)
ocean_cols <- ocean_cols[!grepl("_E$", ocean_cols)]

Before we will go next to the modelling, we need to make sure that the data do not contain duplication.

# Identify duplicate rows based on the columns in ocean_cols
duplicate_rows <- duplicated(df_clean[, ocean_cols])

# Count the number of duplicate rows
num_duplicate_rows <- sum(duplicate_rows)
cat("Number of duplicate rows (based on OCEAN columns):", num_duplicate_rows, "\n")
## Number of duplicate rows (based on OCEAN columns): 854
if (num_duplicate_rows > 0) {
  # Remove duplicate rows
  df_clean <- df_clean[!duplicate_rows, ]
  cat("Duplicate rows removed. Updated number of rows in df_clean:", nrow(df_clean), "\n")
} else {
  cat("No duplicate rows found (based on OCEAN columns).\n")
}
## Duplicate rows removed. Updated number of rows in df_clean: 602468

To see how vary the data for each variables (OCEAN), we can plot it with bar plot and box plot.

# Reshape to long format
df_long <- reshape2::melt(df_clean[, ocean_cols])
# Make sure there's no more NaN values and all of them into numeric
df_long <- na.omit(df_long)
df_long$value <- as.numeric(df_long$value)
ggplot(df_long, aes(x = value)) +
  geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
  facet_wrap(~ variable, scales = "free_y", nrow = 10) +
  labs(title = "Score Distribution for Each Dimension", 
       x = "Score", 
       y = "Frequency") +
  theme_bw()
Figure 2: Score Distribution for Each Dimension of Personality Trait

Figure 2: Score Distribution for Each Dimension of Personality Trait

ggplot(df_long, aes(x = variable, y = value)) +
  geom_boxplot(fill = "lightcoral", color = "black") +
  labs(title = "Score Distribution for Each Dimension (Box Plot)",
       x = "Dimension",
       y = "Score") +
  theme_bw() +
  coord_flip()
Figure 3: Score Distribution for Each Dimension of Personality Trait in Box Plot

Figure 3: Score Distribution for Each Dimension of Personality Trait in Box Plot

We can see some outliers in certain variables related to Agreeableness (AGR), Conscientiousness (CSN), and Openness to Experience (OPN). On the other hand, k-Means Clustering tends to be sensitive to outliers. Therefore, we can devise strategies to deal with outliers, one of which is by performing normalization.

summary_stats <- df_long %>%
  group_by(variable) %>%
  summarize(
    mean_score = mean(value, na.rm = TRUE),
    median_score = median(value, na.rm = TRUE),
    sd_score = sd(value, na.rm = TRUE),
    min_score = min(value, na.rm = TRUE),
    max_score = max(value, na.rm = TRUE),
    n_observations = n()
  )

kable(summary_stats)
variable mean_score median_score sd_score min_score max_score n_observations
EXT1 2.576099 3 1.2374681 1 5 602468
EXT2 2.847959 3 1.3057964 1 5 602468
EXT3 3.230769 3 1.1912989 1 5 602468
EXT4 3.219635 3 1.2047707 1 5 602468
EXT5 3.248511 3 1.2465256 1 5 602468
EXT6 2.423392 2 1.2139971 1 5 602468
EXT7 2.710200 3 1.3698189 1 5 602468
EXT8 3.470126 4 1.2390399 1 5 602468
EXT9 2.953940 3 1.3246703 1 5 602468
EXT10 3.622802 4 1.2633410 1 5 602468
EST1 3.311188 4 1.3117901 1 5 602468
EST2 3.171010 3 1.1912015 1 5 602468
EST3 3.880111 4 1.1217407 1 5 602468
EST4 2.658513 3 1.2248307 1 5 602468
EST5 2.857219 3 1.2537206 1 5 602468
EST6 2.869767 3 1.2935461 1 5 602468
EST7 3.064667 3 1.2649211 1 5 602468
EST8 2.701616 3 1.3213905 1 5 602468
EST9 3.102203 3 1.2677605 1 5 602468
EST10 2.857597 3 1.3053596 1 5 602468
AGR1 2.234690 2 1.3043405 1 5 602468
AGR2 3.854976 4 1.0869058 1 5 602468
AGR3 2.267007 2 1.2641696 1 5 602468
AGR4 3.947790 4 1.0799717 1 5 602468
AGR5 2.302340 2 1.1560334 1 5 602468
AGR6 3.758311 4 1.1663487 1 5 602468
AGR7 2.234258 2 1.1129761 1 5 602468
AGR8 3.684737 4 1.0466293 1 5 602468
AGR9 3.792691 4 1.1401198 1 5 602468
AGR10 3.597522 4 1.0354886 1 5 602468
CSN1 3.320069 3 1.1241311 1 5 602468
CSN2 2.996669 3 1.3710137 1 5 602468
CSN3 4.006905 4 0.9966201 1 5 602468
CSN4 2.653452 3 1.2350004 1 5 602468
CSN5 2.582891 2 1.2417003 1 5 602468
CSN6 2.859699 3 1.4031575 1 5 602468
CSN7 3.731396 4 1.0734690 1 5 602468
CSN8 2.491583 2 1.1258857 1 5 602468
CSN9 3.159303 3 1.2501190 1 5 602468
CSN10 3.629258 4 0.9979830 1 5 602468
OPN1 3.777719 4 1.0741179 1 5 602468
OPN2 2.020036 2 1.0826338 1 5 602468
OPN3 4.067638 4 1.0259751 1 5 602468
OPN4 1.950030 2 1.0575048 1 5 602468
OPN5 3.835307 4 0.9289385 1 5 602468
OPN6 1.876481 2 1.0715174 1 5 602468
OPN7 4.060262 4 0.9180440 1 5 602468
OPN8 3.284412 3 1.2123321 1 5 602468
OPN9 4.222360 4 0.9378580 1 5 602468
OPN10 3.998694 4 0.9813749 1 5 602468

Principal Component Analysis (PCA)

At this step, we will use principal component analysis (PCA) in order to reduce dimensionality of input data while retaining the most significant variations.

# PCA (use scaling)
pca_result <- prcomp(df_clean[, ocean_cols], center = TRUE, scale. = TRUE)
# Scree plot
screeplot(pca_result, type = "lines", main = "Scree Plot PCA")
Figure 4: Scree Plot of Principal Component Analysis

Figure 4: Scree Plot of Principal Component Analysis

# Variance explained
kable(summary(pca_result)$importance) 
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49 PC50
Standard deviation 2.777921 2.247946 2.001303 1.89383 1.702017 1.198502 1.136224 1.01711 0.9866744 0.9615661 0.9448665 0.9220993 0.9110339 0.9008106 0.8827829 0.8606061 0.8554104 0.839683 0.8267318 0.8081084 0.8021401 0.7972646 0.779069 0.7732502 0.7641308 0.7483038 0.7313321 0.7246511 0.7206425 0.7054458 0.7004669 0.695075 0.6913724 0.6659864 0.6554384 0.6472175 0.6461166 0.6384365 0.6298325 0.6265253 0.6181612 0.6036435 0.6025689 0.5981394 0.5970718 0.583583 0.5730759 0.5691529 0.5529994 0.4669083
Proportion of Variance 0.154340 0.101070 0.080100 0.07173 0.057940 0.028730 0.025820 0.02069 0.0194700 0.0184900 0.0178600 0.0170100 0.0166000 0.0162300 0.0155900 0.0148100 0.0146300 0.014100 0.0136700 0.0130600 0.0128700 0.0127100 0.012140 0.0119600 0.0116800 0.0112000 0.0107000 0.0105000 0.0103900 0.0099500 0.0098100 0.009660 0.0095600 0.0088700 0.0085900 0.0083800 0.0083500 0.0081500 0.0079300 0.0078500 0.0076400 0.0072900 0.0072600 0.0071600 0.0071300 0.006810 0.0065700 0.0064800 0.0061200 0.0043600
Cumulative Proportion 0.154340 0.255400 0.335510 0.40724 0.465180 0.493900 0.519720 0.54041 0.5598800 0.5783800 0.5962300 0.6132400 0.6298400 0.6460700 0.6616500 0.6764700 0.6911000 0.705200 0.7188700 0.7319300 0.7448000 0.7575100 0.769650 0.7816100 0.7932900 0.8044900 0.8151800 0.8256900 0.8360700 0.8460300 0.8558400 0.865500 0.8750600 0.8839300 0.8925200 0.9009000 0.9092500 0.9174000 0.9253400 0.9331900 0.9408300 0.9481200 0.9553800 0.9625400 0.9696700 0.976480 0.9830500 0.9895200 0.9956400 1.0000000

From the variance explanation of the PCA, we can take a look that with 6-8 principle components, those can explain the variance of data at 49-54%. On the other hand, to reach around 80% variance explained from the data, we can use 26 principle components.

But the decision to choose which PCs that will be used, will be based on the model evaluation. For now, we will use 6 principle components as input of k-Means clustering.

pca_data <- as.data.frame(pca_result$x[, 1:6])

Elbow Method

To find the optimal number of clusters in k-Means clustering, we can use elbow method. The chosen k has a goodness of fit to clustering model based on the PCA input to the model.
Figure 5: Elbow Method for Optimal k

Figure 5: Elbow Method for Optimal k

From the illustration above, we can observe the point where k is at the elbow joint is the k=2. So, we have got the k for the model.

Modeling with Clustering

We model the data with k-means clustering, where k=2.

set.seed(42)
kmeans_result <- kmeans(pca_data, centers = 2, nstart = 25)

df_clean$Cluster <- as.factor(kmeans_result$cluster)

We can interpret the model with pairwise scatter plot for each PCs.

# Use the cluster assignments from your K-means result
pca_data_with_cluster <- pca_data %>%
  mutate(Cluster = as.factor(df_clean$Cluster))

# Create pairwise scatter plots for the first n PCs
ggpairs(pca_data_with_cluster, columns = 1:6, aes(color = Cluster, alpha = 0.7)) +
  theme_minimal() +
  labs(title = "Pairwise Scatter Plots of First 6 PCs by Cluster")
Figure 6: Pairwise Scatter Plots of First 6 PCs by Clusters

Figure 6: Pairwise Scatter Plots of First 6 PCs by Clusters

The summary below stated how statistically cluster 1 and 2 are divided and distributed by mean and standard deviation.

kable(
  df_clean %>%
    group_by(Cluster) %>%
    summarise(across(all_of(ocean_cols), list(mean = mean, sd = sd)))
)
Cluster EXT1_mean EXT1_sd EXT2_mean EXT2_sd EXT3_mean EXT3_sd EXT4_mean EXT4_sd EXT5_mean EXT5_sd EXT6_mean EXT6_sd EXT7_mean EXT7_sd EXT8_mean EXT8_sd EXT9_mean EXT9_sd EXT10_mean EXT10_sd EST1_mean EST1_sd EST2_mean EST2_sd EST3_mean EST3_sd EST4_mean EST4_sd EST5_mean EST5_sd EST6_mean EST6_sd EST7_mean EST7_sd EST8_mean EST8_sd EST9_mean EST9_sd EST10_mean EST10_sd AGR1_mean AGR1_sd AGR2_mean AGR2_sd AGR3_mean AGR3_sd AGR4_mean AGR4_sd AGR5_mean AGR5_sd AGR6_mean AGR6_sd AGR7_mean AGR7_sd AGR8_mean AGR8_sd AGR9_mean AGR9_sd AGR10_mean AGR10_sd CSN1_mean CSN1_sd CSN2_mean CSN2_sd CSN3_mean CSN3_sd CSN4_mean CSN4_sd CSN5_mean CSN5_sd CSN6_mean CSN6_sd CSN7_mean CSN7_sd CSN8_mean CSN8_sd CSN9_mean CSN9_sd CSN10_mean CSN10_sd OPN1_mean OPN1_sd OPN2_mean OPN2_sd OPN3_mean OPN3_sd OPN4_mean OPN4_sd OPN5_mean OPN5_sd OPN6_mean OPN6_sd OPN7_mean OPN7_sd OPN8_mean OPN8_sd OPN9_mean OPN9_sd OPN10_mean OPN10_sd
1 3.197653 1.114255 2.189290 1.099072 3.965228 0.9058906 2.561354 1.0695974 3.990641 0.9238935 1.840075 0.8864732 3.486022 1.210761 2.994135 1.192727 3.536425 1.176324 2.944203 1.2135210 2.883025 1.283191 3.495029 1.110808 3.590431 1.1860263 2.949282 1.220582 2.530577 1.186436 2.453254 1.197802 2.688555 1.219714 2.269352 1.207805 2.664984 1.214276 2.302960 1.165657 1.963775 1.244857 4.320473 0.8209272 2.066392 1.196233 4.202817 0.9409185 1.957524 1.011916 3.876663 1.132339 1.758672 0.8603816 3.983791 0.9192772 4.064737 1.004347 4.034356 0.8602155 3.533553 1.064935 2.902685 1.378435 4.11392 0.9625651 2.306633 1.158251 2.818343 1.254877 2.640451 1.385987 3.787204 1.055819 2.203965 1.061983 3.358943 1.210066 3.793986 0.9579982 3.939005 1.012748 1.812282 0.9789907 4.158117 0.9722771 1.808834 0.9960333 4.100935 0.7945832 1.686435 0.973477 4.264987 0.8039078 3.355442 1.216698 4.221736 0.9255650 4.243709 0.8306251
2 1.978479 1.039871 3.481263 1.168287 2.524594 0.9886152 3.852566 0.9642721 2.534960 1.0901407 2.984245 1.2221053 1.964255 1.065790 3.927787 1.102821 2.393885 1.213094 4.275268 0.9198928 3.722863 1.202669 2.859468 1.182751 4.158637 0.9784697 2.378941 1.162219 3.171282 1.236309 3.270241 1.255162 3.426296 1.200516 3.117233 1.292246 3.522585 1.172322 3.390877 1.206532 2.495172 1.307167 3.407406 1.1227675 2.459896 1.297283 3.702583 1.1462423 2.633877 1.188523 3.644517 1.187065 2.691530 1.1364036 3.397199 1.0805045 3.531122 1.200103 3.177510 1.0157876 3.114805 1.141190 3.087033 1.357722 3.90401 1.0177459 2.986915 1.214491 2.356506 1.185598 3.070504 1.387172 3.677738 1.087474 2.768125 1.116259 2.967351 1.257926 3.470874 1.0099481 3.622644 1.108067 2.219790 1.1384828 3.980644 1.0678761 2.085789 1.0964003 3.579909 0.9759647 2.059209 1.128188 3.863421 0.9760795 3.216117 1.204177 4.222961 0.9495287 3.763114 1.0547610
# You can use this for all columns starting with EXT, AGR, etc.
summary_by_cluster <- df_clean %>%
  group_by(Cluster) %>%
  summarise(across(everything(), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))
# Pivot to long format so it's easier to plot
summary_long <- summary_by_cluster %>%
  pivot_longer(
    cols = -Cluster,
    names_to = c("Trait", "Stat"),
    names_sep = "_"
  ) %>%
  filter(Stat == "mean")  # Only plot means for now
trait_cols <- df_clean %>%
  select(Cluster, matches("^(EXT|EST|AGR|CSN|OPN)[0-9]+$"))
# Then calculate the mean per trait group
trait_summary <- trait_cols %>%
  pivot_longer(
    cols = -Cluster,
    names_to = "Trait_Item",
    values_to = "Value"
  ) %>%
  mutate(Trait = str_extract(Trait_Item, "^[A-Z]+")) %>%  # Extract EXT, EST, AGR, etc.
  group_by(Cluster, Trait) %>%
  summarise(Mean = mean(Value, na.rm = TRUE), .groups = 'drop')
ggplot(trait_summary, aes(x = Trait, y = Mean, fill = as.factor(Cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Big 5 Traits Mean per Cluster", x = "Trait", fill = "Cluster") +
  theme_minimal()
Figure 7: Big 5 Traits Mean per Cluster

Figure 7: Big 5 Traits Mean per Cluster

With the assigned cluster to each point of data, we can simplified the dominancy of each cluster to the Big Five components.

Conclusion

This study successfully applied unsupervised learning to Big Five personality trait data, yielding valuable insights for the development of more human-centered intelligent systems.

  1. We determined that 6 principal components effectively capture 49-54% of the variance in the dataset. These components served as input for the k-means clustering model, which identified an optimal number of 2 clusters (k=2). This suggests that the personality data can be meaningfully segmented into two distinct groups, capturing a substantial portion of the variability in personality traits with a reduced number of dimensions.

  2. Analysis of the clusters revealed distinct personality profiles. Cluster 1 exhibits higher mean scores across most of the Big Five dimensions, indicating a generally more pronounced expression of these traits on average within this group. In contrast, Cluster 2 shows a relative dominance only in Emotional Stability (EST), suggesting that individuals in this cluster tend to exhibit higher average emotional stability compared to those in Cluster 1. However, the box plots reveal that within these cluster-level averages, there is considerable variability and distributional nuance. For instance, while Cluster 1 has a higher mean for Extraversion (EXT), the box plot shows a wide spread of scores in both clusters, indicating that not all individuals in Cluster 1 are highly extraverted and some individuals in Cluster 2 also exhibit high extraversion.

These findings offer several key insights for the design of human-centered intelligent systems:

  • Personalization Potential: The identified clusters highlight the potential for tailoring system interactions to different personality types. For example, users in Cluster 1, characterized by higher average expression across most Big Five traits, might generally respond well to systems that are more engaging, interactive, and provide a richer set of features. However, it is crucial to acknowledge the variability observed in the box plots. Some users within Cluster 1 may still prefer minimalist interfaces, and some users within Cluster 2 may appreciate highly interactive features. Therefore, personalization should not be solely based on cluster assignment but should also consider individual variations within each cluster.

  • Emotional Stability Considerations: Cluster 2’s emphasis on emotional stability at the cluster level suggests that on average, users in this group may prefer systems that are reliable, predictable, and provide a supportive and calming user experience. The box plots further elaborate that while this trend holds, there’s a range of emotional stability scores within both clusters, implying that some users in Cluster 1 might also benefit from calming interfaces, and some users in Cluster 2 might tolerate or even prefer more dynamic interactions. These insights can inform the design of virtual agent interactions, user interface elements, and feedback mechanisms.

  • Strategic Implications: Understanding these personality-based clusters and the score distributions within them can enable developers to create more targeted and effective user experiences, potentially increasing user satisfaction and engagement. For instance, in the design of a virtual assistant, developers might consider offering a range of personality options, acknowledging the variability within clusters, rather than simply offering “Cluster 1” or “Cluster 2” personalities.

Further Improvement

To further enhance this research and its practical applications, the following improvements are recommended:

  1. Clustering Validation: Implement silhouette score analysis to quantitatively evaluate the quality of the clustering results. This will provide a measure of how well each data point fits within its assigned cluster and how distinct the clusters are from each other, adding robustness to the findings.

  2. Alternative Clustering Methods: Explore alternative clustering algorithms, such as DBSCAN, which is less sensitive to outliers than k-means. This could provide additional insights into the data’s structure and potentially reveal different or more refined user groupings.

  3. Multi-Modal Data Integration: Incorporate additional datasets, such as text-based data from user interactions with intelligent personal assistants, to provide a more comprehensive understanding of user personality. This multi-modal approach could lead to the development of more sophisticated and nuanced models of user behavior and preferences.

  4. Actionable Recommendations for Industry:

    • Personalized User Interfaces: Design user interfaces that adapt to different personality types. For example, users in Cluster 1 might prefer feature-rich interfaces with opportunities for interaction and customization, while users in Cluster 2 might prefer minimalist designs that prioritize clarity and ease of use.

    • Virtual Agent Personalities: Develop virtual agent personalities that align with the identified clusters. This could involve offering users a choice of virtual agent personality (e.g., “engaging” vs. “calm”) to enhance user experience and satisfaction.

    • Targeted Marketing and Communication: Tailor marketing messages and communication strategies to resonate with different personality types. Understanding the preferences and communication styles of each cluster can lead to more effective user engagement and product adoption.

    • Adaptive System Behavior: Design systems that can adapt their behavior based on user personality. For example, a system might provide more proactive assistance to users in Cluster 1, while offering more guidance and support to users in Cluster 2.

By pursuing these improvements, future research can provide even more valuable insights for the development of human-centered intelligent systems, ultimately leading to more personalized, effective, and satisfying user experiences.

References

American Psychological Association. (2023). Personality. https://www.apa.org/topics/personality
Lopatovska, I., & Arapakis, I. (2020). Personality in information behavior. Journal of the Association for Information Science and Technology, 71(8), 921–926. https://doi.org/10.1002/asi.24314