Figure 1: Big Five Personality Trait
Personality fundamentally shapes how individuals think, feel, and behave, significantly influencing their interactions with technology. With the increasing integration of intelligent systems—such as virtual assistants, personalized platforms, and adaptive interfaces—understanding individual personality differences is crucial for creating meaningful, human-centered experiences (American Psychological Association, 2023). However, many existing systems treat users uniformly, failing to capture the nuances in psychological profiles that could lead to more intuitive and engaging interactions.
Recent research emphasizes the growing importance of incorporating personality traits into the design of information systems to better align with user expectations and behavior. Lopatovska & Arapakis (2020) highlighted how personality-based models can improve personalization in search engines and conversational agents, arguing that user satisfaction often hinges on how well systems adapt to individual psychological profiles. This insight aligns with the human-centered design philosophy, emphasizing that emotional and cognitive aspects of users should play a central role in shaping interactive technologies.
Rather than relying on predefined personality labels, this study explores latent patterns within responses to a Big Five personality questionnaire. By applying unsupervised learning methods, we aim to uncover natural groupings in personality traits—allowing us to identify underlying psychological structures without prior assumptions. These clusters may reflect distinct behavioral styles, cognitive preferences, and communication tendencies that have direct implications for how individuals interact with technology, offering valuable insights for personalization in the design of user interfaces, the behavior of virtual agents, and the tailoring of digital content.
This study aims to apply unsupervised learning to Big Five personality trait data to discover meaningful patterns and user groupings that can ultimately inform the development of more human-centered intelligent systems:
Apply k-Means clustering to uncover groups of individuals with similar personality profiles based on questionnaire responses, with the optimal number of clusters determined using the Elbow method. The identified clusters will represent distinct personality segments within the user population, potentially highlighting different preferences and interaction styles with technology.
Analyze the characteristics of each cluster in terms of potential behavioral patterns, communication styles, and implications for system personalization in human-computer interaction. This includes understanding if certain personality profiles are more likely to:
This study leverages a large-scale dataset collected between 2016 and 2018 through an interactive online personality test, hosted on Open Psychometrics. The questionnaire is based on the Big Five Factor Markers from the International Personality Item Pool (IPIP), covering five key personality dimensions:
Each personality trait is measured by 10 items, resulting in a total of 50 statements, rated on a 5-point Likert scale: > 1 = Disagree, 3 = Neutral, 5 = Agree
The questions were shown in a fixed interleaved order (e.g., EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc.) to ensure balanced response behavior.
| Item Code | Item Description |
|---|---|
| EXT1 | I am the life of the party. |
| EXT2 | I don’t talk a lot. |
| EXT3 | I feel comfortable around people. |
| EXT4 | I keep in the background. |
| EXT5 | I start conversations. |
| EXT6 | I have little to say. |
| EXT7 | I talk to a lot of different people at parties. |
| EXT8 | I don’t like to draw attention to myself. |
| EXT9 | I don’t mind being the center of attention. |
| EXT10 | I am quiet around strangers. |
| EST1 | I get stressed out easily. |
| EST2 | I am relaxed most of the time. |
| EST3 | I worry about things. |
| EST4 | I seldom feel blue. |
| EST5 | I am easily disturbed. |
| EST6 | I get upset easily. |
| EST7 | I change my mood a lot. |
| EST8 | I have frequent mood swings. |
| EST9 | I get irritated easily. |
| EST10 | I often feel blue. |
| AGR1 | I feel little concern for others. |
| AGR2 | I am interested in people. |
| AGR3 | I insult people. |
| AGR4 | I sympathize with others’ feelings. |
| AGR5 | I am not interested in other people’s problems. |
| AGR6 | I have a soft heart. |
| AGR7 | I am not really interested in others. |
| AGR8 | I take time out for others. |
| AGR9 | I feel others’ emotions. |
| AGR10 | I make people feel at ease. |
| CSN1 | I am always prepared. |
| CSN2 | I leave my belongings around. |
| CSN3 | I pay attention to details. |
| CSN4 | I make a mess of things. |
| CSN5 | I get chores done right away. |
| CSN6 | I often forget to put things back in their proper place. |
| CSN7 | I like order. |
| CSN8 | I shirk my duties. |
| CSN9 | I follow a schedule. |
| CSN10 | I am exacting in my work. |
| OPN1 | I have a rich vocabulary. |
| OPN2 | I have difficulty understanding abstract ideas. |
| OPN3 | I have a vivid imagination. |
| OPN4 | I am not interested in abstract ideas. |
| OPN5 | I have excellent ideas. |
| OPN6 | I do not have a good imagination. |
| OPN7 | I am quick to understand things. |
| OPN8 | I use difficult words. |
| OPN9 | I spend time reflecting on things. |
| OPN10 | I am full of ideas. |
Response Variables
Each item is labeled with a trait code and item number (e.g.,
EXT1, AGR5, OPN10) corresponding
to a specific statement.
Timing Variables
For each question, an additional variable ending in _E
captures response time in milliseconds, measuring how
long the participant took to answer that item.
Session Metadata
dateload: Timestamp when the test beganintroelapse, testelapse,
endelapse: Time spent (in seconds) on the introduction
page, survey page, and finalization page, respectivelyIPC: Number of responses from the same IP address (used
to filter out multiple submissions)country: Automatically inferred from network
metadatascreenw, screenh: Screen dimensions (can
relate to device context)lat_appx_lots_of_err,
long_appx_lots_of_err: Approximate location (not highly
accurate)For data quality, only records with
IPC = 1were used to ensure uniqueness of submissions.
# Read the data from the CSV
df <- read.delim("C:/Users/LENOVO/big-five-personality-test/IPIP-FFM-data-8Nov2018/data-final.csv", sep = "\t", stringsAsFactors = FALSE)
kable(head(df))| EXT1 | EXT2 | EXT3 | EXT4 | EXT5 | EXT6 | EXT7 | EXT8 | EXT9 | EXT10 | EST1 | EST2 | EST3 | EST4 | EST5 | EST6 | EST7 | EST8 | EST9 | EST10 | AGR1 | AGR2 | AGR3 | AGR4 | AGR5 | AGR6 | AGR7 | AGR8 | AGR9 | AGR10 | CSN1 | CSN2 | CSN3 | CSN4 | CSN5 | CSN6 | CSN7 | CSN8 | CSN9 | CSN10 | OPN1 | OPN2 | OPN3 | OPN4 | OPN5 | OPN6 | OPN7 | OPN8 | OPN9 | OPN10 | EXT1_E | EXT2_E | EXT3_E | EXT4_E | EXT5_E | EXT6_E | EXT7_E | EXT8_E | EXT9_E | EXT10_E | EST1_E | EST2_E | EST3_E | EST4_E | EST5_E | EST6_E | EST7_E | EST8_E | EST9_E | EST10_E | AGR1_E | AGR2_E | AGR3_E | AGR4_E | AGR5_E | AGR6_E | AGR7_E | AGR8_E | AGR9_E | AGR10_E | CSN1_E | CSN2_E | CSN3_E | CSN4_E | CSN5_E | CSN6_E | CSN7_E | CSN8_E | CSN9_E | CSN10_E | OPN1_E | OPN2_E | OPN3_E | OPN4_E | OPN5_E | OPN6_E | OPN7_E | OPN8_E | OPN9_E | OPN10_E | dateload | screenw | screenh | introelapse | testelapse | endelapse | IPC | country | lat_appx_lots_of_err | long_appx_lots_of_err |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 1 | 5 | 2 | 5 | 1 | 5 | 2 | 4 | 1 | 1 | 4 | 4 | 2 | 2 | 2 | 2 | 2 | 3 | 2 | 2 | 5 | 2 | 4 | 2 | 3 | 2 | 4 | 3 | 4 | 3 | 4 | 3 | 2 | 2 | 4 | 4 | 2 | 4 | 4 | 5 | 1 | 4 | 1 | 4 | 1 | 5 | 3 | 4 | 5 | 9419 | 5491 | 3959 | 4821 | 5611 | 2756 | 2388 | 2113 | 5900 | 4110 | 6135 | 4150 | 5739 | 6364 | 3663 | 5070 | 5709 | 4285 | 2587 | 3997 | 4750 | 5475 | 11641 | 3115 | 3207 | 3260 | 10235 | 5897 | 1758 | 3081 | 6602 | 5457 | 1569 | 2129 | 3762 | 4420 | 9382 | 5286 | 4983 | 6339 | 3146 | 4067 | 2959 | 3411 | 2170 | 4920 | 4436 | 3116 | 2992 | 4354 | 2016-03-03 02:01:01 | 768 | 1024 | 9 | 234 | 6 | 1 | GB | 51.5448 | 0.1991 |
| 3 | 5 | 3 | 4 | 3 | 3 | 2 | 5 | 1 | 5 | 2 | 3 | 4 | 1 | 3 | 1 | 2 | 1 | 3 | 1 | 1 | 4 | 1 | 5 | 1 | 5 | 3 | 4 | 5 | 3 | 3 | 2 | 5 | 3 | 3 | 1 | 3 | 3 | 5 | 3 | 1 | 2 | 4 | 2 | 3 | 1 | 4 | 2 | 5 | 3 | 7235 | 3598 | 3315 | 2564 | 2976 | 3050 | 4787 | 3228 | 3465 | 3309 | 9036 | 2406 | 3484 | 3359 | 3061 | 2539 | 4226 | 2962 | 1799 | 1607 | 2158 | 2090 | 2143 | 2807 | 3422 | 5324 | 4494 | 3627 | 1850 | 1747 | 5163 | 5240 | 7208 | 2783 | 4103 | 3431 | 3347 | 2399 | 3360 | 5595 | 2624 | 4985 | 1684 | 3026 | 4742 | 3336 | 2718 | 3374 | 3096 | 3019 | 2016-03-03 02:01:20 | 1360 | 768 | 12 | 179 | 11 | 1 | MY | 3.1698 | 101.706 |
| 2 | 3 | 4 | 4 | 3 | 2 | 1 | 3 | 2 | 5 | 4 | 4 | 4 | 2 | 2 | 2 | 2 | 2 | 1 | 3 | 1 | 4 | 1 | 4 | 2 | 4 | 1 | 4 | 4 | 3 | 4 | 2 | 2 | 2 | 3 | 3 | 4 | 2 | 4 | 2 | 5 | 1 | 2 | 1 | 4 | 2 | 5 | 3 | 4 | 4 | 4657 | 3549 | 2543 | 3335 | 5847 | 2540 | 4922 | 3142 | 14621 | 2191 | 5128 | 3675 | 3442 | 4546 | 8275 | 2185 | 2164 | 1175 | 3813 | 1593 | 1089 | 2203 | 3386 | 1464 | 2562 | 1493 | 3067 | 13719 | 3892 | 4100 | 4286 | 4775 | 2713 | 2813 | 4237 | 6308 | 2690 | 1516 | 2379 | 2983 | 1930 | 1470 | 1644 | 1683 | 2229 | 8114 | 2043 | 6295 | 1585 | 2529 | 2016-03-03 02:01:56 | 1366 | 768 | 3 | 186 | 7 | 1 | GB | 54.9119 | -1.3833 |
| 2 | 2 | 2 | 3 | 4 | 2 | 2 | 4 | 1 | 4 | 3 | 3 | 3 | 2 | 3 | 2 | 2 | 2 | 4 | 3 | 2 | 4 | 3 | 4 | 2 | 4 | 2 | 4 | 3 | 4 | 2 | 4 | 4 | 4 | 1 | 2 | 2 | 3 | 1 | 4 | 4 | 2 | 5 | 2 | 3 | 1 | 4 | 4 | 3 | 3 | 3996 | 2896 | 5096 | 4240 | 5168 | 5456 | 4360 | 4496 | 5240 | 4000 | 3736 | 4616 | 3015 | 2711 | 3960 | 4064 | 4208 | 2936 | 7336 | 3896 | 6062 | 11952 | 1040 | 2264 | 3664 | 3049 | 4912 | 7545 | 4632 | 6896 | 2824 | 520 | 2368 | 3225 | 2848 | 6264 | 3760 | 10472 | 3192 | 7704 | 3456 | 6665 | 1977 | 3728 | 4128 | 3776 | 2984 | 4192 | 3480 | 3257 | 2016-03-03 02:02:02 | 1920 | 1200 | 186 | 219 | 7 | 1 | GB | 51.75 | -1.25 |
| 3 | 3 | 3 | 3 | 5 | 3 | 3 | 5 | 3 | 4 | 1 | 5 | 5 | 3 | 1 | 1 | 1 | 1 | 3 | 2 | 1 | 5 | 1 | 5 | 1 | 3 | 1 | 5 | 5 | 3 | 5 | 1 | 5 | 1 | 3 | 1 | 5 | 1 | 5 | 5 | 5 | 1 | 5 | 1 | 5 | 1 | 5 | 3 | 5 | 5 | 6004 | 3965 | 2721 | 3706 | 2968 | 2426 | 7339 | 3302 | 16819 | 3731 | 4740 | 2856 | 7461 | 2179 | 3324 | 2255 | 4308 | 4506 | 3127 | 3115 | 6771 | 2819 | 3682 | 2511 | 16204 | 1736 | 28983 | 1612 | 2437 | 4532 | 3843 | 7019 | 3102 | 3153 | 2869 | 6550 | 1811 | 3682 | 21500 | 20587 | 8458 | 3510 | 17042 | 7029 | 2327 | 5835 | 6846 | 5320 | 11401 | 8642 | 2016-03-03 02:02:57 | 1366 | 768 | 8 | 315 | 17 | 2 | KE | 1.0 | 38.0 |
| 3 | 3 | 4 | 2 | 4 | 2 | 2 | 3 | 3 | 4 | 3 | 4 | 3 | 2 | 2 | 1 | 2 | 1 | 2 | 2 | 2 | 3 | 1 | 4 | 2 | 3 | 2 | 3 | 4 | 4 | 3 | 2 | 4 | 1 | 3 | 2 | 4 | 3 | 4 | 3 | 5 | 1 | 5 | 1 | 3 | 1 | 5 | 4 | 5 | 2 | 4834 | 5064 | 1160 | 2664 | 6711 | 3344 | 2512 | 6264 | 6992 | 4592 | 2808 | 1776 | 3280 | 4520 | 2640 | 5408 | 3647 | 3183 | 1575 | 672 | 6375 | 4727 | 3775 | 1647 | 1233 | 8694 | 2904 | 2152 | 2856 | 2848 | 4288 | 4360 | 7328 | 3976 | 7895 | 2640 | 1760 | 5720 | 9032 | 3928 | 2104 | 5488 | 3656 | 4352 | 2681 | 3272 | 2640 | 1568 | 1640 | 3192 | 2016-03-03 02:03:12 | 1600 | 1000 | 4 | 196 | 3 | 1 | SE | 59.3333 | 18.05 |
Based on the suggestion of the original dataset, we will try to preprocess data before it is used for modelling.
Firstly, we can check from the dataset that for number of records from IP address is 1 (for maximum cleanliness).
Since those numeric variables are still in chr, we can
convert them into numerical (for columns related to Big Five
response).
# Convert 50 personality items to numeric
trait_cols <- c(paste0("EXT", 1:10), paste0("EST", 1:10),
paste0("AGR", 1:10), paste0("CSN", 1:10), paste0("OPN", 1:10))# Convert to numeric safely
df_filtered <- df %>%
filter(if_all(all_of(trait_cols), ~ . %in% c("1", "2", "3", "4", "5")))
df_clean <- df_filtered %>%
mutate(across(all_of(trait_cols), ~ as.numeric(.)))Next, drop some non-related variables towards Big Five questionnaires and responses.
# Drop non-related variables towards Big Five questionnaires-responses
df_clean <- df_clean %>%
dplyr::select(-IPC, -dateload, -country, -lat_appx_lots_of_err, -long_appx_lots_of_err, -screenw, -screenh, -introelapse, -testelapse, -endelapse)Just like step before, convert columns those are still recorded as
chr but actually contains numerical variables.
# Convert only character columns that look numeric
df_clean <- df_clean %>%
mutate(across(
where(~ is.character(.) && all(grepl("^\\d+(\\.\\d+)?$", .[!is.na(.)]))),
~ as.numeric(.)
))# Identify character columns that are mostly numeric
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
# Try converting them, keeping only if >90% successfully convert
for (col in char_cols) {
temp <- suppressWarnings(as.numeric(df_clean[[col]]))
success_rate <- mean(!is.na(temp))
if (success_rate > 0.9) {
df_clean[[col]] <- temp # overwrite with numeric version
}
}After checking the cleanliness of dataset, look and observe from this summary.
| EXT1 | EXT2 | EXT3 | EXT4 | EXT5 | EXT6 | EXT7 | EXT8 | EXT9 | EXT10 | EST1 | EST2 | EST3 | EST4 | EST5 | EST6 | EST7 | EST8 | EST9 | EST10 | AGR1 | AGR2 | AGR3 | AGR4 | AGR5 | AGR6 | AGR7 | AGR8 | AGR9 | AGR10 | CSN1 | CSN2 | CSN3 | CSN4 | CSN5 | CSN6 | CSN7 | CSN8 | CSN9 | CSN10 | OPN1 | OPN2 | OPN3 | OPN4 | OPN5 | OPN6 | OPN7 | OPN8 | OPN9 | OPN10 | EXT1_E | EXT2_E | EXT3_E | EXT4_E | EXT5_E | EXT6_E | EXT7_E | EXT8_E | EXT9_E | EXT10_E | EST1_E | EST2_E | EST3_E | EST4_E | EST5_E | EST6_E | EST7_E | EST8_E | EST9_E | EST10_E | AGR1_E | AGR2_E | AGR3_E | AGR4_E | AGR5_E | AGR6_E | AGR7_E | AGR8_E | AGR9_E | AGR10_E | CSN1_E | CSN2_E | CSN3_E | CSN4_E | CSN5_E | CSN6_E | CSN7_E | CSN8_E | CSN9_E | CSN10_E | OPN1_E | OPN2_E | OPN3_E | OPN4_E | OPN5_E | OPN6_E | OPN7_E | OPN8_E | OPN9_E | OPN10_E | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. : -42958762 | Min. : -75632 | Min. : -3593866 | Min. :-2494907 | Min. : -58566 | Min. : -79860 | Min. : -89610 | Min. : -461138 | Min. : -11382 | Min. : -142238 | Min. : -112165 | Min. : -71572 | Min. : -24118 | Min. : -3598047 | Min. : -88286 | Min. : -81895 | Min. :-2187273 | Min. : -92455 | Min. :-79175662 | Min. : -43558 | Min. : -48675 | Min. : -3592606 | Min. : -1795552 | Min. : -67786 | Min. : -20294 | Min. : -21938 | Min. : -65423 | Min. : -27029 | Min. : -527846 | Min. : -85674 | Min. : -3590638 | Min. :-35996486 | Min. : -15375 | Min. : -11993 | Min. : -3512740 | Min. : -74245 | Min. : -30016 | Min. : -177880 | Min. : -29167 | Min. : -14988 | Min. :-53927742 | Min. : -215205 | Min. : -417031 | Min. : -74467 | Min. : -75300 | Min. : -509916 | Min. : -51694 | Min. : -17007 | Min. : -95986 | Min. :-3594871 | |
| 1st Qu.:1.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:2.000 | 1st Qu.:3.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:3.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:2.00 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:3.000 | 1st Qu.:3.000 | 1st Qu.:3.00 | 1st Qu.:2.000 | 1st Qu.:4.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:2.00 | 1st Qu.:3.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:3.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:4.000 | 1st Qu.:2.000 | 1st Qu.:4.000 | 1st Qu.:3.000 | 1st Qu.: 4834 | 1st Qu.: 2407 | 1st Qu.: 2503 | 1st Qu.: 2422 | 1st Qu.: 2166 | 1st Qu.: 2208 | 1st Qu.: 3077 | 1st Qu.: 2541 | 1st Qu.: 2610 | 1st Qu.: 2268 | 1st Qu.: 2242 | 1st Qu.: 2515 | 1st Qu.: 1945 | 1st Qu.: 2480 | 1st Qu.: 2487 | 1st Qu.: 2233 | 1st Qu.: 2224 | 1st Qu.: 2086 | 1st Qu.: 1954 | 1st Qu.: 1775 | 1st Qu.: 2926 | 1st Qu.: 2272 | 1st Qu.: 2248 | 1st Qu.: 2216 | 1st Qu.: 2937 | 1st Qu.: 2040 | 1st Qu.: 2667 | 1st Qu.: 2711 | 1st Qu.: 2235 | 1st Qu.: 2409 | 1st Qu.: 2439 | 1st Qu.: 3000 | 1st Qu.: 2283 | 1st Qu.: 2360 | 1st Qu.: 2549 | 1st Qu.: 3117 | 1st Qu.: 2067 | 1st Qu.: 2480 | 1st Qu.: 2055 | 1st Qu.: 2690 | 1st Qu.: 2088 | 1st Qu.: 3057 | 1st Qu.: 1876 | 1st Qu.: 2685 | 1st Qu.: 2001 | 1st Qu.: 2375 | 1st Qu.: 2294 | 1st Qu.: 2166 | 1st Qu.: 2344 | 1st Qu.: 1491 | |
| Median :3.000 | Median :3.000 | Median :3.000 | Median :3.000 | Median :3.000 | Median :2.000 | Median :3.000 | Median :4.000 | Median :3.000 | Median :4.000 | Median :4.000 | Median :3.000 | Median :4.000 | Median :3.000 | Median :3.000 | Median :3.00 | Median :3.000 | Median :3.000 | Median :3.000 | Median :3.000 | Median :2.000 | Median :4.000 | Median :2.000 | Median :4.000 | Median :2.000 | Median :4.000 | Median :2.000 | Median :4.000 | Median :4.000 | Median :4.000 | Median :3.00 | Median :3.000 | Median :4.000 | Median :3.000 | Median :2.000 | Median :3.00 | Median :4.000 | Median :2.000 | Median :3.000 | Median :4.000 | Median :4.000 | Median :2.000 | Median :4.000 | Median :2.000 | Median :4.000 | Median :2.000 | Median :4.000 | Median :3.000 | Median :4.000 | Median :4.000 | Median : 7315 | Median : 3427 | Median : 3517 | Median : 3471 | Median : 3040 | Median : 3128 | Median : 4340 | Median : 3608 | Median : 3651 | Median : 3216 | Median : 3311 | Median : 3614 | Median : 2787 | Median : 3585 | Median : 3509 | Median : 3191 | Median : 3182 | Median : 2932 | Median : 2798 | Median : 2566 | Median : 4369 | Median : 3268 | Median : 3169 | Median : 3181 | Median : 4062 | Median : 2892 | Median : 3686 | Median : 3854 | Median : 3138 | Median : 3339 | Median : 3584 | Median : 4269 | Median : 3200 | Median : 3339 | Median : 3586 | Median : 4352 | Median : 2916 | Median : 3745 | Median : 2923 | Median : 3934 | Median : 3020 | Median : 4216 | Median : 2739 | Median : 3703 | Median : 2841 | Median : 3319 | Median : 3199 | Median : 3057 | Median : 3252 | Median : 2190 | |
| Mean :2.577 | Mean :2.848 | Mean :3.231 | Mean :3.219 | Mean :3.249 | Mean :2.424 | Mean :2.711 | Mean :3.469 | Mean :2.954 | Mean :3.622 | Mean :3.311 | Mean :3.171 | Mean :3.879 | Mean :2.659 | Mean :2.857 | Mean :2.87 | Mean :3.065 | Mean :2.702 | Mean :3.102 | Mean :2.858 | Mean :2.236 | Mean :3.854 | Mean :2.268 | Mean :3.947 | Mean :2.303 | Mean :3.758 | Mean :2.235 | Mean :3.684 | Mean :3.792 | Mean :3.597 | Mean :3.32 | Mean :2.997 | Mean :4.006 | Mean :2.654 | Mean :2.584 | Mean :2.86 | Mean :3.731 | Mean :2.492 | Mean :3.159 | Mean :3.629 | Mean :3.777 | Mean :2.021 | Mean :4.066 | Mean :1.952 | Mean :3.834 | Mean :1.878 | Mean :4.059 | Mean :3.284 | Mean :4.221 | Mean :3.998 | Mean : 101700 | Mean : 8421 | Mean : 9273 | Mean : 7762 | Mean : 7746 | Mean : 5871 | Mean : 7702 | Mean : 6929 | Mean : 5690 | Mean : 5258 | Mean : 8348 | Mean : 8328 | Mean : 6616 | Mean : 10769 | Mean : 7047 | Mean : 8318 | Mean : 6224 | Mean : 5258 | Mean : 4995 | Mean : 4590 | Mean : 18296 | Mean : 9139 | Mean : 6572 | Mean : 8087 | Mean : 7966 | Mean : 5417 | Mean : 7601 | Mean : 9732 | Mean : 5107 | Mean : 5660 | Mean : 12627 | Mean : 9918 | Mean : 8581 | Mean : 7655 | Mean : 9714 | Mean : 9841 | Mean : 5116 | Mean : 10370 | Mean : 5109 | Mean : 8926 | Mean : 8892 | Mean : 12898 | Mean : 6580 | Mean : 8246 | Mean : 6112 | Mean : 7146 | Mean : 7564 | Mean : 4849 | Mean : 5661 | Mean : 4242 | |
| 3rd Qu.:3.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:3.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.00 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:4.00 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.00 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:2.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.: 11886 | 3rd Qu.: 5057 | 3rd Qu.: 5156 | 3rd Qu.: 5174 | 3rd Qu.: 4509 | 3rd Qu.: 4605 | 3rd Qu.: 6229 | 3rd Qu.: 5335 | 3rd Qu.: 5332 | 3rd Qu.: 4681 | 3rd Qu.: 5106 | 3rd Qu.: 5387 | 3rd Qu.: 4195 | 3rd Qu.: 5604 | 3rd Qu.: 5173 | 3rd Qu.: 4817 | 3rd Qu.: 4727 | 3rd Qu.: 4396 | 3rd Qu.: 4206 | 3rd Qu.: 3932 | 3rd Qu.: 6822 | 3rd Qu.: 4919 | 3rd Qu.: 4716 | 3rd Qu.: 4828 | 3rd Qu.: 5864 | 3rd Qu.: 4347 | 3rd Qu.: 5324 | 3rd Qu.: 5686 | 3rd Qu.: 4632 | 3rd Qu.: 4868 | 3rd Qu.: 5518 | 3rd Qu.: 6250 | 3rd Qu.: 4715 | 3rd Qu.: 4921 | 3rd Qu.: 5388 | 3rd Qu.: 6193 | 3rd Qu.: 4317 | 3rd Qu.: 6919 | 3rd Qu.: 4314 | 3rd Qu.: 6059 | 3rd Qu.: 4520 | 3rd Qu.: 6097 | 3rd Qu.: 4212 | 3rd Qu.: 5402 | 3rd Qu.: 4240 | 3rd Qu.: 4872 | 3rd Qu.: 4668 | 3rd Qu.: 4446 | 3rd Qu.: 4710 | 3rd Qu.: 3317 | |
| Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :2147483647 | Max. :261773449 | Max. :605905746 | Max. :99674242 | Max. :351067965 | Max. :166382065 | Max. :85145915 | Max. :247706204 | Max. :88154250 | Max. :150252063 | Max. :240303883 | Max. :184071718 | Max. :139259333 | Max. :880042878 | Max. :96737602 | Max. :346412854 | Max. :87632813 | Max. :31299296 | Max. :183826938 | Max. :81790376 | Max. :1170859453 | Max. :473898335 | Max. :130124433 | Max. :329878192 | Max. :134804667 | Max. :97539002 | Max. :251861470 | Max. :1367497215 | Max. :62757478 | Max. :81582425 | Max. :772659169 | Max. :169887206 | Max. :1100334734 | Max. :140148953 | Max. :958623288 | Max. :443209652 | Max. :84828106 | Max. :177082189 | Max. :87497885 | Max. :338015826 | Max. :675047026 | Max. :1026125615 | Max. :124483678 | Max. :201571919 | Max. :162680825 | Max. :243586621 | Max. :389143415 | Max. :46761765 | Max. :113808739 | Max. :90484840 |
Check also how many NaN values left in the dataset.
| x | |
|---|---|
| EXT1 | 0 |
| EXT2 | 0 |
| EXT3 | 0 |
| EXT4 | 0 |
| EXT5 | 0 |
| EXT6 | 0 |
| EXT7 | 0 |
| EXT8 | 0 |
| EXT9 | 0 |
| EXT10 | 0 |
| EST1 | 0 |
| EST2 | 0 |
| EST3 | 0 |
| EST4 | 0 |
| EST5 | 0 |
| EST6 | 0 |
| EST7 | 0 |
| EST8 | 0 |
| EST9 | 0 |
| EST10 | 0 |
| AGR1 | 0 |
| AGR2 | 0 |
| AGR3 | 0 |
| AGR4 | 0 |
| AGR5 | 0 |
| AGR6 | 0 |
| AGR7 | 0 |
| AGR8 | 0 |
| AGR9 | 0 |
| AGR10 | 0 |
| CSN1 | 0 |
| CSN2 | 0 |
| CSN3 | 0 |
| CSN4 | 0 |
| CSN5 | 0 |
| CSN6 | 0 |
| CSN7 | 0 |
| CSN8 | 0 |
| CSN9 | 0 |
| CSN10 | 0 |
| OPN1 | 0 |
| OPN2 | 0 |
| OPN3 | 0 |
| OPN4 | 0 |
| OPN5 | 0 |
| OPN6 | 0 |
| OPN7 | 0 |
| OPN8 | 0 |
| OPN9 | 0 |
| OPN10 | 0 |
| EXT1_E | 0 |
| EXT2_E | 0 |
| EXT3_E | 0 |
| EXT4_E | 0 |
| EXT5_E | 0 |
| EXT6_E | 0 |
| EXT7_E | 0 |
| EXT8_E | 0 |
| EXT9_E | 0 |
| EXT10_E | 0 |
| EST1_E | 0 |
| EST2_E | 0 |
| EST3_E | 0 |
| EST4_E | 0 |
| EST5_E | 0 |
| EST6_E | 0 |
| EST7_E | 0 |
| EST8_E | 0 |
| EST9_E | 0 |
| EST10_E | 0 |
| AGR1_E | 0 |
| AGR2_E | 0 |
| AGR3_E | 0 |
| AGR4_E | 0 |
| AGR5_E | 0 |
| AGR6_E | 0 |
| AGR7_E | 0 |
| AGR8_E | 0 |
| AGR9_E | 0 |
| AGR10_E | 0 |
| CSN1_E | 0 |
| CSN2_E | 0 |
| CSN3_E | 0 |
| CSN4_E | 0 |
| CSN5_E | 0 |
| CSN6_E | 0 |
| CSN7_E | 0 |
| CSN8_E | 0 |
| CSN9_E | 0 |
| CSN10_E | 0 |
| OPN1_E | 0 |
| OPN2_E | 0 |
| OPN3_E | 0 |
| OPN4_E | 0 |
| OPN5_E | 0 |
| OPN6_E | 0 |
| OPN7_E | 0 |
| OPN8_E | 0 |
| OPN9_E | 0 |
| OPN10_E | 0 |
Next, take columns contain Big Five response only for the clustering model.
# Take columns contain OCEAN but not endswith _E
ocean_cols <- grep("^(EXT|EST|AGR|CSN|OPN)", colnames(df_clean), value = TRUE)
ocean_cols <- ocean_cols[!grepl("_E$", ocean_cols)]Before we will go next to the modelling, we need to make sure that the data do not contain duplication.
# Identify duplicate rows based on the columns in ocean_cols
duplicate_rows <- duplicated(df_clean[, ocean_cols])
# Count the number of duplicate rows
num_duplicate_rows <- sum(duplicate_rows)
cat("Number of duplicate rows (based on OCEAN columns):", num_duplicate_rows, "\n")## Number of duplicate rows (based on OCEAN columns): 854
if (num_duplicate_rows > 0) {
# Remove duplicate rows
df_clean <- df_clean[!duplicate_rows, ]
cat("Duplicate rows removed. Updated number of rows in df_clean:", nrow(df_clean), "\n")
} else {
cat("No duplicate rows found (based on OCEAN columns).\n")
}## Duplicate rows removed. Updated number of rows in df_clean: 602468
To see how vary the data for each variables (OCEAN), we can plot it with bar plot and box plot.
# Make sure there's no more NaN values and all of them into numeric
df_long <- na.omit(df_long)
df_long$value <- as.numeric(df_long$value)ggplot(df_long, aes(x = value)) +
geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
facet_wrap(~ variable, scales = "free_y", nrow = 10) +
labs(title = "Score Distribution for Each Dimension",
x = "Score",
y = "Frequency") +
theme_bw()Figure 2: Score Distribution for Each Dimension of Personality Trait
ggplot(df_long, aes(x = variable, y = value)) +
geom_boxplot(fill = "lightcoral", color = "black") +
labs(title = "Score Distribution for Each Dimension (Box Plot)",
x = "Dimension",
y = "Score") +
theme_bw() +
coord_flip()Figure 3: Score Distribution for Each Dimension of Personality Trait in Box Plot
We can see some outliers in certain variables related to Agreeableness (AGR), Conscientiousness (CSN), and Openness to Experience (OPN). On the other hand, k-Means Clustering tends to be sensitive to outliers. Therefore, we can devise strategies to deal with outliers, one of which is by performing normalization.
summary_stats <- df_long %>%
group_by(variable) %>%
summarize(
mean_score = mean(value, na.rm = TRUE),
median_score = median(value, na.rm = TRUE),
sd_score = sd(value, na.rm = TRUE),
min_score = min(value, na.rm = TRUE),
max_score = max(value, na.rm = TRUE),
n_observations = n()
)
kable(summary_stats)| variable | mean_score | median_score | sd_score | min_score | max_score | n_observations |
|---|---|---|---|---|---|---|
| EXT1 | 2.576099 | 3 | 1.2374681 | 1 | 5 | 602468 |
| EXT2 | 2.847959 | 3 | 1.3057964 | 1 | 5 | 602468 |
| EXT3 | 3.230769 | 3 | 1.1912989 | 1 | 5 | 602468 |
| EXT4 | 3.219635 | 3 | 1.2047707 | 1 | 5 | 602468 |
| EXT5 | 3.248511 | 3 | 1.2465256 | 1 | 5 | 602468 |
| EXT6 | 2.423392 | 2 | 1.2139971 | 1 | 5 | 602468 |
| EXT7 | 2.710200 | 3 | 1.3698189 | 1 | 5 | 602468 |
| EXT8 | 3.470126 | 4 | 1.2390399 | 1 | 5 | 602468 |
| EXT9 | 2.953940 | 3 | 1.3246703 | 1 | 5 | 602468 |
| EXT10 | 3.622802 | 4 | 1.2633410 | 1 | 5 | 602468 |
| EST1 | 3.311188 | 4 | 1.3117901 | 1 | 5 | 602468 |
| EST2 | 3.171010 | 3 | 1.1912015 | 1 | 5 | 602468 |
| EST3 | 3.880111 | 4 | 1.1217407 | 1 | 5 | 602468 |
| EST4 | 2.658513 | 3 | 1.2248307 | 1 | 5 | 602468 |
| EST5 | 2.857219 | 3 | 1.2537206 | 1 | 5 | 602468 |
| EST6 | 2.869767 | 3 | 1.2935461 | 1 | 5 | 602468 |
| EST7 | 3.064667 | 3 | 1.2649211 | 1 | 5 | 602468 |
| EST8 | 2.701616 | 3 | 1.3213905 | 1 | 5 | 602468 |
| EST9 | 3.102203 | 3 | 1.2677605 | 1 | 5 | 602468 |
| EST10 | 2.857597 | 3 | 1.3053596 | 1 | 5 | 602468 |
| AGR1 | 2.234690 | 2 | 1.3043405 | 1 | 5 | 602468 |
| AGR2 | 3.854976 | 4 | 1.0869058 | 1 | 5 | 602468 |
| AGR3 | 2.267007 | 2 | 1.2641696 | 1 | 5 | 602468 |
| AGR4 | 3.947790 | 4 | 1.0799717 | 1 | 5 | 602468 |
| AGR5 | 2.302340 | 2 | 1.1560334 | 1 | 5 | 602468 |
| AGR6 | 3.758311 | 4 | 1.1663487 | 1 | 5 | 602468 |
| AGR7 | 2.234258 | 2 | 1.1129761 | 1 | 5 | 602468 |
| AGR8 | 3.684737 | 4 | 1.0466293 | 1 | 5 | 602468 |
| AGR9 | 3.792691 | 4 | 1.1401198 | 1 | 5 | 602468 |
| AGR10 | 3.597522 | 4 | 1.0354886 | 1 | 5 | 602468 |
| CSN1 | 3.320069 | 3 | 1.1241311 | 1 | 5 | 602468 |
| CSN2 | 2.996669 | 3 | 1.3710137 | 1 | 5 | 602468 |
| CSN3 | 4.006905 | 4 | 0.9966201 | 1 | 5 | 602468 |
| CSN4 | 2.653452 | 3 | 1.2350004 | 1 | 5 | 602468 |
| CSN5 | 2.582891 | 2 | 1.2417003 | 1 | 5 | 602468 |
| CSN6 | 2.859699 | 3 | 1.4031575 | 1 | 5 | 602468 |
| CSN7 | 3.731396 | 4 | 1.0734690 | 1 | 5 | 602468 |
| CSN8 | 2.491583 | 2 | 1.1258857 | 1 | 5 | 602468 |
| CSN9 | 3.159303 | 3 | 1.2501190 | 1 | 5 | 602468 |
| CSN10 | 3.629258 | 4 | 0.9979830 | 1 | 5 | 602468 |
| OPN1 | 3.777719 | 4 | 1.0741179 | 1 | 5 | 602468 |
| OPN2 | 2.020036 | 2 | 1.0826338 | 1 | 5 | 602468 |
| OPN3 | 4.067638 | 4 | 1.0259751 | 1 | 5 | 602468 |
| OPN4 | 1.950030 | 2 | 1.0575048 | 1 | 5 | 602468 |
| OPN5 | 3.835307 | 4 | 0.9289385 | 1 | 5 | 602468 |
| OPN6 | 1.876481 | 2 | 1.0715174 | 1 | 5 | 602468 |
| OPN7 | 4.060262 | 4 | 0.9180440 | 1 | 5 | 602468 |
| OPN8 | 3.284412 | 3 | 1.2123321 | 1 | 5 | 602468 |
| OPN9 | 4.222360 | 4 | 0.9378580 | 1 | 5 | 602468 |
| OPN10 | 3.998694 | 4 | 0.9813749 | 1 | 5 | 602468 |
At this step, we will use principal component analysis (PCA) in order to reduce dimensionality of input data while retaining the most significant variations.
Figure 4: Scree Plot of Principal Component Analysis
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | PC19 | PC20 | PC21 | PC22 | PC23 | PC24 | PC25 | PC26 | PC27 | PC28 | PC29 | PC30 | PC31 | PC32 | PC33 | PC34 | PC35 | PC36 | PC37 | PC38 | PC39 | PC40 | PC41 | PC42 | PC43 | PC44 | PC45 | PC46 | PC47 | PC48 | PC49 | PC50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 2.777921 | 2.247946 | 2.001303 | 1.89383 | 1.702017 | 1.198502 | 1.136224 | 1.01711 | 0.9866744 | 0.9615661 | 0.9448665 | 0.9220993 | 0.9110339 | 0.9008106 | 0.8827829 | 0.8606061 | 0.8554104 | 0.839683 | 0.8267318 | 0.8081084 | 0.8021401 | 0.7972646 | 0.779069 | 0.7732502 | 0.7641308 | 0.7483038 | 0.7313321 | 0.7246511 | 0.7206425 | 0.7054458 | 0.7004669 | 0.695075 | 0.6913724 | 0.6659864 | 0.6554384 | 0.6472175 | 0.6461166 | 0.6384365 | 0.6298325 | 0.6265253 | 0.6181612 | 0.6036435 | 0.6025689 | 0.5981394 | 0.5970718 | 0.583583 | 0.5730759 | 0.5691529 | 0.5529994 | 0.4669083 |
| Proportion of Variance | 0.154340 | 0.101070 | 0.080100 | 0.07173 | 0.057940 | 0.028730 | 0.025820 | 0.02069 | 0.0194700 | 0.0184900 | 0.0178600 | 0.0170100 | 0.0166000 | 0.0162300 | 0.0155900 | 0.0148100 | 0.0146300 | 0.014100 | 0.0136700 | 0.0130600 | 0.0128700 | 0.0127100 | 0.012140 | 0.0119600 | 0.0116800 | 0.0112000 | 0.0107000 | 0.0105000 | 0.0103900 | 0.0099500 | 0.0098100 | 0.009660 | 0.0095600 | 0.0088700 | 0.0085900 | 0.0083800 | 0.0083500 | 0.0081500 | 0.0079300 | 0.0078500 | 0.0076400 | 0.0072900 | 0.0072600 | 0.0071600 | 0.0071300 | 0.006810 | 0.0065700 | 0.0064800 | 0.0061200 | 0.0043600 |
| Cumulative Proportion | 0.154340 | 0.255400 | 0.335510 | 0.40724 | 0.465180 | 0.493900 | 0.519720 | 0.54041 | 0.5598800 | 0.5783800 | 0.5962300 | 0.6132400 | 0.6298400 | 0.6460700 | 0.6616500 | 0.6764700 | 0.6911000 | 0.705200 | 0.7188700 | 0.7319300 | 0.7448000 | 0.7575100 | 0.769650 | 0.7816100 | 0.7932900 | 0.8044900 | 0.8151800 | 0.8256900 | 0.8360700 | 0.8460300 | 0.8558400 | 0.865500 | 0.8750600 | 0.8839300 | 0.8925200 | 0.9009000 | 0.9092500 | 0.9174000 | 0.9253400 | 0.9331900 | 0.9408300 | 0.9481200 | 0.9553800 | 0.9625400 | 0.9696700 | 0.976480 | 0.9830500 | 0.9895200 | 0.9956400 | 1.0000000 |
From the variance explanation of the PCA, we can take a look that with 6-8 principle components, those can explain the variance of data at 49-54%. On the other hand, to reach around 80% variance explained from the data, we can use 26 principle components.
But the decision to choose which PCs that will be used, will be based on the model evaluation. For now, we will use 6 principle components as input of k-Means clustering.
k has a goodness of fit to
clustering model based on the PCA input to the model.
Figure 5: Elbow Method for Optimal k
From the illustration above, we can observe the point where
k is at the elbow joint is the k=2. So, we
have got the k for the model.
We model the data with k-means clustering, where
k=2.
set.seed(42)
kmeans_result <- kmeans(pca_data, centers = 2, nstart = 25)
df_clean$Cluster <- as.factor(kmeans_result$cluster)We can interpret the model with pairwise scatter plot for each PCs.
# Use the cluster assignments from your K-means result
pca_data_with_cluster <- pca_data %>%
mutate(Cluster = as.factor(df_clean$Cluster))
# Create pairwise scatter plots for the first n PCs
ggpairs(pca_data_with_cluster, columns = 1:6, aes(color = Cluster, alpha = 0.7)) +
theme_minimal() +
labs(title = "Pairwise Scatter Plots of First 6 PCs by Cluster")Figure 6: Pairwise Scatter Plots of First 6 PCs by Clusters
The summary below stated how statistically cluster 1 and 2 are divided and distributed by mean and standard deviation.
kable(
df_clean %>%
group_by(Cluster) %>%
summarise(across(all_of(ocean_cols), list(mean = mean, sd = sd)))
)| Cluster | EXT1_mean | EXT1_sd | EXT2_mean | EXT2_sd | EXT3_mean | EXT3_sd | EXT4_mean | EXT4_sd | EXT5_mean | EXT5_sd | EXT6_mean | EXT6_sd | EXT7_mean | EXT7_sd | EXT8_mean | EXT8_sd | EXT9_mean | EXT9_sd | EXT10_mean | EXT10_sd | EST1_mean | EST1_sd | EST2_mean | EST2_sd | EST3_mean | EST3_sd | EST4_mean | EST4_sd | EST5_mean | EST5_sd | EST6_mean | EST6_sd | EST7_mean | EST7_sd | EST8_mean | EST8_sd | EST9_mean | EST9_sd | EST10_mean | EST10_sd | AGR1_mean | AGR1_sd | AGR2_mean | AGR2_sd | AGR3_mean | AGR3_sd | AGR4_mean | AGR4_sd | AGR5_mean | AGR5_sd | AGR6_mean | AGR6_sd | AGR7_mean | AGR7_sd | AGR8_mean | AGR8_sd | AGR9_mean | AGR9_sd | AGR10_mean | AGR10_sd | CSN1_mean | CSN1_sd | CSN2_mean | CSN2_sd | CSN3_mean | CSN3_sd | CSN4_mean | CSN4_sd | CSN5_mean | CSN5_sd | CSN6_mean | CSN6_sd | CSN7_mean | CSN7_sd | CSN8_mean | CSN8_sd | CSN9_mean | CSN9_sd | CSN10_mean | CSN10_sd | OPN1_mean | OPN1_sd | OPN2_mean | OPN2_sd | OPN3_mean | OPN3_sd | OPN4_mean | OPN4_sd | OPN5_mean | OPN5_sd | OPN6_mean | OPN6_sd | OPN7_mean | OPN7_sd | OPN8_mean | OPN8_sd | OPN9_mean | OPN9_sd | OPN10_mean | OPN10_sd |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3.197653 | 1.114255 | 2.189290 | 1.099072 | 3.965228 | 0.9058906 | 2.561354 | 1.0695974 | 3.990641 | 0.9238935 | 1.840075 | 0.8864732 | 3.486022 | 1.210761 | 2.994135 | 1.192727 | 3.536425 | 1.176324 | 2.944203 | 1.2135210 | 2.883025 | 1.283191 | 3.495029 | 1.110808 | 3.590431 | 1.1860263 | 2.949282 | 1.220582 | 2.530577 | 1.186436 | 2.453254 | 1.197802 | 2.688555 | 1.219714 | 2.269352 | 1.207805 | 2.664984 | 1.214276 | 2.302960 | 1.165657 | 1.963775 | 1.244857 | 4.320473 | 0.8209272 | 2.066392 | 1.196233 | 4.202817 | 0.9409185 | 1.957524 | 1.011916 | 3.876663 | 1.132339 | 1.758672 | 0.8603816 | 3.983791 | 0.9192772 | 4.064737 | 1.004347 | 4.034356 | 0.8602155 | 3.533553 | 1.064935 | 2.902685 | 1.378435 | 4.11392 | 0.9625651 | 2.306633 | 1.158251 | 2.818343 | 1.254877 | 2.640451 | 1.385987 | 3.787204 | 1.055819 | 2.203965 | 1.061983 | 3.358943 | 1.210066 | 3.793986 | 0.9579982 | 3.939005 | 1.012748 | 1.812282 | 0.9789907 | 4.158117 | 0.9722771 | 1.808834 | 0.9960333 | 4.100935 | 0.7945832 | 1.686435 | 0.973477 | 4.264987 | 0.8039078 | 3.355442 | 1.216698 | 4.221736 | 0.9255650 | 4.243709 | 0.8306251 |
| 2 | 1.978479 | 1.039871 | 3.481263 | 1.168287 | 2.524594 | 0.9886152 | 3.852566 | 0.9642721 | 2.534960 | 1.0901407 | 2.984245 | 1.2221053 | 1.964255 | 1.065790 | 3.927787 | 1.102821 | 2.393885 | 1.213094 | 4.275268 | 0.9198928 | 3.722863 | 1.202669 | 2.859468 | 1.182751 | 4.158637 | 0.9784697 | 2.378941 | 1.162219 | 3.171282 | 1.236309 | 3.270241 | 1.255162 | 3.426296 | 1.200516 | 3.117233 | 1.292246 | 3.522585 | 1.172322 | 3.390877 | 1.206532 | 2.495172 | 1.307167 | 3.407406 | 1.1227675 | 2.459896 | 1.297283 | 3.702583 | 1.1462423 | 2.633877 | 1.188523 | 3.644517 | 1.187065 | 2.691530 | 1.1364036 | 3.397199 | 1.0805045 | 3.531122 | 1.200103 | 3.177510 | 1.0157876 | 3.114805 | 1.141190 | 3.087033 | 1.357722 | 3.90401 | 1.0177459 | 2.986915 | 1.214491 | 2.356506 | 1.185598 | 3.070504 | 1.387172 | 3.677738 | 1.087474 | 2.768125 | 1.116259 | 2.967351 | 1.257926 | 3.470874 | 1.0099481 | 3.622644 | 1.108067 | 2.219790 | 1.1384828 | 3.980644 | 1.0678761 | 2.085789 | 1.0964003 | 3.579909 | 0.9759647 | 2.059209 | 1.128188 | 3.863421 | 0.9760795 | 3.216117 | 1.204177 | 4.222961 | 0.9495287 | 3.763114 | 1.0547610 |
# You can use this for all columns starting with EXT, AGR, etc.
summary_by_cluster <- df_clean %>%
group_by(Cluster) %>%
summarise(across(everything(), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))# Pivot to long format so it's easier to plot
summary_long <- summary_by_cluster %>%
pivot_longer(
cols = -Cluster,
names_to = c("Trait", "Stat"),
names_sep = "_"
) %>%
filter(Stat == "mean") # Only plot means for now# Then calculate the mean per trait group
trait_summary <- trait_cols %>%
pivot_longer(
cols = -Cluster,
names_to = "Trait_Item",
values_to = "Value"
) %>%
mutate(Trait = str_extract(Trait_Item, "^[A-Z]+")) %>% # Extract EXT, EST, AGR, etc.
group_by(Cluster, Trait) %>%
summarise(Mean = mean(Value, na.rm = TRUE), .groups = 'drop')ggplot(trait_summary, aes(x = Trait, y = Mean, fill = as.factor(Cluster))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Big 5 Traits Mean per Cluster", x = "Trait", fill = "Cluster") +
theme_minimal()Figure 7: Big 5 Traits Mean per Cluster
With the assigned cluster to each point of data, we can simplified the dominancy of each cluster to the Big Five components.
This study successfully applied unsupervised learning to Big Five personality trait data, yielding valuable insights for the development of more human-centered intelligent systems.
We determined that 6 principal components effectively capture 49-54% of the variance in the dataset. These components served as input for the k-means clustering model, which identified an optimal number of 2 clusters (k=2). This suggests that the personality data can be meaningfully segmented into two distinct groups, capturing a substantial portion of the variability in personality traits with a reduced number of dimensions.
Analysis of the clusters revealed distinct personality profiles. Cluster 1 exhibits higher mean scores across most of the Big Five dimensions, indicating a generally more pronounced expression of these traits on average within this group. In contrast, Cluster 2 shows a relative dominance only in Emotional Stability (EST), suggesting that individuals in this cluster tend to exhibit higher average emotional stability compared to those in Cluster 1. However, the box plots reveal that within these cluster-level averages, there is considerable variability and distributional nuance. For instance, while Cluster 1 has a higher mean for Extraversion (EXT), the box plot shows a wide spread of scores in both clusters, indicating that not all individuals in Cluster 1 are highly extraverted and some individuals in Cluster 2 also exhibit high extraversion.
These findings offer several key insights for the design of human-centered intelligent systems:
Personalization Potential: The identified clusters highlight the potential for tailoring system interactions to different personality types. For example, users in Cluster 1, characterized by higher average expression across most Big Five traits, might generally respond well to systems that are more engaging, interactive, and provide a richer set of features. However, it is crucial to acknowledge the variability observed in the box plots. Some users within Cluster 1 may still prefer minimalist interfaces, and some users within Cluster 2 may appreciate highly interactive features. Therefore, personalization should not be solely based on cluster assignment but should also consider individual variations within each cluster.
Emotional Stability Considerations: Cluster 2’s emphasis on emotional stability at the cluster level suggests that on average, users in this group may prefer systems that are reliable, predictable, and provide a supportive and calming user experience. The box plots further elaborate that while this trend holds, there’s a range of emotional stability scores within both clusters, implying that some users in Cluster 1 might also benefit from calming interfaces, and some users in Cluster 2 might tolerate or even prefer more dynamic interactions. These insights can inform the design of virtual agent interactions, user interface elements, and feedback mechanisms.
Strategic Implications: Understanding these personality-based clusters and the score distributions within them can enable developers to create more targeted and effective user experiences, potentially increasing user satisfaction and engagement. For instance, in the design of a virtual assistant, developers might consider offering a range of personality options, acknowledging the variability within clusters, rather than simply offering “Cluster 1” or “Cluster 2” personalities.
To further enhance this research and its practical applications, the following improvements are recommended:
Clustering Validation: Implement silhouette score analysis to quantitatively evaluate the quality of the clustering results. This will provide a measure of how well each data point fits within its assigned cluster and how distinct the clusters are from each other, adding robustness to the findings.
Alternative Clustering Methods: Explore alternative clustering algorithms, such as DBSCAN, which is less sensitive to outliers than k-means. This could provide additional insights into the data’s structure and potentially reveal different or more refined user groupings.
Multi-Modal Data Integration: Incorporate additional datasets, such as text-based data from user interactions with intelligent personal assistants, to provide a more comprehensive understanding of user personality. This multi-modal approach could lead to the development of more sophisticated and nuanced models of user behavior and preferences.
Actionable Recommendations for Industry:
Personalized User Interfaces: Design user interfaces that adapt to different personality types. For example, users in Cluster 1 might prefer feature-rich interfaces with opportunities for interaction and customization, while users in Cluster 2 might prefer minimalist designs that prioritize clarity and ease of use.
Virtual Agent Personalities: Develop virtual agent personalities that align with the identified clusters. This could involve offering users a choice of virtual agent personality (e.g., “engaging” vs. “calm”) to enhance user experience and satisfaction.
Targeted Marketing and Communication: Tailor marketing messages and communication strategies to resonate with different personality types. Understanding the preferences and communication styles of each cluster can lead to more effective user engagement and product adoption.
Adaptive System Behavior: Design systems that can adapt their behavior based on user personality. For example, a system might provide more proactive assistance to users in Cluster 1, while offering more guidance and support to users in Cluster 2.
By pursuing these improvements, future research can provide even more valuable insights for the development of human-centered intelligent systems, ultimately leading to more personalized, effective, and satisfying user experiences.