Investigating the relationships between Social Behavior variables and Personality Types

Vu Nguyen Khue Ngan - s4140743

Last updated: 01 June, 2025

Introduction

Personality types influence how an individual interacts in social settings.

Two Key Personality Types:

Understanding personality types through social behavior indicators can improve communication strategies, group activities, and mental health assessments.

(Shaw, 2017)

Problem Statement

Aim

Investigate to what extent social behavior data can accurately predict an individual’s personality type.

Statistical Approaches

Data Collection

Source

The data set was collected from Kaggle.

Collection Method

Data

Features Overview:

Notes:

Descriptive Statistics and Visualisation

Data set preview

personality <- read_csv("personality_dataset.csv")
kable(head(personality), format = "html") %>% kable_styling(
  bootstrap_options = c("striped", "bordered", "hover"),
    font_size = 12, full_width = FALSE, position = "center")
Time_spent_Alone Stage_fear Social_event_attendance Going_outside Drained_after_socializing Friends_circle_size Post_frequency Personality
4 No 4 6 No 13 5 Extrovert
9 Yes 0 0 Yes 0 3 Introvert
9 Yes 1 2 Yes 5 2 Introvert
0 No 6 7 No 14 8 Extrovert
3 No 9 4 No 8 5 Extrovert
1 No 7 5 No 6 6 Extrovert
dim(personality) # The data set dimension:

[1] 2900 8

Descriptive Statistics and Visualisation Cont.

Data structure after data type conversion

# Data type conversion from categorical variables to factor variables
personality$Stage_fear <- factor(personality$Stage_fear, levels = c("Yes", "No"))
personality$Drained_after_socializing <- factor(personality$Drained_after_socializing, levels = c("Yes", "No")) 
personality$Personality <- factor(personality$Personality, levels = c("Introvert", "Extrovert"))
# 2 variable groups (exclude the 'Personality' variable because it is the target variable):
numeric_variables <- c("Time_spent_Alone", "Social_event_attendance", "Going_outside", "Friends_circle_size", "Post_frequency")
factor_variables <- c("Stage_fear", "Drained_after_socializing")
str(personality)
## spc_tbl_ [2,900 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Time_spent_Alone         : num [1:2900] 4 9 9 0 3 1 4 2 10 0 ...
##  $ Stage_fear               : Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 2 1 2 ...
##  $ Social_event_attendance  : num [1:2900] 4 0 1 6 9 7 9 8 1 8 ...
##  $ Going_outside            : num [1:2900] 6 0 2 7 4 5 NA 4 3 6 ...
##  $ Drained_after_socializing: Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 2 1 2 ...
##  $ Friends_circle_size      : num [1:2900] 13 0 5 14 8 6 7 7 0 13 ...
##  $ Post_frequency           : num [1:2900] 5 3 2 8 5 6 7 8 3 8 ...
##  $ Personality              : Factor w/ 2 levels "Introvert","Extrovert": 2 1 1 2 2 2 2 2 1 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Time_spent_Alone = col_double(),
##   ..   Stage_fear = col_character(),
##   ..   Social_event_attendance = col_double(),
##   ..   Going_outside = col_double(),
##   ..   Drained_after_socializing = col_character(),
##   ..   Friends_circle_size = col_double(),
##   ..   Post_frequency = col_double(),
##   ..   Personality = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Descriptive Statistics and Visualisation Cont.

Summary of Variables for Extrovert personality

summary(personality[personality$Personality == "Extrovert", numeric_variables]) -> table1
summary(personality[personality$Personality == "Extrovert", factor_variables]) -> table2
kable(table1) # Numeric Variables:
Time_spent_Alone Social_event_attendance Going_outside Friends_circle_size Post_frequency
Min. : 0.000 Min. : 0.000 Min. :0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 1.000 1st Qu.: 4.000 1st Qu.:4.000 1st Qu.: 7.000 1st Qu.: 4.000
Median : 2.000 Median : 6.000 Median :5.000 Median : 9.000 Median : 6.000
Mean : 2.067 Mean : 6.016 Mean :4.635 Mean : 9.174 Mean : 5.639
3rd Qu.: 3.000 3rd Qu.: 8.000 3rd Qu.:6.000 3rd Qu.:12.000 3rd Qu.: 7.000
Max. :11.000 Max. :10.000 Max. :7.000 Max. :15.000 Max. :10.000
NA’s :34 NA’s :28 NA’s :35 NA’s :40 NA’s :33
kable(table2) # Factor Variables:
Stage_fear Drained_after_socializing
Yes : 111 Yes : 111
No :1338 No :1362
NA’s: 42 NA’s: 18

Descriptive Statistics and Visualisation Cont.

Summary of Variables for Introvert personality

summary(personality[personality$Personality == "Introvert", numeric_variables]) -> table3
summary(personality[personality$Personality == "Introvert", factor_variables]) -> table4
kable(table3) # Numeric Variables:
Time_spent_Alone Social_event_attendance Going_outside Friends_circle_size Post_frequency
Min. : 0.00 Min. :0.000 Min. :0.000 Min. : 0.000 Min. :0.000
1st Qu.: 5.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 1.000 1st Qu.:0.000
Median : 7.00 Median :2.000 Median :1.000 Median : 3.000 Median :1.000
Mean : 7.08 Mean :1.779 Mean :1.273 Mean : 3.197 Mean :1.369
3rd Qu.: 9.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 4.000 3rd Qu.:2.000
Max. :11.00 Max. :9.000 Max. :7.000 Max. :14.000 Max. :9.000
NA’s :29 NA’s :34 NA’s :31 NA’s :37 NA’s :32
kable(table4) # Factor Variables:
Stage_fear Drained_after_socializing
Yes :1299 Yes :1296
No : 79 No : 79
NA’s: 31 NA’s: 34

Descriptive Statistics and Visualisation Cont.

par(mfrow = c(1, 5), mar = c(4, 3, 2, 1))
for (colname in numeric_variables) {boxplot(personality[[colname]] ~ personality$Personality,
          main = colname, xlab = "Personality", col = c("lightblue", "lightpink"), border = "black")}

Boxplots of Numeric Variables and Personality Types

par(mfrow = c(1, 5), mar = c(4, 4, 2, 1))
for (colname in numeric_variables) {hist(personality[[colname]], main = colname, xlab = colname,
       col = "lightblue", border = "black")}

Histograms of Numeric Variables

Descriptive Statistics and Visualisation Cont.

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
for (colname in factor_variables) {tab <- table(personality[[colname]], personality$Personality)
  barplot(tab, beside = TRUE, col = c("lightblue", "lightpink"), legend = rownames(tab), main = colname, ylab = "Count", xlab = "Personality")}

Bar plots of Factor Variables and Personality Types

Key Findings:

Handling Missing Values

# Percentage of missing values in the data set:
round(sum(is.na(personality))/(nrow(personality)*ncol(personality))*100, 2)
## [1] 1.97
# Number of missing values in each variables:
colSums(is.na(personality))
##          Time_spent_Alone                Stage_fear   Social_event_attendance 
##                        63                        73                        62 
##             Going_outside Drained_after_socializing       Friends_circle_size 
##                        66                        52                        77 
##            Post_frequency               Personality 
##                        65                         0
# Numeric variables: Median imputation using Hmisc package
for (colname in numeric_variables) {personality[[colname]] <- impute(personality[[colname]], fun = median)}
# Factor variables: Mode imputation using Hmisc package
for (colname in factor_variables) {personality[[colname]] <- impute(personality[[colname]], fun = mode)}
# Check number of missing values in each variables after the imputation:
colSums(is.na(personality))
##          Time_spent_Alone                Stage_fear   Social_event_attendance 
##                         0                         0                         0 
##             Going_outside Drained_after_socializing       Friends_circle_size 
##                         0                         0                         0 
##            Post_frequency               Personality 
##                         0                         0

Outliers Detection and Handling

# Multivariate outlier detection for numeric variables using Mahalanobis distance with QQ plots
results <- mvn(data = personality[, numeric_variables], multivariateOutlierMethod = "quan", showOutliers = TRUE)

nrow(results$multivariateOutliers)/nrow(personality)*100 # Percentage of outliers:
## [1] 12.55172

The outliers are not removed from the data set because:

Hypothesis Testing

The Mann-Whitney U test was used for the numeric variables because the variables are not normally distributed.

Assumptions:

wilcox.test(Time_spent_Alone ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Time_spent_Alone" variable
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Time_spent_Alone by Personality
## W = 1930380, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  5.000011 5.000014
## sample estimates:
## difference in location 
##               5.000046
wilcox.test(Social_event_attendance ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Social_event_attendance" variable
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Social_event_attendance by Personality
## W = 164651, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -4.999998 -4.000052
## sample estimates:
## difference in location 
##              -4.000049

Hypthesis Testing Cont.

wilcox.test(Going_outside ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Going_outside" variable
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Going_outside by Personality
## W = 162980, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -3.999948 -3.000008
## sample estimates:
## difference in location 
##               -3.99994
wilcox.test(Friends_circle_size ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Friends_circle_size" variable
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Friends_circle_size by Personality
## W = 191530, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -6.000044 -5.999961
## sample estimates:
## difference in location 
##              -6.000027
wilcox.test(Post_frequency ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Post_frequency" variable
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Post_frequency by Personality
## W = 155426, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -4.999976 -4.000055
## sample estimates:
## difference in location 
##              -4.000016

Hypthesis Testing Cont.

Summary Report of Results:

A two-tailed Mann-Whitney U test was used to test whether the five numeric variables are significantly different when comparing between extroverts and introverts. The five tests all have p-values < 2.2e-16, suggesting significant differences for all variables between extroverts and introverts.

All five numeric variables are significant indicators for predicting an individual personality trait.

Categorical Association

The Chi-square test for association was used to test for statistically significant association between the factor variables and personality types.

Assumption:

(chi_stage <- chisq.test(table(personality$Stage_fear,personality$Personality))) # "Stage_fear" variable
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(personality$Stage_fear, personality$Personality)
## X-squared = 2079.4, df = 1, p-value < 2.2e-16
(chi_drained <- chisq.test(table(personality$Drained_after_socializing,personality$Personality))) # "Drained_after_socializing" variable
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(personality$Drained_after_socializing, personality$Personality)
## X-squared = 2069.2, df = 1, p-value < 2.2e-16
chi_stage$expected # expected counts all above 5
##      
##       Introvert Extrovert
##   Yes  685.0655  724.9345
##   No   723.9345  766.0655
chi_drained$expected # expected counts all above 5
##      
##       Introvert Extrovert
##   Yes  683.6079  723.3921
##   No   725.3921  767.6079

Categorical Association

chi_stage$observed
##      
##       Introvert Extrovert
##   Yes      1299       111
##   No        110      1380
chi_drained$observed
##      
##       Introvert Extrovert
##   Yes      1296       111
##   No        113      1380

Summary Report of Results:

A Pearson’s Chi-squared test with Yates’ continuity correction was used to test two factor variables and both produce p-value < 2.2e-16, suggesting significant association between the variables and personality types.

The two factor variables are significant indicators for predicting an individual personality trait.

Discussion

Major Findings:

Strengths:

Discussion

Limitations:

Future Directions:

References

Gonzalez, A. (2022, November 22). What Is an Extrovert? WebMD. https://www.webmd.com/balance/what-is-an-extrovert

Ellis, R. (2022, September 3). Introvert Personality. WebMD. https://www.webmd.com/balance/introvert-personality-overview

Kapilavayi, R. (2025). Extrovert vs. Introvert Behavior Data. Kaggle.com. https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data?

Shaw, A. (2017, February 23). Are you really an introvert? Medium. https://medium.com/@anthonypjshaw/are-you-really-an-introvert-161e09819466