Vu Nguyen Khue Ngan - s4140743
Last updated: 01 June, 2025
Personality types influence how an individual interacts in social settings.
Understanding personality types through social behavior indicators can improve communication strategies, group activities, and mental health assessments.
(Shaw, 2017)
Investigate to what extent social behavior data can accurately predict an individual’s personality type.
The data set was collected from Kaggle.
read_csv() function from
the readr package.
kable(head(personality), format = "html") %>% kable_styling(
bootstrap_options = c("striped", "bordered", "hover"),
font_size = 12, full_width = FALSE, position = "center")| Time_spent_Alone | Stage_fear | Social_event_attendance | Going_outside | Drained_after_socializing | Friends_circle_size | Post_frequency | Personality |
|---|---|---|---|---|---|---|---|
| 4 | No | 4 | 6 | No | 13 | 5 | Extrovert |
| 9 | Yes | 0 | 0 | Yes | 0 | 3 | Introvert |
| 9 | Yes | 1 | 2 | Yes | 5 | 2 | Introvert |
| 0 | No | 6 | 7 | No | 14 | 8 | Extrovert |
| 3 | No | 9 | 4 | No | 8 | 5 | Extrovert |
| 1 | No | 7 | 5 | No | 6 | 6 | Extrovert |
[1] 2900 8
# Data type conversion from categorical variables to factor variables
personality$Stage_fear <- factor(personality$Stage_fear, levels = c("Yes", "No"))
personality$Drained_after_socializing <- factor(personality$Drained_after_socializing, levels = c("Yes", "No"))
personality$Personality <- factor(personality$Personality, levels = c("Introvert", "Extrovert"))
# 2 variable groups (exclude the 'Personality' variable because it is the target variable):
numeric_variables <- c("Time_spent_Alone", "Social_event_attendance", "Going_outside", "Friends_circle_size", "Post_frequency")
factor_variables <- c("Stage_fear", "Drained_after_socializing")
str(personality)## spc_tbl_ [2,900 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Time_spent_Alone : num [1:2900] 4 9 9 0 3 1 4 2 10 0 ...
## $ Stage_fear : Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 2 1 2 ...
## $ Social_event_attendance : num [1:2900] 4 0 1 6 9 7 9 8 1 8 ...
## $ Going_outside : num [1:2900] 6 0 2 7 4 5 NA 4 3 6 ...
## $ Drained_after_socializing: Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 2 1 2 ...
## $ Friends_circle_size : num [1:2900] 13 0 5 14 8 6 7 7 0 13 ...
## $ Post_frequency : num [1:2900] 5 3 2 8 5 6 7 8 3 8 ...
## $ Personality : Factor w/ 2 levels "Introvert","Extrovert": 2 1 1 2 2 2 2 2 1 2 ...
## - attr(*, "spec")=
## .. cols(
## .. Time_spent_Alone = col_double(),
## .. Stage_fear = col_character(),
## .. Social_event_attendance = col_double(),
## .. Going_outside = col_double(),
## .. Drained_after_socializing = col_character(),
## .. Friends_circle_size = col_double(),
## .. Post_frequency = col_double(),
## .. Personality = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(personality[personality$Personality == "Extrovert", numeric_variables]) -> table1
summary(personality[personality$Personality == "Extrovert", factor_variables]) -> table2
kable(table1) # Numeric Variables:| Time_spent_Alone | Social_event_attendance | Going_outside | Friends_circle_size | Post_frequency | |
|---|---|---|---|---|---|
| Min. : 0.000 | Min. : 0.000 | Min. :0.000 | Min. : 0.000 | Min. : 0.000 | |
| 1st Qu.: 1.000 | 1st Qu.: 4.000 | 1st Qu.:4.000 | 1st Qu.: 7.000 | 1st Qu.: 4.000 | |
| Median : 2.000 | Median : 6.000 | Median :5.000 | Median : 9.000 | Median : 6.000 | |
| Mean : 2.067 | Mean : 6.016 | Mean :4.635 | Mean : 9.174 | Mean : 5.639 | |
| 3rd Qu.: 3.000 | 3rd Qu.: 8.000 | 3rd Qu.:6.000 | 3rd Qu.:12.000 | 3rd Qu.: 7.000 | |
| Max. :11.000 | Max. :10.000 | Max. :7.000 | Max. :15.000 | Max. :10.000 | |
| NA’s :34 | NA’s :28 | NA’s :35 | NA’s :40 | NA’s :33 |
| Stage_fear | Drained_after_socializing | |
|---|---|---|
| Yes : 111 | Yes : 111 | |
| No :1338 | No :1362 | |
| NA’s: 42 | NA’s: 18 |
summary(personality[personality$Personality == "Introvert", numeric_variables]) -> table3
summary(personality[personality$Personality == "Introvert", factor_variables]) -> table4
kable(table3) # Numeric Variables:| Time_spent_Alone | Social_event_attendance | Going_outside | Friends_circle_size | Post_frequency | |
|---|---|---|---|---|---|
| Min. : 0.00 | Min. :0.000 | Min. :0.000 | Min. : 0.000 | Min. :0.000 | |
| 1st Qu.: 5.00 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.: 1.000 | 1st Qu.:0.000 | |
| Median : 7.00 | Median :2.000 | Median :1.000 | Median : 3.000 | Median :1.000 | |
| Mean : 7.08 | Mean :1.779 | Mean :1.273 | Mean : 3.197 | Mean :1.369 | |
| 3rd Qu.: 9.00 | 3rd Qu.:3.000 | 3rd Qu.:2.000 | 3rd Qu.: 4.000 | 3rd Qu.:2.000 | |
| Max. :11.00 | Max. :9.000 | Max. :7.000 | Max. :14.000 | Max. :9.000 | |
| NA’s :29 | NA’s :34 | NA’s :31 | NA’s :37 | NA’s :32 |
| Stage_fear | Drained_after_socializing | |
|---|---|---|
| Yes :1299 | Yes :1296 | |
| No : 79 | No : 79 | |
| NA’s: 31 | NA’s: 34 |
par(mfrow = c(1, 5), mar = c(4, 3, 2, 1))
for (colname in numeric_variables) {boxplot(personality[[colname]] ~ personality$Personality,
main = colname, xlab = "Personality", col = c("lightblue", "lightpink"), border = "black")}
Boxplots of Numeric Variables and Personality Types
par(mfrow = c(1, 5), mar = c(4, 4, 2, 1))
for (colname in numeric_variables) {hist(personality[[colname]], main = colname, xlab = colname,
col = "lightblue", border = "black")}
Histograms of Numeric Variables
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
for (colname in factor_variables) {tab <- table(personality[[colname]], personality$Personality)
barplot(tab, beside = TRUE, col = c("lightblue", "lightpink"), legend = rownames(tab), main = colname, ylab = "Count", xlab = "Personality")}
Bar plots of Factor Variables and Personality Types
# Percentage of missing values in the data set:
round(sum(is.na(personality))/(nrow(personality)*ncol(personality))*100, 2)## [1] 1.97
## Time_spent_Alone Stage_fear Social_event_attendance
## 63 73 62
## Going_outside Drained_after_socializing Friends_circle_size
## 66 52 77
## Post_frequency Personality
## 65 0
# Numeric variables: Median imputation using Hmisc package
for (colname in numeric_variables) {personality[[colname]] <- impute(personality[[colname]], fun = median)}
# Factor variables: Mode imputation using Hmisc package
for (colname in factor_variables) {personality[[colname]] <- impute(personality[[colname]], fun = mode)}
# Check number of missing values in each variables after the imputation:
colSums(is.na(personality))## Time_spent_Alone Stage_fear Social_event_attendance
## 0 0 0
## Going_outside Drained_after_socializing Friends_circle_size
## 0 0 0
## Post_frequency Personality
## 0 0
# Multivariate outlier detection for numeric variables using Mahalanobis distance with QQ plots
results <- mvn(data = personality[, numeric_variables], multivariateOutlierMethod = "quan", showOutliers = TRUE)## [1] 12.55172
The outliers are not removed from the data set because:
The Mann-Whitney U test was used for the numeric variables because the variables are not normally distributed.
wilcox.test(Time_spent_Alone ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Time_spent_Alone" variable##
## Wilcoxon rank sum test with continuity correction
##
## data: Time_spent_Alone by Personality
## W = 1930380, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 5.000011 5.000014
## sample estimates:
## difference in location
## 5.000046
wilcox.test(Social_event_attendance ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Social_event_attendance" variable##
## Wilcoxon rank sum test with continuity correction
##
## data: Social_event_attendance by Personality
## W = 164651, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -4.999998 -4.000052
## sample estimates:
## difference in location
## -4.000049
wilcox.test(Going_outside ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Going_outside" variable##
## Wilcoxon rank sum test with continuity correction
##
## data: Going_outside by Personality
## W = 162980, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -3.999948 -3.000008
## sample estimates:
## difference in location
## -3.99994
wilcox.test(Friends_circle_size ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Friends_circle_size" variable##
## Wilcoxon rank sum test with continuity correction
##
## data: Friends_circle_size by Personality
## W = 191530, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -6.000044 -5.999961
## sample estimates:
## difference in location
## -6.000027
wilcox.test(Post_frequency ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Post_frequency" variable##
## Wilcoxon rank sum test with continuity correction
##
## data: Post_frequency by Personality
## W = 155426, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -4.999976 -4.000055
## sample estimates:
## difference in location
## -4.000016
A two-tailed Mann-Whitney U test was used to test whether the five numeric variables are significantly different when comparing between extroverts and introverts. The five tests all have p-values < 2.2e-16, suggesting significant differences for all variables between extroverts and introverts.
All five numeric variables are significant indicators for predicting an individual personality trait.
The Chi-square test for association was used to test for statistically significant association between the factor variables and personality types.
(chi_stage <- chisq.test(table(personality$Stage_fear,personality$Personality))) # "Stage_fear" variable##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(personality$Stage_fear, personality$Personality)
## X-squared = 2079.4, df = 1, p-value < 2.2e-16
(chi_drained <- chisq.test(table(personality$Drained_after_socializing,personality$Personality))) # "Drained_after_socializing" variable##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(personality$Drained_after_socializing, personality$Personality)
## X-squared = 2069.2, df = 1, p-value < 2.2e-16
##
## Introvert Extrovert
## Yes 685.0655 724.9345
## No 723.9345 766.0655
##
## Introvert Extrovert
## Yes 683.6079 723.3921
## No 725.3921 767.6079
##
## Introvert Extrovert
## Yes 1299 111
## No 110 1380
##
## Introvert Extrovert
## Yes 1296 111
## No 113 1380
A Pearson’s Chi-squared test with Yates’ continuity correction was used to test two factor variables and both produce p-value < 2.2e-16, suggesting significant association between the variables and personality types.
The two factor variables are significant indicators for predicting an individual personality trait.
Gonzalez, A. (2022, November 22). What Is an Extrovert? WebMD. https://www.webmd.com/balance/what-is-an-extrovert
Ellis, R. (2022, September 3). Introvert Personality. WebMD. https://www.webmd.com/balance/introvert-personality-overview
Kapilavayi, R. (2025). Extrovert vs. Introvert Behavior Data. Kaggle.com. https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data?
Shaw, A. (2017, February 23). Are you really an introvert? Medium. https://medium.com/@anthonypjshaw/are-you-really-an-introvert-161e09819466