Introduction

Personality types influence how an individual interacts in social settings.

Two Key Personality Types:

Introvert: Individual with introvert personality often prefers alone time and avoid crowded social settings (Ellis, 2022)
Extrovert: Individual with extrovert personality thrives during social interactions and group activities (Gonzalez, 2022)

Understanding personality types through social behavior indicators can improve communication strategies, group activities, and mental health assessments.

(Shaw, 2017)

Problem Statement

Aim

Investigate to what extent social behavior data can accurately predict an individual’s personality type.

Statistical Approaches

Use a statistical data set containing social behavior variables.
Preprocess the data (data type conversion, handling missing values, outlier detection).
Understand and visualize the distribution and relationships between social behavior variables.
Use hypothesis testing and confidence intervals to test for significant differences between extroverts and introverts.
Use categorical association to test for significant association between categorical variables and personality types

Data Collection

Source

The data set was collected from Kaggle.

Title: Extrovert vs. Introvert Behavior Data
Size: 2,900 rows × 8 columns
Description: Contains observations on individuals’ social behavior and corresponding personality types (introvert or extrovert)

Collection Method

The data set was downloaded as a CSV file directly from Kaggle.
Data was imported into R using the read_csv() function from the readr package.
The sampling method is not specified in the original Kaggle data set by the authors.

Data

Features Overview:

Time_spend_Alone: Number of hours an individual typically spends alone daily (0–11 hours)
Stage_fear: Indicates if the person experiences stage fear (Factor with levels: Yes, No)
Social_event_attendance: Frequency of attending social events per week (0–10)
Going_outside: Frequency of going outside per week (0–10)
Drained_after_socializing: Indicates if the person feels drained after socializing (Factor with levels: Yes, No)
Friends_circle_size: Number of close friends (0–15)
Post_frequency: Frequency of social media posting per week (0–10)
Personality: Target variable (Factor with levels: Introvert, Extrovert)

Notes:

Numerical variables are measured on a fixed scale that represents the frequency or quantity of social behaviors made by an individual.
Categorical variables are converted into factor variables in R with levels for proper interpretation by statistical models during classification tasks.
Pre-processing steps: data type conversion, handling missing values, and outlier detection.

Descriptive Statistics and Visualisation

Data set preview

personality <- read_csv("personality_dataset.csv")

kable(head(personality), format = "html") %>% kable_styling(
  bootstrap_options = c("striped", "bordered", "hover"),
    font_size = 12, full_width = FALSE, position = "center")

Time_spent_Alone	Stage_fear	Social_event_attendance	Going_outside	Drained_after_socializing	Friends_circle_size	Post_frequency	Personality
4	No	4	6	No	13	5	Extrovert
9	Yes	0	0	Yes	0	3	Introvert
9	Yes	1	2	Yes	5	2	Introvert
0	No	6	7	No	14	8	Extrovert
3	No	9	4	No	8	5	Extrovert
1	No	7	5	No	6	6	Extrovert

dim(personality) # The data set dimension:

[1] 2900 8

Descriptive Statistics and Visualisation Cont.

Data structure after data type conversion

# Data type conversion from categorical variables to factor variables
personality$Stage_fear <- factor(personality$Stage_fear, levels = c("Yes", "No"))
personality$Drained_after_socializing <- factor(personality$Drained_after_socializing, levels = c("Yes", "No")) 
personality$Personality <- factor(personality$Personality, levels = c("Introvert", "Extrovert"))
# 2 variable groups (exclude the 'Personality' variable because it is the target variable):
numeric_variables <- c("Time_spent_Alone", "Social_event_attendance", "Going_outside", "Friends_circle_size", "Post_frequency")
factor_variables <- c("Stage_fear", "Drained_after_socializing")
str(personality)

## spc_tbl_ [2,900 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Time_spent_Alone         : num [1:2900] 4 9 9 0 3 1 4 2 10 0 ...
##  $ Stage_fear               : Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 2 1 2 ...
##  $ Social_event_attendance  : num [1:2900] 4 0 1 6 9 7 9 8 1 8 ...
##  $ Going_outside            : num [1:2900] 6 0 2 7 4 5 NA 4 3 6 ...
##  $ Drained_after_socializing: Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 2 1 2 ...
##  $ Friends_circle_size      : num [1:2900] 13 0 5 14 8 6 7 7 0 13 ...
##  $ Post_frequency           : num [1:2900] 5 3 2 8 5 6 7 8 3 8 ...
##  $ Personality              : Factor w/ 2 levels "Introvert","Extrovert": 2 1 1 2 2 2 2 2 1 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Time_spent_Alone = col_double(),
##   ..   Stage_fear = col_character(),
##   ..   Social_event_attendance = col_double(),
##   ..   Going_outside = col_double(),
##   ..   Drained_after_socializing = col_character(),
##   ..   Friends_circle_size = col_double(),
##   ..   Post_frequency = col_double(),
##   ..   Personality = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Descriptive Statistics and Visualisation Cont.

Summary of Variables for Extrovert personality

summary(personality[personality$Personality == "Extrovert", numeric_variables]) -> table1
summary(personality[personality$Personality == "Extrovert", factor_variables]) -> table2
kable(table1) # Numeric Variables:

Time_spent_Alone	Social_event_attendance	Going_outside	Friends_circle_size	Post_frequency
Min. : 0.000	Min. : 0.000	Min. :0.000	Min. : 0.000	Min. : 0.000
1st Qu.: 1.000	1st Qu.: 4.000	1st Qu.:4.000	1st Qu.: 7.000	1st Qu.: 4.000
Median : 2.000	Median : 6.000	Median :5.000	Median : 9.000	Median : 6.000
Mean : 2.067	Mean : 6.016	Mean :4.635	Mean : 9.174	Mean : 5.639
3rd Qu.: 3.000	3rd Qu.: 8.000	3rd Qu.:6.000	3rd Qu.:12.000	3rd Qu.: 7.000
Max. :11.000	Max. :10.000	Max. :7.000	Max. :15.000	Max. :10.000
NA’s :34	NA’s :28	NA’s :35	NA’s :40	NA’s :33

kable(table2) # Factor Variables:

	Stage_fear	Drained_after_socializing
	Yes : 111	Yes : 111
	No :1338	No :1362
	NA’s: 42	NA’s: 18

Descriptive Statistics and Visualisation Cont.

Summary of Variables for Introvert personality

summary(personality[personality$Personality == "Introvert", numeric_variables]) -> table3
summary(personality[personality$Personality == "Introvert", factor_variables]) -> table4
kable(table3) # Numeric Variables:

Time_spent_Alone	Social_event_attendance	Going_outside	Friends_circle_size	Post_frequency
Min. : 0.00	Min. :0.000	Min. :0.000	Min. : 0.000	Min. :0.000
1st Qu.: 5.00	1st Qu.:0.000	1st Qu.:0.000	1st Qu.: 1.000	1st Qu.:0.000
Median : 7.00	Median :2.000	Median :1.000	Median : 3.000	Median :1.000
Mean : 7.08	Mean :1.779	Mean :1.273	Mean : 3.197	Mean :1.369
3rd Qu.: 9.00	3rd Qu.:3.000	3rd Qu.:2.000	3rd Qu.: 4.000	3rd Qu.:2.000
Max. :11.00	Max. :9.000	Max. :7.000	Max. :14.000	Max. :9.000
NA’s :29	NA’s :34	NA’s :31	NA’s :37	NA’s :32

kable(table4) # Factor Variables:

	Stage_fear	Drained_after_socializing
	Yes :1299	Yes :1296
	No : 79	No : 79
	NA’s: 31	NA’s: 34

Descriptive Statistics and Visualisation Cont.

par(mfrow = c(1, 5), mar = c(4, 3, 2, 1))
for (colname in numeric_variables) {boxplot(personality[[colname]] ~ personality$Personality,
          main = colname, xlab = "Personality", col = c("lightblue", "lightpink"), border = "black")}

Boxplots of Numeric Variables and Personality Types

par(mfrow = c(1, 5), mar = c(4, 4, 2, 1))
for (colname in numeric_variables) {hist(personality[[colname]], main = colname, xlab = colname,
       col = "lightblue", border = "black")}

Histograms of Numeric Variables

Descriptive Statistics and Visualisation Cont.

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
for (colname in factor_variables) {tab <- table(personality[[colname]], personality$Personality)
  barplot(tab, beside = TRUE, col = c("lightblue", "lightpink"), legend = rownames(tab), main = colname, ylab = "Count", xlab = "Personality")}

Bar plots of Factor Variables and Personality Types

Key Findings:

There are consistent patterns between the social behavior variables and personality traits.
Extroverts are more comfortable in social settings, while introverts fear and become more easily tired from social events.
Extroverts are more socially active, while introverts prefer spending more time alone.
Outliers appear in both groups, indicating variability in social behaviors and personality types.
The plots support the definition of extroverted and introverted personalities.
The social behavior variables could be used to predict an individual’s personality trait.

Handling Missing Values

# Percentage of missing values in the data set:
round(sum(is.na(personality))/(nrow(personality)*ncol(personality))*100, 2)

## [1] 1.97

# Number of missing values in each variables:
colSums(is.na(personality))

##          Time_spent_Alone                Stage_fear   Social_event_attendance 
##                        63                        73                        62 
##             Going_outside Drained_after_socializing       Friends_circle_size 
##                        66                        52                        77 
##            Post_frequency               Personality 
##                        65                         0

# Numeric variables: Median imputation using Hmisc package
for (colname in numeric_variables) {personality[[colname]] <- impute(personality[[colname]], fun = median)}
# Factor variables: Mode imputation using Hmisc package
for (colname in factor_variables) {personality[[colname]] <- impute(personality[[colname]], fun = mode)}
# Check number of missing values in each variables after the imputation:
colSums(is.na(personality))

##          Time_spent_Alone                Stage_fear   Social_event_attendance 
##                         0                         0                         0 
##             Going_outside Drained_after_socializing       Friends_circle_size 
##                         0                         0                         0 
##            Post_frequency               Personality 
##                         0                         0

Outliers Detection and Handling

# Multivariate outlier detection for numeric variables using Mahalanobis distance with QQ plots
results <- mvn(data = personality[, numeric_variables], multivariateOutlierMethod = "quan", showOutliers = TRUE)

nrow(results$multivariateOutliers)/nrow(personality)*100 # Percentage of outliers:

## [1] 12.55172

The outliers are not removed from the data set because:

Observations with outliers should be included when doing a predictive model to resemble real-world scenarios

Hypothesis Testing

The Mann-Whitney U test was used for the numeric variables because the variables are not normally distributed.

Assumptions:

The two populations and observations of Extrovert and Introvert are independent of each other
The variables are continuous and distribution shapes across the groups

wilcox.test(Time_spent_Alone ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Time_spent_Alone" variable

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Time_spent_Alone by Personality
## W = 1930380, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  5.000011 5.000014
## sample estimates:
## difference in location 
##               5.000046

wilcox.test(Social_event_attendance ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Social_event_attendance" variable

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Social_event_attendance by Personality
## W = 164651, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -4.999998 -4.000052
## sample estimates:
## difference in location 
##              -4.000049

Hypthesis Testing Cont.

wilcox.test(Going_outside ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Going_outside" variable

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Going_outside by Personality
## W = 162980, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -3.999948 -3.000008
## sample estimates:
## difference in location 
##               -3.99994

wilcox.test(Friends_circle_size ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Friends_circle_size" variable

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Friends_circle_size by Personality
## W = 191530, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -6.000044 -5.999961
## sample estimates:
## difference in location 
##              -6.000027

wilcox.test(Post_frequency ~ Personality, data = personality, alternative = "two.sided", conf.int = TRUE) # "Post_frequency" variable

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Post_frequency by Personality
## W = 155426, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -4.999976 -4.000055
## sample estimates:
## difference in location 
##              -4.000016

Hypthesis Testing Cont.

Summary Report of Results:

A two-tailed Mann-Whitney U test was used to test whether the five numeric variables are significantly different when comparing between extroverts and introverts. The five tests all have p-values < 2.2e-16, suggesting significant differences for all variables between extroverts and introverts.

Time_spent_Alone: 95% CI [5.000011, 5.000014], W = 1930380 and introverts spend on median 5 hours more alone than extroverts
Social_event_attendance: 95% CI [-4.999998, -4.000052], W = 164651 and extroverts attend 4 more events than introverts
Going_outside: 95% CI [-3.999948, -3.000008], W = 162980, and extroverts go out on median 4 more times than introverts
Friends_circle_size: 95% CI [-6.000044, -5.999961], W = 191530 and extroverts have a median of 6 more close friends than introverts
Post_frequency: 95% CI [-4.999976, -4.000055], W = 155426, and extroverts post 5 more times than introverts

All five numeric variables are significant indicators for predicting an individual personality trait.

Categorical Association

The Chi-square test for association was used to test for statistically significant association between the factor variables and personality types.

Assumption:

No more than 25% of the cells have expected counts below 5 and it is satisfied with the tables below

(chi_stage <- chisq.test(table(personality$Stage_fear,personality$Personality))) # "Stage_fear" variable

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(personality$Stage_fear, personality$Personality)
## X-squared = 2079.4, df = 1, p-value < 2.2e-16

(chi_drained <- chisq.test(table(personality$Drained_after_socializing,personality$Personality))) # "Drained_after_socializing" variable

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(personality$Drained_after_socializing, personality$Personality)
## X-squared = 2069.2, df = 1, p-value < 2.2e-16

chi_stage$expected # expected counts all above 5

##      
##       Introvert Extrovert
##   Yes  685.0655  724.9345
##   No   723.9345  766.0655

chi_drained$expected # expected counts all above 5

##      
##       Introvert Extrovert
##   Yes  683.6079  723.3921
##   No   725.3921  767.6079

Categorical Association

chi_stage$observed

##      
##       Introvert Extrovert
##   Yes      1299       111
##   No        110      1380

chi_drained$observed

##      
##       Introvert Extrovert
##   Yes      1296       111
##   No        113      1380

Summary Report of Results:

A Pearson’s Chi-squared test with Yates’ continuity correction was used to test two factor variables and both produce p-value < 2.2e-16, suggesting significant association between the variables and personality types.

Stage_fear: 𝜒2 = 2069.2 and introverts are significantly more likely to experience stage fear than extroverts
Social_event_attendance: 𝜒2 = 2069.2 and introverts are more likely to feel drained after socializing than extroverts

The two factor variables are significant indicators for predicting an individual personality trait.

Discussion

Major Findings:

All social behaviors variables in the data set are significant indicators for predicting an individual personality trait with the Mann-Whitney U tests shows significant differences between extroverts and introverts across five numerical variables, and Chi-squared tests shows strong associations between both categorical variables and personality types.
Boxplots and bar plots highlights clear distinction between extroverts and introverts across the variables.
The distinguishable patterns and statistical significance could result in meaningful regression analysis to make a predictive model for personality types
The findings from the report support the definitions of the two personality types mentioned in the Introduction

Strengths:

Missing values are handle with median and mode imputation, and outliers detection was implement
Multiple plots are used to visualize the statistics and illustrated the relationships between the variables
Two different tests are used for different variables type to test statistic significance

Discussion

Limitations:

The sampling method is not mentioned for the data set so cannot verify the biasness of the data set and eliminate potential errors
There are only two factor variables, with both have only two levels, and more factor variables would introduce more variability to data set
This report only consider the relationship and association between one variable and the target variable

Future Directions:

Incorporate more data sources and other features, such as emotional, psychological, and physical, to introduce more variability to the data sets
Investigate the relationships between the variables relative to the target variable of personality types

References

Gonzalez, A. (2022, November 22). What Is an Extrovert? WebMD. https://www.webmd.com/balance/what-is-an-extrovert

Ellis, R. (2022, September 3). Introvert Personality. WebMD. https://www.webmd.com/balance/introvert-personality-overview

Kapilavayi, R. (2025). Extrovert vs. Introvert Behavior Data. Kaggle.com. https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data?

Shaw, A. (2017, February 23). Are you really an introvert? Medium. https://medium.com/@anthonypjshaw/are-you-really-an-introvert-161e09819466

Investigating the relationships between Social Behavior variables and Personality Types

Introduction

Two Key Personality Types:

Problem Statement

Aim

Statistical Approaches

Data Collection

Source

Collection Method

Data

Features Overview:

Notes:

Descriptive Statistics and Visualisation

Data set preview

Descriptive Statistics and Visualisation Cont.

Data structure after data type conversion

Descriptive Statistics and Visualisation Cont.

Summary of Variables for Extrovert personality

Descriptive Statistics and Visualisation Cont.

Summary of Variables for Introvert personality

Descriptive Statistics and Visualisation Cont.

Descriptive Statistics and Visualisation Cont.

Key Findings:

Handling Missing Values

Outliers Detection and Handling

Hypothesis Testing

Assumptions:

Hypthesis Testing Cont.

Hypthesis Testing Cont.

Summary Report of Results:

Categorical Association

Assumption:

Categorical Association

Summary Report of Results:

Discussion

Major Findings:

Strengths:

Discussion

Limitations:

Future Directions:

References