## Rows: 2,900
## Columns: 8
## $ Time_spent_Alone <dbl> 4, 9, 9, 0, 3, 1, 4, 2, 10, 0, 3, 10, 3, 3, …
## $ Stage_fear <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ Social_event_attendance <dbl> 4, 0, 1, 6, 9, 7, 9, 8, 1, 8, 9, 3, 6, 6, 3,…
## $ Going_outside <dbl> 6, 0, 2, 7, 4, 5, 3, 4, 3, 6, 6, 1, 7, 4, 0,…
## $ Drained_after_socializing <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ Friends_circle_size <dbl> 13, 0, 5, 14, 8, 6, 7, 7, 0, 13, 15, 4, 14, …
## $ Post_frequency <dbl> 5, 3, 2, 8, 5, 6, 7, 8, 3, 8, 5, 0, 10, 7, 3…
## $ Personality <chr> "Extrovert", "Introvert", "Introvert", "Extr…
I chose this dataset of 2,900 people because I wanted to move beyond the usual personality clichés. We often talk about introversion and extroversion as simple labels, but I was interested in seeing how those labels actually translate into real-world behavior. This specific data is great because it doesn’t just look at how many parties someone goes to; it tracks the internal “cost” of those events, like feeling drained or experiencing stage fear.
Before diving into the modeling, it was essential to transform our raw survey data into a format R can process. We renamed variables to shorter, more manageable handles like alone and social to keep the code readable. Crucially, we converted binary strings (“Yes”/“No”) into numeric indicators (\(1\) and \(0\)). This isn’t just for convenience; mathematical models like GLMs require numeric inputs to calculate probabilities and gradients. Setting personality as a factor ensures that our visualizations and models recognize it as a distinct categorical group rather than just a label.
clean <- personality %>%
rename(
alone = Time_spent_Alone,
social = Social_event_attendance,
outside = Going_outside,
friends = Friends_circle_size,
posts = Post_frequency
) %>%
mutate(
stage_fear = if_else(Stage_fear == "Yes", 1, 0),
drained = if_else(Drained_after_socializing == "Yes", 1, 0),
introvert = if_else(Personality == "Introvert", 1, 0),
personality = factor(Personality)
)
glimpse(clean)## Rows: 2,900
## Columns: 12
## $ alone <dbl> 4, 9, 9, 0, 3, 1, 4, 2, 10, 0, 3, 10, 3, 3, …
## $ Stage_fear <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ social <dbl> 4, 0, 1, 6, 9, 7, 9, 8, 1, 8, 9, 3, 6, 6, 3,…
## $ outside <dbl> 6, 0, 2, 7, 4, 5, 3, 4, 3, 6, 6, 1, 7, 4, 0,…
## $ Drained_after_socializing <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ friends <dbl> 13, 0, 5, 14, 8, 6, 7, 7, 0, 13, 15, 4, 14, …
## $ posts <dbl> 5, 3, 2, 8, 5, 6, 7, 8, 3, 8, 5, 0, 10, 7, 3…
## $ Personality <chr> "Extrovert", "Introvert", "Introvert", "Extr…
## $ stage_fear <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ drained <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ introvert <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ personality <fct> Extrovert, Introvert, Introvert, Extrovert, …
RQ1 — Does time spent alone nonlinearly predict social event attendance? H1: The relationship between alone-time and social attendance is not linear (EDF > 1 in a GAM).
RQ2 — Does the shape of that relationship differ across the distribution of social activity? H2: The smooth function estimated by a QGAM will differ between low and high quantiles.
RQ3 — Which behavioural variables best distinguish Introverts from Extroverts? H3: Drained-after-socialising and alone-time will be the strongest Introvert predictors in a logistic GLM.
scatter <- clean %>%
ggplot(
aes(
x = alone,
y = social,
color = personality
)
) +
geom_point(alpha = 0.35) +
geom_smooth(method = "loess", se = FALSE) +
labs(
title = "Time Alone vs Social Attendance",
x = "Time Spent Alone",
y = "Social Event Attendance",
color = "Personality"
) +
theme_minimal()
scatterThe relationship is not linear. Thus, GAMs would be employed in order to account for lon-linearity.
clean %>%
ggplot(
aes(
x = personality,
y = alone,
fill = personality
)
) +
geom_violin(alpha = 0.6) +
geom_boxplot(width = 0.15) +
labs(
title = "Distribution of Time Alone by Personality Type",
x = NULL,
y = "Time Spent Alone"
) +
theme_minimal() +
theme(legend.position = "none")
The violin plots reveal a clear “personality gap.” Introverts show a
much wider distribution at higher levels of solitary time, while
Extroverts are tightly clustered at the lower end. However, the overlap
in the middle of these plots is exactly why a simple average won’t tell
the whole story. If the groups were perfectly separated, we wouldn’t
need advanced statistics; because they “bleed” into each other, we need
to find the non-linear thresholds where one identity typically
transitions into the other.
Roughly 51%.4 Extrovert / 48.5 Introvert — about as balanced as you’d want. Distribution of both groups are well balanced.
A Generalised Additive Model swaps out the rigid
straight-line term you’d get from OLS for a flexible smooth function
s(alone). The key thing is we don’t have to decide the
shape upfront — the data figures it out. REML penalisation keeps the
smooth from getting too wiggly.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## social ~ s(alone)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.96335 0.03108 127.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alone) 8.725 8.978 628.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.66 Deviance explained = 66.1%
## -REML = 5631.9 Scale est. = 2.8022 n = 2900
Effective Degrees of Freedom (EDF ≈ 8.7): With an EDF significantly
higher than 1, the model confirms a highly nonlinear relationship. This
suggests the connection between alone time and social attendance
features approximately nine distinct “tipping points” rather than a
simple, steady decline.
Statistical Significance (p < 2e-16): The p-value is infinitesimal,
effectively zero. This proves that the probability of alone-time having
no effect on social attendance is non-existent within this sample of
2,900 people.
Adjusted R^2 (): The model explains over 66% of the variance in social
behavior using only alone-time as a predictor. For a single-variable
psychological model, this is an exceptionally high level of explanatory
power, leaving relatively little to mystery or outside factors.
Intercept (): This represents the baseline; without considering any
specific alone-time factors, the average participant starts with a
social attendance score of nearly 4.0.
When we overlay the GAM smooth on the scatter plot, we see why linear
regression would have failed us. A straight line would either
overestimate social attendance for loners or underestimate it for
socialites. The GAM “snakes” through the data, respecting the fact that
social energy holds steady for a few hours before hitting a cliff. This
flexibility allows the model to respect the “human” nature of the
data—where habits are rarely constant.
plot(fit.gam, residuals = TRUE, shade = TRUE,
xlab = "Time Spent Alone",
ylab = "Effect on Social Attendance",
main = "GAM Smooth: s(alone)")Three numbers worth flagging:
To ensure the integrity of the model assumptions, I utilized the DHARMa package, which generates standardized residuals through a simulation-based approach. By comparing the observed residuals against this expected distribution, the model provides a more robust and principled diagnostic for detecting issues like overdispersion or systematic bias.
The initial model summary shows that time spent alone is a massive predictor of social behavior, explaining roughly 66% of the fluctuations we see in the data. However, the DHARMa “stress test” reveals that our standard model has significant patterned errors, as shown by the S-curve wobble in the QQ Plot. This means the model isn’t just making random mistakes; it consistently misses the mark in specific zones, particularly failing to be stable when predicting average social habits. Because people’s behavior becomes much more unpredictable at the extremes, a one-size-fits-all average simply doesn’t work for this dataset. To solve this, we move to QGAMs, which allow us to model the full spectrum of personality types—from the 10th percentile recluses to the 90th percentile socialites—rather than just the “middle of the road” average. This approach fixes the wiggles seen in our diagnostics by adapting to the inconsistent levels of uncertainty across the different groups.
## 6 · Method 2 — QGAM
A Quantile GAM goes one step further than the standard GAM: instead of fitting one smooth to the average response, it fits a separate smooth at each quantile τ you ask for. That lets us ask the actually interesting question — does alone-time affect people at the bottom of the social-attendance distribution the same way it affects people at the top?
## Estimating learning rate. Each dot corresponds to a loss evaluation.
## qu = 0.5.........done
## qu = 0.1...............done
## qu = 0.9................done
## [[1]]
##
## Family: elf
## Link function: identity
##
## Formula:
## social ~ s(alone)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.88683 0.02841 66.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alone) 8.818 8.988 7946 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.64 Deviance explained = 83.5%
## -REML = 5933.7 Scale est. = 1 n = 2900
##
## [[2]]
##
## Family: elf
## Link function: identity
##
## Formula:
## social ~ s(alone)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.8779 0.0387 100.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alone) 8.733 8.98 3739 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.657 Deviance explained = 53.7%
## -REML = 5885.7 Scale est. = 1 n = 2900
##
## [[3]]
##
## Family: elf
## Link function: identity
##
## Formula:
## social ~ s(alone)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.26290 0.03493 179.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alone) 8.653 8.965 11043 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.609 Deviance explained = 81%
## -REML = 6388.3 Scale est. = 1 n = 2900
Different Starting Points (Intercepts: 1.88 vs 3.87 vs 6.26) The most obvious contrast is where these groups begin. Even with zero “alone time,” the most reserved group starts with a very low social score (1.88), while the most social group starts three times higher (6.26). This proves that some people have a “social battery” that is naturally much larger than others before they even begin to feel drained.
The “Predictability” Gap (Deviance Explained: 83.5% vs 53.7% vs 81%) This is a huge practical finding. The model is incredibly accurate at predicting the “Extremes”—the very quiet people (83.5%) and the very social people (81%). However, the “Average” person in the middle is much more unpredictable (53.7%). Practically, this means that if someone is a extreme introvert or extrovert, their behavior is almost entirely a “slave” to their alone time. But for the middle-of-the-road person, other random factors (like mood or the specific event) matter much more.
The Complexity of the “Crash” (edf: 8.81 vs 8.73 vs 8.65) Despite their different baselines, all three groups have nearly the same “wiggle” score (around 8.7). This tells us that the “social crash”—the point where you’ve had too much or too little alone time—happens with the same level of complexity for everyone. It’s not a simple drop; there are specific tipping points that affect the “Socialite” just as much as the “Reserved” person.
Our data shows that alone time isn’t a “one-size-fits-all” factor. By looking at the different layers of our population, we see that the most social individuals operate on a completely different level (6.26) than the most reserved (1.88). Most importantly, we discovered that while we can predict the behavior of extreme personalities with over 80% accuracy, the “average” person remains much more of a mystery, with our model only capturing about 53% of their habits. This confirms that to truly understand social energy, we have to look at the extremes, not just the average.
The Shared Pattern: All three groups follow a “cliff-edge” trend. Social
activity stays high and steady for a while, but once alone-time hits a
certain threshold (around the “4” mark on your x-axis), everyone’s
social attendance crashes hard.
The Difference (The Intercepts): Look at the height of the lines. The plot on the right (the 90th percentile) starts much higher up than the plot on the left. This visually confirms that “Social Butterflies” have a much higher baseline, but they still suffer the same sharp drop-off as the “Reserved” group.
To conclude the analysis, the research question was inverted: instead of predicting social frequency, the goal was to determine if behavior can accurately classify a person’s identity as an Introvert or Extrovert. A Logistic Generalized Linear Model (GLM) was employed to map seven behavioral predictors onto the log-odds of being an Introvert. To make these results interpretable for a general audience, the coefficients were exponentiated into Odds Ratios, providing a clear multiplier for how much each behavior increases the likelihood of a specific personality classification
fit.glm <- glm(
introvert ~ alone + social + outside + friends + posts + stage_fear + drained,
family = binomial(link = "logit"),
data = clean
)
summary(fit.glm)##
## Call:
## glm(formula = introvert ~ alone + social + outside + friends +
## posts + stage_fear + drained, family = binomial(link = "logit"),
## data = clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.516988 0.487835 -5.160 2.48e-07 ***
## alone -0.058811 0.037581 -1.565 0.1176
## social -0.033049 0.045848 -0.721 0.4710
## outside -0.003837 0.063404 -0.061 0.9517
## friends 0.064106 0.029907 2.143 0.0321 *
## posts -0.076983 0.046240 -1.665 0.0959 .
## stage_fear 2.888991 0.336435 8.587 < 2e-16 ***
## drained 2.652789 0.318944 8.317 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4017.9 on 2899 degrees of freedom
## Residual deviance: 1476.9 on 2892 degrees of freedom
## AIC: 1492.9
##
## Number of Fisher Scoring iterations: 5
When we look at what truly defines an Introvert, the physical and emotional reactions are the ‘dead giveaways.’ While things like time spent alone or outside are actually quite poor at identifying an introvert, Stage Fear and feeling Drained after socializing are incredibly powerful predictors. In fact, these two traits are the most reliable ‘signatures’ in our entire dataset, overshadowing almost every other behavior.
## (Intercept) alone social outside friends posts
## 0.08070228 0.94288466 0.96749088 0.99617061 1.06620494 0.92590541
## stage_fear drained
## 17.97515819 14.19356923
By exponentiating the coefficients, we move from the abstract “log-odds” to real-world Odds Ratios. An Odds Ratio (OR) of 1.0 would mean a behavior has zero effect on personality. Instead, we see massive spikes: having Stage Fear makes you nearly 18 times more likely to be an Introvert. This shift in our analysis proves that personality isn’t just about what you do (your schedule), but about how your body and mind react to the world (your fear and fatigue) ## 8 · Conclusions
This research reveals a significant disconnect between how we label our personalities and how we actually behave. While we often define Introversion by a preference for solitude, the GAM results prove that solitude acts as a universal “social cliff”—once any individual hits approximately 4 hours of alone time, their probability of social attendance crashes regardless of their personality type. The true distinction between groups isn’t the amount of time spent alone, but the internal cost of leaving that solitude. Our Logistic GLM demonstrated that “outward” habits like post frequency or friends-circle size are statistically noisy predictors. Instead, the “dead giveaways” for Introversion are internal: Stage Fear and Post-Social Exhaustion. Ultimately, being an introvert isn’t about how much you stay in; it’s about the physiological “drain” you feel when you finally go out.
Stage Fear (OR: 17.98) and feeling socially drained (OR: 14.19) are the primary signatures of an introvert.
Statistical Significance of Habits: Standard behavioral metrics like Time Spent Alone, Going Outside, and Social Event Attendance were found to be non-significant predictors in the final logistic model (p > 0.05).
Complexity of Social Decay: The relationship between solitude and social engagement is non-linear (EDF: 8.7), suggesting social energy does not decline at a steady rate but hits specific “crash points”.
Quantile Insights: Behavior is most predictable at the extremes of the social spectrum. The social habits of median individuals are significantly more varied and less dictated by time spent alone.