library(tidyverse)
library(mgcv)
library(qgam)
library(DHARMa)
library(ggplot2)

1 1 · Data

personality <- read.csv("personality_datasert.csv")
glimpse(personality)
## Rows: 2,900
## Columns: 8
## $ Time_spent_Alone          <dbl> 4, 9, 9, 0, 3, 1, 4, 2, 10, 0, 3, 10, 3, 3, …
## $ Stage_fear                <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ Social_event_attendance   <dbl> 4, 0, 1, 6, 9, 7, 9, 8, 1, 8, 9, 3, 6, 6, 3,…
## $ Going_outside             <dbl> 6, 0, 2, 7, 4, 5, 3, 4, 3, 6, 6, 1, 7, 4, 0,…
## $ Drained_after_socializing <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ Friends_circle_size       <dbl> 13, 0, 5, 14, 8, 6, 7, 7, 0, 13, 15, 4, 14, …
## $ Post_frequency            <dbl> 5, 3, 2, 8, 5, 6, 7, 8, 3, 8, 5, 0, 10, 7, 3…
## $ Personality               <chr> "Extrovert", "Introvert", "Introvert", "Extr…

I chose this dataset of 2,900 people because I wanted to move beyond the usual personality clichés. We often talk about introversion and extroversion as simple labels, but I was interested in seeing how those labels actually translate into real-world behavior. This specific data is great because it doesn’t just look at how many parties someone goes to; it tracks the internal “cost” of those events, like feeling drained or experiencing stage fear.

1.1 1.1 · Cleaning

Before diving into the modeling, it was essential to transform our raw survey data into a format R can process. We renamed variables to shorter, more manageable handles like alone and social to keep the code readable. Crucially, we converted binary strings (“Yes”/“No”) into numeric indicators (\(1\) and \(0\)). This isn’t just for convenience; mathematical models like GLMs require numeric inputs to calculate probabilities and gradients. Setting personality as a factor ensures that our visualizations and models recognize it as a distinct categorical group rather than just a label.

clean <- personality %>%
  rename(
    alone   = Time_spent_Alone,
    social  = Social_event_attendance,
    outside = Going_outside,
    friends = Friends_circle_size,
    posts   = Post_frequency
  ) %>%
  mutate(
    stage_fear  = if_else(Stage_fear               == "Yes", 1, 0),
    drained     = if_else(Drained_after_socializing == "Yes", 1, 0),
    introvert   = if_else(Personality              == "Introvert", 1, 0),
    personality = factor(Personality)
  )

glimpse(clean)
## Rows: 2,900
## Columns: 12
## $ alone                     <dbl> 4, 9, 9, 0, 3, 1, 4, 2, 10, 0, 3, 10, 3, 3, …
## $ Stage_fear                <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ social                    <dbl> 4, 0, 1, 6, 9, 7, 9, 8, 1, 8, 9, 3, 6, 6, 3,…
## $ outside                   <dbl> 6, 0, 2, 7, 4, 5, 3, 4, 3, 6, 6, 1, 7, 4, 0,…
## $ Drained_after_socializing <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ friends                   <dbl> 13, 0, 5, 14, 8, 6, 7, 7, 0, 13, 15, 4, 14, …
## $ posts                     <dbl> 5, 3, 2, 8, 5, 6, 7, 8, 3, 8, 5, 0, 10, 7, 3…
## $ Personality               <chr> "Extrovert", "Introvert", "Introvert", "Extr…
## $ stage_fear                <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ drained                   <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ introvert                 <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ personality               <fct> Extrovert, Introvert, Introvert, Extrovert, …

2 2 · Research Questions & Hypotheses

RQ1 — Does time spent alone nonlinearly predict social event attendance? H1: The relationship between alone-time and social attendance is not linear (EDF > 1 in a GAM).

RQ2 — Does the shape of that relationship differ across the distribution of social activity? H2: The smooth function estimated by a QGAM will differ between low and high quantiles.

RQ3 — Which behavioural variables best distinguish Introverts from Extroverts? H3: Drained-after-socialising and alone-time will be the strongest Introvert predictors in a logistic GLM.


3 3 · Exploratory Analysis

scatter <- clean %>%
  ggplot(
    aes(
      x     = alone,
      y     = social,
      color = personality
    )
  ) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Time Alone vs Social Attendance",
    x     = "Time Spent Alone",
    y     = "Social Event Attendance",
    color = "Personality"
  ) +
  theme_minimal()

scatter

The relationship is not linear. Thus, GAMs would be employed in order to account for lon-linearity.

clean %>%
  ggplot(
    aes(
      x    = personality,
      y    = alone,
      fill = personality
    )
  ) +
  geom_violin(alpha = 0.6) +
  geom_boxplot(width = 0.15) +
  labs(
    title = "Distribution of Time Alone by Personality Type",
    x     = NULL,
    y     = "Time Spent Alone"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

The violin plots reveal a clear “personality gap.” Introverts show a much wider distribution at higher levels of solitary time, while Extroverts are tightly clustered at the lower end. However, the overlap in the middle of these plots is exactly why a simple average won’t tell the whole story. If the groups were perfectly separated, we wouldn’t need advanced statistics; because they “bleed” into each other, we need to find the non-linear thresholds where one identity typically transitions into the other.

clean %>%
  count(personality) %>%
  mutate(pct = n / sum(n) * 100)

Roughly 51%.4 Extrovert / 48.5 Introvert — about as balanced as you’d want. Distribution of both groups are well balanced.


4 4 · Method 1 — GAM

A Generalised Additive Model swaps out the rigid straight-line term you’d get from OLS for a flexible smooth function s(alone). The key thing is we don’t have to decide the shape upfront — the data figures it out. REML penalisation keeps the smooth from getting too wiggly.

fit.gam <- gam(
  social ~ s(alone),
  data   = clean,
  method = "REML"
)

summary(fit.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## social ~ s(alone)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.96335    0.03108   127.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(alone) 8.725  8.978 628.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.66   Deviance explained = 66.1%
## -REML = 5631.9  Scale est. = 2.8022    n = 2900

Effective Degrees of Freedom (EDF ≈ 8.7): With an EDF significantly higher than 1, the model confirms a highly nonlinear relationship. This suggests the connection between alone time and social attendance features approximately nine distinct “tipping points” rather than a simple, steady decline.
Statistical Significance (p < 2e-16): The p-value is infinitesimal, effectively zero. This proves that the probability of alone-time having no effect on social attendance is non-existent within this sample of 2,900 people.
Adjusted R^2 (): The model explains over 66% of the variance in social behavior using only alone-time as a predictor. For a single-variable psychological model, this is an exceptionally high level of explanatory power, leaving relatively little to mystery or outside factors.
Intercept (): This represents the baseline; without considering any specific alone-time factors, the average participant starts with a social attendance score of nearly 4.0.

scatter +
  geom_smooth(
    method  = "gam",
    formula = y ~ s(x, bs = "tp")
  )

When we overlay the GAM smooth on the scatter plot, we see why linear regression would have failed us. A straight line would either overestimate social attendance for loners or underestimate it for socialites. The GAM “snakes” through the data, respecting the fact that social energy holds steady for a few hours before hitting a cliff. This flexibility allows the model to respect the “human” nature of the data—where habits are rarely constant.

plot(fit.gam, residuals = TRUE, shade = TRUE,
     xlab = "Time Spent Alone",
     ylab = "Effect on Social Attendance",
     main = "GAM Smooth: s(alone)")

Three numbers worth flagging:


5 5 · GAM — Residual Diagnostics (DHARMa)

To ensure the integrity of the model assumptions, I utilized the DHARMa package, which generates standardized residuals through a simulation-based approach. By comparing the observed residuals against this expected distribution, the model provides a more robust and principled diagnostic for detecting issues like overdispersion or systematic bias.

sim.gam <- simulateResiduals(fit.gam)
plot(sim.gam)

The initial model summary shows that time spent alone is a massive predictor of social behavior, explaining roughly 66% of the fluctuations we see in the data. However, the DHARMa “stress test” reveals that our standard model has significant patterned errors, as shown by the S-curve wobble in the QQ Plot. This means the model isn’t just making random mistakes; it consistently misses the mark in specific zones, particularly failing to be stable when predicting average social habits. Because people’s behavior becomes much more unpredictable at the extremes, a one-size-fits-all average simply doesn’t work for this dataset. To solve this, we move to QGAMs, which allow us to model the full spectrum of personality types—from the 10th percentile recluses to the 90th percentile socialites—rather than just the “middle of the road” average. This approach fixes the wiggles seen in our diagnostics by adapting to the inconsistent levels of uncertainty across the different groups.


## 6 · Method 2 — QGAM

A Quantile GAM goes one step further than the standard GAM: instead of fitting one smooth to the average response, it fits a separate smooth at each quantile τ you ask for. That lets us ask the actually interesting question — does alone-time affect people at the bottom of the social-attendance distribution the same way it affects people at the top?

qu <- c(0.1, 0.5, 0.9)

fit.qgam <- mqgam(
  form = social ~ s(alone),
  data = clean,
  qu   = qu
)
## Estimating learning rate. Each dot corresponds to a loss evaluation. 
## qu = 0.5.........done 
## qu = 0.1...............done 
## qu = 0.9................done
qdo(
  obj = fit.qgam,
  qu  = qu,
  fun = summary
)
## [[1]]
## 
## Family: elf 
## Link function: identity 
## 
## Formula:
## social ~ s(alone)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.88683    0.02841   66.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq p-value    
## s(alone) 8.818  8.988   7946  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.64   Deviance explained = 83.5%
## -REML = 5933.7  Scale est. = 1         n = 2900
## 
## [[2]]
## 
## Family: elf 
## Link function: identity 
## 
## Formula:
## social ~ s(alone)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.8779     0.0387   100.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq p-value    
## s(alone) 8.733   8.98   3739  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.657   Deviance explained = 53.7%
## -REML = 5885.7  Scale est. = 1         n = 2900
## 
## [[3]]
## 
## Family: elf 
## Link function: identity 
## 
## Formula:
## social ~ s(alone)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.26290    0.03493   179.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq p-value    
## s(alone) 8.653  8.965  11043  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.609   Deviance explained =   81%
## -REML = 6388.3  Scale est. = 1         n = 2900

Different Starting Points (Intercepts: 1.88 vs 3.87 vs 6.26) The most obvious contrast is where these groups begin. Even with zero “alone time,” the most reserved group starts with a very low social score (1.88), while the most social group starts three times higher (6.26). This proves that some people have a “social battery” that is naturally much larger than others before they even begin to feel drained.

The “Predictability” Gap (Deviance Explained: 83.5% vs 53.7% vs 81%) This is a huge practical finding. The model is incredibly accurate at predicting the “Extremes”—the very quiet people (83.5%) and the very social people (81%). However, the “Average” person in the middle is much more unpredictable (53.7%). Practically, this means that if someone is a extreme introvert or extrovert, their behavior is almost entirely a “slave” to their alone time. But for the middle-of-the-road person, other random factors (like mood or the specific event) matter much more.

The Complexity of the “Crash” (edf: 8.81 vs 8.73 vs 8.65) Despite their different baselines, all three groups have nearly the same “wiggle” score (around 8.7). This tells us that the “social crash”—the point where you’ve had too much or too little alone time—happens with the same level of complexity for everyone. It’s not a simple drop; there are specific tipping points that affect the “Socialite” just as much as the “Reserved” person.

Our data shows that alone time isn’t a “one-size-fits-all” factor. By looking at the different layers of our population, we see that the most social individuals operate on a completely different level (6.26) than the most reserved (1.88). Most importantly, we discovered that while we can predict the behavior of extreme personalities with over 80% accuracy, the “average” person remains much more of a mystery, with our model only capturing about 53% of their habits. This confirms that to truly understand social energy, we have to look at the extremes, not just the average.

invisible(
  qdo(
    obj = fit.qgam,
    qu  = qu,
    fun = plot
  )
)

The Shared Pattern: All three groups follow a “cliff-edge” trend. Social activity stays high and steady for a while, but once alone-time hits a certain threshold (around the “4” mark on your x-axis), everyone’s social attendance crashes hard.

The Difference (The Intercepts): Look at the height of the lines. The plot on the right (the 90th percentile) starts much higher up than the plot on the left. This visually confirms that “Social Butterflies” have a much higher baseline, but they still suffer the same sharp drop-off as the “Reserved” group.


6 7 · Method 3 — Logistic GLM

To conclude the analysis, the research question was inverted: instead of predicting social frequency, the goal was to determine if behavior can accurately classify a person’s identity as an Introvert or Extrovert. A Logistic Generalized Linear Model (GLM) was employed to map seven behavioral predictors onto the log-odds of being an Introvert. To make these results interpretable for a general audience, the coefficients were exponentiated into Odds Ratios, providing a clear multiplier for how much each behavior increases the likelihood of a specific personality classification

fit.glm <- glm(
  introvert ~ alone + social + outside + friends + posts + stage_fear + drained,
  family = binomial(link = "logit"),
  data   = clean
)

summary(fit.glm)
## 
## Call:
## glm(formula = introvert ~ alone + social + outside + friends + 
##     posts + stage_fear + drained, family = binomial(link = "logit"), 
##     data = clean)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.516988   0.487835  -5.160 2.48e-07 ***
## alone       -0.058811   0.037581  -1.565   0.1176    
## social      -0.033049   0.045848  -0.721   0.4710    
## outside     -0.003837   0.063404  -0.061   0.9517    
## friends      0.064106   0.029907   2.143   0.0321 *  
## posts       -0.076983   0.046240  -1.665   0.0959 .  
## stage_fear   2.888991   0.336435   8.587  < 2e-16 ***
## drained      2.652789   0.318944   8.317  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4017.9  on 2899  degrees of freedom
## Residual deviance: 1476.9  on 2892  degrees of freedom
## AIC: 1492.9
## 
## Number of Fisher Scoring iterations: 5

When we look at what truly defines an Introvert, the physical and emotional reactions are the ‘dead giveaways.’ While things like time spent alone or outside are actually quite poor at identifying an introvert, Stage Fear and feeling Drained after socializing are incredibly powerful predictors. In fact, these two traits are the most reliable ‘signatures’ in our entire dataset, overshadowing almost every other behavior.

# Exponentiate to get Odds Ratios — OR > 1 = Introvert, OR < 1 = Extrovert
exp(coef(fit.glm))
## (Intercept)       alone      social     outside     friends       posts 
##  0.08070228  0.94288466  0.96749088  0.99617061  1.06620494  0.92590541 
##  stage_fear     drained 
## 17.97515819 14.19356923

By exponentiating the coefficients, we move from the abstract “log-odds” to real-world Odds Ratios. An Odds Ratio (OR) of 1.0 would mean a behavior has zero effect on personality. Instead, we see massive spikes: having Stage Fear makes you nearly 18 times more likely to be an Introvert. This shift in our analysis proves that personality isn’t just about what you do (your schedule), but about how your body and mind react to the world (your fear and fatigue) ## 8 · Conclusions

6.1 Reflection

This research reveals a significant disconnect between how we label our personalities and how we actually behave. While we often define Introversion by a preference for solitude, the GAM results prove that solitude acts as a universal “social cliff”—once any individual hits approximately 4 hours of alone time, their probability of social attendance crashes regardless of their personality type. The true distinction between groups isn’t the amount of time spent alone, but the internal cost of leaving that solitude. Our Logistic GLM demonstrated that “outward” habits like post frequency or friends-circle size are statistically noisy predictors. Instead, the “dead giveaways” for Introversion are internal: Stage Fear and Post-Social Exhaustion. Ultimately, being an introvert isn’t about how much you stay in; it’s about the physiological “drain” you feel when you finally go out.

6.2 Summary of Findings

  • Internal Reactions vs. Outward Habits: Internal states are the most powerful predictors of personality.

Stage Fear (OR: 17.98) and feeling socially drained (OR: 14.19) are the primary signatures of an introvert.

  • Statistical Significance of Habits: Standard behavioral metrics like Time Spent Alone, Going Outside, and Social Event Attendance were found to be non-significant predictors in the final logistic model (p > 0.05).

  • Complexity of Social Decay: The relationship between solitude and social engagement is non-linear (EDF: 8.7), suggesting social energy does not decline at a steady rate but hits specific “crash points”.

  • Quantile Insights: Behavior is most predictable at the extremes of the social spectrum. The social habits of median individuals are significantly more varied and less dictated by time spent alone.