Introduction

Social anxiety, often also referred to as “social anxiety disorder” or “social phobia,” is defined as a “marked and persistent fear of social or performance situations.” (Jefferson, 2001) Social anxiety often occurs in situations when people feel that they either will make others around them feel uncomfortable, or when they feel that they will be judged negatively based on their actions. (Jefferies, 2020) Social anxiety affects an individual’s function in many aspects of their life, whether it may be in school, work, or in their social lives. In their workplaces, people who suffer from social anxiety are more likely to be absent from work more often, and often have poorer work performances. People suffering from social anxiety are also likely to have fewer friends, less likely to marry or have children, and more likely to divorce. (Jefferies, 2020)

There are many factors that have been theorized and/or proven to influence the diagnosis of social anxiety, ranging from lifestyle choices, mental health management, physical health & physiological symptom management and general associations with certain demographics, such as age and gender.

Research Question

Which demographic, lifestyle, mental, and physical & physiological factors are the strongest predictors of high levels of social anxiety?

Objectives

  1. Describe the distribution of anxiety levels within the sample.
  2. Examine relationships between anxiety and potential predictor variables.
  3. Determine which factors significantly predict anxiety levels using multiple linear regression.
  4. Identify the strongest predictors of anxiety.

Methodology

Data Overview

Data was obtained from the Social Anxiety Data Set developed by Zhang (2024). The data set contains information from 11,000 individuals and includes demographic, lifestyle, mental, and psychological variables related to anxiety. Variables examined in this study included age, gender, sleep duration, physical activity, caffeine intake, alcohol consumption, smoking status, family history of anxiety, stress level, physiological symptoms, medication use, therapy attendance, diet quality, and overall anxiety severity level.

Variables

Dependent Variables

  • Anxiety Level (1-10 Scale)

Independent Variables

Demographic
  • Age (Continuous)
  • Gender
  • Occupation
Lifestyle
  • Sleep Hours
  • Physical Activity (Hrs. Per Week)
  • Diet Quality (1-10 Scale)
  • Caffeine Intake (Mg. Per Day)
  • Smoking Habit
  • Alcohol Consumption (Drinks Per Week)
Physiological/Psychological Health
  • Stress Level (1-10 Scale)
  • Heart Rate (BPM)
  • Breathing (Breaths Per Minute)
  • Sweating (1-5 Scale)
  • Dizziness
  • Medication Use
  • Therapy Sessions (Per Month)
  • Recent Major Life Event
  • Family History of Anxiety

Statistical Analysis

Data was analysed utilizing RStudio.

The analysis methods included:

  • Descriptive Statistics & Data Distribution Analysis
  • Correlation Analysis
  • Multiple Linear Regression
  • Independent Significant Factor Visuals

Results

Data Preparation

# Load Necessary Packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(corrplot)
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:psych':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)
library(moments)

# Load CSV File: Social Anxiety Data Set
SocAnx <- read.csv("enhanced_anxiety_dataset.csv")

# Clean & View Variable Names
SocAnx <- clean_names(SocAnx)

names(SocAnx)
##  [1] "age"                             "gender"                         
##  [3] "occupation"                      "sleep_hours"                    
##  [5] "physical_activity_hrs_week"      "caffeine_intake_mg_day"         
##  [7] "alcohol_consumption_drinks_week" "smoking"                        
##  [9] "family_history_of_anxiety"       "stress_level_1_10"              
## [11] "heart_rate_bpm"                  "breathing_rate_breaths_min"     
## [13] "sweating_level_1_5"              "dizziness"                      
## [15] "medication"                      "therapy_sessions_per_month"     
## [17] "recent_major_life_event"         "diet_quality_1_10"              
## [19] "anxiety_level_1_10"

Descriptive & Summary Statistics

# Data Set Summary 
summary(SocAnx)
##       age              gender          occupation     sleep_hours    
##  Min.   :18.00   Length   :11000   Length   :11000   Min.   : 2.300  
##  1st Qu.:29.00   N.unique :    3   N.unique :   13   1st Qu.: 5.900  
##  Median :40.00   N.blank  :    0   N.blank  :    0   Median : 6.700  
##  Mean   :40.24   Min.nchar:    4   Min.nchar:    4   Mean   : 6.651  
##  3rd Qu.:51.00   Max.nchar:    6   Max.nchar:   10   3rd Qu.: 7.500  
##  Max.   :64.00                                       Max.   :11.300  
##  physical_activity_hrs_week caffeine_intake_mg_day
##  Min.   : 0.000             Min.   :  0.0         
##  1st Qu.: 1.500             1st Qu.:172.0         
##  Median : 2.800             Median :273.0         
##  Mean   : 2.942             Mean   :286.1         
##  3rd Qu.: 4.200             3rd Qu.:382.0         
##  Max.   :10.100             Max.   :599.0         
##  alcohol_consumption_drinks_week      smoking      family_history_of_anxiety
##  Min.   : 0.000                  Length   :11000   Length   :11000          
##  1st Qu.: 5.000                  N.unique :    2   N.unique :    2          
##  Median :10.000                  N.blank  :    0   N.blank  :    0          
##  Mean   : 9.702                  Min.nchar:    2   Min.nchar:    2          
##  3rd Qu.:15.000                  Max.nchar:    3   Max.nchar:    3          
##  Max.   :19.000                                                             
##  stress_level_1_10 heart_rate_bpm   breathing_rate_breaths_min
##  Min.   : 1.000    Min.   : 60.00   Min.   :12.00             
##  1st Qu.: 3.000    1st Qu.: 76.00   1st Qu.:17.00             
##  Median : 6.000    Median : 92.00   Median :21.00             
##  Mean   : 5.856    Mean   : 90.92   Mean   :20.96             
##  3rd Qu.: 8.000    3rd Qu.:106.00   3rd Qu.:25.00             
##  Max.   :10.000    Max.   :119.00   Max.   :29.00             
##  sweating_level_1_5     dizziness         medication   
##  Min.   :1.000      Length   :11000   Length   :11000  
##  1st Qu.:2.000      N.unique :    2   N.unique :    2  
##  Median :3.000      N.blank  :    0   N.blank  :    0  
##  Mean   :3.081      Min.nchar:    2   Min.nchar:    2  
##  3rd Qu.:4.000      Max.nchar:    3   Max.nchar:    3  
##  Max.   :5.000                                         
##  therapy_sessions_per_month recent_major_life_event diet_quality_1_10
##  Min.   : 0.000             Length   :11000         Min.   : 1.000   
##  1st Qu.: 1.000             N.unique :    2         1st Qu.: 3.000   
##  Median : 2.000             N.blank  :    0         Median : 5.000   
##  Mean   : 2.428             Min.nchar:    2         Mean   : 5.182   
##  3rd Qu.: 4.000             Max.nchar:    3         3rd Qu.: 8.000   
##  Max.   :12.000                                     Max.   :10.000   
##  anxiety_level_1_10
##  Min.   : 1.000    
##  1st Qu.: 2.000    
##  Median : 4.000    
##  Mean   : 3.929    
##  3rd Qu.: 5.000    
##  Max.   :10.000
# Table 1: Summary Statistics Table 
summary_table <- SocAnx %>%
  select(where(is.numeric)) %>%
  summarise(
    across(
      everything(),
      list(
        Mean = ~mean(., na.rm = TRUE),
        SD = ~sd(., na.rm = TRUE),
        Min = ~min(., na.rm = TRUE),
        Max = ~max(., na.rm = TRUE)
      )
    )
  ) %>%
  pivot_longer(
    everything(),
    names_to = c("Variable", ".value"),
    names_pattern = "(.*)_(Mean|SD|Min|Max)"
  )

summary_table
## # A tibble: 12 × 5
##    Variable                          Mean     SD   Min   Max
##    <chr>                            <dbl>  <dbl> <dbl> <dbl>
##  1 age                              40.2   13.2   18    64  
##  2 sleep_hours                       6.65   1.23   2.3  11.3
##  3 physical_activity_hrs_week        2.94   1.83   0    10.1
##  4 caffeine_intake_mg_day          286.   145.     0   599  
##  5 alcohol_consumption_drinks_week   9.70   5.69   0    19  
##  6 stress_level_1_10                 5.86   2.93   1    10  
##  7 heart_rate_bpm                   90.9   17.3   60   119  
##  8 breathing_rate_breaths_min       21.0    5.16  12    29  
##  9 sweating_level_1_5                3.08   1.40   1     5  
## 10 therapy_sessions_per_month        2.43   2.18   0    12  
## 11 diet_quality_1_10                 5.18   2.90   1    10  
## 12 anxiety_level_1_10                3.93   2.12   1    10

The summary table (Table 1) is a standard visual used for summarizing variables in the data set, - as previously set by the summary() function - clearly displaying for each of these variables their mean, the standard deviation (SD), and their minimum and maximum values.

However, because Table 1 only provides a summary for the numerical values in the data set, if we want to see a summary of the categorical factors as well, we instead need to use a frequency table, as seen below.

# Select Categorical Variables
categorical_vars <- SocAnx %>%
  select(where(~!is.numeric(.)))

# Frequency Table - Table 2
for (var in names(categorical_vars)) {
  
  cat("\n\n====================================\n")
  cat("Variable:", var, "\n")
  cat("====================================\n")
  
  print(
    categorical_vars %>%
      count(.data[[var]]) %>%
      mutate(
        Percent = round(100 * n / sum(n), 2)
      )
  )
}
## 
## 
## ====================================
## Variable: gender 
## ====================================
##   gender    n Percent
## 1 Female 3730   33.91
## 2   Male 3657   33.25
## 3  Other 3613   32.85
## 
## 
## ====================================
## Variable: occupation 
## ====================================
##    occupation   n Percent
## 1      Artist 888    8.07
## 2     Athlete 822    7.47
## 3        Chef 858    7.80
## 4      Doctor 842    7.65
## 5    Engineer 833    7.57
## 6  Freelancer 838    7.62
## 7      Lawyer 809    7.35
## 8    Musician 892    8.11
## 9       Nurse 861    7.83
## 10      Other 840    7.64
## 11  Scientist 832    7.56
## 12    Student 878    7.98
## 13    Teacher 807    7.34
## 
## 
## ====================================
## Variable: smoking 
## ====================================
##   smoking    n Percent
## 1      No 5221   47.46
## 2     Yes 5779   52.54
## 
## 
## ====================================
## Variable: family_history_of_anxiety 
## ====================================
##   family_history_of_anxiety    n Percent
## 1                        No 5153   46.85
## 2                       Yes 5847   53.15
## 
## 
## ====================================
## Variable: dizziness 
## ====================================
##   dizziness    n Percent
## 1        No 5328   48.44
## 2       Yes 5672   51.56
## 
## 
## ====================================
## Variable: medication 
## ====================================
##   medication    n Percent
## 1         No 5334   48.49
## 2        Yes 5666   51.51
## 
## 
## ====================================
## Variable: recent_major_life_event 
## ====================================
##   recent_major_life_event    n Percent
## 1                      No 5377   48.88
## 2                     Yes 5623   51.12

Now that we’ve seen the summary and frequencies of all of the variables included within the data set, it is important to understand whether our data is properly distributed. To do that, one would typically run a series of normality distribution tests, or the Shapiro-Wilk test.

It is important to note, however, that the Shapiro-Wilk tests should only be run on numerical/continuous variables. Secondly, it is also important to note, that Shapiro-Wilk tests can only run on sample sizes between 3 and 5,000. Because this particular data set has a sample size of 11,000, we are unable to reliably run a Shapiro-Wilks test. Instead, in order to get a sense of the data distribution, a skewness and kurtosis table will be generated.

# Using Only Numeric Variables
numeric_vars <- SocAnx %>%
  select(
    age,
    sleep_hours,
    physical_activity_hrs_week,
    caffeine_intake_mg_day,
    alcohol_consumption_drinks_week,
    stress_level_1_10,
    heart_rate_bpm,
    breathing_rate_breaths_min,
    sweating_level_1_5,
    therapy_sessions_per_month,
    diet_quality_1_10,
    anxiety_level_1_10
  )

normality_summary <- data.frame(
  Variable = names(numeric_vars),
  Skewness = sapply(numeric_vars, skewness, na.rm = TRUE),
  Kurtosis = sapply(numeric_vars, kurtosis, na.rm = TRUE)
)

normality_summary
##                                                        Variable    Skewness
## age                                                         age  0.09732294
## sleep_hours                                         sleep_hours -0.22444159
## physical_activity_hrs_week           physical_activity_hrs_week  0.50694082
## caffeine_intake_mg_day                   caffeine_intake_mg_day  0.32389714
## alcohol_consumption_drinks_week alcohol_consumption_drinks_week -0.02333056
## stress_level_1_10                             stress_level_1_10 -0.16084995
## heart_rate_bpm                                   heart_rate_bpm -0.11918763
## breathing_rate_breaths_min           breathing_rate_breaths_min -0.14118551
## sweating_level_1_5                           sweating_level_1_5 -0.07391860
## therapy_sessions_per_month           therapy_sessions_per_month  1.03497541
## diet_quality_1_10                             diet_quality_1_10  0.16192543
## anxiety_level_1_10                           anxiety_level_1_10  1.04385410
##                                 Kurtosis
## age                             1.873626
## sleep_hours                     2.951908
## physical_activity_hrs_week      2.784475
## caffeine_intake_mg_day          2.298327
## alcohol_consumption_drinks_week 1.840801
## stress_level_1_10               1.744294
## heart_rate_bpm                  1.807373
## breathing_rate_breaths_min      1.828122
## sweating_level_1_5              1.736803
## therapy_sessions_per_month      3.586111
## diet_quality_1_10               1.771484
## anxiety_level_1_10              3.949077

Most variables demonstrate approximately symmetric distributions, with skewness values ranging between -0.5 and 0.5. Sleep hours, alcohol consumption, stress level, heart rate, breathing rate, sweating level, diet quality, and age all exhibit distributions consistent with approximate normality. Physical activity shows a mild positive skew (0.51), while therapy sessions per month (1.03) and anxiety level (1.04) display moderate positive skewness, indicating a greater concentration of participants reporting lower values and fewer participants reporting higher values. Kurtosis values are generally close to the expected value for a normal distribution, although therapy sessions and anxiety level show slightly elevated kurtosis, suggesting heavier tails. Overall, the distributions are considered sufficiently normal for exploratory analysis, with the exception of therapy sessions and anxiety level.

Correlation Analysis

cor_matrix <- cor(
  numeric_vars,
  use = "complete.obs"
)

round(cor_matrix, 2)
##                                   age sleep_hours physical_activity_hrs_week
## age                              1.00       -0.15                       0.04
## sleep_hours                     -0.15        1.00                       0.17
## physical_activity_hrs_week       0.04        0.17                       1.00
## caffeine_intake_mg_day          -0.04       -0.21                      -0.12
## alcohol_consumption_drinks_week -0.02       -0.07                      -0.03
## stress_level_1_10               -0.04       -0.18                      -0.10
## heart_rate_bpm                  -0.03       -0.14                      -0.08
## breathing_rate_breaths_min      -0.01       -0.12                      -0.07
## sweating_level_1_5              -0.02       -0.12                      -0.08
## therapy_sessions_per_month      -0.09       -0.31                      -0.19
## diet_quality_1_10                0.05        0.15                       0.09
## anxiety_level_1_10              -0.07       -0.49                      -0.24
##                                 caffeine_intake_mg_day
## age                                              -0.04
## sleep_hours                                      -0.21
## physical_activity_hrs_week                       -0.12
## caffeine_intake_mg_day                            1.00
## alcohol_consumption_drinks_week                   0.04
## stress_level_1_10                                 0.12
## heart_rate_bpm                                    0.08
## breathing_rate_breaths_min                        0.08
## sweating_level_1_5                                0.08
## therapy_sessions_per_month                        0.22
## diet_quality_1_10                                -0.09
## anxiety_level_1_10                                0.35
##                                 alcohol_consumption_drinks_week
## age                                                       -0.02
## sleep_hours                                               -0.07
## physical_activity_hrs_week                                -0.03
## caffeine_intake_mg_day                                     0.04
## alcohol_consumption_drinks_week                            1.00
## stress_level_1_10                                          0.05
## heart_rate_bpm                                             0.04
## breathing_rate_breaths_min                                 0.02
## sweating_level_1_5                                         0.02
## therapy_sessions_per_month                                 0.06
## diet_quality_1_10                                         -0.03
## anxiety_level_1_10                                         0.10
##                                 stress_level_1_10 heart_rate_bpm
## age                                         -0.04          -0.03
## sleep_hours                                 -0.18          -0.14
## physical_activity_hrs_week                  -0.10          -0.08
## caffeine_intake_mg_day                       0.12           0.08
## alcohol_consumption_drinks_week              0.05           0.04
## stress_level_1_10                            1.00           0.09
## heart_rate_bpm                               0.09           1.00
## breathing_rate_breaths_min                   0.06           0.05
## sweating_level_1_5                           0.08           0.06
## therapy_sessions_per_month                   0.21           0.15
## diet_quality_1_10                           -0.11          -0.09
## anxiety_level_1_10                           0.67           0.19
##                                 breathing_rate_breaths_min sweating_level_1_5
## age                                                  -0.01              -0.02
## sleep_hours                                          -0.12              -0.12
## physical_activity_hrs_week                           -0.07              -0.08
## caffeine_intake_mg_day                                0.08               0.08
## alcohol_consumption_drinks_week                       0.02               0.02
## stress_level_1_10                                     0.06               0.08
## heart_rate_bpm                                        0.05               0.06
## breathing_rate_breaths_min                            1.00               0.05
## sweating_level_1_5                                    0.05               1.00
## therapy_sessions_per_month                            0.14               0.12
## diet_quality_1_10                                    -0.05              -0.08
## anxiety_level_1_10                                    0.16               0.16
##                                 therapy_sessions_per_month diet_quality_1_10
## age                                                  -0.09              0.05
## sleep_hours                                          -0.31              0.15
## physical_activity_hrs_week                           -0.19              0.09
## caffeine_intake_mg_day                                0.22             -0.09
## alcohol_consumption_drinks_week                       0.06             -0.03
## stress_level_1_10                                     0.21             -0.11
## heart_rate_bpm                                        0.15             -0.09
## breathing_rate_breaths_min                            0.14             -0.05
## sweating_level_1_5                                    0.12             -0.08
## therapy_sessions_per_month                            1.00             -0.17
## diet_quality_1_10                                    -0.17              1.00
## anxiety_level_1_10                                    0.52             -0.22
##                                 anxiety_level_1_10
## age                                          -0.07
## sleep_hours                                  -0.49
## physical_activity_hrs_week                   -0.24
## caffeine_intake_mg_day                        0.35
## alcohol_consumption_drinks_week               0.10
## stress_level_1_10                             0.67
## heart_rate_bpm                                0.19
## breathing_rate_breaths_min                    0.16
## sweating_level_1_5                            0.16
## therapy_sessions_per_month                    0.52
## diet_quality_1_10                            -0.22
## anxiety_level_1_10                            1.00

Pearson correlation coefficients were calculated to examine the relationships between anxiety level and the numerical predictor variables. The correlation matrix indicated that stress level demonstrated the strongest positive association with anxiety, which suggests that participants reporting higher stress levels also tended to report higher anxiety levels. Positive correlations were also observed between anxiety and indicators such as heart rate, breathing rate, sweating level, caffeine/alcohol consumption and therapy sessions per month.

Age, sleep duration, physical activity, and diet quality showed negative correlations with anxiety, indicating that healthier lifestyle factors and older age were generally associated with lower anxiety levels.

These results provide preliminary evidence for the future research of relationships that will be examined more comprehensively using multiple regression analysis, as demonstrated below.

Multiple Linear Regression

# Convert the Variables
SocAnx$gender <- as.factor(SocAnx$gender)
SocAnx$occupation <- as.factor(SocAnx$occupation)
SocAnx$smoking <- as.factor(SocAnx$smoking)
SocAnx$family_history_of_anxiety <- as.factor(SocAnx$family_history_of_anxiety)
SocAnx$dizziness <- as.factor(SocAnx$dizziness)
SocAnx$medication <- as.factor(SocAnx$medication)
SocAnx$recent_major_life_event <- as.factor(SocAnx$recent_major_life_event)

# Fit the Model
model <- lm(
  anxiety_level_1_10 ~
    age +
    gender +
    occupation + 
    sleep_hours +
    physical_activity_hrs_week +
    caffeine_intake_mg_day +
    alcohol_consumption_drinks_week +
    smoking +
    family_history_of_anxiety +
    stress_level_1_10 +
    heart_rate_bpm +
    breathing_rate_breaths_min +
    sweating_level_1_5 +
    dizziness +
    medication +
    therapy_sessions_per_month +
    recent_major_life_event +
    diet_quality_1_10,
  data = SocAnx
)

summary(model)
## 
## Call:
## lm(formula = anxiety_level_1_10 ~ age + gender + occupation + 
##     sleep_hours + physical_activity_hrs_week + caffeine_intake_mg_day + 
##     alcohol_consumption_drinks_week + smoking + family_history_of_anxiety + 
##     stress_level_1_10 + heart_rate_bpm + breathing_rate_breaths_min + 
##     sweating_level_1_5 + dizziness + medication + therapy_sessions_per_month + 
##     recent_major_life_event + diet_quality_1_10, data = SocAnx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2138 -0.7828 -0.0237  0.7449  3.6739 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      3.0835094  0.1338729  23.033  < 2e-16 ***
## age                             -0.0080684  0.0008349  -9.663  < 2e-16 ***
## genderMale                      -0.0382927  0.0261813  -1.463  0.14361    
## genderOther                     -0.0212835  0.0262715  -0.810  0.41788    
## occupationAthlete                0.0316345  0.0544432   0.581  0.56121    
## occupationChef                  -0.1123977  0.0542775  -2.071  0.03840 *  
## occupationDoctor                -0.2980146  0.0592866  -5.027 5.07e-07 ***
## occupationEngineer              -0.3688120  0.0594689  -6.202 5.78e-10 ***
## occupationFreelancer            -0.0424284  0.0546396  -0.777  0.43746    
## occupationLawyer                -0.2742594  0.0597854  -4.587 4.54e-06 ***
## occupationMusician              -0.0112961  0.0533457  -0.212  0.83230    
## occupationNurse                 -0.1163612  0.0541867  -2.147  0.03178 *  
## occupationOther                  0.0454638  0.0542280   0.838  0.40183    
## occupationScientist             -0.2672181  0.0592913  -4.507 6.65e-06 ***
## occupationStudent               -0.1134104  0.0539362  -2.103  0.03552 *  
## occupationTeacher                0.0610405  0.0548139   1.114  0.26548    
## sleep_hours                     -0.4429993  0.0098440 -45.002  < 2e-16 ***
## physical_activity_hrs_week      -0.0696247  0.0060654 -11.479  < 2e-16 ***
## caffeine_intake_mg_day           0.0029340  0.0001086  27.005  < 2e-16 ***
## alcohol_consumption_drinks_week  0.0095142  0.0018948   5.021 5.21e-07 ***
## smokingYes                       0.0841698  0.0215730   3.902 9.61e-05 ***
## family_history_of_anxietyYes    -0.0806647  0.0253995  -3.176  0.00150 ** 
## stress_level_1_10                0.3784778  0.0038139  99.236  < 2e-16 ***
## heart_rate_bpm                   0.0044503  0.0006335   7.025 2.26e-12 ***
## breathing_rate_breaths_min       0.0123536  0.0021143   5.843 5.28e-09 ***
## sweating_level_1_5               0.0399365  0.0077916   5.126 3.02e-07 ***
## dizzinessYes                     0.0583170  0.0216031   2.699  0.00696 ** 
## medicationYes                    0.0706718  0.0214904   3.289  0.00101 ** 
## therapy_sessions_per_month       0.2477385  0.0064414  38.460  < 2e-16 ***
## recent_major_life_eventYes       0.0379746  0.0215571   1.762  0.07817 .  
## diet_quality_1_10               -0.0328425  0.0038137  -8.612  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.124 on 10969 degrees of freedom
## Multiple R-squared:  0.7203, Adjusted R-squared:  0.7195 
## F-statistic: 941.6 on 30 and 10969 DF,  p-value: < 2.2e-16

A multiple linear regression model was conducted to examine predictors of anxiety levels. The overall model (as visualized in Figure 1 below) was statistically significant, and explained approximately 72% of the variance in anxiety scores.

Stress level emerged once again as the strongest predictor of anxiety, indicating that individuals reporting higher stress also reported higher anxiety levels. Greater caffeine consumption, therapy attendance, alcohol consumption, smoking status, heart rate, breathing rate, sweating level, dizziness, and medication use were also associated with higher anxiety scores. Conversely, age, sleep duration, physical activity, and diet quality demonstrated significant negative relationships with anxiety. Individuals who slept more, exercised more frequently, reported healthier diets, and were of older age generally reported lower anxiety levels. These findings are consistent with the previous correlation analysis, further supporting their accuracy.

Occupation, as can be seen in the regression results, is also an important contributing factor to anxiety levels, with multiple occupations proving to be statistically significant, with the highest significance (and subsequent high anxiety levels) being attributed to professions such as: doctor, engineer, lawyer and scientist.

Family history demonstrated a statistically significant but negative relationship with anxiety after adjustment for other predictors.

Gender and recent major life events were not statistically significant predictors after controlling for the other variables included in the model.

# Figure 1: Regression Coefficient Plot
coef_df <- tidy(model) 

coef_df <- coef_df[-1, ] 

ggplot( 
  coef_df, 
  aes( 
    x = reorder(term, estimate), 
    y = estimate 
    ) 
  ) + 
  geom_point(size = 3) + 
  geom_errorbar( 
    aes( 
      ymin = estimate - 1.96*std.error, 
      ymax = estimate + 1.96*std.error 
      ), 
    width = 0.2 
    ) + 
  coord_flip() + 
  labs( 
    title = "Figure 1: Predictors of Anxiety Level", 
    x = "Variable", 
    y = "Regression Coefficient" 
    ) + 
  theme_minimal()

Significant Factors Visuals

Now that it is clear which factors are the most significant contributors to developing social anxiety, it is important to look into the specific relationships as conveyed in the data set. The majority of the significant factors have been grouped into general categories for a more seamless and easier understanding.

The visual categories include:

  • lifestyle factors
  • physiological symptoms
  • age & age groups
  • clinical & family factors
  • therapy
  • occupation

There were three types of visuals utilized to clearly and comprehensively facilitate an understanding of the data analysis previously carried out by the normality summary, correlation analysis and multiple linear regression methods: scatter plots (Figures 2 & 3), box/violin plots (Figures 2A, 5, 6 & 9), and histograms or bar charts (Figures 4, 7 & 8).

# Figure 2: Lifestyle Factors vs. Anxiety
Lifestyle <- SocAnx %>%
  select(
    anxiety_level_1_10,
    sleep_hours,
    physical_activity_hrs_week,
    caffeine_intake_mg_day,
    alcohol_consumption_drinks_week,
    diet_quality_1_10,
  ) %>%
  pivot_longer(
    cols = -anxiety_level_1_10,
    names_to = "Predictor",
    values_to = "Value"
  )

ggplot(Lifestyle,
       aes(x = Value,
           y = anxiety_level_1_10)) +
  geom_point(alpha = .15) +
   geom_smooth(method = "lm") +
  facet_wrap(~Predictor, scales = "free_x") +
  labs(
    title = "Figure 2: Lifestyle Factors Associated with Anxiety",
    x = "Predictor Value",
    y = "Anxiety Level"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Figure 2A: Smoking & Anxiety
# Separate Box Plot For Smoking (Smoking = Categorical Factor, Not Numeric & Therefore Cannot Be Included In Figure 2)
ggplot(SocAnx, aes(x = smoking, y = anxiety_level_1_10)) +
  geom_boxplot(fill = "brown", alpha = 0.6) +
  labs(
    title = "Figure 2A: Smoking Status and Anxiety Levels",
    x = "Smoking Status",
    y = "Anxiety Level (1–10)"
  ) +
  theme_minimal()

Figure 2: Lifestyle Factors vs. Anxiety

To demonstrate the lifestyle behaviors that were significantly associated with anxiety levels, a series of scatter plots were created comparing anxiety scores with sleep duration, physical activity, caffeine intake, alcohol consumption, and diet quality (Figure 2).

The respective trend lines suggest that anxiety tends to increase slightly with higher alcohol consumption and caffeine intake, indicating a positive association between these factors and anxiety. Among these particular variables, caffeine intake shows the strongest direct linear relationship.

In contrast, diet quality, physical activity, and sleep duration demonstrate negative relationships with anxiety, with individuals reporting healthier diets, greater levels of physical activity, and longer sleep duration generally exhibiting lower anxiety levels. Among these variables, sleep duration appears to show the strongest inverse linear association with anxiety, followed by physical activity.

Figure 2A: Smoking & Anxiety

Because smoking is a categorical rather than a numeric factor, it was unable to be included in the overall lifestyle factors visual.

As can be seen in the box plot, there isn’t too much of a variation between the two groups – they overall look very similar. While there is minor variation, there isn’t a strong difference between the two groups. It can also be seen that there are outlier cases where high anxiety is reported regardless of smoker status.

# Figure 3: Physiological Symptoms vs Anxiety
Physiology <- SocAnx %>%
  select(
    anxiety_level_1_10,
    stress_level_1_10,
    heart_rate_bpm,
    breathing_rate_breaths_min,
    sweating_level_1_5
  ) %>%
  pivot_longer(
    cols = -anxiety_level_1_10,
    names_to = "Predictor",
    values_to = "Value"
  )

ggplot(Physiology,
       aes(x = Value,
           y = anxiety_level_1_10)) +
  geom_point(alpha = .15) +
  geom_smooth(method = "lm") +
  facet_wrap(~Predictor, scales = "free_x") +
  labs(
    title = "Figure 3: Physiological Indicators of Anxiety",
    x = "Predictor Value",
    y = "Anxiety Level"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Figure 3A: Dizziness & Anxiety
# Separate Box Plot For Dizziness (Diziness = Categorical Factor, Not Numeric & Therefore Cannot Be Included In Figure 3)
ggplot(SocAnx, aes(x = dizziness, y = anxiety_level_1_10)) +
  geom_boxplot(fill = "darkgrey", alpha = 0.6) +
  labs(
    title = "Figure 3A: Anxiety Levels by Dizziness",
    x = "Dizziness",
    y = "Anxiety Level (1–10)"
  ) +
  theme_minimal()

Figure 3: Physiological Symptoms vs Anxiety

A series of scatter plots were also created to demonstrate the physiological symptoms significantly associated with anxiety. As can be seen, the respective trend lines suggest that anxiety tends to increase with higher heart and breathing rates, with higher level of sweat, as well as higher level of stress.

While higher stress levels may be associated with higher levels of anxiety, the other three factors (breathing/heart rate & sweat levels) may likely be indicators of high anxiety rather than contributors.

Figure 3A: Dizziness & Anxiety

Because dizziness is a categorical rather than a numeric factor, it was unable to be included in the overall physiological factors visual. As can be seen in the box plot, similarly to smoking, there isn’t too much variation between the two, however it does seem that, still, the lower the levels of dizziness, the lower the overall anxiety levels.

However, akin to the assumption made about breathing rate, heart rate & sweat levels in the description for Figure 3, dizziness is similarly inferred to be an indicator of high anxiety rather than a contributor or a causal factor.

# Figure 4: Age Distribution
ggplot(SocAnx, aes(x = age)) +
  geom_histogram(binwidth = 2, fill = "green", color = "black", alpha = 0.8) +
  labs(
    title = "Distribution of Age",
    x = "Age",
    y = "Frequency"
  ) +
  theme_minimal()

# Figure 5: Anxiety Across Age Groups
SocAnx$age_group <- cut(
  SocAnx$age,
  breaks = c(18, 30, 45, 60, 80),
  labels = c("18-29", "30-44", "45-59", "60+"),
  include.lowest = TRUE
)

table(SocAnx$age_group)
## 
## 18-29 30-44 45-59   60+ 
##  3129  3820  3213   838
ggplot(SocAnx,
       aes(x = age_group,
           y = anxiety_level_1_10,
           fill = age_group)) +
  geom_boxplot() +
  labs(
    title = "Figure 5: Anxiety Across Age Groups",
    x = "Age Group",
    y = "Anxiety Level"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Figure 4: Age Distribution

Figure 4 is a histogram generated in order to help facilitate an understanding of the age distribution within this data set. Because it is a large sample size (n = 11,000), it is important to understand the demographics of the participants, in order to further understand the correlation of age with anxiety, as it is a majorly significant factor.

Figure 5: Anxiety Across Age Groups

While Figure 4 examines the overall distribution of age, designated age groups were also analyzed to determine whether anxiety levels differed across distinct stages of life (as seen in Figure 5). Grouping participants into age categories allows for easier comparison of average anxiety levels between demographic groups and may reveal patterns that are less visible in a continuous scatter plot. A box plot was selected because it effectively displays the distribution, median, spread, and potential outliers within each age group, providing a clearer understanding of how anxiety varies across the different age categories.

# Figure 6: Clinical & Family Factors
Clinical <- SocAnx %>%
  select(
    anxiety_level_1_10,
    family_history_of_anxiety,
    medication,
    recent_major_life_event
  ) %>%
  pivot_longer(
    cols = -anxiety_level_1_10,
    names_to = "Factor",
    values_to = "Group"
  )

ggplot(Clinical,
       aes(x = Group,
           y = anxiety_level_1_10,
           fill = Group)) +
  geom_boxplot() +
  facet_wrap(~Factor, scales = "free_x") +
  labs(
    title = "Figure 6: Clinical and Family Factors Associated with Anxiety",
    x = "",
    y = "Anxiety Level"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Figure 6: Clinical & Family Factors

To demonstrate the clinical and family factors that were significantly associated with anxiety levels, a series of box plots were generated to convey the influence of family anxiety history, medication use and the effect of having a recent major life event (Figure 6). While family history of anxiety generally led to individuals reporting higher levels of anxiety, there were still instances of those without family history reporting high levels of anxiety. Both the lack of medication use and the lack of recent major life events show that individuals typically (and proportionately) reported low anxiety levels. However, the box plots still show that medication and recent life events are still contributors to higher anxiety levels.

# Figure 7: Therapy Sessions and Anxiety
SocAnx$therapy_group <- cut(
  SocAnx$therapy_sessions_per_month,
  breaks = c(0, 1, 4, 8, Inf),
  labels = c("0-1", "2-4", "5-8", "9+"),
  include.lowest = TRUE
)

ggplot(SocAnx, aes(x = anxiety_level_1_10)) +
  geom_histogram(binwidth = 1,
                 fill = "green",
                 color = "black") +
  facet_wrap(~ therapy_group) +
  labs(
    title = "Figure 7: Distribution of Anxiety Levels by Therapy Frequency",
    x = "Anxiety Level",
    y = "Frequency"
  ) +
  theme_minimal()

Figure 7: Therapy Sessions and Anxiety

In order to demonstrate the relationship between monthly therapy session frequency and its association to anxiety levels, a bar chart was generated as seen in Figure 7. Anxiety levels are seen as being higher with lower frequencies of therapy attendance, with the anxiety severity decreasing as therapy attendance increases. Since people typically experiencing high anxiety often seek treatment, this inverse relationship seems proportionate to demonstrating how psychological and mental help can often help mitigate high levels of anxiety.

# Figure 8: Occupation Distribution
ggplot(SocAnx, aes(x = occupation)) +
  geom_bar(fill = "green", color = "black", alpha = 0.85) +
  labs(
    title = "Figure 8: Distribution of Occupation",
    x = "Occupation",
    y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Figure 9: Occupation & Anxiety
ggplot(SocAnx, aes(x = occupation, y = anxiety_level_1_10)) +
  geom_violin(fill = "green", alpha = 0.5) +
  geom_boxplot(width = 0.1, fill = "black", alpha = 0.7) +
  labs(
    title = "Figure 9: Anxiety Distribution by Occupation",
    x = "Occupation",
    y = "Anxiety Level (1–10)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 8: Occupation Distribution

One’s career often takes up a large portion of one’s life, and in most jobs, there is not only social interaction often involved, but also high pressure environments where scrutiny from others can be faced. As mentioned in the introduction, people who suffer from social anxiety are more likely to be absent from work more often, and often have poorer work performances (Jefferies, 2020). Therefore, occupation is an important demographic to look into, and simultaneously, an important indicator of anxiety. A bar chart (Figure 8) was generated to better understand the distribution of occupation types in this data set.

Figure 9: Occupation & Anxiety

The correlation between occupation and anxiety level is further shown in Figure 9, when a violin plot was generated to demonstrate which occupations produce reports of higher anxiety. As shown in the visual, doctors, engineers, lawyers and scientists report the highest levels of anxiety (and are the highest in statistical significance), followed by nurses, students and chefs.

Discussion

The purpose of this study was to examine the factors associated with anxiety levels using demographic, lifestyle, physiological, and psychological variables. Results from the descriptive statistics, correlation analysis, and multiple regression model indicate that anxiety is influenced by a complex combination of behavioral, physical, and emotional factors.

Several lifestyle factors demonstrated meaningful relationships with anxiety. As shown in Figure 2, individuals who reported fewer hours of sleep, lower levels of physical activity, poorer diet quality, and greater caffeine consumption generally exhibited higher anxiety scores. These visual patterns were supported by the regression analysis, which found sleep duration, physical activity, and diet quality to be significant negative predictors of anxiety, while caffeine intake was a significant positive predictor. These findings suggest that healthier lifestyle behaviors may serve as protective factors against anxiety symptoms.

Physiological symptoms were also strongly associated with anxiety. Figure 3 illustrates positive relationships between anxiety and physiological indicators including heart rate, breathing rate, sweating level, and dizziness. Participants experiencing more severe physiological symptoms generally reported higher anxiety levels. These findings are consistent with the physical manifestations commonly associated with anxiety disorders and suggest that physiological stimulation plays an important role in indicating anxiety experiences.

Age also appeared to be related to anxiety levels. Figure 5 compares anxiety levels across age groups and shows that younger participants tended to report higher anxiety scores than older participants.

Clinical and family-related factors were examined in Figure 6. Although family history of anxiety was statistically significant in the regression model, its negative coefficient suggests that after controlling for other predictors, the relationship may be more complex than initially expected. This finding may reflect interactions among variables within the model rather than a direct protective effect of family history.

Figure 7 illustrates the relationship between therapy attendance and anxiety levels. Participants attending a greater number of therapy sessions tended to report lower anxiety scores. This relationship likely reflects the fact that individuals experiencing more severe anxiety are more likely to seek professional treatment as an attempt to rectify and/or mitigate severity levels, which ultimately (as seen in Figure 7) seems to help.

Occupation type was also discovered to be a significant contributor to anxiety levels, as shown in Figure 9. Individuals with occupations in higher stress or pressure environments were proportionately reflected as reporting higher anxiety levels.

The multiple regression analysis demonstrated strong overall predictive performance, explaining approximately 72% of the variance in anxiety levels (Adjusted R² = 0.7195). Stress level emerged as the strongest predictor of anxiety, highlighting the central role of psychological stress in anxiety experiences. Lifestyle factors, physiological symptoms, and demographic characteristics also contributed significantly to the model, demonstrating that anxiety is influenced by multiple interconnected factors.

Overall, the findings suggest that anxiety is closely associated with stress, physiological symptoms, lifestyle behaviors, and demographic characteristics. Interventions focused on stress management, healthy sleep habits, regular physical activity, and overall wellness may help reduce anxiety levels and improve mental health outcomes.

Conclusion

This study examined the demographic, lifestyle, physiological, and clinical factors associated with anxiety levels using data from 11,000 individuals. Through descriptive statistics, visualizations, correlation analysis, and multiple regression modeling, several significant predictors of anxiety were identified.

The results demonstrated that stress level was the strongest predictor of anxiety, highlighting the important role that psychological stress plays in mental health outcomes. Higher anxiety levels were also associated with increased caffeine consumption, alcohol use, smoking status, physiological symptoms, therapy attendance, and medication use. In contrast, greater sleep duration, higher levels of physical activity, better diet quality, and older age were associated with lower anxiety levels.

The regression model explained approximately 72% of the variation in anxiety scores, indicating that the selected predictors collectively provided a strong explanation of anxiety levels within the data set. These findings suggest that anxiety is influenced by a combination of psychological, behavioral, physiological, and demographic factors rather than a single underlying cause.

Overall, the study emphasizes the importance of healthy lifestyle behaviors and stress management in understanding anxiety. The findings may help inform future research and mental health interventions aimed at reducing anxiety symptoms and improving overall well-being. Future studies should explore these relationships using longitudinal data to better understand causal pathways and changes in anxiety over time.

References

Jefferies P, Ungar M (2020) Social anxiety in young people: A prevalence study in seven countries. PLOS ONE 15(9): e0239133. https://doi.org/10.1371/journal.pone.0239133

Jefferson J. W. (2001). Social Anxiety Disorder: More Than Just a Little Shyness. Primary care companion to the Journal of clinical psychiatry, 3(1), 4–9. https://doi.org/10.4088/pcc.v03n0102

Stein, M. B., & Stein, D. J. (2008). Social anxiety disorder. The Lancet, 371(9618), 1115–1125. https://doi.org/10.1016/S0140-6736(08)60488-2

Zhang, C. (2024). Social Anxiety Data set [Data set]. Kaggle. https://www.kaggle.com/datasets/natezhang123/social-anxiety-dataset