Risk Factors Associated with Osteoporosis: A Statistical Analysis of Demographic, Lifestyle, and Medical Predictors

Warsha Autar
June 26, 2026

Introduction

Osteoporosis is a progressive skeletal disorder characterised by reduced bone density and an increased risk of fractures, particularly among older adults (Compston et al., 2019). Vaishya et al. (2023) highlight that identifying risk factors that contribute to osteoporosis is essential for early diagnosis, prevention, and targeted clinical intervention. This raises the central research question: Which risk factors serve as the strongest predictors of osteoporosis? This study analyses a patient dataset containing demographic, lifestyle, and health-related variables to explore patterns associated with osteoporosis and assess their predictive value (Doxcian, 2024). Understanding these relationships can support improved risk stratification and earlier clinical identification of high-risk individuals. The primary objectives of this analysis are to:

  • Identify key demographic, lifestyle, and clinical factors associated with osteoporosis.
  • Determine which variables act as the strongest predictors of osteoporosis risk.
  • Evaluate relationships between categorical and continuous variables in relation to disease presence.
  • Develop a predictive framework to support early identification of individuals at risk.
  • Methodology

    This study applies a structured statistical and analytical approach to investigate osteoporosis risk factors using a patient dataset. The analysis begins with exploratory data analysis (EDA), including descriptive statistics to summarise demographic, lifestyle, and clinical characteristics of the population. To assess relationships between variables, non-parametric statistical tests are used where appropriate, alongside independence tests to examine associations between categorical variables and osteoporosis status. Finally, a multivariate logistic regression model is implemented to evaluate the predictive strength of identified risk factors and to determine their relative contribution to osteoporosis risk. This combined approach allows for both hypothesis-driven testing and predictive modelling, ensuring a comprehensive assessment of key diagnostic indicators.

    1 Data Preparation & Early Exploration

    Before conducting formal statistical analyses, it is important to understand the structure and characteristics of the dataset. Exploratory Data Analysis (EDA) provides an overview of the variables available, identifies their measurement types, and determines which statistical methods are appropriate. This preliminary assessment ensures that suitable statistical techniques are selected for each variable throughout the analysis.

    1.1 Libraries and Data loading

    # 1.1 Firstly we load the appropriate libraries
    library(ggplot2)
    library(dplyr)
    
    ## 
    ## Attaching package: 'dplyr'
    
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    
    library(patchwork)
    library(pROC)
    
    ## Type 'citation("pROC")' for a citation.
    
    ## 
    ## Attaching package: 'pROC'
    
    ## The following objects are masked from 'package:stats':
    ## 
    ##     cov, smooth, var
    
    library(janitor)
    
    ## 
    ## Attaching package: 'janitor'
    
    ## The following objects are masked from 'package:stats':
    ## 
    ##     chisq.test, fisher.test
    
    # 1.2 Load data
    setwd("/Users/warshaautar/Desktop/Statistics/Autar_Final_Report")
    getwd()
    
    ## [1] "/Users/warshaautar/Desktop/Statistics/Autar_Final_Report"
    
    data <- read.csv("osteoporosis.csv")
    
    #1.3 Descriptive labels
    str(data)
    
    ## 'data.frame':   1958 obs. of  16 variables:
    ##  $ Id                 : int  104866 101999 106567 102316 101944 102265 107447 103065 103040 105960 ...
    ##  $ Age                : int  69 32 89 78 38 41 20 39 70 19 ...
    ##  $ Gender             : chr  "Female" "Female" "Female" "Female" ...
    ##  $ Hormonal.Changes   : chr  "Normal" "Normal" "Postmenopausal" "Normal" ...
    ##  $ Family.History     : chr  "Yes" "Yes" "No" "No" ...
    ##  $ Race.Ethnicity     : chr  "Asian" "Asian" "Caucasian" "Caucasian" ...
    ##  $ Body.Weight        : chr  "Underweight" "Underweight" "Normal" "Underweight" ...
    ##  $ Calcium.Intake     : chr  "Low" "Low" "Adequate" "Adequate" ...
    ##  $ Vitamin.D.Intake   : chr  "Sufficient" "Sufficient" "Sufficient" "Insufficient" ...
    ##  $ Physical.Activity  : chr  "Sedentary" "Sedentary" "Active" "Sedentary" ...
    ##  $ Smoking            : chr  "Yes" "No" "No" "Yes" ...
    ##  $ Alcohol.Consumption: chr  "Moderate" "None" "Moderate" "None" ...
    ##  $ Medical.Conditions : chr  "Rheumatoid Arthritis" "None" "Hyperthyroidism" "Rheumatoid Arthritis" ...
    ##  $ Medications        : chr  "Corticosteroids" "None" "Corticosteroids" "Corticosteroids" ...
    ##  $ Prior.Fractures    : chr  "Yes" "Yes" "No" "No" ...
    ##  $ Osteoporosis       : int  1 1 1 1 1 1 1 1 1 1 ...
    

    The str() function was used to examine the structure of the dataset. This confirmed that the dataset contained both numerical and categorical variables. Numerical variables included Age and Osteoporosis status, while categorical variables included Gender, Race/Ethnicity, Smoking status, Physical Activity level, Calcium Intake, Vitamin D Intake, and several medical risk factors. Understanding the structure of the data was important in determining which statistical tests were appropriate for each variable. Correctly identifying variable types is essential because continuous and categorical variables require different statistical methods. This classification therefore provides the foundation for selecting appropriate hypothesis tests and predictive models throughout the analysis.

    Demographics
  • Age
  • Gender
  • Ethnicity
  • Behavioural
  • Physical Activity
  • Vitamin D Intake
  • Calcium Intake
  • Smoking
  • Alcohol Consumption
  • Medical History
  • Hormonal Changes
  • Family History
  • Body Weight
  • Prior Fractures
  • Medical Conditions
  • Medications
  • In short The exploratory assessment confirmed that the dataset contains an appropriate combination of continuous and categorical variables for investigating potential predictors of osteoporosis.

    1.2 Early data exploration

    Before further statistical analysis of osteoporosis risk, age should be evaluated for distributional properties, while gender should be considered as a stratification variable in subsequent comparisons.

    #1.2.1 Gender Count
    tabyl(data$Gender)
    
    ##  data$Gender   n   percent
    ##       Female 966 0.4933606
    ##         Male 992 0.5066394
    

    The distribution of gender within the dataset was examined using the tabyl() function. The results showed that 966 participants (49.3%) were female and 992 participants (50.7%) were male. This indicates that the sample was almost evenly divided between males and females. This balanced gender distribution reduces the likelihood that any subsequent differences in osteoporosis prevalence are simply the result of unequal representation of males and females.

    #1.2.2 To begin, what are the mean and median ages in the data
    summary(data$Age)
    
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    18.0    21.0    32.0    39.1    53.0    90.0
    

    This means the youngest age in the data is 18yo, the median age is 32yo, the mean age is 39.1yo, and the oldest age is 90yo. The difference between the mean (39.1 years) and median (32 years) suggests that the age distribution is positively skewed, with a relatively large proportion of younger participants represented in the dataset.

    Before comparing groups statistically, it is also necessary to determine whether Age follows a normal distribution. as Age is the only continuous variable present in the dataset, it is the only variable for which such a normality assessment makes sense. With very large datasets, normality tests often become overly sensitive and may incorrectly detect trivial deviations from normality. Using a sample provides a more realistic assessment, thus a Shapiro-Wilk normality test is performed on a random sample of 500 observations.

    #1.2.3 Shapiro Sample test
    shapiro.test(sample(data$Age, 500))
    
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  sample(data$Age, 500)
    ## W = 0.84969, p-value < 2.2e-16
    

    The test produced a W statistic of approximately 0.846 and a p-value smaller than 0.001. Since the p-value was less than the significance level of 0.05, the null hypothesis of normality was rejected. This indicates that age was not normally distributed within the study population. Because the assumption of normality was violated, non-parametric statistical methods were more appropriate than parametric alternatives. Consequently, the Wilcoxon Rank-Sum Test has been selected in preference to an independent samples t-test.

    #1.2.4 Wilcoxon test
    wilcox.test(Age ~ Osteoporosis, data = data)
    
    ## 
    ##  Wilcoxon rank sum test with continuity correction
    ## 
    ## data:  Age by Osteoporosis
    ## W = 93121, p-value < 2.2e-16
    ## alternative hypothesis: true location shift is not equal to 0
    

    The Wilcoxon rank-sum test indicates a statistically significant difference in age distributions between individuals with and without osteoporosis (W = 93121, p < 2.2 × 10⁻¹⁶). This provides strong statistical evidence that individuals diagnosed with osteoporosis are substantially older than those without the condition, suggesting that age is an important explanatory variable for osteoporosis risk.

    #1.2.5 Plotting Figure 1
    ggplot(data, aes(x = Age)) +
      geom_histogram(bins = 25, fill = "salmon", color = "white") +
      theme_minimal() +
      labs(title = "Figure 1: Visual Inspection of Age Distribution")
    
    plot of chunk AGE2

    Figure 1 visually supports the Shapiro-Wilk test by illustrating a positively skewed age distribution. Most participants are younger than 25 years, with progressively fewer observations at older ages. This graphical evidence reinforces the decision to use non-parametric statistical methods.

    Continuous age values are useful for statistical testing but can be difficult to interpret visually in proportional comparisons. Although age remains a continuous variable for statistical modelling, it was categorised into clinically meaningful age groups to facilitate interpretation of prevalence patterns and allow categorical analyses, such as the Chi-square test, to be performed..

    #1.2.6 Age to age category conversion
    data <- data %>%
      mutate(
        Age_Group = cut(
          Age,
          breaks = c(0, 24, 39, 49, 59, 69, 79, 100),
          labels = c("<25", "26-40", "40-49", "50-59", "60-69", "70-79", "80+")
        )
      )
    

    The exploratory analyses consistently demonstrate that age differs substantially between osteoporosis groups and should therefore be considered a key explanatory variable in subsequent modelling.

    2 Chi-Square Independence Evaluation

    The exploratory analyses identified age as a potentially important predictor of osteoporosis. The next stage investigates whether all demographic, lifestyle and medical variables are independently associated with osteoporosis status.

    # 2.1 Listing all the variables
    variables <- c("Age_Group","Smoking", "Hormonal.Changes", "Family.History", "Body.Weight",
                   "Calcium.Intake", "Vitamin.D.Intake", "Physical.Activity",
                   "Alcohol.Consumption", "Medical.Conditions", "Medications", "Prior.Fractures")
    
    # 2.2 Running the Chi-Square test for each one automatically
    results_list <- lapply(variables, function(var) {
      test <- chisq.test(data[[var]], data$Osteoporosis)
    
      data.frame(
        Factor = var,
        X_Squared = round(test$statistic, 3),
        df = test$parameter,
        P_Value = round(test$p.value, 4),
        Significant = ifelse(test$p.value < 0.05, "Yes (p < 0.05)", "No")
      )
    })
    
    # 2.3 Combine them all into one clean table
    chi_square_table <- do.call(rbind, results_list)
    
    chi_square_table
    
    ##                          Factor X_Squared df P_Value    Significant
    ## X-squared             Age_Group  1117.039  6  0.0000 Yes (p < 0.05)
    ## X-squared1              Smoking     0.460  1  0.4978             No
    ## X-squared2     Hormonal.Changes     0.400  1  0.5269             No
    ## X-squared3       Family.History     0.002  1  0.9639             No
    ## X-squared4          Body.Weight     2.367  1  0.1239             No
    ## X-squared5       Calcium.Intake     0.018  1  0.8921             No
    ## X-squared6     Vitamin.D.Intake     0.524  1  0.4693             No
    ## X-squared7    Physical.Activity     0.663  1  0.4155             No
    ## X-squared8  Alcohol.Consumption     0.002  1  0.9639             No
    ## X-squared9   Medical.Conditions     0.296  2  0.8626             No
    ## X-squared10         Medications     2.092  1  0.1481             No
    ## X-squared11     Prior.Fractures     0.400  1  0.5269             No
    

    Chi-square tests of independence were performed to investigate whether osteoporosis status was associated with several categorical risk factors. For each variable, a contingency table was analysed using the chisq.test() function. The resulting p-values for smoking status, hormonal changes, family history, body weight, calcium intake, vitamin D intake, physical activity, alcohol consumption, medical conditions, medication use, and prior fractures were all greater than the 0.05 significance threshold. Because these values remained high, the null hypothesis of independence was retained, indicating insufficient statistical evidence to link these lifestyle or clinical variables directly to an osteoporosis diagnosis within this specific dataset. In contrast, the Age_Group variable produced a p-value significantly below the 0.05 threshold. This allows us to reject the null hypothesis and confirms a statistically significant association between a participant's age bracket and their osteoporosis status. Although many recognised osteoporosis risk factors were not statistically significant, this should not be interpreted as evidence that they are clinically unimportant. Their effects may be comparatively small, may interact with age, or may require an older or more representative study population before statistically significant relationships become apparent.

    3 Proportional Osteoporosis Status

    Here we observe data shapes and partition populations by age boundaries to inspect variations in distribution profiles.

    3.1 Demographic Profiles & Osteoporosis Status

    While statistical hypothesis tests determine whether associations exist, graphical visualisations allow these relationships to be interpreted more intuitively. The following proportional bar charts illustrate how osteoporosis prevalence varies across demographic, lifestyle, and medical categories.

    #Fig.2: Osteoporosis by Gender
    demo1 <- ggplot(data, aes(x = Gender, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))+
      theme_minimal() +
      labs(title = "Fig.2: Osteoporosis by Gender", x = "Gender", y = "Proportion", fill = "Osteoporosis")
    scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))
    
    ## <ggproto object: Class ScaleDiscrete, Scale, gg>
    ##     aesthetics: fill
    ##     axis_order: function
    ##     break_info: function
    ##     break_positions: function
    ##     breaks: waiver
    ##     call: call
    ##     clone: function
    ##     dimension: function
    ##     drop: TRUE
    ##     expand: waiver
    ##     fallback_palette: function
    ##     get_breaks: function
    ##     get_breaks_minor: function
    ##     get_labels: function
    ##     get_limits: function
    ##     get_transformation: function
    ##     guide: legend
    ##     is_discrete: function
    ##     is_empty: function
    ##     labels: No Yes
    ##     limits: NULL
    ##     make_sec_title: function
    ##     make_title: function
    ##     map: function
    ##     map_df: function
    ##     minor_breaks: waiver
    ##     n.breaks.cache: NULL
    ##     na.translate: TRUE
    ##     na.value: grey50
    ##     name: waiver
    ##     palette: NULL
    ##     palette.cache: NULL
    ##     position: left
    ##     range: environment
    ##     rescale: function
    ##     reset: function
    ##     train: function
    ##     train_df: function
    ##     transform: function
    ##     transform_df: function
    ##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
    
    #Fig.3:Osteoporosis by Age Group
    demo2 <- ggplot(data, aes(x = Age_Group, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.3:Osteoporosis by Age Group", x = "Age Group", y = "Proportion", fill = "Osteoporosis")
    scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))
    
    ## <ggproto object: Class ScaleDiscrete, Scale, gg>
    ##     aesthetics: fill
    ##     axis_order: function
    ##     break_info: function
    ##     break_positions: function
    ##     breaks: waiver
    ##     call: call
    ##     clone: function
    ##     dimension: function
    ##     drop: TRUE
    ##     expand: waiver
    ##     fallback_palette: function
    ##     get_breaks: function
    ##     get_breaks_minor: function
    ##     get_labels: function
    ##     get_limits: function
    ##     get_transformation: function
    ##     guide: legend
    ##     is_discrete: function
    ##     is_empty: function
    ##     labels: No Yes
    ##     limits: NULL
    ##     make_sec_title: function
    ##     make_title: function
    ##     map: function
    ##     map_df: function
    ##     minor_breaks: waiver
    ##     n.breaks.cache: NULL
    ##     na.translate: TRUE
    ##     na.value: grey50
    ##     name: waiver
    ##     palette: NULL
    ##     palette.cache: NULL
    ##     position: left
    ##     range: environment
    ##     rescale: function
    ##     reset: function
    ##     train: function
    ##     train_df: function
    ##     transform: function
    ##     transform_df: function
    ##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
    
    #Fig.4:Osteoporosis by Race/Ethnicity
    demo3 <- ggplot(data, aes(x = `Race.Ethnicity`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.4:Osteoporosis by Race/Ethnicity", x = "Race / Ethnicity", y = "Proportion", fill = "Osteoporosis")
    
    
    # Render Page 1
    demo1 / demo2 / demo3 +
      plot_layout(guides = "collect") +
      plot_annotation(title = "Figures 2-4: Demographic Profiles & Osteoporosis Status")
    
    plot of chunk page1

    Figures 2–4 compare osteoporosis prevalence across gender, age group, and race/ethnicity. The proportional bar charts indicate relatively similar osteoporosis prevalence across gender and ethnic groups. In contrast, the age-group visualisation demonstrates a clear increase in osteoporosis prevalence among older participants. This visual pattern closely mirrors the statistical findings obtained from both the Wilcoxon Rank-Sum Test and the Chi-square analysis. The agreement between the graphical and statistical evidence increases confidence that age is genuinely associated with osteoporosis risk within this population.

    3.2 Behavioral & Diet Risk Factors

    We systematically verify whether physical behaviors or lifestyle options generate proportional differences in diagnosis rates.

    #Fig.5: Physical Activity Level
    life1 <- ggplot(data, aes(x = `Physical.Activity`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.5: Physical Activity Level", x = "Activity Level", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.6: Vitamin D Intake
    life2 <- ggplot(data, aes(x = `Vitamin.D.Intake`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.6: Vitamin D Intake", x = "Vitamin D", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.7: Calcium Intake
    life3 <- ggplot(data, aes(x = `Calcium.Intake`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.7: Calcium Intake", x = "Calcium", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.8: Smoking Status
    life4 <- ggplot(data , aes(x = `Smoking`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.8: Smoking Status", x = "Smoking (Yes only)", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.9: Alcohol Consumption
    life5 <- ggplot(data, aes(x = `Alcohol.Consumption`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.9: Alcohol Consumption", x = "Alcohol Level", y = "Proportion", fill = "Osteoporosis")
    
    #Render Page 2
    (life1 + life2 + life3 + life4 + life5) +
      plot_layout(ncol = 2, guides = "collect") +
      plot_annotation(title = "Figures 5-9: Behavioral & Diet Risk Factors")
    
    plot of chunk page2

    Figures 5–9 present proportional comparisons across physical activity, vitamin D intake, calcium intake, smoking status, and alcohol consumption. Across all lifestyle variables, osteoporosis prevalence remains relatively consistent between categories. No substantial visual differences are evident, supporting the non-significant Chi-square test results. Although lifestyle factors are recognised contributors to osteoporosis in clinical research, their independent effects appear limited within this dataset. One explanation may be that age accounts for much of the variation in osteoporosis risk, reducing the apparent influence of these behaviours.

    3.3 Medical History & Clinical Factors

    Medical background states are grouped to view diagnostic variance proportions clearly.

    #Fig.10: Hormonal Status
    med1 <- ggplot(data, aes(x = `Hormonal.Changes`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.10: Hormonal Status", x = "Hormonal Changes", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.11: Family History
    med2 <- ggplot(data, aes(x = `Family.History`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.11: Family History", x = "Family History", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.12: Prior Fractures
    med3 <- ggplot(data, aes(x = `Prior.Fractures`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.12: Prior Fractures", x = "Prior Fractures", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.13: Medical Conditions
    med4 <- ggplot(data, aes(x = `Medical.Conditions`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.13: Medical Conditions", x = "Medical Conditions", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.14: Medications Use
    med5 <- ggplot(data, aes(x = Medications, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.14: Medications Use", x = "Medications", y = "Proportion", fill = "Osteoporosis")
    
    # Render Page 3
    (med1 + med2 + med3 + med4 + med5) +
      plot_layout(ncol = 2, guides = "collect") +
      plot_annotation(title = "Figures 10-14: Medical History & Clinical Factors; Hormonal Changes, Family History, Prior Fractures")
    
    plot of chunk page3

    Figures 10–14 examine proportional osteoporosis prevalence across hormonal changes, family history, prior fractures, medical conditions, and medication use. Similar to the lifestyle variables, osteoporosis prevalence remained relatively stable across all medical history categories. These visual observations support the Chi-square analyses, which found no statistically significant associations between these variables and osteoporosis status. Importantly, the absence of statistical significance does not necessarily imply that these variables have no clinical relevance. Their influence may be masked by the dominant effect of age or require a study population with greater representation of older individuals. Among all proportional plots, Figure 3 (Age Group) remains the clearest illustration of increasing osteoporosis prevalence with advancing age.

    4 Predictive Modeling Architecture

    Descriptive and inferential analyses consistently identified age as the strongest candidate predictor. Logistic regression was therefore used to quantify how effectively age predicts osteoporosis and to determine whether additional variables improve predictive performance. Logistic regression was selected because osteoporosis status is a binary outcome (present or absent). The model estimates the probability that an individual has osteoporosis based on one or more predictor variables.

    4.1 Baseline Standalone Age Framework

    Summary statistics and Figure 15 demonstrate clear separation between the age distributions of participants with and without osteoporosis, suggesting that age alone may provide considerable predictive information. To establish a baseline, age was first modelled as the sole predictor of osteoporosis.

    #4.1.1 Splits data and generates a summary
    data %>%
      group_by(factor(Osteoporosis)) %>%
      summarise(
        Average_Age = mean(Age),
        Median_Age = median(Age),
        Min_Age = min(Age),
        Max_Age = max(Age),
        Total_Patients = n()
      )
    
    ## # A tibble: 2 × 6
    ##   `factor(Osteoporosis)` Average_Age Median_Age Min_Age Max_Age Total_Patients
    ##   <fct>                        <dbl>      <int>   <int>   <int>          <int>
    ## 1 0                             24.3         22      18      38            979
    ## 2 1                             53.9         53      18      90            979
    
    # 4.1.2 Draw a Boxplot to once again visually prove how different the ages are
    ggplot(data, aes(x = factor(Osteoporosis), y = Age, fill = factor(Osteoporosis))) +
      geom_boxplot(alpha = 0.7, outlier.color = "red") +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(
        title = "Figure 15: Age Distribution by Osteoporosis Status",
        subtitle = "Proving that age profiles do not overlap significantly between groups",
        x = "Has Osteoporosis? (No vs Yes)",
        y = "Age (Years)",
        fill = "Osteoporosis"
      )
    
    plot of chunk impactofage
    # 4.1.3 Fit a model using ONLY Age as a predictor
    age_only_model <- glm(Osteoporosis ~ Age,
                           data = data,
                           family = binomial)
    
    # 4.1.4 Look at the statistical significance of Age by itself
    summary(age_only_model)
    
    ## 
    ## Call:
    ## glm(formula = Osteoporosis ~ Age, family = binomial, data = data)
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -5.25057    0.23864   -22.0   <2e-16 ***
    ## Age          0.15624    0.00762    20.5   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 2714.4  on 1957  degrees of freedom
    ## Residual deviance: 1391.3  on 1956  degrees of freedom
    ## AIC: 1395.3
    ## 
    ## Number of Fisher Scoring iterations: 6
    
    # 4.1.5 Generate risk probabilities based solely on a patient's age
    data$predicted_risk_age <- predict(age_only_model, type = "response")
    
    # 4.1.6 Calculate the ROC curve data
    roc_age <- roc(data$Osteoporosis, data$predicted_risk_age)
    
    ## Setting levels: control = 0, case = 1
    
    ## Setting direction: controls < cases
    
    # 4.1.7 Plot the ROC curve for Age alone
    plot(roc_age,
         col = "darkgreen",   # Using a different color to distinguish it
         lwd = 3,
         main = paste("Figure 16: Age Standalone Accuracy (AUC =", round(auc(roc_age), 2), ")"))
    
    plot of chunk impactofage

    A descriptive and modelling approach was first used to examine how age differs between individuals with and without osteoporosis. Summary statistics were calculated for each group, including mean, median, minimum, maximum, and total sample size, in order to provide an initial understanding of how age is distributed across disease status. This was followed by a boxplot, which visually demonstrates a clear separation in age profiles between the two groups, suggesting limited overlap and a strong age gradient in osteoporosis status. To quantify this relationship, a logistic regression model was then fitted using age as the sole predictor of osteoporosis status. The model output indicates that age is statistically significant when considered independently, suggesting that it is a strong standalone predictor of disease risk in this dataset. Predicted probabilities were subsequently generated from the model, and a ROC curve was constructed to evaluate its discriminatory performance, with the resulting AUC demonstrating that age alone provides substantial predictive accuracy for distinguishing between individuals with and without osteoporosis in this sample.

    4.2 Multivariable Structural Framework, everything except age

    To determine whether lifestyle and clinical variables independently predict osteoporosis, a second logistic regression model was fitted excluding age. Unlike the baseline model, none of the remaining predictor variables achieved statistical significance. This suggests that behavioural, dietary, and clinical variables provide limited predictive information when age is omitted.

    # 4.2.1 Fit the names of Multivariable Logistic Regression Model (EXCLUDING Age)
    model_no_age <- glm(
      factor(Osteoporosis) ~ Gender + Hormonal.Changes + Family.History +
                             Race.Ethnicity + Body.Weight + Calcium.Intake +
                             Vitamin.D.Intake + Physical.Activity + Smoking +
                             Alcohol.Consumption + Medical.Conditions +
                             Medications + Prior.Fractures,
      data = data,
      family = binomial
    )
    
    # 4.2.2 Look at the statistical significance of the remaining variables
    summary(model_no_age)
    
    ## 
    ## Call:
    ## glm(formula = factor(Osteoporosis) ~ Gender + Hormonal.Changes + 
    ##     Family.History + Race.Ethnicity + Body.Weight + Calcium.Intake + 
    ##     Vitamin.D.Intake + Physical.Activity + Smoking + Alcohol.Consumption + 
    ##     Medical.Conditions + Medications + Prior.Fractures, family = binomial, 
    ##     data = data)
    ## 
    ## Coefficients:
    ##                                         Estimate Std. Error z value Pr(>|z|)
    ## (Intercept)                            -0.068462   0.182740  -0.375    0.708
    ## GenderMale                              0.044567   0.090839   0.491    0.624
    ## Hormonal.ChangesPostmenopausal          0.063024   0.090745   0.695    0.487
    ## Family.HistoryYes                      -0.007754   0.090748  -0.085    0.932
    ## Race.EthnicityAsian                    -0.025464   0.110924  -0.230    0.818
    ## Race.EthnicityCaucasian                -0.036122   0.110277  -0.328    0.743
    ## Body.WeightUnderweight                  0.136553   0.090922   1.502    0.133
    ## Calcium.IntakeLow                      -0.011927   0.090798  -0.131    0.895
    ## Vitamin.D.IntakeSufficient              0.070556   0.090795   0.777    0.437
    ## Physical.ActivitySedentary              0.077870   0.091066   0.855    0.392
    ## SmokingYes                             -0.065567   0.090996  -0.721    0.471
    ## Alcohol.ConsumptionNone                -0.011252   0.090743  -0.124    0.901
    ## Medical.ConditionsNone                 -0.055819   0.110536  -0.505    0.614
    ## Medical.ConditionsRheumatoid Arthritis -0.009249   0.111181  -0.083    0.934
    ## MedicationsNone                        -0.132634   0.090754  -1.461    0.144
    ## Prior.FracturesYes                      0.064861   0.090990   0.713    0.476
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 2714.4  on 1957  degrees of freedom
    ## Residual deviance: 2706.3  on 1942  degrees of freedom
    ## AIC: 2738.3
    ## 
    ## Number of Fisher Scoring iterations: 3
    
    # 4.2.3 Generate risk probabilities based on this new model
    data$predicted_risk_no_age <- predict(model_no_age, type = "response")
    
    # 4.2.4 Calculate the ROC curve data
    # Making sure response is matching the format used in glm
    roc_no_age <- roc(data$Osteoporosis, data$predicted_risk_no_age)
    
    ## Setting levels: control = 0, case = 1
    
    ## Setting direction: controls < cases
    
    # 4.2.5 Plot the ROC Curve for the model without Age
    plot(
      roc_no_age,
      col = "darkblue",
      lwd = 3,
      main = paste("Figure 17: Model Accuracy Without Age (AUC =", round(auc(roc_no_age), 2), ")")
    )
    
    plot of chunk EVERYTHING

    A multivariable logistic regression model was fitted using osteoporosis status as the binary dependent variable against all lifestyle options and clinical backgrounds, intentionally excluding raw age to isolate their baseline impacts. This model assesses whether factors like diet, smoking, or medical conditions can reliably predict diagnostic outcomes on their own. The output summary reveals that when age is omitted, none of the remaining clinical or lifestyle coefficients achieve statistical significance. This indicates that without controlling for age timelines, individual categorical habits do not build a strong predictive foundation for identifying osteoporosis risk profiles in this sample population.

    5 Comparative Diagnostic Evaluation

    Finally, we compute information fit diagnostics (AIC) alongside Receiver Operating Characteristic (ROC) curves to quantify predictive accuracy enhancements gained by expanding the clinical model.

    # 5.1.1 Baseline Model: Age Only
    model_baseline <- glm(
      factor(Osteoporosis) ~ Age,
      data = data,
      family = binomial
    )
    
    # 5.1.2 Clinical Only Model: All Factors WITHOUT Age
    model_clinical_only <- glm(
      factor(Osteoporosis) ~ Gender + Hormonal.Changes + Family.History +
                             Race.Ethnicity + Body.Weight + Calcium.Intake +
                             Vitamin.D.Intake + Physical.Activity + Smoking +
                             Alcohol.Consumption + Medical.Conditions +
                             Medications + Prior.Fractures,
      data = data,
      family = binomial
    )
    
    # 5.1.3 Expanded Model: Age + All Clinical & Lifestyle Factors
    model_expanded <- glm(
      factor(Osteoporosis) ~ Age + Gender + Hormonal.Changes + Family.History +
                             Race.Ethnicity + Body.Weight + Calcium.Intake +
                             Vitamin.D.Intake + Physical.Activity + Smoking +
                             Alcohol.Consumption + Medical.Conditions +
                             Medications + Prior.Fractures,
      data = data,
      family = binomial
    )
    
    
    # 5.2.2 COMPUTE INFORMATION FIT DIAGNOSTICS (AIC)
    aic_baseline <- AIC(model_baseline)
    aic_clinical <- AIC(model_clinical_only)
    aic_expanded <- AIC(model_expanded)
    
    cat("--- Information Fit Diagnostics ---\n")
    
    ## --- Information Fit Diagnostics ---
    
    cat("Baseline Model (Age Only) AIC:       ", round(aic_baseline, 2), "\n")
    
    ## Baseline Model (Age Only) AIC:        1395.28
    
    cat("Clinical Only Model (No Age) AIC:    ", round(aic_clinical, 2), "\n")
    
    ## Clinical Only Model (No Age) AIC:     2738.26
    
    cat("Expanded Model (Age + Clinical) AIC: ", round(aic_expanded, 2), "\n\n")
    
    ## Expanded Model (Age + Clinical) AIC:  1410.66
    
    # 5.3.1 Generate predictions
    prob_baseline <- predict(model_baseline, type = "response")
    prob_clinical <- predict(model_clinical_only, type = "response")
    prob_expanded <- predict(model_expanded, type = "response")
    
    # 5.3.2 Calculate ROC
    roc_baseline <- roc(data$Osteoporosis, prob_baseline)
    
    ## Setting levels: control = 0, case = 1
    
    ## Setting direction: controls < cases
    
    roc_clinical <- roc(data$Osteoporosis, prob_clinical)
    
    ## Setting levels: control = 0, case = 1
    ## Setting direction: controls < cases
    
    roc_expanded <- roc(data$Osteoporosis, prob_expanded)
    
    ## Setting levels: control = 0, case = 1
    ## Setting direction: controls < cases
    
    # 5.3.3 Run DeLong's test (Example comparing Clinical Only to Fully Expanded)
    delong_test <- roc.test(roc_clinical, roc_expanded, method = "delong")
    print(delong_test)
    
    ## 
    ##  DeLong's test for two correlated ROC curves
    ## 
    ## data:  roc_clinical and roc_expanded
    ## Z = -25.979, p-value < 2.2e-16
    ## alternative hypothesis: true difference in AUC is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.3976022 -0.3418174
    ## sample estimates:
    ## AUC of roc1 AUC of roc2 
    ##   0.5363502   0.9060599
    
    # 5.4.1 Plot the baseline curve first
    plot(
      roc_baseline,
      col = "darkgreen",
      lwd = 3,
      main = "Figure 18: Comparative Diagnostic Evaluation",
      xlab = "False Positive Rate (1 - Specificity)",
      ylab = "True Positive Rate (Sensitivity)"
    )
    
    # 5.4.2 Add the Clinical Only (No Age) curve to the same plot
    plot(
      roc_clinical,
      col = "darkblue",      # Distinct color for the new line
      lwd = 3,
      add = TRUE
    )
    
    # 5.4.3 Add the expanded model curve to the same plot
    plot(
      roc_expanded,
      col = "red",
      lwd = 3,
      add = TRUE
    )
    
    # 5.4.4 Add an updated legend to distinguish all three models
    legend(
      "bottomright",
      legend = c(
        paste("Age Only (AUC =", round(auc(roc_baseline), 2), ", AIC =", round(aic_baseline, 0), ")"),
        paste("Clinical Only (AUC =", round(auc(roc_clinical), 2), ", AIC =", round(aic_clinical, 0), ")"),
        paste("Age + Clinical (AUC =", round(auc(roc_expanded), 2), ", AIC =", round(aic_expanded, 0), ")")
      ),
      col = c("darkgreen", "darkblue", "red"),
      lwd = 3,
      bty = "n"
    )
    
    plot of chunk Final comparison

    The final stage compared three competing predictive models: Age only Clinical variables only Age combined with all clinical variables Model performance was evaluated using both the Area Under the ROC Curve (AUC) and the Akaike Information Criterion (AIC) (Google Developers, 2025; Mohammed et al., 2015). The ROC curve measures predictive discrimination, while the AIC evaluates overall model quality by balancing goodness-of-fit against model complexity. Higher AUC values and lower AIC values indicate superior model performance (Google Developers, 2025; Mohammed et al., 2015). The comparative analysis demonstrated that the Age-only model already achieved strong predictive performance. The Clinical-only model performed considerably worse, confirming that behavioural and medical variables alone contribute relatively little predictive information. Although combining age with additional predictors produced a modest improvement in overall model fit, the improvement over the Age-only model was relatively small. This suggests that age explains most of the variation in osteoporosis risk captured within this dataset. The consistency of these findings across ROC analysis, AIC comparison and logistic regression strengthens confidence in the conclusion that age is the dominant predictor of osteoporosis.

    Discussion & Conclusion

    Discussion

    Before interpreting the findings in detail, four consistent patterns emerged throughout the analysis:

  • Age was the only variable identified as statistically significant across all analytical methods.
  • Lifestyle and clinical variables showed little evidence of independent associations with osteoporosis.
  • The age-only logistic regression model demonstrated strong predictive performance.
  • Adding additional predictors produced only marginal improvements in model accuracy.
  • The findings of this study demonstrate remarkable consistency across descriptive statistics, inferential testing, graphical visualisation and predictive modelling. The Wilcoxon Rank-Sum Test showed that participants with osteoporosis were significantly older than those without the condition (W = 93121, p < 2.2 × 10⁻¹⁶), while the Chi-square analysis identified Age Group as the only categorical variable significantly associated with osteoporosis status. These statistical findings were reinforced visually through the proportional bar charts and boxplots, all of which illustrated increasing osteoporosis prevalence with advancing age. The agreement between multiple statistical approaches increases confidence in the robustness of the findings. The predictive modelling produced similar conclusions. The age-only logistic regression model demonstrated strong discriminatory ability, whereas models excluding age performed considerably worse. Furthermore, adding lifestyle and medical variables produced only limited improvements in predictive performance, indicating that age captures most of the explainable variation in osteoporosis risk within this sample. These findings are consistent with established clinical evidence that bone mineral density declines with advancing age, increasing susceptibility to osteoporosis. However, unlike some previous studies, recognised risk factors such as smoking, calcium intake, vitamin D intake and physical activity were not statistically significant. One explanation may be the relatively young study population, with a median age of only 32 years. This age imbalance may have reduced the ability to detect weaker associations that become more evident in older populations. Overall, the consistency between descriptive statistics, hypothesis testing, graphical exploration and predictive modelling provides compelling evidence that age is the dominant predictor of osteoporosis within this dataset.

    Conclusion

    This study aimed to determine which demographic, lifestyle and clinical factors are the strongest predictors of osteoporosis. Across every statistical method employed—including descriptive statistics, the Shapiro-Wilk normality test, the Wilcoxon Rank-Sum Test, Chi-square analysis, proportional visualisations and logistic regression—age consistently emerged as the strongest predictor of osteoporosis status. In contrast, lifestyle and medical variables demonstrated limited independent predictive value within this sample. Although the expanded predictive model incorporating additional variables produced a small improvement over the age-only model, the majority of predictive performance was attributable to age. These findings suggest that age captures most of the variation in osteoporosis risk represented within this dataset. From a clinical perspective, the results reinforce the importance of age-based osteoporosis screening and early intervention strategies for older adults. Nevertheless, the relatively young age distribution of the study population represents an important limitation and may have reduced the ability to detect associations involving other recognised risk factors. Future research should investigate more representative populations with a greater proportion of older adults to better understand the independent contributions of lifestyle and clinical factors. Such studies would help determine whether these variables provide meaningful predictive value beyond the strong influence of age.

    References

  • Compston, J.E., McClung, M.R. and Leslie, W.D. (2019) ‘Osteoporosis’, The Lancet, 393(10169), pp. 364–376. https://doi.org/10.1016/S0140-6736(18)32112-3
  • Docxian (2024) Osteoporosis Risk Prediction. Available at: https://www.kaggle.com/code/docxian/osteoporosis-risk-prediction (Accessed: 26 June 2026).
  • Google Developers (2025) Classification: ROC Curve and AUC. Available at: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
  • Mohammed, E.A., Naugler, C. and Far, B.H. (2015). ‘Chapter 32 - Emerging Business Intelligence Framework for a Clinical Laboratory Through Big Data Analytics’, in Tran, Q.N. and Arabnia, H. (eds.) Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology: Algorithms and Software Tools. Amsterdam: Elsevier, pp. 581–596. doi: 10.1016/B978-0-12-802508-6.00032-6.
  • Vaishya, R., Iyengar, K.P., Jain, V.K. and Vaish, A. (2023) 'Demystifying the Risk Factors and Preventive Measures for Osteoporosis', Indian Journal of Orthopaedics, 57(Suppl 1), pp. 94–104. doi: 10.1007/s43465-023-00998-0.