Risk Factors Associated with Osteoporosis: A Statistical Analysis of Demographic, Lifestyle, and Medical Predictors

Warsha Autar
June 26, 2026

Introduction

Osteoporosis is a progressive skeletal disorder characterized by reduced bone density and an increased risk of fractures, particularly among older adults. Identifying the factors that contribute to osteoporosis is essential for early diagnosis, prevention, and targeted clinical intervention. This analysis examines a patient dataset containing demographic, lifestyle, and health-related variables to explore patterns associated with osteoporosis occurrence and assess their predictive value. This integrated report performs a rigorous exploratory and multivariate statistical analysis to determine key demographic, lifestyle, and clinical markers associated with osteoporosis risk. By combining descriptive statistics, non-parametric tests, independence tests, and predictive logistic regression frameworks, we identify critical points of diagnostic validation.

1 Data Preparation & Early Exploration

Before executing analytics workflows, the underlying data repository is fetched, and outcome metrics are cleanly structured into factor classifications with descriptive labels.

1.1 Libraries and Data loading

# 1.1 Firstly we load the appropriate libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(patchwork)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
# 1.2 Load data
setwd("/Users/warshaautar/Desktop/Statistics")
getwd()
## [1] "/Users/warshaautar/Desktop/Statistics"
data <- read.csv("osteoporosis.csv")

#1.3 Descriptive labels
str(data)
## 'data.frame':   1958 obs. of  16 variables:
##  $ Id                 : int  104866 101999 106567 102316 101944 102265 107447 103065 103040 105960 ...
##  $ Age                : int  69 32 89 78 38 41 20 39 70 19 ...
##  $ Gender             : chr  "Female" "Female" "Female" "Female" ...
##  $ Hormonal.Changes   : chr  "Normal" "Normal" "Postmenopausal" "Normal" ...
##  $ Family.History     : chr  "Yes" "Yes" "No" "No" ...
##  $ Race.Ethnicity     : chr  "Asian" "Asian" "Caucasian" "Caucasian" ...
##  $ Body.Weight        : chr  "Underweight" "Underweight" "Normal" "Underweight" ...
##  $ Calcium.Intake     : chr  "Low" "Low" "Adequate" "Adequate" ...
##  $ Vitamin.D.Intake   : chr  "Sufficient" "Sufficient" "Sufficient" "Insufficient" ...
##  $ Physical.Activity  : chr  "Sedentary" "Sedentary" "Active" "Sedentary" ...
##  $ Smoking            : chr  "Yes" "No" "No" "Yes" ...
##  $ Alcohol.Consumption: chr  "Moderate" "None" "Moderate" "None" ...
##  $ Medical.Conditions : chr  "Rheumatoid Arthritis" "None" "Hyperthyroidism" "Rheumatoid Arthritis" ...
##  $ Medications        : chr  "Corticosteroids" "None" "Corticosteroids" "Corticosteroids" ...
##  $ Prior.Fractures    : chr  "Yes" "Yes" "No" "No" ...
##  $ Osteoporosis       : int  1 1 1 1 1 1 1 1 1 1 ...

The str() function was used to examine the structure of the dataset. This confirmed that the dataset contained both numerical and categorical variables. Numerical variables included Age and Osteoporosis status, while categorical variables included Gender, Race/Ethnicity, Smoking status, Physical Activity level, Calcium Intake, Vitamin D Intake, and several medical risk factors. Understanding the structure of the data was important in determining which statistical tests were appropriate for each variable. This means that the datacollection for the osteoporosis analysis consists of different kinds of variable

Demographics
  • Age
  • Gender
  • Ethnicity
  • Behavioural
  • Physical Activity
  • Vitamin D Intake
  • Calcium Intake
  • Smoking
  • Alcohol Consumption
  • Medical History
  • Hormonal Changes
  • Family History
  • Body Weight
  • Prior Fractures
  • Medical Conditions
  • Medications
  • 1.2 Early data exploration

    Before further statistical analysis of osteoporosis risk, age should be evaluated for distributional properties, while gender should be considered as a stratification variable in subsequent comparisons.

    #1.2.1 Gender Count
    tabyl(data$Gender)
    
    ##  data$Gender   n   percent
    ##       Female 966 0.4933606
    ##         Male 992 0.5066394
    

    The distribution of gender within the dataset was examined using the tabyl() function. The results showed that 966 participants (49.3%) were female and 992 participants (50.7%) were male. This indicates that the sample was almost evenly divided between males and females, reducing the likelihood of gender imbalance influencing the results.

    #1.2.2 To begin, what are the mean and median ages in the data
    summary(data$Age)
    
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    18.0    21.0    32.0    39.1    53.0    90.0
    

    This means the youngest age in the data is 18yo, the median age is 32yo, the mean age is 39.1yo, and the oldest age is 90yo.

    Before comparing groups statistically, it is also necessary to determine whether Age follows a normal distribution. With very large datasets, normality tests often become overly sensitive and may incorrectly detect trivial deviations from normality. Using a sample provides a more realistic assessment, thus a Shapiro-Wilk normality test is performed on a random sample of 500 observations.

    #1.2.3 Shapiro Sample test
    shapiro.test(sample(data$Age, 500))
    
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  sample(data$Age, 500)
    ## W = 0.85277, p-value < 2.2e-16
    

    The test produced a W statistic of approximately 0.846 and a p-value smaller than 0.001. Since the p-value was less than the significance level of 0.05, the null hypothesis of normality was rejected. This indicates that age was not normally distributed within the study population. Consequently, non-parametric statistical methods were considered more appropriate for analysing age differences between groups. Because Age was found to be non-normal, a: Wilcoxon Rank-Sum Test was selected instead of a traditional t-test.

    #1.2.4 Wilcoxon test
    wilcox.test(Age ~ Osteoporosis, data = data)
    
    ## 
    ##  Wilcoxon rank sum test with continuity correction
    ## 
    ## data:  Age by Osteoporosis
    ## W = 93121, p-value < 2.2e-16
    ## alternative hypothesis: true location shift is not equal to 0
    

    The Wilcoxon rank-sum test indicates a statistically significant difference in age distributions between individuals with and without osteoporosis (W = 93121, p < 2.2 × 10⁻¹⁶). This indicates a statistically significant difference in the central tendency of age between the two groups (group with osteoporosis, group withou osteoporosis).

    #1.2.5 Plotting Figure 1
    ggplot(data, aes(x = Age)) +
      geom_histogram(bins = 25, fill = "salmon", color = "white") +
      theme_minimal() +
      labs(title = "Figure 1: Visual Inspection of Age Distribution")
    
    plot of chunk AGE2

    The visual depicts the non-normal distribution of the age range is and how there is a large group of research participants below the age of 25.

    Continuous age values are useful for statistical testing but can be difficult to interpret visually in proportional comparisons. Age is therefore converted into clinically meaningful categories. Additionally age groups make trends easier to visualise.

    #1.2.6 Age to age category conversion
    data <- data %>%
      mutate(
        Age_Group = cut(
          Age,
          breaks = c(0, 24, 39, 49, 59, 69, 79, 100),
          labels = c("<25", "26-40", "40-49", "50-59", "60-69", "70-79", "80+")
        )
      )
    

    2 Chi-Square Independence Evaluation

    To establish statistical associations comprehensively, categorical variables are passed through automated Chi-Square testing loops.

    # 2.1 Listing all the variables
    variables <- c("Age_Group","Smoking", "Hormonal.Changes", "Family.History", "Body.Weight",
                   "Calcium.Intake", "Vitamin.D.Intake", "Physical.Activity",
                   "Alcohol.Consumption", "Medical.Conditions", "Medications", "Prior.Fractures")
    
    # 2.2 Running the Chi-Square test for each one automatically
    results_list <- lapply(variables, function(var) {
      test <- chisq.test(data[[var]], data$Osteoporosis)
    
      data.frame(
        Factor = var,
        X_Squared = round(test$statistic, 3),
        df = test$parameter,
        P_Value = round(test$p.value, 4),
        Significant = ifelse(test$p.value < 0.05, "Yes (p < 0.05)", "No")
      )
    })
    
    # 2.3 Combine them all into one clean table
    chi_square_table <- do.call(rbind, results_list)
    
    chi_square_table
    
    ##                          Factor X_Squared df P_Value    Significant
    ## X-squared             Age_Group  1117.039  6  0.0000 Yes (p < 0.05)
    ## X-squared1              Smoking     0.460  1  0.4978             No
    ## X-squared2     Hormonal.Changes     0.400  1  0.5269             No
    ## X-squared3       Family.History     0.002  1  0.9639             No
    ## X-squared4          Body.Weight     2.367  1  0.1239             No
    ## X-squared5       Calcium.Intake     0.018  1  0.8921             No
    ## X-squared6     Vitamin.D.Intake     0.524  1  0.4693             No
    ## X-squared7    Physical.Activity     0.663  1  0.4155             No
    ## X-squared8  Alcohol.Consumption     0.002  1  0.9639             No
    ## X-squared9   Medical.Conditions     0.296  2  0.8626             No
    ## X-squared10         Medications     2.092  1  0.1481             No
    ## X-squared11     Prior.Fractures     0.400  1  0.5269             No
    

    Chi-square tests of independence were performed to investigate whether osteoporosis status was associated with several categorical risk factors. For each variable, a contingency table was analyzed using the chisq.test() function. The resulting p-values for smoking status, hormonal changes, family history, body weight, calcium intake, vitamin D intake, physical activity, alcohol consumption, medical conditions, medication use, and prior fractures were all greater than the 0.05 significance threshold. Because these values remained high, the null hypothesis of independence was retained, indicating insufficient statistical evidence to link these lifestyle or clinical variables directly to an osteoporosis diagnosis within this specific dataset. In contrast, the Age_Group variable produced a p-value significantly below the 0.05 threshold. This allows us to reject the null hypothesis and confirms a statistically significant association between a participant's age bracket and their osteoporosis status.

    3 Proportional Osteoporosis Status

    Here we observe data shapes and partition populations by age boundaries to inspect variations in distribution profiles.

    3.1 Demographic Profiles & Osteoporosis Status

    The layout below details relative distributions across gender assignments, age brackets, and ethnic classifications.

    #Fig.2: Osteoporosis by Gender
    demo1 <- ggplot(data, aes(x = Gender, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.2: Osteoporosis by Gender", x = "Gender", y = "Proportion", fill = "Osteoporosis")
    scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))
    
    ## <ggproto object: Class ScaleDiscrete, Scale, gg>
    ##     aesthetics: fill
    ##     axis_order: function
    ##     break_info: function
    ##     break_positions: function
    ##     breaks: waiver
    ##     call: call
    ##     clone: function
    ##     dimension: function
    ##     drop: TRUE
    ##     expand: waiver
    ##     fallback_palette: function
    ##     get_breaks: function
    ##     get_breaks_minor: function
    ##     get_labels: function
    ##     get_limits: function
    ##     get_transformation: function
    ##     guide: legend
    ##     is_discrete: function
    ##     is_empty: function
    ##     labels: No Yes
    ##     limits: NULL
    ##     make_sec_title: function
    ##     make_title: function
    ##     map: function
    ##     map_df: function
    ##     minor_breaks: waiver
    ##     n.breaks.cache: NULL
    ##     na.translate: TRUE
    ##     na.value: grey50
    ##     name: waiver
    ##     palette: NULL
    ##     palette.cache: NULL
    ##     position: left
    ##     range: environment
    ##     rescale: function
    ##     reset: function
    ##     train: function
    ##     train_df: function
    ##     transform: function
    ##     transform_df: function
    ##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
    
    #Fig.3:Osteoporosis by Age Group
    demo2 <- ggplot(data, aes(x = Age_Group, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.3:Osteoporosis by Age Group", x = "Age Group", y = "Proportion", fill = "Osteoporosis")
    scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))
    
    ## <ggproto object: Class ScaleDiscrete, Scale, gg>
    ##     aesthetics: fill
    ##     axis_order: function
    ##     break_info: function
    ##     break_positions: function
    ##     breaks: waiver
    ##     call: call
    ##     clone: function
    ##     dimension: function
    ##     drop: TRUE
    ##     expand: waiver
    ##     fallback_palette: function
    ##     get_breaks: function
    ##     get_breaks_minor: function
    ##     get_labels: function
    ##     get_limits: function
    ##     get_transformation: function
    ##     guide: legend
    ##     is_discrete: function
    ##     is_empty: function
    ##     labels: No Yes
    ##     limits: NULL
    ##     make_sec_title: function
    ##     make_title: function
    ##     map: function
    ##     map_df: function
    ##     minor_breaks: waiver
    ##     n.breaks.cache: NULL
    ##     na.translate: TRUE
    ##     na.value: grey50
    ##     name: waiver
    ##     palette: NULL
    ##     palette.cache: NULL
    ##     position: left
    ##     range: environment
    ##     rescale: function
    ##     reset: function
    ##     train: function
    ##     train_df: function
    ##     transform: function
    ##     transform_df: function
    ##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>
    
    #Fig.4:Osteoporosis by Race/Ethnicity
    demo3 <- ggplot(data, aes(x = `Race.Ethnicity`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.4:Osteoporosis by Race/Ethnicity", x = "Race / Ethnicity", y = "Proportion", fill = "Osteoporosis")
    
    
    # Render Page 1
    demo1 / demo2 / demo3 +
      plot_layout(guides = "collect") +
      plot_annotation(title = "Figures 2-4: Demographic Profiles & Osteoporosis Status")
    
    plot of chunk page1

    These proportional diagnostics and visuals prove that the risk of osteoporosis remains relatively even, except for certain age groups. The older the age group, the higher the presence of osteoporosis. This falls in line with the Chi-test results.

    .

    3.2 Behavioral & Diet Risk Factors

    We systematically verify whether physical behaviors or lifestyle options generate proportional differences in diagnosis rates.

    #Fig.5: Physical Activity Level
    life1 <- ggplot(data, aes(x = `Physical.Activity`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.5: Physical Activity Level", x = "Activity Level", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.6: Vitamin D Intake
    life2 <- ggplot(data, aes(x = `Vitamin.D.Intake`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.6: Vitamin D Intake", x = "Vitamin D", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.7: Calcium Intake
    life3 <- ggplot(data, aes(x = `Calcium.Intake`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.7: Calcium Intake", x = "Calcium", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.8: Smoking Status
    life4 <- ggplot(data , aes(x = `Smoking`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.8: Smoking Status", x = "Smoking (Yes only)", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.9: Alcohol Consumption
    life5 <- ggplot(data, aes(x = `Alcohol.Consumption`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.9: Alcohol Consumption", x = "Alcohol Level", y = "Proportion", fill = "Osteoporosis")
    
    #Render Page 2
    (life1 + life2 + life3 + life4 + life5) +
      plot_layout(ncol = 2, guides = "collect") +
      plot_annotation(title = "Figures 5-9: Behavioral & Diet Risk Factors")
    
    plot of chunk page2

    These proportional diagnostics and visuals prove that the risk of osteoporosis remains relatively even, which is in line with the Chi-test results.

    .

    3.3 Medical History & Clinical Factors

    Medical background states are grouped to view diagnostic variance proportions clearly.

    #Fig.10: Hormonal Status
    med1 <- ggplot(data, aes(x = `Hormonal.Changes`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.10: Hormonal Status", x = "Hormonal Changes", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.11: Family History
    med2 <- ggplot(data, aes(x = `Family.History`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.11: Family History", x = "Family History", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.12: Prior Fractures
    med3 <- ggplot(data, aes(x = `Prior.Fractures`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.12: Prior Fractures", x = "Prior Fractures", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.13: Medical Conditions
    med4 <- ggplot(data, aes(x = `Medical.Conditions`, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.13: Medical Conditions", x = "Medical Conditions", y = "Proportion", fill = "Osteoporosis")
    
    #Fig.14: Medications Use
    med5 <- ggplot(data, aes(x = Medications, fill = factor(Osteoporosis))) +
      geom_bar(position = "fill", alpha = 0.9) +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(title = "Fig.14: Medications Use", x = "Medications", y = "Proportion", fill = "Osteoporosis")
    
    # Render Page 3
    (med1 + med2 + med3 + med4 + med5) +
      plot_layout(ncol = 2, guides = "collect") +
      plot_annotation(title = "Figures 10-14: Medical History & Clinical Factors; Hormonal Changes, Family History, Prior Fractures")
    
    plot of chunk page3

    A series of stacked bar charts were created to explore the relationship between osteoporosis status and demographic characteristics. These charts displayed osteoporosis prevalence according to gender, age group, and race/ethnicity. The graphs were arranged together using the patchwork package to create a single demographic profile figure. The age-group visualisation showed the clearest pattern, with osteoporosis becoming increasingly common in older age categories. This visual trend was consistent with the statistical findings obtained from the Wilcoxon Rank-Sum Test.

    4 Predictive Modeling Architecture

    We deploy two distinct logistic regression configurations: a baseline tracking model restricted purely to age metrics, and a full multivariable modeling configuration processing all lifestyle and medical indices.

    4.1 Baseline Standalone Age Framework

    To begin the baseline involves isolating age as the sole independent predictor to establish a foundational benchmark for risk determination.

    #4.1.1 Splits data and generates a summary
    data %>%
      group_by(factor(Osteoporosis)) %>%
      summarise(
        Average_Age = mean(Age),
        Median_Age = median(Age),
        Min_Age = min(Age),
        Max_Age = max(Age),
        Total_Patients = n()
      )
    
    ## # A tibble: 2 × 6
    ##   `factor(Osteoporosis)` Average_Age Median_Age Min_Age Max_Age Total_Patients
    ##   <fct>                        <dbl>      <int>   <int>   <int>          <int>
    ## 1 0                             24.3         22      18      38            979
    ## 2 1                             53.9         53      18      90            979
    
    # 4.1.2 Draw a Boxplot to once again visually prove how different the ages are
    # This graph will clearly show that the groups barely overlap!
    ggplot(data, aes(x = factor(Osteoporosis), y = Age, fill = factor(Osteoporosis))) +
      geom_boxplot(alpha = 0.7, outlier.color = "red") +
      scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
      theme_minimal() +
      labs(
        title = "The Primary Driver: Age Distribution by Osteoporosis Status",
        subtitle = "Proving that age profiles do not overlap significantly between groups",
        x = "Has Osteoporosis? (No vs Yes)",
        y = "Age (Years)",
        fill = "Osteoporosis"
      )
    
    plot of chunk impactofage
    # 4.1.3 Fit a model using ONLY Age as a predictor
    age_only_model <- glm(Osteoporosis ~ Age,
                           data = data,
                           family = binomial)
    
    # 4.1.4 Look at the statistical significance of Age by itself
    summary(age_only_model)
    
    ## 
    ## Call:
    ## glm(formula = Osteoporosis ~ Age, family = binomial, data = data)
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -5.25057    0.23864   -22.0   <2e-16 ***
    ## Age          0.15624    0.00762    20.5   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 2714.4  on 1957  degrees of freedom
    ## Residual deviance: 1391.3  on 1956  degrees of freedom
    ## AIC: 1395.3
    ## 
    ## Number of Fisher Scoring iterations: 6
    
    # 4.1.5 Generate risk probabilities based solely on a patient's age
    data$predicted_risk_age <- predict(age_only_model, type = "response")
    
    # 4.1.6 Calculate the ROC curve data
    roc_age <- roc(data$Osteoporosis, data$predicted_risk_age)
    
    ## Setting levels: control = 0, case = 1
    
    ## Setting direction: controls < cases
    
    # 4.1.7 Plot the ROC curve for Age alone
    plot(roc_age,
         col = "darkgreen",   # Using a different color to distinguish it
         lwd = 3,
         main = paste("Age Standalone Accuracy (AUC =", round(auc(roc_age), 2), ")"))
    
    plot of chunk impactofage

    4.2 Multivariable Structural Framework, everything except age

    Secondly this multivariable framework integrates the remaining lifestyle and medical indices, intentionally excluding age to isolate the predictive power of behavioral and clinical factors.

    # 4.2 XXXX NEED TO LOOK AT THIS Summary Statistics for Categorical Variables by Osteoporosis Status
    # Since everything is categorical, let's look at counts for key risk factors
    summary_stats <- data %>%
      group_by(Osteoporosis = factor(Osteoporosis)) %>%
      summarise(
        Total_Patients = n(),
        # Counts for Calcium Intake levels
        Calcium_Low = sum(Calcium.Intake == "Low", na.rm = TRUE),
        Calcium_Adequate = sum(Calcium.Intake == "Adequate", na.rm = TRUE),
        # Counts for Vitamin D Intake levels
        VitD_Insufficient = sum(Vitamin.D.Intake == "Insufficient", na.rm = TRUE),
        VitD_Sufficient = sum(Vitamin.D.Intake == "Sufficient", na.rm = TRUE),
        # Count of Smokers
        Smokers = sum(Smoking == "Yes", na.rm = TRUE)
      )
    print(summary_stats)
    
    ## # A tibble: 2 × 7
    ##   Osteoporosis Total_Patients Calcium_Low Calcium_Adequate VitD_Insufficient
    ##   <fct>                 <int>       <int>            <int>             <int>
    ## 1 0                       979         504              475               482
    ## 2 1                       979         500              479               465
    ## # ℹ 2 more variables: VitD_Sufficient <int>, Smokers <int>
    
    # 4.2.XXX Fit the Multivariable Logistic Regression Model (EXCLUDING Age)
    # All column names have been updated to match R's dot-notation conversion of your dataset
    model_no_age <- glm(
      factor(Osteoporosis) ~ Gender + Hormonal.Changes + Family.History +
                             Race.Ethnicity + Body.Weight + Calcium.Intake +
                             Vitamin.D.Intake + Physical.Activity + Smoking +
                             Alcohol.Consumption + Medical.Conditions +
                             Medications + Prior.Fractures,
      data = data,
      family = binomial
    )
    
    # 4.2.3 Look at the statistical significance of the remaining variables
    summary(model_no_age)
    
    ## 
    ## Call:
    ## glm(formula = factor(Osteoporosis) ~ Gender + Hormonal.Changes + 
    ##     Family.History + Race.Ethnicity + Body.Weight + Calcium.Intake + 
    ##     Vitamin.D.Intake + Physical.Activity + Smoking + Alcohol.Consumption + 
    ##     Medical.Conditions + Medications + Prior.Fractures, family = binomial, 
    ##     data = data)
    ## 
    ## Coefficients:
    ##                                         Estimate Std. Error z value Pr(>|z|)
    ## (Intercept)                            -0.068462   0.182740  -0.375    0.708
    ## GenderMale                              0.044567   0.090839   0.491    0.624
    ## Hormonal.ChangesPostmenopausal          0.063024   0.090745   0.695    0.487
    ## Family.HistoryYes                      -0.007754   0.090748  -0.085    0.932
    ## Race.EthnicityAsian                    -0.025464   0.110924  -0.230    0.818
    ## Race.EthnicityCaucasian                -0.036122   0.110277  -0.328    0.743
    ## Body.WeightUnderweight                  0.136553   0.090922   1.502    0.133
    ## Calcium.IntakeLow                      -0.011927   0.090798  -0.131    0.895
    ## Vitamin.D.IntakeSufficient              0.070556   0.090795   0.777    0.437
    ## Physical.ActivitySedentary              0.077870   0.091066   0.855    0.392
    ## SmokingYes                             -0.065567   0.090996  -0.721    0.471
    ## Alcohol.ConsumptionNone                -0.011252   0.090743  -0.124    0.901
    ## Medical.ConditionsNone                 -0.055819   0.110536  -0.505    0.614
    ## Medical.ConditionsRheumatoid Arthritis -0.009249   0.111181  -0.083    0.934
    ## MedicationsNone                        -0.132634   0.090754  -1.461    0.144
    ## Prior.FracturesYes                      0.064861   0.090990   0.713    0.476
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 2714.4  on 1957  degrees of freedom
    ## Residual deviance: 2706.3  on 1942  degrees of freedom
    ## AIC: 2738.3
    ## 
    ## Number of Fisher Scoring iterations: 3
    
    # 4.4.4 Generate risk probabilities based on this new model
    data$predicted_risk_no_age <- predict(model_no_age, type = "response")
    
    # 4.4.5 Calculate the ROC curve data
    # Making sure response is matching the format used in glm
    roc_no_age <- roc(data$Osteoporosis, data$predicted_risk_no_age)
    
    ## Setting levels: control = 0, case = 1
    
    ## Setting direction: controls < cases
    
    # 4.4.6 Plot the ROC Curve for the model without Age
    plot(
      roc_no_age,
      col = "darkblue",
      lwd = 3,
      main = paste("Model Accuracy Without Age (AUC =", round(auc(roc_no_age), 2), ")")
    )
    
    plot of chunk EVERYTHING

    A multivariable logistic regression model was fitted using osteoporosis status as the binary dependent variable against all lifestyle options and clinical backgrounds, intentionally excluding raw age to isolate their baseline impacts. This model assesses whether factors like diet, smoking, or medical conditions can reliably predict diagnostic outcomes on their own. The output summary reveals that when age is omitted, none of the remaining clinical or lifestyle coefficients achieve statistical significance. This indicates that without controlling for age timelines, individual categorical habits do not build a strong predictive foundation for identifying osteoporosis risk profiles in this sample population.

    5 Comparative Diagnostic Evaluation

    Finally, we compute information fit diagnostics (AIC) alongside Receiver Operating Characteristic (ROC) curves to quantify predictive accuracy enhancements gained by expanding the clinical model.

    # 5.1.1 Baseline Model: Age Only
    model_baseline <- glm(
      factor(Osteoporosis) ~ Age,
      data = data,
      family = binomial
    )
    
    # 5.1.2 Expanded Model: Age + All Clinical & Lifestyle Factors
    model_expanded <- glm(
      factor(Osteoporosis) ~ Age + Gender + Hormonal.Changes + Family.History +
                             Race.Ethnicity + Body.Weight + Calcium.Intake +
                             Vitamin.D.Intake + Physical.Activity + Smoking +
                             Alcohol.Consumption + Medical.Conditions +
                             Medications + Prior.Fractures,
      data = data,
      family = binomial
    )
    
    
    # 5.2.2 COMPUTE INFORMATION FIT DIAGNOSTICS (AIC)
    aic_baseline <- AIC(model_baseline)
    aic_expanded <- AIC(model_expanded)
    
    cat("--- Information Fit Diagnostics ---\n")
    
    ## --- Information Fit Diagnostics ---
    
    cat("Baseline Model (Age Only) AIC: ", round(aic_baseline, 2), "\n")
    
    ## Baseline Model (Age Only) AIC:  1395.28
    
    cat("Expanded Model (Age + All) AIC:", round(aic_expanded, 2), "\n")
    
    ## Expanded Model (Age + All) AIC: 1410.66
    
    cat("AIC Difference:                ", round(aic_baseline - aic_expanded, 2), "\n\n")
    
    ## AIC Difference:                 -15.38
    
    # Note: A lower AIC is better. If the expanded model's AIC is NOT significantly
    # lower than the baseline, the extra variables are just adding noise.
    
    
    #5.3.1 Generate predictions
    prob_baseline <- predict(model_baseline, type = "response")
    prob_expanded <- predict(model_expanded, type = "response")
    
    #5.3.2 Calculate ROC
    roc_baseline <- roc(data$Osteoporosis, prob_baseline)
    
    ## Setting levels: control = 0, case = 1
    
    ## Setting direction: controls < cases
    
    roc_expanded <- roc(data$Osteoporosis, prob_expanded)
    
    ## Setting levels: control = 0, case = 1
    ## Setting direction: controls < cases
    
    #5.3.3 Run DeLong's test to see if the two AUCs are statistically different
    delong_test <- roc.test(roc_baseline, roc_expanded, method = "delong")
    print(delong_test)
    
    ## 
    ##  DeLong's test for two correlated ROC curves
    ## 
    ## data:  roc_baseline and roc_expanded
    ## Z = -1.9997, p-value = 0.04553
    ## alternative hypothesis: true difference in AUC is not equal to 0
    ## 95 percent confidence interval:
    ##  -6.373593e-03 -6.394478e-05
    ## sample estimates:
    ## AUC of roc1 AUC of roc2 
    ##   0.9028412   0.9060599
    
    #5.4.1 Plot the baseline curve first
    plot(
      roc_baseline,
      col = "darkgreen",
      lwd = 3,
      main = "Comparative Diagnostic Evaluation",
      xlab = "False Positive Rate (1 - Specificity)",
      ylab = "True Positive Rate (Sensitivity)"
    )
    
    #5.4.2 Add the expanded model curve to the same plot
    plot(
      roc_expanded,
      col = "darkblue",
      lwd = 3,
      add = TRUE
    )
    
    #5.4.3 Add a legend to distinguish the models, including AIC and AUC metrics
    legend(
      "bottomright",
      legend = c(
        paste("Age Only (AUC =", round(auc(roc_baseline), 2), ", AIC =", round(aic_baseline, 0), ")"),
        paste("Age + Clinical (AUC =", round(auc(roc_expanded), 2), ", AIC =", round(aic_expanded, 0), ")")
      ),
      col = c("darkgreen", "darkblue"),
      lwd = 3,
      bty = "n" # Removes the legend border box
    )
    
    plot of chunk Final comparison

    A logistic regression model was fitted using osteoporosis status as the dependent variable and age as the independent variable. Logistic regression is appropriate when the outcome variable is binary, as osteoporosis status was coded as either present or absent. The model assessed whether age could predict the likelihood of osteoporosis. Given the large age difference observed between the osteoporosis and non-osteoporosis groups, age emerged as a strong predictor of osteoporosis status. The results indicate that the probability of osteoporosis increases as age increases.

    Conclusion

    The analysis identified age as the strongest factor associated with osteoporosis in this dataset. Participants diagnosed with osteoporosis were substantially older than those without osteoporosis, and this difference was highly statistically significant. In contrast, no statistically significant associations were detected between osteoporosis status and the other demographic, lifestyle, or medical variables examined. It should also be noted that the study population contained a relatively large proportion of younger participants, as reflected by the median age of 32 years. This imbalance in the age distribution may have influenced the results, as younger individuals generally have a lower risk of osteoporosis. Consequently, some associations between osteoporosis and other risk factors may have been more difficult to detect due to the underrepresentation of older age groups, who are typically at greater risk of developing the condition. Overall, the findings suggest that age is the primary predictor of osteoporosis within this sample population, highlighting the importance of age-related changes in bone health and osteoporosis risk. .