Risk Factors Associated with Osteoporosis: A Statistical Analysis of Demographic, Lifestyle, and Medical Predictors
Introduction
Osteoporosis is a progressive skeletal disorder characterized by reduced bone density and an increased risk of fractures, particularly among older adults. Identifying the factors that contribute to osteoporosis is essential for early diagnosis, prevention, and targeted clinical intervention. This analysis examines a patient dataset containing demographic, lifestyle, and health-related variables to explore patterns associated with osteoporosis occurrence and assess their predictive value. This integrated report performs a rigorous exploratory and multivariate statistical analysis to determine key demographic, lifestyle, and clinical markers associated with osteoporosis risk. By combining descriptive statistics, non-parametric tests, independence tests, and predictive logistic regression frameworks, we identify critical points of diagnostic validation.
1 Data Preparation & Early Exploration
Before executing analytics workflows, the underlying data repository is fetched, and outcome metrics are cleanly structured into factor classifications with descriptive labels.
1.1 Libraries and Data loading
# 1.1 Firstly we load the appropriate libraries library(ggplot2) library(dplyr)
library(patchwork) library(pROC)
library(janitor)
# 1.2 Load data setwd("/Users/warshaautar/Desktop/Statistics") getwd()
## [1] "/Users/warshaautar/Desktop/Statistics"
data <- read.csv("osteoporosis.csv") #1.3 Descriptive labels str(data)
## 'data.frame': 1958 obs. of 16 variables: ## $ Id : int 104866 101999 106567 102316 101944 102265 107447 103065 103040 105960 ... ## $ Age : int 69 32 89 78 38 41 20 39 70 19 ... ## $ Gender : chr "Female" "Female" "Female" "Female" ... ## $ Hormonal.Changes : chr "Normal" "Normal" "Postmenopausal" "Normal" ... ## $ Family.History : chr "Yes" "Yes" "No" "No" ... ## $ Race.Ethnicity : chr "Asian" "Asian" "Caucasian" "Caucasian" ... ## $ Body.Weight : chr "Underweight" "Underweight" "Normal" "Underweight" ... ## $ Calcium.Intake : chr "Low" "Low" "Adequate" "Adequate" ... ## $ Vitamin.D.Intake : chr "Sufficient" "Sufficient" "Sufficient" "Insufficient" ... ## $ Physical.Activity : chr "Sedentary" "Sedentary" "Active" "Sedentary" ... ## $ Smoking : chr "Yes" "No" "No" "Yes" ... ## $ Alcohol.Consumption: chr "Moderate" "None" "Moderate" "None" ... ## $ Medical.Conditions : chr "Rheumatoid Arthritis" "None" "Hyperthyroidism" "Rheumatoid Arthritis" ... ## $ Medications : chr "Corticosteroids" "None" "Corticosteroids" "Corticosteroids" ... ## $ Prior.Fractures : chr "Yes" "Yes" "No" "No" ... ## $ Osteoporosis : int 1 1 1 1 1 1 1 1 1 1 ...
The str() function was used to examine the structure of the dataset. This confirmed that the dataset contained both numerical and categorical variables. Numerical variables included Age and Osteoporosis status, while categorical variables included Gender, Race/Ethnicity, Smoking status, Physical Activity level, Calcium Intake, Vitamin D Intake, and several medical risk factors. Understanding the structure of the data was important in determining which statistical tests were appropriate for each variable. This means that the datacollection for the osteoporosis analysis consists of different kinds of variable
DemographicsBehavioural
Medical History
1.2 Early data exploration
Before further statistical analysis of osteoporosis risk, age should be evaluated for distributional properties, while gender should be considered as a stratification variable in subsequent comparisons.
#1.2.1 Gender Count tabyl(data$Gender)
## data$Gender n percent ## Female 966 0.4933606 ## Male 992 0.5066394
The distribution of gender within the dataset was examined using the tabyl() function. The results showed that 966 participants (49.3%) were female and 992 participants (50.7%) were male. This indicates that the sample was almost evenly divided between males and females, reducing the likelihood of gender imbalance influencing the results.
#1.2.2 To begin, what are the mean and median ages in the data summary(data$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.0 21.0 32.0 39.1 53.0 90.0
This means the youngest age in the data is 18yo, the median age is 32yo, the mean age is 39.1yo, and the oldest age is 90yo.
Before comparing groups statistically, it is also necessary to determine whether Age follows a normal distribution. With very large datasets, normality tests often become overly sensitive and may incorrectly detect trivial deviations from normality. Using a sample provides a more realistic assessment, thus a Shapiro-Wilk normality test is performed on a random sample of 500 observations.
#1.2.3 Shapiro Sample test shapiro.test(sample(data$Age, 500))
## ## Shapiro-Wilk normality test ## ## data: sample(data$Age, 500) ## W = 0.85277, p-value < 2.2e-16
The test produced a W statistic of approximately 0.846 and a p-value smaller than 0.001. Since the p-value was less than the significance level of 0.05, the null hypothesis of normality was rejected. This indicates that age was not normally distributed within the study population. Consequently, non-parametric statistical methods were considered more appropriate for analysing age differences between groups. Because Age was found to be non-normal, a: Wilcoxon Rank-Sum Test was selected instead of a traditional t-test.
#1.2.4 Wilcoxon test wilcox.test(Age ~ Osteoporosis, data = data)
## ## Wilcoxon rank sum test with continuity correction ## ## data: Age by Osteoporosis ## W = 93121, p-value < 2.2e-16 ## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon rank-sum test indicates a statistically significant difference in age distributions between individuals with and without osteoporosis (W = 93121, p < 2.2 × 10⁻¹⁶). This indicates a statistically significant difference in the central tendency of age between the two groups (group with osteoporosis, group withou osteoporosis).
#1.2.5 Plotting Figure 1 ggplot(data, aes(x = Age)) + geom_histogram(bins = 25, fill = "salmon", color = "white") + theme_minimal() + labs(title = "Figure 1: Visual Inspection of Age Distribution")
The visual depicts the non-normal distribution of the age range is and how there is a large group of research participants below the age of 25.
Continuous age values are useful for statistical testing but can be difficult to interpret visually in proportional comparisons. Age is therefore converted into clinically meaningful categories. Additionally age groups make trends easier to visualise.
#1.2.6 Age to age category conversion data <- data %>% mutate( Age_Group = cut( Age, breaks = c(0, 24, 39, 49, 59, 69, 79, 100), labels = c("<25", "26-40", "40-49", "50-59", "60-69", "70-79", "80+") ) )
2 Chi-Square Independence Evaluation
To establish statistical associations comprehensively, categorical variables are passed through automated Chi-Square testing loops.
# 2.1 Listing all the variables variables <- c("Age_Group","Smoking", "Hormonal.Changes", "Family.History", "Body.Weight", "Calcium.Intake", "Vitamin.D.Intake", "Physical.Activity", "Alcohol.Consumption", "Medical.Conditions", "Medications", "Prior.Fractures") # 2.2 Running the Chi-Square test for each one automatically results_list <- lapply(variables, function(var) { test <- chisq.test(data[[var]], data$Osteoporosis) data.frame( Factor = var, X_Squared = round(test$statistic, 3), df = test$parameter, P_Value = round(test$p.value, 4), Significant = ifelse(test$p.value < 0.05, "Yes (p < 0.05)", "No") ) }) # 2.3 Combine them all into one clean table chi_square_table <- do.call(rbind, results_list) chi_square_table
## Factor X_Squared df P_Value Significant ## X-squared Age_Group 1117.039 6 0.0000 Yes (p < 0.05) ## X-squared1 Smoking 0.460 1 0.4978 No ## X-squared2 Hormonal.Changes 0.400 1 0.5269 No ## X-squared3 Family.History 0.002 1 0.9639 No ## X-squared4 Body.Weight 2.367 1 0.1239 No ## X-squared5 Calcium.Intake 0.018 1 0.8921 No ## X-squared6 Vitamin.D.Intake 0.524 1 0.4693 No ## X-squared7 Physical.Activity 0.663 1 0.4155 No ## X-squared8 Alcohol.Consumption 0.002 1 0.9639 No ## X-squared9 Medical.Conditions 0.296 2 0.8626 No ## X-squared10 Medications 2.092 1 0.1481 No ## X-squared11 Prior.Fractures 0.400 1 0.5269 No
Chi-square tests of independence were performed to investigate whether osteoporosis status was associated with several categorical risk factors. For each variable, a contingency table was analyzed using the chisq.test() function. The resulting p-values for smoking status, hormonal changes, family history, body weight, calcium intake, vitamin D intake, physical activity, alcohol consumption, medical conditions, medication use, and prior fractures were all greater than the 0.05 significance threshold. Because these values remained high, the null hypothesis of independence was retained, indicating insufficient statistical evidence to link these lifestyle or clinical variables directly to an osteoporosis diagnosis within this specific dataset. In contrast, the Age_Group variable produced a p-value significantly below the 0.05 threshold. This allows us to reject the null hypothesis and confirms a statistically significant association between a participant's age bracket and their osteoporosis status.
3 Proportional Osteoporosis Status
Here we observe data shapes and partition populations by age boundaries to inspect variations in distribution profiles.
3.1 Demographic Profiles & Osteoporosis Status
The layout below details relative distributions across gender assignments, age brackets, and ethnic classifications.
#Fig.2: Osteoporosis by Gender demo1 <- ggplot(data, aes(x = Gender, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.2: Osteoporosis by Gender", x = "Gender", y = "Proportion", fill = "Osteoporosis") scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))
## <ggproto object: Class ScaleDiscrete, Scale, gg> ## aesthetics: fill ## axis_order: function ## break_info: function ## break_positions: function ## breaks: waiver ## call: call ## clone: function ## dimension: function ## drop: TRUE ## expand: waiver ## fallback_palette: function ## get_breaks: function ## get_breaks_minor: function ## get_labels: function ## get_limits: function ## get_transformation: function ## guide: legend ## is_discrete: function ## is_empty: function ## labels: No Yes ## limits: NULL ## make_sec_title: function ## make_title: function ## map: function ## map_df: function ## minor_breaks: waiver ## n.breaks.cache: NULL ## na.translate: TRUE ## na.value: grey50 ## name: waiver ## palette: NULL ## palette.cache: NULL ## position: left ## range: environment ## rescale: function ## reset: function ## train: function ## train_df: function ## transform: function ## transform_df: function ## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
#Fig.3:Osteoporosis by Age Group demo2 <- ggplot(data, aes(x = Age_Group, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.3:Osteoporosis by Age Group", x = "Age Group", y = "Proportion", fill = "Osteoporosis") scale_fill_discrete(labels = c("0" = "No", "1" = "Yes"))
## <ggproto object: Class ScaleDiscrete, Scale, gg> ## aesthetics: fill ## axis_order: function ## break_info: function ## break_positions: function ## breaks: waiver ## call: call ## clone: function ## dimension: function ## drop: TRUE ## expand: waiver ## fallback_palette: function ## get_breaks: function ## get_breaks_minor: function ## get_labels: function ## get_limits: function ## get_transformation: function ## guide: legend ## is_discrete: function ## is_empty: function ## labels: No Yes ## limits: NULL ## make_sec_title: function ## make_title: function ## map: function ## map_df: function ## minor_breaks: waiver ## n.breaks.cache: NULL ## na.translate: TRUE ## na.value: grey50 ## name: waiver ## palette: NULL ## palette.cache: NULL ## position: left ## range: environment ## rescale: function ## reset: function ## train: function ## train_df: function ## transform: function ## transform_df: function ## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
#Fig.4:Osteoporosis by Race/Ethnicity demo3 <- ggplot(data, aes(x = `Race.Ethnicity`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.4:Osteoporosis by Race/Ethnicity", x = "Race / Ethnicity", y = "Proportion", fill = "Osteoporosis") # Render Page 1 demo1 / demo2 / demo3 + plot_layout(guides = "collect") + plot_annotation(title = "Figures 2-4: Demographic Profiles & Osteoporosis Status")
These proportional diagnostics and visuals prove that the risk of osteoporosis remains relatively even, except for certain age groups. The older the age group, the higher the presence of osteoporosis. This falls in line with the Chi-test results.
.
3.2 Behavioral & Diet Risk Factors
We systematically verify whether physical behaviors or lifestyle options generate proportional differences in diagnosis rates.
#Fig.5: Physical Activity Level life1 <- ggplot(data, aes(x = `Physical.Activity`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.5: Physical Activity Level", x = "Activity Level", y = "Proportion", fill = "Osteoporosis") #Fig.6: Vitamin D Intake life2 <- ggplot(data, aes(x = `Vitamin.D.Intake`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.6: Vitamin D Intake", x = "Vitamin D", y = "Proportion", fill = "Osteoporosis") #Fig.7: Calcium Intake life3 <- ggplot(data, aes(x = `Calcium.Intake`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.7: Calcium Intake", x = "Calcium", y = "Proportion", fill = "Osteoporosis") #Fig.8: Smoking Status life4 <- ggplot(data , aes(x = `Smoking`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.8: Smoking Status", x = "Smoking (Yes only)", y = "Proportion", fill = "Osteoporosis") #Fig.9: Alcohol Consumption life5 <- ggplot(data, aes(x = `Alcohol.Consumption`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.9: Alcohol Consumption", x = "Alcohol Level", y = "Proportion", fill = "Osteoporosis") #Render Page 2 (life1 + life2 + life3 + life4 + life5) + plot_layout(ncol = 2, guides = "collect") + plot_annotation(title = "Figures 5-9: Behavioral & Diet Risk Factors")
These proportional diagnostics and visuals prove that the risk of osteoporosis remains relatively even, which is in line with the Chi-test results.
.
3.3 Medical History & Clinical Factors
Medical background states are grouped to view diagnostic variance proportions clearly.
#Fig.10: Hormonal Status med1 <- ggplot(data, aes(x = `Hormonal.Changes`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.10: Hormonal Status", x = "Hormonal Changes", y = "Proportion", fill = "Osteoporosis") #Fig.11: Family History med2 <- ggplot(data, aes(x = `Family.History`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.11: Family History", x = "Family History", y = "Proportion", fill = "Osteoporosis") #Fig.12: Prior Fractures med3 <- ggplot(data, aes(x = `Prior.Fractures`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.12: Prior Fractures", x = "Prior Fractures", y = "Proportion", fill = "Osteoporosis") #Fig.13: Medical Conditions med4 <- ggplot(data, aes(x = `Medical.Conditions`, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.13: Medical Conditions", x = "Medical Conditions", y = "Proportion", fill = "Osteoporosis") #Fig.14: Medications Use med5 <- ggplot(data, aes(x = Medications, fill = factor(Osteoporosis))) + geom_bar(position = "fill", alpha = 0.9) + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs(title = "Fig.14: Medications Use", x = "Medications", y = "Proportion", fill = "Osteoporosis") # Render Page 3 (med1 + med2 + med3 + med4 + med5) + plot_layout(ncol = 2, guides = "collect") + plot_annotation(title = "Figures 10-14: Medical History & Clinical Factors; Hormonal Changes, Family History, Prior Fractures")
A series of stacked bar charts were created to explore the relationship between osteoporosis status and demographic characteristics. These charts displayed osteoporosis prevalence according to gender, age group, and race/ethnicity. The graphs were arranged together using the patchwork package to create a single demographic profile figure. The age-group visualisation showed the clearest pattern, with osteoporosis becoming increasingly common in older age categories. This visual trend was consistent with the statistical findings obtained from the Wilcoxon Rank-Sum Test.
4 Predictive Modeling Architecture
We deploy two distinct logistic regression configurations: a baseline tracking model restricted purely to age metrics, and a full multivariable modeling configuration processing all lifestyle and medical indices.
4.1 Baseline Standalone Age Framework
To begin the baseline involves isolating age as the sole independent predictor to establish a foundational benchmark for risk determination.
#4.1.1 Splits data and generates a summary data %>% group_by(factor(Osteoporosis)) %>% summarise( Average_Age = mean(Age), Median_Age = median(Age), Min_Age = min(Age), Max_Age = max(Age), Total_Patients = n() )
## # A tibble: 2 × 6 ## `factor(Osteoporosis)` Average_Age Median_Age Min_Age Max_Age Total_Patients ## <fct> <dbl> <int> <int> <int> <int> ## 1 0 24.3 22 18 38 979 ## 2 1 53.9 53 18 90 979
# 4.1.2 Draw a Boxplot to once again visually prove how different the ages are # This graph will clearly show that the groups barely overlap! ggplot(data, aes(x = factor(Osteoporosis), y = Age, fill = factor(Osteoporosis))) + geom_boxplot(alpha = 0.7, outlier.color = "red") + scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) + theme_minimal() + labs( title = "The Primary Driver: Age Distribution by Osteoporosis Status", subtitle = "Proving that age profiles do not overlap significantly between groups", x = "Has Osteoporosis? (No vs Yes)", y = "Age (Years)", fill = "Osteoporosis" )
# 4.1.3 Fit a model using ONLY Age as a predictor age_only_model <- glm(Osteoporosis ~ Age, data = data, family = binomial) # 4.1.4 Look at the statistical significance of Age by itself summary(age_only_model)
## ## Call: ## glm(formula = Osteoporosis ~ Age, family = binomial, data = data) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -5.25057 0.23864 -22.0 <2e-16 *** ## Age 0.15624 0.00762 20.5 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2714.4 on 1957 degrees of freedom ## Residual deviance: 1391.3 on 1956 degrees of freedom ## AIC: 1395.3 ## ## Number of Fisher Scoring iterations: 6
# 4.1.5 Generate risk probabilities based solely on a patient's age data$predicted_risk_age <- predict(age_only_model, type = "response") # 4.1.6 Calculate the ROC curve data roc_age <- roc(data$Osteoporosis, data$predicted_risk_age)
# 4.1.7 Plot the ROC curve for Age alone plot(roc_age, col = "darkgreen", # Using a different color to distinguish it lwd = 3, main = paste("Age Standalone Accuracy (AUC =", round(auc(roc_age), 2), ")"))
4.2 Multivariable Structural Framework, everything except age
Secondly this multivariable framework integrates the remaining lifestyle and medical indices, intentionally excluding age to isolate the predictive power of behavioral and clinical factors.
# 4.2 XXXX NEED TO LOOK AT THIS Summary Statistics for Categorical Variables by Osteoporosis Status # Since everything is categorical, let's look at counts for key risk factors summary_stats <- data %>% group_by(Osteoporosis = factor(Osteoporosis)) %>% summarise( Total_Patients = n(), # Counts for Calcium Intake levels Calcium_Low = sum(Calcium.Intake == "Low", na.rm = TRUE), Calcium_Adequate = sum(Calcium.Intake == "Adequate", na.rm = TRUE), # Counts for Vitamin D Intake levels VitD_Insufficient = sum(Vitamin.D.Intake == "Insufficient", na.rm = TRUE), VitD_Sufficient = sum(Vitamin.D.Intake == "Sufficient", na.rm = TRUE), # Count of Smokers Smokers = sum(Smoking == "Yes", na.rm = TRUE) ) print(summary_stats)
## # A tibble: 2 × 7 ## Osteoporosis Total_Patients Calcium_Low Calcium_Adequate VitD_Insufficient ## <fct> <int> <int> <int> <int> ## 1 0 979 504 475 482 ## 2 1 979 500 479 465 ## # ℹ 2 more variables: VitD_Sufficient <int>, Smokers <int>
# 4.2.XXX Fit the Multivariable Logistic Regression Model (EXCLUDING Age) # All column names have been updated to match R's dot-notation conversion of your dataset model_no_age <- glm( factor(Osteoporosis) ~ Gender + Hormonal.Changes + Family.History + Race.Ethnicity + Body.Weight + Calcium.Intake + Vitamin.D.Intake + Physical.Activity + Smoking + Alcohol.Consumption + Medical.Conditions + Medications + Prior.Fractures, data = data, family = binomial ) # 4.2.3 Look at the statistical significance of the remaining variables summary(model_no_age)
## ## Call: ## glm(formula = factor(Osteoporosis) ~ Gender + Hormonal.Changes + ## Family.History + Race.Ethnicity + Body.Weight + Calcium.Intake + ## Vitamin.D.Intake + Physical.Activity + Smoking + Alcohol.Consumption + ## Medical.Conditions + Medications + Prior.Fractures, family = binomial, ## data = data) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.068462 0.182740 -0.375 0.708 ## GenderMale 0.044567 0.090839 0.491 0.624 ## Hormonal.ChangesPostmenopausal 0.063024 0.090745 0.695 0.487 ## Family.HistoryYes -0.007754 0.090748 -0.085 0.932 ## Race.EthnicityAsian -0.025464 0.110924 -0.230 0.818 ## Race.EthnicityCaucasian -0.036122 0.110277 -0.328 0.743 ## Body.WeightUnderweight 0.136553 0.090922 1.502 0.133 ## Calcium.IntakeLow -0.011927 0.090798 -0.131 0.895 ## Vitamin.D.IntakeSufficient 0.070556 0.090795 0.777 0.437 ## Physical.ActivitySedentary 0.077870 0.091066 0.855 0.392 ## SmokingYes -0.065567 0.090996 -0.721 0.471 ## Alcohol.ConsumptionNone -0.011252 0.090743 -0.124 0.901 ## Medical.ConditionsNone -0.055819 0.110536 -0.505 0.614 ## Medical.ConditionsRheumatoid Arthritis -0.009249 0.111181 -0.083 0.934 ## MedicationsNone -0.132634 0.090754 -1.461 0.144 ## Prior.FracturesYes 0.064861 0.090990 0.713 0.476 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2714.4 on 1957 degrees of freedom ## Residual deviance: 2706.3 on 1942 degrees of freedom ## AIC: 2738.3 ## ## Number of Fisher Scoring iterations: 3
# 4.4.4 Generate risk probabilities based on this new model data$predicted_risk_no_age <- predict(model_no_age, type = "response") # 4.4.5 Calculate the ROC curve data # Making sure response is matching the format used in glm roc_no_age <- roc(data$Osteoporosis, data$predicted_risk_no_age)
# 4.4.6 Plot the ROC Curve for the model without Age plot( roc_no_age, col = "darkblue", lwd = 3, main = paste("Model Accuracy Without Age (AUC =", round(auc(roc_no_age), 2), ")") )
A multivariable logistic regression model was fitted using osteoporosis status as the binary dependent variable against all lifestyle options and clinical backgrounds, intentionally excluding raw age to isolate their baseline impacts. This model assesses whether factors like diet, smoking, or medical conditions can reliably predict diagnostic outcomes on their own. The output summary reveals that when age is omitted, none of the remaining clinical or lifestyle coefficients achieve statistical significance. This indicates that without controlling for age timelines, individual categorical habits do not build a strong predictive foundation for identifying osteoporosis risk profiles in this sample population.
5 Comparative Diagnostic Evaluation
Finally, we compute information fit diagnostics (AIC) alongside Receiver Operating Characteristic (ROC) curves to quantify predictive accuracy enhancements gained by expanding the clinical model.
# 5.1.1 Baseline Model: Age Only model_baseline <- glm( factor(Osteoporosis) ~ Age, data = data, family = binomial ) # 5.1.2 Expanded Model: Age + All Clinical & Lifestyle Factors model_expanded <- glm( factor(Osteoporosis) ~ Age + Gender + Hormonal.Changes + Family.History + Race.Ethnicity + Body.Weight + Calcium.Intake + Vitamin.D.Intake + Physical.Activity + Smoking + Alcohol.Consumption + Medical.Conditions + Medications + Prior.Fractures, data = data, family = binomial ) # 5.2.2 COMPUTE INFORMATION FIT DIAGNOSTICS (AIC) aic_baseline <- AIC(model_baseline) aic_expanded <- AIC(model_expanded) cat("--- Information Fit Diagnostics ---\n")
## --- Information Fit Diagnostics ---
cat("Baseline Model (Age Only) AIC: ", round(aic_baseline, 2), "\n")
## Baseline Model (Age Only) AIC: 1395.28
cat("Expanded Model (Age + All) AIC:", round(aic_expanded, 2), "\n")
## Expanded Model (Age + All) AIC: 1410.66
cat("AIC Difference: ", round(aic_baseline - aic_expanded, 2), "\n\n")
## AIC Difference: -15.38
# Note: A lower AIC is better. If the expanded model's AIC is NOT significantly # lower than the baseline, the extra variables are just adding noise. #5.3.1 Generate predictions prob_baseline <- predict(model_baseline, type = "response") prob_expanded <- predict(model_expanded, type = "response") #5.3.2 Calculate ROC roc_baseline <- roc(data$Osteoporosis, prob_baseline)
roc_expanded <- roc(data$Osteoporosis, prob_expanded)
#5.3.3 Run DeLong's test to see if the two AUCs are statistically different delong_test <- roc.test(roc_baseline, roc_expanded, method = "delong") print(delong_test)
## ## DeLong's test for two correlated ROC curves ## ## data: roc_baseline and roc_expanded ## Z = -1.9997, p-value = 0.04553 ## alternative hypothesis: true difference in AUC is not equal to 0 ## 95 percent confidence interval: ## -6.373593e-03 -6.394478e-05 ## sample estimates: ## AUC of roc1 AUC of roc2 ## 0.9028412 0.9060599
#5.4.1 Plot the baseline curve first plot( roc_baseline, col = "darkgreen", lwd = 3, main = "Comparative Diagnostic Evaluation", xlab = "False Positive Rate (1 - Specificity)", ylab = "True Positive Rate (Sensitivity)" ) #5.4.2 Add the expanded model curve to the same plot plot( roc_expanded, col = "darkblue", lwd = 3, add = TRUE ) #5.4.3 Add a legend to distinguish the models, including AIC and AUC metrics legend( "bottomright", legend = c( paste("Age Only (AUC =", round(auc(roc_baseline), 2), ", AIC =", round(aic_baseline, 0), ")"), paste("Age + Clinical (AUC =", round(auc(roc_expanded), 2), ", AIC =", round(aic_expanded, 0), ")") ), col = c("darkgreen", "darkblue"), lwd = 3, bty = "n" # Removes the legend border box )
A logistic regression model was fitted using osteoporosis status as the dependent variable and age as the independent variable. Logistic regression is appropriate when the outcome variable is binary, as osteoporosis status was coded as either present or absent. The model assessed whether age could predict the likelihood of osteoporosis. Given the large age difference observed between the osteoporosis and non-osteoporosis groups, age emerged as a strong predictor of osteoporosis status. The results indicate that the probability of osteoporosis increases as age increases.
Conclusion
The analysis identified age as the strongest factor associated with osteoporosis in this dataset. Participants diagnosed with osteoporosis were substantially older than those without osteoporosis, and this difference was highly statistically significant. In contrast, no statistically significant associations were detected between osteoporosis status and the other demographic, lifestyle, or medical variables examined. It should also be noted that the study population contained a relatively large proportion of younger participants, as reflected by the median age of 32 years. This imbalance in the age distribution may have influenced the results, as younger individuals generally have a lower risk of osteoporosis. Consequently, some associations between osteoporosis and other risk factors may have been more difficult to detect due to the underrepresentation of older age groups, who are typically at greater risk of developing the condition. Overall, the findings suggest that age is the primary predictor of osteoporosis within this sample population, highlighting the importance of age-related changes in bone health and osteoporosis risk. .