Final Project

Introduction

Research Question:
Does BMI, age, and physical activity level predict whether a person has diabetes?

This project uses the Diabetes Health Indicators Dataset from the Behavioral Risk Factor Surveillance System, also known as BRFSS, from 2015. The dataset contains health and lifestyle information from adults in the United States. Each row represents one respondent, and the dataset includes variables related to diabetes, body health, physical activity, general health, and other lifestyle factors.

The full dataset has 253680 observations and 22 variables. For this project, I focus on four main variables. Diabetes_binary is the outcome variable and shows whether a person has diabetes, where 0 means no diabetes and 1 means diabetes. BMI is body mass index. Age is the age category of the respondent. PhysActivity shows whether the person reported physical activity in the past 30 days.

I chose this topic because diabetes is a major public health issue, and BMI, age, and physical activity are commonly connected to diabetes risk. I wanted to test whether these variables can predict diabetes status using a logistic regression model.

Source: https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

Data Analysis

To prepare the data, I selected the variables needed for the research question, removed missing values, converted the diabetes outcome and physical activity variables into factors, and created readable labels for the categories. This makes the dataset cleaner and easier to use for logistic regression. The main dplyr functions used were select(), filter(), mutate(), and drop_na().

dim(data)

## [1] 253680     22

head(data)

##   Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## 1               0      1        1         1  40      1      0
## 2               0      0        0         0  25      1      0
## 3               0      1        1         1  28      0      0
## 4               0      1        0         1  27      0      0
## 5               0      1        1         1  24      0      0
## 6               0      1        1         1  25      1      0
##   HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1                    0            0      0       1                 0
## 2                    0            1      0       0                 0
## 3                    0            0      1       0                 0
## 4                    0            1      1       1                 0
## 5                    0            1      1       1                 0
## 6                    0            1      1       1                 0
##   AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1             1           0       5       18       15        1   0   9
## 2             0           1       3        0        0        0   0   7
## 3             1           1       5       30       30        1   0   9
## 4             1           0       2        0        0        0   0  11
## 5             1           0       2        3        0        0   0  11
## 6             1           0       2        0        2        0   1  10
##   Education Income
## 1         4      3
## 2         6      1
## 3         4      8
## 4         3      6
## 5         5      4
## 6         6      8

project_data <- data %>%
  select(Diabetes_binary, BMI, Age, PhysActivity) %>%
  drop_na() %>%
  mutate(
    Diabetes_binary = factor(Diabetes_binary, levels = c(0, 1), labels = c("No Diabetes", "Diabetes")),
    PhysActivity = factor(PhysActivity, levels = c(0, 1), labels = c("No Activity", "Activity"))
  )

dim(project_data)

## [1] 253680      4

summary(project_data)

##     Diabetes_binary        BMI             Age              PhysActivity   
##  No Diabetes:218334   Min.   :12.00   Min.   : 1.000   No Activity: 61760  
##  Diabetes   : 35346   1st Qu.:24.00   1st Qu.: 6.000   Activity   :191920  
##                       Median :27.00   Median : 8.000                       
##                       Mean   :28.38   Mean   : 8.032                       
##                       3rd Qu.:31.00   3rd Qu.:10.000                       
##                       Max.   :98.00   Max.   :13.000

project_data %>%
  group_by(Diabetes_binary) %>%
  summarise(
    count = n(),
    mean_BMI = round(mean(BMI), 2),
    mean_age_category = round(mean(Age), 2)
  )

## # A tibble: 2 × 4
##   Diabetes_binary  count mean_BMI mean_age_category
##   <fct>            <int>    <dbl>             <dbl>
## 1 No Diabetes     218334     27.8              7.81
## 2 Diabetes         35346     31.9              9.38

ggplot(project_data, aes(x = Diabetes_binary, y = BMI, fill = Diabetes_binary)) +
  geom_boxplot() +
  labs(
    title = "BMI by Diabetes Status",
    x = "Diabetes Status",
    y = "BMI"
  )

ggplot(project_data, aes(x = Diabetes_binary, fill = PhysActivity)) +
  geom_bar(position = "fill") +
  labs(
    title = "Physical Activity by Diabetes Status",
    x = "Diabetes Status",
    y = "Proportion",
    fill = "Physical Activity"
  )

Statistical Analysis

Logistic regression is the correct method for this project because the outcome variable is binary. The model predicts whether a person has diabetes or does not have diabetes based on BMI, age, and physical activity. Logistic regression is appropriate because it estimates the probability of a binary outcome using multiple predictors.

The final model is:

Diabetes_binary = BMI + Age + PhysActivity

model <- glm(Diabetes_binary ~ BMI + Age + PhysActivity,
             data = project_data,
             family = binomial)

summary(model)

## 
## Call:
## glm(formula = Diabetes_binary ~ BMI + Age + PhysActivity, family = binomial, 
##     data = project_data)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -6.0119182  0.0408038 -147.34   <2e-16 ***
## BMI                   0.0870482  0.0008736   99.65   <2e-16 ***
## Age                   0.2208479  0.0023672   93.30   <2e-16 ***
## PhysActivityActivity -0.4231666  0.0129595  -32.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 204847  on 253679  degrees of freedom
## Residual deviance: 182948  on 253676  degrees of freedom
## AIC: 182956
## 
## Number of Fisher Scoring iterations: 5

odds_ratios <- exp(coef(model))
odds_ratios

##          (Intercept)                  BMI                  Age 
##          0.002449385          1.090949280          1.247133765 
## PhysActivityActivity 
##          0.654969502

confint_odds <- exp(confint(model))

## Waiting for profiling to be done...

confint_odds

##                            2.5 %      97.5 %
## (Intercept)          0.002260744 0.002652877
## BMI                  1.089085679 1.092821479
## Age                  1.241370135 1.252942810
## PhysActivityActivity 0.638554907 0.671832163

The coefficients from logistic regression are interpreted in terms of log odds. A positive coefficient means the predictor increases the log odds of diabetes. A negative coefficient means the predictor decreases the log odds of diabetes.

The odds ratios make the model easier to interpret. An odds ratio above 1 means the predictor is associated with higher odds of diabetes. An odds ratio below 1 means the predictor is associated with lower odds of diabetes.

The odds ratio for BMI is 1.091. This means that for each one unit increase in BMI, the odds of having diabetes are multiplied by about 1.091, holding age and physical activity constant.

The odds ratio for Age is 1.247. This means that for each one category increase in age, the odds of having diabetes are multiplied by about 1.247, holding BMI and physical activity constant.

The odds ratio for physical activity is 0.655. This means that people who reported physical activity have odds of diabetes that are multiplied by about 0.655 compared to people who reported no activity, holding BMI and age constant.

The p-values in the model summary show whether each predictor is statistically significant. If the p-value is below 0.05, the predictor is considered statistically significant.

Model Performance and Diagnostics

For logistic regression, model performance is checked using a confusion matrix, accuracy, sensitivity, specificity, ROC curve, and AUC. The confusion matrix shows how many diabetes and non-diabetes cases the model classifies correctly. The ROC curve and AUC show how well the model separates people with diabetes from people without diabetes.

predicted_prob <- predict(model, type = "response")

predicted_class <- ifelse(predicted_prob > 0.5, "Diabetes", "No Diabetes")

predicted_class <- factor(predicted_class, levels = c("No Diabetes", "Diabetes"))

conf_matrix <- confusionMatrix(predicted_class, project_data$Diabetes_binary, positive = "Diabetes")

conf_matrix

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes      216568    34080
##   Diabetes           1766     1266
##                                         
##                Accuracy : 0.8587        
##                  95% CI : (0.8573, 0.86)
##     No Information Rate : 0.8607        
##     P-Value [Acc > NIR] : 0.9979        
##                                         
##                   Kappa : 0.0449        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.035817      
##             Specificity : 0.991911      
##          Pos Pred Value : 0.417546      
##          Neg Pred Value : 0.864032      
##              Prevalence : 0.139333      
##          Detection Rate : 0.004991      
##    Detection Prevalence : 0.011952      
##       Balanced Accuracy : 0.513864      
##                                         
##        'Positive' Class : Diabetes      
##

accuracy <- conf_matrix$overall["Accuracy"]
sensitivity <- conf_matrix$byClass["Sensitivity"]
specificity <- conf_matrix$byClass["Specificity"]

accuracy

## Accuracy 
## 0.858696

sensitivity

## Sensitivity 
##  0.03581735

specificity

## Specificity 
##   0.9919115

actual_numeric <- ifelse(project_data$Diabetes_binary == "Diabetes", 1, 0)

roc_result <- roc(actual_numeric, predicted_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_result,
     main = "ROC Curve for Diabetes Prediction")

auc_value <- auc(roc_result)
auc_value

## Area under the curve: 0.747

The model accuracy is 0.859. This means the model correctly classified about 85.9 percent of respondents.

The sensitivity is 0.036. Sensitivity measures how well the model correctly identifies people who actually have diabetes.

The specificity is 0.992. Specificity measures how well the model correctly identifies people who do not have diabetes.

The AUC value is 0.747. AUC measures how well the model separates people with diabetes from people without diabetes. An AUC of 0.5 means the model is no better than random guessing, while an AUC of 1.0 means perfect classification.

Discussion of Results

This logistic regression model examined whether BMI, age, and physical activity predict diabetes status. BMI and age are expected to increase the odds of diabetes, while physical activity is expected to lower the odds. The model results help show which predictors are statistically significant and how they affect the probability of diabetes.

The odds ratio for BMI shows that higher BMI is associated with greater odds of diabetes. The odds ratio for age shows that older age categories are associated with greater odds of diabetes. The odds ratio for physical activity shows whether being physically active is associated with lower odds of diabetes compared to not being physically active.

The confusion matrix and AUC help evaluate how well the model performs overall. Even if the model is not perfect, it still gives useful information about major health factors connected to diabetes risk.

Conclusion and Future Directions

This project examined whether BMI, age, and physical activity level predict whether a person has diabetes. Logistic regression was used because the outcome variable was binary. The model showed how each predictor affected the odds of diabetes while holding the other predictors constant.

Overall, the results suggest that BMI, age, and physical activity are useful variables for understanding diabetes risk. Higher BMI and older age are expected to increase the odds of diabetes, while physical activity may be connected to lower diabetes risk.

A limitation of this project is that only three predictors were used. Diabetes is affected by many other factors, such as diet, income, smoking, blood pressure, cholesterol, and access to healthcare. Future research could include more predictors or compare different classification models to improve prediction accuracy.

References

CDC Behavioral Risk Factor Surveillance System. Diabetes Health Indicators Dataset, BRFSS 2015.
https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

Kuhn, M. caret package documentation.
https://topepo.github.io/caret/

Robin, X. pROC package documentation.
https://cran.r-project.org/package=pROC