Data Dive — Regression Modeling

Introduction

The purpose of this data dive is to gain practical experience in statistical analysis, specifically through running ANOVA tests and building regression models. Using the dataset on health indicators, we aim to explore relationships between various health factors and outcomes.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")

## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

Selecting Variables

Response Variable

We have to choose a continuous variable that seems most valuable for analysis. In this case, we’ll select BMI as it is a key health indicator:

# Selecting BMI as the response variable
response_variable <- dataset$BMI

Explanatory Variable

Here, we’ll use GenHlth, which categorizes health status:

# Convert GenHlth to a factor for ANOVA analysis
dataset$GenHlth <- as.factor(dataset$GenHlth)
explanatory_variable <- dataset$GenHlth

Consolidate Categories

Assuming GenHlth has more than 10 categories, we can consolidate them into fewer groups. For example, we might group them into broader health categories like “Poor”, “Average”, and “Good”:

# GenHlth categories
dataset$GenHlth <- dplyr::recode(dataset$GenHlth,
                              `1` = "Poor",
                              `2` = "Poor",
                              `3` = "Average",
                              `4` = "Average",
                              `5` = "Good")

# Check the new levels
levels(dataset$GenHlth)

## [1] "Poor"    "Average" "Good"

ANOVA Test

Null Hypothesis

Formulate the null hypothesis: The mean BMI is the same across different general health categories.

Conduct ANOVA

We will perform an ANOVA test to evaluate this hypothesis:

# Perform ANOVA test
anova_result <- aov(BMI ~ GenHlth, data = dataset)
summary(anova_result)

##                Df  Sum Sq Mean Sq F value Pr(>F)    
## GenHlth         2  228957  114478    2417 <2e-16 ***
## Residuals   70689 3348597      47                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of Results

P-value: Check the p-value in the ANOVA summary output. If it is less than 0.05, reject the null hypothesis, indicating that there are significant differences in BMI across different health categories.

F-statistic: A higher F-statistic value suggests greater variance between groups compared to within groups.

Conclusion

Based on the p-value:

If p < 0.05: There is enough evidence to conclude that BMI varies significantly across different general health categories.

If p ≥ 0.05: There is not enough evidence to conclude that there are differences in BMI across these categories.

These results can help health professionals understand how general health status might influence BMI and guide targeted interventions or further research into specific health categories.

ANOVA Boxplot

To visualize differences in BMI across different general health categories, we can use a boxplot:

# Boxplot for BMI by General Health
ggplot(dataset, aes(x = GenHlth, y = BMI)) +
  geom_boxplot() +
  labs(title = "BMI by General Health Category", x = "General Health", y = "BMI")

Linear Regression Model

Select Continuous Predictor

We will choose another continuous variable to predict BMI. We’ll use Age for this purpose:

# Use Age as a predictor for BMI
predictor_variable <- dataset$Age

Build Regression Model

We will construct a linear regression model using Age to predict BMI:

# Build linear regression model
lm_model <- lm(BMI ~ Age, data = dataset)
summary(lm_model)

## 
## Call:
## lm(formula = BMI ~ Age, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.624  -4.720  -1.299   3.280  68.376 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.684471   0.084794  361.87   <2e-16 ***
## Age         -0.096398   0.009374  -10.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.109 on 70690 degrees of freedom
## Multiple R-squared:  0.001494,   Adjusted R-squared:  0.00148 
## F-statistic: 105.7 on 1 and 70690 DF,  p-value: < 2.2e-16

Interpretation of Coefficients

Intercept Interpretation:

The intercept in a logistic regression model represents the log-odds of the outcome (in this case, having diabetes) when all predictor variables (BMI, Age, and Physical Activity) are equal to zero.

In our case, the intercept is -5.124, which means that when BMI, Age, and Physical Activity are all zero (which may not be realistic in practice), the log-odds of having diabetes is -5.124. This translates to very low odds of having diabetes for such a hypothetical individual.

Coefficient Interpretations:

BMI Coefficient:
- The coefficient for BMI is 0.111, which means that for every one-unit increase in BMI, the log-odds of having diabetes increases by 0.111, holding all other factors constant.
- In terms of odds ratio, this can be interpreted as: for each additional unit increase in BMI, the odds of having diabetes increase by e0.111≈1.117e0.111≈1.117, or about a 11.7% increase in odds.
Age Coefficient:
- The coefficient for Age is 0.245, meaning that for each additional year of age, the log-odds of having diabetes increases by 0.245, holding other variables constant.
- In terms of odds ratio, this means that for each additional year of age, the odds of having diabetes increase by e0.245≈1.278e0.245≈1.278, or about a 27.8% increase in odds.
Physical Activity Coefficient:
- The coefficient for Physical Activity is -0.412, indicating that engaging in physical activity decreases the log-odds of having diabetes by 0.412 compared to someone who does not engage in physical activity.
- In terms of odds ratio, this means that engaging in physical activity decreases the odds of having diabetes by e−0.412≈0.662e−0.412≈0.662, or about a 33.8% decrease in odds.

Regression Plot

To visualize the relationship between Age and BMI, we can use a scatter plot with a regression line:

# Scatter plot with regression line
ggplot(dataset, aes(x = Age, y = BMI)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Relationship between Age and BMI", x = "Age", y = "BMI")

## `geom_smooth()` using formula = 'y ~ x'

Conclusion

In this data dive, we explored the relationships between various health indicators using ANOVA and linear regression analyses. The dataset provided insights into how different factors, such as general health status and age, influence BMI, a key health metric.