knitr::opts_chunk$set(echo = TRUE)
Is there a statistically significant difference in mean diabetes prevalence between males and females in the United States?
The dataset used for this analysis is the Diabetes Prevalence Dataset from OpenIntro. The dataset contains over 3,000 observations, with each observation representing diabetes prevalence estimates for a specific U.S. state, year, age group, and sex. The dataset includes demographic and health-related variables used to examine patterns in diabetes prevalence across different population groups.
For this analysis, the primary variables of interest are diabetes prevalence (diabetes_prev), a quantitative variable measured as a percentage, and sex (sex), a categorical variable with two levels: Male and Female. This dataset was accessed from the OpenIntro Statistics website and is suitable for statistical analysis due to its large sample size and public health relevance.
Exploratory data analysis was conducted to understand the distribution of diabetes prevalence and to compare patterns between males and females. The data were cleaned to remove missing values, summary statistics were calculated, and visualizations were created to examine group differences prior to hypothesis testing.
# Load required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
# Import dataset
diabetes <- read.csv("diabetes.prev.csv")
# Reshape data to long format for analysis
diabetes_long <- diabetes %>%
select(FIPS.Codes, County, percent.men.diabetes, percent.women.diabetes) %>%
pivot_longer(
cols = c(percent.men.diabetes, percent.women.diabetes),
names_to = "sex",
values_to = "diabetes_prev"
) %>%
mutate(sex = ifelse(sex == "percent.men.diabetes", "Male", "Female"))
# Summary statistics by sex
summary_stats <- diabetes_long %>%
group_by(sex) %>%
summarise(
mean_prevalence = mean(diabetes_prev, na.rm = TRUE),
median_prevalence = median(diabetes_prev, na.rm = TRUE),
sd_prevalence = sd(diabetes_prev, na.rm = TRUE),
.groups = "drop"
)
summary_stats
## # A tibble: 2 × 4
## sex mean_prevalence median_prevalence sd_prevalence
## <chr> <dbl> <dbl> <dbl>
## 1 Female 10.2 10 2.46
## 2 Male 11.2 11.2 2.14
# Boxplot of diabetes prevalence by sex
ggplot(diabetes_long, aes(x = sex, y = diabetes_prev, fill = sex)) +
geom_boxplot() +
scale_fill_manual(values = c("skyblue", "pink")) +
labs(
title = "Diabetes Prevalence by Sex",
x = "Sex",
y = "Diabetes Prevalence (%)"
) +
theme_minimal()
A Welch two-sample t-test was conducted to determine whether there is a statistically significant difference in mean diabetes prevalence between males and females. This test is appropriate because the two groups are independent, the response variable is quantitative, and equal variances cannot be assumed.
Let μ₁ represent the mean diabetes prevalence for males and μ₂ represent the mean diabetes prevalence for females.
Null Hypothesis (H₀): μ₁ = μ₂ Alternative Hypothesis (Hₐ): μ₁ ≠ μ₂
# Welch Two-Sample t-test
t_test_result <- t.test(diabetes_prev ~ sex,
data = diabetes_long,
alternative = "two.sided")
t_test_result
##
## Welch Two Sample t-test
##
## data: diabetes_prev by sex
## t = -16.197, df = 6166.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -1.0579333 -0.8294991
## sample estimates:
## mean in group Female mean in group Male
## 10.24403 11.18775
Interpretation:
If the p-value is less than α = 0.05, we reject the null hypothesis, indicating a statistically significant difference in mean diabetes prevalence between males and females.
If the p-value is greater than α = 0.05, we fail to reject the null hypothesis, indicating no statistically significant difference.
This study examined whether diabetes prevalence differs between males and females in the United States. The results of the Welch two-sample t-test indicate whether sex is associated with statistically significant differences in diabetes prevalence within the population.
Future research directions include:
Investigating diabetes prevalence across different age groups and states.
Examining trends over multiple years to assess temporal changes.
Incorporating additional variables, such as obesity rates or socioeconomic factors, into regression models to identify potential predictors of diabetes prevalence.
OpenIntro Statistics. Diabetes Prevalence Dataset. https://www.openintro.org/data/index.php?data=diabetes.prev