Project 2 Final Paper

knitr::opts_chunk$set(echo = TRUE)

Introduction

Is there a statistically significant difference in mean diabetes prevalence between males and females in the United States?

The dataset used for this analysis is the Diabetes Prevalence Dataset from OpenIntro. The dataset contains over 3,000 observations, with each observation representing diabetes prevalence estimates for a specific U.S. state, year, age group, and sex. The dataset includes demographic and health-related variables used to examine patterns in diabetes prevalence across different population groups.

For this analysis, the primary variables of interest are diabetes prevalence (diabetes_prev), a quantitative variable measured as a percentage, and sex (sex), a categorical variable with two levels: Male and Female. This dataset was accessed from the OpenIntro Statistics website and is suitable for statistical analysis due to its large sample size and public health relevance.

Data Analysis

Exploratory data analysis was conducted to understand the distribution of diabetes prevalence and to compare patterns between males and females. The data were cleaned to remove missing values, summary statistics were calculated, and visualizations were created to examine group differences prior to hypothesis testing.

# Load required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

# Import dataset

diabetes <- read.csv("diabetes.prev.csv")

# Reshape data to long format for analysis

diabetes_long <- diabetes %>%
select(FIPS.Codes, County, percent.men.diabetes, percent.women.diabetes) %>%
pivot_longer(
cols = c(percent.men.diabetes, percent.women.diabetes),
names_to = "sex",
values_to = "diabetes_prev"
) %>%
mutate(sex = ifelse(sex == "percent.men.diabetes", "Male", "Female"))

# Summary statistics by sex

summary_stats <- diabetes_long %>%
group_by(sex) %>%
summarise(
mean_prevalence = mean(diabetes_prev, na.rm = TRUE),
median_prevalence = median(diabetes_prev, na.rm = TRUE),
sd_prevalence = sd(diabetes_prev, na.rm = TRUE),
.groups = "drop"
)
summary_stats

## # A tibble: 2 × 4
##   sex    mean_prevalence median_prevalence sd_prevalence
##   <chr>            <dbl>             <dbl>         <dbl>
## 1 Female            10.2              10            2.46
## 2 Male              11.2              11.2          2.14

# Boxplot of diabetes prevalence by sex

ggplot(diabetes_long, aes(x = sex, y = diabetes_prev, fill = sex)) +
geom_boxplot() +
scale_fill_manual(values = c("skyblue", "pink")) +
labs(
title = "Diabetes Prevalence by Sex",
x = "Sex",
y = "Diabetes Prevalence (%)"
) +
theme_minimal()

Statistical Analysis

A Welch two-sample t-test was conducted to determine whether there is a statistically significant difference in mean diabetes prevalence between males and females. This test is appropriate because the two groups are independent, the response variable is quantitative, and equal variances cannot be assumed.

Let μ₁ represent the mean diabetes prevalence for males and μ₂ represent the mean diabetes prevalence for females.

Null Hypothesis (H₀): μ₁ = μ₂ Alternative Hypothesis (Hₐ): μ₁ ≠ μ₂

# Welch Two-Sample t-test

t_test_result <- t.test(diabetes_prev ~ sex,
data = diabetes_long,
alternative = "two.sided")
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  diabetes_prev by sex
## t = -16.197, df = 6166.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -1.0579333 -0.8294991
## sample estimates:
## mean in group Female   mean in group Male 
##             10.24403             11.18775

Interpretation:

If the p-value is less than α = 0.05, we reject the null hypothesis, indicating a statistically significant difference in mean diabetes prevalence between males and females.

If the p-value is greater than α = 0.05, we fail to reject the null hypothesis, indicating no statistically significant difference.

Conclusion and Future Directions

This study examined whether diabetes prevalence differs between males and females in the United States. The results of the Welch two-sample t-test indicate whether sex is associated with statistically significant differences in diabetes prevalence within the population.

Future research directions include:

Investigating diabetes prevalence across different age groups and states.

Examining trends over multiple years to assess temporal changes.

Incorporating additional variables, such as obesity rates or socioeconomic factors, into regression models to identify potential predictors of diabetes prevalence.

References

OpenIntro Statistics. Diabetes Prevalence Dataset. https://www.openintro.org/data/index.php?data=diabetes.prev