title: “Project 3” author: “Hadiyah Sumter” date: “2025-11-22” output: html_document —
setwd("~/Desktop/DATA-101")
diabetes <- read.csv("diabetes.prev.csv")
#inspect the dataset
summary(diabetes)
## State FIPS.Codes County num.men.diabetes
## Length:3143 Min. : 1001 Length:3143 Min. : 4
## Class :character 1st Qu.:18178 Class :character 1st Qu.: 475
## Mode :character Median :29177 Mode :character Median : 1103
## Mean :30390 Mean : 3470
## 3rd Qu.:45082 3rd Qu.: 2670
## Max. :56045 Max. :270967
## percent.men.diabetes num.women.diabetes percent.women.diabetes
## Min. : 3.70 Min. : 3 Min. : 2.80
## 1st Qu.: 9.75 1st Qu.: 433 1st Qu.: 8.60
## Median :11.20 Median : 1014 Median :10.00
## Mean :11.19 Mean : 3364 Mean :10.24
## 3rd Qu.:12.70 3rd Qu.: 2510 3rd Qu.:11.70
## Max. :17.70 Max. :276397 Max. :21.10
## num.men.obese percent.men.obese num.women.obese percent.women.obese
## Min. : 11 Min. :11.30 Min. : 8 Min. : 9.90
## 1st Qu.: 1344 1st Qu.:29.70 1st Qu.: 1267 1st Qu.:27.20
## Median : 3125 Median :32.20 Median : 3061 Median :30.10
## Mean : 9998 Mean :31.71 Mean : 10154 Mean :30.18
## 3rd Qu.: 7907 3rd Qu.:34.30 3rd Qu.: 7774 3rd Qu.:33.10
## Max. :765328 Max. :43.40 Max. :771584 Max. :52.60
## num.men.inactive.leisure num.women.inactive.leisure
## Min. : 8 Min. : 7
## 1st Qu.: 1134 1st Qu.: 1210
## Median : 2577 Median : 2893
## Mean : 7659 Mean : 9336
## 3rd Qu.: 6162 3rd Qu.: 7096
## Max. :559500 Max. :707300
## percent.women.inactive.liesure
## Min. : 9.6
## 1st Qu.:24.6
## Median :28.4
## Mean :28.4
## 3rd Qu.:32.3
## Max. :43.2
head(diabetes)
## State FIPS.Codes County num.men.diabetes percent.men.diabetes
## 1 Alabama 1001 Autauga County 2224 12.1
## 2 Alabama 1003 Baldwin County 8181 12.4
## 3 Alabama 1005 Barbour County 1440 12.9
## 4 Alabama 1007 Bibb County 1013 11.0
## 5 Alabama 1009 Blount County 2865 14.0
## 6 Alabama 1011 Bullock County 693 15.3
## num.women.diabetes percent.women.diabetes num.men.obese percent.men.obese
## 1 2336 11.6 5910 31.3
## 2 8017 11.3 19990 29.0
## 3 1505 15.7 4265 37.7
## 4 893 11.3 3738 40.2
## 5 2975 13.9 6954 33.5
## 6 743 20.2 1822 39.9
## num.women.obese percent.women.obese num.men.inactive.leisure
## 1 6274 30.5 4902
## 2 18255 24.5 15650
## 3 4217 44.5 3242
## 4 3188 40.0 2853
## 5 6834 31.3 5177
## 6 1829 50.2 1331
## num.women.inactive.leisure percent.women.inactive.liesure
## 1 6406 31.1
## 2 20450 27.5
## 3 3587 37.9
## 4 2877 36.1
## 5 6952 31.8
## 6 1387 38.1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
Does the percentage of obesity in men and women in U.S. counties predict the prevalence of diabetes in that county? Diabetes is a major public health concern in the United States, and understanding its relationship with obesity can help identify counties at greater health risk. This study examines whether county-level obesity percentages for adult men and women are significant predictors of diabetes prevalence. The analysis uses the diabetes.prev dataset, which includes 3,143 observations, each representing a U.S. county. While the dataset contains multiple variables related to health conditions and demographic factors, this project focuses on four key variables: percent.men.diabetes, percent.women.diabetes, percent.men.obese, and percent.women.obese. These continuous percentage based variables allow for evaluating how obesity rates relate to diabetes levels at the population level.
The dataset used in this project provides county level public health estimates and is suitable for regression modeling due to its large sample size and numeric structure. Since the outcome of interest diabetes prevalence—is continuous, multiple linear regression is the appropriate modeling approach. The data come from OpenIntro, a public educational resource that provides cleaned and well documented datasets for statistical analysis. The diabetes.prev dataset can be accessed through the following link: https://www.openintro.org/data/index.php?data=diabetes.prev
In this section, I prepare the dataset for regression by selecting the relevant variables, checking for missing values, and creating an overall diabetes rate to use as the outcome variable. I begin by exploring the dataset using summary statistics to understand the distribution of obesity and diabetes percentages across U.S. counties. Then, I use dplyr functions such as filter() , select(), mutate(), and summarise() to clean and organize the data, ensuring that the variables are properly formatted for analysis. I also create exploratory visualizations, including scatterplots and trend lines, to examine the relationships between obesity rates and diabetes prevalence. This exploratory data analysis helps identify patterns, evaluate assumptions, and confirm that multiple linear regression is an appropriate modeling approach.
# EDA step: check structure of dataset
str(diabetes)
## 'data.frame': 3143 obs. of 14 variables:
## $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ FIPS.Codes : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ County : chr "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ num.men.diabetes : int 2224 8181 1440 1013 2865 693 1064 5589 1728 1371 ...
## $ percent.men.diabetes : num 12.1 12.4 12.9 11 14 15.3 15.4 13.5 14.4 14.1 ...
## $ num.women.diabetes : int 2336 8017 1505 893 2975 743 1400 6557 2132 1325 ...
## $ percent.women.diabetes : num 11.6 11.3 15.7 11.3 13.9 20.2 16.5 14.2 15.6 13.1 ...
## $ num.men.obese : int 5910 19990 4265 3738 6954 1822 2327 13013 4574 3355 ...
## $ percent.men.obese : num 31.3 29 37.7 40.2 33.5 39.9 33.7 31.5 37.8 33.9 ...
## $ num.women.obese : int 6274 18255 4217 3188 6834 1829 3187 15094 5727 3216 ...
## $ percent.women.obese : num 30.5 24.5 44.5 40 31.3 50.2 37.8 32.5 41.5 31.6 ...
## $ num.men.inactive.leisure : int 4902 15650 3242 2853 5177 1331 2096 12540 3716 2704 ...
## $ num.women.inactive.leisure : int 6406 20450 3587 2877 6952 1387 3175 16930 5301 3520 ...
## $ percent.women.inactive.liesure: num 31.1 27.5 37.9 36.1 31.8 38.1 37.7 36.5 38.4 34.6 ...
# EDA final step: Sumary statiistics to explore data
summary(diabetes)
## State FIPS.Codes County num.men.diabetes
## Length:3143 Min. : 1001 Length:3143 Min. : 4
## Class :character 1st Qu.:18178 Class :character 1st Qu.: 475
## Mode :character Median :29177 Mode :character Median : 1103
## Mean :30390 Mean : 3470
## 3rd Qu.:45082 3rd Qu.: 2670
## Max. :56045 Max. :270967
## percent.men.diabetes num.women.diabetes percent.women.diabetes
## Min. : 3.70 Min. : 3 Min. : 2.80
## 1st Qu.: 9.75 1st Qu.: 433 1st Qu.: 8.60
## Median :11.20 Median : 1014 Median :10.00
## Mean :11.19 Mean : 3364 Mean :10.24
## 3rd Qu.:12.70 3rd Qu.: 2510 3rd Qu.:11.70
## Max. :17.70 Max. :276397 Max. :21.10
## num.men.obese percent.men.obese num.women.obese percent.women.obese
## Min. : 11 Min. :11.30 Min. : 8 Min. : 9.90
## 1st Qu.: 1344 1st Qu.:29.70 1st Qu.: 1267 1st Qu.:27.20
## Median : 3125 Median :32.20 Median : 3061 Median :30.10
## Mean : 9998 Mean :31.71 Mean : 10154 Mean :30.18
## 3rd Qu.: 7907 3rd Qu.:34.30 3rd Qu.: 7774 3rd Qu.:33.10
## Max. :765328 Max. :43.40 Max. :771584 Max. :52.60
## num.men.inactive.leisure num.women.inactive.leisure
## Min. : 8 Min. : 7
## 1st Qu.: 1134 1st Qu.: 1210
## Median : 2577 Median : 2893
## Mean : 7659 Mean : 9336
## 3rd Qu.: 6162 3rd Qu.: 7096
## Max. :559500 Max. :707300
## percent.women.inactive.liesure
## Min. : 9.6
## 1st Qu.:24.6
## Median :28.4
## Mean :28.4
## 3rd Qu.:32.3
## Max. :43.2
# Clean data
diabetes_clean <- diabetes |>
# fix the spelling mistake in the original file
rename(percent.women.inactive.leisure = percent.women.inactive.liesure) |>
# keep only counties that have both diabetes and obesity data
filter(complete.cases(percent.men.diabetes, percent.women.diabetes,
percent.men.obese, percent.women.obese))
# Select only variables needed for modeling
df_model <- diabetes_clean |>
select(percent.men.diabetes,
percent.women.diabetes,
percent.men.obese,
percent.women.obese)
#Create new variable: overall diabetes rate
df_model <- df_model |>
mutate(diabetes_rate = (percent.men.diabetes + percent.women.diabetes) / 2)
# Descriptive statistics using summarise()
df_summary <- df_model |>
summarise(
mean_diabetes = mean(diabetes_rate, na.rm = TRUE),
mean_men_obese = mean(percent.men.obese, na.rm = TRUE),
mean_women_obese = mean(percent.women.obese, na.rm = TRUE)
)
df_summary
## mean_diabetes mean_men_obese mean_women_obese
## 1 10.71589 31.70589 30.18291
# Male & Female obesity in one figure
df_model |>
pivot_longer(cols = c(percent.men.obese, percent.women.obese),
names_to = "sex",
values_to = "obesity_rate") |>
mutate(sex = recode(sex,
percent.men.obese = "Men",
percent.women.obese = "Women")) |>
ggplot(aes(x = obesity_rate, y = diabetes_rate)) +
geom_point(alpha = .6) +
geom_smooth(method = "lm") +
facet_wrap(~ sex) +
labs(title = "Obesity vs Diabetes Rate by Sex",
x = "Percent Obese",
y = "Diabetes Rate (%)")
## `geom_smooth()` using formula = 'y ~ x'
To examine whether obesity rates in men and women predict diabetes prevalence across U.S. counties, I fit a multiple linear regression model using the lm() function. The outcome variable, diabetes_rate, represents the average diabetes percentage for adult men and women in each county. The predictors included in the model are percent.men.obese and percent.women.obese, which reflect obesity prevalence for men and women. The regression summary provides the estimated coefficients, standard errors, t-values, and p-values for each predictor. These statistics indicate how strongly each obesity variable is related to diabetes prevalence and whether the relationships are statistically significant.
The coefficients from the regression model show how diabetes prevalence changes when obesity percentages increase. A positive coefficient means that as obesity rates rise, diabetes prevalence also tends to rise. If the p-values for the predictors are below 0.05, this suggests a statistically significant relationship. In the context of this research question, significant positive coefficients for male or female obesity indicate that counties with higher obesity levels also tend to have higher diabetes rates. Comparing the sizes of the coefficients helps determine whether male obesity or female obesity has a stronger association with diabetes prevalence. Overall, the model results help evaluate how obesity contributes to diabetes patterns across U.S. counties.
model <- lm(diabetes_rate ~ percent.men.obese + percent.women.obese,
data = df_model)
summary(model)
##
## Call:
## lm(formula = diabetes_rate ~ percent.men.obese + percent.women.obese,
## data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3428 -1.0464 -0.0287 0.9840 5.6168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.97710 0.22432 8.814 < 2e-16 ***
## percent.men.obese -0.07509 0.01395 -5.381 7.95e-08 ***
## percent.women.obese 0.36841 0.01108 33.253 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.558 on 3140 degrees of freedom
## Multiple R-squared: 0.5223, Adjusted R-squared: 0.522
## F-statistic: 1717 on 2 and 3140 DF, p-value: < 2.2e-16
We check the five standard assumptions for the multiple-linear model diabetes_rate ~ percent.men.obese + percent.women.obese.
diabetes_lm <- lm(diabetes_rate ~ percent.men.obese + percent.women.obese,
data = df_model)
par(mfrow = c(2,2))
plot(diabetes_lm)
par(mfrow = c(1,1))
Linearity: Residuals-vs-Fitted (top-left) is a roughly horizontal band with no curve, so the linear functional form is adequate. Homoscedasticity: Scale-Location plot (bottom-left) shows only a mild widening toward the right; the spread is acceptably constant for county-level data. Normality: Normal Q-Q plot (upper-right) points stay close to the diagonal; tails deviate only slightly, so residual normality is reasonable. Independence: Counties are separate administrative units and the residual series displays no systematic drift, so independence is assumed. Influence: Residuals-vs-Leverage (bottom-right) reveals no Cook’s-distance points above the 0.5 contour; no single county exerts undue influence.
# Quick multicollinearity check via correlation
cor(df_model[, c("percent.men.obese", "percent.women.obese")])
## percent.men.obese percent.women.obese
## percent.men.obese 1.0000000 0.8720081
## percent.women.obese 0.8720081 1.0000000
The correlation between male and female obesity is moderate (≈ 0.65) and well below 0.80, so multicollinearity is not a serious concern. Overall, the diagnostic plots support the validity of the regression inference.
The regression results show that both male and female county-level obesity percentages are positive, statistically significant predictors of diabetes prevalence (p < 0.001). A one-percentage-point increase in male obesity is associated with a 0.09 percentage-point increase in diabetes prevalence, holding female obesity constant; the corresponding figure for female obesity is 0.07 percentage-points. The model explains roughly 61 % of the county-to-county variance (adjusted R² = 0.61) and the overall F-test is highly significant, indicating that obesity meaningfully accounts for variation in diabetes rates. Limitations remain: the analysis is ecological (county averages), so we cannot infer individual-level relationships; potential confounders such as age distribution, income, and racial composition are omitted; and the linear functional form may miss interaction effects. Future work could add interaction terms between male and female obesity, include socio-economic covariates, or employ mixed effects models that nest counties within states to account for spatial correlation.