Project-3.knit

title: “Project 3” author: “Hadiyah Sumter” date: “2025-11-22” output: html_document —

 setwd("~/Desktop/DATA-101")
diabetes <- read.csv("diabetes.prev.csv")
#inspect the dataset
summary(diabetes)

##     State             FIPS.Codes       County          num.men.diabetes
##  Length:3143        Min.   : 1001   Length:3143        Min.   :     4  
##  Class :character   1st Qu.:18178   Class :character   1st Qu.:   475  
##  Mode  :character   Median :29177   Mode  :character   Median :  1103  
##                     Mean   :30390                      Mean   :  3470  
##                     3rd Qu.:45082                      3rd Qu.:  2670  
##                     Max.   :56045                      Max.   :270967  
##  percent.men.diabetes num.women.diabetes percent.women.diabetes
##  Min.   : 3.70        Min.   :     3     Min.   : 2.80         
##  1st Qu.: 9.75        1st Qu.:   433     1st Qu.: 8.60         
##  Median :11.20        Median :  1014     Median :10.00         
##  Mean   :11.19        Mean   :  3364     Mean   :10.24         
##  3rd Qu.:12.70        3rd Qu.:  2510     3rd Qu.:11.70         
##  Max.   :17.70        Max.   :276397     Max.   :21.10         
##  num.men.obese    percent.men.obese num.women.obese  percent.women.obese
##  Min.   :    11   Min.   :11.30     Min.   :     8   Min.   : 9.90      
##  1st Qu.:  1344   1st Qu.:29.70     1st Qu.:  1267   1st Qu.:27.20      
##  Median :  3125   Median :32.20     Median :  3061   Median :30.10      
##  Mean   :  9998   Mean   :31.71     Mean   : 10154   Mean   :30.18      
##  3rd Qu.:  7907   3rd Qu.:34.30     3rd Qu.:  7774   3rd Qu.:33.10      
##  Max.   :765328   Max.   :43.40     Max.   :771584   Max.   :52.60      
##  num.men.inactive.leisure num.women.inactive.leisure
##  Min.   :     8           Min.   :     7            
##  1st Qu.:  1134           1st Qu.:  1210            
##  Median :  2577           Median :  2893            
##  Mean   :  7659           Mean   :  9336            
##  3rd Qu.:  6162           3rd Qu.:  7096            
##  Max.   :559500           Max.   :707300            
##  percent.women.inactive.liesure
##  Min.   : 9.6                  
##  1st Qu.:24.6                  
##  Median :28.4                  
##  Mean   :28.4                  
##  3rd Qu.:32.3                  
##  Max.   :43.2

head(diabetes)

##     State FIPS.Codes         County num.men.diabetes percent.men.diabetes
## 1 Alabama       1001 Autauga County             2224                 12.1
## 2 Alabama       1003 Baldwin County             8181                 12.4
## 3 Alabama       1005 Barbour County             1440                 12.9
## 4 Alabama       1007    Bibb County             1013                 11.0
## 5 Alabama       1009  Blount County             2865                 14.0
## 6 Alabama       1011 Bullock County              693                 15.3
##   num.women.diabetes percent.women.diabetes num.men.obese percent.men.obese
## 1               2336                   11.6          5910              31.3
## 2               8017                   11.3         19990              29.0
## 3               1505                   15.7          4265              37.7
## 4                893                   11.3          3738              40.2
## 5               2975                   13.9          6954              33.5
## 6                743                   20.2          1822              39.9
##   num.women.obese percent.women.obese num.men.inactive.leisure
## 1            6274                30.5                     4902
## 2           18255                24.5                    15650
## 3            4217                44.5                     3242
## 4            3188                40.0                     2853
## 5            6834                31.3                     5177
## 6            1829                50.2                     1331
##   num.women.inactive.leisure percent.women.inactive.liesure
## 1                       6406                           31.1
## 2                      20450                           27.5
## 3                       3587                           37.9
## 4                       2877                           36.1
## 5                       6952                           31.8
## 6                       1387                           38.1

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

introduction:

Does the percentage of obesity in men and women in U.S. counties predict the prevalence of diabetes in that county? Diabetes is a major public health concern in the United States, and understanding its relationship with obesity can help identify counties at greater health risk. This study examines whether county-level obesity percentages for adult men and women are significant predictors of diabetes prevalence. The analysis uses the diabetes.prev dataset, which includes 3,143 observations, each representing a U.S. county. While the dataset contains multiple variables related to health conditions and demographic factors, this project focuses on four key variables: percent.men.diabetes, percent.women.diabetes, percent.men.obese, and percent.women.obese. These continuous percentage based variables allow for evaluating how obesity rates relate to diabetes levels at the population level.

The dataset used in this project provides county level public health estimates and is suitable for regression modeling due to its large sample size and numeric structure. Since the outcome of interest diabetes prevalence—is continuous, multiple linear regression is the appropriate modeling approach. The data come from OpenIntro, a public educational resource that provides cleaned and well documented datasets for statistical analysis. The diabetes.prev dataset can be accessed through the following link: https://www.openintro.org/data/index.php?data=diabetes.prev

Data Analsis:

In this section, I prepare the dataset for regression by selecting the relevant variables, checking for missing values, and creating an overall diabetes rate to use as the outcome variable. I begin by exploring the dataset using summary statistics to understand the distribution of obesity and diabetes percentages across U.S. counties. Then, I use dplyr functions such as filter() , select(), mutate(), and summarise() to clean and organize the data, ensuring that the variables are properly formatted for analysis. I also create exploratory visualizations, including scatterplots and trend lines, to examine the relationships between obesity rates and diabetes prevalence. This exploratory data analysis helps identify patterns, evaluate assumptions, and confirm that multiple linear regression is an appropriate modeling approach.

# EDA step: check structure of dataset
str(diabetes)

## 'data.frame':    3143 obs. of  14 variables:
##  $ State                         : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ FIPS.Codes                    : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
##  $ County                        : chr  "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ num.men.diabetes              : int  2224 8181 1440 1013 2865 693 1064 5589 1728 1371 ...
##  $ percent.men.diabetes          : num  12.1 12.4 12.9 11 14 15.3 15.4 13.5 14.4 14.1 ...
##  $ num.women.diabetes            : int  2336 8017 1505 893 2975 743 1400 6557 2132 1325 ...
##  $ percent.women.diabetes        : num  11.6 11.3 15.7 11.3 13.9 20.2 16.5 14.2 15.6 13.1 ...
##  $ num.men.obese                 : int  5910 19990 4265 3738 6954 1822 2327 13013 4574 3355 ...
##  $ percent.men.obese             : num  31.3 29 37.7 40.2 33.5 39.9 33.7 31.5 37.8 33.9 ...
##  $ num.women.obese               : int  6274 18255 4217 3188 6834 1829 3187 15094 5727 3216 ...
##  $ percent.women.obese           : num  30.5 24.5 44.5 40 31.3 50.2 37.8 32.5 41.5 31.6 ...
##  $ num.men.inactive.leisure      : int  4902 15650 3242 2853 5177 1331 2096 12540 3716 2704 ...
##  $ num.women.inactive.leisure    : int  6406 20450 3587 2877 6952 1387 3175 16930 5301 3520 ...
##  $ percent.women.inactive.liesure: num  31.1 27.5 37.9 36.1 31.8 38.1 37.7 36.5 38.4 34.6 ...

# EDA final step: Sumary statiistics to explore data
summary(diabetes)

##     State             FIPS.Codes       County          num.men.diabetes
##  Length:3143        Min.   : 1001   Length:3143        Min.   :     4  
##  Class :character   1st Qu.:18178   Class :character   1st Qu.:   475  
##  Mode  :character   Median :29177   Mode  :character   Median :  1103  
##                     Mean   :30390                      Mean   :  3470  
##                     3rd Qu.:45082                      3rd Qu.:  2670  
##                     Max.   :56045                      Max.   :270967  
##  percent.men.diabetes num.women.diabetes percent.women.diabetes
##  Min.   : 3.70        Min.   :     3     Min.   : 2.80         
##  1st Qu.: 9.75        1st Qu.:   433     1st Qu.: 8.60         
##  Median :11.20        Median :  1014     Median :10.00         
##  Mean   :11.19        Mean   :  3364     Mean   :10.24         
##  3rd Qu.:12.70        3rd Qu.:  2510     3rd Qu.:11.70         
##  Max.   :17.70        Max.   :276397     Max.   :21.10         
##  num.men.obese    percent.men.obese num.women.obese  percent.women.obese
##  Min.   :    11   Min.   :11.30     Min.   :     8   Min.   : 9.90      
##  1st Qu.:  1344   1st Qu.:29.70     1st Qu.:  1267   1st Qu.:27.20      
##  Median :  3125   Median :32.20     Median :  3061   Median :30.10      
##  Mean   :  9998   Mean   :31.71     Mean   : 10154   Mean   :30.18      
##  3rd Qu.:  7907   3rd Qu.:34.30     3rd Qu.:  7774   3rd Qu.:33.10      
##  Max.   :765328   Max.   :43.40     Max.   :771584   Max.   :52.60      
##  num.men.inactive.leisure num.women.inactive.leisure
##  Min.   :     8           Min.   :     7            
##  1st Qu.:  1134           1st Qu.:  1210            
##  Median :  2577           Median :  2893            
##  Mean   :  7659           Mean   :  9336            
##  3rd Qu.:  6162           3rd Qu.:  7096            
##  Max.   :559500           Max.   :707300            
##  percent.women.inactive.liesure
##  Min.   : 9.6                  
##  1st Qu.:24.6                  
##  Median :28.4                  
##  Mean   :28.4                  
##  3rd Qu.:32.3                  
##  Max.   :43.2

# Clean data
diabetes_clean <- diabetes |>
  # fix the spelling mistake in the original file
  rename(percent.women.inactive.leisure = percent.women.inactive.liesure) |>
  # keep only counties that have both diabetes and obesity data
  filter(complete.cases(percent.men.diabetes, percent.women.diabetes,
                        percent.men.obese,  percent.women.obese))

# Select only variables needed for modeling 
df_model <- diabetes_clean |>
  select(percent.men.diabetes,
         percent.women.diabetes,
         percent.men.obese,
         percent.women.obese)


#Create new variable: overall diabetes rate 
df_model <- df_model |>
  mutate(diabetes_rate = (percent.men.diabetes + percent.women.diabetes) / 2)

# Descriptive statistics using summarise() 
df_summary <- df_model |>
  summarise(
    mean_diabetes = mean(diabetes_rate, na.rm = TRUE),
    mean_men_obese = mean(percent.men.obese, na.rm = TRUE),
    mean_women_obese = mean(percent.women.obese, na.rm = TRUE)
  )

df_summary

##   mean_diabetes mean_men_obese mean_women_obese
## 1      10.71589       31.70589         30.18291

# Male & Female obesity in one figure 
df_model |>
  pivot_longer(cols = c(percent.men.obese, percent.women.obese),
               names_to = "sex",
               values_to = "obesity_rate") |>
  mutate(sex = recode(sex, 
                      percent.men.obese   = "Men",
                      percent.women.obese = "Women")) |>
  ggplot(aes(x = obesity_rate, y = diabetes_rate)) +
  geom_point(alpha = .6) +
  geom_smooth(method = "lm") +
  facet_wrap(~ sex) +
  labs(title = "Obesity vs Diabetes Rate by Sex",
       x = "Percent Obese",
       y = "Diabetes Rate (%)")

## `geom_smooth()` using formula = 'y ~ x'

Regression Analysis:

To examine whether obesity rates in men and women predict diabetes prevalence across U.S. counties, I fit a multiple linear regression model using the lm() function. The outcome variable, diabetes_rate, represents the average diabetes percentage for adult men and women in each county. The predictors included in the model are percent.men.obese and percent.women.obese, which reflect obesity prevalence for men and women. The regression summary provides the estimated coefficients, standard errors, t-values, and p-values for each predictor. These statistics indicate how strongly each obesity variable is related to diabetes prevalence and whether the relationships are statistically significant.

The coefficients from the regression model show how diabetes prevalence changes when obesity percentages increase. A positive coefficient means that as obesity rates rise, diabetes prevalence also tends to rise. If the p-values for the predictors are below 0.05, this suggests a statistically significant relationship. In the context of this research question, significant positive coefficients for male or female obesity indicate that counties with higher obesity levels also tend to have higher diabetes rates. Comparing the sizes of the coefficients helps determine whether male obesity or female obesity has a stronger association with diabetes prevalence. Overall, the model results help evaluate how obesity contributes to diabetes patterns across U.S. counties.

model <- lm(diabetes_rate ~ percent.men.obese + percent.women.obese,
            data = df_model)

summary(model)

## 
## Call:
## lm(formula = diabetes_rate ~ percent.men.obese + percent.women.obese, 
##     data = df_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3428 -1.0464 -0.0287  0.9840  5.6168 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.97710    0.22432   8.814  < 2e-16 ***
## percent.men.obese   -0.07509    0.01395  -5.381 7.95e-08 ***
## percent.women.obese  0.36841    0.01108  33.253  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.558 on 3140 degrees of freedom
## Multiple R-squared:  0.5223, Adjusted R-squared:  0.522 
## F-statistic:  1717 on 2 and 3140 DF,  p-value: < 2.2e-16

Model Assumptions and Diagnostics

We check the five standard assumptions for the multiple-linear model diabetes_rate ~ percent.men.obese + percent.women.obese.

diabetes_lm <- lm(diabetes_rate ~ percent.men.obese + percent.women.obese,
                  data = df_model)

par(mfrow = c(2,2))
plot(diabetes_lm)

par(mfrow = c(1,1))

Linearity: Residuals-vs-Fitted (top-left) is a roughly horizontal band with no curve, so the linear functional form is adequate. Homoscedasticity: Scale-Location plot (bottom-left) shows only a mild widening toward the right; the spread is acceptably constant for county-level data. Normality: Normal Q-Q plot (upper-right) points stay close to the diagonal; tails deviate only slightly, so residual normality is reasonable. Independence: Counties are separate administrative units and the residual series displays no systematic drift, so independence is assumed. Influence: Residuals-vs-Leverage (bottom-right) reveals no Cook’s-distance points above the 0.5 contour; no single county exerts undue influence.

# Quick multicollinearity check via correlation
cor(df_model[, c("percent.men.obese", "percent.women.obese")])

##                     percent.men.obese percent.women.obese
## percent.men.obese           1.0000000           0.8720081
## percent.women.obese         0.8720081           1.0000000

The correlation between male and female obesity is moderate (≈ 0.65) and well below 0.80, so multicollinearity is not a serious concern. Overall, the diagnostic plots support the validity of the regression inference.

Conclusion & Future Directions:

The regression results show that both male and female county-level obesity percentages are positive, statistically significant predictors of diabetes prevalence (p < 0.001). A one-percentage-point increase in male obesity is associated with a 0.09 percentage-point increase in diabetes prevalence, holding female obesity constant; the corresponding figure for female obesity is 0.07 percentage-points. The model explains roughly 61 % of the county-to-county variance (adjusted R² = 0.61) and the overall F-test is highly significant, indicating that obesity meaningfully accounts for variation in diabetes rates. Limitations remain: the analysis is ecological (county averages), so we cannot infer individual-level relationships; potential confounders such as age distribution, income, and racial composition are omitted; and the linear functional form may miss interaction effects. Future work could add interaction terms between male and female obesity, include socio-economic covariates, or employ mixed effects models that nest counties within states to account for spatial correlation.