Statewise Analysis of Diabetes Prevalence and Related Factors

Author

Balemlay Azimeraw

Introduction

I am using the diabete.prev.csv dataset. The variables are state, flips_ code( Federal Information Processing Standards (FIPS) codes,), county, num_men_diabetes, percent_ men_diabetes, num_women_ diabetes, percent_ women_diabetes, num_ men_obese, percent_men_obese, num_women_ obese, percent_women_obese, num_men_inactive_leisure, num_women_inactive_leisure, percent_women_inactive_liesure. I am going to explore the correlation between the average percent of women inactive leisure and the average percent of women with diabetes. The other thing I want to see is the correlation between the average percent of men with diabetes and the average percent of men obese. The source is the CDC.

These are the packages that I used for this project

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.3
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.3
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(RColorBrewer)
setwd("C:/Users/ebale/OneDrive/Desktop/DATA110")
diabete <- read_csv("diabetes.prev .csv")
Rows: 3143 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): State, County
dbl (12): FIPS.Codes, num.men.diabetes, percent.men.diabetes, num.women.diab...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the data.

#!is.na() is checking in every column if there is a na value.
diabete1 <- !is.na(diabete$percent.men.diabetes) & !is.na(diabete$State) & !is.na(diabete$percent.women.diabetes) & !is.na(diabete$percent.men.obese)& !is.na(diabete$percent.women.inactive.liesure) 

Changing capital letter to small letter and changing full stop to underscore

# this code change capital letter of colum name to lower letter
colnames(diabete) <- tolower(colnames(diabete))

# Replace full stops (periods) with underscores
colnames(diabete) <- gsub("\\.", "_", colnames(diabete))

Grouping state into four regions

 # this code is grouping stat in to four regions
northeast_states <- c("Maine", "New Hampshire", "Vermont", "Massachusetts", "Rhode Island", "Connecticut", "New York", "New Jersey", "Pennsylvania")
midwest_states <- c("Ohio", "Indiana", "Illinois", "Michigan", "Wisconsin", "Missouri", "Iowa", "Minnesota", "North Dakota", "South Dakota", "Nebraska", "Kansas")
south_states <- c("Delaware", "Maryland", "Virginia", "North Carolina", "South Carolina", "Georgia", "Florida", "Alabama", "Mississippi", "Tennessee", "Kentucky", "Arkansas", "Louisiana", "Oklahoma", "Texas", "West Virginia")
west_states <- c("Montana", "Wyoming", "Colorado", "New Mexico", "Idaho", "Utah", "Arizona", "Nevada", "Washington", "Oregon", "California", "Alaska", "Hawaii")

Create a region column based on the state

# Create a region column based on the state
diabete$region <- ifelse(diabete$state %in% northeast_states, "Northeast",
                                ifelse(diabete$state %in% midwest_states, "Midwest",
                                       ifelse(diabete$state %in% south_states, "South", "West")))
head(diabete)
# A tibble: 6 × 15
  state   fips_codes county         num_men_diabetes percent_men_diabetes
  <chr>        <dbl> <chr>                     <dbl>                <dbl>
1 Alabama       1001 Autauga County             2224                 12.1
2 Alabama       1003 Baldwin County             8181                 12.4
3 Alabama       1005 Barbour County             1440                 12.9
4 Alabama       1007 Bibb County                1013                 11  
5 Alabama       1009 Blount County              2865                 14  
6 Alabama       1011 Bullock County              693                 15.3
# ℹ 10 more variables: num_women_diabetes <dbl>, percent_women_diabetes <dbl>,
#   num_men_obese <dbl>, percent_men_obese <dbl>, num_women_obese <dbl>,
#   percent_women_obese <dbl>, num_men_inactive_leisure <dbl>,
#   num_women_inactive_leisure <dbl>, percent_women_inactive_liesure <dbl>,
#   region <chr>

Grouping and summarizing data

 # Aggregate the data by state and region
aggregated_data <- diabete %>%
  group_by(state, region) %>% # grouping by state and region
  summarise(
    avg_percent_men_diabetes = mean(percent_men_diabetes),
    avg_percent_men_obese = mean(percent_men_obese),
    avg_percent_women_diabetes = mean(percent_women_diabetes),
 avg_percent_women_inactive_liesure = mean(percent_women_inactive_liesure) ) # summarise by making their mean
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# View the aggregated data
aggregated_data
# A tibble: 51 × 6
# Groups:   state [51]
   state                region    avg_percent_men_diabetes avg_percent_men_obese
   <chr>                <chr>                        <dbl>                 <dbl>
 1 Alabama              South                        13.9                   34.3
 2 Alaska               West                          7.34                  30.3
 3 Arizona              West                         10.8                   27.5
 4 Arkansas             South                        13.3                   35.3
 5 California           West                          8.64                  25.1
 6 Colorado             West                          6.82                  21.4
 7 Connecticut          Northeast                     9.01                  27.0
 8 Delaware             South                        11.3                   30.4
 9 District of Columbia West                          7.9                   18.9
10 Florida              South                        12.4                   31.2
# ℹ 41 more rows
# ℹ 2 more variables: avg_percent_women_diabetes <dbl>,
#   avg_percent_women_inactive_liesure <dbl>

Here is the code of the first graph

plot1 <- ggplot(aggregated_data,
       aes(x = avg_percent_women_inactive_liesure,
           y = avg_percent_women_diabetes)) + # this code tell what variable will be in the x - axis and in the y - axis to plot the graph.
  geom_point(size = 4, color = "blue") +  # Adds the scatter plot points with blue color
  geom_smooth(method = "lm", se = FALSE, linetype = "dotdash", size = 0.4, color = "red") +  # Adds a red linear regression line without the confidence interval shading
    labs(
    title = "Women Inactive Leisure vs Women Diabetes in Each State",  # Adds a title tothe plot                                                                                                                                                           
    x = "Average Percent of Women Inactive Leisure",                 # Label for the x-axis
    y = "Average Percent of Women with Diabetes",                    # Label for the y-axis
    caption = "Source: CDC"                                          # Caption at the bottom of the plot
  ) +
# this tell title,
  theme_minimal(base_size = 14)  # Minimal theme with larger text size
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Display the plot
plot1
`geom_smooth()` using formula = 'y ~ x'

Calculating correlation

# this code calculate correlation  beteen average percent of women inactive liesure and average percent women diabetes
cor(aggregated_data$avg_percent_women_inactive_liesure,aggregated_data$avg_percent_women_diabetes)
[1] 0.8657397

Fit the linear regression model and summarize the result of the linear model

# Fit a linear regression model to predict the average percent of women with diabetes
# based on the average percent of women who are inactive in leisure activities
fit1 <- lm(avg_percent_women_diabetes ~ avg_percent_women_inactive_liesure, data = aggregated_data)

# Summarize the results of the linear model
# The summary will provide important statistical information like coefficients, R-squared, and p-values
summary(fit1)

Call:
lm(formula = avg_percent_women_diabetes ~ avg_percent_women_inactive_liesure, 
    data = aggregated_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.67603 -0.72202 -0.08543  0.82128  2.38441 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         1.00202    0.72723   1.378    0.175    
avg_percent_women_inactive_liesure  0.32558    0.02689  12.108 2.43e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9957 on 49 degrees of freedom
Multiple R-squared:  0.7495,    Adjusted R-squared:  0.7444 
F-statistic: 146.6 on 1 and 49 DF,  p-value: 2.427e-16

This is the analysis of summary

the model equation is avg_percent_women_diabetes = 0.326(avg_percent_women_inactive_liesure) + 1.002 The p-value of 2.43e-16 is extremely small (close to zero). It is less than the common significance threshold of 0.05. This means we can reject the null hypothesis and conclude that there is strong evidence to suggest that avg_percent_women_inactive_liesure is statistically significantly related to avg_percent_women_diabetes. Adjusted R-squared measures the model’s goodness of fit. The Adjusted R-squared value is 0.7444, which means that about 74.44% of the variation in the percentage of women with diabetes can be explained by the percentage of women who are inactive during leisure time. High Adjusted R-squared (above 0.7) generally indicates that the model fits the data well and that the predictor variable (avg_percent_women_inactive_liesure) is a good predictor of the outcome variable (avg_percent_women_diabetes).

Diagnostic plots

autoplot(fit1, 1:4, nrow=2, ncol=2) # # Generate and arrange 4 diagnostic plots for the linear model (fit1) in a 2x2 grid

Analysis of diagnostic plot

1, Residuals are scattered around zero, but there is some curvature in the pattern. This suggests the model may not be fully capturing the relationship between variables, indicating possible non-linearity. Points 38 and 41 suggest they may be influential data 2, Normal Q-Q indicates that the residuals are approximately normal but might have some skewness or heavy tails. 3, Scale-Location Plot is suggesting potential heteroscedasticity. Point 38 and 41 reinforce their influence on model 4, Cooks Distance Plot is suggested observation38 and 43 have significant higher cooks distances than other points.

Here is the code of the second graph

region_colors <- c("Northeast" = "black",   # Assigning blue color to Northeast region
                   "Midwest" = "blue",   # Assigning green color to Midwest region
                   "South" = "red",       # Assigning red color to South region
                   "West" = "green")     # Assigning purple color to West region

# Create the plot
plot2 <- ggplot(aggregated_data,    # ggplot is being used to create the plot
                aes(x = avg_percent_men_obese,     # Set x-axis as average percent of men with obese
                    y = avg_percent_men_diabetes,    # Set y-axis as average percent of men with diabetes
                    color = region,                  # Color the points based on the 'region' variable
                    )) +  # Add diabetes % to the tooltip
  geom_point(size = 4) +  # Adds the scatter plot points with a size of 4
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linetype = "dotdash", size = 0.6, color = "red") + 
   geom_text_repel(aes(label = state, color = "blue",max.overlaps = 5 ), nudge_x = 0.005)+   # add text to each point                                                                  # Adds a linear regression line (lm) with no confidence intervals (se = FALSE)
  scale_color_manual(values = region_colors) +  # Manually set the colors for each region based on the 'region_colors' vector
  labs(
    title = "Average Percent of Men with Diabetes vs Average Percent of Men Obese in Each State",  # Set plot title
    x = "Average Percent of Men Obese",  # Set x-axis label
    y = "Average Percent of Men with Diabetes",  # Set y-axis label
     caption = "Source: CDC")+ # Set the caption at the bottom of the plot
  
  theme_minimal(base_size = 10)  # Use a minimal theme with a base text size of 14
Warning in geom_text_repel(aes(label = state, color = "blue", max.overlaps =
5), : Ignoring unknown aesthetics: max.overlaps
# Display the plot
plot2
Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Essay

For cleaning, I use different techniques. The first one is, !is. na(data$column name). This code helped me to see if there is a missing value in each column. I cleaned state, percent of men diabetes, percent of women diabetes, percent of men obese, and percent of women inactive leisure. I got it all true; that means there is no missing value on those columns. The second one that I use is colnames(diabete) <- tolower(colnames(diabete)) to change some of the column from capital to lower letter. For example, state, county, fips_code. Lastly, I use colnames(diabete) <- gsub(“\.”, “_”, colnames(diabete)) to change full stop into underscore for the column name. The first paragraph shows a correlation between the average percent of women inactive leisure and average percent of women with diabetes. On the graph, I see that they have a positive relationship. I got 0.866 correlation, which is close to 1, and it shows a strong correlation between them. From the second graph, I see that there is a positive correlation between the average percent of obese men and the average percent of men with diabetes. When the average percent of obesity increases, the average percent of men with diabetes also increases. I was doing to see which gender has the highest average percent of diabetes in each state, but I ended up with only two categorical colors by gender. Because you are looking at least three colors, I did not include it here.