Statewise Analysis of Diabetes Prevalence and Related Factors
Author
Balemlay Azimeraw
Introduction
I am using the diabete.prev.csv dataset. The variables are state, flips_ code( Federal Information Processing Standards (FIPS) codes,), county, num_men_diabetes, percent_ men_diabetes, num_women_ diabetes, percent_ women_diabetes, num_ men_obese, percent_men_obese, num_women_ obese, percent_women_obese, num_men_inactive_leisure, num_women_inactive_leisure, percent_women_inactive_liesure. I am going to explore the correlation between the average percent of women inactive leisure and the average percent of women with diabetes. The other thing I want to see is the correlation between the average percent of men with diabetes and the average percent of men obese. The source is the CDC.
These are the packages that I used for this project
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.3
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Rows: 3143 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): State, County
dbl (12): FIPS.Codes, num.men.diabetes, percent.men.diabetes, num.women.diab...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning the data.
#!is.na() is checking in every column if there is a na value.diabete1 <-!is.na(diabete$percent.men.diabetes) &!is.na(diabete$State) &!is.na(diabete$percent.women.diabetes) &!is.na(diabete$percent.men.obese)&!is.na(diabete$percent.women.inactive.liesure)
Changing capital letter to small letter and changing full stop to underscore
# this code change capital letter of colum name to lower lettercolnames(diabete) <-tolower(colnames(diabete))# Replace full stops (periods) with underscorescolnames(diabete) <-gsub("\\.", "_", colnames(diabete))
Grouping state into four regions
# this code is grouping stat in to four regionsnortheast_states <-c("Maine", "New Hampshire", "Vermont", "Massachusetts", "Rhode Island", "Connecticut", "New York", "New Jersey", "Pennsylvania")midwest_states <-c("Ohio", "Indiana", "Illinois", "Michigan", "Wisconsin", "Missouri", "Iowa", "Minnesota", "North Dakota", "South Dakota", "Nebraska", "Kansas")south_states <-c("Delaware", "Maryland", "Virginia", "North Carolina", "South Carolina", "Georgia", "Florida", "Alabama", "Mississippi", "Tennessee", "Kentucky", "Arkansas", "Louisiana", "Oklahoma", "Texas", "West Virginia")west_states <-c("Montana", "Wyoming", "Colorado", "New Mexico", "Idaho", "Utah", "Arizona", "Nevada", "Washington", "Oregon", "California", "Alaska", "Hawaii")
Create a region column based on the state
# Create a region column based on the statediabete$region <-ifelse(diabete$state %in% northeast_states, "Northeast",ifelse(diabete$state %in% midwest_states, "Midwest",ifelse(diabete$state %in% south_states, "South", "West")))head(diabete)
# A tibble: 6 × 15
state fips_codes county num_men_diabetes percent_men_diabetes
<chr> <dbl> <chr> <dbl> <dbl>
1 Alabama 1001 Autauga County 2224 12.1
2 Alabama 1003 Baldwin County 8181 12.4
3 Alabama 1005 Barbour County 1440 12.9
4 Alabama 1007 Bibb County 1013 11
5 Alabama 1009 Blount County 2865 14
6 Alabama 1011 Bullock County 693 15.3
# ℹ 10 more variables: num_women_diabetes <dbl>, percent_women_diabetes <dbl>,
# num_men_obese <dbl>, percent_men_obese <dbl>, num_women_obese <dbl>,
# percent_women_obese <dbl>, num_men_inactive_leisure <dbl>,
# num_women_inactive_leisure <dbl>, percent_women_inactive_liesure <dbl>,
# region <chr>
Grouping and summarizing data
# Aggregate the data by state and regionaggregated_data <- diabete %>%group_by(state, region) %>%# grouping by state and regionsummarise(avg_percent_men_diabetes =mean(percent_men_diabetes),avg_percent_men_obese =mean(percent_men_obese),avg_percent_women_diabetes =mean(percent_women_diabetes),avg_percent_women_inactive_liesure =mean(percent_women_inactive_liesure) ) # summarise by making their mean
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# View the aggregated dataaggregated_data
# A tibble: 51 × 6
# Groups: state [51]
state region avg_percent_men_diabetes avg_percent_men_obese
<chr> <chr> <dbl> <dbl>
1 Alabama South 13.9 34.3
2 Alaska West 7.34 30.3
3 Arizona West 10.8 27.5
4 Arkansas South 13.3 35.3
5 California West 8.64 25.1
6 Colorado West 6.82 21.4
7 Connecticut Northeast 9.01 27.0
8 Delaware South 11.3 30.4
9 District of Columbia West 7.9 18.9
10 Florida South 12.4 31.2
# ℹ 41 more rows
# ℹ 2 more variables: avg_percent_women_diabetes <dbl>,
# avg_percent_women_inactive_liesure <dbl>
Here is the code of the first graph
plot1 <-ggplot(aggregated_data,aes(x = avg_percent_women_inactive_liesure,y = avg_percent_women_diabetes)) +# this code tell what variable will be in the x - axis and in the y - axis to plot the graph.geom_point(size =4, color ="blue") +# Adds the scatter plot points with blue colorgeom_smooth(method ="lm", se =FALSE, linetype ="dotdash", size =0.4, color ="red") +# Adds a red linear regression line without the confidence interval shadinglabs(title ="Women Inactive Leisure vs Women Diabetes in Each State", # Adds a title tothe plot x ="Average Percent of Women Inactive Leisure", # Label for the x-axisy ="Average Percent of Women with Diabetes", # Label for the y-axiscaption ="Source: CDC"# Caption at the bottom of the plot ) +# this tell title,theme_minimal(base_size =14) # Minimal theme with larger text size
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Display the plotplot1
`geom_smooth()` using formula = 'y ~ x'
Calculating correlation
# this code calculate correlation beteen average percent of women inactive liesure and average percent women diabetescor(aggregated_data$avg_percent_women_inactive_liesure,aggregated_data$avg_percent_women_diabetes)
[1] 0.8657397
Fit the linear regression model and summarize the result of the linear model
# Fit a linear regression model to predict the average percent of women with diabetes# based on the average percent of women who are inactive in leisure activitiesfit1 <-lm(avg_percent_women_diabetes ~ avg_percent_women_inactive_liesure, data = aggregated_data)# Summarize the results of the linear model# The summary will provide important statistical information like coefficients, R-squared, and p-valuessummary(fit1)
Call:
lm(formula = avg_percent_women_diabetes ~ avg_percent_women_inactive_liesure,
data = aggregated_data)
Residuals:
Min 1Q Median 3Q Max
-1.67603 -0.72202 -0.08543 0.82128 2.38441
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00202 0.72723 1.378 0.175
avg_percent_women_inactive_liesure 0.32558 0.02689 12.108 2.43e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9957 on 49 degrees of freedom
Multiple R-squared: 0.7495, Adjusted R-squared: 0.7444
F-statistic: 146.6 on 1 and 49 DF, p-value: 2.427e-16
This is the analysis of summary
the model equation is avg_percent_women_diabetes = 0.326(avg_percent_women_inactive_liesure) + 1.002 The p-value of 2.43e-16 is extremely small (close to zero). It is less than the common significance threshold of 0.05. This means we can reject the null hypothesis and conclude that there is strong evidence to suggest that avg_percent_women_inactive_liesure is statistically significantly related to avg_percent_women_diabetes. Adjusted R-squared measures the model’s goodness of fit. The Adjusted R-squared value is 0.7444, which means that about 74.44% of the variation in the percentage of women with diabetes can be explained by the percentage of women who are inactive during leisure time. High Adjusted R-squared (above 0.7) generally indicates that the model fits the data well and that the predictor variable (avg_percent_women_inactive_liesure) is a good predictor of the outcome variable (avg_percent_women_diabetes).
Diagnostic plots
autoplot(fit1, 1:4, nrow=2, ncol=2) # # Generate and arrange 4 diagnostic plots for the linear model (fit1) in a 2x2 grid
Analysis of diagnostic plot
1, Residuals are scattered around zero, but there is some curvature in the pattern. This suggests the model may not be fully capturing the relationship between variables, indicating possible non-linearity. Points 38 and 41 suggest they may be influential data 2, Normal Q-Q indicates that the residuals are approximately normal but might have some skewness or heavy tails. 3, Scale-Location Plot is suggesting potential heteroscedasticity. Point 38 and 41 reinforce their influence on model 4, Cooks Distance Plot is suggested observation38 and 43 have significant higher cooks distances than other points.
Here is the code of the second graph
region_colors <-c("Northeast"="black", # Assigning blue color to Northeast region"Midwest"="blue", # Assigning green color to Midwest region"South"="red", # Assigning red color to South region"West"="green") # Assigning purple color to West region# Create the plotplot2 <-ggplot(aggregated_data, # ggplot is being used to create the plotaes(x = avg_percent_men_obese, # Set x-axis as average percent of men with obesey = avg_percent_men_diabetes, # Set y-axis as average percent of men with diabetescolor = region, # Color the points based on the 'region' variable )) +# Add diabetes % to the tooltipgeom_point(size =4) +# Adds the scatter plot points with a size of 4geom_smooth(method ="lm", formula = y ~ x, se =FALSE, linetype ="dotdash", size =0.6, color ="red") +geom_text_repel(aes(label = state, color ="blue",max.overlaps =5 ), nudge_x =0.005)+# add text to each point # Adds a linear regression line (lm) with no confidence intervals (se = FALSE)scale_color_manual(values = region_colors) +# Manually set the colors for each region based on the 'region_colors' vectorlabs(title ="Average Percent of Men with Diabetes vs Average Percent of Men Obese in Each State", # Set plot titlex ="Average Percent of Men Obese", # Set x-axis labely ="Average Percent of Men with Diabetes", # Set y-axis labelcaption ="Source: CDC")+# Set the caption at the bottom of the plottheme_minimal(base_size =10) # Use a minimal theme with a base text size of 14
Warning in geom_text_repel(aes(label = state, color = "blue", max.overlaps =
5), : Ignoring unknown aesthetics: max.overlaps
# Display the plotplot2
Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Essay
For cleaning, I use different techniques. The first one is, !is. na(data$column name). This code helped me to see if there is a missing value in each column. I cleaned state, percent of men diabetes, percent of women diabetes, percent of men obese, and percent of women inactive leisure. I got it all true; that means there is no missing value on those columns. The second one that I use is colnames(diabete) <- tolower(colnames(diabete)) to change some of the column from capital to lower letter. For example, state, county, fips_code. Lastly, I use colnames(diabete) <- gsub(“\.”, “_”, colnames(diabete)) to change full stop into underscore for the column name. The first paragraph shows a correlation between the average percent of women inactive leisure and average percent of women with diabetes. On the graph, I see that they have a positive relationship. I got 0.866 correlation, which is close to 1, and it shows a strong correlation between them. From the second graph, I see that there is a positive correlation between the average percent of obese men and the average percent of men with diabetes. When the average percent of obesity increases, the average percent of men with diabetes also increases. I was doing to see which gender has the highest average percent of diabetes in each state, but I ended up with only two categorical colors by gender. Because you are looking at least three colors, I did not include it here.