What is the difference in proportions of diabetes in terms of U.S state and gender?
I decided to investigate the difference in proportions of diabetes in terms of U.S state and gender using 2012 CDC data. The data set I used is called diabetes. The source of my data set was was the CDC https://www.openintro.org/data/index.php?data=diabetes.prev . I choose this data set because my family has struggled with diabetes over the years, on both sides and I have had both male relatives and female relatives struggle with diabetes. Therefore, I wanted to investigate if there is a difference in proportion between the percent of women with diabetes and percent of men with diabetes. My data set has 3143 observations and 14 variables.
-Percent.women.diabetes : the percentage of women with diabetes.
-Percent.men.diabetes : the percentage of men with diabetes.
-State: A name of a state in the United States of America.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/Data 101") #setting working directory
diabetes<- read_csv ("diabetes.prev.csv") # my dataset
I started to analyze my data by checking the head and structure using the”head” and “str” function. I then cleaned the data by removing the ‘.’ between the words using ‘gsub’ and replacing it with underscores. Next, I used ‘tolower’ to make all the name columns lowercase. After, I checked for NA’s using ‘colSums’ and I noticed that my observations remained the same therefore I had no NA’s. Additionally, I used dplyr functions such as select, group by and summarise.First, I used select to select the variables I was focused on such as state, percent of women with diabetes, percent of men with diabetes. Second, I group by states and got the means for percent of women and men with diabetes to have one proportion value for each gender and have them grouped by one state. Lastly, I got the maximum percent of men and women with diabetes and by using the dplyr function summarise and then ‘max’. I noted that the state with the highest percent of diabetes in men and women were both Alabama with a slight difference of roughly 0.7 percent. Finally I plotted a graph to show the difference in proportions of diabetes in terms of U.S state and gender in 2012.
head(diabetes)
## # A tibble: 6 × 14
## State FIPS.Codes County num.men.diabetes percent.men.diabetes
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 Alabama 1001 Autauga County 2224 12.1
## 2 Alabama 1003 Baldwin County 8181 12.4
## 3 Alabama 1005 Barbour County 1440 12.9
## 4 Alabama 1007 Bibb County 1013 11
## 5 Alabama 1009 Blount County 2865 14
## 6 Alabama 1011 Bullock County 693 15.3
## # ℹ 9 more variables: num.women.diabetes <dbl>, percent.women.diabetes <dbl>,
## # num.men.obese <dbl>, percent.men.obese <dbl>, num.women.obese <dbl>,
## # percent.women.obese <dbl>, num.men.inactive.leisure <dbl>,
## # num.women.inactive.leisure <dbl>, percent.women.inactive.liesure <dbl>
str(diabetes)
## spc_tbl_ [3,143 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ State : chr [1:3143] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ FIPS.Codes : num [1:3143] 1001 1003 1005 1007 1009 ...
## $ County : chr [1:3143] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ num.men.diabetes : num [1:3143] 2224 8181 1440 1013 2865 ...
## $ percent.men.diabetes : num [1:3143] 12.1 12.4 12.9 11 14 15.3 15.4 13.5 14.4 14.1 ...
## $ num.women.diabetes : num [1:3143] 2336 8017 1505 893 2975 ...
## $ percent.women.diabetes : num [1:3143] 11.6 11.3 15.7 11.3 13.9 20.2 16.5 14.2 15.6 13.1 ...
## $ num.men.obese : num [1:3143] 5910 19990 4265 3738 6954 ...
## $ percent.men.obese : num [1:3143] 31.3 29 37.7 40.2 33.5 39.9 33.7 31.5 37.8 33.9 ...
## $ num.women.obese : num [1:3143] 6274 18255 4217 3188 6834 ...
## $ percent.women.obese : num [1:3143] 30.5 24.5 44.5 40 31.3 50.2 37.8 32.5 41.5 31.6 ...
## $ num.men.inactive.leisure : num [1:3143] 4902 15650 3242 2853 5177 ...
## $ num.women.inactive.leisure : num [1:3143] 6406 20450 3587 2877 6952 ...
## $ percent.women.inactive.liesure: num [1:3143] 31.1 27.5 37.9 36.1 31.8 38.1 37.7 36.5 38.4 34.6 ...
## - attr(*, "spec")=
## .. cols(
## .. State = col_character(),
## .. FIPS.Codes = col_double(),
## .. County = col_character(),
## .. num.men.diabetes = col_double(),
## .. percent.men.diabetes = col_double(),
## .. num.women.diabetes = col_double(),
## .. percent.women.diabetes = col_double(),
## .. num.men.obese = col_double(),
## .. percent.men.obese = col_double(),
## .. num.women.obese = col_double(),
## .. percent.women.obese = col_double(),
## .. num.men.inactive.leisure = col_double(),
## .. num.women.inactive.leisure = col_double(),
## .. percent.women.inactive.liesure = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
names(diabetes) <- gsub("[(). \\-]", "_", names(diabetes)) # replace ., (), space, with dash
names(diabetes) <- gsub("_$", "", names(diabetes)) # remove trailing underscore
names(diabetes) <- tolower(names(diabetes)) # lowercase
head(diabetes) #verify
## # A tibble: 6 × 14
## state fips_codes county num_men_diabetes percent_men_diabetes
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 Alabama 1001 Autauga County 2224 12.1
## 2 Alabama 1003 Baldwin County 8181 12.4
## 3 Alabama 1005 Barbour County 1440 12.9
## 4 Alabama 1007 Bibb County 1013 11
## 5 Alabama 1009 Blount County 2865 14
## 6 Alabama 1011 Bullock County 693 15.3
## # ℹ 9 more variables: num_women_diabetes <dbl>, percent_women_diabetes <dbl>,
## # num_men_obese <dbl>, percent_men_obese <dbl>, num_women_obese <dbl>,
## # percent_women_obese <dbl>, num_men_inactive_leisure <dbl>,
## # num_women_inactive_leisure <dbl>, percent_women_inactive_liesure <dbl>
colSums(is.na (diabetes)) #check for NA's
## state fips_codes
## 0 0
## county num_men_diabetes
## 0 0
## percent_men_diabetes num_women_diabetes
## 0 0
## percent_women_diabetes num_men_obese
## 0 0
## percent_men_obese num_women_obese
## 0 0
## percent_women_obese num_men_inactive_leisure
## 0 0
## num_women_inactive_leisure percent_women_inactive_liesure
## 0 0
The data is free of NA’s.
selected_columns<-diabetes |>
select (state, percent_women_diabetes,percent_men_diabetes) #selecting my columns
selected_columns
## # A tibble: 3,143 × 3
## state percent_women_diabetes percent_men_diabetes
## <chr> <dbl> <dbl>
## 1 Alabama 11.6 12.1
## 2 Alabama 11.3 12.4
## 3 Alabama 15.7 12.9
## 4 Alabama 11.3 11
## 5 Alabama 13.9 14
## 6 Alabama 20.2 15.3
## 7 Alabama 16.5 15.4
## 8 Alabama 14.2 13.5
## 9 Alabama 15.6 14.4
## 10 Alabama 13.1 14.1
## # ℹ 3,133 more rows
women_and_men <-selected_columns |>
group_by(state) |>
summarise(prop_men= mean(percent_men_diabetes), prop_women= mean(percent_women_diabetes))
women_and_men
## # A tibble: 51 × 3
## state prop_men prop_women
## <chr> <dbl> <dbl>
## 1 Alabama 13.9 14.3
## 2 Alaska 7.34 7.43
## 3 Arizona 10.8 9.32
## 4 Arkansas 13.3 11.7
## 5 California 8.64 7.52
## 6 Colorado 6.82 5.78
## 7 Connecticut 9.01 7.65
## 8 Delaware 11.3 9.93
## 9 District of Columbia 7.9 8.3
## 10 Florida 12.4 10.9
## # ℹ 41 more rows
max_men_and_women2<-women_and_men |>
summarise(max_men= max(prop_men), max_women= max(prop_women))
max_men_and_women2
## # A tibble: 1 × 2
## max_men max_women
## <dbl> <dbl>
## 1 13.9 14.6
Alabama has the highest proportion of men and women with diabetes with a slight difference of roughly 0.7 percent.
ggplot(women_and_men, aes(x = state)) +
geom_bar(aes(y = prop_men, fill = "Men"), stat = "identity", position = "dodge") +
geom_bar(aes(y = prop_women, fill = "Women"), stat = "identity", position = "dodge") +
labs(title = "Average Diabetes Prevalence by Gender and State (2012)",
x = "State",
y = "Average Percent with Diabetes",
fill = "Gender") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, size = 7))
The investigation highlighted that there is a difference in proportions between percentage of women and men with diabetes across the U.S states in 2012 however very slight.
I noticed that the difference between proportion of men and women with diabetes is very similar, with only a slight difference among different states and women having a slightly higher rate of diabetes than men across states.This means that both women and men have compatible proportion of diabetes in America and that gender does not make a huge difference when comparing diabetes.
A potential avenue is that I could find a more recent data set that covers the entire US and test if I get same results for difference in proportion of men and women with diabetes. Additionally doing another graph using predictors for each gender’s incorporating other factors as predictor such as obesity or activeness across genders.