Project 1 data 101

My research question

What is the difference in proportions of diabetes in terms of U.S state and gender?

Introduction

I decided to investigate the difference in proportions of diabetes in terms of U.S state and gender using 2012 CDC data. The data set I used is called diabetes. The source of my data set was was the CDC https://www.openintro.org/data/index.php?data=diabetes.prev . I choose this data set because my family has struggled with diabetes over the years, on both sides and I have had both male relatives and female relatives struggle with diabetes. Therefore, I wanted to investigate if there is a difference in proportion between the percent of women with diabetes and percent of men with diabetes. My data set has 3143 observations and 14 variables.

The variables I choose were:

-Percent.women.diabetes : the percentage of women with diabetes.

-Percent.men.diabetes : the percentage of men with diabetes.

-State: A name of a state in the United States of America.

Loading libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Setting the working directory

setwd("~/Desktop/Data 101")  #setting working directory
diabetes<- read_csv ("diabetes.prev.csv") # my dataset

Data analysis:

I started to analyze my data by checking the head and structure using the”head” and “str” function. I then cleaned the data by removing the ‘.’ between the words using ‘gsub’ and replacing it with underscores. Next, I used ‘tolower’ to make all the name columns lowercase. After, I checked for NA’s using ‘colSums’ and I noticed that my observations remained the same therefore I had no NA’s. Additionally, I used dplyr functions such as select, group by and summarise.First, I used select to select the variables I was focused on such as state, percent of women with diabetes, percent of men with diabetes. Second, I group by states and got the means for percent of women and men with diabetes to have one proportion value for each gender and have them grouped by one state. Lastly, I got the maximum percent of men and women with diabetes and by using the dplyr function summarise and then ‘max’. I noted that the state with the highest percent of diabetes in men and women were both Alabama with a slight difference of roughly 0.7 percent. Finally I plotted a graph to show the difference in proportions of diabetes in terms of U.S state and gender in 2012.

looking at the data set

head(diabetes)

## # A tibble: 6 × 14
##   State   FIPS.Codes County         num.men.diabetes percent.men.diabetes
##   <chr>        <dbl> <chr>                     <dbl>                <dbl>
## 1 Alabama       1001 Autauga County             2224                 12.1
## 2 Alabama       1003 Baldwin County             8181                 12.4
## 3 Alabama       1005 Barbour County             1440                 12.9
## 4 Alabama       1007 Bibb County                1013                 11  
## 5 Alabama       1009 Blount County              2865                 14  
## 6 Alabama       1011 Bullock County              693                 15.3
## # ℹ 9 more variables: num.women.diabetes <dbl>, percent.women.diabetes <dbl>,
## #   num.men.obese <dbl>, percent.men.obese <dbl>, num.women.obese <dbl>,
## #   percent.women.obese <dbl>, num.men.inactive.leisure <dbl>,
## #   num.women.inactive.leisure <dbl>, percent.women.inactive.liesure <dbl>

str(diabetes)

## spc_tbl_ [3,143 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ State                         : chr [1:3143] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ FIPS.Codes                    : num [1:3143] 1001 1003 1005 1007 1009 ...
##  $ County                        : chr [1:3143] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ num.men.diabetes              : num [1:3143] 2224 8181 1440 1013 2865 ...
##  $ percent.men.diabetes          : num [1:3143] 12.1 12.4 12.9 11 14 15.3 15.4 13.5 14.4 14.1 ...
##  $ num.women.diabetes            : num [1:3143] 2336 8017 1505 893 2975 ...
##  $ percent.women.diabetes        : num [1:3143] 11.6 11.3 15.7 11.3 13.9 20.2 16.5 14.2 15.6 13.1 ...
##  $ num.men.obese                 : num [1:3143] 5910 19990 4265 3738 6954 ...
##  $ percent.men.obese             : num [1:3143] 31.3 29 37.7 40.2 33.5 39.9 33.7 31.5 37.8 33.9 ...
##  $ num.women.obese               : num [1:3143] 6274 18255 4217 3188 6834 ...
##  $ percent.women.obese           : num [1:3143] 30.5 24.5 44.5 40 31.3 50.2 37.8 32.5 41.5 31.6 ...
##  $ num.men.inactive.leisure      : num [1:3143] 4902 15650 3242 2853 5177 ...
##  $ num.women.inactive.leisure    : num [1:3143] 6406 20450 3587 2877 6952 ...
##  $ percent.women.inactive.liesure: num [1:3143] 31.1 27.5 37.9 36.1 31.8 38.1 37.7 36.5 38.4 34.6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   State = col_character(),
##   ..   FIPS.Codes = col_double(),
##   ..   County = col_character(),
##   ..   num.men.diabetes = col_double(),
##   ..   percent.men.diabetes = col_double(),
##   ..   num.women.diabetes = col_double(),
##   ..   percent.women.diabetes = col_double(),
##   ..   num.men.obese = col_double(),
##   ..   percent.men.obese = col_double(),
##   ..   num.women.obese = col_double(),
##   ..   percent.women.obese = col_double(),
##   ..   num.men.inactive.leisure = col_double(),
##   ..   num.women.inactive.leisure = col_double(),
##   ..   percent.women.inactive.liesure = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Cleaning the data

names(diabetes) <- gsub("[(). \\-]", "_", names(diabetes)) # replace ., (), space, with dash
names(diabetes) <- gsub("_$", "", names(diabetes))  # remove trailing underscore
names(diabetes) <- tolower(names(diabetes))         # lowercase

head(diabetes) #verify

## # A tibble: 6 × 14
##   state   fips_codes county         num_men_diabetes percent_men_diabetes
##   <chr>        <dbl> <chr>                     <dbl>                <dbl>
## 1 Alabama       1001 Autauga County             2224                 12.1
## 2 Alabama       1003 Baldwin County             8181                 12.4
## 3 Alabama       1005 Barbour County             1440                 12.9
## 4 Alabama       1007 Bibb County                1013                 11  
## 5 Alabama       1009 Blount County              2865                 14  
## 6 Alabama       1011 Bullock County              693                 15.3
## # ℹ 9 more variables: num_women_diabetes <dbl>, percent_women_diabetes <dbl>,
## #   num_men_obese <dbl>, percent_men_obese <dbl>, num_women_obese <dbl>,
## #   percent_women_obese <dbl>, num_men_inactive_leisure <dbl>,
## #   num_women_inactive_leisure <dbl>, percent_women_inactive_liesure <dbl>

Handling NA’s

colSums(is.na (diabetes)) #check for NA's

##                          state                     fips_codes 
##                              0                              0 
##                         county               num_men_diabetes 
##                              0                              0 
##           percent_men_diabetes             num_women_diabetes 
##                              0                              0 
##         percent_women_diabetes                  num_men_obese 
##                              0                              0 
##              percent_men_obese                num_women_obese 
##                              0                              0 
##            percent_women_obese       num_men_inactive_leisure 
##                              0                              0 
##     num_women_inactive_leisure percent_women_inactive_liesure 
##                              0                              0

The data is free of NA’s.

Selecting my specific columns

selected_columns<-diabetes |>
  select (state, percent_women_diabetes,percent_men_diabetes) #selecting my columns
selected_columns

## # A tibble: 3,143 × 3
##    state   percent_women_diabetes percent_men_diabetes
##    <chr>                    <dbl>                <dbl>
##  1 Alabama                   11.6                 12.1
##  2 Alabama                   11.3                 12.4
##  3 Alabama                   15.7                 12.9
##  4 Alabama                   11.3                 11  
##  5 Alabama                   13.9                 14  
##  6 Alabama                   20.2                 15.3
##  7 Alabama                   16.5                 15.4
##  8 Alabama                   14.2                 13.5
##  9 Alabama                   15.6                 14.4
## 10 Alabama                   13.1                 14.1
## # ℹ 3,133 more rows

Grouping by state and summarizing the proportions of percent of women and men with diabetes

women_and_men <-selected_columns |>
 group_by(state) |>
  summarise(prop_men= mean(percent_men_diabetes), prop_women= mean(percent_women_diabetes))
women_and_men

## # A tibble: 51 × 3
##    state                prop_men prop_women
##    <chr>                   <dbl>      <dbl>
##  1 Alabama                 13.9       14.3 
##  2 Alaska                   7.34       7.43
##  3 Arizona                 10.8        9.32
##  4 Arkansas                13.3       11.7 
##  5 California               8.64       7.52
##  6 Colorado                 6.82       5.78
##  7 Connecticut              9.01       7.65
##  8 Delaware                11.3        9.93
##  9 District of Columbia     7.9        8.3 
## 10 Florida                 12.4       10.9 
## # ℹ 41 more rows

Which state has the highest proportion of diabetes for women and men?

max_men_and_women2<-women_and_men |>
 summarise(max_men= max(prop_men), max_women= max(prop_women)) 
max_men_and_women2

## # A tibble: 1 × 2
##   max_men max_women
##     <dbl>     <dbl>
## 1    13.9      14.6

Alabama has the highest proportion of men and women with diabetes with a slight difference of roughly 0.7 percent.

Visualization

ggplot(women_and_men, aes(x = state)) +
  geom_bar(aes(y = prop_men, fill = "Men"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = prop_women, fill = "Women"), stat = "identity", position = "dodge") +
  labs(title = "Average Diabetes Prevalence by Gender and State (2012)",
       x = "State",
       y = "Average Percent with Diabetes",
       fill = "Gender") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, size = 7))

Conclusion and Future Directions :

Key finds of analysis

The investigation highlighted that there is a difference in proportions between percentage of women and men with diabetes across the U.S states in 2012 however very slight.

Discussion of results

I noticed that the difference between proportion of men and women with diabetes is very similar, with only a slight difference among different states and women having a slightly higher rate of diabetes than men across states.This means that both women and men have compatible proportion of diabetes in America and that gender does not make a huge difference when comparing diabetes.

Potential avenues

A potential avenue is that I could find a more recent data set that covers the entire US and test if I get same results for difference in proportion of men and women with diabetes. Additionally doing another graph using predictors for each gender’s incorporating other factors as predictor such as obesity or activeness across genders.

References

https://www.openintro.org/data/index.php?data=diabetes.prev