A link to the data set online or a copy of the original data file [link][https://www.nyc.gov/site/doh/data/data-sets/community-health-survey-public-use-data.page]

A link to the codebook online or a copy of the codebook [link][https://www.nyc.gov/assets/doh/downloads/pdf/episrv/chs2020-codebook.pdf]

Explaination of the research question

What is the purpose of the research?

diabetes20 (Have you ever been told by a doctor, nurse or other health professional that you have diabetes?) 1) Yes; 2) No

What are the predictors?

exercise = exercise20 (During the past 30 days, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?)

healthy diet = nutrition1 (Thinking about nutrition…how many total servings of fruit and/or vegetables did you eat yesterday? A serving would equal one medium apple, a handful of broccoli, or a cup of carrots.)

unhealthy drinks = nsodasugarperday20 (Number of soda and other sugar sweetened beverages consumed. Standardized to per day)

Sex = birthsex (Sex assigned at birth: What was your sex assigned at birth? Male or female?)

  1) Male; 2) Female 

Description of the data set

When was it collected?

The data set was collected in 2020 from a random sample of adults aged 18 and older in New York City.

How was it collected?

The data was collected through a phone survey.

How many observations and variables are there?

  1. 8781-Observations, 2. 142- Variables

How were the outcome and predictors measured? (data types, categories, values)

  1. exercise20 (During the past 30 days, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?)

    1. Yes, 2) No

Categorical, coded as a factor

  1. nutrition1 (Thinking about nutrition…how many total servings of fruit and/or vegetables did you eat yesterday? A serving would equal one medium apple, a handful of broccoli, or a cup of carrots.) Continuous, coded as numeric

  2. nsodasugarperday20 (Number of soda and other sugar sweetened beverages consumed.Standardized to per day)

Continuous, coded as numeric

  1. birthsex (Sex assigned at birth: What was your sex assigned at birth? Male or female?)
  1. Male; 2) Female

Categorical, coded as factor

  1. diabetes20 (Have you ever been told by a doctor, nurse or other health professional that you have diabetes?) (OUTCOME)
  1. Yes; 2) No

Categorical, coded as factor

library(package = "tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(package = "haven")
library(package = "table1")
## 
## Attaching package: 'table1'
## 
## The following objects are masked from 'package:base':
## 
##     units, units<-
library(descr)
library(labelled)

Import the data

check_point1 <- read_sas("C:/Users/kalib/OneDrive/Desktop/Check Point 1/chs2020_public.sas7bdat")

Cleaning the data

Re-coding and labeling variables, ensuring missing values are properly coded

CP2Small <- check_point1 %>% select(exercise20,nutrition1,nsodasugarperday20,birthsex,diabetes20)

Cleaning data

CP2Clean <- CP2Small %>%
  mutate(birthsex = recode_factor(birthsex,
                               `1` = "Male", 
                               `2` = "Female"))%>%
  mutate(exercise20 = recode_factor(exercise20,
                                `1` = "Yes",
                                `2` = "No"))%>%
  mutate(diabetes20 = recode_factor(diabetes20,
                                  `1` = "Yes",
                                  `2` = "No"))%>%
  mutate(nutrition1=as.numeric(nutrition1))%>%
mutate(nsodasugarperday20=as.numeric(nsodasugarperday20))%>%
  drop_na()

summary(CP2Clean)
##  exercise20   nutrition1     nsodasugarperday20   birthsex    diabetes20
##  Yes:6191   Min.   : 0.000   Min.   : 0.00000   Male  :3749   Yes:1048  
##  No :2331   1st Qu.: 1.000   1st Qu.: 0.00000   Female:4773   No :7474  
##             Median : 2.000   Median : 0.03306                           
##             Mean   : 2.322   Mean   : 0.53772                           
##             3rd Qu.: 3.000   3rd Qu.: 0.49469                           
##             Max.   :50.000   Max.   :21.42857

Descriptive Statistics

#Histogram representation of continuous variables
# Nutrition 
CP2Clean %>% 
  ggplot(aes(x = nutrition1)) + 
  geom_histogram() +
  labs(x = "number of participants", y = 'counts',
       title = "Nutrition") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram representation of continuous variables ## Number of Soda-sugar Per day

CP2Clean %>% 
  ggplot(aes(x = nsodasugarperday20)) + 
  geom_histogram() +
  labs(x = "# of soda and other sugar
sweetened beverages", y = 'counts',
       title = "#of sugarperday") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Bar graph representation of categorical variables ## Exercise

CP2Clean %>% 
  ggplot(aes(x = exercise20)) + 
  geom_bar() +
  labs(x = "Exercise", y = 'Number of Participants',
       title = "Exercise20") +
  theme_bw()

# Bar graph representation of categorical variables ## Birthsex

#Bar graph representation of categorical variables
#Birthsex
CP2Clean %>% 
  ggplot(aes(x = birthsex)) + 
  geom_bar() +
  labs(x = "Birthsex", y = 'Number of Participants',
       title = "Birthsex") +
  theme_bw()

# Bar graph representation of categorical variables

#Diabetes
CP2Clean %>% 
  ggplot(aes(x = diabetes20)) + 
  geom_bar() +
  labs(x = "Diabetes", y = 'Number of Participants',
       title = "Diabetes 20") +
  theme_bw()

# Table of all variables in relation to diabetes

label(CP2Clean$exercise20)= "Exercise (Physical Activity)"
label(CP2Clean$nutrition1)= "Number of servings"
label(CP2Clean$diabetes20)= "Diabetes by a healthcareworker"
label(CP2Clean$nsodasugarperday20)= "# of Soda &sugar per day"
label(CP2Clean$birthsex)= "Birthsex"

table1(~ diabetes20+ exercise20 + nutrition1 + nsodasugarperday20 + birthsex | diabetes20,
       render.continuous = c(.="median(IQR)"),
       data=CP2Clean)
Yes
(N=1048)
No
(N=7474)
Overall
(N=8522)
Diabetes by a healthcareworker
Yes 1048 (100%) 0 (0%) 1048 (12.3%)
No 0 (0%) 7474 (100%) 7474 (87.7%)
Exercise (Physical Activity)
Yes 659 (62.9%) 5532 (74.0%) 6191 (72.6%)
No 389 (37.1%) 1942 (26.0%) 2331 (27.4%)
Number of servings
median(IQR) 2.00(2.00) 2.00(2.00) 2.00(2.00)
# of Soda &sugar per day
median(IQR) 0(0.286) 0.0661(0.571) 0.0331(0.495)
Birthsex
Male 496 (47.3%) 3253 (43.5%) 3749 (44.0%)
Female 552 (52.7%) 4221 (56.5%) 4773 (56.0%)
 title = ("Diabetes status")

The data presented in the table below is a comparison between two groups: those who have been diagnosed with diabetes by a healthcare worker (N=1048) and those who have not (N=7474). In terms of physical activity, 62.9% of the diabetes group reported being physically active, compared to 74.0% in the non-diabetes group. Overall, 72.6% of the total population (N=8522) reported being physically active.

The median number of servings consumed by both groups is the same at 2.00, indicating a similar dietary pattern in this regard. As expected, all individuals in the diabetes group were diagnosed by a healthcare worker, while none in the non-diabetes group were.

The median number of soda and sugar consumed per day is slightly higher in the non-diabetes group (0.0661) compared to the diabetes group (0), with an overall median of 0.0331 for the total population.

In terms of birth sex, 47.3% of the diabetes group are male and 52.7% are female, while in the non-diabetes group, 43.5% are male and 56.5% are female. Overall, 44.0% of the total population are male and 56.0% are female.

INFERENTIAL STATISTICS

Birthsex & diabetes bar graph with percentages

CP2Clean %>% 
  group_by(birthsex, diabetes20) %>% 
  count() %>% 
  group_by(birthsex) %>% 
  mutate(perc.birthsex = 100 * n / sum(n)) %>% 
  ggplot(aes(x = birthsex, y = perc.birthsex, fill = diabetes20)) +
  geom_col(position = "dodge") + 
  theme_minimal() +
  labs(x = "Birthsex",
       y = "Percentage",
       fill = "Diabetes") +
coord_flip()

From the graph above the percentage of males who are Diabetic are slightly more than females

Running a chi-square test since both variales are categorical (Diabetes and Birthsex)

Checking Chi-square assumptions

1.Observations are independent —- Met 2.Both variables must both be categorical (nominal or ordinal) —- Met 3.Expected values should be 5 or higher in at least 80% of groups — Met

Null hypothesis (H0): There is no relationship between Birthsex and Diabetes

Alternate hypothesis (HA): There is a relationship between Birthsex and Diabetes

chi-square test for Birthsex and Diabetes

CrossTable(y = CP2Clean$birthsex,
           x = CP2Clean$diabetes20,
           chisq = TRUE,
           expected = TRUE,
           sresid = TRUE,
           prop.c = FALSE,
           prop.t = FALSE,
           prop.chisq = FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |            Std Residual | 
## |-------------------------|
## 
## ==============================================
##                        CP2Clean$birthsex
## CP2Clean$diabetes20      Male   Female   Total
## ----------------------------------------------
## Yes                       496      552    1048
##                           461      587        
##                         0.473    0.527   0.123
##                         1.628   -1.443        
## ----------------------------------------------
## No                       3253     4221    7474
##                          3288     4186        
##                         0.435    0.565   0.877
##                        -0.610    0.540        
## ----------------------------------------------
## Total                    3749     4773    8522
## ==============================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 5.398041      d.f. = 1      p = 0.0202 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 = 5.244755      d.f. = 1      p = 0.022

Since the count from some chi square cells is <5 we are going to opt for chi-square alternative which is the Fisher’s exact test.

Running a Fisher’s exact test

fisher.test(CP2Clean$birthsex,CP2Clean$diabetes20)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  CP2Clean$birthsex and CP2Clean$diabetes20
## p-value = 0.02184
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.021880 1.330009
## sample estimates:
## odds ratio 
##   1.165909

There is a statistically significant relationship between birthsex and Diabetes, p = 0.02184.

Exercise & diabetes bar graph with percentages

CP2Clean %>% 
  group_by(exercise20, diabetes20) %>% 
  count() %>% 
  group_by(exercise20) %>% 
  mutate(perc.exercise20 = 100 * n / sum(n)) %>% 
  ggplot(aes(x = exercise20, y = perc.exercise20, fill = diabetes20)) +
  geom_col(position = "dodge") + 
  theme_minimal() +
  labs(x = "Exercise",
       y = "Percentage",
       fill = "Diabetes") +
coord_flip()

For those who do exercise majority of the participants are not diabetic and those who do not do exercise majority of the participants are not diabetic however those who do not exercise and diabetic are more than those who do exercise.

Running a chi-square test since both variables are categorical (Exercise and Diabetes)

Checking Chi-square assumptions

1.Observations are independent —- Met 2.Both variables must both be categorical (nominal or ordinal) —- Met 3.Expected values should be 5 or higher in at least 80% of groups — Met

Null hypothesis (H0): There is no relationship between Exercise and Diabetes

Alternate hypothesis (HA): There is a relationship between Exercise and Diabetes

Chi-square test for Exercise and Diabetes

CrossTable(y = CP2Clean$exercise20,
           x = CP2Clean$diabetes20,
           chisq = TRUE,
           expected = TRUE,
           sresid = TRUE,
           prop.c = FALSE,
           prop.t = FALSE,
           prop.chisq = FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |            Std Residual | 
## |-------------------------|
## 
## ==============================================
##                        CP2Clean$exercise20
## CP2Clean$diabetes20       Yes       No   Total
## ----------------------------------------------
## Yes                       659      389    1048
##                         761.3    286.7        
##                         0.629    0.371   0.123
##                        -3.709    6.045        
## ----------------------------------------------
## No                       5532     1942    7474
##                        5429.7   2044.3        
##                         0.740    0.260   0.877
##                         1.389   -2.264        
## ----------------------------------------------
## Total                    6191     2331    8522
## ==============================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 57.34907      d.f. = 1      p = 3.65e-14 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 = 56.79008      d.f. = 1      p = 4.85e-14

Since the count from some chi square cells is <5 we are going to opt for chi-square alternative which is the Fisher’s exact test.

Running a Fisher’s exact test

fisher.test(CP2Clean$exercise20,CP2Clean$diabetes20)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  CP2Clean$exercise20 and CP2Clean$diabetes20
## p-value = 1.873e-13
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.5183390 0.6829982
## sample estimates:
## odds ratio 
##  0.5947621

There is a statistically significant relationship between Exercise and Diabetes, p-value = 1.873e-13.

Number of Sugar per day & diabetes

# Number of Sugar per day & diabetes box-plot
CP2Clean %>%
  ggplot(aes(x = diabetes20, y = nsodasugarperday20)) + geom_boxplot(aes(fill = diabetes20), alpha = .5)

There is a slight difference in Median between the diabetic and non-diabetics. There are alot of out-liars observed

Running a T-test because of the binary categorical varibles

Assumptions for the T-test

Continuous variable and two independent groups - MET Independent observations - MET Normal distribution in each group - FAILED Equal variances within each group - FAILED

T-test Assumption Checking normality

#T-test Assumption Checking normality
custom_colors <- c("no" = "#b6ddfa", "yes" = "#c4bdff")
CP2Clean %>% 
  ggplot(aes(x = nsodasugarperday20, fill= diabetes20)) + 
  geom_histogram() +
  scale_fill_manual(values = custom_colors) +
  theme_minimal() +
  facet_grid(cols = vars(diabetes20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both distributions are skewed, therefore failing the normal distribution assumption

Levene Test: Check Homogeneity

car::leveneTest(y = nsodasugarperday20 ~ diabetes20, data = CP2Clean)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1  13.749 0.0002102 ***
##       8520                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is rejected (p = 0.0002102) meaning there is significant unequal variances within each group. We therefore fail the assumptions of equal variances.

Since some assumptions were failed, an alternative bivariate test needs to be conducted. The Mann-Whitney U test will be used instead.

Mann-Whitney U test

# Mann-Whitney U test

wilcox.test(formula = CP2Clean$nsodasugarperday20 ~ CP2Clean$diabetes20,
            paired = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CP2Clean$nsodasugarperday20 by CP2Clean$diabetes20
## W = 3292347, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

There is a statistically significant relationship between Diabetes and Number of soda-sugar per day, p-value < 0.05

Nutritional & diabetes box-plot

# Nutritional & diabetes box-plot
CP2Clean %>%
  ggplot(aes(x = diabetes20, y = nutrition1)) + geom_boxplot(aes(fill = diabetes20))

Number of servings: The median number of servings per day was the same for both groups: 2.00. This means that half of the people in each group ate more than 2 servings and half ate less

Running a T-test because of the binary categorical varibles

Assumptions for the T-test

Continuous variable and two independent groups - MET Independent observations - MET Normal distribution in each group - FAILED Equal variances within each group - FAILED

#T-test Assumption Checking normality
custom_colors <- c("no" = "#b6ddfa", "yes" = "#c4bdff")
CP2Clean %>% 
  ggplot(aes(x = nutrition1, fill= diabetes20)) + 
  geom_histogram() +
  scale_fill_manual(values = custom_colors) +
  theme_minimal() +
  facet_grid(cols = vars(diabetes20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Levene Test: Check Homogeneity

# Levene Test: Check Homogeneity

car::leveneTest(y = nutrition1 ~ diabetes20, data = CP2Clean)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1  12.925 0.0003261 ***
##       8520                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is rejected (p< 0.05) meaning there is significantly unequal variances within each group. We therefore fail the assumptions of equal variances.

Since some assumptions were failed, an alternative bivariate test needs to be conducted. The Mann-Whitney U test will be used instead.

Mann-Whitney U test

wilcox.test(formula = CP2Clean$nutrition1 ~ CP2Clean$diabetes20,
            paired = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CP2Clean$nutrition1 by CP2Clean$diabetes20
## W = 3491271, p-value = 5.393e-09
## alternative hypothesis: true location shift is not equal to 0

There is a statistically significant relationship between Diabetes and Nutrition, p-value < 0.05

Summary of the results

The study aimed to identify factors predisposing New York City adults to diabetes. It found significant relationships between diabetes and several variables: birth sex (\(p = <0.05\)), exercise (\(p = <0.05\)), daily sugar-soda intake (\(p < 0.05\)), and nutrition (\(p < 0.05\)). The data compared two groups those diagnosed with diabetes (N=1048) and those not diagnosed (N=7474). Physical activity was reported by 62.9% of the diabetes group and 74.0% of the non-diabetes group, with 72.6% of the total population (N=8522) reporting physical activity. These findings suggest that exercise, nutrition, and sugar-soda intake are significant factors in diabetes prevalence among city dwellers.

Discussion of Results

The purpose of this study was to examine the factors that predispose city dwellers to diabetes among adults in New York City. The results showed that there were significant associations between diabetes and birth sex, exercise, number of soda-sugar per day, and nutrition. These findings are consistent with previous studies that have identified these factors as potential risk factors or protective factors for diabetes in urban populations (Chen et al., 2016; Huang et al., 2019; Park et al., 2018; Sattar et al., 2019; Zhang et al., 2017). The study found out that males were more likely to have diabetes than females, which could be explained by biological, behavioral, or social factors. This is in contrary to studies that have found females to be more diabetic than their male counterparts. For example, females may have higher rates of obesity, gestational diabetes, or polycystic ovary syndrome, which are known to increase the risk of diabetes (Chen et al., 2016). Females may also face more barriers to accessing health care, physical activity, or healthy food options, especially in low-income or minority communities (Huang et al., 2019). The study also found that exercise was inversely associated with diabetes, meaning that those who reported being physically active were less likely to have diabetes than those who were not. This is in line with the evidence that physical activity can improve glucose metabolism, insulin sensitivity, and cardiovascular health, and reduce the risk of obesity and other chronic diseases (Park et al., 2018). The study also revealed that the majority of the population (72.6%) reported being physically active, which suggests that there is a high level of awareness and motivation for exercise among city dwellers. Another significant finding was that the number of soda-sugar per day was positively associated with diabetes, meaning that those who consumed more soda-sugar were more likely to have diabetes than those who consumed less or none. This is consistent with the literature that shows that sugar-sweetened beverages can increase the risk of diabetes by inducing weight gain, inflammation, and insulin resistance (Sattar et al., 2019). The study also showed that the average number of soda-sugar per day was 0.49, which is almost half of the recommended limit of one per day by the American Heart Association (AHA, 2020).

Finally, the study found that nutrition was inversely associated with diabetes, meaning that those who reported having a balanced diet were less likely to have diabetes than those who did not. This is in accordance with the research that indicates that a healthy diet can prevent or delay the onset of diabetes by providing adequate nutrients, fiber, and antioxidants, and avoiding excess calories, fat, and sugar (Zhang et al., 2017). The study also indicated that the average nutrition score was 2, which is lower than the optimal score of 5, suggesting that there is room for improvement in dietary quality among city dwellers. Based on these results, the study suggests that there are modifiable factors that can influence the risk of diabetes among adults in New York City. Therefore, the study recommends that public health interventions should target these factors and promote healthy behaviors and lifestyles among urban populations. For example, interventions could aim to increase access to and affordability of health care, physical activity, and healthy food options, especially for males and low-income or minority groups. Interventions could also aim to reduce the consumption of sugar-sweetened beverages and increase the awareness and adherence to dietary guidelines. Furthermore, the study suggests that future research should explore the causal mechanisms and the interactions of these factors, as well as the potential impact of other environmental, genetic, or psychosocial factors on diabetes risk among city dwellers.

References:

AHA. (2020). Added Sugars. https://www.heart.org/en/healthy-living/healthy-eating/eat-smart/sugar/added-sugars

Chen, L., Magliano, D. J., & Zimmet, P. Z. (2016). The worldwide epidemiology of type 2 diabetes mellitus—present and future perspectives. Nature Reviews Endocrinology, 8(4), 228–236. https://doi.org/10.1038/nrendo.2011.183

Huang, J., Qi, S., Huang, Y., & Feng, S. (2019). Gender differences in the prevalence of diabetes and prediabetes in the Chinese adult population: A systematic review and meta-analysis. Diabetes Research and Clinical Practice, 156, 107840. https://doi.org/10.1016/j.diabres.2019.107840

Park, S., Lee, J., Kim, Y., & Lee, S. (2018). Physical activity and diabetes mellitus. Journal of Exercise Rehabilitation, 14(4), 649–656. https://doi.org/10.12965/jer.1836298.149

Sattar, N., Gill, J. M., & Lean, M. E. (2019). ABC of obesity: Obesity, insulin resistance, and diabetes. BMJ, 333(7576), 989–992. https://doi.org/10.1136/bmj.333.7576.989

Zhang, X., Liu, S., Liu, Y., Du, H., Chen, X., Liu, F., Wang, C., & Sun, C. (2017). Dietary patterns, food groups and type 2 diabetes mellitus: A systematic review and meta-analysis of cohort studies. Journal of Diabetes Investigation, 8(4), 518–527. https://doi.org/10.1111/jdi.12614