Pima Indians Diabetes Analysis

Author

Wafaa mohamed shahen

Diabetes Project

This project is made after I finished a month of extensive R learning I studied the book “R for data science” and joined Negida academy “3-day R challenge”, I still have a lot to learn but i will apply what I learned so far in this document.

I want to use Pima Indians dataset to test the relationship between age, BMI and other diabetes prediction factors and developing diabetes

loading libraries

library(tidyverse)
library(gtsummary)

reading the dataset

First I will read the data into R

diabetes <- read_csv("diabetes.csv")

head(diabetes)

# A tibble: 6 × 9
  Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
        <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
1           6     148            72            35       0  33.6
2           1      85            66            29       0  26.6
3           8     183            64             0       0  23.3
4           1      89            66            23      94  28.1
5           0     137            40            35     168  43.1
6           5     116            74             0       0  25.6
# ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>

The names are camel-case I will change them to snake-case

diabetes <- diabetes %>% 
  janitor::clean_names("snake")

Cleaning the data

I want to add age groups as well as BMI groups and I also noticed that NA is represented in some columns as 0 , I could have solved this problem with read_csv() but that would have changed the 0 values in pregnancies and outcome as well.

diabetes_na <- diabetes %>% 
  group_by(outcome) %>% 
  mutate(across(c(glucose:age), \(x) na_if(x, 0), .names = "{.col}"),
         age_groups = case_when(
           age >= 20 & age < 40 ~ "young adult",
           age >= 40 & age < 60 ~ "middle aged",
           age >= 60            ~ "old"
         ),
         bmi_groups = case_when(
           bmi < 18.5             ~ "under weight",
           bmi >= 18.5 & bmi < 25 ~ "normal weight",
           bmi >= 25 & bmi < 30   ~ "over weight",
           bmi >= 30 & bmi < 35   ~ "class 1 obesity",
           bmi >= 35 & bmi < 40   ~ "class 2 obesity",
           bmi >= 40              ~ "class 3 obesity"
         )) %>% 
  ungroup()

Now I will convert some variables to factors as they represent categories

fctr <- function(x)  {
  as.factor(x)
}

diabetes <- diabetes_na %>% 
  mutate(across(c(pregnancies, outcome, age_groups, bmi_groups), fctr))

And I will add levels to age groups and BMI groups

bmi_levels <- c("under weight", "normal weight", "over weight",
                "class 1 obesity", "class 2 obesity", "class 3 obesity")
diabetes$bmi_groups <-  factor(diabetes$bmi_groups,
                                    levels = bmi_levels)
age_levels <- c("young adult", "middle aged", "old")
diabetes$age_groups <- factor(diabetes$age_groups, levels = age_levels)

Descriptives table

Code

desc_table <- diabetes %>%
  drop_na() %>% 
  mutate(pregnancies = as.numeric(as.character(pregnancies))) %>% 
  tbl_summary(by = outcome, 
              statistic = list(
                all_categorical() ~ "{n} ({p}%)",
                all_continuous() ~ "{median} ({p25}, {p75})"
                          ),
              label = list(age = "Age",
                           pregnancies = "Pregnancies",
                           glucose = "Glucose level",
                           blood_pressure = "Blood pressure",
                           skin_thickness = "Skin thickness",
                           insulin = "Insulin",
                           bmi = "BMI",
                           diabetes_pedigree_function = "Diabetes predigree function",
                           outcome = "Outcome",
                           age_groups = "Age groups",
                           bmi_groups = "BMI groups")) %>% 
  modify_header(
    stat_1 ~ "**Non-diabetic**",
    stat_2 ~ "**Diabetic**"
  ) %>% 
  add_p(test = list(
    all_continuous() ~ "wilcox.test",
    all_categorical() ~ "fisher.test"
  ), test.arg = bmi_groups ~ list(simulate.p.value = TRUE))

desc_table

Characteristic	Non-diabetic¹	Diabetic¹	p-value²
Pregnancies	2.0 (1.0, 4.0)	3.0 (1.0, 7.0)	<0.001
Glucose level	108 (94, 126)	145 (124, 172)	<0.001
Blood pressure	70 (60, 76)	74 (66, 82)	<0.001
Skin thickness	27 (18, 34)	33 (26, 40)	<0.001
Insulin	105 (66, 165)	170 (127, 240)	<0.001
BMI	31 (26, 36)	35 (32, 38)	<0.001
Diabetes predigree function	0.41 (0.26, 0.63)	0.55 (0.33, 0.79)	<0.001
Age	25 (22, 30)	33 (27, 43)	<0.001
Age groups			<0.001
young adult	233 (89%)	83 (64%)
middle aged	25 (9.5%)	46 (35%)
old	4 (1.5%)	1 (0.8%)
BMI groups			<0.001
under weight	1 (0.4%)	0 (0%)
normal weight	42 (16%)	2 (1.5%)
over weight	67 (26%)	18 (14%)
class 1 obesity	73 (28%)	47 (36%)
class 2 obesity	51 (19%)	36 (28%)
class 3 obesity	28 (11%)	27 (21%)
¹ Median (Q1, Q3); n (%)
² Wilcoxon rank sum test; Fisher’s exact test; Fisher’s Exact Test for Count Data with simulated p-value (based on 2000 replicates)

The table shows statistically significant difference between the two groups with diabetic patients having higher BMI, larger median age and going through more pregnancies than non-diabetic groups

visualization

After cleaning I can start making plots that demonstrate data distribution and relationships

This plot is a box plot that shows the relationship between BMI and glucose levels faceted by outcome (diabetic, non-diabetic)

Code

diabetes %>% 
  filter(!is.na(bmi_groups)) %>% 
  ggplot(aes(y = glucose, x = bmi_groups, fill = outcome, color = outcome))+
  geom_boxplot(alpha = 0.7)+
  facet_wrap(~outcome, labeller = as_labeller(c("0" = "non diabetics", "1" = "diabetics")))+
  theme_minimal()+
  labs(title = "Relationship between BMI and Glucose levels",
       subtitle = "High BMI is associated with higher blood glucose levels",
       x = "BMI", y = "Glucose")+
  scale_fill_manual(values = c("1" = "#FFA07A", "0" = "#FFEC8B"),
                    labels = c("1" = "diabetic", "0" = "non-diabetic"),
                    guide = "none")+
  scale_color_manual(values = c("1" = "#EE4000", "0" = "#CD8500"),
                     labels = c("1" = "diabetic", "0" = "non-diabetic"),
                     guide = "none")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This box plot shows that high BMI is associated with higher glucose levels especially in diabetic, surprisingly in the diabetic group patients with normal weight have the highest median glucose levels between all BMI groups, this might be because of type 1 diabetes which is not necessarily associated with high BMI unlike type 2, it might also be cause by advanced type 2 cases with wasting due to inability of the cells to utilize glucose, this needs further investigations.

This plot is a histogram that shows glucose levels in the two outcome groups

Code

diabetes %>%
  ggplot(aes(x = glucose, fill = outcome))+
  geom_density(aes(y = after_stat(count)), alpha = 0.5, color = NA)+
  theme_minimal()+
  labs(title = "Histogram showing glucose levels", x = "Glucose")+
  scale_fill_manual(values = c("1" = "#FF4500", "0" = "#FFA500"),
                    labels = c("1" = "diabetic", "0" = "non-diabetic"))

Non-diabetics have glucose levels mainly around 100 mg/dl with diabetics having glucose levels around higher values

I found that there is a correlation between diabetes_pedigree_function and age so I made a plot that shows this risk

Code

diabetes %>% 
  mutate(
    outcome = as.numeric(as.character(outcome)),
    age = round(age / 5) * 5,
    diabetes_pedigree_function = round(diabetes_pedigree_function / 0.1) * 0.1
  ) %>% 
  group_by(age, diabetes_pedigree_function) %>% 
  summarise(
    mean_outcome = mean(outcome, na.rm = TRUE), .groups = "drop"
  ) %>% 
  complete(age = seq(20, 85, by = 5),
           diabetes_pedigree_function = seq(0, 2.6, by = 0.1)) %>% 
  ggplot(aes(x = diabetes_pedigree_function, y = age, fill = mean_outcome))+
  geom_tile()+
  theme_minimal()+
  scale_x_continuous(breaks = seq(0, 2.6, by = 0.5))+
  scale_y_continuous(breaks = seq(20, 85, by = 10))+
  scale_fill_viridis_c(na.value = "grey90", name = "mean outcome")+
  labs(title = "diabetes risk by age and genetic risk score",
       x = "Diabetes predigree function",
       y = "Age")

The last plot is a correlation heat map between all the variables

Code

corr <- diabetes_na %>% 
  select(where(is.numeric)) %>% 
  cor(method = "spearman")
cor_data <- as.data.frame(corr)
cor_data <- cor_data %>% 
  rownames_to_column(var = "var_1")
cor_vars <- cor_data %>% 
  pivot_longer(where(is.numeric),
               names_to = c("var_2"),
               values_to = "correlation") %>% 
  mutate(var_2 = str_replace_all(string = c(var_2),
                                 pattern = "_",
                                 replacement =  " "), 
         var_1 = str_replace_all(string = c(var_1),
                                 pattern = "_",
                                 replacement =  " "))
cor_vars %>% 
  ggplot(aes(x = var_1, y = var_2, fill = correlation))+
  geom_tile()+
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
                       midpoint = 0, limit = c(-1, 1), na.value = "white")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = NULL, y = NULL,
       title = "glucose and age show the strongest association with diabetes outcome")

There seem to be a correlation between the patient’s age , the number of pregnancies and diabetes pedigree function and developing diabetes

Hypothesis testing

My hypothesis (H1): high BMI increases the chance of developing diabetes

Null hypothesis (H0): high BMI does not affect the development of diabetes

I will start by checking if the variables are normally distributed¹

diabetes %>% 
  select(where(is.numeric)) %>% 
  lapply(shapiro.test)

It appears all the numeric variables are not normally distributed so I will use non-parametric tests.

wilcox.test(bmi ~ outcome, data = diabetes)

# p-value < 2.2e-16

P-value = < 0.0001 this shows that BMI significantly affects diabetes development

I want to test for the number of pregnancies, age, glucose and diabetes pedigree function as well

wilcox.test(pregnancies ~ outcome, data = diabetes_na)

# p-value = 3.745e-08

wilcox.test(age ~ outcome, data = diabetes)

# p-value < 2.2e-16

wilcox.test(glucose ~ outcome, data = diabetes)

# p-value < 2.2e-16

wilcox.test(diabetes_pedigree_function ~ outcome, data = diabetes)

# p-value = 1.197e-06

All tests show statistically significant relationships theorizing that diabetes is affected by more than one risk factor.

Conclusion

This analysis of the Pima Indians dataset revealed that diabetic patients differ significantly from non-diabetic patients across all measured variables (p < 0.001).

From the correlation heat map it appears that the number of pregnancies, age and diabetes pedigree function have the strongest correlation with developing diabetes.

Non diabetics had thier sugar levels around 100 mg/dl while diabetics had sugar levels mainly between 100 and 200 mg/dl , this and the wilcox test suggest a positive relationship between glucose levels and diabetes.

The observation that normal-weight patients in the diabetic group had the highest median glucose levels suggests high proportion of type 1 diabetics in the sample or it might also be cause by advanced type 2 diabetes cases with wasting due to inability of the cells to utilize glucose .

This dataset was originally introduced by Smith et al. (1988) to predict diabetes in Pima Indian women using clinical measurements. my analysis replicates and extends their exploratory approach using modern R tools.

Footnotes

I wrote this code before the correlation heat map and I decided the test according to it.↩︎