Pima Indians Diabetes Analysis

Author

Wafaa mohamed shahen

Diabetes Project

This project is made after I finished a month of extensive R learning I studied the book “R for data science” and joined Negida academy “3-day R challenge”, I still have a lot to learn but i will apply what I learned so far in this document.

I want to use Pima Indians dataset to test the relationship between age, BMI and other diabetes prediction factors and developing diabetes


loading libraries

library(tidyverse)
library(gtsummary)

reading the dataset

First I will read the data into R

diabetes <- read_csv("diabetes.csv")
head(diabetes)
# A tibble: 6 × 9
  Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
        <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
1           6     148            72            35       0  33.6
2           1      85            66            29       0  26.6
3           8     183            64             0       0  23.3
4           1      89            66            23      94  28.1
5           0     137            40            35     168  43.1
6           5     116            74             0       0  25.6
# ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>

The names are camel-case I will change them to snake-case

diabetes <- diabetes %>% 
  janitor::clean_names("snake")

Cleaning the data

I want to add age groups as well as BMI groups and I also noticed that NA is represented in some columns as 0 , I could have solved this problem with read_csv() but that would have changed the 0 values in pregnancies and outcome as well.

diabetes_na <- diabetes %>% 
  group_by(outcome) %>% 
  mutate(across(c(glucose:age), \(x) na_if(x, 0), .names = "{.col}"),
         age_groups = case_when(
           age >= 20 & age < 40 ~ "young adult",
           age >= 40 & age < 60 ~ "middle aged",
           age >= 60            ~ "old"
         ),
         bmi_groups = case_when(
           bmi < 18.5             ~ "under weight",
           bmi >= 18.5 & bmi < 25 ~ "normal weight",
           bmi >= 25 & bmi < 30   ~ "over weight",
           bmi >= 30 & bmi < 35   ~ "class 1 obesity",
           bmi >= 35 & bmi < 40   ~ "class 2 obesity",
           bmi >= 40              ~ "class 3 obesity"
         )) %>% 
  ungroup()

Now I will convert some variables to factors as they represent categories

fctr <- function(x)  {
  as.factor(x)
}

diabetes <- diabetes_na %>% 
  mutate(across(c(pregnancies, outcome, age_groups, bmi_groups), fctr))

And I will add levels to age groups and BMI groups

bmi_levels <- c("under weight", "normal weight", "over weight",
                "class 1 obesity", "class 2 obesity", "class 3 obesity")
diabetes$bmi_groups <-  factor(diabetes$bmi_groups,
                                    levels = bmi_levels)
age_levels <- c("young adult", "middle aged", "old")
diabetes$age_groups <- factor(diabetes$age_groups, levels = age_levels)

Descriptives table

Code
desc_table <- diabetes %>%
  drop_na() %>% 
  mutate(pregnancies = as.numeric(as.character(pregnancies))) %>% 
  tbl_summary(by = outcome, 
              statistic = list(
                all_categorical() ~ "{n} ({p}%)",
                all_continuous() ~ "{median} ({p25}, {p75})"
                          ),
              label = list(age = "Age",
                           pregnancies = "Pregnancies",
                           glucose = "Glucose level",
                           blood_pressure = "Blood pressure",
                           skin_thickness = "Skin thickness",
                           insulin = "Insulin",
                           bmi = "BMI",
                           diabetes_pedigree_function = "Diabetes predigree function",
                           outcome = "Outcome",
                           age_groups = "Age groups",
                           bmi_groups = "BMI groups")) %>% 
  modify_header(
    stat_1 ~ "**Non-diabetic**",
    stat_2 ~ "**Diabetic**"
  ) %>% 
  add_p(test = list(
    all_continuous() ~ "wilcox.test",
    all_categorical() ~ "fisher.test"
  ), test.arg = bmi_groups ~ list(simulate.p.value = TRUE))

desc_table
Characteristic Non-diabetic1 Diabetic1 p-value2
Pregnancies 2.0 (1.0, 4.0) 3.0 (1.0, 7.0) <0.001
Glucose level 108 (94, 126) 145 (124, 172) <0.001
Blood pressure 70 (60, 76) 74 (66, 82) <0.001
Skin thickness 27 (18, 34) 33 (26, 40) <0.001
Insulin 105 (66, 165) 170 (127, 240) <0.001
BMI 31 (26, 36) 35 (32, 38) <0.001
Diabetes predigree function 0.41 (0.26, 0.63) 0.55 (0.33, 0.79) <0.001
Age 25 (22, 30) 33 (27, 43) <0.001
Age groups

<0.001
    young adult 233 (89%) 83 (64%)
    middle aged 25 (9.5%) 46 (35%)
    old 4 (1.5%) 1 (0.8%)
BMI groups

<0.001
    under weight 1 (0.4%) 0 (0%)
    normal weight 42 (16%) 2 (1.5%)
    over weight 67 (26%) 18 (14%)
    class 1 obesity 73 (28%) 47 (36%)
    class 2 obesity 51 (19%) 36 (28%)
    class 3 obesity 28 (11%) 27 (21%)
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Fisher’s exact test; Fisher’s Exact Test for Count Data with simulated p-value (based on 2000 replicates)

The table shows statistically significant difference between the two groups with diabetic patients having higher BMI, larger median age and going through more pregnancies than non-diabetic groups


visualization

After cleaning I can start making plots that demonstrate data distribution and relationships

  • This plot is a box plot that shows the relationship between BMI and glucose levels faceted by outcome (diabetic, non-diabetic)
Code
diabetes %>% 
  filter(!is.na(bmi_groups)) %>% 
  ggplot(aes(y = glucose, x = bmi_groups, fill = outcome, color = outcome))+
  geom_boxplot(alpha = 0.7)+
  facet_wrap(~outcome, labeller = as_labeller(c("0" = "non diabetics", "1" = "diabetics")))+
  theme_minimal()+
  labs(title = "Relationship between BMI and Glucose levels",
       subtitle = "High BMI is associated with higher blood glucose levels",
       x = "BMI", y = "Glucose")+
  scale_fill_manual(values = c("1" = "#FFA07A", "0" = "#FFEC8B"),
                    labels = c("1" = "diabetic", "0" = "non-diabetic"),
                    guide = "none")+
  scale_color_manual(values = c("1" = "#EE4000", "0" = "#CD8500"),
                     labels = c("1" = "diabetic", "0" = "non-diabetic"),
                     guide = "none")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This box plot shows that high BMI is associated with higher glucose levels especially in diabetic, surprisingly in the diabetic group patients with normal weight have the highest median glucose levels between all BMI groups, this might be because of type 1 diabetes which is not necessarily associated with high BMI unlike type 2, it might also be cause by advanced type 2 cases with wasting due to inability of the cells to utilize glucose, this needs further investigations.

  • This plot is a histogram that shows glucose levels in the two outcome groups
Code
diabetes %>%
  ggplot(aes(x = glucose, fill = outcome))+
  geom_density(aes(y = after_stat(count)), alpha = 0.5, color = NA)+
  theme_minimal()+
  labs(title = "Histogram showing glucose levels", x = "Glucose")+
  scale_fill_manual(values = c("1" = "#FF4500", "0" = "#FFA500"),
                    labels = c("1" = "diabetic", "0" = "non-diabetic"))

Non-diabetics have glucose levels mainly around 100 mg/dl with diabetics having glucose levels around higher values

  • I found that there is a correlation between diabetes_pedigree_function and age so I made a plot that shows this risk
Code
diabetes %>% 
  mutate(
    outcome = as.numeric(as.character(outcome)),
    age = round(age / 5) * 5,
    diabetes_pedigree_function = round(diabetes_pedigree_function / 0.1) * 0.1
  ) %>% 
  group_by(age, diabetes_pedigree_function) %>% 
  summarise(
    mean_outcome = mean(outcome, na.rm = TRUE), .groups = "drop"
  ) %>% 
  complete(age = seq(20, 85, by = 5),
           diabetes_pedigree_function = seq(0, 2.6, by = 0.1)) %>% 
  ggplot(aes(x = diabetes_pedigree_function, y = age, fill = mean_outcome))+
  geom_tile()+
  theme_minimal()+
  scale_x_continuous(breaks = seq(0, 2.6, by = 0.5))+
  scale_y_continuous(breaks = seq(20, 85, by = 10))+
  scale_fill_viridis_c(na.value = "grey90", name = "mean outcome")+
  labs(title = "diabetes risk by age and genetic risk score",
       x = "Diabetes predigree function",
       y = "Age")

  • The last plot is a correlation heat map between all the variables
Code
corr <- diabetes_na %>% 
  select(where(is.numeric)) %>% 
  cor(method = "spearman")
cor_data <- as.data.frame(corr)
cor_data <- cor_data %>% 
  rownames_to_column(var = "var_1")
cor_vars <- cor_data %>% 
  pivot_longer(where(is.numeric),
               names_to = c("var_2"),
               values_to = "correlation") %>% 
  mutate(var_2 = str_replace_all(string = c(var_2),
                                 pattern = "_",
                                 replacement =  " "), 
         var_1 = str_replace_all(string = c(var_1),
                                 pattern = "_",
                                 replacement =  " "))
cor_vars %>% 
  ggplot(aes(x = var_1, y = var_2, fill = correlation))+
  geom_tile()+
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
                       midpoint = 0, limit = c(-1, 1), na.value = "white")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = NULL, y = NULL,
       title = "glucose and age show the strongest association with diabetes outcome")

There seem to be a correlation between the patient’s age , the number of pregnancies and diabetes pedigree function and developing diabetes


Hypothesis testing

My hypothesis (H1): high BMI increases the chance of developing diabetes

Null hypothesis (H0): high BMI does not affect the development of diabetes

I will start by checking if the variables are normally distributed1

diabetes %>% 
  select(where(is.numeric)) %>% 
  lapply(shapiro.test)

It appears all the numeric variables are not normally distributed so I will use non-parametric tests.

wilcox.test(bmi ~ outcome, data = diabetes)

# p-value < 2.2e-16

P-value = < 0.0001 this shows that BMI significantly affects diabetes development

I want to test for the number of pregnancies, age, glucose and diabetes pedigree function as well

wilcox.test(pregnancies ~ outcome, data = diabetes_na)

# p-value = 3.745e-08
wilcox.test(age ~ outcome, data = diabetes)

# p-value < 2.2e-16
wilcox.test(glucose ~ outcome, data = diabetes)

# p-value < 2.2e-16
wilcox.test(diabetes_pedigree_function ~ outcome, data = diabetes)

# p-value = 1.197e-06

All tests show statistically significant relationships theorizing that diabetes is affected by more than one risk factor.

Conclusion

This analysis of the Pima Indians dataset revealed that diabetic patients differ significantly from non-diabetic patients across all measured variables (p < 0.001).

From the correlation heat map it appears that the number of pregnancies, age and diabetes pedigree function have the strongest correlation with developing diabetes.

Non diabetics had thier sugar levels around 100 mg/dl while diabetics had sugar levels mainly between 100 and 200 mg/dl , this and the wilcox test suggest a positive relationship between glucose levels and diabetes.

The observation that normal-weight patients in the diabetic group had the highest median glucose levels suggests high proportion of type 1 diabetics in the sample or it might also be cause by advanced type 2 diabetes cases with wasting due to inability of the cells to utilize glucose .

This dataset was originally introduced by Smith et al. (1988) to predict diabetes in Pima Indian women using clinical measurements. my analysis replicates and extends their exploratory approach using modern R tools.

Footnotes

  1. I wrote this code before the correlation heat map and I decided the test according to it.↩︎