library(tidyverse)
library(gtsummary)Pima Indians Diabetes Analysis
Diabetes Project
This project is made after I finished a month of extensive R learning I studied the book “R for data science” and joined Negida academy “3-day R challenge”, I still have a lot to learn but i will apply what I learned so far in this document.
I want to use Pima Indians dataset to test the relationship between age, BMI and other diabetes prediction factors and developing diabetes
loading libraries
reading the dataset
First I will read the data into R
diabetes <- read_csv("diabetes.csv")head(diabetes)# A tibble: 6 × 9
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 148 72 35 0 33.6
2 1 85 66 29 0 26.6
3 8 183 64 0 0 23.3
4 1 89 66 23 94 28.1
5 0 137 40 35 168 43.1
6 5 116 74 0 0 25.6
# ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>
The names are camel-case I will change them to snake-case
diabetes <- diabetes %>%
janitor::clean_names("snake")Cleaning the data
I want to add age groups as well as BMI groups and I also noticed that NA is represented in some columns as 0 , I could have solved this problem with read_csv() but that would have changed the 0 values in pregnancies and outcome as well.
diabetes_na <- diabetes %>%
group_by(outcome) %>%
mutate(across(c(glucose:age), \(x) na_if(x, 0), .names = "{.col}"),
age_groups = case_when(
age >= 20 & age < 40 ~ "young adult",
age >= 40 & age < 60 ~ "middle aged",
age >= 60 ~ "old"
),
bmi_groups = case_when(
bmi < 18.5 ~ "under weight",
bmi >= 18.5 & bmi < 25 ~ "normal weight",
bmi >= 25 & bmi < 30 ~ "over weight",
bmi >= 30 & bmi < 35 ~ "class 1 obesity",
bmi >= 35 & bmi < 40 ~ "class 2 obesity",
bmi >= 40 ~ "class 3 obesity"
)) %>%
ungroup()Now I will convert some variables to factors as they represent categories
fctr <- function(x) {
as.factor(x)
}
diabetes <- diabetes_na %>%
mutate(across(c(pregnancies, outcome, age_groups, bmi_groups), fctr))And I will add levels to age groups and BMI groups
bmi_levels <- c("under weight", "normal weight", "over weight",
"class 1 obesity", "class 2 obesity", "class 3 obesity")
diabetes$bmi_groups <- factor(diabetes$bmi_groups,
levels = bmi_levels)
age_levels <- c("young adult", "middle aged", "old")
diabetes$age_groups <- factor(diabetes$age_groups, levels = age_levels)Descriptives table
Code
desc_table <- diabetes %>%
drop_na() %>%
mutate(pregnancies = as.numeric(as.character(pregnancies))) %>%
tbl_summary(by = outcome,
statistic = list(
all_categorical() ~ "{n} ({p}%)",
all_continuous() ~ "{median} ({p25}, {p75})"
),
label = list(age = "Age",
pregnancies = "Pregnancies",
glucose = "Glucose level",
blood_pressure = "Blood pressure",
skin_thickness = "Skin thickness",
insulin = "Insulin",
bmi = "BMI",
diabetes_pedigree_function = "Diabetes predigree function",
outcome = "Outcome",
age_groups = "Age groups",
bmi_groups = "BMI groups")) %>%
modify_header(
stat_1 ~ "**Non-diabetic**",
stat_2 ~ "**Diabetic**"
) %>%
add_p(test = list(
all_continuous() ~ "wilcox.test",
all_categorical() ~ "fisher.test"
), test.arg = bmi_groups ~ list(simulate.p.value = TRUE))
desc_table| Characteristic | Non-diabetic1 | Diabetic1 | p-value2 |
|---|---|---|---|
| Pregnancies | 2.0 (1.0, 4.0) | 3.0 (1.0, 7.0) | <0.001 |
| Glucose level | 108 (94, 126) | 145 (124, 172) | <0.001 |
| Blood pressure | 70 (60, 76) | 74 (66, 82) | <0.001 |
| Skin thickness | 27 (18, 34) | 33 (26, 40) | <0.001 |
| Insulin | 105 (66, 165) | 170 (127, 240) | <0.001 |
| BMI | 31 (26, 36) | 35 (32, 38) | <0.001 |
| Diabetes predigree function | 0.41 (0.26, 0.63) | 0.55 (0.33, 0.79) | <0.001 |
| Age | 25 (22, 30) | 33 (27, 43) | <0.001 |
| Age groups | <0.001 | ||
| young adult | 233 (89%) | 83 (64%) | |
| middle aged | 25 (9.5%) | 46 (35%) | |
| old | 4 (1.5%) | 1 (0.8%) | |
| BMI groups | <0.001 | ||
| under weight | 1 (0.4%) | 0 (0%) | |
| normal weight | 42 (16%) | 2 (1.5%) | |
| over weight | 67 (26%) | 18 (14%) | |
| class 1 obesity | 73 (28%) | 47 (36%) | |
| class 2 obesity | 51 (19%) | 36 (28%) | |
| class 3 obesity | 28 (11%) | 27 (21%) | |
| 1 Median (Q1, Q3); n (%) | |||
| 2 Wilcoxon rank sum test; Fisher’s exact test; Fisher’s Exact Test for Count Data with simulated p-value (based on 2000 replicates) | |||
The table shows statistically significant difference between the two groups with diabetic patients having higher BMI, larger median age and going through more pregnancies than non-diabetic groups
visualization
After cleaning I can start making plots that demonstrate data distribution and relationships
- This plot is a box plot that shows the relationship between BMI and glucose levels faceted by outcome (diabetic, non-diabetic)
Code
diabetes %>%
filter(!is.na(bmi_groups)) %>%
ggplot(aes(y = glucose, x = bmi_groups, fill = outcome, color = outcome))+
geom_boxplot(alpha = 0.7)+
facet_wrap(~outcome, labeller = as_labeller(c("0" = "non diabetics", "1" = "diabetics")))+
theme_minimal()+
labs(title = "Relationship between BMI and Glucose levels",
subtitle = "High BMI is associated with higher blood glucose levels",
x = "BMI", y = "Glucose")+
scale_fill_manual(values = c("1" = "#FFA07A", "0" = "#FFEC8B"),
labels = c("1" = "diabetic", "0" = "non-diabetic"),
guide = "none")+
scale_color_manual(values = c("1" = "#EE4000", "0" = "#CD8500"),
labels = c("1" = "diabetic", "0" = "non-diabetic"),
guide = "none")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))This box plot shows that high BMI is associated with higher glucose levels especially in diabetic, surprisingly in the diabetic group patients with normal weight have the highest median glucose levels between all BMI groups, this might be because of type 1 diabetes which is not necessarily associated with high BMI unlike type 2, it might also be cause by advanced type 2 cases with wasting due to inability of the cells to utilize glucose, this needs further investigations.
- This plot is a histogram that shows glucose levels in the two outcome groups
Code
diabetes %>%
ggplot(aes(x = glucose, fill = outcome))+
geom_density(aes(y = after_stat(count)), alpha = 0.5, color = NA)+
theme_minimal()+
labs(title = "Histogram showing glucose levels", x = "Glucose")+
scale_fill_manual(values = c("1" = "#FF4500", "0" = "#FFA500"),
labels = c("1" = "diabetic", "0" = "non-diabetic"))Non-diabetics have glucose levels mainly around 100 mg/dl with diabetics having glucose levels around higher values
- I found that there is a correlation between
diabetes_pedigree_functionandageso I made a plot that shows this risk
Code
diabetes %>%
mutate(
outcome = as.numeric(as.character(outcome)),
age = round(age / 5) * 5,
diabetes_pedigree_function = round(diabetes_pedigree_function / 0.1) * 0.1
) %>%
group_by(age, diabetes_pedigree_function) %>%
summarise(
mean_outcome = mean(outcome, na.rm = TRUE), .groups = "drop"
) %>%
complete(age = seq(20, 85, by = 5),
diabetes_pedigree_function = seq(0, 2.6, by = 0.1)) %>%
ggplot(aes(x = diabetes_pedigree_function, y = age, fill = mean_outcome))+
geom_tile()+
theme_minimal()+
scale_x_continuous(breaks = seq(0, 2.6, by = 0.5))+
scale_y_continuous(breaks = seq(20, 85, by = 10))+
scale_fill_viridis_c(na.value = "grey90", name = "mean outcome")+
labs(title = "diabetes risk by age and genetic risk score",
x = "Diabetes predigree function",
y = "Age")- The last plot is a correlation heat map between all the variables
Code
corr <- diabetes_na %>%
select(where(is.numeric)) %>%
cor(method = "spearman")
cor_data <- as.data.frame(corr)
cor_data <- cor_data %>%
rownames_to_column(var = "var_1")
cor_vars <- cor_data %>%
pivot_longer(where(is.numeric),
names_to = c("var_2"),
values_to = "correlation") %>%
mutate(var_2 = str_replace_all(string = c(var_2),
pattern = "_",
replacement = " "),
var_1 = str_replace_all(string = c(var_1),
pattern = "_",
replacement = " "))
cor_vars %>%
ggplot(aes(x = var_1, y = var_2, fill = correlation))+
geom_tile()+
scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#D6604D",
midpoint = 0, limit = c(-1, 1), na.value = "white")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
labs(x = NULL, y = NULL,
title = "glucose and age show the strongest association with diabetes outcome")There seem to be a correlation between the patient’s age , the number of pregnancies and diabetes pedigree function and developing diabetes
Hypothesis testing
My hypothesis (H1): high BMI increases the chance of developing diabetes
Null hypothesis (H0): high BMI does not affect the development of diabetes
I will start by checking if the variables are normally distributed1
diabetes %>%
select(where(is.numeric)) %>%
lapply(shapiro.test)It appears all the numeric variables are not normally distributed so I will use non-parametric tests.
wilcox.test(bmi ~ outcome, data = diabetes)
# p-value < 2.2e-16P-value = < 0.0001 this shows that BMI significantly affects diabetes development
I want to test for the number of pregnancies, age, glucose and diabetes pedigree function as well
wilcox.test(pregnancies ~ outcome, data = diabetes_na)
# p-value = 3.745e-08wilcox.test(age ~ outcome, data = diabetes)
# p-value < 2.2e-16wilcox.test(glucose ~ outcome, data = diabetes)
# p-value < 2.2e-16wilcox.test(diabetes_pedigree_function ~ outcome, data = diabetes)
# p-value = 1.197e-06All tests show statistically significant relationships theorizing that diabetes is affected by more than one risk factor.
Conclusion
This analysis of the Pima Indians dataset revealed that diabetic patients differ significantly from non-diabetic patients across all measured variables (p < 0.001).
From the correlation heat map it appears that the number of pregnancies, age and diabetes pedigree function have the strongest correlation with developing diabetes.
Non diabetics had thier sugar levels around 100 mg/dl while diabetics had sugar levels mainly between 100 and 200 mg/dl , this and the wilcox test suggest a positive relationship between glucose levels and diabetes.
The observation that normal-weight patients in the diabetic group had the highest median glucose levels suggests high proportion of type 1 diabetics in the sample or it might also be cause by advanced type 2 diabetes cases with wasting due to inability of the cells to utilize glucose .
This dataset was originally introduced by Smith et al. (1988) to predict diabetes in Pima Indian women using clinical measurements. my analysis replicates and extends their exploratory approach using modern R tools.
Footnotes
I wrote this code before the correlation heat map and I decided the test according to it.↩︎