Diabetes is defined as a group of metabolic diseases that is categorized by chronic hyperglycemia. This can result from defects in insulin secretion, insulin action, or both. (Kharroubi and Darwish 2015) Throughout this analysis, we will analyze multiple variables and understand their prevalence in diabetes. We will first analyze each variable and its correlation to diabetes. From there, we will group each variable together that share similar correlations whether it is positive, negative, or neutral and analyze them further. Lastly, we will formulate a linear regression model to predict which variables are good predictors of having diabetes and which are not.
The data analysis will be broken down by the following sections:
Pearson’s Correlation - Heatmap
Blood Pressure and Skin Thickness vs Diabetes
Pregnancies, Age, Diabetes Pedigree Function, and Insulin vs Diabetes
Glucose Levels and BMI vs Diabetes
Linear Regression Model - Binomial
Good Predictors of Diabetes - Analysis
Units:
Pregnancies: number of times pregnant
Glucose: in mg/dL
Blood pressure: in mm Hg - based on the numbers, it seems that our values are of DBP (diastolic blood pressure)
Skinthickness: triceps skin fold in mm
Insulin: mu U/ml
BMI in kg/m2
DiabetesPedigree: likelihood of having diabetes based on familial history
Age: in years
Outcome: 0 if no diabetes and 1 if diabetic
summary(diabetes_data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
str(diabetes_data)
## tibble [768 x 9] (S3: tbl_df/tbl/data.frame)
## $ Pregnancies : int [1:768] 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int [1:768] 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int [1:768] 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int [1:768] 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int [1:768] 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num [1:768] 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num [1:768] 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int [1:768] 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int [1:768] 1 0 1 0 1 0 1 0 1 1 ...
diabetes_data$Outcome <- as.factor(diabetes_data$Outcome)
ggcorrplot(cor(data), hc.order = TRUE, lab = TRUE, lab_size = 3) +
labs(title = "Correlation Between Variables and outcome",
subtitle = "Netural and Positive Correlation",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Based on this correlation plot, we are more interested in how outcome fares with the other numeric variables.
Blood pressure and skin thickness do not appear to show either a positive or negative correlation with Outcome.
Pregnancy, age, diabetes pedigree function, and insulin have a slightly positive correlation with outcome.
Glucose and BMI have a strong positive correlation with Outcome.
This will be our overhead analysis and we will now dive deeper into each section. We will see if there is a statistically significant correlation between each variable and the Diabetic outcome and if the results of the heat map correspond to the linear regression model formulated at the end of our analysis.
We want to see if Blood Pressure and Skin Thickness are good indicators of Diabetes. According to our heat map in the previous section, both Blood Pressure and Skin Thickness did not show a positive or negative correlation to diabetes.
ggplot(diabetes_data, aes(x = Outcome, y = BloodPressure)) +
geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Blood Pressure",
title = "No Statistically Significant Difference",
subtitle = "Between the Avg BP of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
t.test(diabetes_data$BloodPressure ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$BloodPressure by diabetes_data$Outcome
## t = -1.7131, df = 471.31, p-value = 0.08735
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.669580 0.388326
## sample estimates:
## mean in group 0 mean in group 1
## 68.18400 70.82463
Based on the t.test there is not a statistically significant difference between the average blood pressure of someone that is diabetic vs not diabetic.
Zero is also in the confidence interval, therefore, we fail to reject our null hypothesis that there is no significant difference between the average blood pressures.
Next, we will analyze skin thickness in the same manner:
ggplot(diabetes_data, aes(x = Outcome, y = SkinThickness)) +
geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Tricep Skin Thickness",
title = "No Statistically Significant Difference",
subtitle = "Between the Avg Skin Thickness of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
t.test(diabetes_data$SkinThickness ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$SkinThickness by diabetes_data$Outcome
## t = -1.9706, df = 472.1, p-value = 0.04936
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.993281565 -0.007076644
## sample estimates:
## mean in group 0 mean in group 1
## 19.66400 22.16418
Based on the t-test, there is not a statistically significant difference between the Skin Thickness of the triceps of a diabetic and non-diabetic patient
The results similarly reflects that of blood pressure.
As shown in our heat map, Pregnancy, Age, Diabetes Pedigree Function, and Insulin all showed a slight positive correlation to diabetes outcome.
Let’s analyze each separately and run a t.test to show statistical significance of each variable.
diabetes_data$Outcome <- as.factor(diabetes_data$Outcome)
ggplot(diabetes_data, aes(x = Outcome, y = Pregnancies)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Number of Pregnancies",
title = "Slight Statistically Significant Difference",
subtitle = "Between the Avg # of Pregnancies of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
# Showing a positive relation: diabetic patients have more pregnancies on average than non-diabetic patients
t.test(diabetes_data$Pregnancies ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$Pregnancies by diabetes_data$Outcome
## t = -5.907, df = 455.96, p-value = 6.822e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.089219 -1.046125
## sample estimates:
## mean in group 0 mean in group 1
## 3.298000 4.865672
ggplot(diabetes_data, aes(x = Outcome, y = Age)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Age",
title = "Slight Statistically Significant Difference",
subtitle = "Between the Avg Age of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
# Showing a positive relation: diabetic patients are older on average than non-diabetic patients
t.test(diabetes_data$Age ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$Age by diabetes_data$Outcome
## t = -6.9207, df = 575.78, p-value = 1.202e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.545092 -4.209236
## sample estimates:
## mean in group 0 mean in group 1
## 31.19000 37.06716
ggplot(diabetes_data, aes(x = Outcome, y = DiabetesPedigreeFunction)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Diabetes Pedigree Function Score",
title = "Slight Statistically Significant Difference",
subtitle = "Between the Avg Diabetes Pedigree Frunction Score of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
# Showing a positive relation: diabetic patients have a higher diabetes pedigree function score on average than non-diabetic patients
t.test(diabetes_data$DiabetesPedigreeFunction ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$DiabetesPedigreeFunction by diabetes_data$Outcome
## t = -4.5768, df = 454.51, p-value = 6.1e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.17262065 -0.06891135
## sample estimates:
## mean in group 0 mean in group 1
## 0.429734 0.550500
ggplot(diabetes_data, aes(x = Outcome, y = Insulin)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Insulin (mu U/ml)",
title = "Slight Statistically Significant Difference",
subtitle = "Between the Avg Insulin Level of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
# Showing a positive relation: diabetic patients have on average higher Insulin levels than non-diabetic
t.test(diabetes_data$Insulin ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$Insulin by diabetes_data$Outcome
## t = -3.3009, df = 415.75, p-value = 0.001047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -50.32820 -12.75944
## sample estimates:
## mean in group 0 mean in group 1
## 68.7920 100.3358
All variables above show a statistically significant positive relationship with diabetic outcome.
Also, none of the confidence intervals included zero. Therefore, we can reject our null hypothesis.
ggplot(diabetes_data, aes(x = Outcome, y = Glucose)) +
geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "Glucose (mg/dL)",
title = "Strong Statistically Significant Difference",
subtitle = "Between the Avg Glucose Level of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
t.test(diabetes_data$Glucose ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$Glucose by diabetes_data$Outcome
## t = -13.752, df = 461.33, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35.74707 -26.80786
## sample estimates:
## mean in group 0 mean in group 1
## 109.9800 141.2575
ggplot(diabetes_data, aes(x = Outcome, y = BMI)) +
geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
theme_dark() +
labs(x = "Diabetic",
y = "BMI (kg/m^2^)",
title = "Strong Statistically Significant Difference",
subtitle = "Between the Avg BMI of Diabetic and Non-Diabetic",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_x_discrete(limits = c("0", "1"),
labels = c("No", "Yes"))
t.test(diabetes_data$BMI ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: diabetes_data$BMI by diabetes_data$Outcome
## t = -8.6193, df = 573.47, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.940864 -3.735811
## sample estimates:
## mean in group 0 mean in group 1
## 30.30420 35.14254
predicted <- glm(Outcome ~ ., family = "binomial", data = diabetes_data)
summary(predicted)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diabetes_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5566 -0.7274 -0.4159 0.7267 2.9297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4046964 0.7166359 -11.728 < 2e-16 ***
## Pregnancies 0.1231823 0.0320776 3.840 0.000123 ***
## Glucose 0.0351637 0.0037087 9.481 < 2e-16 ***
## BloodPressure -0.0132955 0.0052336 -2.540 0.011072 *
## SkinThickness 0.0006190 0.0068994 0.090 0.928515
## Insulin -0.0011917 0.0009012 -1.322 0.186065
## BMI 0.0897010 0.0150876 5.945 2.76e-09 ***
## DiabetesPedigreeFunction 0.9451797 0.2991475 3.160 0.001580 **
## Age 0.0148690 0.0093348 1.593 0.111192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 759 degrees of freedom
## AIC: 741.45
##
## Number of Fisher Scoring iterations: 5
As shown in the summary: Pregnancy, Glucose, and BMI have the smallest p-value of all variables (as indicated by the three stars), indicating the strongest statistical significance.
Next, Diabetes Pedigree function and Blood pressure show statistical significant but with slightly higher p-values.
Lastly, Age, Insulin, and Skin Thickness all show large p-values (greater than 5%) which shows they are not good predictors of diabetes.
Certain variables, such as Pregnancy, Blood Pressure, Age, and Insulin have shifted in significance in comparison to their Pearson correlation value. This can be explained as the linear regression model does not quantify the significance of the variables separately but rather together in reference to diabetic outcome.
Now we will plot the graph of our generalized linear model to see if it effectively captures the relationship. If so, closer to zero should represent no diabetes and closer to one should represent diabetes:
probability_data <- data.frame(fitted.values = predicted$fitted.values, outcome = diabetes_data$Outcome)
probability_data <- probability_data %>%
arrange(fitted.values)
probability_data <- probability_data %>%
mutate(rank = 1:nrow(probability_data))
ggplot(probability_data, aes(x = rank, y = fitted.values, color = outcome)) +
geom_point(alpha = 1, shape = 1, stroke = 2) +
xlab("Rank") +
ylab("Predicted Probability of Having Diabetes")
Our graph makes sense, which means our generalized linear model above (predicted object) accurately captures the relationship between variables.
Given a sample of the same variables, we can predict the likelihood of an individual having diabetes.
Now that we know Pregnancy, Glucose, BMI, Diabetes Pedigree Function, and Blood Pressure are all good predictors of Diabetes, we can dive deeper and examine each individually:
preg_dp <- ggplot(diabetes_data, aes(x = Pregnancies, fill = Outcome)) +
geom_density(size = 1, alpha = .5)
gluc_dp <- ggplot(diabetes_data, aes(x = Glucose, fill = Outcome)) +
geom_density(size = 1, alpha = .5)
bmi_dp <- ggplot(diabetes_data, aes(x = BMI, fill = Outcome)) +
geom_density(size = 1, alpha = .5)
dpf_dp <- ggplot(diabetes_data, aes(x = DiabetesPedigreeFunction, fill = Outcome)) +
geom_density(size = 1, alpha = .5)
bp_dp <- ggplot(diabetes_data, aes(x = BloodPressure, fill = Outcome)) +
geom_density(size = 1, alpha = .5)
multiplot(preg_dp, gluc_dp, bmi_dp, dpf_dp, bp_dp, cols = 2)
As mentioned by (Corrado et al. 2003), it has been suggested that in multiple pregnancies that the incidence of gestational diabetes may be increased in addition to decreased insulin sensitivity that is modified by several factors (some of which have been explored within this analysis).
According to (Mellitus 2005), there are criteria for the diagnosis of diabetes in regards to plasma glucose level. The criteria as stated in this article shows: In a patient with classic symptoms of hyperglycemia or hyperglycemia crisis, random plasma glucose greater than or equal to 200 mg/dL or a Fasted Plasma Glucose of greater than or equal to 126 mg/dL
Normal BMI is an individual who is between 18.5 and 25. However, as BMI increases, so does the risk of developing type 2 diabetes. Within a BMI score of 30 to 39.99 there is a 20.1% greater risk and with a BMI score greater than 40 there is a 38.8% greater risk. (Gray et al. 2015)
Unfortunately, Diabetes Pedigree Function scores seem to be specific to this particular data set, therefore, normal ranges have not been documented. However, the larger the number the higher the likelihood of diabetes onset formulated by family medical history
In regards to hypertension and diabetes, Diastolic Blood pressure of 90-99 mmHg should be treated in diabetic patients. (Volpe et al. 2015)
We notice that the strong predictors (those with small p-values) have a slightly shifted density plot (they do not overlap completely) and/or have a much wider density.
Variables (such as Diabetes Pedigree Function and Blood Pressure) with statistically significant variability (< 5%), but are not as strong of predictors as the top 3 - Pregnancies, Glucose, and BMI have overlapping density plots (with minimal rightward shifts) and similar wideness in density.
Although the diabetes pedigree function and blood pressure are only slightly different, it is still statistically significant as shown by our linear regression model
Overlapping density plots give us a good understanding of the spread within our good predictors and how they vary between diabetic and non-diabetic patients.
Uncontrolled diabetes can lead to stupor, coma, or even death if not treated. (Kharroubi and Darwish 2015) Therefore, understanding the different variables and their correlation to diabetes is extremely important to early diagnosis/treatment. As shown in this analysis, there are great predictors of diabetes, good predictors of diabetes, and not good predictors of diabetes. Pregnancy, glucose levels, and BMI scores are all great predictors of diabetes as they had the smallest p-value. Diabetes pedigree function scores and blood pressure are good predictors of diabetes as they had slightly higher p-values yet were still statistically significant. When all of the good and great predictors are analyzed together, we can determine the likelihood of diabetes within a patient. Within our data set, the majority of diabetic patients have 1-8 pregnancies, a glucose level of 125-175mg/dL, and a BMI score of 32-40. They also had a diabetes pedigree function score of .25-.75 and a blood pressure of 75-90mmHg. Knowing these values and how they compare to “normal” ranges will allow us to make a confident prediction for diagnosis.
Corrado, Francesco, Francesco Caputo, Graziella Facciolà, and Alfredo Mancuso. 2003. “Gestational Glucose Intolerance in Multiple Pregnancy.” Diabetes Care 26 (5): 1646–6.
Gray, Natallia, Gabriel Picone, Frank Sloan, and Arseniy Yashkin. 2015. “The Relationship Between Bmi and Onset of Diabetes Mellitus and Its Complications.” Southern Medical Journal 108 (1): 29.
Kharroubi, Akram T, and Hisham M Darwish. 2015. “Diabetes Mellitus: The Epidemic of the Century.” World Journal of Diabetes 6 (6): 850.
Mellitus, Diabetes. 2005. “Diagnosis and Classification of Diabetes Mellitus.” Diabetes Care 28 (S37): S5–S10.
Volpe, Massimo, Allegra Battistoni, Carmine Savoia, and Giuliano Tocci. 2015. “Understanding and Treating Hypertension in Diabetic Populations.” Cardiovascular Diagnosis and Therapy 5 (5): 353.