Abstract

Diabetes is defined as a group of metabolic diseases that is categorized by chronic hyperglycemia. This can result from defects in insulin secretion, insulin action, or both. (Kharroubi and Darwish 2015) Throughout this analysis, we will analyze multiple variables and understand their prevalence in diabetes. We will first analyze each variable and its correlation to diabetes. From there, we will group each variable together that share similar correlations whether it is positive, negative, or neutral and analyze them further. Lastly, we will formulate a linear regression model to predict which variables are good predictors of having diabetes and which are not.

The data analysis will be broken down by the following sections:

Units:

summary(diabetes_data)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
str(diabetes_data)
## tibble [768 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Pregnancies             : int [1:768] 6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int [1:768] 148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int [1:768] 72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int [1:768] 35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int [1:768] 0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num [1:768] 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num [1:768] 0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int [1:768] 50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int [1:768] 1 0 1 0 1 0 1 0 1 1 ...
diabetes_data$Outcome <- as.factor(diabetes_data$Outcome)

Pearson’s Correlation - Heat Map

ggcorrplot(cor(data), hc.order = TRUE, lab = TRUE, lab_size = 3) +
  labs(title = "Correlation Between Variables and outcome",
       subtitle = "Netural and Positive Correlation",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

This will be our overhead analysis and we will now dive deeper into each section. We will see if there is a statistically significant correlation between each variable and the Diabetic outcome and if the results of the heat map correspond to the linear regression model formulated at the end of our analysis.

Blood Pressure and Skin Thickness vs Diabetes

We want to see if Blood Pressure and Skin Thickness are good indicators of Diabetes. According to our heat map in the previous section, both Blood Pressure and Skin Thickness did not show a positive or negative correlation to diabetes.

ggplot(diabetes_data, aes(x = Outcome, y = BloodPressure)) +
  geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Blood Pressure",
       title = "No Statistically Significant Difference",
       subtitle = "Between the Avg BP of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

t.test(diabetes_data$BloodPressure ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$BloodPressure by diabetes_data$Outcome
## t = -1.7131, df = 471.31, p-value = 0.08735
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.669580  0.388326
## sample estimates:
## mean in group 0 mean in group 1 
##        68.18400        70.82463

Next, we will analyze skin thickness in the same manner:

ggplot(diabetes_data, aes(x = Outcome, y = SkinThickness)) +
  geom_boxplot(fill = wes_palette("GrandBudapest2", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Tricep Skin Thickness",
       title = "No Statistically Significant Difference",
       subtitle = "Between the Avg Skin Thickness of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

t.test(diabetes_data$SkinThickness ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$SkinThickness by diabetes_data$Outcome
## t = -1.9706, df = 472.1, p-value = 0.04936
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.993281565 -0.007076644
## sample estimates:
## mean in group 0 mean in group 1 
##        19.66400        22.16418

Pregnancy, Age, Diabetes Pedigree Function, and Insulin vs Diabetes

As shown in our heat map, Pregnancy, Age, Diabetes Pedigree Function, and Insulin all showed a slight positive correlation to diabetes outcome.

Let’s analyze each separately and run a t.test to show statistical significance of each variable.

diabetes_data$Outcome <- as.factor(diabetes_data$Outcome)

ggplot(diabetes_data, aes(x = Outcome, y = Pregnancies)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Number of Pregnancies",
       title = "Slight Statistically Significant Difference",
       subtitle = "Between the Avg # of Pregnancies of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

# Showing a positive relation: diabetic patients have more pregnancies on average than non-diabetic patients
t.test(diabetes_data$Pregnancies ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$Pregnancies by diabetes_data$Outcome
## t = -5.907, df = 455.96, p-value = 6.822e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.089219 -1.046125
## sample estimates:
## mean in group 0 mean in group 1 
##        3.298000        4.865672
ggplot(diabetes_data, aes(x = Outcome, y = Age)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Age",
       title = "Slight Statistically Significant Difference",
       subtitle = "Between the Avg Age of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

# Showing a positive relation: diabetic patients are older on average than non-diabetic patients
t.test(diabetes_data$Age ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$Age by diabetes_data$Outcome
## t = -6.9207, df = 575.78, p-value = 1.202e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.545092 -4.209236
## sample estimates:
## mean in group 0 mean in group 1 
##        31.19000        37.06716
ggplot(diabetes_data, aes(x = Outcome, y = DiabetesPedigreeFunction)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Diabetes Pedigree Function Score",
       title = "Slight Statistically Significant Difference",
       subtitle = "Between the Avg Diabetes Pedigree Frunction Score of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

# Showing a positive relation: diabetic patients have a higher diabetes pedigree function score on average than non-diabetic patients
t.test(diabetes_data$DiabetesPedigreeFunction ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$DiabetesPedigreeFunction by diabetes_data$Outcome
## t = -4.5768, df = 454.51, p-value = 6.1e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.17262065 -0.06891135
## sample estimates:
## mean in group 0 mean in group 1 
##        0.429734        0.550500
ggplot(diabetes_data, aes(x = Outcome, y = Insulin)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Insulin (mu U/ml)",
       title = "Slight Statistically Significant Difference",
       subtitle = "Between the Avg Insulin Level of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

# Showing a positive relation: diabetic patients have on average higher Insulin levels than non-diabetic
t.test(diabetes_data$Insulin ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$Insulin by diabetes_data$Outcome
## t = -3.3009, df = 415.75, p-value = 0.001047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -50.32820 -12.75944
## sample estimates:
## mean in group 0 mean in group 1 
##         68.7920        100.3358

Glucose Levels and BMI vs Diabetes

ggplot(diabetes_data, aes(x = Outcome, y = Glucose)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "Glucose (mg/dL)",
       title = "Strong Statistically Significant Difference",
       subtitle = "Between the Avg Glucose Level of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

t.test(diabetes_data$Glucose ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$Glucose by diabetes_data$Outcome
## t = -13.752, df = 461.33, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -35.74707 -26.80786
## sample estimates:
## mean in group 0 mean in group 1 
##        109.9800        141.2575
ggplot(diabetes_data, aes(x = Outcome, y = BMI)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(x = "Diabetic", 
       y = "BMI (kg/m^2^)",
       title = "Strong Statistically Significant Difference",
       subtitle = "Between the Avg BMI of Diabetic and Non-Diabetic",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_discrete(limits = c("0", "1"),
                   labels = c("No", "Yes"))

t.test(diabetes_data$BMI ~ diabetes_data$Outcome, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  diabetes_data$BMI by diabetes_data$Outcome
## t = -8.6193, df = 573.47, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.940864 -3.735811
## sample estimates:
## mean in group 0 mean in group 1 
##        30.30420        35.14254

Linear Regression Model - Binomial

predicted <- glm(Outcome ~ ., family = "binomial", data = diabetes_data)
summary(predicted)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diabetes_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5566  -0.7274  -0.4159   0.7267   2.9297  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.4046964  0.7166359 -11.728  < 2e-16 ***
## Pregnancies               0.1231823  0.0320776   3.840 0.000123 ***
## Glucose                   0.0351637  0.0037087   9.481  < 2e-16 ***
## BloodPressure            -0.0132955  0.0052336  -2.540 0.011072 *  
## SkinThickness             0.0006190  0.0068994   0.090 0.928515    
## Insulin                  -0.0011917  0.0009012  -1.322 0.186065    
## BMI                       0.0897010  0.0150876   5.945 2.76e-09 ***
## DiabetesPedigreeFunction  0.9451797  0.2991475   3.160 0.001580 ** 
## Age                       0.0148690  0.0093348   1.593 0.111192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 759  degrees of freedom
## AIC: 741.45
## 
## Number of Fisher Scoring iterations: 5

Certain variables, such as Pregnancy, Blood Pressure, Age, and Insulin have shifted in significance in comparison to their Pearson correlation value. This can be explained as the linear regression model does not quantify the significance of the variables separately but rather together in reference to diabetic outcome.

Now we will plot the graph of our generalized linear model to see if it effectively captures the relationship. If so, closer to zero should represent no diabetes and closer to one should represent diabetes:

probability_data <- data.frame(fitted.values = predicted$fitted.values, outcome = diabetes_data$Outcome)

probability_data <- probability_data %>%
  arrange(fitted.values)


probability_data <- probability_data %>%
  mutate(rank = 1:nrow(probability_data))

ggplot(probability_data, aes(x = rank, y = fitted.values, color = outcome)) +
  geom_point(alpha = 1, shape = 1, stroke = 2) +
  xlab("Rank") +
  ylab("Predicted Probability of Having Diabetes")

Good Predictors of Diabetes

Now that we know Pregnancy, Glucose, BMI, Diabetes Pedigree Function, and Blood Pressure are all good predictors of Diabetes, we can dive deeper and examine each individually:

preg_dp <- ggplot(diabetes_data, aes(x = Pregnancies, fill = Outcome)) +
  geom_density(size = 1, alpha = .5)

gluc_dp <- ggplot(diabetes_data, aes(x = Glucose, fill = Outcome)) +
  geom_density(size = 1, alpha = .5)

bmi_dp <- ggplot(diabetes_data, aes(x = BMI, fill = Outcome)) +
  geom_density(size = 1, alpha = .5)

dpf_dp <- ggplot(diabetes_data, aes(x = DiabetesPedigreeFunction, fill = Outcome)) +
  geom_density(size = 1, alpha = .5)

bp_dp <- ggplot(diabetes_data, aes(x = BloodPressure, fill = Outcome)) +
  geom_density(size = 1, alpha = .5)

multiplot(preg_dp, gluc_dp, bmi_dp, dpf_dp, bp_dp, cols = 2)

As mentioned by (Corrado et al. 2003), it has been suggested that in multiple pregnancies that the incidence of gestational diabetes may be increased in addition to decreased insulin sensitivity that is modified by several factors (some of which have been explored within this analysis).

According to (Mellitus 2005), there are criteria for the diagnosis of diabetes in regards to plasma glucose level. The criteria as stated in this article shows: In a patient with classic symptoms of hyperglycemia or hyperglycemia crisis, random plasma glucose greater than or equal to 200 mg/dL or a Fasted Plasma Glucose of greater than or equal to 126 mg/dL

Normal BMI is an individual who is between 18.5 and 25. However, as BMI increases, so does the risk of developing type 2 diabetes. Within a BMI score of 30 to 39.99 there is a 20.1% greater risk and with a BMI score greater than 40 there is a 38.8% greater risk. (Gray et al. 2015)

Unfortunately, Diabetes Pedigree Function scores seem to be specific to this particular data set, therefore, normal ranges have not been documented. However, the larger the number the higher the likelihood of diabetes onset formulated by family medical history

In regards to hypertension and diabetes, Diastolic Blood pressure of 90-99 mmHg should be treated in diabetic patients. (Volpe et al. 2015)

We notice that the strong predictors (those with small p-values) have a slightly shifted density plot (they do not overlap completely) and/or have a much wider density.

Conclusion

Uncontrolled diabetes can lead to stupor, coma, or even death if not treated. (Kharroubi and Darwish 2015) Therefore, understanding the different variables and their correlation to diabetes is extremely important to early diagnosis/treatment. As shown in this analysis, there are great predictors of diabetes, good predictors of diabetes, and not good predictors of diabetes. Pregnancy, glucose levels, and BMI scores are all great predictors of diabetes as they had the smallest p-value. Diabetes pedigree function scores and blood pressure are good predictors of diabetes as they had slightly higher p-values yet were still statistically significant. When all of the good and great predictors are analyzed together, we can determine the likelihood of diabetes within a patient. Within our data set, the majority of diabetic patients have 1-8 pregnancies, a glucose level of 125-175mg/dL, and a BMI score of 32-40. They also had a diabetes pedigree function score of .25-.75 and a blood pressure of 75-90mmHg. Knowing these values and how they compare to “normal” ranges will allow us to make a confident prediction for diagnosis.

Bibliography

Corrado, Francesco, Francesco Caputo, Graziella Facciolà, and Alfredo Mancuso. 2003. “Gestational Glucose Intolerance in Multiple Pregnancy.” Diabetes Care 26 (5): 1646–6.

Gray, Natallia, Gabriel Picone, Frank Sloan, and Arseniy Yashkin. 2015. “The Relationship Between Bmi and Onset of Diabetes Mellitus and Its Complications.” Southern Medical Journal 108 (1): 29.

Kharroubi, Akram T, and Hisham M Darwish. 2015. “Diabetes Mellitus: The Epidemic of the Century.” World Journal of Diabetes 6 (6): 850.

Mellitus, Diabetes. 2005. “Diagnosis and Classification of Diabetes Mellitus.” Diabetes Care 28 (S37): S5–S10.

Volpe, Massimo, Allegra Battistoni, Carmine Savoia, and Giuliano Tocci. 2015. “Understanding and Treating Hypertension in Diabetic Populations.” Cardiovascular Diagnosis and Therapy 5 (5): 353.