Introduction

The data analyzed here was collected by the US National Institue of Diabetes and Digestive and Kidney Diseases. The data collected pertains to women of Pima Indian heritage (living near Phoenix, Arizona) that were at least 21 years of age and includes the number of pregnancies they have, their age, whether or not they are diabetic, along with various vitals. The following report describes the process undergone to find a quantitative model that can predict whether or not the women of this community will have diabetes.

Data

The data has 768 records and 9 columns of data. The types of information recorded are:

The columns of data can be seen here:

##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos

Processing

The following model was developed to determine which columns of data are statistically significant.

library(caret)

fit <- train(diabetes ~ ., data = PimaData, method = "glm",
             metric = "ROC",
             trControl = trainControl(
             method = "cv",
             number = 10,
             verboseIter = FALSE,
             classProbs = TRUE,
             summaryFunction = twoClassSummary)
             )

This model produced a resulting ROC of 0.8329715. According to the glm model, the pregnant, glucose, mass, and pedigree columns were all considered statistically significant. All other columns were not considered for the following modeling.

Model

The process used to develop the first model was repeated considering only the statistically significant columns. The new ROC is 0.8284017, which is up from before although only by 0.002.

Future Work

This model developed to predict diabetes in women of Pima Indian heritage is a great starting point. However, more work needs to be done to improve the model in order to properly assist the community in their future health efforts.