Introduction

This is an R Markdown document on classifying diabetes in the Pima Indian Women dataset. The data set is retrieved from the mlbench R package. In the MASS package, the description of the data set is described as follows:

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

Exploratory Data Analysis

dim(PimaIndiansDiabetes)
## [1] 768   9

The dataset has 768 rows and 9 columns.

sum(sapply(PimaIndiansDiabetes, function(x) is.na(x)))
## [1] 0

The dataset has no missing values.

Now let’s look at the columns of the dataset.

str(PimaIndiansDiabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

Column Descriptions:

pregnant: Number of times pregnant

glucose: Plasma glucose concentration (glucose tolerance test)

pressure: Diastolic blood pressure (mm Hg)

triceps: Triceps skin fold thickness (mm)

insulin: 2-Hour serum insulin (mu U/ml)

mass: Body mass index (weight in kg/(height in m)^2)

pedigree: Diabetes pedigree function

age: Age (years)

diabetes: Class variable (test for diabetes)

Let’s check out the distribution of positives and negatives.

plot(PimaIndiansDiabetes$diabetes)

Our dataset has a 65.1041667 % negative diabetes rate.

Preprocessing

We can start building our model by doing some preprocessing prework.

# Removing low variance columns
pimaData <- PimaIndiansDiabetes
nz <- nearZeroVar(pimaData, saveMetrics = TRUE)
pimaData <- pimaData[, !nz$nzv]
colnames(pimaData)
## [1] "pregnant" "glucose"  "pressure" "triceps"  "insulin"  "mass"    
## [7] "pedigree" "age"      "diabetes"

None of the columns seem to have low variance. Now let’s check correlation between columns.

# Remove diabetes column before comparison
diabetes <- pimaData$diabetes
pimaData$diabetes <- NULL

# Remove columns with high correlation
df.correlated <- findCorrelation(cor(pimaData), cutoff = 0.60, verbose = TRUE, exact = TRUE)
## All correlations <= 0.6
trainingData <- pimaData[, -df.correlated]

All correlations between columns are less than 0.6 so we do not need to remove any.

Partition

We now can divide the pimaData into training and testing partitions.

# Bind back the diabetes column
pimaData <- cbind(diabetes, pimaData)

index <- createDataPartition(y = pimaData$diabetes, p = 0.8)[[1]]

pima.train <- pimaData[index,]
pima.test  <- pimaData[-index,]

Now let’s train our model

fit <- train(diabetes ~ ., data = pima.train, method = "glm",
             metric = "ROC",
             trControl = trainControl(
               method = "repeatedcv",
               number = 10,
               repeats = 10,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = TRUE))

pred <- predict(fit, pima.test, type = "prob")
pred <- pred$pos

Testing our Model

fitted.results <- ifelse(pred > 0.5, 'pos', 'neg')

misClasificError <- mean(fitted.results != pima.test$diabetes)
print(paste('Accuracy', 1-misClasificError))
## [1] "Accuracy 0.784313725490196"

Our Accuracy to our glm model is 78.4313725 %. To improve upon such model, one might want to try gradient boosting methods like LightGBM or XGBoost.