Data 622 Homework 4

Introduction

For this assignment, we will be analyzing a popular dataset from Kaggle.com which is focused on the detection and classification of diabetes in women. This is a very famous dataset known as the “Pima Indian Diabetes Dataset” and has been used extensively for machine learning applications.

https://www.kaggle.com/datasets/whenamancodes/predict-diabities?select=diabetes.csv

I will explore this datset, and apply two machine learning topics of interest learned from this semester:

Clustering: I am interested in what types of clusters will be produced when applying algorithms such as KMeans to our dataset. Specifically, I want to know how clustering can help inform us on what groups features are most correlated with presence of diabetes.
Classification: I am interested in testing a few classificatin methods, including random forest and svm, on our dataset. How accurate will our results be, and when looking at feature importance plots will be be able to draw parallels from our learnings with unsupervised learning?

Libraries and Data

library(tidyverse)
library(factoextra)
library(dplyr)
library(tidyr)
library(rpart)
library(rpart.plot)
library(lubridate)
library(skimr)
library(stringr)
library(corrplot)
library(ggplot2)
library(fpp3)
library(caret)
library(readr)
library(GGally)
library(tidymodels)
library(randomForest)
library(e1071)
library(caTools)
library(gridExtra)

# Load Datasets

diabetes <- "/Users/alecmccabe/Desktop/data 622/homework4/diabetes.csv"
data <- read_csv(diabetes)

Exploratory Data Analysis

colnames(data) <- c("Pregnancies","Glucose","Blood_pressure","Skin_thick","Insulin","Bmi","Family_history","Age","Class")

dim(data)

## [1] 768   9

There are 786 observations, with 9 features each.

head(data)

Features include:

Pregnancies: number of pregnancies the individual has had
Glucose: measured glucose levels of patient
BloodPressure: measured blood pressure of patient
SkinThickness: measured skin thickness of patient
Insulin: measured insulin levels of patient
BMI: measured body-mass-index of patient
DiabetesPedigreeFunction: function to determine risk of diabetes 2 based on family history. Higher value implies higher probability.
Age: age of patient

And the target variable is described below:

Outcome: target variable. Does the patient have diabetes?

skim(data)

Data summary
Name	data
Number of rows	768
Number of columns	9
_______________________
Column type frequency:
numeric	9
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Pregnancies	1	3.85	3.37	0.00	1.00	3.00	6.00	17.00	▇▃▂▁▁
Glucose	1	120.89	31.97	0.00	99.00	117.00	140.25	199.00	▁▁▇▆▂
Blood_pressure	1	69.11	19.36	0.00	62.00	72.00	80.00	122.00	▁▁▇▇▁
Skin_thick	1	20.54	15.95	0.00	0.00	23.00	32.00	99.00	▇▇▂▁▁
Insulin	1	79.80	115.24	0.00	0.00	30.50	127.25	846.00	▇▁▁▁▁
Bmi	1	31.99	7.88	0.00	27.30	32.00	36.60	67.10	▁▃▇▂▁
Family_history	1	0.47	0.33	0.08	0.24	0.37	0.63	2.42	▇▃▁▁▁
Age	1	33.24	11.76	21.00	24.00	29.00	41.00	81.00	▇▃▁▁▁
Class	1	0.35	0.48	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▅

We can see from the above skim function that we are not missing any data. Certain fields show left or right skews, which we can handle in subsequent steps.

Feature Distributions

data %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free", ncol = 3) +
    geom_density(col = 'red') +
    geom_histogram(aes(y = stat(density)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

With respect to our target variable, we can see that the dataset is not balanced. We will want to resample our dataset in order to balance the target.

Data Balance

Our data is not balanced which will be a problem for some of our classification analysis later on. We will deal with this in subsequent steps.

data$Class <- as.character(data$Class)
data$Class<-as.factor(data$Class)
prop.table(table(data$Class))

## 
##         0         1 
## 0.6510417 0.3489583

Feature Scaling

In order to compare features, we should scale them to the same magnitude.

features <- data %>% select(-c("Class"))

features <- as.data.frame(lapply(features, scale))

features %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free", ncol = 3) +
    geom_density(col = 'red') +
    geom_histogram(aes(y = stat(density)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Checking Outliers

reshape2::melt(features) %>%
  ggplot(aes(y = value)) +
  geom_boxplot() +
  facet_wrap(variable~., scales = "free")

## No id variables; using all as measure variables

Some features show outliers in the data, including Insulin, BMI, DiabetesPedigreeFunction, and Age. However looking the skim() method output from before, I din’t believe any of these are true outliers. The oldest person in this study is 81 years old, which is perfectly acceptable. As for BMI, the largest value is 67, which is again within the realm of possibility. For our analysis, we will not touch these outliers. Instead, we will perform scaling on the data and use tree-based models when determining feature importance.

Check Correlation

#correlation matrix
corrplot(cor(features, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=40, tl.cex=0.9, number.cex=0.85, title="Diagnostic measurements", mar=c(0,0,1,0))

The features do not display concerning levels of collinearity. Not surprisingly, Age and # of Pregnancies are somewhat correlated. As woman age, they are more likely to have had more children.

Clustering

In K-means clustering we want to define clusters so that the total intra-cluster variation is minimized. Each cluster is represented by its center. The main disadvantages of these method are: its sensitivity to outliers,and the initial random selection of cluster centers and the need to choose the number of clusters in advance.

# optimal number of clusters - elbow
k1=fviz_nbclust(features, FUN = kmeans, method = "wss") + 
  geom_vline(xintercept = 3, linetype = 2, col="blue") +
  labs(subtitle = "Elbow method - kmeans")

# optimal number of clusters - silhouette
k2=fviz_nbclust(features, FUN = kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method - kmeans")

grid.arrange(k1, k2, ncol=2)

According to the output above, we can see that using either 2 or 3 clusters will produce the strongest results. We will use 3.

kmeans_3 <-eclust(features, "kmeans", k=3, hc_metric="euclidean", graph=FALSE)

# plots
k=fviz_cluster(list(data=features, cluster=kmeans_3$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

a <- fviz_silhouette(kmeans_3)

##   cluster size ave.sil.width
## 1       1  334          0.25
## 2       2  214          0.10
## 3       3  220          0.15

grid.arrange(k, a, ncol=2)

Out data has a lot of overlap, and looking at the results above we can see that some of the values in the silhoette plot are negative, which further reinforces the idea of non-perfect separation. That being said, we still see some clearly defined clusters and can begin profiling.

The below code compares the true tag for diabetes from our data against the cluster labels created with KMeans.

table(data$Class, kmeans_3$cluster)

##    
##       1   2   3
##   0 288  99 113
##   1  46 115 107

And the below code here shows us our cluster definitions, with values for each dimension for all cluster centroids.

aggregate(features, by=list(cluster=kmeans_3$cluster), mean)

Describing the Clusters:

Cluster 1: “Young and Fit, with few pregnancies” - Lowest Age - Lowest Pregnancies - Lowest BMI

Cluster 2: “Unhealthy, Family History” - Highest Family History Score - Highest Blood Pressure - Highest Glucose Levels - Highest Skin thickness

Cluster 3: Older with Most Pregnancies - Highest Age - Highest Number of Pregnancies

Interestingly, by performing KMeans clustering on our data, we are able to create distinct groupings of the data and compare against the presence of diabetes. We can see that naturally, cluster 1 produces a very low rate of diabetes. This makes sense as this cluster is dominated by younger, healthier woman who have not yet had many pregnancies.

What is really interesting is comparing clusters 2 and 3. Cluster 2 has the highest prevalance of diabetes, which makes sense as this cluster is the least healthy and also has family history. However, cluster 3 has almost equal rates of diabetes, which indicates how influential pregnancies is on the onset of diabetes in woman!

Classification

Data Splitting

We will be using the caret package to run our analysis. First, we will need to split the data into training and testing using an 80% split.

upsampled_data <- upSample(x = data[, -ncol(data)],
                     y = data$Class)

set.seed(1)
sample <- sample.split(upsampled_data$Class, SplitRatio = 0.8)

train_X  <- subset(upsampled_data %>% select(-Class), sample == TRUE)
test_X   <- subset(upsampled_data %>% select(-Class), sample == FALSE)
train_y <- subset(upsampled_data$Class, sample == TRUE)
test_y <- subset(upsampled_data$Class, sample == FALSE)

Decision Tree Model

While we will not be using decision trees as our final classifier, they are useful tools to assess the dependancies of features for classification

# test decision tree with just model_2 as feature
dt <- rpart(Class ~ .,
   method="anova", data)

dt %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)

From the above map, we can see that glucose, age, and bmi are all highly influential features for predicting diabetes. For individuals that have low glucose levels, typically we rely on the family history predictor to determine risk of diabetes.

Random Forest Model

rf_model <- randomForest(train_X, train_y,
                         preProcess = c("YeoJohnson"),
                         importance = TRUE,
                         ntrees = 1000)

rf_pred <- predict(rf_model, newdata = test_X)

plot(rf_model)

randomForest::varImpPlot(rf_model,
                         main = "Variable Importance for Random Forest Model")

postResample(pred = rf_pred, obs = test_y)

## Accuracy    Kappa 
##     0.84     0.68

The Random Forest model produced very strong results, with an accuracy of 84% on a balanced dataset!

SVM Model

svmR_model <- train(train_X[,-1] %>% as.matrix(),
                    train_y %>% as_vector(),
                    method = "svmRadial",
                    preProc = c("YeoJohnson", "center", "scale"),
                    tuneLength = 14,
                    trControl = trainControl(method = "cv"))

svmR_pred <- predict(svmR_model, newdata = test_X[,-1] %>% as.matrix())

svmR_model$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1024 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.157252214714693 
## 
## Number of Support Vectors : 373 
## 
## Objective Function Value : -54265.66 
## Training error : 0.01125

postResample(pred = svmR_pred, obs = test_y)

## Accuracy    Kappa 
##     0.76     0.52

The SVM model did not perform as well as the random forest model. We know that SVM performs really well on datasets with high dimensionality. This dataset contains only 9 features, so it may not be the most opportune model for the job.

Conclusion

We see some very interesting results after finalizing our analysis! Both from a unsupervised learning perspective, as well as a supervised learning perspective, we derive meaningful learnings from our analysis. Additionally, we can see that learnings from both approaches support eachother.

From our clusting analysis, we were able to split our data into thre logical groups, or clusters. These three groupings were essentially: - young, healthy women - women with health problems - older women who have had more children

We identified that groups 2 and 3 were most susceptible to developing diabetes, with group #2 the most susceptible. From here we learn that diabetes is primarily impacted by specific health metrics, family history, and the number of children you’ve had.

Looking at our classification / supervised analysis, we can ascertain that the most important featurs for predicting diabetes issues are: - Glucose (health) - Age - BMI (Health) - Family History

Pregnancies were also included as influential, but not in the top 5. This supports why group 2 from our clustering analysis had the highest prevelance of diabetes members over group 3.