Early Stage Diabetes Risk Prediction

1 Introduction

Early diagnosis can help people prevent the onset of the disease from getting worse in the future or even completely treat the disease early on. Here I will analyze a data set of people with or without diabetes and come up with a possible solution to this problem.

Here is a little preview of the data we will be using

data <- read.csv("../data/diabetes_data_upload.csv", stringsAsFactors = T)
head(data)

And here are the libraries that will help our analysis

library(dplyr) # for data manipulation
library(ggplot2) # for data visualization
library(randomForest) # to build the predictive model using Random Forest algorithm
library(MLmetrics) # for model evaluation

2 Data Profiling

Before we get into the detailed analysis, we need to understand the data first

2.1 Dimensions

dim(data)

[1] 520  17

The data has 520 rows and 17 columns

2.2 Attributes

colnames(data)

 [1] "Age"                "Gender"             "Polyuria"          
 [4] "Polydipsia"         "sudden.weight.loss" "weakness"          
 [7] "Polyphagia"         "Genital.thrush"     "visual.blurring"   
[10] "Itching"            "Irritability"       "delayed.healing"   
[13] "partial.paresis"    "muscle.stiffness"   "Alopecia"          
[16] "Obesity"            "class"

Most of the attributes are self-explanatory, except for some of medical terms explained below

Polyuria : The production of large volumes of urine
Polydipsia : An intense thirst which leads to drinking large quantities of water
Polyphagia : Excessive eating or appetite
Partial Paresis : Partial muscle weakness or paralysis
Alopecia : Loss of hair

2.3 Missing Values

We need to check whether the data contain missing values or not

colSums(is.na(data))

               Age             Gender           Polyuria         Polydipsia 
                 0                  0                  0                  0 
sudden.weight.loss           weakness         Polyphagia     Genital.thrush 
                 0                  0                  0                  0 
   visual.blurring            Itching       Irritability    delayed.healing 
                 0                  0                  0                  0 
   partial.paresis   muscle.stiffness           Alopecia            Obesity 
                 0                  0                  0                  0 
             class 
                 0

Fortunately there are no missing values in the data

2.4 Target Variable

The target variable is binary class (Positive or Negative). We always want our target variable to be balanced, otherwise we would have to upsample the target variable

prop.table(table(data$class))


 Negative  Positive 
0.3846154 0.6153846

As we can see, the data is not perfectly balanced, with a bias to the Positive class. But the level of imbalance is still tolerable (not at an extreme level, for example 1:10 ratio)

3 Data Analysis and Visualization

The next step is to analyze the data. By doing so, we might be able to get some early insights about the data. This is usually done by laying questions regarding the data, and attempting to answer the questions by using various data mining and/or visualization methods/techniques

1. How’s the distribution of the age for people with and without diabetes

p1 <- ggplot(data[data$class=="Positive",], aes(x=Age)) + 
        geom_density(fill="skyblue") + theme_minimal() + labs(title = "With Diabetes")

p2 <- ggplot(data[data$class=="Negative",], aes(x=Age)) + 
        geom_density(fill="skyblue") + theme_minimal() + labs(title = "Without Diabetes")

gridExtra::grid.arrange(p1, p2, ncol=1)

What can we infer from these graphs?

The age of people with diabetes spreads normally, with the mean is around 50 years old
There are only a few people under 25 years old to have diabetes
As for people without diabetes, the distribution is evenly distributed across all ages

2. How’s the correlation between gender towards the target variable

gender_class <- as.data.frame(table(data$Gender, data$class))
gender_class$per <- gender_class$Freq/sum(gender_class$Freq)*100
colnames(gender_class) <- c("Gender", "Class", "Frequency", "Percentage")
gender_class[order(gender_class$Percentage),]

We can visualise the table for a better overall view

ggplot(gender_class, aes(Gender, Frequency, fill=Class)) + geom_col(position = "dodge") + 
  theme_minimal() + scale_fill_manual(values = c("green4", "red4"))

What can we conclude from this graph?

Although both gender spreads evenly in the positive class, but there’s a very small percentage of Female in negative class. We might say that women are more likely to get diabetes, but men are at risk too

3. Which variables have the highest correlation to the target variable

target <- data$class
levels(target) <- c("No", "Yes")
columns <- c()
linear <- c()
reversed <- c()

for(i in 3:(ncol(data)-1)){
  columns <- c(columns, colnames(data)[i])
  similar <- data[,i]==target
  linear <- c(linear, prop.table(table(similar))[2]*100)
  reversed <- c(reversed, prop.table(table(similar))[1]*100)
}

results <- data.frame(columns, linear, reversed)
results <- results[order(results$linear, decreasing = T),]
rownames(results) <-1:14
results

Based on the table above:

Polyuria and Polydipsia have a similarity higher than 80% with diabetes

This means that Polyuria or Polydipsia can classify someone is diabetic or not with an accuracy above 80%. In machine learning, variables with correlation higher than 70% is usually removed. We can simply conclude that people with Polyuria and/or Polydipsia have a very high risk of diabetes

data <- data %>% select(-c(Polyuria, Polydipsia))

The other variables have a similarity below 70%, so they can’t perform as a single predictor

We will later use these variables to create a predictive model

4 Modeling

To evaluate the model performance, we’d have to split the data into train set and test set. Usually it is in proportion of 75-25

library(rsample)
splitted <- initial_split(data, prop = .75, strata = "class")
train <- training(splitted)
test <- testing(splitted)

4.1 Logistic Regression

First we will use the logistic regression model

set.seed(42)

model1 <- glm(class~., train, family = "binomial")
predicted <- round(predict(model1, test, type = "response"))
predicted <- ifelse(predicted==1, "Positive", "Negative")

Accuracy(predicted, test$class)

[1] 0.8153846

4.2 Random Forest

set.seed(42)

model2 <- randomForest(class~., train)
Accuracy(predict(model2, test), test$class)

[1] 0.9230769

Both models produce a great accuracy. This is often a case of overfitting, since the data size is relatively small

5 Variable Importance

We can apply the model for a quick diagnosis, but furthermore we’d like to find out which variables have the highest impact to the outcomes, so that the patients can start managing their lifestyle based on their condition regarding those variables

summary(model1)


Call:
glm(formula = class ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3404  -0.4420   0.1398   0.4242   2.7048  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            0.372273   0.778382   0.478 0.632461    
Age                   -0.005119   0.018422  -0.278 0.781112    
GenderMale            -2.495479   0.436503  -5.717 1.08e-08 ***
sudden.weight.lossYes  1.578770   0.368963   4.279 1.88e-05 ***
weaknessYes            0.438702   0.380966   1.152 0.249505    
PolyphagiaYes          0.726740   0.426973   1.702 0.088742 .  
Genital.thrushYes      2.465916   0.455098   5.418 6.01e-08 ***
visual.blurringYes     0.764359   0.452850   1.688 0.091433 .  
ItchingYes            -1.005941   0.422515  -2.381 0.017273 *  
IrritabilityYes        1.692227   0.436482   3.877 0.000106 ***
delayed.healingYes    -0.223034   0.407645  -0.547 0.584291    
partial.paresisYes     1.638146   0.385037   4.255 2.10e-05 ***
muscle.stiffnessYes    0.259082   0.442972   0.585 0.558635    
AlopeciaYes           -0.732612   0.384885  -1.903 0.056981 .  
ObesityYes            -0.040079   0.448588  -0.089 0.928808    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 519.70  on 389  degrees of freedom
Residual deviance: 256.45  on 375  degrees of freedom
AIC: 286.45

Number of Fisher Scoring iterations: 6

Earlier before, we already have Polyuria and Polydipsia as variables that play a major role in determining whether a person has diabetes or not. Here we have Gender, Sudden Weight Loss, Genital Thrush, Irritability, and Partial Paresis proven significant as variables that might increase the risk of diabetes

This is further validated by the random forest model

importance <- as.data.frame(model2$importance)
importance[order(importance$MeanDecreaseGini, decreasing=T), , drop=F]

The list goes from the most important variables to the least important, with the higher MeanDecreaseGini specifying the more important variables.

6 Closing

6.1 Conclusion

Age for people with diabetes distributes normally with mean around 50 years old
Women are more likely to have diabetes
Polyuria and Polydipsia are the most significant variables in diagnosing diabetes
Sudden weight loss, genital thrush, irritability, and partial paresis are some of the variables that significantly increase the risk of diabetes

6.2 Solution

Finally, I can present a solution to this problem in a form of a dashboard, which will display whether someone has a high risk of diabetes or not, as well as the factors that they should be aware of. With this dashboard, hopefully the users can be more aware of their condition, and take care of themselves better based on the diagnosis result.

You can find the dashboard here : https://rogate16.shinyapps.io/early_risk/