Early Stage Diabetes Risk Prediction
1 Introduction
Early diagnosis can help people prevent the onset of the disease from getting worse in the future or even completely treat the disease early on. Here I will analyze a data set of people with or without diabetes and come up with a possible solution to this problem.
Here is a little preview of the data we will be using
data <- read.csv("../data/diabetes_data_upload.csv", stringsAsFactors = T)
head(data)And here are the libraries that will help our analysis
library(dplyr) # for data manipulation
library(ggplot2) # for data visualization
library(randomForest) # to build the predictive model using Random Forest algorithm
library(MLmetrics) # for model evaluation2 Data Profiling
Before we get into the detailed analysis, we need to understand the data first
2.1 Dimensions
dim(data)[1] 520 17
The data has 520 rows and 17 columns
2.2 Attributes
colnames(data) [1] "Age" "Gender" "Polyuria"
[4] "Polydipsia" "sudden.weight.loss" "weakness"
[7] "Polyphagia" "Genital.thrush" "visual.blurring"
[10] "Itching" "Irritability" "delayed.healing"
[13] "partial.paresis" "muscle.stiffness" "Alopecia"
[16] "Obesity" "class"
Most of the attributes are self-explanatory, except for some of medical terms explained below
- Polyuria : The production of large volumes of urine
- Polydipsia : An intense thirst which leads to drinking large quantities of water
- Polyphagia : Excessive eating or appetite
- Partial Paresis : Partial muscle weakness or paralysis
- Alopecia : Loss of hair
2.3 Missing Values
We need to check whether the data contain missing values or not
colSums(is.na(data)) Age Gender Polyuria Polydipsia
0 0 0 0
sudden.weight.loss weakness Polyphagia Genital.thrush
0 0 0 0
visual.blurring Itching Irritability delayed.healing
0 0 0 0
partial.paresis muscle.stiffness Alopecia Obesity
0 0 0 0
class
0
Fortunately there are no missing values in the data
2.4 Target Variable
The target variable is binary class (Positive or Negative). We always want our target variable to be balanced, otherwise we would have to upsample the target variable
prop.table(table(data$class))
Negative Positive
0.3846154 0.6153846
As we can see, the data is not perfectly balanced, with a bias to the Positive class. But the level of imbalance is still tolerable (not at an extreme level, for example 1:10 ratio)
3 Data Analysis and Visualization
The next step is to analyze the data. By doing so, we might be able to get some early insights about the data. This is usually done by laying questions regarding the data, and attempting to answer the questions by using various data mining and/or visualization methods/techniques
1. How’s the distribution of the age for people with and without diabetes
p1 <- ggplot(data[data$class=="Positive",], aes(x=Age)) +
geom_density(fill="skyblue") + theme_minimal() + labs(title = "With Diabetes")
p2 <- ggplot(data[data$class=="Negative",], aes(x=Age)) +
geom_density(fill="skyblue") + theme_minimal() + labs(title = "Without Diabetes")
gridExtra::grid.arrange(p1, p2, ncol=1)What can we infer from these graphs?
- The age of people with diabetes spreads normally, with the mean is around 50 years old
- There are only a few people under 25 years old to have diabetes
- As for people without diabetes, the distribution is evenly distributed across all ages
2. How’s the correlation between gender towards the target variable
gender_class <- as.data.frame(table(data$Gender, data$class))
gender_class$per <- gender_class$Freq/sum(gender_class$Freq)*100
colnames(gender_class) <- c("Gender", "Class", "Frequency", "Percentage")
gender_class[order(gender_class$Percentage),]We can visualise the table for a better overall view
ggplot(gender_class, aes(Gender, Frequency, fill=Class)) + geom_col(position = "dodge") +
theme_minimal() + scale_fill_manual(values = c("green4", "red4"))What can we conclude from this graph?
- Although both gender spreads evenly in the positive class, but there’s a very small percentage of Female in negative class. We might say that women are more likely to get diabetes, but men are at risk too
3. Which variables have the highest correlation to the target variable
target <- data$class
levels(target) <- c("No", "Yes")
columns <- c()
linear <- c()
reversed <- c()
for(i in 3:(ncol(data)-1)){
columns <- c(columns, colnames(data)[i])
similar <- data[,i]==target
linear <- c(linear, prop.table(table(similar))[2]*100)
reversed <- c(reversed, prop.table(table(similar))[1]*100)
}
results <- data.frame(columns, linear, reversed)
results <- results[order(results$linear, decreasing = T),]
rownames(results) <-1:14
resultsBased on the table above:
- Polyuria and Polydipsia have a similarity higher than 80% with diabetes
This means that Polyuria or Polydipsia can classify someone is diabetic or not with an accuracy above 80%. In machine learning, variables with correlation higher than 70% is usually removed. We can simply conclude that people with Polyuria and/or Polydipsia have a very high risk of diabetes
data <- data %>% select(-c(Polyuria, Polydipsia))- The other variables have a similarity below 70%, so they can’t perform as a single predictor
We will later use these variables to create a predictive model
4 Modeling
To evaluate the model performance, we’d have to split the data into train set and test set. Usually it is in proportion of 75-25
library(rsample)
splitted <- initial_split(data, prop = .75, strata = "class")
train <- training(splitted)
test <- testing(splitted)4.1 Logistic Regression
First we will use the logistic regression model
set.seed(42)
model1 <- glm(class~., train, family = "binomial")
predicted <- round(predict(model1, test, type = "response"))
predicted <- ifelse(predicted==1, "Positive", "Negative")
Accuracy(predicted, test$class)[1] 0.8153846
4.2 Random Forest
set.seed(42)
model2 <- randomForest(class~., train)
Accuracy(predict(model2, test), test$class)[1] 0.9230769
Both models produce a great accuracy. This is often a case of overfitting, since the data size is relatively small
5 Variable Importance
We can apply the model for a quick diagnosis, but furthermore we’d like to find out which variables have the highest impact to the outcomes, so that the patients can start managing their lifestyle based on their condition regarding those variables
summary(model1)
Call:
glm(formula = class ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3404 -0.4420 0.1398 0.4242 2.7048
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.372273 0.778382 0.478 0.632461
Age -0.005119 0.018422 -0.278 0.781112
GenderMale -2.495479 0.436503 -5.717 1.08e-08 ***
sudden.weight.lossYes 1.578770 0.368963 4.279 1.88e-05 ***
weaknessYes 0.438702 0.380966 1.152 0.249505
PolyphagiaYes 0.726740 0.426973 1.702 0.088742 .
Genital.thrushYes 2.465916 0.455098 5.418 6.01e-08 ***
visual.blurringYes 0.764359 0.452850 1.688 0.091433 .
ItchingYes -1.005941 0.422515 -2.381 0.017273 *
IrritabilityYes 1.692227 0.436482 3.877 0.000106 ***
delayed.healingYes -0.223034 0.407645 -0.547 0.584291
partial.paresisYes 1.638146 0.385037 4.255 2.10e-05 ***
muscle.stiffnessYes 0.259082 0.442972 0.585 0.558635
AlopeciaYes -0.732612 0.384885 -1.903 0.056981 .
ObesityYes -0.040079 0.448588 -0.089 0.928808
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 519.70 on 389 degrees of freedom
Residual deviance: 256.45 on 375 degrees of freedom
AIC: 286.45
Number of Fisher Scoring iterations: 6
Earlier before, we already have Polyuria and Polydipsia as variables that play a major role in determining whether a person has diabetes or not. Here we have Gender, Sudden Weight Loss, Genital Thrush, Irritability, and Partial Paresis proven significant as variables that might increase the risk of diabetes
This is further validated by the random forest model
importance <- as.data.frame(model2$importance)
importance[order(importance$MeanDecreaseGini, decreasing=T), , drop=F]The list goes from the most important variables to the least important, with the higher MeanDecreaseGini specifying the more important variables.
6 Closing
6.1 Conclusion
- Age for people with diabetes distributes normally with mean around 50 years old
- Women are more likely to have diabetes
- Polyuria and Polydipsia are the most significant variables in diagnosing diabetes
- Sudden weight loss, genital thrush, irritability, and partial paresis are some of the variables that significantly increase the risk of diabetes
6.2 Solution
Finally, I can present a solution to this problem in a form of a dashboard, which will display whether someone has a high risk of diabetes or not, as well as the factors that they should be aware of. With this dashboard, hopefully the users can be more aware of their condition, and take care of themselves better based on the diagnosis result.
You can find the dashboard here : https://rogate16.shinyapps.io/early_risk/