Diabetes is a chronic(long-lasting) health condition that affects how your body converts food into energy, diabetes hampers the body’s ability to make sufficient insulin which then leads to there being too much blood sugar in the bloodstream. Excess blood sugar over a prolonged time, can lead to serious health problems such as heart disease, vision loss and kidney disease. There is no cure for diabetes yet, but it can be managed through healthier eating, losing weight and taking medicine as prescribed by a medical practitioner. Hence, this analysis is geared towards training a model that can identify diabetes in patients using the k-nearest neighbors(kNN) Algorithm.
Train a model that can predict diabetes in patients.
The data used for this analysis is from the National Institute of Diabetes and Digestive and Kidney Diseases and is made available on Kaggle by Mehmet Akturk.
The Dataset contains entries from only women of at least 21 years of age with Pima Indian heritage with the following features;
Pregnancies: Number of times pregnant.
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure: Diastolic blood pressure (mm Hg).
SkinThickness: Triceps skin fold thickness (mm).
Insulin: 2-Hour serum insulin (mu U/ml).
BMI: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction: Diabetes pedigree function.
Age: Age (years).
Outcome: Class variable (0 or 1) with the class value 1 representing those who tested positive for diabetes.
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(janitor)
library(gmodels)
library(class)
library(corrplot)
diabetes_df <- read.csv("diabetes.csv")
str(diabetes_df)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
The dataset contains data entries from 768 patients, with all the features/attributes being numeric values.
get_dupes(diabetes_df)
## No variable names specified - using all columns.
## No duplicate combinations found of: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome
## [1] Pregnancies Glucose BloodPressure
## [4] SkinThickness Insulin BMI
## [7] DiabetesPedigreeFunction Age Outcome
## [10] dupe_count
## <0 rows> (or 0-length row.names)
There are no duplicate entries in the dataset
diabetes_df$Outcome <- as.character(diabetes_df$Outcome)
diabetes_df <- diabetes_df %>%
mutate(Outcome = fct_recode(Outcome, "Diabetic" = "1", "Non Diabetic" = "0"))
diabetes_df$Outcome <- factor(diabetes_df$Outcome,
levels = c("Diabetic", "Non Diabetic"),
labels = c("Diabetic", "Non Diabetic"))
Here, the values of the outcome attributes are transformed from 1’s and 0’s to “Diabetic” and “Non Diabetic” to make the outcomes/diagnosis more clearer to understand.
diabetes_correlation_df <- diabetes_df[-9]
diabetes_correlation_df <- cor(diabetes_correlation_df)
corrplot(diabetes_correlation_df, method = "color", type = "lower", addCoef.col = "black", col = COL2("RdYlBu"), number.cex = 0.8, tl.cex = 0.8)
There are moderate positive correlations between the Age and Pregnancy, and the Insulin and Skin Thickness attributes. This indicates that as the age of the patients increased so did the number of pregnancies, also as the quantity of insulin administered to the patients increased; the skin thickness increased likewise.
Weak positive correlations can also be observed in the following attributes of the dataset; Insulin & Glucose, BMI & Skin Thickness, Blood Pressure & BMI, Age & Blood Pressure e.t.c…
ggplot(data = diabetes_df, aes(x = Age)) + geom_histogram(bins = 30, color = "blue", fill = "lightblue") + facet_wrap(~Outcome) + theme_dark() + ylab("Number of Patients") + labs(title = "Age(s) of Patients")
The ages of the patients are skewed to the right with most of the patients being between the ages of 20 to 40.
ggplot(data = diabetes_df, aes(x = BMI)) + geom_histogram(bins = 30, color = "blue", fill = "lightblue") + facet_wrap(~Outcome) + theme_dark() + ylab("Number of Patients") + labs(title = "BMI of Patients")
From the histogram above, the BMI attribute is symmetric but it is quite visible that outliers exist in the dataset having BMI’s with 0 values. To have a BMI of Zero(0) is impossible, indicating that there might be an error in this field.
ggplot(data = diabetes_df, aes(x = BloodPressure)) + geom_histogram(bins = 30, color = "blue", fill = "lightblue") + facet_wrap(~Outcome) + theme_dark() + ylab("Number of Patients") + labs(title = "Patient Blood Pressure")
Just as the previous chart indicated; outliers are also present in the blood pressure attribute. With the outlier being 0(Zero) it is clear to see that there must be an error as the human blood pressure can not drop to absolute 0(Zero).
normalize <- function(x){
(x-min(x))/(max(x)-min(x))
}
diabetes_df_n <- as.data.frame(lapply(diabetes_df[1:8],normalize))
k-Nearest Neighbors uses the Euclidean Distance(which is the distance one would measure if you could use a ruler to connect two points) to classify, so we normalize the dataset to re-scale the value of the features to ensure each value is contributing equally to the distance formula.
diabetes_df_train <- diabetes_df_n[1:668, ]
diabetes_df_test <- diabetes_df_n[669:768, ]
diabetes_train_labels <- diabetes_df[1:668, 9]
diabetes_test_labels <- diabetes_df[669:768, 9]
Finally, the dataset is then split into two where the larger half will be used to train the model and the second half utilized to test the accuracy of the model.
diabetes_prediction <- knn(train = diabetes_df_train, test = diabetes_df_test, cl = diabetes_train_labels, k = 27)
The kNN factor is utilized above to train the model, the value used for k is the square-root of the total sample size used for the analysis(768).
CrossTable(y = diabetes_prediction, x = diabetes_test_labels, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | diabetes_prediction
## diabetes_test_labels | Diabetic | Non Diabetic | Row Total |
## ---------------------|--------------|--------------|--------------|
## Diabetic | 22 | 15 | 37 |
## | 0.595 | 0.405 | 0.370 |
## | 0.815 | 0.205 | |
## | 0.220 | 0.150 | |
## ---------------------|--------------|--------------|--------------|
## Non Diabetic | 5 | 58 | 63 |
## | 0.079 | 0.921 | 0.630 |
## | 0.185 | 0.795 | |
## | 0.050 | 0.580 | |
## ---------------------|--------------|--------------|--------------|
## Column Total | 27 | 73 | 100 |
## | 0.270 | 0.730 | |
## ---------------------|--------------|--------------|--------------|
##
##
The CrossTable function is used above to determine the accuracy of the model by comparing the known values to the values predicted by the model. There were 37 diabetic patients and 67 non diabetic patients, the model was able to predict 21 diabetic patients and 59 non diabetic patients leading to an Accuracy of 80%.
diabetes_df <- diabetes_df%>% filter(BMI > 0) %>% filter(BloodPressure > 0)
diabetes_df_n <- as.data.frame(lapply(diabetes_df[1:8],normalize))
diabetes_df_train <- diabetes_df_n[1:629, ]
diabetes_df_test <- diabetes_df_n[630:729, ]
diabetes_train_labels <- diabetes_df[1:629, 9]
diabetes_test_labels <- diabetes_df[630:729, 9]
diabetes_prediction <- knn(train = diabetes_df_train, test = diabetes_df_test, cl = diabetes_train_labels, k = 27)
CrossTable(y = diabetes_prediction, x = diabetes_test_labels, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | diabetes_prediction
## diabetes_test_labels | Diabetic | Non Diabetic | Row Total |
## ---------------------|--------------|--------------|--------------|
## Diabetic | 19 | 20 | 39 |
## | 0.487 | 0.513 | 0.390 |
## | 0.826 | 0.260 | |
## | 0.190 | 0.200 | |
## ---------------------|--------------|--------------|--------------|
## Non Diabetic | 4 | 57 | 61 |
## | 0.066 | 0.934 | 0.610 |
## | 0.174 | 0.740 | |
## | 0.040 | 0.570 | |
## ---------------------|--------------|--------------|--------------|
## Column Total | 23 | 77 | 100 |
## | 0.230 | 0.770 | |
## ---------------------|--------------|--------------|--------------|
##
##
The accuracy of the model was not hampered by the presence of the outliers, as the accuracy reduced to 76% with the removal of outliers.
Thank you for taking you time to go through my analysis. Any feedback is welcomed.