Artificial Intelligence and Data Science Foundation

1. Title:

Applying Data Science techniques to predict the onset of Diabetes among women in rural Bangalore, India.

A Case Study of The Dekhabhaal Clinic.

2. Abstract:

Scope:

Data Science is an umbrella field that’s concerned with everything related to data acquisition, data cleaning, preparation and analysis.

It deals with structured data like tables and relational data, semi-structured data, such as key-value pair files in hash tables and JSON files as well as unstructured data like sound waves, signals, images and blob files.

Data Science encompasses:

Data Analytics,

Data Analysis,

Big Data and

Machine Learning.

A Data Scientist gathers data from multiple sources and applies machine learning, predictive analytics, and/or sentiment analysis to extract critical information from the collected data sets.

They understand data from a business point of view and are able to provide accurate predictions and insights that can be used to make critical business decisions.

Purpose:

The purpose of this report is to demonstrate the effect of applying Data Science techniques to improve business efficiency in The Dekhabhaal Clinic, in rural Bangalore, India.

Data:

The data set for this report is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set.

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The data set consists of 15000 observations, 9 independent variables and 1 dependent variable.

Here is a link to the raw data set on Github

Target Audience:

This report is targeted at current and future students as well as Business Owners, who are interested in practical applications of Data Science to improve business processes.

Methods:

The methods include:

Data extraction

Data visualization

Feature transformation

Correcting data-imbalance

Feature selection

Splitting data into train and test

Modelling

Prediction

Evaluation

Results:

The results display a 50% reduction in transportation costs per patient and more than 500% improvement in service-level-agreement times and even a reduction in the average cost per test, after applying Data Science techniques. Kindly see Figure 1 below.

Business Process Improvements via Data Science

3. Introduction:

With the still raging pandemic, it’s imperative for hospitals especially in rural areas, to be equipped with the right tools to detect the onset of underlying factors such as Diabetes.

A typical Glucose-Tolerance-Test (GTT) for detecting Diabetes costs averagely 400 Rupees in Bangalore, India, an amount too expensive for an average rural settler. In addition, the GTT involved 2 visits. One visit for sample extractions and a second visit to pick up the result around 48 to 72 hours later.

Therefore the new Analyst at The Dekhabhaal Clinic in Bangalore was left asking a few scientific questions:

How can we use Data Science to reduce the number of visits per test?

How can we use Data Science to reduce the resolution time per test?

How can Data Science help us reduce the cost of the GTT test?

Therefore our hypothesis here is…

“Data Science can be used to improve the GTT resolution time from an average of 40 hours to an average of 6 hours.”

This report shows how the Analyst at The Dekhabhaal Clinic was able to prove the above hypothesis using Data Science and Machine Learning by learning a Model that was able to answer the above 3 questions, with over 90% AUC Score.

4. Methods:

The overall idea is to extract the data set of 15000 observations of women, containing relevant variables, in order to train a machine learning model to predict if a new patient is diabetic or non-diabetic.

Data Extraction

The raw data is extracted from Github and read into a Dataframe. It contains 15000 observations and 10 variables, including the target variable Diabetic. Luckily it has no missing values. Let’s see the first 3 rows of the data below.

Cross section of some numerical variable plots

Feature Visualization and Transformation:

For features without a uniform distribution, we apply log transformations to make them more normal as seen below

Applying log10 transformation

Checking Class distribution:

The data set contains about 67% of non-diabetics and 33% diabetics data. This means the data is imbalanced, as seen above.

prop.table(table(diabetes_df$Diabetic))

## 
##         0         1 
## 0.6666667 0.3333333

An imbalanced dataset makes the model to have a high recognition rate (sensitivity) for the dominant class. The F1 score of the model could be unreliable in an imbalanced Data set. Therefore we shall use the ROSE module to balance the data.

Splitting The Data:

Before balancing the data, let’s save 1500 random samples as test data from each class

# 1st select all variables in class 0 and 1 separately
class_0 <- subset(diabetes_df, diabetes_df$Diabetic==0)
class_1 <- subset(diabetes_df, diabetes_df$Diabetic==1)

# Now choose random samples from class_0 and 1
test_0 <- class_0[sample(nrow(class_0), 1500, replace=FALSE), ]
test_1 <- class_1[sample(nrow(class_1), 1500, replace=FALSE), ]

# Now, join the data sets vertically
x_test <- rbind(test_0, test_1)

# Finally, remove the test data completely from main data
df2 <- subset(diabetes_df, !(diabetes_df$PatientID %in% x_test$PatientID))

Balancing The Dataset:

We’d use the ROSE package. ROSE (Random Over Sampling Examples) package helps us to generate artificial data based on sampling methods and smoothed bootstrap approach. This package has well defined accuracy functions to do the tasks quickly.

library(ROSE)

## Loaded ROSE 0.0-4

df2 <- ROSE(Diabetic ~ ., data = df2, seed = 1)$data

# Let's see the new spread
prop.table(table(df2$Diabetic))

## 
##         0         1 
## 0.5031326 0.4968674

Balanced Data

Feature-Selection:

Just by eye-balling the dataset, we can see that the PatientID column has no value for model-building, let’s remove it from our training and test sets

# removing PatientID column
df2 <- subset(df2, select = -PatientID)
x_test <- subset(x_test, select = -PatientID)

# Splitting y_test from x_test
y_test <- x_test$Diabetic
x_test <- subset(x_test, select = -Diabetic)

head(x_test, 3)

MODEL BUILDING:

Next we build a decision tree model on the training data

library(rpart)

tree_model <- rpart(Diabetic ~ ., data = df2)
print('done')

## [1] "done"

MODEL PREDICTION:

Make predictions on unseen data

pred <- predict(tree_model, newdata = x_test)
pred[1:10]

##     10527      5722     11913      9523      2714      7989       158      2333 
## 0.5504431 0.1135802 0.1978239 0.1135802 0.1135802 0.1135802 0.4727273 0.1135802 
##      7198      9882 
## 0.1135802 0.1135802

5. Results:

The results display the model’s predictive performance on the test data. Note that we set a threshold of 0.5. This means that predictions with probability score less than 0.5 are non-diabetic, while those above 0.5 are diabetic.

Further more, as seen in the sample predictions above, each observation has a score that indicates how much diabetic or non-diabetic that score is. In other words, observations with probabilities less than, but close to 0.5, though non-diabetic, are close to the diabetic threshold and such patients must take necessary precautions.

See the Area Under the Curve (AUC) for our model below.

# takes y_true, y_pred
roc.curve(y_test, pred)

## Area under the curve (AUC): 0.912

With an AUC Score about 0.916, or 92%, The model has generalized well to be used for future unseen data.

6. Discussion:

The machine learning process we just explored is the same process that the Analyst at The Dekhabhaal Clinic in Bangalore applied to train a model that can be used to make future classifications on unseen data. Remember the initial critical questions…

How can we use Data Science to reduce the number of visits per test?

How can we use Data Science to reduce the resolution time per test?

How can Data Science help us reduce the cost of the GTT test?

By deploying this trained model, patients do not need to visit the clinic twice for each test. ow, they get their results same day in an average of 5 hours. This has also reduced the resolution SLA time for the GTT test.

Although the cost of the GTT test has remained largely stable at around 400 Rupees, by reducing the cost of visitation and improving the resolution time, patients now have an easier process when carrying out this test. In extreme cases, patients can stay at home and merely send in their data, for those who are knowledgeable about the data extraction process and let the model predict their likelihood of Diabetes.

7. Acknowlwdgements:

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.

Sushant Banerjee of the Dekhabhaal Clinic, Bangalore, India.

8. Literature:

For the practical analysis done on this report, kindly see the rpub document via the link

In-Course-Assessment-Report

Lawrence Alaso Krukrubo (A0115333@tees.ac.uk)

7/26/2021