In the age of machine learning and artificial intelligence, everyone wonders how to leverage advanced analytic to better their decision making. The common questions we get from our clients are: What is machine learning? What can Launch do with clients’ data? What steps are involved with predictive modeling? What do clients get at the end?
In a nutshell, predictive modeling is a process that uses descriptive data called features to forecast the outcomes of a target variable and machine learning is the use of automated algorithms that build optimal predictive models through an iterative learning procedure. At Launch we use client data to train predictive models by conducting a “study” that consists of the seven steps depicted in the diagram below. Upon completion of the study a client has a predictive model that can be operationalized for a report, dashboard, or app.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The population for this study was the Pima Indian population near Phoenix, Arizona. That population has been under continuous study since 1965 by the National Institute of Diabetes and Digestive and Kidney Diseases because of its high incidence rate of diabetes. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The objective of this study is to predict whether a patient has diabetes based on diagnostic measurements.
Data: 768 recordings on 9 variables (Pregnancies - Number of Times Pregnent, Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test, Blood Pressure - Diastolic blood pressure (mm Hg), Skin Thckness - Triceps skin fold thickness (mm), Insulin - 2-Hour serum insulin (mu U/ml), BMI - Body mass index (weight in kg/(height in m)^2), DiabetesPedigreeFunction - The DPF uses information from parents, grandparents, full and half siblings, full and half aunts and uncles, and first cousins. It provides a measure of the expected genetic influence of affected and unaffected relatives on the subject’s eventual diabetes risk., Age - Age (years), Outcome - Class variable (0 or 1))
Analysis: Several classification Machine Learning algorithms will be tested against this data.
Exploratory analysis lets us dig into the data using descriptive statistics and visualizations. The goal is to learn which features of the data can best be leveraged for predictions.
A matrix of correlations and scatter plots provides useful insight into relationships between pairs of variables. Below, we will look for correlations as close to 1 or -1 as possible, which would indicate a strong positive or negative correlation between any two variables in the data set. A correlation of 0 means no detectable relationship between the variables. The matrix below combines these tools for effective and rapid modelling decisions.
This visualization shows the distribution of each predictor variable seperated by those with diabetes(0) and those without (1). You can see a considerable diffence in Glucose level distirbution from 0 to 1. We will keep this variable in mind as we run our models.
This visualization shows the same data using boxplots:
Imputation: Glucose, Blood Pressure, SkinThickness, Insulin, and BMI all had zero values in there columns. Since this is not possible for a human to have 0 of any of these traits, we performed predictive imputation using a random forest method for those varaibles.
Transformations: To normalize Insulin for our models, we did a log transformation.
Here is a look at the same correlation matrix after the imputations:
Here are the confusion matrix displayed: The four-fold display shows the frequencies in a 2 x 2 table in a way that depicts the odds ratio. In this display the frequency in each cell is shown by a quarter circle. An association between the variables (odds ratio != 1) is shown by the tendency of diagonally opposite cells in one direction to differ in size from those in the opposite direction, and we use color and shading to show this direction. In essence, the larger the 2 green cirlces are, the better accuracy of our predicted model. You can see all models performed similer, with the Random Forest model performing slightly better than the rest.
Here is a table that represents how each model performed:
Model | Accuracy | Sensitivity | PValue | CI |
---|---|---|---|---|
RandomForest | 0.8009 | 0.9145 | 1.279e-06 | (0.7435, 0.8504) |
GBM | 0.7922 | 0.8947 | 5.468e-06 | (0.7341, 0.8426) |
GLM | 0.7922 | 0.9079 | 5.468e-06 | (0.7341, 0.8426) |
SVM | 0.7835 | 0.9079 | 2.112e-05 | (0.7248, 0.8349) |
Lasso | 0.7662 | 0.9079 | 2.359e-04 | (0.7063, 0.8192) |
Random Forest works by building a large set of classification/regression trees. Each tree is built from a random subsample (with replacement) of the training data. At each split for each tree a random subsample of the predictor variables is used. This large set of trees is combined, averaged, or voted on to maximize performance.
Random Forests are some of the most accurate machine learning algorithms for prediction, however they are not a golden bullet; Random Forests often are slow to compute and difficult to interpret by non-experts.
While no model is perfect, we have found a model best fits the requirements of this project. Picking the wrong tool for the job could lead to many problems later on, so this step ensures the best possible functionality.
This plot visualizes the Random Forest model error. The red line indicates the error rate of predicting if a person does not have diabetes while the green line indicates the error rate of predicting a if a person does have diabetes while the black line indicates the overall error rate. In short, we were better at predicting if someone does not have diabetes from the test data than someone who does have diabetes.
Below we visualize the relative importance of variables; This bar chart forms the prediction power of the Random Forest Model. If you drop the top variable from the model (glucose), it’s prediction power will greatly reduce. On the other hand, if you reduce one of the bottom variables, there might not be much impact on prediction power of the model.
Here is a final look at the predictive outputs for the random forest model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 139 33
## 1 13 46
##
## Accuracy : 0.8009
## 95% CI : (0.7435, 0.8504)
## No Information Rate : 0.658
## P-Value [Acc > NIR] : 1.279e-06
##
## Kappa : 0.5289
## Mcnemar's Test P-Value : 0.005088
##
## Sensitivity : 0.9145
## Specificity : 0.5823
## Pos Pred Value : 0.8081
## Neg Pred Value : 0.7797
## Prevalence : 0.6580
## Detection Rate : 0.6017
## Detection Prevalence : 0.7446
## Balanced Accuracy : 0.7484
##
## 'Positive' Class : 0
##
Kaggle.com, Datasets. Pima Indians Diabetes Database. Retrieved from https://www.kaggle.com/uciml/pima-indians-diabetes-database.
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.