“Numbers have an important story to tell. They rely on you to give them a voice” -Stephen Few
This project aims to discover a machine learning algorithm discussed within our Data Science 4001 class. Through the completion of this project, the goal is to obtain a detailed knowledge of one of the topics as well as tell an important story of the results. Although the analysis itself is essential, the ability to interpret the results efficiently and successfully is vital. The aim through this project is to successfully create two Machine Learning models and effectively tell an un-biased ‘story’ or conclusion with our findings.
Diabetes is a chronic health condition that inhibits the normal function of the body’s insulin production and intake. This results in abnormal breakdown of carbohydrates and elevated levels of glucose within the blood and urine. Diabetes can lead to many complications including blindness, kidney failure, heart attacks, stroke, and lower limb amputation (World Health Organization).
In the United States alone, there are more than 34 million individuals who have diabetes. More significantly, there are more than 88 million individuals who are pre-diabetic. This disease is the seventh leading cause of death in the United States, and the prevalence of the disease keeps increasing significantly worldwide. Specifically, the number of individuals in low- and middle-income countries is rapidly rising compared to high-income countries (CDC).
There are two types of diabetes, Type 1 and Type 2. Type 1 is described by “deficient insulin production and requires daily administration of insulin.” On the other hand, Type 2 is said to result from “the body’s ineffective use of insulin.” Type 2 diabetes is more common within adults or individuals who are overweight or have low physical activity (CDC).
Understanding the different medical predictors that can lead to diabetes can help individuals educate themselves to take the precautions necessary to avoid becoming pre-diabetic or diabetic. This is especially important for those who may become victim to Type 2 Diabetes in their lifetime. This analysis will look for some of those medical predictors to see their significance with the outcome of diabetes.
There have been a plethora of studies analyzing Type 2 Diabetes; however, because the disease is so multi-faceted, researchers have been using specific approaches to reduce the “complexity” of the disease by studying a population that has limited genetic and environmental variability (Reference 3). A population that fits this description is the Pima Native Americans of Arizona; another factor that is useful of this population is that they are said to have the “highest reported prevalence of diabetes [specifically Type 2 Diabetes] of any population in the world” (Reference 3). There have been multiple genetic studies conducted to identify susceptibility genes in Pima Native Americans.
Although the dataset we are utilizing does not claim that the patients only have Type 2 Diabetes and that the Pima Native American group was selected due to genetic and environmental advantages, it can be assumed that this is likely the case. By analyzing a sample of a population that has limited genetic and environmental variability (as well as a restriction on age), our results are more specific to the factors that lead to diabetes that are not related to other health concerns. It is, therefore, imperative to look at our results from a perspective that acknowledges this claim. The more knowledgeable one is about the dataset and where it comes from, the better and less biased the analysis/results will be.
Before selecting a machine learning algorithm, having a clear understanding of the data is critical. We must familiarize ourselves with the origin, size, key characteristics, behavior, and type of data.
The selected dataset for this project is original from the National Institute of Diabetes and Digestive and Kidney Diseases but can also be found on Kaggle here.
The dataset is composed of eight medical predictor variables and one target variable, Outcome. The medical predictor variables include number of pregnancies, blood pressure, BMI, insulin levels, age, skin thickness, diabetes pedigree, and age. There are 768 observations or patients (in this case). All of these observations are for women ages 21 and up of Pima Native American heritage. Therefore, gender will not be a variable considered for this analysis. Because several constraints were placed on the selection of these patients, the data being analyzed is considered a sample; if it had been representative of all of the Pima Native Americans, then it would have been considered a population.
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
df$Outcome <- factor(df$Outcome) #change Outcome to a factor with levels 0 and 1
summary(df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## 0:500
## 1:268
##
##
##
##
Looking at the structure of the data frame, the Outcome column is seen as an integer value; this must be modified to be levels of 0 representing “tested negative for diabetes” and 1 representing “tested positive for diabetes.”
df[, 3:7][df[, 3:7] == 0] <- NA #assign column's with 0's a value of NA
df<- na.omit(df) #remove NA's
Looking at the structure of the data, the data appears to have no NA’s present. However, there are certain medical predictor variables with values equal to 0. For example, Blood Pressure, Skin Thickness, BMI, and Insulin should realistically not ever be 0. Therefore, it can be assumed that these biological measurements must be removed as they are not meaningful
nm.rows <- data.frame("Number of Rows" = nrow(df))
kable(nm.rows)
| Number.of.Rows |
|---|
| 393 |
With these modifications, the dataset is restricted to 393 observations/patients. Understanding the size and structure of the dataset allows for the best selection of the Machine Learning Algorithm that will be implemented.
Prior to choosing our machine learning algorithms and continuing on with our project, we wanted to do some exploratory data analysis to achieve an even better understanding of diabetes and the Pima Native Americans. We used the “plotly” package for data visualization. More about the package and specific use can be found here
df.new <- df %>% mutate(age_factor = case_when(
Age >= 20 & Age <= 29 ~ '20-29',
29 < Age & Age <= 39 ~ '30-39',
39 < Age & Age <= 49 ~ '40-49',
49 < Age & Age <= 59 ~ '50-59',
59 < Age ~ '60 +'))
colors<- c("#A569BD", "#EC7063", "#16A085", "#F1C40F", "#95A5A6")
fig <- plot_ly(df.new, labels = ~age_factor, values = ~as.numeric(as.factor(age_factor)), type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~paste( age_factor, ' years'),
marker = list(colors = colors,
line = list(color = '#FFFFFF', width = 1)),
#The 'pull' attribute can also be used to create space between the sectors
showlegend = FALSE)
fig <- fig %>% layout(title = 'Ages of Patients',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
We did a pie chart of the age of the women in the dataset to better understand how many young women were represented in the dataset in compared to older women. The pie chart shows an overwhelming majority of younger women, with the most represented age group being 21-29 (making up 36.3%). One thing to note is that as the age group increases in age, its proportion of the entire data set decreases. In the next step, we want to compare this to those that tested positive to see if we find a similar trend, or just another trend in general relating diabetes diagnoses to age.
w.diabetes <- (1:nrow(df.new))[df.new[ ,"Outcome"] == 1]
w.diabetes2 <- df.new[w.diabetes, ]
colors<- c( "#EC7063","#A569BD", "#F1C40F","#16A085", "#95A5A6")
fig3 <- plot_ly(w.diabetes2, labels = ~age_factor, values = ~as.numeric(as.factor(age_factor)), type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~paste( age_factor, ' years'),
marker = list(colors = colors,
line = list(color = '#FFFFFF', width = 1)),
#The 'pull' attribute can also be used to create space between the sectors
showlegend = FALSE)
fig3 <- fig3 %>% layout(title = 'Ages of Patients',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig3
Looking at this age break down, we find that ages 50-59 have the largest representation proportionally of those women who were diagnosed with diabetes. Followed by ages 40-49, 30-39, 20-20, and 60+ still as the least represented. This trend is almost the complete converse to our initial findings, indicating that of those that have diabetes, the 50-59 age group is represented more proportionally. Thus, this should be an indicator that age may play a role in diabetes prevalence.
ggplot(df, aes(x = Outcome, y = BMI, fill = Outcome)) + geom_boxplot() + labs(title = "Distribution of Diabetes by BMI")
ggplot(df, aes(x = Outcome, y = Age, fill = Outcome)) + geom_boxplot() + labs(title = "Distribution of Dibabetes by Age")
ggplot(df, aes(x = Outcome, y = BloodPressure, fill = Outcome)) + geom_boxplot() +
labs(title = "Distribution of Diabetes by Blood Pressure")
In addition, for our EDA, we decided to analyze some of the preliminary relationships in our dataset using boxplots. With the useage of boxplots, we were hoping to see if there was a difference in the distribution of some of our quantitative variables between those that have diabetes and those that do not. If we see that the they largely overlap, they may not be as important in driving our classification efforts; however, if we notice a non-trivial difference in the distributions, this may be an indication that the variable is one that is important in differentiating between those that have diabetes and those that do not. For this portion of EDA, we decided to look at BMI, Age, and Blood Pressure levels. From the three plots, we find that the distrubtion of BMI and Blood Pressure have relatively similar distributions between those with diabetes and those do not, with a slightly higher distribution, however, for those with diabetes. Thus, these variables may not be as important in our classification efforts as we had anticipated. Looking at age, however, similar to our findings from the previous EDA, we find that the distribution of Age for those with diabetes lies more notably above the Age of those without diabetes. Although this is not a drastic disparity in distribution, it is more noticeable than the other two plots, and also is consistent with our findings from the previous EDA procedure. This means that Age is a variable worth noting and continuing to explore in our future analysis.
When conducting our classification procedure, it is also imperative to ensure that we have a relatively balanced data set. To do this, we will check the base rate in addition to ensure that there are not any serious threats to the balance between either class.
tbl <- table(df$Outcome)
baserate <- tbl[2] / sum(tbl)
paste0("The base rate is: ", round(baserate, 6) * 100, "%")
## [1] "The base rate is: 33.0789%"
With a base rate of 33.08%, although this is not a perfect 50/50 split that we would prefer in an ideal world, this certainly does not raise any serious red flags about sample imbalance. Thus, we can proceed, but still should consider that our data on those with diabetes, which are coded as 1’s within the data, are less represented within the data set relative to those that do not have diabetes, which are coded as 0’s.
Choosing an Algorithm (Reference Five)
Multiple factors were in consideration to determine the best machine learning algorithm for the given dataset. After exploratory data analysis, the appropriate machine learning algorithm was determined with the help of the graphic above. Since the data is predicting a category (whether someone is diabetic or not), the data is labeled, and we have less than a thousand observations, we can utilize a Support Vector Machine as our classification algorithm. Although we have not discussed this algorithm in class, we will go into detail about the algorithm and the steps necessary to create our model.
Our group also decided to include a K-Nearest Neighbors Model as this algorithm is also appropriate for out dataset. In fact, the graphic above points to KNN as an alternative to SVM. This is a model we learned about in class; therefore, we felt that including it would be beneficial in showcasing our understanding of the DS 4001 Material.
Reference Six
Objective of SVM: The goal of the Support Vector Machine algorithm is to determine a hyperplane in an N-dimensional space that distinctly classifies the data points. Note: Hyperplanes can be a line or a plane depending on how many variables are being analyzed
Support Vector Machine (SVM) is a supervised type of machine learning algorithm that can be used for both regression and classification models (although, it is often more preferred for classification). It looks at the extremes of a dataset and draws a decision boundary (aka hyperplane). By creating this hyperplane, the SVM allows for the segregation between two classes. For this project, the two segregated groups are patients with diabetes and those without diabetes.
SVM aims to maximize the margin– maximum distance between data points of both classes–when successfully running the algorithm. However, often with SVM, there is a trade-off between bias and variance when a hyper-plane boundary is created between the types of data. In certain cases, if we have found the largest margin, we are using what is called the Maximal Margin Classifier. However, this is super sensitive to outliers in the training data. If instead we use a threshold that is not as sensitive to outliers, then we are using what is called the Soft Margin.
Here, our support vectors are the data points that are found to be closest to the hyperplane and influence the position/orientation of the hyperplane. These points are critical as they help build the SVM model by maximizing the margin of the classifier; if these points were to be deleted, the position of the hyperplane would be modified as a result.
class(df$Outcome)
## [1] "factor"
The ‘Outcome’ column is already a class of type ‘factor’, therefore we do not need to modify it and can move forward with our machine learning algorithm. It is important to note that 1’s represent patients with diabetes whereas 0’s represent patients without diabetes.
# Splitting the dataset into the Training set and Test set
#install.packages('caTools')
library(caTools)
# Creating a 70/30 split
splitting_data <- sample(1:nrow(df),
round(0.7 * nrow(df), 0),
replace = FALSE)
#Creating the train and test data
svm_train <- df[splitting_data, ] #Should contain 70% of data points
svm_test <- df[-splitting_data, ] #Should contain 30% of data points
#Checking to ensure steps above were done correctly
size_of_training <- nrow(svm_train)
size_of_total <- nrow(df)
size_of_test <- nrow(svm_test)
#Verification
paste("The Training Set contains", toString(round(size_of_training/size_of_total,2)*100), "% of the total data")
## [1] "The Training Set contains 70 % of the total data"
paste("The Testing Set contains", toString(round(size_of_test/size_of_total,2)*100), "% of the total data")
## [1] "The Testing Set contains 30 % of the total data"
For the model, we decided to stick to a 70/30 split for the training vs testing data set size.
# Feature Scaling
svm_train[-9] = scale(svm_train[-9])
svm_test[-9] = scale(svm_test[-9])
Scaling is one of the most important steps in the pre-processing of the data. By scaling, data scientist are able to normalize the range of independent variables (features) of data. For example, this is extremely important in cases where the data may have an outlier that is large in magnitude; scaling, in this sense, allows the model to remain unimpaired.
Feature scaling can be the determining factor between a weak machine learning model and a strong machine learning model. Therefore, it is imperative to scale for certain machine learning algorithms. With the feature scaling complete, we proceed with our analysis.
#install.packages('e1071')
library(e1071)
classifier <- svm(formula = Outcome ~ .,
data = svm_train,
type = 'C-classification', #Default
kernel = 'linear') #The kernel used in training and predicting
# Predicting the test set results
y_pred <- predict(classifier, newdata = svm_test[-9])
# Making a Confusion Matrix
#install.packages("caret")
library(caret)
cm <- confusionMatrix(svm_test$Outcome,y_pred, positive = "1")
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 9
## 1 19 20
##
## Accuracy : 0.7627
## 95% CI : (0.6756, 0.8362)
## No Information Rate : 0.7542
## P-Value [Acc > NIR] : 0.46462
##
## Kappa : 0.4266
##
## Mcnemar's Test P-Value : 0.08897
##
## Sensitivity : 0.6897
## Specificity : 0.7865
## Pos Pred Value : 0.5128
## Neg Pred Value : 0.8861
## Prevalence : 0.2458
## Detection Rate : 0.1695
## Detection Prevalence : 0.3305
## Balanced Accuracy : 0.7381
##
## 'Positive' Class : 1
##
After creating our confusion matrix, we can see that our accuracy for the model is 76.27%, which means that the model is performing moderately. We can then look at the specificity value of 78.65%, which tells us the true negative rate. This means that our model identifies patients without diabetes correctly 78.65% of the time. Similarly, we can look at the sensitivity value of 68.97%, which tells us the true positive rate. This means that we identify patients with diabetes correctly 68.97% of the time.Specificity and sensitivity are important in our analysis as they are values that we care about optimizing. The higher both of these values are, the better our model is at predicting patients who are diabetic and those who are not.
Similar to other models, we would consider optimizing our model by finding our optimal hyper-parameter. For the SVM model, we will consider our gamma value and the cost value. For the gamma value, this number defines how far the influence of a single training example reaches. The higher the gamma value, points that are closer to the decision boundary will carry more weight than those farther away from the decision boundary, and thus the decision boundary becomes more flexible / non-linear. Conversely, smaller gamma values place more weight on points farther away from the decision boundary, giving a more linear decision boundary. Our C value, or cost, controls the tradeoff between classification of training points and a smooth decision boundary. As we increase C, the model uses more points as a part of the support vector which lowers the bias of the model at the expense of increased variance. If we inflate C too much, we run the risk of overfitting the model.
We can now use a tuning function to tune our SVM function and run a new and, hopefully, improved model.
## Optimizing Gamma and Cost Parameters
library(e1071)
obj <- tune(svm, Outcome~., data = svm_test,
ranges = list(gamma = 2^(-1:1),
cost = 2^(2:4)),
tunecontrol = tune.control(sampling = "fix"))
summary(obj)
##
## Parameter tuning of 'svm':
##
## - sampling method: fixed training/validation set
##
## - best parameters:
## gamma cost
## 2 4
##
## - best performance: 0.3
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.5 4 0.375 NA
## 2 1.0 4 0.325 NA
## 3 2.0 4 0.300 NA
## 4 0.5 8 0.375 NA
## 5 1.0 8 0.325 NA
## 6 2.0 8 0.300 NA
## 7 0.5 16 0.375 NA
## 8 1.0 16 0.325 NA
## 9 2.0 16 0.300 NA
plot(obj)
From the given output, we will choose to use a gamma value of 2, and a cost value of 4. We will now rerun our model and compare its performance to the original model with the default gamma and cost values.
tuned.svm <- svm(formula = Outcome ~ .,
data = svm_train,
type = 'C-classification', #Default
kernel = 'linear',
gamma = 2,
cost = 4)
summary(tuned.svm)
##
## Call:
## svm(formula = Outcome ~ ., data = svm_train, type = "C-classification",
## kernel = "linear", gamma = 2, cost = 4)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 4
##
## Number of Support Vectors: 129
##
## ( 65 64 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
# Predicting the test set results
y_pred.tune <- predict(tuned.svm, newdata = svm_test[-9])
# Making a Confusion Matrix
#install.packages("caret")
library(caret)
cm.tune <- confusionMatrix(svm_test$Outcome,
y_pred.tune,
positive = "1")
cm.tune
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 9
## 1 19 20
##
## Accuracy : 0.7627
## 95% CI : (0.6756, 0.8362)
## No Information Rate : 0.7542
## P-Value [Acc > NIR] : 0.46462
##
## Kappa : 0.4266
##
## Mcnemar's Test P-Value : 0.08897
##
## Sensitivity : 0.6897
## Specificity : 0.7865
## Pos Pred Value : 0.5128
## Neg Pred Value : 0.8861
## Prevalence : 0.2458
## Detection Rate : 0.1695
## Detection Prevalence : 0.3305
## Balanced Accuracy : 0.7381
##
## 'Positive' Class : 1
##
After optimizing our model, we can see that we have the same accuracy, specificity, and sensitivity rates. Although this is not desired, there are multiple ways to avoid this in the future. Some of these include dealing with missing data and outliers so that more data is present. Furthermore, feature engineering and feature selection can be critical options as well. Although there are many ways to optimize the SVM model, another option is to select an algorithm that works better for the given dataset. As a result, we will be moving onto the KNN model.
Although our model’s specificity and sensitivity rates remained the same, in the context of the problem, it is beneficial to ensure that both of these class prediction rates are being optimized; however, we should especially focus on the specificity. For those that do have diabetes, it is imperative that they receive necessary and proper medical attention, and misclassifying them as non-diabetic could pose as a serious threat to one’s health. Falsely classifying someone as diabetic also would not be desired, but would serve as less of a threat to one’s health than a false negative.
Originally, the SVM model performed with 76.27% accuracy, which remained the same after the model tuning. Specificity and sensitivity also remained the same at a value of 78.65% and 68.97%, respectively. In order for this model to perform better, there must be a higher sensitivity and specificity rate, which would in turn help increase the accuracy.
Therefore we should, and will, consider other methods to see if other models better fit the needs of our scenario. Thus, we will now consider the K Nearest Neighbors classification method.
Objective of KNN: Makes the assumption that similar things exist in close proximity. By using this concept, the algorithm predicts the grouping an input belongs to by looking at the “k” closest data points (also known as neighbors).
KNN is a supervised algorithm used in machine learning; it is considered both non-parametric and lazy (makes no generalizations). KNN works by trying to classify a given data point to the points already existing in the dataset. It compares the input to all of the nearest points surrounding it, and determines its classification based on which point it is closest/most similar to. KNN mainly involves two hyperparameters, K value & distance function. The graphic above shows a very basic example of how this algorithm functions.
#Determine the split between diabetic and not diabetic then calculate the base rate
split <- table(df$Outcome)
non_diab <- split[1] / sum(split) #0 represents not diabetic (negative for diabetes)
diab <- split[2] / sum(split) #1 represents diabetic (positive for diabetes)
data.frame(c('Diabetic' = diab,'Not Diabetic' = non_diab))
## c.Diabetic...diab...Not.Diabetic....non_diab.
## Diabetic.1 0.3307888
## Not Diabetic.0 0.6692112
Looking at the base rates, we can see that about 33.1% of our dataset is made up of patients who are diabetic and 66.9% of our dataset is made up of patients who are not diabetic.
df2<- df[,-9]
correlations <- cor(df2)
correlations
## Pregnancies Glucose BloodPressure SkinThickness
## Pregnancies 1.00000000 0.2014035 0.21270810 0.09464568
## Glucose 0.20140352 1.0000000 0.20324808 0.20341477
## BloodPressure 0.21270810 0.2032481 1.00000000 0.23173429
## SkinThickness 0.09464568 0.2034148 0.23173429 1.00000000
## Insulin 0.08084792 0.5800602 0.09758405 0.18421068
## BMI -0.02391278 0.2128775 0.30362624 0.66491559
## DiabetesPedigreeFunction 0.00873705 0.1438075 -0.01640256 0.16169449
## Age 0.68011954 0.3461216 0.29899652 0.16954075
## Insulin BMI DiabetesPedigreeFunction
## Pregnancies 0.08084792 -0.02391278 0.00873705
## Glucose 0.58006018 0.21287749 0.14380748
## BloodPressure 0.09758405 0.30362624 -0.01640256
## SkinThickness 0.18421068 0.66491559 0.16169449
## Insulin 1.00000000 0.22805016 0.13746453
## BMI 0.22805016 1.00000000 0.15983348
## DiabetesPedigreeFunction 0.13746453 0.15983348 1.00000000
## Age 0.21923205 0.07156561 0.08647941
## Age
## Pregnancies 0.68011954
## Glucose 0.34612159
## BloodPressure 0.29899652
## SkinThickness 0.16954075
## Insulin 0.21923205
## BMI 0.07156561
## DiabetesPedigreeFunction 0.08647941
## Age 1.00000000
It is important to check that our variables are not highly correlated before running the model. Although we can see from a first glance that there are not variables with extreme correlations, we will continue analyzing the correlations of variables as if we expected there to be highly correlated variables for learning purposes.
cormat<-signif(cor(df2),2)
col<- colorRampPalette(c( "white","#00203FFF"))(10)
heatmap(cormat, col=col, symm=TRUE)
In order to analyze the correlation values, we can create a heatmap to get a visual understanding. Here, we can see that the highest correlations have a dark blue color, and the lowest correlations have a white color. Although this heatmap does not show us the specific correlation value, it can help us get a sense of which variables might have to be removed before running KNN. Our next step will be removing the variables with high correlations; we proceed with our analysis.
#Removing variables with high correlations (.7 or higher, -.7 or lower)
correlations %>% # start with the correlation matrix (from earlier step)
as.table() %>% as.data.frame() %>%
subset(Var1 != Var2 & abs(Freq)>=0.7) %>% # omit diagonal and keep significant correlations
filter(!duplicated(paste0(pmax(as.character(Var1), as.character(Var2)), pmin(as.character(Var1), as.character(Var2))))) %>%
# keep only unique occurrences, as.character because Var1 and Var2 are factors
arrange(desc(Freq)) %>% # sort by Freq (aka correlation value)
formattable() #create a visually more appealing chart to refer to
| Var1 | Var2 | Freq |
|---|---|---|
We took the matrix from step four and filtered it so that we could see the unique occurrences of correlations that were above or equal to the abs(0.7) threshold. We omitted relationships on the diagonal (meaning perfect correlations of abs(1.0)). These correlation values are not relevant to our analysis as it shows the correlation between the same variables (i.e. BMI vs BMI). The chart that was created in the step above shows all of the values that have a correlation above abs(0.7) threshold.
When looking at the correlation matrix, using abs(0.7) as a threshold for higher correlations, there are no variables that fit this criteria. Therefore, all of the variables will remain in the dataset for the purposes of our analysis. We could have also looked at the initial chart we created of all of the correlation values to determine this. However, ensuring that our code produces the same results makes our analysis less prone to human error. This would especially be important if we were analyzing a dataset with more variables. We only have 8 variables to worry about, in this case, which helps make looking at the correlation values simpler.
We do not need to subset the dataframe based on the above analysis; therefore, we continue with the KNN model.
# Splitting the dataset into the Training set and Test set
#install.packages('caTools')
library(caTools)
# Creating a 70/30 split
splitting_data2 <- sample(1:nrow(df),
round(0.7 * nrow(df), 0),
replace = FALSE)
#Creating the train and test data
knn_train <- df[splitting_data2, ] #Should contain 70% of data points
knn_test <- df[-splitting_data2, ] #Should contain 30% of data points
knn_train_labels <- as.data.frame(knn_test[, 9])
#Checking to ensure steps above were done correctly
size_of_training2 <- nrow(knn_train)
size_of_total <- nrow(df)
size_of_test2 <- nrow(knn_test)
#Verification
paste("The Training Set contains", toString(round(size_of_training2/size_of_total,2)*100), "% of the total data")
## [1] "The Training Set contains 70 % of the total data"
paste("The Testing Set contains", toString(round(size_of_test2/size_of_total,2)*100), "% of the total data")
## [1] "The Testing Set contains 30 % of the total data"
For the model, we decided to stick to a 70/30 split for the training vs testing data set size.
# Feature Scaling
knn_train[-9] = scale(knn_train[-9])
knn_test[-9] = scale(knn_test[-9])
As with the SVM mode, scaling is of paramount importance since the scales used for the values for each variable may be different. It is best practice for KNN to normalize the data and create a common scale before proceeding.
diab_test_pred_knn <- knn(train = knn_train,
test = knn_test,
cl = knn_train[, "Outcome"], #<- category for true classification
k=3,
use.all = TRUE) #<- control ties between class assignments If true, all distances equal to the kth largest are included
# View the output
table(diab_test_pred_knn)
## diab_test_pred_knn
## 0 1
## 83 35
Here, we can see that our model determined 83 “Not Diabetic/Negative” groupings and 35 “Diabetic/Positive” groupings.
There are multiple factors to look at when selecting an initial k value. Ideally, the value for k will be an odd integer as it avoids the problem of breaking ties in a binary classifier. In this situation, we decided on an initial k by taking the square root of the number of features. Since this dataset contains nine features, our k value is 3. Since 3 is an odd integer, this is a good starting point for our analysis.
Having a large k makes the model less sensitive to noise. However, because the model is then less sensitive to noise, many points can be misclassified. On the other hand, having a small k means that noise will have a higher influence on the result. Depending on the size of the training set, the data scientist may choose to have a larger or smaller k value. The selection of k is a difficult process that often requires multiple rounds of tuning; parameter tuning, of k in this case, is extremely important for better accuracy. One simple method data scientists use to selecting a k value is to set k= sqrt(n), where n is the number of observations in the training dataset. It is important to note that there is not “one official way” of selecting a k value that is superior to others.
Reference Seven
The graphic above shows the importance of selecting a k that is appropriate for the dataset. For example, in the example above if the k value chosen was 3, then the test input would be classified as belonging to ‘Class B’. On the other hand, increasing the k value to 7 causes the test input to be classified as belonging to the ‘Class A’ group.
#Creating a confusion matrix and getting detailed information about other metrics
confusionMatrix(as.factor(diab_test_pred_knn), as.factor(knn_test$Outcome), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 74 9
## 1 6 29
##
## Accuracy : 0.8729
## 95% CI : (0.799, 0.9271)
## No Information Rate : 0.678
## P-Value [Acc > NIR] : 8.767e-07
##
## Kappa : 0.7027
##
## Mcnemar's Test P-Value : 0.6056
##
## Sensitivity : 0.7632
## Specificity : 0.9250
## Pos Pred Value : 0.8286
## Neg Pred Value : 0.8916
## Prevalence : 0.3220
## Detection Rate : 0.2458
## Detection Prevalence : 0.2966
## Balanced Accuracy : 0.8441
##
## 'Positive' Class : 1
##
With the confusion matrix created, we can see that our model has 74 true negatives and 29 true positives. In plain terms, this means that 74 patients were correctly identified as not having diabetes, and 29 patients were correctly identified as having diabetes.
The accuracy rate is about 87.29%. Compared to the base rate of 65/35, this accuracy rate is performing well. However, we can continue looking at other metrics to get a better sense of the model. We have a sensitivity rate of 76.32%. This is the true positive rate, meaning that 76.32% of non-diabetic patients were correctly identified as non-diabetic. Similarly, we have a specificity rate of 92.50. This is the true negative rate, meaning that 92.50% of diabetic patients were correctly identified as diabetic. As we can see here, ideally we would want to increase our sensitivity rate and specificity rate; however, since sensitivity is significantly lower than specificity at a k of 3, increasing our true positive rate is critical in optimizing our model. Although overall, the metrics prove that this is not poor performance, we certainly can do better. To do this, we can find our optimal value of k which will maximize our overall accuracy.
#15 Run the "chooseK" function to find the perfect K, while using sapply() function on chooseK() to test k from 1 to 21 (only selecting the odd numbers), and set the train_set argument to 'commercial_train', val_set to 'commercial_test', train_class to the "label" column of 'commercial_train', and val_class to the "label" column of 'commercial_test'. Label this "knn_diff_k_com"
chooseK = function(k, train_set, val_set, train_class, val_class){
class_knn = knn(train = train_set,
test = val_set,
cl = train_class,
k = k,
use.all = TRUE)
conf_mat2 = table(class_knn, val_class)
accu = sum(conf_mat2[row(conf_mat2) == col(conf_mat2)]) / sum(conf_mat2)
cbind(k = k, accuracy = accu)
}
new_k <- sapply(seq(1, 21, by = 2), #<- set k to be odd number from 1 to 21
function(x) chooseK(x,
train_set = knn_train,
val_set = knn_test,
train_class = knn_train[, "Outcome"],
val_class = knn_test[, "Outcome"]))
new_k
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0000000 3.0000000 5.0000000 7.0000000 9.0000000 11.0000000 13.0000000
## [2,] 0.9067797 0.8728814 0.8813559 0.8898305 0.8728814 0.8728814 0.8813559
## [,8] [,9] [,10] [,11]
## [1,] 15.0000000 17.0000000 19.0000000 21.0000000
## [2,] 0.8813559 0.8728814 0.8728814 0.8728814
Here, we are creating a chooseK function which we will use to test different k values from 1 to 21 (choosing only the odd numbers). This will be beneficial in determining the ‘best’ or most optimal k value for this model.
#Visualizing the difference in accuracy based on K
new_k <- data.frame(k_value = new_k[1, ], accuracy = new_k[2, ])
new_k
## k_value accuracy
## 1 1 0.9067797
## 2 3 0.8728814
## 3 5 0.8813559
## 4 7 0.8898305
## 5 9 0.8728814
## 6 11 0.8728814
## 7 13 0.8813559
## 8 15 0.8813559
## 9 17 0.8728814
## 10 19 0.8728814
## 11 21 0.8728814
Here, we are creating an easier, more visually appealing way to view the results of the step prior. We can see that a k value of 1 gives us the highest accuracy rate; however, in the context of the model, a k level of 1 does not make sense as it would imply overfitting to the data. The second highest peak of the accuracy rate is at a k value of 7. At a k of 7, we have an accuracy of 88.98%. This goes to show that our initial k value of 3 did not yield the most accurate model. Therefore, we should change our model so that it utilizes a k value of 7.
#Create a dataframe so we can visualize the difference in accuracy based on K, convert the matrix to a dataframe
#17 Use ggplot to show the output and comment on the k to select
ggplot(new_k, aes(x = k_value, y = accuracy)) +
geom_line(color = "blue", size = 1.5) +
geom_point(size = 3) +
labs(title = 'K Value Versus Overall Model Accuracy',
x = "K Value",
y= "Model Accuracy")
The graph above is a visual representation of the chart we had displayed earlier; this contains the same information. Again, it is seen that the most optimal value for k can be found around 7. We would want to continue keeping our k value a positive, odd integer. As a result, to avoid overfitting while maintaining adequate accuracy, we can thus use k = 7 as our optimal k value to proceed with our analysis.
# Rerun the model with "optimal" k
updated_diab_knn <- knn(train = knn_train,
test = knn_test,
cl = knn_train[, "Outcome"], #<- category for true classification
k=7,
use.all = TRUE) #<- control ties between class assignments If true, all distances equal to the kth largest are included
# View the output
table(updated_diab_knn)
## updated_diab_knn
## 0 1
## 85 33
As mentioned in the prior step, a k value of 7 was selected as the optimal k value for this model. Here, we can see that our model determined 85 “Not Diabetic/Negative” groupings and 33 “Diabetic/Positive” groupings.
In the next step, the specifics for the model with k=7 is shown.
#Use the confusion matrix function to measure the quality of the new model
confusionMatrix(as.factor(updated_diab_knn), as.factor(knn_test$Outcome), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 76 9
## 1 4 29
##
## Accuracy : 0.8898
## 95% CI : (0.819, 0.94)
## No Information Rate : 0.678
## P-Value [Acc > NIR] : 7.039e-08
##
## Kappa : 0.7387
##
## Mcnemar's Test P-Value : 0.2673
##
## Sensitivity : 0.7632
## Specificity : 0.9500
## Pos Pred Value : 0.8788
## Neg Pred Value : 0.8941
## Prevalence : 0.3220
## Detection Rate : 0.2458
## Detection Prevalence : 0.2797
## Balanced Accuracy : 0.8566
##
## 'Positive' Class : 1
##
With the confusion matrix created, we can see that our model has 76 true negatives and 29 true positives. In plain terms, this means that 76 patients were correctly identified as not having diabetes, and 29 patients were correctly identified as having diabetes.
The accuracy rate is 88.98%, which means that this accuracy rate is performing fairly well. As with our previous analysis, we can continue looking at other metrics to get a wider sense of the model. We have a sensitivity rate of 76.32%. This is the true positive rate, meaning that 76.32% of non-diabetic patients were correctly identified as non-diabetic. Similarly, we have a specificity rate of 95%. This is the true negative rate, meaning that 95% of diabetic patients were correctly identified as diabetic.
| Metric | k= 3 | k = 7 |
|---|---|---|
| Accuracy | 87.29% | 88.98% |
| Specificity | 92.50% | 95.00% |
| Sensitivity | 76.32% | 76.32% |
As we can see displayed above, accuracy and specificity went up by a decent amount when using a k of 7. Sensitivity remained the same at 76.32%. In future models, the aim may be to increase the rate of sensitivity since a rate of 76.32% can be viewed as poor to moderate, especially when compared to the other metrics. However, generally, we are content with the increase in accuracy and specificity as that was the goal of the optimization with a k value of 7.
Although we did not run into this issue with our model, there may be situations where in order to increase the accuracy of the model, there is a trade-off between the specificity and sensitivity rates. For example, at a high accuracy rate, there may be times when specificity goes down and sensitivity goes up, or vice versa. In situations like this, it is critical for the data scientist to determine which metric is more important.
As mentioned before, our model was at its optimal performance when a k value of 7 was utilized. This allowed for the accuracy and specificity to be at its highest. Our accuracy level was 88.98% and specificity was 95%, both of which are relatively high. Compared to the other two metrics, sensitivity was lower at a rate of 76.32%; even though sensitivity is not ‘low’ in this case, it could be improved upon in future analysis.
A way to improve our KNN model could be to collect more data. Although we had a several hundred observations to work with in the dataset, having access to significantly larger amounts of observations will allow for the model to adjust accordingly; this could result in a higher accuracy, specificity, and sensitivity rate.
Although we went into detail on how both SVM and KNN function, it is important to note some differences between the two algorithms. Both SVM and KNN are similar in the sense that they can be used for regression and classification purposes. In this project, they were both utilized for classification of whether a patient has diabetes or not. In the real world, both algorithms are widely used, and there is not a clear answer as to which algorithm is considered “better”.
KNN does have some advantages over SVM in certain situations. For example, if the dataset has a lot of points in a low dimensional space, then KNN can be considered a better choice; KNN can also more intuitive to use than SVM. Nonetheless, KNN does result in some challenges when trying to determine the perfect k value; in order for a more accurate model, the k value must be appropriate (which can be difficult to achieve). Likewise, if the training data is much larger than the number of features, KNN may outperform SVM.
There are also certain circumstances when SVM can be preferred over KNN. Typically, if the dataset has a few points in a high dimensional space, then a linear SVM is deemed better. SVM can often have a faster run time than KNN, and it also deals with outliers much better than the KNN algorithm. Similarly, if there are large features and less training data, SVM can outperform KNN.
At the end of the day, the selection of an algorithm is entirely dependent on the data scientist and the dataset that is selected. There can be multiple advantages and disadvantages of using different algorithms. Having a clear understanding of each algorithm is crucial for this reason.
Overall, out of both the SVM and KNN models, in terms of accuracy, specificity, and sensitivity, as well as other metrics not discussed in the analysis but given as a guide, the KNN model out-performed the SVM model. In fact, the KNN model had an accuracy of 88.98% after using an optimized k value whereas the SVM model only had an accuracy rate of 76.27% even after the parameter tuning. This was also true for the specificity and sensitivity values. Both of these metrics were significantly higher in the KNN model, which can be explained as a cause for the KNN model’s high accuracy rate. The reason for this lies in the topology of the distribution of the class data. Meaning, this could be indication that the dataset is not easily separable using the decision planes that we have let SVM use.
The basic SVM model, which is what we utilized in our analysis, uses linear hyperplanes to separate the classes. On the other hand, the KNN model can generate a highly convoluted decision boundary since it is driven by the raw training data itself. Since the KNN model provided good results, it signifies that the classes, whether a patient is diabetic or not, are easily separable by the training data. A future implementation of a dataset of similar data points should utilize a scatter plot in the EDA to try different transforms of the input data; by having a visual of the data points, it can make it easier to see if the classes are linearly separable. In addition, if there exists a set of transforms that make the classes more linearly separable, then linear (or close to linear) classification techniques, like the basic SVM model we used, will function better. These transforms can include but are not limited to taking each metric to a power or taking logs of the metrics.
One limitation of the dataset, to reiterate, is the restrictions set on the dataset. Specifically, the patients included were all females that were 21+ and of Pima Native American heritage. Initially, this provided our group with 768 observations (in this case, patients); however, after looking at the data, it was clear that there were meaningless 0’s within the dataset. Once these were removed, our group was left with 393 observations/patients. Although 393 rows is not a significantly small sample to analyze, most machine learning algorithms function better with more data points. Having more observations could help significantly increase the accuracy rate within our models and make the models more robust to new incoming data.
As mentioned before, a reason to why this study was restricted to individuals of Pima Native American heritage could be because of the different genetic advantages that make analyzing the cause of diabetes within the sample much simpler. However, there are a plethora of ways to continue analyzing this disease. Although there are more than three different approaches possible, our group has highlighted the three we believe make the most sense in terms of next steps.
Depending on whether the data scientist is interested in understanding Type 1 or Type 2 Diabetes, the dataset could be more detailed and contain whether a patient has Type 1 Diabetes or Type 2 Diabetes, which our dataset did not signify. This could be extremely important in determining factors that may be significant for one type of diabetes but not the other; it can allow the data scientists to conclude similarities and differences between the two varieties of the same disease.
The dataset could contain a wider selection of individuals in terms of age and gender. The age of a patient is extremely important because there are trends where individuals who have Type 1 Diabetes tend to have their illness peaks at the ages 4-7 and 10-14 in children (Reference 4). Although both types of diabetes can occur at any age, Type 1 Diabetes is especially common in children or young adults since it is a result of the immune system and not a result of dietary choices. Having more analysis that solidifies this concept could be beneficial for doctors and diabetes researchers.
Similarly, looking at both females and males could be beneficial in determining which factors are more important in making an individual susceptible to diabetes. Because the human physiology is very different between the two sexes, it could be assumed that different variables lead to a diabetes diagnosis in women compared to men. Seeing how this changes with the increase of age could be an interesting research question for future analysis.
Because our dataset only included patients of Pima Native American heritage, our results are not applicable to other groups of individuals. Therefore, a future study that would be encouraged is analyzing patients who have diverse backgrounds. Not every heritage is known to have genetic and environmental advantages like those of the Pima Native American heritage; therefore, it is critical to repeat this study on a dataset that is more inclusive.