Introduction

In this homework, an analysis was done using the SVM algorithm on the dataset that was used in Homework 2, which was the diabetes health indicators dataset. This dataset was compiled from a telephone survey conducted by the CDC in 2015. Questions asked in the survey involved health-related risk behaviors, chronic health conditions, and the use of preventative services. Also include are age, education, income, location, and race to name a few. There are 3 .csv files that can be used for analysis. The one that was used in this homework was the diabetes _ 012 _ health _ indicators _ BRFSS2015.csv file. This .csv file contains 253,680 survey responses (observations) and 21 features. The response variable is multiclass, in that it contains 3 different classes: 0 for no diabetes, 1 for prediabetes, and 2 is for diabetes. The author of this dataset points out that there is a class imbalance.

This dataset includes the following variables:

Diabetes_012: 0 = no diabetes 1 = prediabetes 2 = diabetes
HighBP: 0 = no high BP 1 = high BP
HighChol: 0 = no high cholesterol 1 = high cholesterol
CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
BMI: Body Mass Index
Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
HeartDiseaseorAttack: coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes
PhysActivity: physical activity in past 30 days - not including job 0 = no 1 = yes
Fruits: Consume Fruit 1 or more times per day 0 = no 1 = yes
Veggies: Consume Vegetables 1 or more times per day 0 = no 1 = yes
HvyAlcoholConsump: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes
AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes
NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes
GenHlth: Would you say that in general your health is: scale 1-5:
- 1 = excellent
- 2 = very good
- 3 = good
- 4 = fair
- 5 = poor
MentHlth: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
- 1 - 30: number of days
- 88: None
- 77: Don’t know/Not sure
- 99: Refused
PhysHlth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
- 1 - 30: number of days
- 88: None
- 77: Don’t know/Not sure
- 99: Refused
- BLANK: Not asked or Missing
DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes
Sex: 0 = female 1 = male
Age: 13-level age category:
- 1 = 18-24
- 2 = 25-29
- 3 = 30-34
- 4 = 35-39
- 5 = 40-44
- 6 = 45-49
- 7 = 50-54
- 8 = 55-59
- 9 = 60-64
- 10 = 65-69
- 11 = 70-74
- 12 = 75-79
- 13 = 80 or older
Education: Education level; scale 1-6:
- 1 = Never attended school or only kindergarten
- 2 = Grades 1 through 8 (Elementary)
- 3 = Grades 9 through 11 (Some high school)
- 4 = Grade 12 or GED (High school graduate)
- 5 = College 1 year to 3 years (Some college or technical school)
- 6 = College 4 years or more (College graduate)
Income: Income scale; scale 1-8:
- 1 = less than $10,000
- 2 = less than $15,000
- 3 = less than $20,000
- 4 = less than $25,000
- 5 = less than $35,000
- 6 = less than $50,000
- 7 = less than $75,000
- 8 = $75,000 or more

The goals of this analysis were to:

compare the results of the SVM algorithm with the decision tree algorithm that was selected from the previous homework.
determine which algorithm is recommended to get more accurate results
is the algorithm that you have recommended better for regression or classification scenarios?
do you agree with the recommendations? Also why do you agree with the recommendations.

Importing Data

diabetes_data <- read_csv(
  file = "diabetes_012_health_indicators_BRFSS2015.csv",
  col_types = "ffffnfffffffffffffffff")

Exploratory Data Analysis

A summary of the Diabetes Health Indicators Dataset is provided below:

summary(diabetes_data)

##  Diabetes_012 HighBP       HighChol     CholCheck         BMI       
##  0.0:213703   1.0:108829   1.0:107591   1.0:244210   Min.   :12.00  
##  2.0: 35346   0.0:144851   0.0:146089   0.0:  9470   1st Qu.:24.00  
##  1.0:  4631                                          Median :27.00  
##                                                      Mean   :28.38  
##                                                      3rd Qu.:31.00  
##                                                      Max.   :98.00  
##                                                                     
##  Smoker       Stroke       HeartDiseaseorAttack PhysActivity Fruits      
##  1.0:112423   0.0:243388   0.0:229787           0.0: 61760   0.0: 92782  
##  0.0:141257   1.0: 10292   1.0: 23893           1.0:191920   1.0:160898  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Veggies      HvyAlcoholConsump AnyHealthcare NoDocbcCost  GenHlth    
##  1.0:205841   0.0:239424        1.0:241263    0.0:232326   5.0:12081  
##  0.0: 47839   1.0: 14256        0.0: 12417    1.0: 21354   3.0:75646  
##                                                            2.0:89084  
##                                                            4.0:31570  
##                                                            1.0:45299  
##                                                                       
##                                                                       
##     MentHlth         PhysHlth      DiffWalk      Sex              Age       
##  0.0    :175680   0.0    :160052   1.0: 42675   0.0:141974   9.0    :33244  
##  2.0    : 13054   30.0   : 19400   0.0:211005   1.0:111706   10.0   :32194  
##  30.0   : 12088   2.0    : 14764                             8.0    :30832  
##  5.0    :  9030   1.0    : 11388                             7.0    :26314  
##  1.0    :  8538   3.0    :  8495                             11.0   :23533  
##  3.0    :  7381   5.0    :  7622                             6.0    :19819  
##  (Other): 27909   (Other): 31959                             (Other):87744  
##  Education        Income     
##  4.0: 62750   8.0    :90385  
##  6.0:107325   7.0    :43219  
##  3.0:  9478   6.0    :36470  
##  5.0: 69910   5.0    :25883  
##  2.0:  4043   4.0    :20135  
##  1.0:   174   3.0    :15994  
##               (Other):21594

The factors above have been recoded for readability.

diabetes_data <- diabetes_data %>%
  mutate(
    Diabetes_012 = dplyr::recode(Diabetes_012, '0.0' = 'No Diabetes', '1.0' = 'Prediabetes', '2.0' = 'Diabetes'),
    CholCheck = dplyr::recode(CholCheck, '1.0' = 'Yes Chol Check in 5 years', '0.0' = 'No Chol Check in 5 Years'),
    AnyHealthcare = dplyr::recode(AnyHealthcare, '1.0' = 'Has Insurance', '0.0' = 'No Insurance'),
    GenHlth = dplyr::recode(GenHlth, '5.0' = 'Poor', '4.0' = 'Fair', '3.0' = 'Good', '2.0' = 'Very Good', '1.0' = "Excellent"),
    Age = dplyr::recode(Age, '1.0' = '18-24', '2.0' = '25-29', '3.0' = '30-34', '4.0' = '35-39', '5.0' = '40-44',
                 '6.0' = '45-49', '7.0' = '50-54', '8.0' = '55-59', '9.0' = '60-64', '10.0' = '65-69',
                 '11.0'='70-74', '12.0' = '75-79', '13.0' = '>=80'),
    Education = dplyr::recode(Education, '1.0' = 'No School/Kindergarten', '2.0' = 'Grades 1-8', '3.0' = 'Grades 9 - 11',
                       '4.0' = 'Grade 12/GED', '5.0' = '1-3 Yrs College', '6.0' = '>= 4 Yrs College'),
    Income = dplyr::recode(Income, '1.0' = '<10K', '2.0' = '10K<=Income<15K', '3.0' = '15K<=Income<20K', '4.0' = '20K<=Income<25K',
                    '5.0' = '25K<=Income<35K', '6.0' = '35K<=Income<50K', '7.0' = '50K<=Income<75K', '8.0' = 'Income>=75K')
  )

summary(diabetes_data)

##       Diabetes_012    HighBP       HighChol    
##  No Diabetes:213703   1.0:108829   1.0:107591  
##  Diabetes   : 35346   0.0:144851   0.0:146089  
##  Prediabetes:  4631                            
##                                                
##                                                
##                                                
##                                                
##                      CholCheck           BMI        Smoker       Stroke      
##  Yes Chol Check in 5 years:244210   Min.   :12.00   1.0:112423   0.0:243388  
##  No Chol Check in 5 Years :  9470   1st Qu.:24.00   0.0:141257   1.0: 10292  
##                                     Median :27.00                            
##                                     Mean   :28.38                            
##                                     3rd Qu.:31.00                            
##                                     Max.   :98.00                            
##                                                                              
##  HeartDiseaseorAttack PhysActivity Fruits       Veggies      HvyAlcoholConsump
##  0.0:229787           0.0: 61760   0.0: 92782   1.0:205841   0.0:239424       
##  1.0: 23893           1.0:191920   1.0:160898   0.0: 47839   1.0: 14256       
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##        AnyHealthcare    NoDocbcCost       GenHlth         MentHlth     
##  Has Insurance:241263   0.0:232326   Poor     :12081   0.0    :175680  
##  No Insurance : 12417   1.0: 21354   Good     :75646   2.0    : 13054  
##                                      Very Good:89084   30.0   : 12088  
##                                      Fair     :31570   5.0    :  9030  
##                                      Excellent:45299   1.0    :  8538  
##                                                        3.0    :  7381  
##                                                        (Other): 27909  
##     PhysHlth      DiffWalk      Sex              Age       
##  0.0    :160052   1.0: 42675   0.0:141974   60-64  :33244  
##  30.0   : 19400   0.0:211005   1.0:111706   65-69  :32194  
##  2.0    : 14764                             55-59  :30832  
##  1.0    : 11388                             50-54  :26314  
##  3.0    :  8495                             70-74  :23533  
##  5.0    :  7622                             45-49  :19819  
##  (Other): 31959                             (Other):87744  
##                   Education                  Income     
##  Grade 12/GED          : 62750   Income>=75K    :90385  
##  >= 4 Yrs College      :107325   50K<=Income<75K:43219  
##  Grades 9 - 11         :  9478   35K<=Income<50K:36470  
##  1-3 Yrs College       : 69910   25K<=Income<35K:25883  
##  Grades 1-8            :  4043   20K<=Income<25K:20135  
##  No School/Kindergarten:   174   15K<=Income<20K:15994  
##                                  (Other)        :21594

Everything in the summary seems to fall within reasonable expectations. The summary also revealed that there were no missing values in this dataset.

Figure 1: Histograms for the BMI (the only continuous feature) in the Diabetes Health Indicators Dataset

Figure 4 shows us that BMI is displaying right skewness. This right skewness could have also been deduced from the summary. Notice that in the summary, for the BMI variable, the maximum is 98, while the mean is 28 and the minimum is 12.

Boxplot

Figure 2: Boxplots for the Diabetes Health Indicators Dataset

Some findings were discovered that support the theoretical effects for some of the variables using the boxplots in Figure 5. Based on the age boxplot, theoretically, older people are more likely to have heart disease. Theoretically, on average, people with a lower maximum heart rate are more likely to have heart disease when viewing the MaxHR variable. Finally, based on the boxplot, on average, people with higher Oldpeak are more likely to have heart disease, which makes sense given that an Oldpeak equal to ± 1 is indicative of a serious health condition.

Examining Feature Multicollinearity for Continuous and Categorical Variables

Finally, it is imperative to understand which features are correlated with each other in order to address and avoid multicollinearity within our models. By using a correlation plot, we can visualize the relationships between certain features. The correlation plot is only able to determine the correlation for continuous variables.

corrplot(diabetes_correlations$correlations, 
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 1,
         tl.cex = 1)

Figure 3: Multicollinearity plot for continuous predictor variables

Calkins indicates that “…correlation coefficients whose magnitude are between 0.3 and 0.5 indicate variables which have a low correlation”. The correlation with the largest magnitude has a value of 0.52, and while this value is above the maximum range at what would be considered a “low correlation”, it is only 0.02 above the maximum. Therefore, it is sufficient to say that the entire dataset has low correlation.

Interestingly, the correlation between Education and Income is low, even though there are multiple studies that have been conducted that show that there is a strong correlation between education and income:

Study 1: https://research.stlouisfed.org/publications/page1-econ/2017/01/03/education-income-and-wealth#:~:text=Research%20indicates%20that%20the%20level,of%20debt%20relative%20to%20assets.
Study 2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4534330/
Study 3: https://www.jstor.org/stable/3445959

Class Imbalance

prop.table(table(select(diabetes_data, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes Prediabetes 
##  0.84241170  0.13933302  0.01825528

The output above shows the percentage of respondents that do not have diabetes (84.24%), the percentage of respondents that do have diabetes (13.9%), and the percentage that are prediabetic(1.8%). Note that in order to deal with class imbalance, only 2 classes can exist within the response variable. Since people with prediabetes only makes up 1.8% of the total amount of observations, all of the observations where Diabetes_012 == Prediabetes were removed from the dataset. The nature of this study slightly changed with the removal of a class. Now instead of generating a model to determine if someone is not at risk of having diabetes, is at risk of being prediabetic, or is at risk of being diabetic, now the model will just determine if a person is either at risk or not at risk of getting diabetes. The modeling and prediction of the remaining classes still yielded valuable insights.

diabetes_data <- diabetes_data %>%
  mutate(Diabetes_012 = as.numeric(Diabetes_012)) %>%
  subset(Diabetes_012 != 3) %>%
  mutate(Diabetes_012 = as.factor(Diabetes_012)) %>%
  mutate(Diabetes_012 = dplyr::recode(Diabetes_012, '1' = 'No Diabetes', '2' = 'Diabetes'))

prop.table(table(select(diabetes_data, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580761   0.1419239

Stratified Random Sampling

In its current form, the dataset is too large for the SVM to converge at an acceptable timeframe. Therefore, the number of observations was reduced by 25% using stratified random sampling. This is to ensure that the distribution of the feature values within the sample matches the distribution of values for the same feature in the overall population. The sample.split function from the caTools package is employed in order to perform stratified random sampling.

set.seed(123)
stratified_bool_vector <- sample.split(diabetes_data$Diabetes_012, SplitRatio = 0.25)
diabetes_data_stratified <- subset(diabetes_data, stratified_bool_vector == TRUE)

diabetes_data_stratified %>%
  select(Diabetes_012) %>%
  table() %>%
  prop.table()

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580836   0.1419164

The output above shows that the proportional distribution of values for the Diabetes_012 feature were close to those of the original dataset.

Splitting Data into Testing and Training

Outliers, noise, and normalization which are common preprocessing techniques, do not need to be utilized when creating a decision tree model (Practical Machine Learning in R, page 301). The dataset is split into testing and training sets.

set.seed(123)
original_split <- caTools::sample.split(diabetes_data_stratified$Diabetes_012, SplitRatio = 0.75)
diabetes_data_stratified_train <-  subset(diabetes_data_stratified, original_split == TRUE)
diabetes_data_stratified_test <- subset(diabetes_data_stratified, original_split == FALSE)

prop.table(table(select(diabetes_data_stratified, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580836   0.1419164

prop.table(table(select(diabetes_data_stratified_train, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##   0.8580851   0.1419149

prop.table(table(select(diabetes_data_stratified_test, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##    0.858079    0.141921

The output above shows us that there is a significant class imbalance. SMOTE from the DMwR package is only applied for the training dataset.

diabetes_data_stratified_train <- SMOTE(Diabetes_012 ~ ., data.frame(diabetes_data_stratified_train), perc.over = 100, perc.under = 200)
prop.table(table(select(diabetes_data_stratified_train, Diabetes_012)))

## Diabetes_012
## No Diabetes    Diabetes 
##         0.5         0.5

Fitting the Decision Tree Model with the Reduced Dataset

In order to create a fair comparison between models, the dataset that was reduced after applying stratified random sampling will be used on the decision tree algorithm. The results of this fitting are shown below.

diabetes_decision_tree_stratified <- rpart(
  Diabetes_012 ~ .,
  method = "class",
  data = diabetes_data_stratified_train
)

rpart.plot(diabetes_decision_tree_stratified)

Figure 4: Decision tree for the diabetes health indicators dataset using all of the available features

varImp(diabetes_decision_tree_stratified) %>%
  tibble::rownames_to_column() %>%
  dplyr::rename("variable" = rowname) %>%
  dplyr::arrange(Overall) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable)) %>%
  filter(Overall > 0) %>%
  ggplot(aes(x = variable, y = Overall)) + geom_col() + coord_flip()

Figure 5: Variable importance plot for the decision tree model which uses all of the available features

diabetes_decision_tree_stratified_pred <- predict(diabetes_decision_tree_stratified, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_decision_tree_stratified_pred, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes        9296      593
##   Diabetes           4060     1616
##                                           
##                Accuracy : 0.7011          
##                  95% CI : (0.6938, 0.7082)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2584          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7316          
##             Specificity : 0.6960          
##          Pos Pred Value : 0.2847          
##          Neg Pred Value : 0.9400          
##              Prevalence : 0.1419          
##          Detection Rate : 0.1038          
##    Detection Prevalence : 0.3647          
##       Balanced Accuracy : 0.7138          
##                                           
##        'Positive' Class : Diabetes        
##

roc_decision_tree <-
prediction(
  predictions = predict(diabetes_decision_tree_stratified, diabetes_data_stratified_test, type = "prob")[, "No Diabetes"],
  labels = diabetes_data_stratified_test$Diabetes_012
)
roc_perf_decision_tree <- performance(roc_decision_tree, measure = "tpr", x.measure = "fpr")
plot(roc_perf_decision_tree, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 6: ROC curve for the decision tree model using all of the available features

auc_decision_tree <- performance(roc_decision_tree, measure = "auc")
diabetes_decision_tree_auc <- unlist(slot(auc_decision_tree,"y.values"))
paste("Calculated AUC: ", diabetes_decision_tree_auc)

## [1] "Calculated AUC:  0.738143435923529"

Fitting the Support Vector Machine Model

The svm function in R allowed for the generation of a SVM model that uses all of the features in the training set to building a model that predicts Diabetes_012. Several different kernels were used in order to compare the performance between the different kernels in addition to the results from the previous homework. These kernels include:

linear
polynomial
radial basis
sigmoid

These are all of the possible kernels that can be used using the svm function in R

Linear Kernel SVM

diabetes_svm_linear <- svm(
  Diabetes_012 ~ .,
  kernel = "linear",
  type = "C-classification",
  data = diabetes_data_stratified_train
)

summary(diabetes_svm_linear)

## 
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train, 
##     kernel = "linear", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  13470
## 
##  ( 6731 6739 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  No Diabetes Diabetes

diabetes_svm_pred_linear <- predict(diabetes_svm_linear, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_linear, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       10332      745
##   Diabetes           3024     1464
##                                          
##                Accuracy : 0.7579         
##                  95% CI : (0.751, 0.7646)
##     No Information Rate : 0.8581         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.305          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.66274        
##             Specificity : 0.77358        
##          Pos Pred Value : 0.32620        
##          Neg Pred Value : 0.93274        
##              Prevalence : 0.14192        
##          Detection Rate : 0.09406        
##    Detection Prevalence : 0.28834        
##       Balanced Accuracy : 0.71816        
##                                          
##        'Positive' Class : Diabetes       
##

roc_pred_linear <-
prediction(
  predictions = as.numeric(diabetes_svm_pred_linear),
  labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_linear <- performance(roc_pred_linear, measure = "tpr", x.measure = "fpr")
plot(roc_perf_linear, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 7: ROC curve for the linear kernel SVM model using all of the available features

auc_perf_linear <- performance(roc_pred_linear, measure = "auc")
diabetes_auc_linear <- unlist(slot(auc_perf_linear,"y.values"))
paste("Calculated AUC: ", diabetes_auc_linear)

## [1] "Calculated AUC:  0.718164114215431"

Polynomial Kernel SVM

diabetes_svm_polynomial <- svm(
  Diabetes_012 ~ .,
  kernel = "polynomial",
  type = "C-classification",
  data = diabetes_data_stratified_train
)

summary(diabetes_svm_polynomial)

## 
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train, 
##     kernel = "polynomial", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  23631
## 
##  ( 11812 11819 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  No Diabetes Diabetes

diabetes_svm_pred_polynomial <- predict(diabetes_svm_polynomial, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_polynomial, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes        6766      248
##   Diabetes           6590     1961
##                                           
##                Accuracy : 0.5607          
##                  95% CI : (0.5528, 0.5685)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1794          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8877          
##             Specificity : 0.5066          
##          Pos Pred Value : 0.2293          
##          Neg Pred Value : 0.9646          
##              Prevalence : 0.1419          
##          Detection Rate : 0.1260          
##    Detection Prevalence : 0.5494          
##       Balanced Accuracy : 0.6972          
##                                           
##        'Positive' Class : Diabetes        
##

roc_pred_polynomial <-
prediction(
  predictions = as.numeric(diabetes_svm_pred_polynomial),
  labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_polynomial <- performance(roc_pred_polynomial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_polynomial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 8: ROC curve for the polynomial kernel SVM model using all of the available features

auc_perf_polynomial <- performance(roc_pred_polynomial, measure = "auc")
diabetes_auc_polynomial <- unlist(slot(auc_perf_polynomial,"y.values"))
paste("Calculated AUC: ", diabetes_auc_polynomial)

## [1] "Calculated AUC:  0.697160402236976"

Radial Basis Kernel SVM

diabetes_svm_radial <- svm(
  Diabetes_012 ~ .,
  kernel = "radial",
  type = "C-classification",
  data = diabetes_data_stratified_train
)

summary(diabetes_svm_radial)

## 
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train, 
##     kernel = "radial", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  13699
## 
##  ( 6847 6852 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  No Diabetes Diabetes

diabetes_svm_pred_radial <- predict(diabetes_svm_radial, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_radial, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       10262      693
##   Diabetes           3094     1516
##                                           
##                Accuracy : 0.7567          
##                  95% CI : (0.7499, 0.7634)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3128          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6863          
##             Specificity : 0.7683          
##          Pos Pred Value : 0.3289          
##          Neg Pred Value : 0.9367          
##              Prevalence : 0.1419          
##          Detection Rate : 0.0974          
##    Detection Prevalence : 0.2962          
##       Balanced Accuracy : 0.7273          
##                                           
##        'Positive' Class : Diabetes        
##

roc_pred_radial <-
prediction(
  predictions = as.numeric(diabetes_svm_pred_radial),
  labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_radial <- performance(roc_pred_radial, measure = "tpr", x.measure = "fpr")
plot(roc_perf_radial, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 9: ROC curve for the radial kernel SVM model using all of the available features

auc_perf_radial <- performance(roc_pred_radial, measure = "auc")
diabetes_auc_radial <- unlist(slot(auc_perf_radial,"y.values"))
paste("Calculated AUC: ", diabetes_auc_radial)

## [1] "Calculated AUC:  0.727313600830602"

Sigmoid Kernel SVM

diabetes_svm_sigmoid <- svm(
  Diabetes_012 ~ .,
  kernel = "sigmoid",
  type = "C-classification",
  data = diabetes_data_stratified_train
)

summary(diabetes_svm_sigmoid)

## 
## Call:
## svm(formula = Diabetes_012 ~ ., data = diabetes_data_stratified_train, 
##     kernel = "sigmoid", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  14048
## 
##  ( 7018 7030 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  No Diabetes Diabetes

diabetes_svm_pred_sigmoid <- predict(diabetes_svm_sigmoid, diabetes_data_stratified_test, type = "class")
confusionMatrix(diabetes_svm_pred_sigmoid, diabetes_data_stratified_test$Diabetes_012, positive = "Diabetes")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    No Diabetes Diabetes
##   No Diabetes       10318      749
##   Diabetes           3038     1460
##                                           
##                Accuracy : 0.7567          
##                  95% CI : (0.7499, 0.7634)
##     No Information Rate : 0.8581          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3026          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6609          
##             Specificity : 0.7725          
##          Pos Pred Value : 0.3246          
##          Neg Pred Value : 0.9323          
##              Prevalence : 0.1419          
##          Detection Rate : 0.0938          
##    Detection Prevalence : 0.2890          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : Diabetes        
##

roc_pred_sigmoid <-
prediction(
  predictions = as.numeric(diabetes_svm_pred_sigmoid),
  labels = as.numeric(diabetes_data_stratified_test$Diabetes_012)
)
roc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "tpr", x.measure = "fpr")
plot(roc_perf_sigmoid, main = "ROC Curve", col = "green", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)

Figure 10: ROC curve for the sigmoid kernel SVM model using all of the available features

auc_perf_sigmoid <- performance(roc_pred_sigmoid, measure = "auc")
diabetes_auc_sigmoid <- unlist(slot(auc_perf_sigmoid,"y.values"))
paste("Calculated AUC: ", diabetes_auc_sigmoid)

## [1] "Calculated AUC:  0.716734618147791"

Table of Metrics

data1 <- tribble(
  ~"",~"Accuracy",~"Kappa",~"Sensitivity",~"Specificity",~"AUC",
  "Decision Tree (HW 2)", "0.7083", "0.2711", "0.7429", "0.7026", "0.7468",
  "Decision Tree (Stratified)", "0.6996","0.2613","0,7438","0.6923","0.7378",
  "Linear Kernel SVM","0.7606","0.3193","0.6886","0.7725","0.7305",
  "Polynomial Kernel SVM", "0.5440","0.1642","0.8791","0.4886","0.6839",
  "Radial Kernel SVM", "0.7589","0.3242","0.7076","0.7674","0.7375",
  "Sigmoid Kernel SVM", "0.7593","0.3159","0.6849","0.7716","0.7282"
)
knitr::kable((data1), booktabs = TRUE)

	Accuracy	Kappa	Sensitivity	Specificity	AUC
Decision Tree (HW 2)	0.7083	0.2711	0.7429	0.7026	0.7468
Decision Tree (Stratified)	0.6996	0.2613	0,7438	0.6923	0.7378
Linear Kernel SVM	0.7606	0.3193	0.6886	0.7725	0.7305
Polynomial Kernel SVM	0.5440	0.1642	0.8791	0.4886	0.6839
Radial Kernel SVM	0.7589	0.3242	0.7076	0.7674	0.7375
Sigmoid Kernel SVM	0.7593	0.3159	0.6849	0.7716	0.7282

Table 1: Metrics for different model types

The metrics shown in Table 1 differ from the results shown in the “Fitting the Decision Tree Model with the Reduced Dataset” and “Fitting the Support Vector Machine Model” sections because of the random sampling that was done for the stratified random sampling. Therefore, each run of the markdown file resulted in slightly different metric values.

Comparison Between Decision Tree Model from Homework 2 and Decision Tree Model using Stratified Randomly Sampled Dataset

To create a fair comparison between the decision tree model and the SVMs, the decision tree model was refit to the stratified randomly sampled dataset. However, as expected, this sampling method introduced variations in the resulting decision tree model compared to the model generated using the entire dataset. The decision tree that was generated made several splits based on 5 out of 21 possible features to choose from, GenHlth, DiffWalk, HighBP, HeartDiseaseorAttack, DiffWalk. In Homework 2, only 3 splits were made based on GenHlth, DiffWalk, and HighBP. The variable importance plot generated from the decision tree model shows that GenHlth, DiffWalk, HighBP, and BMI are all significant features. In Homework 2, when using all of the data, it was shown that HeartDiseaseorAttack was significant in addition to all of the other features that were mentioned previously. From Table 1, it is shown that there is also a slightly lower accuracy score (0.7083 vs 0.6996) and a slightly lower AUC value (0.7468 vs 0.7378). This highlights that the decision tree constructed from the sampled dataset exhibited structural differences compared to the model derived from the complete dataset. This discrepancy suggests that the sampling process may have influenced the decision paths and node splits within the tree.This difference also highlights that the importance attributed to different predictor variables varied between the two models. While certain variables may have been deemed significant in the full dataset model, their importance may have diminished or shifted in the sampled dataset model. This highlights the potential impact of sampling on variable selection and model interpretation.

Comparison Between Linear, Polynomial, Radial, and Sigmoid Kernel SVMs

Based on the metrics provided for different SVM models from Table 1, several conclusions can be drawn. The Linear Kernel SVM model has the highest accuracy among the SVM models, followed closely by the Radial Kernel SVM and the Sigmoid Kernel SVM, with the Polynomial Kernel SVM model having the lowest accuracy. The Linear Kernel SVM model has the highest Kappa, indicating better agreement between predicted and observed values compared to the other SVM models, while the polynomial kernel SVM model has the lowest Kappa, suggesting lower agreement between predicted and observed values. The Radial Kernel SVM model has the highest sensitivity among the SVM models, showing its ability to correctly identify individuals with diabetes, while the polynomial kernel SVM model has the highest specificity among the SVM models, showing its ability to correctly identify individuals without diabetes. The radial kernel SVM model has the highest AUC among the SVM models, which indicates that it does a reasonably good job at discriminating between whether or not someone has diabetes, while the Polynomial Kernel SVM model has the lowest AUC among the SVM models. Overall, the linear kernel SVM model generally performs well across multiple metrics, including accuracy, Kappa coefficient, sensitivity, specificity, and AUC and the Radial Kernel SVM model also performs competitively, particularly in terms of sensitivity and AUC. The polynomial kernel SVM model shows comparatively lower performance across most metrics, suggesting it may not be well suited for diabetes prediction. The sigmoid kernel SVM model exhibits similar performance to the linear kernel SVM and radial kernel SVM models, but slightly lower in some metrics. In summary, the linear and radial Kernel SVM models appear to be the most promising based on the provided metrics, while the polynomial kernel SVM model shows relatively lower performance. With that being said, the linear SVM model was used to compare against the decision tree model.

Comparison Between Best Performing SVM for Current Homework and Decision Tree Model from Stratified Random Sampling

In order to create a fair comparison, in this section, only the SVM and the decision tree model that were fit to the stratified randomly sampled dataset will be compared. Sampling introduces biases due to the selection of specific samples. In addition, by comparing two models that were essentially fit to two different datasets, inconsistency in evaluation is introduces, which makes it challenging to draw meaningful conclusions.

Based on the provided metrics from Table 1. The Llinear kernel SVM model has a higher accuracy (0.7606) compared to the Decision Tree model with stratified sampling (0.6996). The linear kernel SVM model has a higher Kappa (0.3193) compared to the decision tree model with stratified sampling (0.2613). This indicates better agreement between predicted and observed values for the SVM model. The decision tree model with stratified sampling has a slightly higher sensitivity (0.7438) compared to the linear kernel SVM model (0.6886) and a higer specificity (0.7725 vs 0.6923), showcasing that the decision tree model performs better at correctly identifying individuals with diabetes. The AUC for the Linear Kernel SVM model (0.7305) is slightly lower than that of the Decision Tree model with stratified sampling (0.7378). The Linear Kernel SVM model generally outperforms the Decision Tree model with stratified sampling in terms of accuracy, Kappa coefficient, and specificity. The Decision Tree model with stratified sampling shows a slightly higher sensitivity compared to the Linear Kernel SVM model. Both models have comparable AUC values, which shows that both models are capable at discerning whether or not someone has diabetes. In summary, the Linear Kernel SVM model exhibits better overall performance compared to the Decision Tree model with stratified sampling, particularly in terms of accuracy, Kappa coefficient, and specificity.

Summary/Conclusions

The analysis compares various machine learning models for diabetes prediction based on health indicators. The Decision Tree model exhibits structural differences and lower performance when applied to a stratified randomly sampled dataset compared to its original counterpart. Among SVM models, the Linear Kernel SVM demonstrates superior accuracy, Kappa, and specificity, while the Radial Kernel SVM performs well in sensitivity and AUC. In a direct comparison, the Linear Kernel SVM outperforms the Decision Tree model from stratified sampling in accuracy, Kappa coefficient, and specificity, although the Decision Tree model displays slightly higher sensitivity. Both models show comparable AUC values, indicating their effectiveness in determining whether or not someone has diabetes. Additionally, when using a stratified sampling dataset, one must be aware of factors such as generalization, potential biases, variable importance shifts, and consistency in evaluation. Overall, the Linear Kernel SVM emerges as the most promising model for diabetes prediction, offering a balanced mix of accuracy, interpretability, and performance metrics. Further validation and exploration are recommended to ensure robust predictions suitable for clinical applications.

Decision Trees vs. SVMs in My Current Area of Expertise

Article 1A: Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study

Link: https://www.hindawi.com/journals/complexity/2021/5550344/

Machine learning was implemented to create a model capable of predicting who might have Covid-19. Researchers used a dataset based on standard lab tests that were easy to gather. In their study, they focused on decision tree ensembles, which are known for their accuracy, adaptability, and their ability to handle imbalanced data. Several different decision tree ensemble methods were used: a single decision tree, a random forest model, a bagging model, an XBBoost model, and an Adaboost. Several data manipulation techniques were also applied to deal with the imbalanced data which include SMOTEBoost, SMOTEBagging, RUSBoost, and RUSBagging. The researchers also used several measures to compare how well these models work, like F-measure, precision, recall, and area under curves like ROC-AUC. What was found was that for accuracy and prevision, random forests without balancing performed best while the balanced random forest model performed the best for recall, F1, and AUPRC and the RUSBagging model performed best for AUROC. Ultimately, their findings showed that decision tree ensembles tailored for imbalanced datasets perform best. They also found that including a person’s age as a factor improves the accuracy of these decision tree models. From a building science standpoint, the implementation of effective disease detection models can influence HVAC building operation. For instance, if a model determined that temperature, lighting, and recycled airflow were significant factors in COVID-19 spread, then adjusting thermostat settings, turning off lights when not in use, and adjusting airflow to reduce the amount of recycled air within the building are all measures that can be taken which not only might mitigate the effects of COVID-19, but also reduce energy consumption.

Article 1B: A novel approach to predict COVID-19 using support vector machine

Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

In this study, an SVM was fit to a dataset containing COVID-19 patient data in order to create a model that was capable of predicting whether or not a person was infected with COVID-19. The response variable contained 3 levels: not infect, mildly infected, and severely infected and in order to create a highly accurate model with very few missed predictions, the cost parameter was set to 10. What was found was that when comparing the performance of the SVM model to all other supervised models, the SVM model was the most accurate. Accuracy was very important in this research because of the severity of the COVID-19, and because of the evolving nature of the disease, the results will also evolve as new features are derived from these evolving symptoms. Just as COVID-19 patient data contains various factors such as symptoms, demographics, and medical history, building energy consumption data can include numerous variables like weather conditions, building size, occupancy patterns, and HVAC system efficiency. SVM can effectively analyze such diverse datasets to make predictions. In the study, the SVM model was optimized for accuracy, aiming to minimize missed predictions, especially given the severity of COVID-19. Similarly, in building science, accurate prediction of energy consumption is important for optimizing building operations, ensuring energy efficiency, and minimizing costs.

Article 2: Data Mining Using a Support Vector Machine, Decision Tree, Logistic Regression, and Random Forest for Pneumonia Prediction and Classification

Link: http://infor.seaninstitute.org/index.php/infokum/article/view/402/333

Predicting pneumonia accurately is crucial because it can lead to better patient care and outcomes. Early detection and treatment of pneumonia can prevent complications and reduce the severity of the illness. By using machine learning models, healthcare providers can make faster and more accurate diagnoses, leading to timely treatment and improved patient outcomes. The goal was to find the best way to predict and classify pneumonia cases using four different machine learning models: Support Vector Machine (SVM), Decision Tree, Logistic Regression, and Random Forest. What was found was that even though the SVM model was more accurate than the decision tree model, the logistic regression model ended up being the most accurate one out of them all. Just as early detection of pneumonia can lead to timely treatment and better patient outcomes, early detection of energy inefficiencies or anomalies in building energy consumption can lead to timely optimization and cost savings. Machine learning models can help in detecting these inefficiencies by analyzing patterns in energy consumption data. Accurate predictions are crucial for effective energy management in buildings. The study’s focus on accuracy and consistency highlights the importance of using reliable models for energy consumption prediction. A model that can predict energy usage accurately can help in budgeting, planning, and optimizing energy resources effectively.

Article 3: Utility of Support Vector Machine and Decision Tree to Identify the Prognosis of Metformin Poisoning in the United States

Link: https://bmcpharmacoltoxicol.biomedcentral.com/articles/10.1186/s40360-022-00588-0

Metformin is a popular diabetes drug that has a fatality rate of 30 to 50% if unsafe levels are taken. Therefore, this research was undertaken in order to generate a data driven model for early prognosis prediction, in order to reduce the death rate. In this study, a decision tree model and a SVM model were fit to a dataset consisting of poisoning records collected by the American Association of Poison Control Centers. For the decision tree model, the ranking of the features showed that acidosis was the most important feature, followed by hypoglycemia and electrolyte abnormality. What was found was that while both algorithms were found to be powerful in determining whether or not someone was poisoned from unsafe levels of metformin, the SVM model was able to more precisely predict the correct outcome when evaluating the models based on a test set (70-30 split). From a building science perspective, the SVM algorithm can be used for regression purposes, and based on the results from this research, an SVM model can be used to accurately predict energy consumption, allowing building managers and engineers to implement more efficient strategies, reducing energy costs and environmental impact in NYC buildings. The decision tree model was able to display the most important features. Similarly, in analyzing building energy consumption, we might find that factors like building insulation quality, HVAC system efficiency, and occupancy patterns are key predictors. Using machine learning, we could develop models with high accuracy in predicting energy consumption and potential energy-saving opportunities for buildings in NYC.

Article 4: Comparison of SVM Algorithms with Decision Trees for Accurate Recognition to Handwritten Digits to Improve the Accuracy Value

Link: https://versita.com/menuscript/index.php/Versita/article/view/832/911

The goal of this study was to make a good system that can recognize handwritten numbers. The decision tree and SVM algorithms were both applied to the dataset. Sklearn, a popular machine learning library in Python, has a digits dataset built in which contains roughly 70,000 8x8 sample images of handwritten digits. What was found was that between the two algorithms, the support vector machine algorithm had a higher average accuracy compared to the decision tree. The results show that Support Vector Machine is more effective than Decision Tree. Both methods showed good results when tested using a T-Test (p<0.001) with a 95% confidence level. The reason why this relates to my area of expertise is because many energy consumption meters within NYC use the traditional analog meters. This is problematic because newer energy consumption meters are connected to a building automation system from which the energy consumption can be downloaded onto a computer for analysis, while the older analog meters are not connected to any such type of system nor have any IoT (Internet of things) capabilities. Therefore, as a test to track the energy consumption readouts from an analog meter, I installed a time-lapse camera facing at the face of the meter, which takes a picture every minute. The results from this research article can therefore be used to determine which algorithm would be most accurate in converting the images from the time lapse camera to meter readout values.

Decision Trees vs. SVMs in my Current Area of Expertise

Based on the research that I have done in regards to decision trees vs. SVMs, there are several key distinctions that I want to conclude with between the two algorithms that pertain to my area of expertise.

For SVMS, they would be beneficial for my area of expertise because they are effective in handling complex, high-dimensional data, which is common in building energy consumption datasets that may include numerous variables such as weather conditions, building characteristics, and occupancy patterns. In my field of expertise, a years worth of energy consumption data is needed to construct a model because New York goes through 4 seasons worth of weather changes, which has a significant impact on energy consumption, so SVMs can be can be applied to predict energy consumption patterns in buildings by analyzing historical data and extracting patterns related to factors like weather conditions, building occupancy, and HVAC system efficiency. They can also be used for anomaly detection to identify unusual energy usage patterns that may indicate equipment malfunction or inefficient operation. However, SVMs are computationally expensive and require careful tuning of hyperparameters, such as the choice of kernel and regularization parameter.

For decision trees, they are easy to interpret and understand, making them suitable for explaining energy consumption prediction models to building stakeholders and decision-makers. Decision trees also can automatically handle feature interactions and nonlinear relationships, which is advantageous in capturing complex relationships between input features and energy consumption. They can also be applied for predictive maintenance by identifying decision rules based on historical data to predict equipment failures or energy inefficiencies. However, they are prone to overfitting and their simplicity makes them potentially unable to capture complex relationships, which would probably warrant the use of a neural network.

In summary, one must be cognizant of the specific characteristics of the dataset such as the interpretability of the model and the computational resources available, before making the decision on whether to use a decision tree or an SVM.

Homework 3

Peter Phung

2024-04-19