HOMEWORK #3

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Based on articles

https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.

Which algorithm is recommended to get more accurate results?

Is it better for classification or regression scenarios?

Do you agree with the recommendations? Why? Format: R file & essay

Decision Tree - Titantic from HW2

Analysis shown below:

Dataset selection

https://gist.github.com/fyyying/4aa5b471860321d7b47fd881898162b7#file-titanic_dataset-csv

About this file:

Titantic dataset

Variables:

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

# define the filename-manual procedure
filename1 <- "C:/Users/Lisa/OneDrive/CUNY/622/HW2/titanic_dataset.csv"
# load the CSV file from the local directory
dataset1 <- read.csv(filename1, header=TRUE)
str(dataset1)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

summary(dataset1)

##   PassengerId       Survived          Pclass         Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.00   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.00   Class :character  
##  Median :446.0   Median :0.0000   Median :3.00   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.31                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.00                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.00                     
##                                   NA's   :1                        
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

We need to set the following character variables as factors……

#Set blanks to NA

dataset1[dataset1==""] <- NA
#dataset1<-dataset1[complete.cases(dataset1), ]


dataset1$Survived <- as.factor(dataset1$Survived)
dataset1$Pclass <- as.factor(dataset1$Pclass)
dataset1$Sex <- as.factor(dataset1$Sex)
dataset1$Embarked<-as.factor(dataset1$Embarked)
str(dataset1)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

And now we keep the following variables in the analysis:

Pclass
Sex
Age
SibSp
Parch
Fare
Embarked

and keep completed cases.

ds2 <- dataset1 %>% 
  select(Survived, 
Pclass,
Sex,
Age,
SibSp,
Parch,
Fare,
Embarked)
dim(ds2)

## [1] 891   8

#keep only complete cases

#ds2[ds2==""] <- NA
ds2<-ds2[complete.cases(ds2), ]

dim(ds2)

## [1] 711   8

head(ds2)

##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  22     1     0  7.2500        S
## 2        1      1 female  38     1     0 71.2833        C
## 3        1      3 female  26     0     0  7.9250        S
## 4        1      1 female  35     1     0 53.1000        S
## 5        0      3   male  35     0     0  8.0500        S
## 7        0      1   male  54     0     0 51.8625        S

summary(ds2)

##  Survived Pclass      Sex           Age            SibSp       
##  0:424    1:183   female:259   Min.   : 0.42   Min.   :0.0000  
##  1:287    2:173   male  :452   1st Qu.:20.00   1st Qu.:0.0000  
##           3:355                Median :28.00   Median :0.0000  
##                                Mean   :29.64   Mean   :0.5148  
##                                3rd Qu.:38.00   3rd Qu.:1.0000  
##                                Max.   :80.00   Max.   :5.0000  
##      Parch             Fare        Embarked
##  Min.   :0.0000   Min.   :  0.00   C:130   
##  1st Qu.:0.0000   1st Qu.:  8.05   Q: 28   
##  Median :0.0000   Median : 15.55   S:553   
##  Mean   :0.4332   Mean   : 34.57           
##  3rd Qu.:1.0000   3rd Qu.: 32.75           
##  Max.   :6.0000   Max.   :512.33

Decision Tree - predict Survival - Titantic

Let’s predict survival using a decision tree algorithm

Splitting the data

set.seed(1234)
sample_set<-sample(nrow(ds2), round(nrow(ds2)*.75), replace=FALSE)

ds_train<-ds2[sample_set,]
ds_test<-ds2[-sample_set,]

round(prop.table(table(select(ds2,Survived))),2)

## 
##   0   1 
## 0.6 0.4

round(prop.table(table(select(ds_train,Survived
                       ))),2)

## 
##    0    1 
## 0.58 0.42

round(prop.table(table(select(ds_test,Survived))),2)

## 
##    0    1 
## 0.64 0.36

Training the model

set.seed(123)
ds_mod<-rpart(Survived~., 
                        method="class",
                        data=ds_train
                        )
rpart.plot(ds_mod)

### Test the tree model

Survived_pred<-predict(ds_mod,ds_test, type="class")

pred_table1<-table(ds_test$Survived,Survived_pred)

pred_table1

##    Survived_pred
##      0  1
##   0 95 19
##   1 21 43

sum(diag(pred_table1)/nrow(ds_test))

## [1] 0.7752809

Predicting the Survival is 78% accuracy on the test data.

SVM on the titanic dataset

For SVM classification, we can set dummy variables to represent the categorical variables.

Specifically, we have Sex and Embarked categorical variables. These will be converted into dummy variables in both the test and training set.

Furthermore, we have to scale the variables to the same units by scale=TRUE.

set.seed(1)
#ds_train2<-dummy.data.frame(data=ds_train, sep="_")
#ds_test2<-dummy.data.frame(data=ds_test, sep="_")

ds_train2 <- dummy_cols(ds_train, select_columns = c('Sex', 'Embarked'),
           remove_selected_columns = TRUE)

ds_test2 <- dummy_cols(ds_test, select_columns = c('Sex', 'Embarked'),
           remove_selected_columns = TRUE)

Training set after dummy variables and scaling

svmfit<-svm(Survived~., data=ds_train2, kernel="linear", cost=10,scale=TRUE)

summary(svmfit)

## 
## Call:
## svm(formula = Survived ~ ., data = ds_train2, kernel = "linear", 
##     cost = 10, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  251
## 
##  ( 126 125 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Tuning model

We used a cost parameter of 10. Let’s tune to using 10-fold cv using a range of values for the cost.

set.seed(1)
tune.out<-tune(svm,Survived~., data=ds_train2, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1, 1,5,10,100)))


summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.2306778 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-03 0.3266247 0.07325866
## 2 1e-02 0.2306778 0.06183791
## 3 1e-01 0.2306778 0.06183791
## 4 1e+00 0.2363382 0.05812666
## 5 5e+00 0.2363382 0.05812666
## 6 1e+01 0.2363382 0.05812666
## 7 1e+02 0.2363382 0.05812666

Cost=.01 is the best with the lowest cross-validation error rate.

Let’s look at the best model.

bestmod<-tune.out$best.model
summary(bestmod)

## 
## Call:
## best.tune(method = svm, train.x = Survived ~ ., data = ds_train2, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  297
## 
##  ( 149 148 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Predict on the test set

Using the best model, to make predictions

ypred<-predict(bestmod,ds_test2[,-1])

pred_tablesvm<-table(predict=ypred, truth=ds_test2$Survived)

pred_tablesvm

##        truth
## predict  0  1
##       0 98 17
##       1 16 47

sum(diag(pred_tablesvm/nrow(ds_test2)))

## [1] 0.8146067

Predicting survival by use of the SVM is 81% accuracy.

Both the decision tree (78%) and the svm (81%) performed remarkably well on predicting survival in the Titanic dataset.

Further Discussion

HW 3 Articles

Article #1 Machine Learning Techniques in Drug Discovery and Development

Manne, R. (2021). Machine learning techniques in drug discovery and development. International Journal of Applied Research, 7(4), 21-28. https://www.researchgate.net/publication/350707374_Machine_Learning_Techniques_in_Drug_Discovery_and_Development The author reviews several machine learning and deep learning techniques in the pharmaceutical industry on the pathway to drug development. Drug development includes target identification, which contain proteins, DNA mutations and biomarkers and clinical trials. In the article, Manne et al present several machine learning techniques and explains SVM is important in drug discovery due to its ability to distinguish between active and inactive compounds. Decision trees may be used for prediction of drug likeness, and predicting drug properties e.g., absorption, penetration, etc. However, the author’s point out their concern of changes in the data causing differing results. In addition, the size of the dataset may cause problems as well. I do agree with the author utilizing two techniques in the same drug development for different reasons.

Article #2 Medical decision-making based on the exploration of a personalized medicine dataset Hafid Kadi, Mohammed Rebbah, Boudjelal Meftah, Olivier Lézoray, Medical decision-making based on the exploration of a personalized medicine dataset, Informatics in Medicine Unlocked, Volume 23, 2021, 100561, ISSN 2352-9148. https://www.sciencedirect.com/science/article/pii/S2352914821000514

Kadi et al seek to explore automatic medical decision-making based on a personalized medicine dataset. They tested 3 distance measurements to analyze patient similarities. The authors used PCA and other algorithms to address dimension reduction and 4 classifiers including SVM and RF to classify patients. In their analysis, the authors consider 3 scenarios as well. Kadi concludes with the recommendation which includes RF as the classifier. This study was quite extension and sought to consider several options with a final recommendation. This a powerful methodology to review a medical dataset.

Article #3 Detecting Credit Card Fraud by Decision Trees and Support Vector Machines

Sahin, Y., & Duman, E. (2011). Detecting credit card fraud by decision trees and support vector machines. http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp442-447.pdf?msclkid=39f78880b8bd11ecbe154fab1b37ca9b

The losses from credit card fraud are enormous and the need to detect fraud is the imperative. Sahin and Duman compare SVM and Decision tree algorithms in fraud detection using real data. The authors asses 3 decision tree methods and 4 SVM methods. Kadi and Duman suggest the decision tree approaches are superior over SVM. However, in larger datasets, the SVM and Decision tree are similar. They suggested SVM may overfit the training set, but as the dataset size increases the models are comparable. Interestingly, the SVM detects a lower number of fraud. Given the authors conclusions, the decision tree methods seem to be superior in this study.

Overall, it is the quality of the dataset and characteristics which should drive the selection of the algorithm. In machine learning, several different models and scenarios seem to be the protocol in determining the answer to the question. Decision trees are better with categorical data. Computational resources should also be taken into account with SVM consuming more resources. Decision trees and SVM could be used for both classification and regression. Whether we are dealing with a classification or regression problem depends on the target variable being categorical or continuous. In a decision tree regression, all observations in a bin have the same predicted value. The SVM counterpart is SVR, Support vector regression, which is similar to SVM with some modifications. Again, running several different models and comparing RMSE seems to be the practice to aid in determining the best model.

HW3

Lisa Szdziak

4/9/2022