HOMEWORK #3
Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
Based on articles
https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.
Which algorithm is recommended to get more accurate results?
Is it better for classification or regression scenarios?
Do you agree with the recommendations? Why? Format: R file & essay
Analysis shown below:
https://gist.github.com/fyyying/4aa5b471860321d7b47fd881898162b7#file-titanic_dataset-csv
About this file:
Titantic dataset
Variables:
# define the filename-manual procedure
filename1 <- "C:/Users/Lisa/OneDrive/CUNY/622/HW2/titanic_dataset.csv"
# load the CSV file from the local directory
dataset1 <- read.csv(filename1, header=TRUE)
str(dataset1)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(dataset1)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.00 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.00 Class :character
## Median :446.0 Median :0.0000 Median :3.00 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.31
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.00
## Max. :891.0 Max. :1.0000 Max. :3.00
## NA's :1
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
We need to set the following character variables as factors……
#Set blanks to NA
dataset1[dataset1==""] <- NA
#dataset1<-dataset1[complete.cases(dataset1), ]
dataset1$Survived <- as.factor(dataset1$Survived)
dataset1$Pclass <- as.factor(dataset1$Pclass)
dataset1$Sex <- as.factor(dataset1$Sex)
dataset1$Embarked<-as.factor(dataset1$Embarked)
str(dataset1)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
And now we keep the following variables in the analysis:
and keep completed cases.
ds2 <- dataset1 %>%
select(Survived,
Pclass,
Sex,
Age,
SibSp,
Parch,
Fare,
Embarked)
dim(ds2)
## [1] 891 8
#keep only complete cases
#ds2[ds2==""] <- NA
ds2<-ds2[complete.cases(ds2), ]
dim(ds2)
## [1] 711 8
head(ds2)
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 1 0 3 male 22 1 0 7.2500 S
## 2 1 1 female 38 1 0 71.2833 C
## 3 1 3 female 26 0 0 7.9250 S
## 4 1 1 female 35 1 0 53.1000 S
## 5 0 3 male 35 0 0 8.0500 S
## 7 0 1 male 54 0 0 51.8625 S
summary(ds2)
## Survived Pclass Sex Age SibSp
## 0:424 1:183 female:259 Min. : 0.42 Min. :0.0000
## 1:287 2:173 male :452 1st Qu.:20.00 1st Qu.:0.0000
## 3:355 Median :28.00 Median :0.0000
## Mean :29.64 Mean :0.5148
## 3rd Qu.:38.00 3rd Qu.:1.0000
## Max. :80.00 Max. :5.0000
## Parch Fare Embarked
## Min. :0.0000 Min. : 0.00 C:130
## 1st Qu.:0.0000 1st Qu.: 8.05 Q: 28
## Median :0.0000 Median : 15.55 S:553
## Mean :0.4332 Mean : 34.57
## 3rd Qu.:1.0000 3rd Qu.: 32.75
## Max. :6.0000 Max. :512.33
Let’s predict survival using a decision tree algorithm
set.seed(1234)
sample_set<-sample(nrow(ds2), round(nrow(ds2)*.75), replace=FALSE)
ds_train<-ds2[sample_set,]
ds_test<-ds2[-sample_set,]
round(prop.table(table(select(ds2,Survived))),2)
##
## 0 1
## 0.6 0.4
round(prop.table(table(select(ds_train,Survived
))),2)
##
## 0 1
## 0.58 0.42
round(prop.table(table(select(ds_test,Survived))),2)
##
## 0 1
## 0.64 0.36
set.seed(123)
ds_mod<-rpart(Survived~.,
method="class",
data=ds_train
)
rpart.plot(ds_mod)
### Test the tree model
Survived_pred<-predict(ds_mod,ds_test, type="class")
pred_table1<-table(ds_test$Survived,Survived_pred)
pred_table1
## Survived_pred
## 0 1
## 0 95 19
## 1 21 43
sum(diag(pred_table1)/nrow(ds_test))
## [1] 0.7752809
Predicting the Survival is 78% accuracy on the test data.
For SVM classification, we can set dummy variables to represent the categorical variables.
Specifically, we have Sex and Embarked categorical variables. These will be converted into dummy variables in both the test and training set.
Furthermore, we have to scale the variables to the same units by scale=TRUE.
set.seed(1)
#ds_train2<-dummy.data.frame(data=ds_train, sep="_")
#ds_test2<-dummy.data.frame(data=ds_test, sep="_")
ds_train2 <- dummy_cols(ds_train, select_columns = c('Sex', 'Embarked'),
remove_selected_columns = TRUE)
ds_test2 <- dummy_cols(ds_test, select_columns = c('Sex', 'Embarked'),
remove_selected_columns = TRUE)
svmfit<-svm(Survived~., data=ds_train2, kernel="linear", cost=10,scale=TRUE)
summary(svmfit)
##
## Call:
## svm(formula = Survived ~ ., data = ds_train2, kernel = "linear",
## cost = 10, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 251
##
## ( 126 125 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
We used a cost parameter of 10. Let’s tune to using 10-fold cv using a range of values for the cost.
set.seed(1)
tune.out<-tune(svm,Survived~., data=ds_train2, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1, 1,5,10,100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.2306778
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.3266247 0.07325866
## 2 1e-02 0.2306778 0.06183791
## 3 1e-01 0.2306778 0.06183791
## 4 1e+00 0.2363382 0.05812666
## 5 5e+00 0.2363382 0.05812666
## 6 1e+01 0.2363382 0.05812666
## 7 1e+02 0.2363382 0.05812666
Cost=.01 is the best with the lowest cross-validation error rate.
Let’s look at the best model.
bestmod<-tune.out$best.model
summary(bestmod)
##
## Call:
## best.tune(method = svm, train.x = Survived ~ ., data = ds_train2,
## ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 297
##
## ( 149 148 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Using the best model, to make predictions
ypred<-predict(bestmod,ds_test2[,-1])
pred_tablesvm<-table(predict=ypred, truth=ds_test2$Survived)
pred_tablesvm
## truth
## predict 0 1
## 0 98 17
## 1 16 47
sum(diag(pred_tablesvm/nrow(ds_test2)))
## [1] 0.8146067
Predicting survival by use of the SVM is 81% accuracy.
Both the decision tree (78%) and the svm (81%) performed remarkably well on predicting survival in the Titanic dataset.
HW 3 Articles
Article #1 Machine Learning Techniques in Drug Discovery and Development
Manne, R. (2021). Machine learning techniques in drug discovery and development. International Journal of Applied Research, 7(4), 21-28. https://www.researchgate.net/publication/350707374_Machine_Learning_Techniques_in_Drug_Discovery_and_Development The author reviews several machine learning and deep learning techniques in the pharmaceutical industry on the pathway to drug development. Drug development includes target identification, which contain proteins, DNA mutations and biomarkers and clinical trials. In the article, Manne et al present several machine learning techniques and explains SVM is important in drug discovery due to its ability to distinguish between active and inactive compounds. Decision trees may be used for prediction of drug likeness, and predicting drug properties e.g., absorption, penetration, etc. However, the author’s point out their concern of changes in the data causing differing results. In addition, the size of the dataset may cause problems as well. I do agree with the author utilizing two techniques in the same drug development for different reasons.
Article #2 Medical decision-making based on the exploration of a personalized medicine dataset Hafid Kadi, Mohammed Rebbah, Boudjelal Meftah, Olivier Lézoray, Medical decision-making based on the exploration of a personalized medicine dataset, Informatics in Medicine Unlocked, Volume 23, 2021, 100561, ISSN 2352-9148. https://www.sciencedirect.com/science/article/pii/S2352914821000514
Kadi et al seek to explore automatic medical decision-making based on a personalized medicine dataset. They tested 3 distance measurements to analyze patient similarities. The authors used PCA and other algorithms to address dimension reduction and 4 classifiers including SVM and RF to classify patients. In their analysis, the authors consider 3 scenarios as well. Kadi concludes with the recommendation which includes RF as the classifier. This study was quite extension and sought to consider several options with a final recommendation. This a powerful methodology to review a medical dataset.
Article #3 Detecting Credit Card Fraud by Decision Trees and Support Vector Machines
Sahin, Y., & Duman, E. (2011). Detecting credit card fraud by decision trees and support vector machines. http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp442-447.pdf?msclkid=39f78880b8bd11ecbe154fab1b37ca9b
The losses from credit card fraud are enormous and the need to detect fraud is the imperative. Sahin and Duman compare SVM and Decision tree algorithms in fraud detection using real data. The authors asses 3 decision tree methods and 4 SVM methods. Kadi and Duman suggest the decision tree approaches are superior over SVM. However, in larger datasets, the SVM and Decision tree are similar. They suggested SVM may overfit the training set, but as the dataset size increases the models are comparable. Interestingly, the SVM detects a lower number of fraud. Given the authors conclusions, the decision tree methods seem to be superior in this study.
Overall, it is the quality of the dataset and characteristics which should drive the selection of the algorithm. In machine learning, several different models and scenarios seem to be the protocol in determining the answer to the question. Decision trees are better with categorical data. Computational resources should also be taken into account with SVM consuming more resources. Decision trees and SVM could be used for both classification and regression. Whether we are dealing with a classification or regression problem depends on the target variable being categorical or continuous. In a decision tree regression, all observations in a bin have the same predicted value. The SVM counterpart is SVR, Support vector regression, which is similar to SVM with some modifications. Again, running several different models and comparing RMSE seems to be the practice to aid in determining the best model.