HW 3 (SVM BOTTOM OF PAGE) -

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework. Based on articles

https://www.hindawi.com/journals/complexity/2021/5550344/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why? Format: R file & essay Due date: Sunday, April 24, 2022, end of day

HW #2

DECISION TREE CLASSIFICATION PROBLEM

Based on the the topics presented, bring a dataset of your choice and create a Decision tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

PACKAGE: ISLR

Description: Gas mileage, horsepower, and other information for 392 vehicles. Usage Auto

Format: A data frame with 392 observations on the following 9 variables. mpg miles per gallon cylinders Number of cylinders between 4 and 8 displacement Engine displacement (cu. inches) horsepower Engine horsepower weight Vehicle weight (lbs.) acceleration Time to accelerate from 0 to 60 mph (sec.) year Model year (modulo 100) origin Origin of car (1. American, 2. European, 3. Japanese) name Vehicle name

The orginal data contained 408 observations but 16 observations with missing values were removed.

Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

INITIAL DATA ANALYSIS

# Get the list of data sets contained in package
x <- data(package = "ISLR")
x$results[, "Item"]

##  [1] "Auto"      "Caravan"   "Carseats"  "College"   "Credit"    "Default"  
##  [7] "Hitters"   "Khan"      "NCI60"     "OJ"        "Portfolio" "Smarket"  
## [13] "Wage"      "Weekly"

colnames(x)

## NULL

As you can see some our variables are ranging from Auto to Weekly etc

EXPLORING THE DATA

data(Carseats)
# Get the variable names
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

dim(Carseats)

## [1] 400  11

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

As suggested by the instructions in the assignment I will be doing classification thetree to analyze the carseats data set. Also notice that a simulated data set containing Sales of child car seats at 400 different stores. You’ll see 400 observations and 11 variables in the data set. I am interested in predicting Sales based on the other variables in the data set. Of course, since Sales is a continuous variable, we need to make as a binary variable. You’ll see the new variable, High, will take value of Yes if the Sales variable exceeds 8, and No otherwise.

Classification thetrees

High = ifelse(Carseats$Sales <=8, "No", "Yes")

Carseats=data.frame(Carseats,High)

Carseats.H <- Carseats[,-1]

Carseats.H$High = as.factor(Carseats$High)
class(Carseats.H$High)

## [1] "factor"

set.seed(888)
thetrain = sample(1:nrow(Carseats.H), 200)
Carseats.thetrain=Carseats.H[thetrain,]
Carseats.thetest=Carseats.H[-thetrain,]

High.thetest=High[-thetrain]

My first step is to make classification thetree using the thetraining set to predict High using all variables except Sales (remember that High was derived from Sales).

The cp value is a stopping parameter. It helps speed up the search for splits because it can identify splits that don’t meet this criteria and prune them before going too far.

If you take the approach of building really deep thetrees, the default value of 0.01 might be too restrictive.

fit.thetree = rpart(High ~ ., data=Carseats.thetrain, method = "class", cp=0.008)
fit.thetree

## n= 200 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 200 87 No (0.56500000 0.43500000)  
##    2) Price>=96.5 161 57 No (0.64596273 0.35403727)  
##      4) ShelveLoc=Bad 42  3 No (0.92857143 0.07142857) *
##      5) ShelveLoc=Good,Medium 119 54 No (0.54621849 0.45378151)  
##       10) Advertising< 8.5 65 19 No (0.70769231 0.29230769)  
##         20) CompPrice< 144.5 51  9 No (0.82352941 0.17647059) *
##         21) CompPrice>=144.5 14  4 Yes (0.28571429 0.71428571) *
##       11) Advertising>=8.5 54 19 Yes (0.35185185 0.64814815)  
##         22) ShelveLoc=Medium 36 18 No (0.50000000 0.50000000)  
##           44) Education>=13.5 21  6 No (0.71428571 0.28571429) *
##           45) Education< 13.5 15  3 Yes (0.20000000 0.80000000) *
##         23) ShelveLoc=Good 18  1 Yes (0.05555556 0.94444444) *
##    3) Price< 96.5 39  9 Yes (0.23076923 0.76923077)  
##      6) CompPrice< 99 8  3 No (0.62500000 0.37500000) *
##      7) CompPrice>=99 31  4 Yes (0.12903226 0.87096774) *

# Visualizing
rpart.plot(fit.thetree)

pred.thetree = predict(fit.thetree, Carseats.thetest, type = "class")

table(pred.thetree,High.thetest)

##             High.thetest
## pred.thetree No Yes
##          No  90  32
##          Yes 33  45

#plotcp(fit.thetree)
printcp(fit.thetree)

## 
## Classification tree:
## rpart(formula = High ~ ., data = Carseats.thetrain, method = "class", 
##     cp = 0.008)
## 
## Variables actually used in tree construction:
## [1] Advertising CompPrice   Education   Price       ShelveLoc  
## 
## Root node error: 87/200 = 0.435
## 
## n= 200 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.241379      0   1.00000 1.00000 0.080587
## 2 0.091954      1   0.75862 0.98851 0.080476
## 3 0.068966      3   0.57471 0.87356 0.078901
## 4 0.051724      4   0.50575 0.73563 0.075827
## 5 0.022989      6   0.40230 0.73563 0.075827
## 6 0.008000      7   0.37931 0.65517 0.073379

# lowest cp value
fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]

## [1] 0.008

we’ll prune the regression thetree to find the optimal value to use for cp (the complexity parameter) that leads to the lowest thetest error.

Note that the optimal value for cp is the one that leads to the lowest xerror in the previous output, which represents the error on the observations from the cross-validation data.

bestcp <-fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
pruned.thetree <- prune(fit.thetree, cp = bestcp)
rpart.plot(pruned.thetree)

pred.prune = predict(pruned.thetree, Carseats.thetest, type="class")


table(pred.prune, High.thetest)

##           High.thetest
## pred.prune No Yes
##        No  90  32
##        Yes 33  45

Regression thetree Fit

# High variable
Carseats.S <- Carseats[,-12]

set.seed(999)
thetrain = sample(1:nrow(Carseats.S), 200)
Carseats.thetrain=Carseats.S[thetrain,]
Carseats.thetest=Carseats.S[-thetrain,]

Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for thetests of significance. The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten as follows: (yi - ) = ( i - ) + (yi - i)

fit.thetree = rpart(Sales ~ ., data=Carseats.thetrain, method="anova", cp=0.008)
#summary(fit.thetree)
fit.thetree

## n= 200 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 200 1605.306000  7.289650  
##    2) ShelveLoc=Bad,Medium 161  982.688100  6.655714  
##      4) Price>=94.5 135  646.254000  6.136593  
##        8) ShelveLoc=Bad 46  163.515600  4.679565  
##         16) Population< 106 11   48.512490  3.319091 *
##         17) Population>=106 35   88.244510  5.107143  
##           34) Age>=33.5 28   65.860070  4.787857 *
##           35) Age< 33.5 7    8.112371  6.384286 *
##        9) ShelveLoc=Medium 89  334.610500  6.889663  
##         18) Price>=127 29   83.957080  5.622069  
##           36) Advertising< 3.5 13   18.137710  4.636154 *
##           37) Advertising>=3.5 16   42.915940  6.423125 *
##         19) Price< 127 60  181.534500  7.502333  
##           38) Age>=60.5 25   70.170620  6.565600  
##             76) CompPrice< 118.5 9   10.483560  5.307778 *
##             77) CompPrice>=118.5 16   37.438540  7.273125 *
##           39) Age< 60.5 35   73.758030  8.171429  
##             78) Advertising< 6 18   29.197650  7.171667 *
##             79) Advertising>=6 17    7.519200  9.230000 *
##      5) Price< 94.5 26  111.153100  9.351154  
##       10) Advertising< 9 15   70.662090  8.597333 *
##       11) Advertising>=9 11   20.344090 10.379090 *
##    3) ShelveLoc=Good 39  290.814100  9.906667  
##      6) Price>=135 9   15.959890  6.391111 *
##      7) Price< 135 30  130.252300 10.961330  
##       14) Age>=62 7   12.926490  9.168571 *
##       15) Age< 62 23   87.980690 11.506960  
##         30) Urban=No 9   22.282160  9.887778 *
##         31) Urban=Yes 14   26.934240 12.547860 *

rpart.plot(fit.thetree)

fit.thetree$variable.importance

##   ShelveLoc       Price         Age Advertising   CompPrice      Income 
##   483.59509   474.09259   139.48793    87.61880    78.87432    57.47487 
##          US  Population       Urban   Education 
##    55.24076    54.08655    38.76430    30.12734

pred.thetree = predict(fit.thetree, Carseats.thetest)

The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors. The lower the MSE, the better the forecast.

# mean square error
mse <- mean((pred.thetree - Carseats.thetest$Sales)^2)
mse

## [1] 4.530078

# CP value
printcp(fit.thetree)

## 
## Regression tree:
## rpart(formula = Sales ~ ., data = Carseats.thetrain, method = "anova", 
##     cp = 0.008)
## 
## Variables actually used in tree construction:
## [1] Advertising Age         CompPrice   Population  Price       ShelveLoc  
## [7] Urban      
## 
## Root node error: 1605.3/200 = 8.0265
## 
## n= 200 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2066921      0   1.00000 1.00668 0.094663
## 2  0.1403352      1   0.79331 0.87310 0.081991
## 3  0.0922740      2   0.65297 0.69819 0.067697
## 4  0.0900774      3   0.56070 0.69214 0.067207
## 5  0.0430565      4   0.47062 0.57741 0.056675
## 6  0.0234260      5   0.42756 0.57185 0.061614
## 7  0.0230742      6   0.40414 0.59280 0.063329
## 8  0.0212139      7   0.38106 0.59789 0.063341
## 9  0.0166688      9   0.33864 0.60690 0.064391
## 10 0.0142673     10   0.32197 0.64061 0.063234
## 11 0.0138594     11   0.30770 0.62653 0.061521
## 12 0.0125502     12   0.29384 0.63228 0.061349
## 13 0.0088906     13   0.28129 0.64558 0.060469
## 14 0.0080000     14   0.27240 0.64923 0.060476

bestcp <- fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
bestcp

## [1] 0.02342595

The accuracy of the model on the thetest data is better when the thetree is pruned, which means that the pruned decision thetree model generalizes well and is more suited for a production environment. However, there are also other factors that can influence decision thetree model creation, such as building a thetree on an unbalanced class. These factors were not accounted for in this demonstration but it’s very important for them to be examined during a live model formulation.

pruned.thetree <- prune(fit.thetree, cp = bestcp)

# Visualize the thetree
rpart.plot(pruned.thetree)

# Checking the order of variable importance
pruned.thetree$variable.importance

##   ShelveLoc       Price         Age      Income   CompPrice   Education 
##  479.932017  442.221990   35.353913   33.396181   29.246370    7.150235 
## Advertising  Population 
##    3.220173    2.383412

EVALUATING PERFORMANCE

With the decision thetree, it can enable validation since it is the best predictive model. In fact, it finds use of making quantitative analysis of the business platform. In addition, it can validate results of the statistical thetests. Nevertheless, it can support naturally the classification of problems with several classes by modification process.

# Use the thetest data to evaluate performance of pruned regression thetree
pred.prune = predict(pruned.thetree, Carseats.thetest)

# Calcualte the MSE for the pruned thetree
mse <- mean((pred.prune - Carseats.thetest$Sales)^2)
mse

## [1] 4.897713

RANDOM FOREST

Random Forest is based on the bagging algorithm and uses Ensemble Learning technique. It creates as many thetrees on the subset of the data and combines the output of all the thetrees. In this way it reduces overfitting problem in decision thetrees and also reduces the variance and therefore improves the accuracy.

Random Forest can be used to solve both classification as well as regression problems.

Random Forest works well with both categorical and continuous variables.

Random Forest can automatically handle missing values.

No feature scaling required: No feature scaling (standardization and normalization) required in case of Random Forest as it uses rule based approach instead of distance calculation.

# random forest using all predictors
# using
modFit.rf <- randomForest::randomForest(Carseats.thetrain$Sales ~ ., data = Carseats.thetrain[,c(1:11)])

modFit.rf

## 
## Call:
##  randomForest(formula = Carseats.thetrain$Sales ~ ., data = Carseats.thetrain[,      c(1:11)]) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 3.232277
##                     % Var explained: 59.73

forest_pred <- predict(modFit.rf, Carseats.thetest)
table(forest_pred)

## forest_pred
## 4.49766066666666         4.553272 4.74268461904762 4.84429633333333 
##                1                1                1                1 
##         4.882053 4.90977966666667         4.916167 4.92679866666667 
##                1                1                1                1 
## 5.08684316666667 5.09389766666667 5.13174255555556         5.143352 
##                1                1                1                1 
## 5.20166183333333 5.33597961904761 5.41031466666666 5.42922566666667 
##                1                1                1                1 
## 5.44026733333333 5.62702733333333 5.64218895238094 5.66944888095238 
##                1                1                1                1 
## 5.67166533333333 5.69624733333333         5.825111         5.834741 
##                1                1                1                1 
## 5.85814066666667 5.87860738095238 5.89826866666667 5.96244161904762 
##                1                1                1                1 
## 6.02942271428571         6.043148 6.04935166666667 6.06333661904762 
##                1                1                1                1 
## 6.08969066666667 6.12923283333333 6.20497100000001 6.20790599999999 
##                1                1                1                1 
## 6.23353299999999 6.24949233333333 6.32152233333333 6.34873466666666 
##                1                1                1                1 
## 6.36093066666667 6.36178433333333 6.37020176190476         6.440513 
##                1                1                1                1 
## 6.45411866666667         6.460274         6.468564          6.47315 
##                1                1                1                1 
## 6.50512361904762         6.518236         6.542815 6.56347799999999 
##                1                1                1                1 
## 6.56632616666667 6.58527966666667         6.612339         6.630332 
##                1                1                1                1 
##         6.657704         6.659737 6.66003588888889 6.66013833333333 
##                1                1                1                1 
## 6.66025654761905 6.66469266666667 6.69525128571428 6.70967971428571 
##                1                1                1                1 
## 6.71034083333332 6.71400961904761         6.729461 6.73066033333333 
##                1                1                1                1 
##         6.747615         6.770789 6.78079833333332         6.823457 
##                1                1                1                1 
## 6.83419466666667           6.8782         6.889779         6.892132 
##                1                1                1                1 
## 6.89628266666666 6.91858033333334 6.95286433333334 7.00136583333334 
##                1                1                1                1 
## 7.02389080952381 7.02771199999999 7.03207566666666 7.07083433333335 
##                1                1                1                1 
## 7.08448661904762 7.09387000000001 7.09501466666666 7.11656666666667 
##                1                1                1                1 
##         7.134659 7.14507866666666         7.237091 7.24696433333333 
##                1                1                1                1 
## 7.24784095238095 7.28751466666666         7.290333 7.30234133333334 
##                1                1                1                1 
## 7.33092099999999 7.33240428571429 7.42347380952381 7.42905366666667 
##                1                1                1                1 
## 7.43322299999999 7.43324433333333 7.43324533333333 7.44922799999999 
##                1                1                1                1 
## 7.45635100000001 7.46085933333334 7.46259466666665 7.48436533333333 
##                1                1                1                1 
##         7.507944         7.519279 7.53220399999999 7.53987033333333 
##                1                1                1                1 
## 7.58219422222222 7.59806666666667           7.6013 7.66689866666667 
##                1                1                1                1 
## 7.70365899999999 7.73442738095238 7.75570966666667 7.76352133333334 
##                1                1                1                1 
## 7.78285166666667 7.79906728571428 7.80812733333333         7.831732 
##                1                1                1                1 
## 7.90635866666666          7.92275 7.93173066666667 7.95244433333334 
##                1                1                1                1 
## 7.95700066666666         7.963845 7.97789366666666 7.98678133333334 
##                1                1                1                1 
## 7.99230916666666 7.99579480952382 8.02173850000001 8.03621566666667 
##                1                1                1                1 
## 8.09594166666667 8.09940071428571        8.2424045 8.27331433333334 
##                1                1                1                1 
## 8.27478033333333 8.34325595238097 8.37773866666666 8.43998066666666 
##                1                1                1                1 
## 8.54173966666666          8.54311 8.56097883333334         8.612151 
##                1                1                1                1 
## 8.61851666666667 8.71259409523809         8.751208 8.85622416666667 
##                1                1                1                1 
## 8.87673733333333 8.90295466666666 8.94765433333332 8.97254733333333 
##                1                1                1                1 
##         8.977202 8.98490516666666 8.98880828571429 9.01419433333333 
##                1                1                1                1 
## 9.05158583333332 9.13910233333333 9.25350033333333 9.28426899999999 
##                1                1                1                1 
## 9.28980111111112 9.29261899999999 9.30257266666666 9.31912499999999 
##                1                1                1                1 
## 9.34772299999999         9.351979 9.39249666666667 9.44541966666667 
##                1                1                1                1 
## 9.44659633333334 9.45246266666666 9.50048399999999 9.63173233333331 
##                1                1                1                1 
##        9.6427775 9.67040744444443 9.80139166666666 9.83688099999999 
##                1                1                1                1 
## 9.88388566666668 9.91572733333333 9.96535833333333 10.0005271111111 
##                1                1                1                1 
##        10.044763 10.0535023333333 10.0589236666667 10.0940126666667 
##                1                1                1                1 
## 10.0968243333333 10.1466936666667        10.212493 10.2341617777778 
##                1                1                1                1 
## 10.2349104444444 10.2355576666667        10.267478 10.3406077777778 
##                1                1                1                1 
## 10.3625343333333 10.4262394444444 11.2468724444444 11.4011914444444 
##                1                1                1                1

SVM

#install.packages(‘skimr’)

skim(Carseats.thetrain)

Data summary
Name	Carseats.thetrain
Number of rows	200
Number of columns	11
_______________________
Column type frequency:
factor	3
numeric	8
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
ShelveLoc	1	FALSE	3	Med: 107, Bad: 54, Goo: 39
Urban	1	FALSE	2	Yes: 142, No: 58
US	1	FALSE	2	Yes: 125, No: 75

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sales	1	7.29	2.84	0.37	5.20	7.38	9.31	15.63	▁▇▇▃▁
CompPrice	1	125.44	15.43	77.00	115.00	125.00	135.00	162.00	▁▃▇▆▂
Income	1	66.49	27.17	22.00	42.00	67.00	87.00	120.00	▇▆▇▆▅
Advertising	1	6.12	6.30	0.00	0.00	5.00	11.00	26.00	▇▂▃▁▁
Population	1	261.65	151.17	12.00	128.25	271.00	398.50	508.00	▇▆▅▇▆
Price	1	116.46	24.04	24.00	100.00	118.00	131.00	191.00	▁▂▇▅▁
Age	1	53.95	16.58	25.00	39.75	55.00	66.00	80.00	▇▆▇▇▇
Education	1	13.81	2.54	10.00	12.00	14.00	16.00	18.00	▇▇▃▇▆

set.seed(69)

my_svm<-svm(Income~., data=Carseats.thetrain, kernel="linear", cost=10,scale=TRUE)

summary(my_svm)

## 
## Call:
## svm(formula = Income ~ ., data = Carseats.thetrain, kernel = "linear", 
##     cost = 10, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.08333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  185

print(my_svm)

## 
## Call:
## svm(formula = Income ~ ., data = Carseats.thetrain, kernel = "linear", 
##     cost = 10, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.08333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  185

PRED

set.seed(888)

pred <- predict(my_svm, newdata=Carseats.thetest)

print(my_svm)

## 
## Call:
## svm(formula = Income ~ ., data = Carseats.thetrain, kernel = "linear", 
##     cost = 10, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.08333333 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  185

plot(my_svm, Carseats.thetest)

RMSE

Carseats.thetest$pred <- predict(my_svm, newdata=Carseats.thetest)

the_rmse <- Carseats.thetrain %>% 
  mutate(residual = Income - pred) %>%
  summarize(rmse = sqrt(mean(residual^2)))


print(the_rmse)

##       rmse
## 1 32.36961

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.

A Complete View of Decision Trees and SVM in Machine Learning https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b

summary: SVM uses kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem. Decision trees are better for categorical data and it deals colinearity better than SVM.

A Comparison of Support Vector Machine and Decision Tree Classifications https://scialert.net/fulltext/?doi=itj.2009.64.70

summary: The biggest difference between the two algorithms is that SVM uses the kernel trick to turn a linearly nonseparable problem into a linearly separable one (unless of course we use the linear kernel), while decision trees (and forests based on them, and boosted trees, both to a lesser extent due to the nature of the ensemble algorithms) split the input space into hyper-rectangles according to the target. Usually one will work better than another in a given situation, but it’s hard to tell in most cases in high dimensional spaces unless there is something about the data that suggests one over the other. This is the preferred method, but hardly obvious in most cases.Most of the time, people use a validation set to not only optimize hyperperameters but also to choose between algorithms. It’s not perfect, but often it works. Oh - if you have categories in your inputs, you can’t use SVMs. They only work with numeric data.

Comparing Support Vector Machines and Decision Trees for Classification https://www.codementor.io/blog/text-classification-6mmol0q8oj

summary: In the end, if you have the computational resources to do so, try both. See which class of models performs best on your holdout/validation/test dataset(s). If your data is highly structured, gradient boosting methods will likely perform very well. And oftentimes, you can train a high-performing booster in less time than it takes to fit an SVM.If your data includes categorical features, just be aware that the performance of tree-based methods often suffers if these features are one-hot encoded (see here for a great discussion of why this occurs). So either use another encoding strategy such as target encoding, or use a library with native handling of categorical features, such as H20 (see here).

Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

If we have more categorical data then first I will prefer to go with decisions tree. There are many advantage with decisions tree.Its highly interpretive.Its automatically handled the multicolinarty problem.If we have more sparse data then I will prefer to go with SVM.Some people ask why is svm not so good as decision tree on the same data? Possibilities include the use of an inappropriate kernel (e.g. a linear kernel for a non-linear problem), poor choice of kernel and regularisation hyper-parameters. Good model selection (choice of kernel and hyper-parameter tuning is the key to getting good performance from SVMs, they can only be expected to give good results when used correctly). SVMs often do take a long time to train, this is especially true when the choice of kernel and particularly regularisation parameter means that almost all the data end up as support vectors (the sparsity of SVMs is a handy by-product, nothing more).Lastly, there is no a-priori superiority for any classifier system over the others, so the best classifier for a particular task is itself task-dependent. However there is more compelling theory for the SVM that suggests it is likely to be better choice than many other approaches for many problems.

Conclusion

SVM is one of the supervised algorithms mostly used for classification problems. This article will give an idea about its advantages in general. SVM is very helpful method if we don’t have much idea about the data. It can be used for the data such as image, text, audio etc.It can be used for the data that is not regularly distributed and have unknown distribution. The SVM provides a very useful technique within it known as kernel and by the application of associated kernel function we can solve any complex problem. Kernel provides choosing a function which is not necessarily linear and can have different forms in terms of different data it operates on and thus is a non-parametric function. In Classification problems, there is a strong assumption that is Data have samples that are linearly separable but with the introduction of kernel, Input data can be converted into High dimensional data avoiding the need of this assumption. K(x1, x2)=〈f(x1), f(x2)〉Where K is the kernel function, x1, x2 are n-dimensional inputs and f is a function that is used to map n-dimensional space into m-dimensional space and 〈x1, x2〉is used to specify/indicate the dot product SVM generally do not suffer condition of overfitting and performs well when there is a clear indication of separation between classes. SVM can be used when total no of samples is less than the no of dimensions and performs well in terms of memory. SVM performs and generalized well on the out of sample data. Due to this as it performs well on out of generalization sample data SVM proves itself to be fast as the sure fact says that in SVM for the classification of one sample , the kernel function is evaluated and performed for each and every support vectors. The other important advantage of SVM Algorithm is that it is able to handle High dimensional data too and this proves to be a great help taking into account its usage and application in Machine learning field.

DATA622_HW3

John Mazon

4/2/2022