Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
Based on articles
https://www.hindawi.com/journals/complexity/2021/5550344/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
Format: R file & essay
For Homework 2, I used the NYC OpenData set on HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity data set. I wanted to understand if the diagnoses of HIV in NYC was driven by location (neighborhood) and also sex, RACE.ETHNICITY. I sought to predict the number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” variable and find out whether variables “Neighborhood..U.H.F.”, “SEX” and “RACE.ETHNICITY” had any relationship with this dependent variable.
There appears to be some problem uploading and reading the file from GitHub, I was able to read the exact same file from my local directory.
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ recipes 1.0.8
## ✔ dials 1.2.0 ✔ rsample 1.2.0
## ✔ dplyr 1.1.2 ✔ tibble 3.2.1
## ✔ ggplot2 3.4.3 ✔ tidyr 1.3.0
## ✔ infer 1.0.5 ✔ tune 1.1.2
## ✔ modeldata 1.2.0 ✔ workflows 1.1.3
## ✔ parsnip 1.1.1 ✔ workflowsets 1.0.1
## ✔ purrr 1.0.2 ✔ yardstick 1.2.0
## Warning: package 'dials' was built under R version 4.3.2
## Warning: package 'infer' was built under R version 4.3.2
## Warning: package 'modeldata' was built under R version 4.3.2
## Warning: package 'parsnip' was built under R version 4.3.2
## Warning: package 'recipes' was built under R version 4.3.2
## Warning: package 'rsample' was built under R version 4.3.2
## Warning: package 'tune' was built under R version 4.3.2
## Warning: package 'workflows' was built under R version 4.3.2
## Warning: package 'workflowsets' was built under R version 4.3.2
## Warning: package 'yardstick' was built under R version 4.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
## • Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ lubridate 1.9.2 ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(openxlsx)
library(lubridate)
library(readxl)
HIV <- read.csv("C:\\Users\\tparker3\\OneDrive - Bell Partners, Inc\\Documents\\HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20231029.csv")
Once the file is loaded, we proceed with reviewing the structure and content of the data set to understand which machine learning algorithms can be applied in this context.
Of the 11 variables in the file, 1). TOTAL.NUMBER.OF.HIV.DIAGNOSES, 2). HIV.DIAGNOSES.PER.100.000.POPULATION, 3). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, 4). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION need to be converted from char too int.
Additionally, the two char variables SEX and RACE.ETHNICITY need to be converted to factor data type.
I handle both of these conversions in the data clean-up section below.
head(HIV)
## YEAR Borough Neighborhood..U.H.F. SEX RACE.ETHNICITY
## 1 2010 Greenpoint Male Black
## 2 2011 Stapleton - St. George Female Native American
## 3 2010 Southeast Queens Male All
## 4 2012 Upper Westside Female Unknown
## 5 2013 Willowbrook Male Unknown
## 6 2013 East Flatbush - Flatbush Male Black
## TOTAL.NUMBER.OF.HIV.DIAGNOSES HIV.DIAGNOSES.PER.100.000.POPULATION
## 1 6 330.4
## 2 0 0
## 3 23 25.4
## 4 0 0
## 5 0 0
## 6 54 56.5
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## 1 0
## 2 0
## 3 5
## 4 0
## 5 0
## 6 8
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## 1 0
## 2 0
## 3 21.7
## 4 0
## 5 0
## 6 14.8
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## 1 5 275.3
## 2 0 0
## 3 14 15.4
## 4 0 0
## 5 0 0
## 6 33 34.5
colnames(HIV)
## [1] "YEAR"
## [2] "Borough"
## [3] "Neighborhood..U.H.F."
## [4] "SEX"
## [5] "RACE.ETHNICITY"
## [6] "TOTAL.NUMBER.OF.HIV.DIAGNOSES"
## [7] "HIV.DIAGNOSES.PER.100.000.POPULATION"
## [8] "TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES"
## [9] "PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES"
## [10] "TOTAL.NUMBER.OF.AIDS.DIAGNOSES"
## [11] "AIDS.DIAGNOSES.PER.100.000.POPULATION"
print(nrow(HIV))
## [1] 8976
print(ncol(HIV))
## [1] 11
This data file has 8,976 rows and 11 columns. The summary and plot_missing functions show there are missing values for 6 variables listed below:
1). TOTAL.NUMBER.OF.HIV.DIAGNOSES 2). HIV.DIAGNOSES.PER.100.000.POPULATION 3). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES 4). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION
data_prepared <- HIV
data_prepared$SEX <- as.factor(data_prepared$SEX)
data_prepared$RACE.ETHNICITY <- as.factor(data_prepared$RACE.ETHNICITY)
data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)
## Warning: NAs introduced by coercion
data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES <- as.integer(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)
## Warning: NAs introduced by coercion
str(data_prepared)
## 'data.frame': 8976 obs. of 11 variables:
## $ YEAR : int 2010 2011 2010 2012 2013 2013 2013 2013 2012 2010 ...
## $ Borough : chr "" "" "" "" ...
## $ Neighborhood..U.H.F. : chr "Greenpoint" "Stapleton - St. George" "Southeast Queens" "Upper Westside" ...
## $ SEX : Factor w/ 3 levels "All","Female",..: 3 2 3 2 3 3 2 2 3 1 ...
## $ RACE.ETHNICITY : Factor w/ 11 levels "All","Asian/Pacific Islander",..: 4 8 1 10 10 4 8 10 10 1 ...
## $ TOTAL.NUMBER.OF.HIV.DIAGNOSES : int 6 0 23 0 0 54 0 0 0 14 ...
## $ HIV.DIAGNOSES.PER.100.000.POPULATION : int 330 0 25 0 0 56 0 0 0 5 ...
## $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES : int 0 0 5 0 0 8 0 0 0 5 ...
## $ PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES: int 0 0 21 0 0 14 0 0 0 35 ...
## $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES : int 5 0 14 0 0 33 0 0 0 12 ...
## $ AIDS.DIAGNOSES.PER.100.000.POPULATION : int 275 0 15 0 0 34 0 0 0 4 ...
dim(data_prepared)
## [1] 8976 11
plot_missing(data_prepared)
summary(data_prepared)
## YEAR Borough Neighborhood..U.H.F. SEX
## Min. :2010 Length:8976 Length:8976 All :2192
## 1st Qu.:2013 Class :character Class :character Female:3392
## Median :2017 Mode :character Mode :character Male :3392
## Mean :2016
## 3rd Qu.:2020
## Max. :2021
##
## RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
## All :1528 Min. : 0.00
## Black :1352 1st Qu.: 0.00
## White :1352 Median : 2.00
## Asian/Pacific\nIslander:1008 Mean : 21.01
## Latino/Hispanic :1008 3rd Qu.: 12.00
## Other/Unknown :1008 Max. :3353.00
## (Other) :1720 NA's :16
## HIV.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 9.00
## Mean : 24.95
## 3rd Qu.: 33.00
## Max. :821.00
## NA's :84
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 3.924
## 3rd Qu.: 2.000
## Max. :680.000
## NA's :4
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 11.00
## Mean : 15.69
## 3rd Qu.: 23.00
## Max. :100.00
## NA's :1895
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 1.00 Median : 4.00
## Mean : 13.52 Mean : 15.93
## 3rd Qu.: 7.00 3rd Qu.: 19.00
## Max. :2611.00 Max. :565.00
## NA's :13 NA's :81
We substitute the median for missing values since the distribution is left-skewed for all 6 variables.
# Identifying all 6 variables are left-skewed
hist(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)
hist(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)
hist(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)
hist(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)
hist(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)
hist(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)
# Substituting median value for missing data
data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES, na.rm=TRUE)
data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)
data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, na.rm=TRUE)
data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES[is.na(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)] <- median(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, na.rm=TRUE)
data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES, na.rm=TRUE)
data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)
summary(data_prepared)
## YEAR Borough Neighborhood..U.H.F. SEX
## Min. :2010 Length:8976 Length:8976 All :2192
## 1st Qu.:2013 Class :character Class :character Female:3392
## Median :2017 Mode :character Mode :character Male :3392
## Mean :2016
## 3rd Qu.:2020
## Max. :2021
##
## RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
## All :1528 Min. : 0.00
## Black :1352 1st Qu.: 0.00
## White :1352 Median : 2.00
## Asian/Pacific\nIslander:1008 Mean : 20.98
## Latino/Hispanic :1008 3rd Qu.: 12.00
## Other/Unknown :1008 Max. :3353.00
## (Other) :1720
## HIV.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 9.0
## Mean : 24.8
## 3rd Qu.: 32.0
## Max. :821.0
##
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 3.922
## 3rd Qu.: 2.000
## Max. :680.000
##
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 11.0
## Mean : 14.7
## 3rd Qu.: 20.0
## Max. :100.0
##
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 1.0 Median : 4.00
## Mean : 13.5 Mean : 15.83
## 3rd Qu.: 7.0 3rd Qu.: 19.00
## Max. :2611.0 Max. :565.00
##
I would like to predict the number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” variable and find out whether variables “Neighborhood..U.H.F.”, “SEX” and “RACE.ETHNICITY” have any relationship with this dependent variable. We take a look at the “SEX” variable to see if it is evenly distributed among our “RACE.ETHNICITY” variable.
At first glance, we can see that the number of Females and Males in the data seem to be evenly distributed.
Next step, we partition the data into training (80%) and test (20%) in order to measure model performance.
# Two-way table of factor variables
xtabs(~ SEX + RACE.ETHNICITY, data = data_prepared)
## RACE.ETHNICITY
## SEX All Asian/Pacific Islander Asian/Pacific\nIslander Black Hispanic
## All 512 0 336 336 0
## Female 508 172 336 508 172
## Male 508 172 336 508 172
## RACE.ETHNICITY
## SEX Latino/Hispanic Multiracial Native American Other/Unknown Unknown
## All 336 0 0 336 0
## Female 336 172 172 336 172
## Male 336 172 172 336 172
## RACE.ETHNICITY
## SEX White
## All 336
## Female 508
## Male 508
# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(data_prepared), replace = T, prob = c(0.8, 0.2))
SVM_train <- data_prepared[ind==1,]
SVM_test <- data_prepared[ind==2,]
The following code constructs the SVM model based. It attempts to predict the HIV diagnoses based on neighborhood, sex, race and ethnicity.
svm_model <- svm(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. + SEX + RACE.ETHNICITY,
data = SVM_train,
kernel="polynomial",
scale=FALSE)
svm_model
##
## Call:
## svm(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. +
## SEX + RACE.ETHNICITY, data = SVM_train, kernel = "polynomial",
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.01282051
## coef.0: 0
## epsilon: 0.1
##
##
## Number of Support Vectors: 6756
Next, we will run the model and calculate the RMSE for when the model is run with the test data.
SVM_test$pred <- predict(svm_model, newdata = SVM_test)
rmse <- SVM_test %>%
mutate(residual = TOTAL.NUMBER.OF.HIV.DIAGNOSES - pred) %>%
summarize(rmse = sqrt(mean(residual^2)))
rmse
## rmse
## 1 113.7579
The RMSE was worse with the SVM model than with the random forest model used in Homework 2. There were 3 models used in Homework 2 with RMSE outlined for each below:
Model 1: RMSE -> 105.9462 Model 2: RMSE -> 108.1461 Model 3 (Random forest model): RMSE -> 93.85277 SVM model: RMSE -> 113.7579
As can be seen, RMSE is worse with the SVM model than with the prior models. Hence, I would choose the random forest model in this case.
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
In order to estimate real estate prices, three machine learning algorithms, Support Vector Machine (SVM), Random Forest (RF), and gradient boosting machine (GBM), were employed by [27]. The authors then examined the results associated with these three algorithms while applying these techniques to a data sample of roughly 40,000 housing transactions over the course of more than 18 years in Hong Kong. When compared to SVM, RF and GBM demonstrated superior performance in terms of predictive power, while RF and GBM performed equally well. In terms of three performance criteria (MSE, RMSE, and MAPE), GBM surpasses SVM while doing marginally better than RF in terms of error minimization. As a result, this paper shows that RF and GBM are very effective methods for making precise predictions of real estate prices because their results are comparable.
In this study conducted in Alanya city, one of the most important tourism regions in Turkey, real estate valuation was performed using machine learning algorithms. In the current research conducted on 200 samples using the SVM, kNN, and RF algorithms, the best result was achieved with the SVM algorithm (0.73). The spatial distribution of the values obtained by real estate valuation with machine learning algorithms was examined on the maps generated using GIS technology.
After going through various stages, namely pre-processing, modeling, and evaluation, this research concludes: Pre-processing with cleaning and feature selection using Principal Component Analysis (PCA) produces 5 independent variables used in the model, namely X12 (Return on Assets), X11 (Return on Equity), X9 (Net Profit Margin), X16 (Earning Per Share), and X10 (Operating Profit Margin). The best model formed is the SVM model with the Radial Basis Function (RBF) kernel with the value of parameter sigma = 1 and C = 1.0. This model has an accuracy value of 82.99% and an error rate of 17.01% in predicting the financial distress of property and real estate companies listed on the Indonesian Stock Exchange.
All three examples above are regression scenarios where a certain numeric value is being predicted and in 2 of 3 cases, SVM model is being recommended while GB/random forest modeling were more accurate in predicting real estate prices in the first instance. I do agree with the recommendations the authors make and really believe it is case dependent on when random forest modeling or SVM may be better suited. Having said that, decision trees in general are suited better for categorical data and deal with collinearity better than SVM model. For this homework though, I prefer the random forest model based on better RMSE.