Homework3

Homework 3:

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Based on articles

https://www.hindawi.com/journals/complexity/2021/5550344/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

Format: R file & essay

Solution:

For Homework 2, I used the NYC OpenData set on HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity data set. I wanted to understand if the diagnoses of HIV in NYC was driven by location (neighborhood) and also sex, RACE.ETHNICITY. I sought to predict the number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” variable and find out whether variables “Neighborhood..U.H.F.”, “SEX” and “RACE.ETHNICITY” had any relationship with this dependent variable.

There appears to be some problem uploading and reading the file from GitHub, I was able to read the exact same file from my local directory.

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.2

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.2

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ recipes      1.0.8
## ✔ dials        1.2.0     ✔ rsample      1.2.0
## ✔ dplyr        1.1.2     ✔ tibble       3.2.1
## ✔ ggplot2      3.4.3     ✔ tidyr        1.3.0
## ✔ infer        1.0.5     ✔ tune         1.1.2
## ✔ modeldata    1.2.0     ✔ workflows    1.1.3
## ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
## ✔ purrr        1.0.2     ✔ yardstick    1.2.0

## Warning: package 'dials' was built under R version 4.3.2

## Warning: package 'infer' was built under R version 4.3.2

## Warning: package 'modeldata' was built under R version 4.3.2

## Warning: package 'parsnip' was built under R version 4.3.2

## Warning: package 'recipes' was built under R version 4.3.2

## Warning: package 'rsample' was built under R version 4.3.2

## Warning: package 'tune' was built under R version 4.3.2

## Warning: package 'workflows' was built under R version 4.3.2

## Warning: package 'workflowsets' was built under R version 4.3.2

## Warning: package 'yardstick' was built under R version 4.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()        masks scales::discard()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ recipes::step()         masks stats::step()
## ✖ tune::tune()            masks parsnip::tune(), e1071::tune()
## • Learn how to get started at https://www.tidymodels.org/start/

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ stringr   1.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DataExplorer)
library(rpart)

## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune

library(randomForest)

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(openxlsx)
library(lubridate)
library(readxl)

Importing dataset

HIV <- read.csv("C:\\Users\\tparker3\\OneDrive - Bell Partners, Inc\\Documents\\HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20231029.csv")

Exploratory Analysis

Once the file is loaded, we proceed with reviewing the structure and content of the data set to understand which machine learning algorithms can be applied in this context.

Of the 11 variables in the file, 1). TOTAL.NUMBER.OF.HIV.DIAGNOSES, 2). HIV.DIAGNOSES.PER.100.000.POPULATION, 3). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, 4). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION need to be converted from char too int.

Additionally, the two char variables SEX and RACE.ETHNICITY need to be converted to factor data type.

I handle both of these conversions in the data clean-up section below.

head(HIV)

##   YEAR Borough     Neighborhood..U.H.F.    SEX  RACE.ETHNICITY
## 1 2010                       Greenpoint   Male           Black
## 2 2011           Stapleton - St. George Female Native American
## 3 2010                 Southeast Queens   Male             All
## 4 2012                   Upper Westside Female         Unknown
## 5 2013                      Willowbrook   Male         Unknown
## 6 2013         East Flatbush - Flatbush   Male           Black
##   TOTAL.NUMBER.OF.HIV.DIAGNOSES HIV.DIAGNOSES.PER.100.000.POPULATION
## 1                             6                                330.4
## 2                             0                                    0
## 3                            23                                 25.4
## 4                             0                                    0
## 5                             0                                    0
## 6                            54                                 56.5
##   TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## 1                                             0
## 2                                             0
## 3                                             5
## 4                                             0
## 5                                             0
## 6                                             8
##   PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## 1                                                                   0
## 2                                                                   0
## 3                                                                21.7
## 4                                                                   0
## 5                                                                   0
## 6                                                                14.8
##   TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## 1                              5                                 275.3
## 2                              0                                     0
## 3                             14                                  15.4
## 4                              0                                     0
## 5                              0                                     0
## 6                             33                                  34.5

colnames(HIV)

##  [1] "YEAR"                                                               
##  [2] "Borough"                                                            
##  [3] "Neighborhood..U.H.F."                                               
##  [4] "SEX"                                                                
##  [5] "RACE.ETHNICITY"                                                     
##  [6] "TOTAL.NUMBER.OF.HIV.DIAGNOSES"                                      
##  [7] "HIV.DIAGNOSES.PER.100.000.POPULATION"                               
##  [8] "TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES"                      
##  [9] "PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES"
## [10] "TOTAL.NUMBER.OF.AIDS.DIAGNOSES"                                     
## [11] "AIDS.DIAGNOSES.PER.100.000.POPULATION"

print(nrow(HIV))

## [1] 8976

print(ncol(HIV))

## [1] 11

Data clean-up, summary and visualization

This data file has 8,976 rows and 11 columns. The summary and plot_missing functions show there are missing values for 6 variables listed below:

1). TOTAL.NUMBER.OF.HIV.DIAGNOSES 2). HIV.DIAGNOSES.PER.100.000.POPULATION 3). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES 4). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION

data_prepared <- HIV

data_prepared$SEX <- as.factor(data_prepared$SEX)
data_prepared$RACE.ETHNICITY <- as.factor(data_prepared$RACE.ETHNICITY)
data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)

## Warning: NAs introduced by coercion

data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)

## Warning: NAs introduced by coercion

data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)

## Warning: NAs introduced by coercion

data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES <- as.integer(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)

## Warning: NAs introduced by coercion

data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)

## Warning: NAs introduced by coercion

data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)

## Warning: NAs introduced by coercion

str(data_prepared)

## 'data.frame':    8976 obs. of  11 variables:
##  $ YEAR                                                               : int  2010 2011 2010 2012 2013 2013 2013 2013 2012 2010 ...
##  $ Borough                                                            : chr  "" "" "" "" ...
##  $ Neighborhood..U.H.F.                                               : chr  "Greenpoint" "Stapleton - St. George" "Southeast Queens" "Upper Westside" ...
##  $ SEX                                                                : Factor w/ 3 levels "All","Female",..: 3 2 3 2 3 3 2 2 3 1 ...
##  $ RACE.ETHNICITY                                                     : Factor w/ 11 levels "All","Asian/Pacific Islander",..: 4 8 1 10 10 4 8 10 10 1 ...
##  $ TOTAL.NUMBER.OF.HIV.DIAGNOSES                                      : int  6 0 23 0 0 54 0 0 0 14 ...
##  $ HIV.DIAGNOSES.PER.100.000.POPULATION                               : int  330 0 25 0 0 56 0 0 0 5 ...
##  $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES                      : int  0 0 5 0 0 8 0 0 0 5 ...
##  $ PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES: int  0 0 21 0 0 14 0 0 0 35 ...
##  $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES                                     : int  5 0 14 0 0 33 0 0 0 12 ...
##  $ AIDS.DIAGNOSES.PER.100.000.POPULATION                              : int  275 0 15 0 0 34 0 0 0 4 ...

dim(data_prepared)

## [1] 8976   11

plot_missing(data_prepared)

summary(data_prepared)

##       YEAR        Borough          Neighborhood..U.H.F.     SEX      
##  Min.   :2010   Length:8976        Length:8976          All   :2192  
##  1st Qu.:2013   Class :character   Class :character     Female:3392  
##  Median :2017   Mode  :character   Mode  :character     Male  :3392  
##  Mean   :2016                                                        
##  3rd Qu.:2020                                                        
##  Max.   :2021                                                        
##                                                                      
##                  RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
##  All                    :1528   Min.   :   0.00              
##  Black                  :1352   1st Qu.:   0.00              
##  White                  :1352   Median :   2.00              
##  Asian/Pacific\nIslander:1008   Mean   :  21.01              
##  Latino/Hispanic        :1008   3rd Qu.:  12.00              
##  Other/Unknown          :1008   Max.   :3353.00              
##  (Other)                :1720   NA's   :16                   
##  HIV.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :  0.00                      
##  1st Qu.:  0.00                      
##  Median :  9.00                      
##  Mean   : 24.95                      
##  3rd Qu.: 33.00                      
##  Max.   :821.00                      
##  NA's   :84                          
##  TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
##  Min.   :  0.000                              
##  1st Qu.:  0.000                              
##  Median :  0.000                              
##  Mean   :  3.924                              
##  3rd Qu.:  2.000                              
##  Max.   :680.000                              
##  NA's   :4                                    
##  PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
##  Min.   :  0.00                                                     
##  1st Qu.:  0.00                                                     
##  Median : 11.00                                                     
##  Mean   : 15.69                                                     
##  3rd Qu.: 23.00                                                     
##  Max.   :100.00                                                     
##  NA's   :1895                                                       
##  TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :   0.00                Min.   :  0.00                       
##  1st Qu.:   0.00                1st Qu.:  0.00                       
##  Median :   1.00                Median :  4.00                       
##  Mean   :  13.52                Mean   : 15.93                       
##  3rd Qu.:   7.00                3rd Qu.: 19.00                       
##  Max.   :2611.00                Max.   :565.00                       
##  NA's   :13                     NA's   :81

Substituting missing values for the 6 variables identified earlier

We substitute the median for missing values since the distribution is left-skewed for all 6 variables.

# Identifying all 6 variables are left-skewed

hist(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)

hist(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)

hist(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)

hist(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)

hist(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)

hist(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)

# Substituting median value for missing data

data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES, na.rm=TRUE)

data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)

data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, na.rm=TRUE)

data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES[is.na(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)] <- median(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, na.rm=TRUE)

data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES, na.rm=TRUE)

data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)

summary(data_prepared)

##       YEAR        Borough          Neighborhood..U.H.F.     SEX      
##  Min.   :2010   Length:8976        Length:8976          All   :2192  
##  1st Qu.:2013   Class :character   Class :character     Female:3392  
##  Median :2017   Mode  :character   Mode  :character     Male  :3392  
##  Mean   :2016                                                        
##  3rd Qu.:2020                                                        
##  Max.   :2021                                                        
##                                                                      
##                  RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
##  All                    :1528   Min.   :   0.00              
##  Black                  :1352   1st Qu.:   0.00              
##  White                  :1352   Median :   2.00              
##  Asian/Pacific\nIslander:1008   Mean   :  20.98              
##  Latino/Hispanic        :1008   3rd Qu.:  12.00              
##  Other/Unknown          :1008   Max.   :3353.00              
##  (Other)                :1720                                
##  HIV.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :  0.0                       
##  1st Qu.:  0.0                       
##  Median :  9.0                       
##  Mean   : 24.8                       
##  3rd Qu.: 32.0                       
##  Max.   :821.0                       
##                                      
##  TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
##  Min.   :  0.000                              
##  1st Qu.:  0.000                              
##  Median :  0.000                              
##  Mean   :  3.922                              
##  3rd Qu.:  2.000                              
##  Max.   :680.000                              
##                                               
##  PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
##  Min.   :  0.0                                                      
##  1st Qu.:  0.0                                                      
##  Median : 11.0                                                      
##  Mean   : 14.7                                                      
##  3rd Qu.: 20.0                                                      
##  Max.   :100.0                                                      
##                                                                     
##  TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :   0.0                 Min.   :  0.00                       
##  1st Qu.:   0.0                 1st Qu.:  0.00                       
##  Median :   1.0                 Median :  4.00                       
##  Mean   :  13.5                 Mean   : 15.83                       
##  3rd Qu.:   7.0                 3rd Qu.: 19.00                       
##  Max.   :2611.0                 Max.   :565.00                       
##

Model building

I would like to predict the number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” variable and find out whether variables “Neighborhood..U.H.F.”, “SEX” and “RACE.ETHNICITY” have any relationship with this dependent variable. We take a look at the “SEX” variable to see if it is evenly distributed among our “RACE.ETHNICITY” variable.

At first glance, we can see that the number of Females and Males in the data seem to be evenly distributed.

Next step, we partition the data into training (80%) and test (20%) in order to measure model performance.

# Two-way table of factor variables
xtabs(~ SEX + RACE.ETHNICITY, data = data_prepared)

##         RACE.ETHNICITY
## SEX      All Asian/Pacific Islander Asian/Pacific\nIslander Black Hispanic
##   All    512                      0                     336   336        0
##   Female 508                    172                     336   508      172
##   Male   508                    172                     336   508      172
##         RACE.ETHNICITY
## SEX      Latino/Hispanic Multiracial Native American Other/Unknown Unknown
##   All                336           0               0           336       0
##   Female             336         172             172           336     172
##   Male               336         172             172           336     172
##         RACE.ETHNICITY
## SEX      White
##   All      336
##   Female   508
##   Male     508

# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(data_prepared), replace = T, prob = c(0.8, 0.2))
SVM_train <- data_prepared[ind==1,]
SVM_test <- data_prepared[ind==2,]

SVM modeling

The following code constructs the SVM model based. It attempts to predict the HIV diagnoses based on neighborhood, sex, race and ethnicity.

svm_model <- svm(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. + SEX + RACE.ETHNICITY,
                 data = SVM_train,
                 kernel="polynomial",
                 scale=FALSE)

svm_model

## 
## Call:
## svm(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. + 
##     SEX + RACE.ETHNICITY, data = SVM_train, kernel = "polynomial", 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.01282051 
##      coef.0:  0 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  6756

Analysis of model

Next, we will run the model and calculate the RMSE for when the model is run with the test data.

SVM_test$pred <- predict(svm_model, newdata = SVM_test)

rmse <- SVM_test %>% 
  mutate(residual = TOTAL.NUMBER.OF.HIV.DIAGNOSES - pred) %>%
  summarize(rmse = sqrt(mean(residual^2)))

rmse

##       rmse
## 1 113.7579

Conclusion:

The RMSE was worse with the SVM model than with the random forest model used in Homework 2. There were 3 models used in Homework 2 with RMSE outlined for each below:

Model 1: RMSE -> 105.9462 Model 2: RMSE -> 108.1461 Model 3 (Random forest model): RMSE -> 93.85277 SVM model: RMSE -> 113.7579

As can be seen, RMSE is worse with the SVM model than with the prior models. Hence, I would choose the random forest model in this case.

Further questions:

Solution:

Article 1 - The Use of Machine Learning in Real Estate Research (by Lennon H.T. Choy and Winky K.O. Ho) -https://www.mdpi.com/2073-445X/12/4/740

In order to estimate real estate prices, three machine learning algorithms, Support Vector Machine (SVM), Random Forest (RF), and gradient boosting machine (GBM), were employed by [27]. The authors then examined the results associated with these three algorithms while applying these techniques to a data sample of roughly 40,000 housing transactions over the course of more than 18 years in Hong Kong. When compared to SVM, RF and GBM demonstrated superior performance in terms of predictive power, while RF and GBM performed equally well. In terms of three performance criteria (MSE, RMSE, and MAPE), GBM surpasses SVM while doing marginally better than RF in terms of error minimization. As a result, this paper shows that RF and GBM are very effective methods for making precise predictions of real estate prices because their results are comparable.

Article 2 - Using Machine Learning Algorithms for Predicting Real Estate Values in Tourism Centers - chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://assets.researchsquare.com/files/rs-1757533/v1/eea1c08e-4bee-4d16-ad81-0b8fbd8ab90b.pdf?c=1660761016

In this study conducted in Alanya city, one of the most important tourism regions in Turkey, real estate valuation was performed using machine learning algorithms. In the current research conducted on 200 samples using the SVM, kNN, and RF algorithms, the best result was achieved with the SVM algorithm (0.73). The spatial distribution of the values obtained by real estate valuation with machine learning algorithms was examined on the maps generated using GIS technology.

Article 3 - Support Vector Machine (SVM) as Financial Distress Model Prediction in Property and Real Estate Companies (By Ni Wayan Dewinta Ayuni, Ni Nengah Lasmini, Agus Adi Putrawan)

After going through various stages, namely pre-processing, modeling, and evaluation, this research concludes: Pre-processing with cleaning and feature selection using Principal Component Analysis (PCA) produces 5 independent variables used in the model, namely X12 (Return on Assets), X11 (Return on Equity), X9 (Net Profit Margin), X16 (Earning Per Share), and X10 (Operating Profit Margin). The best model formed is the SVM model with the Radial Basis Function (RBF) kernel with the value of parameter sigma = 1 and C = 1.0. This model has an accuracy value of 82.99% and an error rate of 17.01% in predicting the financial distress of property and real estate companies listed on the Indonesian Stock Exchange.

Final comments:

All three examples above are regression scenarios where a certain numeric value is being predicted and in 2 of 3 cases, SVM model is being recommended while GB/random forest modeling were more accurate in predicting real estate prices in the first instance. I do agree with the recommendations the authors make and really believe it is case dependent on when random forest modeling or SVM may be better suited. Having said that, decision trees in general are suited better for categorical data and deal with collinearity better than SVM model. For this homework though, I prefer the random forest model based on better RMSE.

Homework3

2023-12-02

Homework 3:

Solution:

Importing dataset

Exploratory Analysis

Data clean-up, summary and visualization

Substituting missing values for the 6 variables identified earlier

Model building

SVM modeling

Analysis of model

Conclusion:

Further questions:

Solution:

Article 1 - The Use of Machine Learning in Real Estate Research (by Lennon H.T. Choy and Winky K.O. Ho) -https://www.mdpi.com/2073-445X/12/4/740

Article 2 - Using Machine Learning Algorithms for Predicting Real Estate Values in Tourism Centers - chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://assets.researchsquare.com/files/rs-1757533/v1/eea1c08e-4bee-4d16-ad81-0b8fbd8ab90b.pdf?c=1660761016

Article 3 - Support Vector Machine (SVM) as Financial Distress Model Prediction in Property and Real Estate Companies (By Ni Wayan Dewinta Ayuni, Ni Nengah Lasmini, Agus Adi Putrawan)

Final comments: