Decision Trees Algorithms
Pre-work
Read this blog: https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees which shows some of the issues with decision trees.
Choose a data set from a source in Assignment #1, or another data set of your choice.
Assignment work:
Based on the latest topics presented, choose a data set of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results.
Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
Deliverable:
Essay (minimum 500 word document)
Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)
Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)
Due Date: Sunday October 29th, end of day
We are trying to train 2 different decision trees to compare bias and variance - so switch the features used for the first node (split) to force a different decision tree (How did the performance change?)
You will create 3 models: 2 x decision trees (to compare variance) and a random forest
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(openxlsx)
library(lubridate)
library(readxl)
I used the NYC OpenData set on HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity data set. I wanted to understand if the diagnoses of HIV in NYC was driven by location (neighborhood) and also sex, RACE.ETHNICITY.
The data set contains 11 variables, listed below: YEAR, Borough, Neighborhood..U.H.F., SEX, RACE.ETHNICITY, TOTAL.NUMBER.OF.HIV.DIAGNOSES, HIV.DIAGNOSES.PER.100.000.POPULATION, TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, TOTAL.NUMBER.OF.AIDS.DIAGNOSES and AIDS.DIAGNOSES.PER.100.000.POPULATION.
I’ve uploaded the file from my personal directory as I kept running into the error reading from the GitHub directory - Warning: incomplete final line found by readTableHeader on Warning: incomplete final line found by readTableHeader on ‘https://github.com/tponnada/hello-world/blob/master/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20231029.csv’
There appears to be some problem uploading and reading the file from GitHub, I was able to read the exact same file from my local directory.
HIV <- read.csv("C:\\Users\\tparker3\\OneDrive - Bell Partners, Inc\\Documents\\HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20231029.csv")
Once the file is loaded, we proceed with reviewing the structure and content of the data set to understand which machine learning algorithms can be applied in this context.
Of the 11 variables in the file, 1). TOTAL.NUMBER.OF.HIV.DIAGNOSES, 2). HIV.DIAGNOSES.PER.100.000.POPULATION, 3). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, 4). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION need to be converted from char too int.
Additionally, the two char variables SEX and RACE.ETHNICITY need to be converted to factor data type.
I handle both of these conversions in the data clean-up section below.
head(HIV)
## YEAR Borough Neighborhood..U.H.F. SEX RACE.ETHNICITY
## 1 2010 Greenpoint Male Black
## 2 2011 Stapleton - St. George Female Native American
## 3 2010 Southeast Queens Male All
## 4 2012 Upper Westside Female Unknown
## 5 2013 Willowbrook Male Unknown
## 6 2013 East Flatbush - Flatbush Male Black
## TOTAL.NUMBER.OF.HIV.DIAGNOSES HIV.DIAGNOSES.PER.100.000.POPULATION
## 1 6 330.4
## 2 0 0
## 3 23 25.4
## 4 0 0
## 5 0 0
## 6 54 56.5
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## 1 0
## 2 0
## 3 5
## 4 0
## 5 0
## 6 8
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## 1 0
## 2 0
## 3 21.7
## 4 0
## 5 0
## 6 14.8
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## 1 5 275.3
## 2 0 0
## 3 14 15.4
## 4 0 0
## 5 0 0
## 6 33 34.5
colnames(HIV)
## [1] "YEAR"
## [2] "Borough"
## [3] "Neighborhood..U.H.F."
## [4] "SEX"
## [5] "RACE.ETHNICITY"
## [6] "TOTAL.NUMBER.OF.HIV.DIAGNOSES"
## [7] "HIV.DIAGNOSES.PER.100.000.POPULATION"
## [8] "TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES"
## [9] "PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES"
## [10] "TOTAL.NUMBER.OF.AIDS.DIAGNOSES"
## [11] "AIDS.DIAGNOSES.PER.100.000.POPULATION"
print(nrow(HIV))
## [1] 8976
print(ncol(HIV))
## [1] 11
This data file has 8,976 rows and 11 columns. The summary and plot_missing functions show there are missing values for 6 variables listed below:
1). TOTAL.NUMBER.OF.HIV.DIAGNOSES 2). HIV.DIAGNOSES.PER.100.000.POPULATION 3). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES 4). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION
data_prepared <- HIV
data_prepared$SEX <- as.factor(data_prepared$SEX)
data_prepared$RACE.ETHNICITY <- as.factor(data_prepared$RACE.ETHNICITY)
data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)
## Warning: NAs introduced by coercion
data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES <- as.integer(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)
## Warning: NAs introduced by coercion
str(data_prepared)
## 'data.frame': 8976 obs. of 11 variables:
## $ YEAR : int 2010 2011 2010 2012 2013 2013 2013 2013 2012 2010 ...
## $ Borough : chr "" "" "" "" ...
## $ Neighborhood..U.H.F. : chr "Greenpoint" "Stapleton - St. George" "Southeast Queens" "Upper Westside" ...
## $ SEX : Factor w/ 3 levels "All","Female",..: 3 2 3 2 3 3 2 2 3 1 ...
## $ RACE.ETHNICITY : Factor w/ 11 levels "All","Asian/Pacific Islander",..: 4 8 1 10 10 4 8 10 10 1 ...
## $ TOTAL.NUMBER.OF.HIV.DIAGNOSES : int 6 0 23 0 0 54 0 0 0 14 ...
## $ HIV.DIAGNOSES.PER.100.000.POPULATION : int 330 0 25 0 0 56 0 0 0 5 ...
## $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES : int 0 0 5 0 0 8 0 0 0 5 ...
## $ PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES: int 0 0 21 0 0 14 0 0 0 35 ...
## $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES : int 5 0 14 0 0 33 0 0 0 12 ...
## $ AIDS.DIAGNOSES.PER.100.000.POPULATION : int 275 0 15 0 0 34 0 0 0 4 ...
dim(data_prepared)
## [1] 8976 11
plot_missing(data_prepared)
summary(data_prepared)
## YEAR Borough Neighborhood..U.H.F. SEX
## Min. :2010 Length:8976 Length:8976 All :2192
## 1st Qu.:2013 Class :character Class :character Female:3392
## Median :2017 Mode :character Mode :character Male :3392
## Mean :2016
## 3rd Qu.:2020
## Max. :2021
##
## RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
## All :1528 Min. : 0.00
## Black :1352 1st Qu.: 0.00
## White :1352 Median : 2.00
## Asian/Pacific\nIslander:1008 Mean : 21.01
## Latino/Hispanic :1008 3rd Qu.: 12.00
## Other/Unknown :1008 Max. :3353.00
## (Other) :1720 NA's :16
## HIV.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 9.00
## Mean : 24.95
## 3rd Qu.: 33.00
## Max. :821.00
## NA's :84
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 3.924
## 3rd Qu.: 2.000
## Max. :680.000
## NA's :4
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 11.00
## Mean : 15.69
## 3rd Qu.: 23.00
## Max. :100.00
## NA's :1895
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 1.00 Median : 4.00
## Mean : 13.52 Mean : 15.93
## 3rd Qu.: 7.00 3rd Qu.: 19.00
## Max. :2611.00 Max. :565.00
## NA's :13 NA's :81
We substitute the median for missing values since the distribution is left-skewed for all 6 variables.
# Identifying all 6 variables are left-skewed
hist(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)
hist(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)
hist(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)
hist(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)
hist(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)
hist(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)
# Substituting median value for missing data
data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES, na.rm=TRUE)
data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)
data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, na.rm=TRUE)
data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES[is.na(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)] <- median(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, na.rm=TRUE)
data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES, na.rm=TRUE)
data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)
summary(data_prepared)
## YEAR Borough Neighborhood..U.H.F. SEX
## Min. :2010 Length:8976 Length:8976 All :2192
## 1st Qu.:2013 Class :character Class :character Female:3392
## Median :2017 Mode :character Mode :character Male :3392
## Mean :2016
## 3rd Qu.:2020
## Max. :2021
##
## RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
## All :1528 Min. : 0.00
## Black :1352 1st Qu.: 0.00
## White :1352 Median : 2.00
## Asian/Pacific\nIslander:1008 Mean : 20.98
## Latino/Hispanic :1008 3rd Qu.: 12.00
## Other/Unknown :1008 Max. :3353.00
## (Other) :1720
## HIV.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 9.0
## Mean : 24.8
## 3rd Qu.: 32.0
## Max. :821.0
##
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 3.922
## 3rd Qu.: 2.000
## Max. :680.000
##
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 11.0
## Mean : 14.7
## 3rd Qu.: 20.0
## Max. :100.0
##
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 1.0 Median : 4.00
## Mean : 13.5 Mean : 15.83
## 3rd Qu.: 7.0 3rd Qu.: 19.00
## Max. :2611.0 Max. :565.00
##
I would like to predict the number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” variable and find out whether variables “Neighborhood..U.H.F.”, “SEX” and “RACE.ETHNICITY” have any relationship with this dependent variable. We take a look at the “SEX” variable to see if it is evenly distributed among our “RACE.ETHNICITY” variable.
At first glance, we can see that the number of Females and Males in the data seem to be evenly distributed.
Next step, we partition the data into training (80%) and test (20%) in order to measure model performance.
# Two-way table of factor variables
xtabs(~ SEX + RACE.ETHNICITY, data = data_prepared)
## RACE.ETHNICITY
## SEX All Asian/Pacific Islander Asian/Pacific\nIslander Black Hispanic
## All 512 0 336 336 0
## Female 508 172 336 508 172
## Male 508 172 336 508 172
## RACE.ETHNICITY
## SEX Latino/Hispanic Multiracial Native American Other/Unknown Unknown
## All 336 0 0 336 0
## Female 336 172 172 336 172
## Male 336 172 172 336 172
## RACE.ETHNICITY
## SEX White
## All 336
## Female 508
## Male 508
# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(data_prepared), replace = T, prob = c(0.8, 0.2))
train <- data_prepared[ind==1,]
test <- data_prepared[ind==2,]
We run the first decision tree model to predict number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” with the variable “Neighborhood..U.H.F.” below:
first_model <- rpart(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F., method="anova", train)
# Detailed summary of splits
summary(first_model)
## Call:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F.,
## data = train, method = "anova")
## n= 7194
##
## CP nsplit rel error xerror xstd
## 1 0.1191236 0 1.0000000 1.0002275 0.2400782
## 2 0.0100000 1 0.8808764 0.8834338 0.2186504
##
## Variable importance
## Neighborhood..U.H.F.
## 100
##
## Node number 1: 7194 observations, complexity param=0.1191236
## mean=20.91423, MSE=10976.63
## left son=2 (6578 obs) right son=3 (616 obs)
## Primary splits:
## Neighborhood..U.H.F. splits as RLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL, improve=0.1191236, (0 missing)
##
## Node number 2: 6578 observations
## mean=9.848586, MSE=485.1932
##
## Node number 3: 616 observations
## mean=139.0795, MSE=107739.6
# Visualize cross-validation results
plotcp(first_model)
As we can see below, the R-square increases with the number of splits, and the X Relative Error decreases as the number of splits increase.
# create additional plots
par(mfrow=c(1,2))
rsq.rpart(first_model)
##
## Regression tree:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F.,
## data = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] Neighborhood..U.H.F.
##
## Root node error: 78965872/7194 = 10977
##
## n= 7194
##
## CP nsplit rel error xerror xstd
## 1 0.11912 0 1.00000 1.00023 0.24008
## 2 0.01000 1 0.88088 0.88343 0.21865
# plot tree
plot(first_model, uniform=TRUE, main="Regression Tree for Number of HIV diagnoses ")
text(first_model, use.n=TRUE, all=TRUE, cex=.8)
## Warning in labels.rpart(x, minlength = minlength): more than 52 levels in a
## predicting factor, truncated for printout
# make predictions
p <- predict(first_model, test)
# Root Mean Square Error
sqrt(mean((test$TOTAL.NUMBER.OF.HIV.DIAGNOSES - p)^2))
## [1] 105.9462
# R-square
(cor(test$TOTAL.NUMBER.OF.HIV.DIAGNOSES, p))^2
## [1] 0.1071033
The RMSE and R-square results are not satisfying when making predictions based off the first model.
The second decision tree is devised to predict number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” with the variable “SEX” and “RACE.ETHNICITY” below. Comparing the RMSE and R-square results with those achieved in model 1, we see that there is essentially no improvement. RMSE moves from 105.9462 to 108.1461 and R-square moves from 0.1071033 to 0.07019709.
second_model <- rpart(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ SEX + RACE.ETHNICITY, method="anova", train)
summary(second_model)
## Call:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ SEX + RACE.ETHNICITY,
## data = train, method = "anova")
## n= 7194
##
## CP nsplit rel error xerror xstd
## 1 0.04287737 0 1.0000000 1.0002142 0.2401039
## 2 0.01840571 1 0.9571226 0.9581364 0.2316735
## 3 0.01000000 2 0.9387169 0.9400997 0.2274470
##
## Variable importance
## RACE.ETHNICITY SEX
## 70 30
##
## Node number 1: 7194 observations, complexity param=0.04287737
## mean=20.91423, MSE=10976.63
## left son=2 (5981 obs) right son=3 (1213 obs)
## Primary splits:
## RACE.ETHNICITY splits as RLLLLLLLLLL, improve=0.04287737, (0 missing)
## SEX splits as RLL, improve=0.01303778, (0 missing)
##
## Node number 2: 5981 observations
## mean=11.14429, MSE=2212.04
##
## Node number 3: 1213 observations, complexity param=0.01840571
## mean=69.08739, MSE=51401.33
## left son=6 (404 obs) right son=7 (809 obs)
## Primary splits:
## SEX splits as RLR, improve=0.02331078, (0 missing)
##
## Node number 6: 404 observations
## mean=20.10396, MSE=2232.554
##
## Node number 7: 809 observations
## mean=93.54883, MSE=74158.76
plotcp(second_model)
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(second_model) # visualize cross-validation results
##
## Regression tree:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ SEX + RACE.ETHNICITY,
## data = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] RACE.ETHNICITY SEX
##
## Root node error: 78965872/7194 = 10977
##
## n= 7194
##
## CP nsplit rel error xerror xstd
## 1 0.042877 0 1.00000 1.00021 0.24010
## 2 0.018406 1 0.95712 0.95814 0.23167
## 3 0.010000 2 0.93872 0.94010 0.22745
plot(second_model, uniform=TRUE, main="Regression Tree for Number of HIV diagnoses")
text(second_model, use.n=TRUE, all=TRUE, cex=.8)
# make predictions
p2 <- predict(second_model, test)
# Root Mean Square Error
sqrt(mean((test$TOTAL.NUMBER.OF.HIV.DIAGNOSES - p2)^2))
## [1] 108.1461
# R-square
(cor(test$TOTAL.NUMBER.OF.HIV.DIAGNOSES, p2))^2
## [1] 0.07019709
The final model is the random forest model with all 3 predictor variables. From the results below, we can see that the percentage of variance explained with this model is about 30%, which is better than the first two models. We see that the RMSE and R-square improve significantly with the random forest model which adds extra predictors into the model.
train2 <- train
test2 <- test
third_model <- randomForest(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. + SEX + RACE.ETHNICITY, importance = TRUE, na.action = na.omit, train2)
print(third_model)
##
## Call:
## randomForest(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. + SEX + RACE.ETHNICITY, data = train2, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 7768.409
## % Var explained: 29.23
plot(third_model)
# make predictions
p3 <- predict(third_model, test2)
# Root Mean Square Error
sqrt(mean((test2$TOTAL.NUMBER.OF.HIV.DIAGNOSES - p3)^2))
## [1] 93.85277
# R-square
(cor(test2$TOTAL.NUMBER.OF.HIV.DIAGNOSES, p3))^2
## [1] 0.3762122
Based upon running the two decision trees and the random forest, the random forest model with three predictors appears to be the best model but it could be due to over fitting. Even though it’s complex, I would choose the random forest model due to the better fit.
Additionally, based on real cases where decision trees have gone “wrong”, machine learning algorithm version of decision tree appears to be simpler than actually drawing a tree. The machine-learning version allows the user to interpret the results in a simpler way than the visual depiction of an actual tree would. As decisions become more complex with more nodes, algorithm can help simplify the results.