Homework 2

Decision Trees Algorithms

Pre-work

Read this blog: https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees which shows some of the issues with decision trees.

Choose a data set from a source in Assignment #1, or another data set of your choice.

Assignment work:

Based on the latest topics presented, choose a data set of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.

Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results.

Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Deliverable:

Essay (minimum 500 word document)

Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)

Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)

Due Date: Sunday October 29th, end of day

  1. We are trying to train 2 different decision trees to compare bias and variance - so switch the features used for the first node (split) to force a different decision tree (How did the performance change?)

  2. You will create 3 models: 2 x decision trees (to compare variance) and a random forest

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(openxlsx)
library(lubridate)
library(readxl)

Solution:

I used the NYC OpenData set on HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity data set. I wanted to understand if the diagnoses of HIV in NYC was driven by location (neighborhood) and also sex, RACE.ETHNICITY.

The data set contains 11 variables, listed below: YEAR, Borough, Neighborhood..U.H.F., SEX, RACE.ETHNICITY, TOTAL.NUMBER.OF.HIV.DIAGNOSES, HIV.DIAGNOSES.PER.100.000.POPULATION, TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, TOTAL.NUMBER.OF.AIDS.DIAGNOSES and AIDS.DIAGNOSES.PER.100.000.POPULATION.

I’ve uploaded the file from my personal directory as I kept running into the error reading from the GitHub directory - Warning: incomplete final line found by readTableHeader on Warning: incomplete final line found by readTableHeader on ‘https://github.com/tponnada/hello-world/blob/master/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20231029.csv

There appears to be some problem uploading and reading the file from GitHub, I was able to read the exact same file from my local directory.

HIV <- read.csv("C:\\Users\\tparker3\\OneDrive - Bell Partners, Inc\\Documents\\HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20231029.csv")

Exploratory Analysis

Once the file is loaded, we proceed with reviewing the structure and content of the data set to understand which machine learning algorithms can be applied in this context.

Of the 11 variables in the file, 1). TOTAL.NUMBER.OF.HIV.DIAGNOSES, 2). HIV.DIAGNOSES.PER.100.000.POPULATION, 3). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, 4). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION need to be converted from char too int.

Additionally, the two char variables SEX and RACE.ETHNICITY need to be converted to factor data type.

I handle both of these conversions in the data clean-up section below.

head(HIV)
##   YEAR Borough     Neighborhood..U.H.F.    SEX  RACE.ETHNICITY
## 1 2010                       Greenpoint   Male           Black
## 2 2011           Stapleton - St. George Female Native American
## 3 2010                 Southeast Queens   Male             All
## 4 2012                   Upper Westside Female         Unknown
## 5 2013                      Willowbrook   Male         Unknown
## 6 2013         East Flatbush - Flatbush   Male           Black
##   TOTAL.NUMBER.OF.HIV.DIAGNOSES HIV.DIAGNOSES.PER.100.000.POPULATION
## 1                             6                                330.4
## 2                             0                                    0
## 3                            23                                 25.4
## 4                             0                                    0
## 5                             0                                    0
## 6                            54                                 56.5
##   TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## 1                                             0
## 2                                             0
## 3                                             5
## 4                                             0
## 5                                             0
## 6                                             8
##   PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## 1                                                                   0
## 2                                                                   0
## 3                                                                21.7
## 4                                                                   0
## 5                                                                   0
## 6                                                                14.8
##   TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## 1                              5                                 275.3
## 2                              0                                     0
## 3                             14                                  15.4
## 4                              0                                     0
## 5                              0                                     0
## 6                             33                                  34.5
colnames(HIV)
##  [1] "YEAR"                                                               
##  [2] "Borough"                                                            
##  [3] "Neighborhood..U.H.F."                                               
##  [4] "SEX"                                                                
##  [5] "RACE.ETHNICITY"                                                     
##  [6] "TOTAL.NUMBER.OF.HIV.DIAGNOSES"                                      
##  [7] "HIV.DIAGNOSES.PER.100.000.POPULATION"                               
##  [8] "TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES"                      
##  [9] "PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES"
## [10] "TOTAL.NUMBER.OF.AIDS.DIAGNOSES"                                     
## [11] "AIDS.DIAGNOSES.PER.100.000.POPULATION"
print(nrow(HIV))
## [1] 8976
print(ncol(HIV))
## [1] 11

Data clean-up, summary and visualization

This data file has 8,976 rows and 11 columns. The summary and plot_missing functions show there are missing values for 6 variables listed below:

1). TOTAL.NUMBER.OF.HIV.DIAGNOSES 2). HIV.DIAGNOSES.PER.100.000.POPULATION 3). TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES 4). PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES 5). TOTAL.NUMBER.OF.AIDS.DIAGNOSES and 6). AIDS.DIAGNOSES.PER.100.000.POPULATION

data_prepared <- HIV

data_prepared$SEX <- as.factor(data_prepared$SEX)
data_prepared$RACE.ETHNICITY <- as.factor(data_prepared$RACE.ETHNICITY)
data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)
## Warning: NAs introduced by coercion
data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES <- as.integer(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES <- as.integer(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)
## Warning: NAs introduced by coercion
data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION <- as.integer(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)
## Warning: NAs introduced by coercion
str(data_prepared)
## 'data.frame':    8976 obs. of  11 variables:
##  $ YEAR                                                               : int  2010 2011 2010 2012 2013 2013 2013 2013 2012 2010 ...
##  $ Borough                                                            : chr  "" "" "" "" ...
##  $ Neighborhood..U.H.F.                                               : chr  "Greenpoint" "Stapleton - St. George" "Southeast Queens" "Upper Westside" ...
##  $ SEX                                                                : Factor w/ 3 levels "All","Female",..: 3 2 3 2 3 3 2 2 3 1 ...
##  $ RACE.ETHNICITY                                                     : Factor w/ 11 levels "All","Asian/Pacific Islander",..: 4 8 1 10 10 4 8 10 10 1 ...
##  $ TOTAL.NUMBER.OF.HIV.DIAGNOSES                                      : int  6 0 23 0 0 54 0 0 0 14 ...
##  $ HIV.DIAGNOSES.PER.100.000.POPULATION                               : int  330 0 25 0 0 56 0 0 0 5 ...
##  $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES                      : int  0 0 5 0 0 8 0 0 0 5 ...
##  $ PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES: int  0 0 21 0 0 14 0 0 0 35 ...
##  $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES                                     : int  5 0 14 0 0 33 0 0 0 12 ...
##  $ AIDS.DIAGNOSES.PER.100.000.POPULATION                              : int  275 0 15 0 0 34 0 0 0 4 ...
dim(data_prepared)
## [1] 8976   11
plot_missing(data_prepared)

summary(data_prepared)
##       YEAR        Borough          Neighborhood..U.H.F.     SEX      
##  Min.   :2010   Length:8976        Length:8976          All   :2192  
##  1st Qu.:2013   Class :character   Class :character     Female:3392  
##  Median :2017   Mode  :character   Mode  :character     Male  :3392  
##  Mean   :2016                                                        
##  3rd Qu.:2020                                                        
##  Max.   :2021                                                        
##                                                                      
##                  RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
##  All                    :1528   Min.   :   0.00              
##  Black                  :1352   1st Qu.:   0.00              
##  White                  :1352   Median :   2.00              
##  Asian/Pacific\nIslander:1008   Mean   :  21.01              
##  Latino/Hispanic        :1008   3rd Qu.:  12.00              
##  Other/Unknown          :1008   Max.   :3353.00              
##  (Other)                :1720   NA's   :16                   
##  HIV.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :  0.00                      
##  1st Qu.:  0.00                      
##  Median :  9.00                      
##  Mean   : 24.95                      
##  3rd Qu.: 33.00                      
##  Max.   :821.00                      
##  NA's   :84                          
##  TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
##  Min.   :  0.000                              
##  1st Qu.:  0.000                              
##  Median :  0.000                              
##  Mean   :  3.924                              
##  3rd Qu.:  2.000                              
##  Max.   :680.000                              
##  NA's   :4                                    
##  PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
##  Min.   :  0.00                                                     
##  1st Qu.:  0.00                                                     
##  Median : 11.00                                                     
##  Mean   : 15.69                                                     
##  3rd Qu.: 23.00                                                     
##  Max.   :100.00                                                     
##  NA's   :1895                                                       
##  TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :   0.00                Min.   :  0.00                       
##  1st Qu.:   0.00                1st Qu.:  0.00                       
##  Median :   1.00                Median :  4.00                       
##  Mean   :  13.52                Mean   : 15.93                       
##  3rd Qu.:   7.00                3rd Qu.: 19.00                       
##  Max.   :2611.00                Max.   :565.00                       
##  NA's   :13                     NA's   :81

Substituting missing values for the 6 variables identified earlier

We substitute the median for missing values since the distribution is left-skewed for all 6 variables.

# Identifying all 6 variables are left-skewed

hist(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)

hist(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)

hist(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)

hist(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)

hist(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)

hist(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)

# Substituting median value for missing data

data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.HIV.DIAGNOSES, na.rm=TRUE)

data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$HIV.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)

data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES, na.rm=TRUE)

data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES[is.na(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES)] <- median(data_prepared$PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES, na.rm=TRUE)

data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES[is.na(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)] <- median(data_prepared$TOTAL.NUMBER.OF.AIDS.DIAGNOSES, na.rm=TRUE)

data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION[is.na(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION)] <- median(data_prepared$AIDS.DIAGNOSES.PER.100.000.POPULATION, na.rm=TRUE)

summary(data_prepared)
##       YEAR        Borough          Neighborhood..U.H.F.     SEX      
##  Min.   :2010   Length:8976        Length:8976          All   :2192  
##  1st Qu.:2013   Class :character   Class :character     Female:3392  
##  Median :2017   Mode  :character   Mode  :character     Male  :3392  
##  Mean   :2016                                                        
##  3rd Qu.:2020                                                        
##  Max.   :2021                                                        
##                                                                      
##                  RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
##  All                    :1528   Min.   :   0.00              
##  Black                  :1352   1st Qu.:   0.00              
##  White                  :1352   Median :   2.00              
##  Asian/Pacific\nIslander:1008   Mean   :  20.98              
##  Latino/Hispanic        :1008   3rd Qu.:  12.00              
##  Other/Unknown          :1008   Max.   :3353.00              
##  (Other)                :1720                                
##  HIV.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :  0.0                       
##  1st Qu.:  0.0                       
##  Median :  9.0                       
##  Mean   : 24.8                       
##  3rd Qu.: 32.0                       
##  Max.   :821.0                       
##                                      
##  TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
##  Min.   :  0.000                              
##  1st Qu.:  0.000                              
##  Median :  0.000                              
##  Mean   :  3.922                              
##  3rd Qu.:  2.000                              
##  Max.   :680.000                              
##                                               
##  PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
##  Min.   :  0.0                                                      
##  1st Qu.:  0.0                                                      
##  Median : 11.0                                                      
##  Mean   : 14.7                                                      
##  3rd Qu.: 20.0                                                      
##  Max.   :100.0                                                      
##                                                                     
##  TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
##  Min.   :   0.0                 Min.   :  0.00                       
##  1st Qu.:   0.0                 1st Qu.:  0.00                       
##  Median :   1.0                 Median :  4.00                       
##  Mean   :  13.5                 Mean   : 15.83                       
##  3rd Qu.:   7.0                 3rd Qu.: 19.00                       
##  Max.   :2611.0                 Max.   :565.00                       
## 

Model building

I would like to predict the number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” variable and find out whether variables “Neighborhood..U.H.F.”, “SEX” and “RACE.ETHNICITY” have any relationship with this dependent variable. We take a look at the “SEX” variable to see if it is evenly distributed among our “RACE.ETHNICITY” variable.

At first glance, we can see that the number of Females and Males in the data seem to be evenly distributed.

Next step, we partition the data into training (80%) and test (20%) in order to measure model performance.

# Two-way table of factor variables
xtabs(~ SEX + RACE.ETHNICITY, data = data_prepared)
##         RACE.ETHNICITY
## SEX      All Asian/Pacific Islander Asian/Pacific\nIslander Black Hispanic
##   All    512                      0                     336   336        0
##   Female 508                    172                     336   508      172
##   Male   508                    172                     336   508      172
##         RACE.ETHNICITY
## SEX      Latino/Hispanic Multiracial Native American Other/Unknown Unknown
##   All                336           0               0           336       0
##   Female             336         172             172           336     172
##   Male               336         172             172           336     172
##         RACE.ETHNICITY
## SEX      White
##   All      336
##   Female   508
##   Male     508
# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(data_prepared), replace = T, prob = c(0.8, 0.2))
train <- data_prepared[ind==1,]
test <- data_prepared[ind==2,]

First Decision Tree

We run the first decision tree model to predict number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” with the variable “Neighborhood..U.H.F.” below:

first_model <- rpart(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F., method="anova", train)

# Detailed summary of splits
summary(first_model)
## Call:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F., 
##     data = train, method = "anova")
##   n= 7194 
## 
##          CP nsplit rel error    xerror      xstd
## 1 0.1191236      0 1.0000000 1.0002275 0.2400782
## 2 0.0100000      1 0.8808764 0.8834338 0.2186504
## 
## Variable importance
## Neighborhood..U.H.F. 
##                  100 
## 
## Node number 1: 7194 observations,    complexity param=0.1191236
##   mean=20.91423, MSE=10976.63 
##   left son=2 (6578 obs) right son=3 (616 obs)
##   Primary splits:
##       Neighborhood..U.H.F. splits as  RLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL, improve=0.1191236, (0 missing)
## 
## Node number 2: 6578 observations
##   mean=9.848586, MSE=485.1932 
## 
## Node number 3: 616 observations
##   mean=139.0795, MSE=107739.6
# Visualize cross-validation results
plotcp(first_model) 

As we can see below, the R-square increases with the number of splits, and the X Relative Error decreases as the number of splits increase.

# create additional plots
par(mfrow=c(1,2))
rsq.rpart(first_model)
## 
## Regression tree:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F., 
##     data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Neighborhood..U.H.F.
## 
## Root node error: 78965872/7194 = 10977
## 
## n= 7194 
## 
##        CP nsplit rel error  xerror    xstd
## 1 0.11912      0   1.00000 1.00023 0.24008
## 2 0.01000      1   0.88088 0.88343 0.21865

# plot tree
plot(first_model, uniform=TRUE, main="Regression Tree for Number of HIV diagnoses ")
text(first_model, use.n=TRUE, all=TRUE, cex=.8)
## Warning in labels.rpart(x, minlength = minlength): more than 52 levels in a
## predicting factor, truncated for printout

Make predictions using the first model

# make predictions
p <- predict(first_model, test)

# Root Mean Square Error
sqrt(mean((test$TOTAL.NUMBER.OF.HIV.DIAGNOSES - p)^2))
## [1] 105.9462
# R-square
(cor(test$TOTAL.NUMBER.OF.HIV.DIAGNOSES, p))^2
## [1] 0.1071033

The RMSE and R-square results are not satisfying when making predictions based off the first model.

Second Decision Tree

The second decision tree is devised to predict number of “TOTAL.NUMBER.OF.HIV.DIAGNOSES” with the variable “SEX” and “RACE.ETHNICITY” below. Comparing the RMSE and R-square results with those achieved in model 1, we see that there is essentially no improvement. RMSE moves from 105.9462 to 108.1461 and R-square moves from 0.1071033 to 0.07019709.

second_model <- rpart(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ SEX + RACE.ETHNICITY, method="anova", train)
summary(second_model)
## Call:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ SEX + RACE.ETHNICITY, 
##     data = train, method = "anova")
##   n= 7194 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.04287737      0 1.0000000 1.0002142 0.2401039
## 2 0.01840571      1 0.9571226 0.9581364 0.2316735
## 3 0.01000000      2 0.9387169 0.9400997 0.2274470
## 
## Variable importance
## RACE.ETHNICITY            SEX 
##             70             30 
## 
## Node number 1: 7194 observations,    complexity param=0.04287737
##   mean=20.91423, MSE=10976.63 
##   left son=2 (5981 obs) right son=3 (1213 obs)
##   Primary splits:
##       RACE.ETHNICITY splits as  RLLLLLLLLLL, improve=0.04287737, (0 missing)
##       SEX            splits as  RLL,         improve=0.01303778, (0 missing)
## 
## Node number 2: 5981 observations
##   mean=11.14429, MSE=2212.04 
## 
## Node number 3: 1213 observations,    complexity param=0.01840571
##   mean=69.08739, MSE=51401.33 
##   left son=6 (404 obs) right son=7 (809 obs)
##   Primary splits:
##       SEX splits as  RLR, improve=0.02331078, (0 missing)
## 
## Node number 6: 404 observations
##   mean=20.10396, MSE=2232.554 
## 
## Node number 7: 809 observations
##   mean=93.54883, MSE=74158.76
plotcp(second_model)

par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(second_model) # visualize cross-validation results
## 
## Regression tree:
## rpart(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ SEX + RACE.ETHNICITY, 
##     data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] RACE.ETHNICITY SEX           
## 
## Root node error: 78965872/7194 = 10977
## 
## n= 7194 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.042877      0   1.00000 1.00021 0.24010
## 2 0.018406      1   0.95712 0.95814 0.23167
## 3 0.010000      2   0.93872 0.94010 0.22745

plot(second_model, uniform=TRUE, main="Regression Tree for Number of HIV diagnoses")
text(second_model, use.n=TRUE, all=TRUE, cex=.8)

# make predictions
p2 <- predict(second_model, test)

# Root Mean Square Error
sqrt(mean((test$TOTAL.NUMBER.OF.HIV.DIAGNOSES - p2)^2))
## [1] 108.1461
# R-square
(cor(test$TOTAL.NUMBER.OF.HIV.DIAGNOSES, p2))^2
## [1] 0.07019709

Random Forest

The final model is the random forest model with all 3 predictor variables. From the results below, we can see that the percentage of variance explained with this model is about 30%, which is better than the first two models. We see that the RMSE and R-square improve significantly with the random forest model which adds extra predictors into the model.

train2 <- train
test2 <- test

third_model <- randomForest(TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. + SEX + RACE.ETHNICITY, importance = TRUE, na.action = na.omit, train2)

print(third_model)
## 
## Call:
##  randomForest(formula = TOTAL.NUMBER.OF.HIV.DIAGNOSES ~ Neighborhood..U.H.F. +      SEX + RACE.ETHNICITY, data = train2, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 7768.409
##                     % Var explained: 29.23
plot(third_model)

# make predictions
p3 <- predict(third_model, test2)

# Root Mean Square Error
sqrt(mean((test2$TOTAL.NUMBER.OF.HIV.DIAGNOSES - p3)^2))
## [1] 93.85277
# R-square
(cor(test2$TOTAL.NUMBER.OF.HIV.DIAGNOSES, p3))^2
## [1] 0.3762122

Conclusion:

Based upon running the two decision trees and the random forest, the random forest model with three predictors appears to be the best model but it could be due to over fitting. Even though it’s complex, I would choose the random forest model due to the better fit.

Additionally, based on real cases where decision trees have gone “wrong”, machine learning algorithm version of decision tree appears to be simpler than actually drawing a tree. The machine-learning version allows the user to interpret the results in a simpler way than the visual depiction of an actual tree would. As decisions become more complex with more nodes, algorithm can help simplify the results.