R Markdown

Assignment

Choose a data set:

You get to decide which data set you want to work on. The data set must be different from the ones used in previous home works. You can work on a problem from your job, or something you are interested in. You may also obtain a data set from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new data set selected.

To complete this task:

  1. Describe the problem you are trying to solve.
  2. Describe your data set and what you did to prepare the data for analysis.
  3. Methodologies you used for analyzing the data
  4. What’s the purpose of the analysis performed
  5. Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Deliverable

Your final presentation (essay or video) should include:

The traditional R file or Python file and essay, An Essay (minimum 500 word document) or Video ( 5 to 8 minutes recording) Include the execution and explanation of your code. The video can be recorded on any platform of your choice (Youtube, Free Cam).

library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(openxlsx)
library(lubridate)
library(readxl)
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.2
## 
## Attaching package: 'neuralnet'
## 
## The following object is masked from 'package:dplyr':
## 
##     compute

Describe the problem you are trying to solve:

Solution:

I used the NYC Open Data set on Pregnancy-Associated Mortality by Race/Ethnicity, Underlying cause and Borough of Residence data set. I wanted to understand if Pregnancy-Associated Mortality was driven by location (Borough) and also Race/Ethnicity.

The data set contains 6 variables, listed below: 1). Year, 2). Related, 3). Underlying_cause, 4). Race/ethnicity, 5). Borough and 6). Deaths

I’ve uploaded the file from my personal directory as I kept running into the error reading from the GitHub directory - Warning: incomplete final line found by readTableHeader on Warning: incomplete final line found by readTableHeader on ‘https://github.com/tponnada/hello-world/blob/master/Pregnancy-Associated_Mortality_20231217.csv

There appears to be some problem uploading and reading the file from GitHub, I was able to read the exact same file from my local directory..

preg <- read.csv("C:\\Users\\tparker3\\OneDrive - Bell Partners, Inc\\Documents\\Pregnancy-Associated_Mortality_20231217.csv")

Describe your data set and what you did to prepare the data for analysis

Exploratory Analysis

Once the file is loaded, we proceed with reviewing the structure and content of the data set to understand which machine learning algorithms can be applied in this context.

No conversion was needed looking at the 6 variables in the file - Year and Death variables were of type int and the other variables were of type char. The summary and plot_missing functions show there are no missing values for any of the 6 variables.

1). Year, 2). Related, 3). Underlying_cause, 4). Race/ethnicity, 5). Borough and 6). Deaths

head(preg)
##   Year Related Underlying_cause Race.ethnicity       Borough Deaths
## 1 2016     All              All            All         Bronx      5
## 2 2016     All              All            All      Brooklyn     12
## 3 2016     All              All            All     Manhattan      4
## 4 2016     All              All            All        Queens      8
## 5 2016     All              All            All Staten Island      4
## 6 2016     All              All            All Rest of State      3
colnames(preg)
## [1] "Year"             "Related"          "Underlying_cause" "Race.ethnicity"  
## [5] "Borough"          "Deaths"
print(nrow(preg))
## [1] 198
print(ncol(preg))
## [1] 6
str(preg)
## 'data.frame':    198 obs. of  6 variables:
##  $ Year            : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Related         : chr  "All" "All" "All" "All" ...
##  $ Underlying_cause: chr  "All" "All" "All" "All" ...
##  $ Race.ethnicity  : chr  "All" "All" "All" "All" ...
##  $ Borough         : chr  "Bronx" "Brooklyn" "Manhattan" "Queens" ...
##  $ Deaths          : int  5 12 4 8 4 3 4 15 9 8 ...
dim(preg)
## [1] 198   6
plot_missing(preg)

summary(preg)
##       Year        Related          Underlying_cause   Race.ethnicity    
##  Min.   :2016   Length:198         Length:198         Length:198        
##  1st Qu.:2017   Class :character   Class :character   Class :character  
##  Median :2018   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2018                                                           
##  3rd Qu.:2019                                                           
##  Max.   :2020                                                           
##    Borough              Deaths      
##  Length:198         Min.   : 0.000  
##  Class :character   1st Qu.: 1.000  
##  Mode  :character   Median : 3.000  
##                     Mean   : 4.803  
##                     3rd Qu.: 6.000  
##                     Max.   :26.000

Methodologies you used for analyzing the data

Model building

I would like to predict the number of “Deaths” variable and find out whether variables “Underlying_cause”, “Race.ethnicity” and “Borough” have any relationship with this dependent variable.

Next step, we partition the data into training (80%) and test (20%) in order to measure model performance.

# Partition data - train (80%) & test (20%)

set.seed(1234)
ind <- sample(2, nrow(preg), replace = T, prob = c(0.8, 0.2))
train <- preg[ind == 1,]
test <- preg[ind == 2,]

First Decision Tree

We run the first decision tree model to predict number of “Deaths” with the variable “Borough” below:

first_model <- rpart(Deaths ~ Borough, method ="anova", train)

# Detailed summary of splits
summary(first_model)
## Call:
## rpart(formula = Deaths ~ Borough, data = train, method = "anova")
##   n= 160 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.15734571      0 1.0000000 1.0163891 0.1857685
## 2 0.03011169      1 0.8426543 0.9206611 0.1967758
## 3 0.01000000      2 0.8125426 0.8834268 0.1942098
## 
## Variable importance
## Borough 
##     100 
## 
## Node number 1: 160 observations,    complexity param=0.1573457
##   mean=4.88125, MSE=21.52965 
##   left son=2 (131 obs) right son=3 (29 obs)
##   Primary splits:
##       Borough splits as  LRRLRLL, improve=0.1573457, (0 missing)
## 
## Node number 2: 131 observations
##   mean=4.015267, MSE=17.74022 
## 
## Node number 3: 29 observations,    complexity param=0.03011169
##   mean=8.793103, MSE=19.95719 
##   left son=6 (19 obs) right son=7 (10 obs)
##   Primary splits:
##       Borough splits as  -LR-L--, improve=0.1792233, (0 missing)
## 
## Node number 6: 19 observations
##   mean=7.421053, MSE=15.50693 
## 
## Node number 7: 10 observations
##   mean=11.4, MSE=18.04
# Visualize cross-validation results
plotcp(first_model) 

As we can see below, the R-square increases with the number of splits, and the X Relative Error decreases as the number of splits increase.

# create additional plots
par(mfrow = c(1,2))
rsq.rpart(first_model)
## 
## Regression tree:
## rpart(formula = Deaths ~ Borough, data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Borough
## 
## Root node error: 3444.7/160 = 21.53
## 
## n= 160 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.157346      0   1.00000 1.01639 0.18577
## 2 0.030112      1   0.84265 0.92066 0.19678
## 3 0.010000      2   0.81254 0.88343 0.19421

# plot tree

plot(first_model, uniform = TRUE, main = "Regression Tree for Number of Deaths")
text(first_model, use.n = TRUE, all = TRUE, cex =.8)

Make predictions using the first model

# make predictions
p <- predict(first_model, test)

# Root Mean Square Error
sqrt(mean((test$Deaths - p)^2))
## [1] 4.884643
# R-square
(cor(test$Deaths, p))^2
## [1] 0.0617929

The RMSE and R-square results are not satisfying when making predictions based off the first model.

Second Decision Tree

The second decision tree is devised to predict number of “Deaths” with the variable “Underlying_cause” and “Race.ethnicity” below. Comparing the RMSE and R-square results with those achieved in model 1, we see that there is an improvement in Model 2 compared with model 1. RMSE moves from 4.884643 to 3.379278 and R-square moves from 0.0617929 to 0.5502431.

second_model <- rpart(Deaths ~ Underlying_cause + Race.ethnicity, method = "anova", train)
summary(second_model)
## Call:
## rpart(formula = Deaths ~ Underlying_cause + Race.ethnicity, data = train, 
##     method = "anova")
##   n= 160 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.28981373      0 1.0000000 1.0095115 0.1857030
## 2 0.11425007      1 0.7101863 0.7937076 0.1270433
## 3 0.01458420      2 0.5959362 0.6771606 0.1063049
## 4 0.01254889      3 0.5813520 0.6871915 0.1054926
## 5 0.01000000      4 0.5688031 0.6789787 0.1050948
## 
## Variable importance
##   Race.ethnicity Underlying_cause 
##               72               28 
## 
## Node number 1: 160 observations,    complexity param=0.2898137
##   mean=4.88125, MSE=21.52965 
##   left son=2 (144 obs) right son=3 (16 obs)
##   Primary splits:
##       Race.ethnicity   splits as  LLRRLL, improve=0.2898137, (0 missing)
##       Underlying_cause splits as  RLLLLLLLLLLLLL, improve=0.2430042, (0 missing)
## 
## Node number 2: 144 observations,    complexity param=0.1142501
##   mean=4.048611, MSE=12.92125 
##   left son=4 (77 obs) right son=5 (67 obs)
##   Primary splits:
##       Underlying_cause splits as  RLLLLLLLLLLLLL, improve=0.211517600, (0 missing)
##       Race.ethnicity   splits as  LL--LR, improve=0.008009186, (0 missing)
##   Surrogate splits:
##       Race.ethnicity splits as  LR--RR, agree=0.639, adj=0.224, (0 split)
## 
## Node number 3: 16 observations
##   mean=12.375, MSE=36.60938 
## 
## Node number 4: 77 observations,    complexity param=0.01254889
##   mean=2.506494, MSE=2.795412 
##   left son=8 (35 obs) right son=9 (42 obs)
##   Primary splits:
##       Underlying_cause splits as  -LRRRRLLRLRLLL, improve=0.2008286, (0 missing)
## 
## Node number 5: 67 observations,    complexity param=0.0145842
##   mean=5.820896, MSE=18.68434 
##   left son=10 (7 obs) right son=11 (60 obs)
##   Primary splits:
##       Race.ethnicity splits as  RL--LR, improve=0.04013165, (0 missing)
## 
## Node number 8: 35 observations
##   mean=1.685714, MSE=0.8440816 
## 
## Node number 9: 42 observations
##   mean=3.190476, MSE=3.39229 
## 
## Node number 10: 7 observations
##   mean=3.285714, MSE=3.061224 
## 
## Node number 11: 60 observations
##   mean=6.116667, MSE=19.66972
plotcp(second_model)

par(mfrow = c(1,2)) # two plots on one page
rsq.rpart(second_model) # visualize cross-validation results
## 
## Regression tree:
## rpart(formula = Deaths ~ Underlying_cause + Race.ethnicity, data = train, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] Race.ethnicity   Underlying_cause
## 
## Root node error: 3444.7/160 = 21.53
## 
## n= 160 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.289814      0   1.00000 1.00951 0.18570
## 2 0.114250      1   0.71019 0.79371 0.12704
## 3 0.014584      2   0.59594 0.67716 0.10630
## 4 0.012549      3   0.58135 0.68719 0.10549
## 5 0.010000      4   0.56880 0.67898 0.10509

plot(second_model, uniform = TRUE, main = "Regression Tree for Number of Deaths")
text(second_model, use.n = TRUE, all = TRUE, cex = .8)

# make predictions
p2 <- predict(second_model, test)

# Root Mean Square Error
sqrt(mean((test$Deaths - p2)^2))
## [1] 3.379278
# R-square
(cor(test$Deaths, p2))^2
## [1] 0.5502431

Random Forest

Next we use the random forest model with all 3 predictor variables. From the results below, we can see that the percentage of variance explained with this model is about 40%. We see that the RMSE and R-square worsen slightly with the random forest model when compared to Model 2 above but using all three independent variables gives more comfort than using just two in the second model above. Hence, I recommend using Random Forest modeling.

train2 <- train
test2 <- test

third_model <- randomForest(Deaths ~ Borough + Underlying_cause + Race.ethnicity, importance = TRUE, na.action = na.omit, train2)

print(third_model)
## 
## Call:
##  randomForest(formula = Deaths ~ Borough + Underlying_cause +      Race.ethnicity, data = train2, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 12.99198
##                     % Var explained: 39.66
plot(third_model)

# make predictions
p3 <- predict(third_model, test2)

# Root Mean Square Error
sqrt(mean((test2$Deaths - p3)^2))
## [1] 3.812101
# R-square
(cor(test2$Deaths, p3))^2
## [1] 0.4790389

Neural networking

Next we use neural network modeling to see if the results are any better. The following code attempts to predict the Deaths based on Borough, Underlying_cause and Race.ethnicity. Unfortunately, this did not work for me. Although, I converted the categorical variables, I was still getting error for neural networking, specifically that “Error in [.data.frame(data, , model.list$variables) : undefined columns selected”.

# Partition data - train (80%) & test (20%)

m <- model.matrix(~ Year + Related + Underlying_cause + Race.ethnicity + Borough + Deaths, data = preg)
head(m)
##   (Intercept) Year RelatedPregnancy-associated but not related
## 1           1 2016                                           0
## 2           1 2016                                           0
## 3           1 2016                                           0
## 4           1 2016                                           0
## 5           1 2016                                           0
## 6           1 2016                                           0
##   RelatedPregnancy-related RelatedPregnancy-Related RelatedUnable to Determine
## 1                        0                        0                          0
## 2                        0                        0                          0
## 3                        0                        0                          0
## 4                        0                        0                          0
## 5                        0                        0                          0
## 6                        0                        0                          0
##   Underlying_causeAsthma/ pulmonary conditions Underlying_causeCancer
## 1                                            0                      0
## 2                                            0                      0
## 3                                            0                      0
## 4                                            0                      0
## 5                                            0                      0
## 6                                            0                      0
##   Underlying_causeCardiovascular Conditions Underlying_causeEmbolism
## 1                                         0                        0
## 2                                         0                        0
## 3                                         0                        0
## 4                                         0                        0
## 5                                         0                        0
## 6                                         0                        0
##   Underlying_causeHemorrhage Underlying_causeHomicide
## 1                          0                        0
## 2                          0                        0
## 3                          0                        0
## 4                          0                        0
## 5                          0                        0
## 6                          0                        0
##   Underlying_causeInfection/Sepsis
## 1                                0
## 2                                0
## 3                                0
## 4                                0
## 5                                0
## 6                                0
##   Underlying_causeMental Health Conditions (Overdose related to substance use disorder)
## 1                                                                                     0
## 2                                                                                     0
## 3                                                                                     0
## 4                                                                                     0
## 5                                                                                     0
## 6                                                                                     0
##   Underlying_causeMental Health Conditions (Suicide) Underlying_causeOther
## 1                                                  0                     0
## 2                                                  0                     0
## 3                                                  0                     0
## 4                                                  0                     0
## 5                                                  0                     0
## 6                                                  0                     0
##   Underlying_causePregnancy-induced hypertension
## 1                                              0
## 2                                              0
## 3                                              0
## 4                                              0
## 5                                              0
## 6                                              0
##   Underlying_causeUnintentional Injury Underlying_causeUnknown COD
## 1                                    0                           0
## 2                                    0                           0
## 3                                    0                           0
## 4                                    0                           0
## 5                                    0                           0
## 6                                    0                           0
##   Race.ethnicityAsian/Pacific Islander Race.ethnicityBlack non-Latina
## 1                                    0                              0
## 2                                    0                              0
## 3                                    0                              0
## 4                                    0                              0
## 5                                    0                              0
## 6                                    0                              0
##   Race.ethnicityLatina Race.ethnicityOther Race.ethnicityWhite non-Latina
## 1                    0                   0                              0
## 2                    0                   0                              0
## 3                    0                   0                              0
## 4                    0                   0                              0
## 5                    0                   0                              0
## 6                    0                   0                              0
##   BoroughBronx BoroughBrooklyn BoroughManhattan BoroughQueens
## 1            1               0                0             0
## 2            0               1                0             0
## 3            0               0                1             0
## 4            0               0                0             1
## 5            0               0                0             0
## 6            0               0                0             0
##   BoroughRest of State BoroughStaten Island Deaths
## 1                    0                    0      5
## 2                    0                    0     12
## 3                    0                    0      4
## 4                    0                    0      8
## 5                    0                    1      4
## 6                    1                    0      3
names(m) <- make.names(names(m))

set.seed(245)
data_rows <- floor(0.80 * nrow(m))
train_indices <- sample(c(1:nrow(m)), data_rows)
train_data <- m[train_indices,]
test_data <- m[-train_indices,]

#nnet_model <- neuralnet(Deaths ~ Borough + Underlying_cause + Race.ethnicity, data = test_data, hidden = 1, err.fct = "sse", linear.output #= FALSE)

#plot(nnet_model, rep = "best")

What’s the purpose of the analysis performed and conclusions:

Solution:

The purpose of the analysis as we see was predicting number of “Deaths/Pregnancy-associated Mortality” using the independent variables “Neighborhood”, “Underlying cause” and “Race/ethnicity”. We used decision trees, random forest and neural networks to attempt to predict the number of deaths. The first decision tree used Neighborhoods while the second used “underlying cause” and “Race/ethnicity”. The random forest model then uses all three independent variables and one that I feel most confident with, the model has similar RMSE and R-squared values as the second decisions tree and has reasonable variance. This means that all three factors neighborhood, underlying cause and race/ethnicity can impact pregnancy mortality which has implications for public policy. For example, government policies providing assistance/subsidies can be targeted towards neighborhoods and race/ethnicity profiles that have higher pregnancy-associated moralities.