Choose a data set:
You get to decide which data set you want to work on. The data set must be different from the ones used in previous home works. You can work on a problem from your job, or something you are interested in. You may also obtain a data set from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new data set selected.
To complete this task:
Your final presentation (essay or video) should include:
The traditional R file or Python file and essay, An Essay (minimum 500 word document) or Video ( 5 to 8 minutes recording) Include the execution and explanation of your code. The video can be recorded on any platform of your choice (Youtube, Free Cam).
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(openxlsx)
library(lubridate)
library(readxl)
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.2
##
## Attaching package: 'neuralnet'
##
## The following object is masked from 'package:dplyr':
##
## compute
I used the NYC Open Data set on Pregnancy-Associated Mortality by Race/Ethnicity, Underlying cause and Borough of Residence data set. I wanted to understand if Pregnancy-Associated Mortality was driven by location (Borough) and also Race/Ethnicity.
The data set contains 6 variables, listed below: 1). Year, 2). Related, 3). Underlying_cause, 4). Race/ethnicity, 5). Borough and 6). Deaths
I’ve uploaded the file from my personal directory as I kept running into the error reading from the GitHub directory - Warning: incomplete final line found by readTableHeader on Warning: incomplete final line found by readTableHeader on ‘https://github.com/tponnada/hello-world/blob/master/Pregnancy-Associated_Mortality_20231217.csv’
There appears to be some problem uploading and reading the file from GitHub, I was able to read the exact same file from my local directory..
preg <- read.csv("C:\\Users\\tparker3\\OneDrive - Bell Partners, Inc\\Documents\\Pregnancy-Associated_Mortality_20231217.csv")
Once the file is loaded, we proceed with reviewing the structure and content of the data set to understand which machine learning algorithms can be applied in this context.
No conversion was needed looking at the 6 variables in the file - Year and Death variables were of type int and the other variables were of type char. The summary and plot_missing functions show there are no missing values for any of the 6 variables.
1). Year, 2). Related, 3). Underlying_cause, 4). Race/ethnicity, 5). Borough and 6). Deaths
head(preg)
## Year Related Underlying_cause Race.ethnicity Borough Deaths
## 1 2016 All All All Bronx 5
## 2 2016 All All All Brooklyn 12
## 3 2016 All All All Manhattan 4
## 4 2016 All All All Queens 8
## 5 2016 All All All Staten Island 4
## 6 2016 All All All Rest of State 3
colnames(preg)
## [1] "Year" "Related" "Underlying_cause" "Race.ethnicity"
## [5] "Borough" "Deaths"
print(nrow(preg))
## [1] 198
print(ncol(preg))
## [1] 6
str(preg)
## 'data.frame': 198 obs. of 6 variables:
## $ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Related : chr "All" "All" "All" "All" ...
## $ Underlying_cause: chr "All" "All" "All" "All" ...
## $ Race.ethnicity : chr "All" "All" "All" "All" ...
## $ Borough : chr "Bronx" "Brooklyn" "Manhattan" "Queens" ...
## $ Deaths : int 5 12 4 8 4 3 4 15 9 8 ...
dim(preg)
## [1] 198 6
plot_missing(preg)
summary(preg)
## Year Related Underlying_cause Race.ethnicity
## Min. :2016 Length:198 Length:198 Length:198
## 1st Qu.:2017 Class :character Class :character Class :character
## Median :2018 Mode :character Mode :character Mode :character
## Mean :2018
## 3rd Qu.:2019
## Max. :2020
## Borough Deaths
## Length:198 Min. : 0.000
## Class :character 1st Qu.: 1.000
## Mode :character Median : 3.000
## Mean : 4.803
## 3rd Qu.: 6.000
## Max. :26.000
I would like to predict the number of “Deaths” variable and find out whether variables “Underlying_cause”, “Race.ethnicity” and “Borough” have any relationship with this dependent variable.
Next step, we partition the data into training (80%) and test (20%) in order to measure model performance.
# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(preg), replace = T, prob = c(0.8, 0.2))
train <- preg[ind == 1,]
test <- preg[ind == 2,]
We run the first decision tree model to predict number of “Deaths” with the variable “Borough” below:
first_model <- rpart(Deaths ~ Borough, method ="anova", train)
# Detailed summary of splits
summary(first_model)
## Call:
## rpart(formula = Deaths ~ Borough, data = train, method = "anova")
## n= 160
##
## CP nsplit rel error xerror xstd
## 1 0.15734571 0 1.0000000 1.0163891 0.1857685
## 2 0.03011169 1 0.8426543 0.9206611 0.1967758
## 3 0.01000000 2 0.8125426 0.8834268 0.1942098
##
## Variable importance
## Borough
## 100
##
## Node number 1: 160 observations, complexity param=0.1573457
## mean=4.88125, MSE=21.52965
## left son=2 (131 obs) right son=3 (29 obs)
## Primary splits:
## Borough splits as LRRLRLL, improve=0.1573457, (0 missing)
##
## Node number 2: 131 observations
## mean=4.015267, MSE=17.74022
##
## Node number 3: 29 observations, complexity param=0.03011169
## mean=8.793103, MSE=19.95719
## left son=6 (19 obs) right son=7 (10 obs)
## Primary splits:
## Borough splits as -LR-L--, improve=0.1792233, (0 missing)
##
## Node number 6: 19 observations
## mean=7.421053, MSE=15.50693
##
## Node number 7: 10 observations
## mean=11.4, MSE=18.04
# Visualize cross-validation results
plotcp(first_model)
As we can see below, the R-square increases with the number of splits, and the X Relative Error decreases as the number of splits increase.
# create additional plots
par(mfrow = c(1,2))
rsq.rpart(first_model)
##
## Regression tree:
## rpart(formula = Deaths ~ Borough, data = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] Borough
##
## Root node error: 3444.7/160 = 21.53
##
## n= 160
##
## CP nsplit rel error xerror xstd
## 1 0.157346 0 1.00000 1.01639 0.18577
## 2 0.030112 1 0.84265 0.92066 0.19678
## 3 0.010000 2 0.81254 0.88343 0.19421
# plot tree
plot(first_model, uniform = TRUE, main = "Regression Tree for Number of Deaths")
text(first_model, use.n = TRUE, all = TRUE, cex =.8)
# make predictions
p <- predict(first_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p)^2))
## [1] 4.884643
# R-square
(cor(test$Deaths, p))^2
## [1] 0.0617929
The RMSE and R-square results are not satisfying when making predictions based off the first model.
The second decision tree is devised to predict number of “Deaths” with the variable “Underlying_cause” and “Race.ethnicity” below. Comparing the RMSE and R-square results with those achieved in model 1, we see that there is an improvement in Model 2 compared with model 1. RMSE moves from 4.884643 to 3.379278 and R-square moves from 0.0617929 to 0.5502431.
second_model <- rpart(Deaths ~ Underlying_cause + Race.ethnicity, method = "anova", train)
summary(second_model)
## Call:
## rpart(formula = Deaths ~ Underlying_cause + Race.ethnicity, data = train,
## method = "anova")
## n= 160
##
## CP nsplit rel error xerror xstd
## 1 0.28981373 0 1.0000000 1.0095115 0.1857030
## 2 0.11425007 1 0.7101863 0.7937076 0.1270433
## 3 0.01458420 2 0.5959362 0.6771606 0.1063049
## 4 0.01254889 3 0.5813520 0.6871915 0.1054926
## 5 0.01000000 4 0.5688031 0.6789787 0.1050948
##
## Variable importance
## Race.ethnicity Underlying_cause
## 72 28
##
## Node number 1: 160 observations, complexity param=0.2898137
## mean=4.88125, MSE=21.52965
## left son=2 (144 obs) right son=3 (16 obs)
## Primary splits:
## Race.ethnicity splits as LLRRLL, improve=0.2898137, (0 missing)
## Underlying_cause splits as RLLLLLLLLLLLLL, improve=0.2430042, (0 missing)
##
## Node number 2: 144 observations, complexity param=0.1142501
## mean=4.048611, MSE=12.92125
## left son=4 (77 obs) right son=5 (67 obs)
## Primary splits:
## Underlying_cause splits as RLLLLLLLLLLLLL, improve=0.211517600, (0 missing)
## Race.ethnicity splits as LL--LR, improve=0.008009186, (0 missing)
## Surrogate splits:
## Race.ethnicity splits as LR--RR, agree=0.639, adj=0.224, (0 split)
##
## Node number 3: 16 observations
## mean=12.375, MSE=36.60938
##
## Node number 4: 77 observations, complexity param=0.01254889
## mean=2.506494, MSE=2.795412
## left son=8 (35 obs) right son=9 (42 obs)
## Primary splits:
## Underlying_cause splits as -LRRRRLLRLRLLL, improve=0.2008286, (0 missing)
##
## Node number 5: 67 observations, complexity param=0.0145842
## mean=5.820896, MSE=18.68434
## left son=10 (7 obs) right son=11 (60 obs)
## Primary splits:
## Race.ethnicity splits as RL--LR, improve=0.04013165, (0 missing)
##
## Node number 8: 35 observations
## mean=1.685714, MSE=0.8440816
##
## Node number 9: 42 observations
## mean=3.190476, MSE=3.39229
##
## Node number 10: 7 observations
## mean=3.285714, MSE=3.061224
##
## Node number 11: 60 observations
## mean=6.116667, MSE=19.66972
plotcp(second_model)
par(mfrow = c(1,2)) # two plots on one page
rsq.rpart(second_model) # visualize cross-validation results
##
## Regression tree:
## rpart(formula = Deaths ~ Underlying_cause + Race.ethnicity, data = train,
## method = "anova")
##
## Variables actually used in tree construction:
## [1] Race.ethnicity Underlying_cause
##
## Root node error: 3444.7/160 = 21.53
##
## n= 160
##
## CP nsplit rel error xerror xstd
## 1 0.289814 0 1.00000 1.00951 0.18570
## 2 0.114250 1 0.71019 0.79371 0.12704
## 3 0.014584 2 0.59594 0.67716 0.10630
## 4 0.012549 3 0.58135 0.68719 0.10549
## 5 0.010000 4 0.56880 0.67898 0.10509
plot(second_model, uniform = TRUE, main = "Regression Tree for Number of Deaths")
text(second_model, use.n = TRUE, all = TRUE, cex = .8)
# make predictions
p2 <- predict(second_model, test)
# Root Mean Square Error
sqrt(mean((test$Deaths - p2)^2))
## [1] 3.379278
# R-square
(cor(test$Deaths, p2))^2
## [1] 0.5502431
Next we use the random forest model with all 3 predictor variables. From the results below, we can see that the percentage of variance explained with this model is about 40%. We see that the RMSE and R-square worsen slightly with the random forest model when compared to Model 2 above but using all three independent variables gives more comfort than using just two in the second model above. Hence, I recommend using Random Forest modeling.
train2 <- train
test2 <- test
third_model <- randomForest(Deaths ~ Borough + Underlying_cause + Race.ethnicity, importance = TRUE, na.action = na.omit, train2)
print(third_model)
##
## Call:
## randomForest(formula = Deaths ~ Borough + Underlying_cause + Race.ethnicity, data = train2, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 12.99198
## % Var explained: 39.66
plot(third_model)
# make predictions
p3 <- predict(third_model, test2)
# Root Mean Square Error
sqrt(mean((test2$Deaths - p3)^2))
## [1] 3.812101
# R-square
(cor(test2$Deaths, p3))^2
## [1] 0.4790389
Next we use neural network modeling to see if the results are any
better. The following code attempts to predict the Deaths based on
Borough, Underlying_cause and Race.ethnicity. Unfortunately, this did
not work for me. Although, I converted the categorical variables, I was
still getting error for neural networking, specifically that “Error in
[.data.frame(data, , model.list$variables) : undefined
columns selected”.
# Partition data - train (80%) & test (20%)
m <- model.matrix(~ Year + Related + Underlying_cause + Race.ethnicity + Borough + Deaths, data = preg)
head(m)
## (Intercept) Year RelatedPregnancy-associated but not related
## 1 1 2016 0
## 2 1 2016 0
## 3 1 2016 0
## 4 1 2016 0
## 5 1 2016 0
## 6 1 2016 0
## RelatedPregnancy-related RelatedPregnancy-Related RelatedUnable to Determine
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Underlying_causeAsthma/ pulmonary conditions Underlying_causeCancer
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Underlying_causeCardiovascular Conditions Underlying_causeEmbolism
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Underlying_causeHemorrhage Underlying_causeHomicide
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Underlying_causeInfection/Sepsis
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## Underlying_causeMental Health Conditions (Overdose related to substance use disorder)
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## Underlying_causeMental Health Conditions (Suicide) Underlying_causeOther
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Underlying_causePregnancy-induced hypertension
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## Underlying_causeUnintentional Injury Underlying_causeUnknown COD
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Race.ethnicityAsian/Pacific Islander Race.ethnicityBlack non-Latina
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Race.ethnicityLatina Race.ethnicityOther Race.ethnicityWhite non-Latina
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## BoroughBronx BoroughBrooklyn BoroughManhattan BoroughQueens
## 1 1 0 0 0
## 2 0 1 0 0
## 3 0 0 1 0
## 4 0 0 0 1
## 5 0 0 0 0
## 6 0 0 0 0
## BoroughRest of State BoroughStaten Island Deaths
## 1 0 0 5
## 2 0 0 12
## 3 0 0 4
## 4 0 0 8
## 5 0 1 4
## 6 1 0 3
names(m) <- make.names(names(m))
set.seed(245)
data_rows <- floor(0.80 * nrow(m))
train_indices <- sample(c(1:nrow(m)), data_rows)
train_data <- m[train_indices,]
test_data <- m[-train_indices,]
#nnet_model <- neuralnet(Deaths ~ Borough + Underlying_cause + Race.ethnicity, data = test_data, hidden = 1, err.fct = "sse", linear.output #= FALSE)
#plot(nnet_model, rep = "best")
The purpose of the analysis as we see was predicting number of “Deaths/Pregnancy-associated Mortality” using the independent variables “Neighborhood”, “Underlying cause” and “Race/ethnicity”. We used decision trees, random forest and neural networks to attempt to predict the number of deaths. The first decision tree used Neighborhoods while the second used “underlying cause” and “Race/ethnicity”. The random forest model then uses all three independent variables and one that I feel most confident with, the model has similar RMSE and R-squared values as the second decisions tree and has reasonable variance. This means that all three factors neighborhood, underlying cause and race/ethnicity can impact pregnancy mortality which has implications for public policy. For example, government policies providing assistance/subsidies can be targeted towards neighborhoods and race/ethnicity profiles that have higher pregnancy-associated moralities.