Exploratory analysis and essay
Assignment
Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:.
Deliverable
library(Amelia)
library(car)
library(caret)
library(corrplot)
library(Cubist)
library(DataExplorer)
library(dplyr)
library(e1071)
library(earth)
library(forcats)
library(forecast)
library(fpp3)
library(gbm)
library(ggplot2)
library(ggforce)
library(gridExtra)
library(kableExtra)
library(MASS)
library(Metrics)
library(mice)
library(mlbench)
library(party)
library(psych)
library(pROC)
library(randomForest)
library(RANN)
library(RColorBrewer)
library(readr)
library(readxl)
library(rpart)
library(rpart.plot)
library(stringr)
library(summarytools)
library(tidyr)
library(tidymodels)
library(VIM)
library(earth)
library(randomForest)
Data retrived came from Centers of Disease Control and Prevention website. The data selected was the 2018-2022 Underlying Cause of Deaths by Single: Race Categories. The queries completed from this site is limited to 75,000 observations. In order to limit the data, State was filtered to strictly NYS and data was collected in yearly batches then merged.
We will first load in the data that is required for this analysis.
cdc_ucd_df_2018 <- as_tibble(read_tsv(url_git_2018,
show_col_types = FALSE)
)%>%
dplyr::select(-1)%>%
rename(Race = `Single Race 6`,
`Race Code` = `Single Race 6 Code`)
cdc_ucd_df_2019 <- as_tibble(read_tsv(url_git_2019,
show_col_types = FALSE)
)%>%
dplyr::select(-1)%>%
rename(Race = `Single Race 6`,
`Race Code` = `Single Race 6 Code`)
cdc_ucd_df_2020 <- as_tibble(read_tsv(url_git_2020,
show_col_types = FALSE)
)%>%
dplyr::select(-1)%>%
rename(Race = `Single Race 6`,
`Race Code` = `Single Race 6 Code`)
cdc_ucd_df_2021 <- as_tibble(read_tsv(url_git_2021,
show_col_types = FALSE)
)%>%
dplyr::select(-1)%>%
rename(Race = `Single Race 6`,
`Race Code` = `Single Race 6 Code`)
cdc_ucd_df_2022 <- as_tibble(read_tsv(url_git_2022,
show_col_types = FALSE)
)%>%
dplyr::select(-1)%>%
rename(Race = `Single Race 6`,
`Race Code` = `Single Race 6 Code`)
cdc_ucd_df <- bind_rows(cdc_ucd_df_2018,
cdc_ucd_df_2020,
cdc_ucd_df_2021,
cdc_ucd_df_2022)
First, we can preview our dataset.
glimpse(cdc_ucd_df)
## Rows: 12,356
## Columns: 13
## $ Year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ `Year Code` <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ County <chr> "Albany County, NY", "Albany County, NY", "Alban…
## $ `County Code` <dbl> 36001, 36001, 36001, 36001, 36001, 36001, 36001,…
## $ Gender <chr> "Female", "Female", "Female", "Female", "Female"…
## $ `Gender Code` <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ Race <chr> "White", "White", "White", "White", "White", "Wh…
## $ `Race Code` <chr> "2106-3", "2106-3", "2106-3", "2106-3", "2106-3"…
## $ `Cause of death` <chr> "Septicaemia, unspecified", "Colon, unspecified …
## $ `Cause of death Code` <chr> "A41.9", "C18.9", "C25.9", "C34.9", "C50.9", "C5…
## $ Deaths <dbl> 19, 10, 24, 78, 37, 16, 25, 18, 143, 14, 73, 21,…
## $ Population <dbl> 119942, 119942, 119942, 119942, 119942, 119942, …
## $ `Crude Rate` <chr> "Unreliable", "Unreliable", "20.0", "65.0", "30.…
The dataset consists of 12,356 rows and 13 columns. Most of the variables are categorical, except for the “Deaths” column indicating the count for this type of observation.
We can take also take a look at the summary statistics for each of the numeric variables.
describe(cdc_ucd_df)
Year is notable being calculated due to its formatting, which will need addressing.Deaths, have a noteable average of 38.16 but a standard deviation of 79.76, indicating a large window of fluctuating deaths.
summary(cdc_ucd_df)
## Year Year Code County County Code
## Min. :2018 Min. :2018 Length:12356 Min. :36001
## 1st Qu.:2020 1st Qu.:2020 Class :character 1st Qu.:36039
## Median :2021 Median :2021 Mode :character Median :36061
## Mean :2020 Mean :2020 Mean :36061
## 3rd Qu.:2022 3rd Qu.:2022 3rd Qu.:36083
## Max. :2022 Max. :2022 Max. :36123
## NA's :225 NA's :225 NA's :225
## Gender Gender Code Race Race Code
## Length:12356 Length:12356 Length:12356 Length:12356
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Cause of death Cause of death Code Deaths Population
## Length:12356 Length:12356 Min. : 10.00 Min. : 4648
## Class :character Class :character 1st Qu.: 13.00 1st Qu.: 83932
## Mode :character Mode :character Median : 19.00 Median :280155
## Mean : 38.16 Mean :278184
## 3rd Qu.: 35.00 3rd Qu.:485306
## Max. :2341.00 Max. :668250
## NA's :225 NA's :225
## Crude Rate
## Length:12356
## Class :character
## Mode :character
##
##
##
##
apply(cdc_ucd_df, 2, function(x) sum(is.na(x)))
## Year Year Code County County Code
## 225 225 225 225
## Gender Gender Code Race Race Code
## 225 225 225 225
## Cause of death Cause of death Code Deaths Population
## 225 225 225 225
## Crude Rate
## 225
We can view if any variable is without NAs below
data.frame(missing = colSums(is.na(cdc_ucd_df))) |>
filter(missing == 0) |>
rownames()
## character(0)
Considering all values have 225 NA, it is important to understand how much this would impact the overall data.
plot_missing(cdc_ucd_df,
missing_only = T,
ggtheme = theme_classic(),
theme_config = list(legend.position = c("right")),
geom_label_args = list("size" = 3, "label.padding" = unit(0.1, "lines")))
VIM::aggr(cdc_ucd_df, numbers=T, sortVars=T, bars = FALSE,
cex.axis = .6)
##
## Variables sorted by number of missings:
## Variable Count
## Year 0.01820978
## Year Code 0.01820978
## County 0.01820978
## County Code 0.01820978
## Gender 0.01820978
## Gender Code 0.01820978
## Race 0.01820978
## Race Code 0.01820978
## Cause of death 0.01820978
## Cause of death Code 0.01820978
## Deaths 0.01820978
## Population 0.01820978
## Crude Rate 0.01820978
We can see that all 11 variables is missing 1.8% of values, which means the NA count of 66 observations noted from the summary is not of great concern, therefore I will actively make the decision to remove it.
cdc_ucd_df<-na.omit(cdc_ucd_df)
# kable(cdc_ucd_df$`Cause of death`, format = "html", row.names = TRUE) %>%
# kable_styling(full_width = FALSE)
We will now take a look at the distributions of the numeric variables.
DataExplorer::plot_histogram(cdc_ucd_df, nrow = 4L, ncol = 4L, ggtheme = theme_classic())
It appears none of the numeric values County Code,
Population or Deaths is normally
distributed
# Create bar plot for gender distribution
ggplot(cdc_ucd_df, aes(x = Gender, fill = Gender)) +
geom_bar() +
labs(title = "Gender Distribution", x = "Gender", y = "Frequency") +
theme_minimal() + # Change the theme to minimal
theme(legend.position = "none") + # Remove legend
scale_fill_manual(values = c("Male" = "skyblue",
"Female" = "pink")) # Custom fill colors
# Create bar plot for race distribution
ggplot(cdc_ucd_df, aes(x = str_wrap(`Race`, width = 10),
fill = `Race`)) +
geom_bar() +
labs(title = "Race Distribution", x = "Race", y = "Frequency") +
theme_minimal() + # Change the theme to minimal
theme(legend.position = "none",
axis.text.x = element_text(angle = 45,
hjust = 1)) + # Remove legend
scale_fill_manual(values = c("Asian" = "lightgreen",
"Black" = "lightblue",
"White" = "lightcoral")) # Custom fill colors
It also appears that deaths among men are higher than women, and among races, more deaths occurred for individuals classified as “white”.
# Calculate frequency of each cause of death
top_10_ca_freq<- table(cdc_ucd_df$`Cause of death`)
# Select the top 10 causes of death
top_10_ca <- names(sort(top_10_ca_freq, decreasing = TRUE))[1:10]
# Filter data to include only the top 10 causes of death
top_10_ca_data <- subset(cdc_ucd_df, `Cause of death` %in% top_10_ca)
# Create the plot with sorted values
ggplot(top_10_ca_data, aes(x = reorder(str_wrap(`Cause of death`, width = 23), -table(`Cause of death`)[`Cause of death`]), fill = `Cause of death`)) +
geom_bar() +
labs(title = "Top 10 Causes of Death", x = "Cause", y = "Frequency") +
theme_minimal() + # Change the theme to minimal
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + # Rotate x-axis labels by 45 degrees
coord_flip()
My main concern with the data is the most prominent causes of death in NYC so above I identified the top 10 for 2018-2022, and see if my models will help identify the most prominent causes it expects. I am actually surprised that COVID-19 is not ranked #1 considering the years my dataset consists of.
# Calculate frequency of each cause of death by race
ca_freq_rc <- table(cdc_ucd_df$Race, cdc_ucd_df$`Cause of death`,
cdc_ucd_df$Gender)
# Convert the frequency table to a data frame
ca_freq_rc_df <- as.data.frame.table(ca_freq_rc)
# Rename columns
names(ca_freq_rc_df) <- c("Race", "Cause","Gender", "Frequency")
# Sort by frequency in descending order
ca_freq_rc_df <-
ca_freq_rc_df[order(ca_freq_rc_df$Frequency,
decreasing = TRUE),]
unique(cdc_ucd_df$Gender)
## [1] "Female" "Male"
# Filter to keep only the top 3 causes of death for each race
top_3_causes <- do.call(rbind,
lapply(split(ca_freq_rc_df,
list(ca_freq_rc_df$Race, ca_freq_rc_df$Gender)),
function(x) {
race_gender <- unique(x$Race)[1]
gender <- unique(x$Gender)[1]
head(x[order(-x$Frequency), ], 3)
}))
top_3_causes_m <- subset(top_3_causes, Gender == "Male")
top_3_causes_f <- subset(top_3_causes, Gender == "Female")
top_3_causes_o <- subset(top_3_causes, Gender == "NA")
# Create the plot for males
ggplot(top_3_causes_m, aes(x = Race, y = Frequency, fill = Cause)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Top 3 Causes of Death by Race (Males)", x = "Race", y = "Frequency") +
theme_minimal() +
theme(legend.position = "bottom") + # Position the legend at the bottom
scale_fill_discrete(labels = function(x) str_wrap(x, width = 10))+
# Manually wrap legend labels
coord_flip()
# Create the plot for females
ggplot(top_3_causes_f, aes(x = Race, y = Frequency, fill = Cause)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Top 3 Causes of Death by Race (Female)", x = "Race", y = "Frequency") +
theme_minimal() +
theme(legend.position = "bottom") + # Position the legend at the bottom
scale_fill_discrete(labels = function(x) str_wrap(x, width = 10))+
# Manually wrap legend labels
coord_flip()
# Create the plot for females
ggplot(top_3_causes_o, aes(x = Race, y = Frequency, fill = Cause)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Top 3 Causes of Death by Race (Niether Male or Female)", x = "Race", y = "Frequency") +
theme_minimal() +
theme(legend.position = "bottom") + # Position the legend at the bottom
scale_fill_discrete(labels = function(x) str_wrap(x, width = 10))+
# Manually wrap legend labels
coord_flip()
I looked into how much, top causes of death may vary by race and gender, and decided in my model, going into that level of granularity may not be needed at the present time. I can however explore looking to this at a future date for my own purposes.
(unreliable_count <- sum(cdc_ucd_df$`Crude Rate` == "Unreliable", na.rm = TRUE))
## [1] 6420
Crude Rate also appears to have to many Unreliable
values, approximately 6420, therefore I removed the column
altogether.
First redundant categorical data which is any variable labeled
Code. I also remove
cdc_model<-cdc_ucd_df%>%
dplyr::select(-c(`Year Code`,`County Code`, `Race Code`,`Gender Code`,`Cause of death Code`, `Crude Rate`))
From here, the multiple classifications are set with
as.factor and Gender is simply set to
character. From there we preprocess the data, and use
predict() for our model.
cdc_model <- cdc_model %>%
mutate(
County = as.factor(`County`),
Gender = as.factor(Gender),
Race = as.factor(str_trim(Race)),
`Cause of death` = as.factor(`Cause of death`),
Year = as.Date(paste0(Year, "-01-01")) # Convert Year to Date with January 1st as the date
) %>%
predict(preProcess(., method = c("center", "scale")), .)
Running predict to normalize the data made the values appear unlikely with negative decimals. For the purposes of this project I will move forward but resolving this for later interpretation may prove difficult.
SVM is a model that is ideal for high-dimensional data, so I
attempted to utilize it for this data set and use of
eps-regression is based on the fact that the annual data
can be considered continous.
# Set seed for reproducibility
set.seed(1234)
#
# # Process the data: trim whitespace and convert to factors
# cdc_model <- cdc_model %>%
# mutate(
# County = as.factor(str_trim(County)),
# Gender = as.factor(str_trim(Gender)),
# Race = as.factor(str_trim(Race)),
# `Cause of death` = as.factor(str_trim(`Cause of death`))
# )
# Remove rows with NA values
cdc_model <- na.omit(cdc_model)
# Split the data
training.samples <- cdc_model$Deaths %>%
createDataPartition(p = 0.8, list = FALSE)
train_df <- cdc_model[training.samples, ]
test_df <- cdc_model[-training.samples, ]
# Identify and remove constant variables in the training set
constant_vars <- sapply(train_df, function(x) length(unique(x)) == 1)
train_df <- train_df[, !constant_vars]
# Ensure the same columns are removed from the test set
test_df <- test_df[, colnames(train_df)]
# Fit the SVM model
svm_model <- svm(formula = Deaths ~ ., data = train_df, type = 'eps-regression')
# Print the SVM model
print(svm_model)
##
## Call:
## svm(formula = Deaths ~ ., data = train_df, type = "eps-regression")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.00390625
## epsilon: 0.1
##
##
## Number of Support Vectors: 5171
# Predict using the SVM model
predictions_SVM <- predict(svm_model, newdata = test_df)
# Combine predictions with the test dataset
predictions_SVM <- data.frame(Predicted = predictions_SVM, test_df)
# Print predictions
# print(predictions_SVM)
Note SVM-Kernel: radial is the default
predictions_SVM <- predict(svm_model, newdata = test_df) %>%
bind_cols(test_df)
## New names:
## • `` -> `...1`
predictions_SVM$...1 <- as.numeric(predictions_SVM$...1)
MAE <- MAE(predictions_SVM$Deaths, predictions_SVM$...1)
RMSE <- RMSE(predictions_SVM$Deaths, predictions_SVM$...1)
R2 <- R2(predictions_SVM$Deaths, predictions_SVM$...1)
# Create a data frame to store the results
a_svm <- data.frame(Model = "SVM",
MAE = MAE,
RMSE = RMSE,
R2 = R2)
# Print the results
print(a_svm)
## Model MAE RMSE R2
## 1 SVM 0.4065106 1.089014 0.2689261
Not suprisingly the model did very poorly. Choosing a dataset with only 1 numeric value was a drastic change from previous data, therefore learning how to best utilize SVM or manipulate the data for accuracy is something I will look further into.
Random Forest Regression Tree is an obvious choice considering most of the data is categorical. The only challenge I foresee is creating a clear visual.
# Calculate frequency of each cause of death
top_10_ca_freq<- table(cdc_model$`Cause of death`)
# Select the top 10 causes of death
top_10_ca <- names(sort(top_10_ca_freq, decreasing = TRUE))[1:10]
# Filter data to include only the top 10 causes of death
top_10_ca_data <- subset(cdc_model, `Cause of death` %in% top_10_ca)
top_10_ca_data<-top_10_ca_data%>%
dplyr::select(-Population)
set.seed(1234)
cdc_rf_model <- top_10_ca_data
#split
training_cdc_samples <- cdc_rf_model$Deaths %>%
createDataPartition(p = 0.8, list = FALSE)
train_cdc <- cdc_rf_model[training_cdc_samples, ]
test_cdc <- cdc_rf_model[-training_cdc_samples, ]
#train using rpart, cp- complexity, smaller # = more complexity,
#method- anova is for regression
tree_cdc <- rpart(Deaths ~., data = train_cdc, cp = 0.004, method = 'anova')
#visualize
# rpart.plot(tree_cdc)
# print(tree_1k1)
# Open a PNG graphics device
png("tree_cdc.png", width = 1000, height = 600, res=300) # Adjust width and height as needed
# Plot the tree using rpart.plot
rpart.plot(tree_cdc)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
# Close the graphics device and save the plot as "tree_cdc.png"
dev.off()
## png
## 2
Because the Tree model was difficult to see I made it into a PNG. However because of there were an excessive amount of variables, most labels are still difficult to view.The PNG will be provided with my submission.
# Open a PNG graphics device with high resolution
png("tree_cdc_hd.png", width = 2040, height = 1200, res = 300) # 300 DPI
# Plot the tree using rpart.plot with custom text settings
rpart.plot(tree_cdc, extra = 101, type = 3, under = TRUE, faclen = 0, varlen = 0, snip = TRUE,
cex = .3, # Increase font size
branch.lty = 1, branch.lwd = .5, # Set branch line type and width
main = "Decision Tree for CDC Data", # Add a main title
split.cex = .5, split.box.col = "lightblue", split.border.col = "blue") # Customize split nodes
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
## Warning: ignoring snip=TRUE for png device
# Close the graphics device and save the plot as "tree_cdc_hd.png"
dev.off()
## png
## 2
Predictions
predictions_tree <- predict(tree_cdc, newdata = test_cdc) %>%
bind_cols(test_cdc )
predictions_tree$...1 <- as.numeric(predictions_tree$...1)
decision_tree_model <- data.frame(Model = "Decision Tree 1",
MAE = ModelMetrics::mae(predictions_tree$Deaths, predictions_tree$...1),
#rmse Root Mean Squared Error
RMSE = ModelMetrics::rmse(predictions_tree$Deaths, predictions_tree$...1),
#r squared
R2 = caret::R2(predictions_tree$Deaths, predictions_tree$...1)
)
decision_tree_model
The objective of my project is to utilize recent mortality data to train a model that can potentially predict its impact on specific demographics. The data was sourced from the Center for Disease Control (CDC) website, specifically from the dataset titled “2018-2022 Underlying Cause of Death by Single Race Categories.” This dataset is primarily categorical, presenting a personal challenge since my previous work throughout the semester primarily involved numerical data.
This project aims to identify underlying causes of death that may disproportionately affect different communities. Factors such as social disparity, limited access to healthy foods, excessive access to fast food, limited access to parks and outdoor activities, and availability of sports facilities by region could all be contributing factors. Although this project does not delve into these specific factors, it applies the skills learned this semester to localize these underlying causes of death by county and demographic. Visualizations created for race and gender groupings illustrate the potential of this data to inform stakeholders and decision-makers, with the goal of improving the lives of various regions and communities.
Data collection faced obstacles due to the CDC website’s maximum query output of 75,000 observations. To mitigate this challenge, the data was limited to New York State and collected in annual batches before merging. During my exploratory data analysis (EDA), I prioritized data integrity, assessing for missing values and identifying 225 NA observations. Given their negligible impact on the dataset, these observations were omitted from the models. The numerical data included deaths and population counts, which were not normalized. With an average death toll of 38 and a standard deviation of 79, there was concern about the model’s fit. Plotting the distributions confirmed that the death count was right-skewed and not normalized.
All categorical data included both coded and descriptive text formats. For this project, the descriptive text was used to enhance readability. Redundant codes were removed, categories were set as factors, and the year was treated as a date value for preprocessing.
Given that Support Vector Machine (SVM) models perform well with high-dimensional data, I employed this technique for the dataset. Based on guidance from R documentation and Stack Exchange sites, I determined that ‘eps-regression’ was the optimal parameter for the ‘type’ argument, considering the continuous nature of the annual data. However, the results were as follows:
| Model | MAE | RMSE | \(R^2\) |
|---|---|---|---|
| SVM | 0.4065 | 1.0890 | 0.2689 |
The SVM model proved to be a poor fit.
Subsequently, I utilized a random forest model, generating the image tree_cdc_hd.png to better visualize the results. The results were:
| Model | MAE | RMSE | \(R^2\) |
|---|---|---|---|
| Decision Tree 1 | 0.4034 | 0.8857 | 0.6272 |
The decision tree model demonstrated a better fit compared to the SVM model.
In this project, I deliberately retained the noted errors as learning opportunities for future endeavors. Firstly, while I believe my data selection was appropriate, I need to further investigate the transformation of numerical values. Specifically, the normalization process via the predict() function resulted in negative decimal values, posing a challenge in accurately predicting future outcomes. Additionally, using the Code definitions for categories in the Random Forest model could have enhanced the model’s visualization. However, this introduced interpretation challenges. Moving forward, I intend to continue refining this dataset to address these common interpretative issues, thereby improving the clarity and utility of my models in similar future projects.