This dataset is from the CDC depicting the leading causes of death by year, cause, location (state), number of deaths, and age-adjusted-death-rates. I am going to be doing an analysis of the data week-by-week attempting to extrapolate patterns and answering questions that will be formulated along the way.
Through the data analysis of the dataset, it is my goal to show which causes of death are most prevalent across the states and how the causes have change throughout time.
After this surface level analysis is complete, I will begin to peel back the layers of data to understand the possible why and how behind the results. This may lead to further data-diving to help explain the findings.
To begin, a basic analysis and visualization of the vectors will be completed and then questions will be formulated from the findings to answer from further investigation of the data.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
deaths <- read.csv('NCHS.csv')
In this \(dataset\), the Deaths and Age Adjusted Death Rate columns are categorized as character strings due to the comma being present. As such, it does not allow for proper data filtering when trying to perform basic R functions such as min, max, etc. It is necessary to transform the column to a numeric format to properly extract meaningful insights.
deaths <- deaths |>
mutate(
Deaths = as.numeric(gsub(",", "", Deaths))
)
deaths <- deaths |>
mutate(
Age.adjusted.Death.Rate = as.numeric(gsub(",", "", Age.adjusted.Death.Rate))
)
deaths |>
summarise(
min_year = min(Year, na.rm = TRUE),
max_year = max(Year, na.rm = TRUE),
q1_year = quantile(Year, 0.25, na.rm = TRUE),
med_year = median(Year, na.rm = TRUE),
q3_year = quantile(Year, 0.75, na.rm = TRUE)
)
## min_year max_year q1_year med_year q3_year
## 1 1999 2017 2003 2008 2013
deaths |>
summarise(
min_death = min(Deaths, na.rm = TRUE),
max_death = max(Deaths, na.rm = TRUE),
avg_death = mean(Deaths, na.rm = TRUE),
q1_death = quantile(Deaths, 0.25, na.rm = TRUE),
med_death = median(Deaths, na.rm = TRUE),
q3_death = quantile(Deaths, 0.75, na.rm = TRUE)
)
## min_death max_death avg_death q1_death med_death q3_death
## 1 21 2813503 15459.91 612 1718.5 5756.5
deaths |>
summarise(
min_aadr = min(Age.adjusted.Death.Rate, na.rm = TRUE),
max_aadr = max(Age.adjusted.Death.Rate, na.rm = TRUE),
avg_aadr = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
q1_aadr = quantile(Age.adjusted.Death.Rate, 0.25, na.rm = TRUE),
med_aadr = median(Age.adjusted.Death.Rate, na.rm = TRUE),
q3_aadr = quantile(Age.adjusted.Death.Rate, 0.75, na.rm = TRUE)
)
## min_aadr max_aadr avg_aadr q1_aadr med_aadr q3_aadr
## 1 2.6 1087.3 127.5639 19.2 35.9 151.725
deaths |>
count(X113.Cause.Name)
## X113.Cause.Name n
## 1 Accidents (unintentional injuries) (V01-X59,Y85-Y86) 988
## 2 All Causes 988
## 3 Alzheimer's disease (G30) 988
## 4 Cerebrovascular diseases (I60-I69) 988
## 5 Chronic lower respiratory diseases (J40-J47) 988
## 6 Diabetes mellitus (E10-E14) 988
## 7 Diseases of heart (I00-I09,I11,I13,I20-I51) 988
## 8 Influenza and pneumonia (J09-J18) 988
## 9 Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) 988
## 10 Malignant neoplasms (C00-C97) 988
## 11 Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27) 988
deaths |>
count(Cause.Name)
## Cause.Name n
## 1 All causes 988
## 2 Alzheimer's disease 988
## 3 CLRD 988
## 4 Cancer 988
## 5 Diabetes 988
## 6 Heart disease 988
## 7 Influenza and pneumonia 988
## 8 Kidney disease 988
## 9 Stroke 988
## 10 Suicide 988
## 11 Unintentional injuries 988
deaths |>
count(State)
## State n
## 1 Alabama 209
## 2 Alaska 209
## 3 Arizona 209
## 4 Arkansas 209
## 5 California 209
## 6 Colorado 209
## 7 Connecticut 209
## 8 Delaware 209
## 9 District of Columbia 209
## 10 Florida 209
## 11 Georgia 209
## 12 Hawaii 209
## 13 Idaho 209
## 14 Illinois 209
## 15 Indiana 209
## 16 Iowa 209
## 17 Kansas 209
## 18 Kentucky 209
## 19 Louisiana 209
## 20 Maine 209
## 21 Maryland 209
## 22 Massachusetts 209
## 23 Michigan 209
## 24 Minnesota 209
## 25 Mississippi 209
## 26 Missouri 209
## 27 Montana 209
## 28 Nebraska 209
## 29 Nevada 209
## 30 New Hampshire 209
## 31 New Jersey 209
## 32 New Mexico 209
## 33 New York 209
## 34 North Carolina 209
## 35 North Dakota 209
## 36 Ohio 209
## 37 Oklahoma 209
## 38 Oregon 209
## 39 Pennsylvania 209
## 40 Rhode Island 209
## 41 South Carolina 209
## 42 South Dakota 209
## 43 Tennessee 209
## 44 Texas 209
## 45 United States 209
## 46 Utah 209
## 47 Vermont 209
## 48 Virginia 209
## 49 Washington 209
## 50 West Virginia 209
## 51 Wisconsin 209
## 52 Wyoming 209
Upon analyzing the two cause name vectors, it struck me as odd at first the number of occurrences for each unique value was the same (988). After further examination and diving, it became clear this was the case since there are 19 years being examined and 52 states (52 states due to the District of Columbia and United States appearing in the vector) and 19 * 52 is 988.
The same type of inquiry may arise at first glance for states as well since each unique value has the same number of occurrences at 209. This occurs since there are a total of 11 causes of death and 19 years, and 11 * 19 equates to 209.
deaths |>
filter(Cause.Name != "All causes") |>
group_by(Cause.Name) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ggplot(aes(
x = total_deaths,
y = reorder(Cause.Name, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths by Cause",
x = "Total Deaths (Millions)",
y = "Cause of Death"
) +
theme_minimal()
This is an introductory analysis of the Cause.Name vector to see which cause is most prevalent cumulatively through the years of the dataset. Heart disease and cancer are the two leading causes of death in the United States from 1999-2017 according to this analysis. Further analysis of the vectors will help build the birds-eye view of the data as a foundation for subsequent investigation.
deaths |>
filter(Cause.Name == "All causes") |>
group_by(Year) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ggplot(aes(
x = Year,
y = total_deaths
)) +
geom_line(color = "steelblue", linewidth = 1) +
scale_x_continuous(
breaks = seq(min(deaths$Year), max(deaths$Year), by = 2)
) +
scale_y_continuous(
limits = c(0, NA),
breaks = seq(0, 6e6, by = 1e6),
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths per Year (All Causes)",
x = "Year",
y = "Total Deaths (Millions)"
) +
theme_minimal()
This visualization was built in order to depict a trend, if any, existed for how many total deaths occurred in the United States from the causes of death in the dataset. In this introductory analysis, it can be seen that, overall, the total deaths have increased throughout the years with a few exceptions that may require further analysis.
deaths |>
filter(
Cause.Name == "All causes",
State != "United States"
) |>
group_by(State) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
slice_max(total_deaths, n = 10) |>
ggplot(aes(
x = total_deaths,
y = reorder(State, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
limits = c(0, NA),
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Top 10 States by Total Deaths (All Years, All Causes)",
x = "Total Deaths (Millions)",
y = "State"
) +
theme_minimal()
This visualization of total deaths (by top 10 states) stemming from all causes across all years is an introductory analysis of what states have the most deaths. This will require further investigation to see if there are underlying factors that may skew these results or explain them in a way not involving the data in this particular dataset.
This first data dive provided an opportunity to understand the dataset and what vectors needed to be cleaned in order to properly perform analysis. At this point, the minimums, maximums, averages, and quantiles of each numerical vector have been analyzed. Furthermore, the categorical vectors have been analyzed to see what unique values exist and how many times they appear in the dataset.
Visualizations were created based on answering very base analytical questions of:
From these findings, the foundation of data analysis of the dataset have occurred which yields further examination throughout the upcoming weeks. Possible questions from this introductory dive include:
The next analysis may focus on these questions and others as deeper dives into the data occur.