This dataset is from the CDC depicting the leading causes of death by year, cause, location (state), number of deaths, and age-adjusted-death-rates. I am going to be doing an analysis of the data week-by-week attempting to extrapolate patterns and answering questions that will be formulated along the way.
Through the data analysis of the dataset, it is my goal to show which causes of death are most prevalent across the states and how the causes have change throughout time.
After this surface level analysis is complete, I will begin to peel back the layers of data to understand the possible why and how behind the results. This may lead to further data-diving to help explain the findings.
To begin, a basic analysis and visualization of the vectors will be completed and then questions will be formulated from the findings to answer from further investigation of the data.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
deaths <- read.csv('NCHS.csv')
In this \(dataset\), the Deaths and Age Adjusted Death Rate columns are categorized as character strings due to the comma being present. As such, it does not allow for proper data filtering when trying to perform basic R functions such as min, max, etc. It is necessary to transform the column to a numeric format to properly extract meaningful insights.
deaths <- deaths |>
mutate(
Deaths = as.numeric(gsub(",", "", Deaths))
)
deaths <- deaths |>
mutate(
Age.adjusted.Death.Rate = as.numeric(gsub(",", "", Age.adjusted.Death.Rate))
)
deaths |>
summarise(
min_year = min(Year, na.rm = TRUE),
max_year = max(Year, na.rm = TRUE),
q1_year = quantile(Year, 0.25, na.rm = TRUE),
med_year = median(Year, na.rm = TRUE),
q3_year = quantile(Year, 0.75, na.rm = TRUE)
)
## min_year max_year q1_year med_year q3_year
## 1 1999 2017 2003 2008 2013
The Year variable spans a defined range between the earliest and latest observations in the dataset, indicating the overall temporal coverage of the data. The median and interquartile range show where most observations are concentrated, suggesting that the data is more heavily represented in certain periods rather than evenly distributed across all years. This concentration is important to consider when interpreting trends, as patterns observed may be influenced by the density of data in specific time ranges. Further investigation could explore whether key changes in outcomes align with particular periods within this range.
deaths |>
summarise(
min_death = min(Deaths, na.rm = TRUE),
max_death = max(Deaths, na.rm = TRUE),
avg_death = mean(Deaths, na.rm = TRUE),
q1_death = quantile(Deaths, 0.25, na.rm = TRUE),
med_death = median(Deaths, na.rm = TRUE),
q3_death = quantile(Deaths, 0.75, na.rm = TRUE)
)
## min_death max_death avg_death q1_death med_death q3_death
## 1 21 2813503 15459.91 612 1718.5 5756.5
The Deaths variable exhibits a wide range between its minimum and maximum values, indicating substantial variation in the number of deaths across observations. The difference between the mean and median suggests potential skewness in the distribution, where higher death counts may disproportionately influence the average. The interquartile range highlights that most observations fall within a narrower band, while extreme values may represent outliers or exceptional cases. This variability suggests that additional contextual factors, such as time or population characteristics, may play a significant role in explaining differences in death counts.
deaths |>
summarise(
min_aadr = min(Age.adjusted.Death.Rate, na.rm = TRUE),
max_aadr = max(Age.adjusted.Death.Rate, na.rm = TRUE),
avg_aadr = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
q1_aadr = quantile(Age.adjusted.Death.Rate, 0.25, na.rm = TRUE),
med_aadr = median(Age.adjusted.Death.Rate, na.rm = TRUE),
q3_aadr = quantile(Age.adjusted.Death.Rate, 0.75, na.rm = TRUE)
)
## min_aadr max_aadr avg_aadr q1_aadr med_aadr q3_aadr
## 1 2.6 1087.3 127.5639 19.2 35.9 151.725
The age‑adjusted death rate shows a more standardized distribution compared to raw death counts, as reflected by a relatively tighter interquartile range. The median provides a useful measure of central tendency that accounts for population age structure, making this metric more suitable for comparisons across groups or time periods. While variation still exists between the minimum and maximum values, the adjusted nature of the rate suggests that observed differences are less driven by demographic composition and more likely related to underlying risk factors. Further analysis could examine how these rates change over time or differ across categories.
deaths |>
count(X113.Cause.Name)
## X113.Cause.Name n
## 1 Accidents (unintentional injuries) (V01-X59,Y85-Y86) 988
## 2 All Causes 988
## 3 Alzheimer's disease (G30) 988
## 4 Cerebrovascular diseases (I60-I69) 988
## 5 Chronic lower respiratory diseases (J40-J47) 988
## 6 Diabetes mellitus (E10-E14) 988
## 7 Diseases of heart (I00-I09,I11,I13,I20-I51) 988
## 8 Influenza and pneumonia (J09-J18) 988
## 9 Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) 988
## 10 Malignant neoplasms (C00-C97) 988
## 11 Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27) 988
deaths |>
count(Cause.Name)
## Cause.Name n
## 1 All causes 988
## 2 Alzheimer's disease 988
## 3 CLRD 988
## 4 Cancer 988
## 5 Diabetes 988
## 6 Heart disease 988
## 7 Influenza and pneumonia 988
## 8 Kidney disease 988
## 9 Stroke 988
## 10 Suicide 988
## 11 Unintentional injuries 988
deaths |>
count(State)
## State n
## 1 Alabama 209
## 2 Alaska 209
## 3 Arizona 209
## 4 Arkansas 209
## 5 California 209
## 6 Colorado 209
## 7 Connecticut 209
## 8 Delaware 209
## 9 District of Columbia 209
## 10 Florida 209
## 11 Georgia 209
## 12 Hawaii 209
## 13 Idaho 209
## 14 Illinois 209
## 15 Indiana 209
## 16 Iowa 209
## 17 Kansas 209
## 18 Kentucky 209
## 19 Louisiana 209
## 20 Maine 209
## 21 Maryland 209
## 22 Massachusetts 209
## 23 Michigan 209
## 24 Minnesota 209
## 25 Mississippi 209
## 26 Missouri 209
## 27 Montana 209
## 28 Nebraska 209
## 29 Nevada 209
## 30 New Hampshire 209
## 31 New Jersey 209
## 32 New Mexico 209
## 33 New York 209
## 34 North Carolina 209
## 35 North Dakota 209
## 36 Ohio 209
## 37 Oklahoma 209
## 38 Oregon 209
## 39 Pennsylvania 209
## 40 Rhode Island 209
## 41 South Carolina 209
## 42 South Dakota 209
## 43 Tennessee 209
## 44 Texas 209
## 45 United States 209
## 46 Utah 209
## 47 Vermont 209
## 48 Virginia 209
## 49 Washington 209
## 50 West Virginia 209
## 51 Wisconsin 209
## 52 Wyoming 209
Upon analyzing the two cause name vectors, it struck me as odd at first the number of occurrences for each unique value was the same (988). After further examination and diving, it became clear this was the case since there are 19 years being examined and 52 states (52 states due to the District of Columbia and United States appearing in the vector) and 19 * 52 is 988.
The same type of inquiry may arise at first glance for states as well since each unique value has the same number of occurrences at 209. This occurs since there are a total of 11 causes of death and 19 years, and 11 * 19 equates to 209.
deaths |>
filter(Cause.Name != "All causes") |>
group_by(Cause.Name) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ggplot(aes(
x = total_deaths,
y = reorder(Cause.Name, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths by Cause",
x = "Total Deaths (Millions)",
y = "Cause of Death"
) +
theme_minimal()
This visualization summarizes total deaths aggregated by cause across the entire dataset, highlighting which causes contribute the largest share of cumulative mortality. The ordering of causes reflects differences in frequency and duration of reporting rather than relative risk at the individual level. Causes with higher total deaths may represent conditions with broad prevalence, longer observation periods, or consistent reporting over time. As a result, this chart is best interpreted as a descriptive overview of how mortality is distributed across causes in the dataset, rather than as a comparison of severity or likelihood. Further analysis using age‑adjusted rates or time‑specific trends would provide more meaningful insight into relative risk.
deaths |>
filter(Cause.Name == "All causes") |>
group_by(Year) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ggplot(aes(
x = Year,
y = total_deaths
)) +
geom_line(color = "steelblue", linewidth = 1) +
scale_x_continuous(
breaks = seq(min(deaths$Year), max(deaths$Year), by = 2)
) +
scale_y_continuous(
limits = c(0, NA),
breaks = seq(0, 6e6, by = 1e6),
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths per Year (All Causes)",
x = "Year",
y = "Total Deaths (Millions)"
) +
theme_minimal()
This visualization shows total deaths aggregated by year across all causes, providing a high‑level view of how cumulative mortality changes over time in the dataset. The overall trend reflects both population growth and temporal changes in mortality reporting rather than year‑to‑year shifts in individual risk. Because the values represent raw totals, increases over time should be interpreted as descriptive of scale and aggregation effects, not necessarily as evidence of worsening health outcomes. This view is most useful for identifying broad temporal patterns and contextualizing more detailed analyses, such as cause‑specific or population‑adjusted death rates, which are better suited for comparative interpretation.
deaths |>
filter(
Cause.Name == "All causes",
State != "United States"
) |>
group_by(State) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
slice_max(total_deaths, n = 10) |>
ggplot(aes(
x = total_deaths,
y = reorder(State, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
limits = c(0, NA),
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Top 10 States by Total Deaths (All Years, All Causes)",
x = "Total Deaths (Millions)",
y = "State"
) +
theme_minimal()
The states appearing in the top 10 by total deaths largely reflect differences in population size and data aggregation rather than underlying risk alone. States with larger populations or longer reporting periods are more likely to appear at the top when using raw death counts, as this metric does not account for population normalization or demographic structure. As a result, these rankings should be interpreted as descriptive of cumulative totals within the dataset rather than as an indicator of relative severity or individual risk. This highlights the importance of complementing total counts with population‑adjusted measures, such as age‑adjusted death rates, for more meaningful comparisons.
This first data dive provided an opportunity to understand the dataset and what vectors needed to be cleaned in order to properly perform analysis. At this point, the minimums, maximums, averages, and quantiles of each numerical vector have been analyzed. Furthermore, the categorical vectors have been analyzed to see what unique values exist and how many times they appear in the dataset.
Visualizations were created based on answering very base analytical questions of:
From these findings, the foundation of data analysis of the dataset have occurred which yields further examination throughout the upcoming weeks. Possible questions from this introductory dive include:
The next analysis may focus on these questions and others as deeper dives into the data occur.