First Data Dive

Data Synopsis

This dataset is from the CDC depicting the leading causes of death by year, cause, location (state), number of deaths, and age-adjusted-death-rates. I am going to be doing an analysis of the data week-by-week attempting to extrapolate patterns and answering questions that will be formulated along the way.

Data Analysis Goals

Through the data analysis of the dataset, it is my goal to show which causes of death are most prevalent across the states and how the causes have change throughout time.

After this surface level analysis is complete, I will begin to peel back the layers of data to understand the possible why and how behind the results. This may lead to further data-diving to help explain the findings.

To begin, a basic analysis and visualization of the vectors will be completed and then questions will be formulated from the findings to answer from further investigation of the data.

Loading the Libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

Reading the Data

deaths <- read.csv('NCHS.csv')

Cleaning the Data

In this \(dataset\), the Deaths and Age Adjusted Death Rate columns are categorized as character strings due to the comma being present. As such, it does not allow for proper data filtering when trying to perform basic R functions such as min, max, etc. It is necessary to transform the column to a numeric format to properly extract meaningful insights.

deaths <- deaths |>
  mutate(
    Deaths = as.numeric(gsub(",", "", Deaths))
  )
deaths <- deaths |>
  mutate(
    Age.adjusted.Death.Rate = as.numeric(gsub(",", "", Age.adjusted.Death.Rate))
  )

Min/Max/Mean/Median/Quantile Summaries

Year Vector Analysis

deaths |>
  summarise(
    min_year = min(Year, na.rm = TRUE),
    max_year = max(Year, na.rm = TRUE),
    q1_year = quantile(Year, 0.25, na.rm = TRUE),
    med_year = median(Year, na.rm = TRUE),
    q3_year = quantile(Year, 0.75, na.rm = TRUE)
  )
##   min_year max_year q1_year med_year q3_year
## 1     1999     2017    2003     2008    2013

Summary and Insight:

The Year variable spans a defined range between the earliest and latest observations in the dataset, indicating the overall temporal coverage of the data. The median and interquartile range show where most observations are concentrated, suggesting that the data is more heavily represented in certain periods rather than evenly distributed across all years. This concentration is important to consider when interpreting trends, as patterns observed may be influenced by the density of data in specific time ranges. Further investigation could explore whether key changes in outcomes align with particular periods within this range.

Deaths Vector Analysis

deaths |>
  summarise(
    min_death = min(Deaths, na.rm = TRUE),
    max_death = max(Deaths, na.rm = TRUE),
    avg_death = mean(Deaths, na.rm = TRUE),
    q1_death = quantile(Deaths, 0.25, na.rm = TRUE),
    med_death = median(Deaths, na.rm = TRUE),
    q3_death = quantile(Deaths, 0.75, na.rm = TRUE)
  )
##   min_death max_death avg_death q1_death med_death q3_death
## 1        21   2813503  15459.91      612    1718.5   5756.5

Summary and Insight:

The Deaths variable exhibits a wide range between its minimum and maximum values, indicating substantial variation in the number of deaths across observations. The difference between the mean and median suggests potential skewness in the distribution, where higher death counts may disproportionately influence the average. The interquartile range highlights that most observations fall within a narrower band, while extreme values may represent outliers or exceptional cases. This variability suggests that additional contextual factors, such as time or population characteristics, may play a significant role in explaining differences in death counts.

Age Adjusted Death Rate Vector Analysis

deaths |>
  summarise(
    min_aadr = min(Age.adjusted.Death.Rate, na.rm = TRUE),
    max_aadr = max(Age.adjusted.Death.Rate, na.rm = TRUE),
    avg_aadr = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
    q1_aadr = quantile(Age.adjusted.Death.Rate, 0.25, na.rm = TRUE),
    med_aadr = median(Age.adjusted.Death.Rate, na.rm = TRUE),
    q3_aadr = quantile(Age.adjusted.Death.Rate, 0.75, na.rm = TRUE)
  )
##   min_aadr max_aadr avg_aadr q1_aadr med_aadr q3_aadr
## 1      2.6   1087.3 127.5639    19.2     35.9 151.725

Summary and Insight:

The age‑adjusted death rate shows a more standardized distribution compared to raw death counts, as reflected by a relatively tighter interquartile range. The median provides a useful measure of central tendency that accounts for population age structure, making this metric more suitable for comparisons across groups or time periods. While variation still exists between the minimum and maximum values, the adjusted nature of the rate suggests that observed differences are less driven by demographic composition and more likely related to underlying risk factors. Further analysis could examine how these rates change over time or differ across categories.

Categorical Analysis

X113 Cause Name Vector Analysis

deaths |>
  count(X113.Cause.Name)
##                                                          X113.Cause.Name   n
## 1                   Accidents (unintentional injuries) (V01-X59,Y85-Y86) 988
## 2                                                             All Causes 988
## 3                                              Alzheimer's disease (G30) 988
## 4                                     Cerebrovascular diseases (I60-I69) 988
## 5                           Chronic lower respiratory diseases (J40-J47) 988
## 6                                            Diabetes mellitus (E10-E14) 988
## 7                            Diseases of heart (I00-I09,I11,I13,I20-I51) 988
## 8                                      Influenza and pneumonia (J09-J18) 988
## 9                   Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) 988
## 10                                         Malignant neoplasms (C00-C97) 988
## 11 Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27) 988

Cause Name Vector Analysis

deaths |>
  count(Cause.Name)
##                 Cause.Name   n
## 1               All causes 988
## 2      Alzheimer's disease 988
## 3                     CLRD 988
## 4                   Cancer 988
## 5                 Diabetes 988
## 6            Heart disease 988
## 7  Influenza and pneumonia 988
## 8           Kidney disease 988
## 9                   Stroke 988
## 10                 Suicide 988
## 11  Unintentional injuries 988

State Vector Analysis

deaths |>
  count(State)
##                   State   n
## 1               Alabama 209
## 2                Alaska 209
## 3               Arizona 209
## 4              Arkansas 209
## 5            California 209
## 6              Colorado 209
## 7           Connecticut 209
## 8              Delaware 209
## 9  District of Columbia 209
## 10              Florida 209
## 11              Georgia 209
## 12               Hawaii 209
## 13                Idaho 209
## 14             Illinois 209
## 15              Indiana 209
## 16                 Iowa 209
## 17               Kansas 209
## 18             Kentucky 209
## 19            Louisiana 209
## 20                Maine 209
## 21             Maryland 209
## 22        Massachusetts 209
## 23             Michigan 209
## 24            Minnesota 209
## 25          Mississippi 209
## 26             Missouri 209
## 27              Montana 209
## 28             Nebraska 209
## 29               Nevada 209
## 30        New Hampshire 209
## 31           New Jersey 209
## 32           New Mexico 209
## 33             New York 209
## 34       North Carolina 209
## 35         North Dakota 209
## 36                 Ohio 209
## 37             Oklahoma 209
## 38               Oregon 209
## 39         Pennsylvania 209
## 40         Rhode Island 209
## 41       South Carolina 209
## 42         South Dakota 209
## 43            Tennessee 209
## 44                Texas 209
## 45        United States 209
## 46                 Utah 209
## 47              Vermont 209
## 48             Virginia 209
## 49           Washington 209
## 50        West Virginia 209
## 51            Wisconsin 209
## 52              Wyoming 209

Summary of Categorical Vectors

Upon analyzing the two cause name vectors, it struck me as odd at first the number of occurrences for each unique value was the same (988). After further examination and diving, it became clear this was the case since there are 19 years being examined and 52 states (52 states due to the District of Columbia and United States appearing in the vector) and 19 * 52 is 988.

The same type of inquiry may arise at first glance for states as well since each unique value has the same number of occurrences at 209. This occurs since there are a total of 11 causes of death and 19 years, and 11 * 19 equates to 209.

Visualizations of the Data

Overall Cause of Death Visualization

deaths |>
  filter(Cause.Name != "All causes") |>
  group_by(Cause.Name) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  ggplot(aes(
    x = total_deaths,
    y = reorder(Cause.Name, total_deaths)
  )) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths by Cause",
    x = "Total Deaths (Millions)",
    y = "Cause of Death"
  ) +
  theme_minimal()

Summary and Insight:

This visualization summarizes total deaths aggregated by cause across the entire dataset, highlighting which causes contribute the largest share of cumulative mortality. The ordering of causes reflects differences in frequency and duration of reporting rather than relative risk at the individual level. Causes with higher total deaths may represent conditions with broad prevalence, longer observation periods, or consistent reporting over time. As a result, this chart is best interpreted as a descriptive overview of how mortality is distributed across causes in the dataset, rather than as a comparison of severity or likelihood. Further analysis using age‑adjusted rates or time‑specific trends would provide more meaningful insight into relative risk.

Deaths by Year Visualization

deaths |>
  filter(Cause.Name == "All causes") |>
  group_by(Year) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  ggplot(aes(
    x = Year,
    y = total_deaths
  )) +
  geom_line(color = "steelblue", linewidth = 1) +
  scale_x_continuous(
    breaks = seq(min(deaths$Year), max(deaths$Year), by = 2)
  ) +
  scale_y_continuous(
    limits = c(0, NA),
    breaks = seq(0, 6e6, by = 1e6),
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths per Year (All Causes)",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  theme_minimal()

Summary and Insight:

This visualization shows total deaths aggregated by year across all causes, providing a high‑level view of how cumulative mortality changes over time in the dataset. The overall trend reflects both population growth and temporal changes in mortality reporting rather than year‑to‑year shifts in individual risk. Because the values represent raw totals, increases over time should be interpreted as descriptive of scale and aggregation effects, not necessarily as evidence of worsening health outcomes. This view is most useful for identifying broad temporal patterns and contextualizing more detailed analyses, such as cause‑specific or population‑adjusted death rates, which are better suited for comparative interpretation.

Total Deaths by State Visualization (Top 10 States)

deaths |>
  filter(
    Cause.Name == "All causes",
    State != "United States"
  ) |>
  group_by(State) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  slice_max(total_deaths, n = 10) |>
  ggplot(aes(
    x = total_deaths,
    y = reorder(State, total_deaths)
  )) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    limits = c(0, NA),
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Top 10 States by Total Deaths (All Years, All Causes)",
    x = "Total Deaths (Millions)",
    y = "State"
  ) +
  theme_minimal()

Summary and Insight:

The states appearing in the top 10 by total deaths largely reflect differences in population size and data aggregation rather than underlying risk alone. States with larger populations or longer reporting periods are more likely to appear at the top when using raw death counts, as this metric does not account for population normalization or demographic structure. As a result, these rankings should be interpreted as descriptive of cumulative totals within the dataset rather than as an indicator of relative severity or individual risk. This highlights the importance of complementing total counts with population‑adjusted measures, such as age‑adjusted death rates, for more meaningful comparisons.

Weekly Data Dive Summary

This first data dive provided an opportunity to understand the dataset and what vectors needed to be cleaned in order to properly perform analysis. At this point, the minimums, maximums, averages, and quantiles of each numerical vector have been analyzed. Furthermore, the categorical vectors have been analyzed to see what unique values exist and how many times they appear in the dataset.

Visualizations were created based on answering very base analytical questions of:

  1. Which states had the most deaths across the years?
  2. What are the causes of death that were the most prevalent in quantity of deaths across the years?
  3. What states had the most deaths across the years?

From these findings, the foundation of data analysis of the dataset have occurred which yields further examination throughout the upcoming weeks. Possible questions from this introductory dive include:

  1. Why are those the top 10 states and what other factors could contribute to them being the top 10?
  2. What caused the dips in the trend of total deaths across the years?
  3. What states represent the highest number of deaths for each cause name?

The next analysis may focus on these questions and others as deeper dives into the data occur.