First Data Dive

Data Synopsis

This dataset is from the CDC depicting the leading causes of death by year, cause, location (state), number of deaths, and age-adjusted-death-rates. I am going to be doing an analysis of the data week-by-week attempting to extrapolate patterns and answering questions that will be formulated along the way.

Data Analysis Goals

Through the data analysis of the dataset, it is my goal to show which causes of death are most prevalent across the states and how the causes have change throughout time.

After this surface level analysis is complete, I will begin to peel back the layers of data to understand the possible why and how behind the results. This may lead to further data-diving to help explain the findings.

To begin, a basic analysis and visualization of the vectors will be completed and then questions will be formulated from the findings to answer from further investigation of the data.

Loading the Libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

Reading the Data

deaths <- read.csv('NCHS.csv')

Cleaning the Data

In this \(dataset\), the Deaths and Age Adjusted Death Rate columns are categorized as character strings due to the comma being present. As such, it does not allow for proper data filtering when trying to perform basic R functions such as min, max, etc. It is necessary to transform the column to a numeric format to properly extract meaningful insights.

deaths <- deaths |>
  mutate(
    Deaths = as.numeric(gsub(",", "", Deaths))
  )

deaths <- deaths |>
  mutate(
    Age.adjusted.Death.Rate = as.numeric(gsub(",", "", Age.adjusted.Death.Rate))
  )

Min/Max/Mean/Median/Quantile Summaries

Year Vector Analysis

deaths |>
  summarise(
    min_year = min(Year, na.rm = TRUE),
    max_year = max(Year, na.rm = TRUE),
    q1_year = quantile(Year, 0.25, na.rm = TRUE),
    med_year = median(Year, na.rm = TRUE),
    q3_year = quantile(Year, 0.75, na.rm = TRUE)
  )

##   min_year max_year q1_year med_year q3_year
## 1     1999     2017    2003     2008    2013

Summary and Insight:

The Year variable spans a defined range between the earliest and latest observations in the dataset, indicating the overall temporal coverage of the data. The median and interquartile range show where most observations are concentrated, suggesting that the data is more heavily represented in certain periods rather than evenly distributed across all years. This concentration is important to consider when interpreting trends, as patterns observed may be influenced by the density of data in specific time ranges. Further investigation could explore whether key changes in outcomes align with particular periods within this range.

Deaths Vector Analysis

deaths |>
  summarise(
    min_death = min(Deaths, na.rm = TRUE),
    max_death = max(Deaths, na.rm = TRUE),
    avg_death = mean(Deaths, na.rm = TRUE),
    q1_death = quantile(Deaths, 0.25, na.rm = TRUE),
    med_death = median(Deaths, na.rm = TRUE),
    q3_death = quantile(Deaths, 0.75, na.rm = TRUE)
  )

##   min_death max_death avg_death q1_death med_death q3_death
## 1        21   2813503  15459.91      612    1718.5   5756.5

Summary and Insight:

The Deaths variable exhibits a wide range between its minimum and maximum values, indicating substantial variation in the number of deaths across observations. The difference between the mean and median suggests potential skewness in the distribution, where higher death counts may disproportionately influence the average. The interquartile range highlights that most observations fall within a narrower band, while extreme values may represent outliers or exceptional cases. This variability suggests that additional contextual factors, such as time or population characteristics, may play a significant role in explaining differences in death counts.

Age Adjusted Death Rate Vector Analysis

deaths |>
  summarise(
    min_aadr = min(Age.adjusted.Death.Rate, na.rm = TRUE),
    max_aadr = max(Age.adjusted.Death.Rate, na.rm = TRUE),
    avg_aadr = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
    q1_aadr = quantile(Age.adjusted.Death.Rate, 0.25, na.rm = TRUE),
    med_aadr = median(Age.adjusted.Death.Rate, na.rm = TRUE),
    q3_aadr = quantile(Age.adjusted.Death.Rate, 0.75, na.rm = TRUE)
  )

##   min_aadr max_aadr avg_aadr q1_aadr med_aadr q3_aadr
## 1      2.6   1087.3 127.5639    19.2     35.9 151.725

Summary and Insight:

The age‑adjusted death rate shows a more standardized distribution compared to raw death counts, as reflected by a relatively tighter interquartile range. The median provides a useful measure of central tendency that accounts for population age structure, making this metric more suitable for comparisons across groups or time periods. While variation still exists between the minimum and maximum values, the adjusted nature of the rate suggests that observed differences are less driven by demographic composition and more likely related to underlying risk factors. Further analysis could examine how these rates change over time or differ across categories.

Categorical Analysis

X113 Cause Name Vector Analysis

deaths |>
  count(X113.Cause.Name)

##                                                          X113.Cause.Name   n
## 1                   Accidents (unintentional injuries) (V01-X59,Y85-Y86) 988
## 2                                                             All Causes 988
## 3                                              Alzheimer's disease (G30) 988
## 4                                     Cerebrovascular diseases (I60-I69) 988
## 5                           Chronic lower respiratory diseases (J40-J47) 988
## 6                                            Diabetes mellitus (E10-E14) 988
## 7                            Diseases of heart (I00-I09,I11,I13,I20-I51) 988
## 8                                      Influenza and pneumonia (J09-J18) 988
## 9                   Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) 988
## 10                                         Malignant neoplasms (C00-C97) 988
## 11 Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27) 988

Cause Name Vector Analysis

deaths |>
  count(Cause.Name)

##                 Cause.Name   n
## 1               All causes 988
## 2      Alzheimer's disease 988
## 3                     CLRD 988
## 4                   Cancer 988
## 5                 Diabetes 988
## 6            Heart disease 988
## 7  Influenza and pneumonia 988
## 8           Kidney disease 988
## 9                   Stroke 988
## 10                 Suicide 988
## 11  Unintentional injuries 988

State Vector Analysis

deaths |>
  count(State)

##                   State   n
## 1               Alabama 209
## 2                Alaska 209
## 3               Arizona 209
## 4              Arkansas 209
## 5            California 209
## 6              Colorado 209
## 7           Connecticut 209
## 8              Delaware 209
## 9  District of Columbia 209
## 10              Florida 209
## 11              Georgia 209
## 12               Hawaii 209
## 13                Idaho 209
## 14             Illinois 209
## 15              Indiana 209
## 16                 Iowa 209
## 17               Kansas 209
## 18             Kentucky 209
## 19            Louisiana 209
## 20                Maine 209
## 21             Maryland 209
## 22        Massachusetts 209
## 23             Michigan 209
## 24            Minnesota 209
## 25          Mississippi 209
## 26             Missouri 209
## 27              Montana 209
## 28             Nebraska 209
## 29               Nevada 209
## 30        New Hampshire 209
## 31           New Jersey 209
## 32           New Mexico 209
## 33             New York 209
## 34       North Carolina 209
## 35         North Dakota 209
## 36                 Ohio 209
## 37             Oklahoma 209
## 38               Oregon 209
## 39         Pennsylvania 209
## 40         Rhode Island 209
## 41       South Carolina 209
## 42         South Dakota 209
## 43            Tennessee 209
## 44                Texas 209
## 45        United States 209
## 46                 Utah 209
## 47              Vermont 209
## 48             Virginia 209
## 49           Washington 209
## 50        West Virginia 209
## 51            Wisconsin 209
## 52              Wyoming 209

Summary of Categorical Vectors

Upon analyzing the two cause name vectors, it struck me as odd at first the number of occurrences for each unique value was the same (988). After further examination and diving, it became clear this was the case since there are 19 years being examined and 52 states (52 states due to the District of Columbia and United States appearing in the vector) and 19 * 52 is 988.

The same type of inquiry may arise at first glance for states as well since each unique value has the same number of occurrences at 209. This occurs since there are a total of 11 causes of death and 19 years, and 11 * 19 equates to 209.

Visualizations of the Data

Overall Cause of Death Visualization

deaths |>
  filter(Cause.Name != "All causes") |>
  group_by(Cause.Name) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  ggplot(aes(
    x = total_deaths,
    y = reorder(Cause.Name, total_deaths)
  )) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths by Cause",
    x = "Total Deaths (Millions)",
    y = "Cause of Death"
  ) +
  theme_minimal()

Summary and Insight:

This visualization summarizes total deaths aggregated by cause across the entire dataset, highlighting which causes contribute the largest share of cumulative mortality. The ordering of causes reflects differences in frequency and duration of reporting rather than relative risk at the individual level. Causes with higher total deaths may represent conditions with broad prevalence, longer observation periods, or consistent reporting over time. As a result, this chart is best interpreted as a descriptive overview of how mortality is distributed across causes in the dataset, rather than as a comparison of severity or likelihood. Further analysis using age‑adjusted rates or time‑specific trends would provide more meaningful insight into relative risk.

Deaths by Year Visualization

deaths |>
  filter(Cause.Name == "All causes") |>
  group_by(Year) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  ggplot(aes(
    x = Year,
    y = total_deaths
  )) +
  geom_line(color = "steelblue", linewidth = 1) +
  scale_x_continuous(
    breaks = seq(min(deaths$Year), max(deaths$Year), by = 2)
  ) +
  scale_y_continuous(
    limits = c(0, NA),
    breaks = seq(0, 6e6, by = 1e6),
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths per Year (All Causes)",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  theme_minimal()

Summary and Insight:

This visualization shows total deaths aggregated by year across all causes, providing a high‑level view of how cumulative mortality changes over time in the dataset. The overall trend reflects both population growth and temporal changes in mortality reporting rather than year‑to‑year shifts in individual risk. Because the values represent raw totals, increases over time should be interpreted as descriptive of scale and aggregation effects, not necessarily as evidence of worsening health outcomes. This view is most useful for identifying broad temporal patterns and contextualizing more detailed analyses, such as cause‑specific or population‑adjusted death rates, which are better suited for comparative interpretation.

Total Deaths by State Visualization (Top 10 States)

deaths |>
  filter(
    Cause.Name == "All causes",
    State != "United States"
  ) |>
  group_by(State) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  slice_max(total_deaths, n = 10) |>
  ggplot(aes(
    x = total_deaths,
    y = reorder(State, total_deaths)
  )) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    limits = c(0, NA),
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Top 10 States by Total Deaths (All Years, All Causes)",
    x = "Total Deaths (Millions)",
    y = "State"
  ) +
  theme_minimal()

Summary and Insight:

The states appearing in the top 10 by total deaths largely reflect differences in population size and data aggregation rather than underlying risk alone. States with larger populations or longer reporting periods are more likely to appear at the top when using raw death counts, as this metric does not account for population normalization or demographic structure. As a result, these rankings should be interpreted as descriptive of cumulative totals within the dataset rather than as an indicator of relative severity or individual risk. This highlights the importance of complementing total counts with population‑adjusted measures, such as age‑adjusted death rates, for more meaningful comparisons.

Weekly Data Dive Summary

This first data dive provided an opportunity to understand the dataset and what vectors needed to be cleaned in order to properly perform analysis. At this point, the minimums, maximums, averages, and quantiles of each numerical vector have been analyzed. Furthermore, the categorical vectors have been analyzed to see what unique values exist and how many times they appear in the dataset.

Visualizations were created based on answering very base analytical questions of:

Which states had the most deaths across the years?
What are the causes of death that were the most prevalent in quantity of deaths across the years?
What states had the most deaths across the years?

From these findings, the foundation of data analysis of the dataset have occurred which yields further examination throughout the upcoming weeks. Possible questions from this introductory dive include:

Why are those the top 10 states and what other factors could contribute to them being the top 10?
What caused the dips in the trend of total deaths across the years?
What states represent the highest number of deaths for each cause name?

The next analysis may focus on these questions and others as deeper dives into the data occur.

Week 3 Data Dive

Data Dive Objective

The goal of this week’s data dive is to examine groups of observations within the dataset and assess how likely those groups are to occur given the overall structure of the data. This type of analysis introduces a basic form of anomaly detection, where less frequent groups are interpreted as having lower probability of occurrence relative to more common groups.

The focus of this analysis is not to infer causation, but rather to identify rare versus common groupings, interpret what those groupings represent in context, and formulate testable hypotheses for why certain groups appear less frequently than others.

Grouped Analysis 1: Death Counts by Cause Name

cause_group <- deaths |>
  filter(Cause.Name != "All causes") |>
  group_by(Cause.Name) |>
  summarise(
    total_deaths = sum(Deaths, na.rm = TRUE),
    count = n()
  )
cause_group

## # A tibble: 10 × 3
##    Cause.Name              total_deaths count
##    <chr>                          <dbl> <int>
##  1 Alzheimer's disease          2989632   988
##  2 CLRD                         5189854   988
##  3 Cancer                      21687288   988
##  4 Diabetes                     2799886   988
##  5 Heart disease               24445280   988
##  6 Influenza and pneumonia      2189282   988
##  7 Kidney disease               1717226   988
##  8 Stroke                       5453046   988
##  9 Suicide                      1394032   988
## 10 Unintentional injuries       4695640   988

Interpretation & Probability Context

Each group in this dataframe represents a specific cause of death, with the number of rows indicating how often that cause appears across all states and years. If a single row were randomly selected from the dataset, the probability of selecting a particular cause would depend on how many rows are associated with that cause relative to the total number of rows in the dataset.

Groups with fewer rows therefore have a lower probability of selection, making them statistically rarer within the dataset. This rarity reflects reporting structure and categorical coverage rather than medical severity.

The lowest‑probability group in this analysis is the cause of death with the smallest number of observations. This cause appears least frequently across states and years, meaning it is least likely to be selected in a random row draw. This group is explicitly tagged as a low‑probability (anomalous) group.

Hypothesis

A testable hypothesis is that causes with lower probabilities correspond to conditions that are less prevalent nationally or were introduced later in the reporting timeline.

Visualization

ggplot(cause_group, aes(
  x = total_deaths,
  y = reorder(Cause.Name, total_deaths)
)) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Total Deaths by Cause",
    x = "Total Deaths (Millions)",
    y = "Cause of Death"
  ) +
  theme_minimal()

Grouped Analysis 2: State-Level Deaths

state_group <- deaths |>
  filter(
    Cause.Name == "All causes",
    State != "United States"
  ) |>
  group_by(State) |>
  summarise(
    total_deaths = sum(Deaths, na.rm = TRUE),
    count = n()
  )

state_group

## # A tibble: 51 × 3
##    State                total_deaths count
##    <chr>                       <dbl> <int>
##  1 Alabama                    914067    19
##  2 Alaska                      67789    19
##  3 Arizona                    895865    19
##  4 Arkansas                   555553    19
##  5 California                4575252    19
##  6 Colorado                   599361    19
##  7 Connecticut                562638    19
##  8 Delaware                   145173    19
##  9 District of Columbia        99121    19
## 10 Florida                   3334759    19
## # ℹ 41 more rows

Interpretation & Probability Context

Each state represents a group of observations aggregated across all years. If a row were randomly selected from the dataset, the probability of selecting a particular state would depend on how many rows are associated with that state.

States with fewer rows therefore represent lower‑probability groups, as they are less likely to be selected in a random draw. This is expected given differences in population size and total mortality volume across states.

The lowest‑probability states are those with the smallest number of observations. These states are explicitly tagged as low‑probability (anomalous) groups, reflecting structural characteristics of the dataset rather than missing data.

Hypothesis

A testable hypothesis is that states with smaller populations consistently appear as lower‑probability groups due to fewer deaths being recorded across years.

Visualization

ggplot(state_group, aes(
  x = total_deaths,
  y = reorder(State, total_deaths)
)) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths by State (All Causes)",
    x = "Total Deaths (Millions)",
    y = "State"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 8),
    plot.margin = margin(10, 10, 10, 20)
  )

Grouped Analysis 3: Deaths by Year

year_group <- deaths |>
  filter(Cause.Name == "All causes") |>
  group_by(Year) |>
  summarise(
    total_deaths = sum(Deaths, na.rm = TRUE),
    count = n()
  )

year_group

## # A tibble: 19 × 3
##     Year total_deaths count
##    <int>        <dbl> <int>
##  1  1999      4782798    52
##  2  2000      4806702    52
##  3  2001      4832850    52
##  4  2002      4886774    52
##  5  2003      4896576    52
##  6  2004      4795230    52
##  7  2005      4896034    52
##  8  2006      4852528    52
##  9  2007      4847424    52
## 10  2008      4943968    52
## 11  2009      4874326    52
## 12  2010      4936870    52
## 13  2011      5030916    52
## 14  2012      5086558    52
## 15  2013      5193986    52
## 16  2014      5252836    52
## 17  2015      5425260    52
## 18  2016      5488496    52
## 19  2017      5627006    52

Interpretation & Probability Context

Each year represents a group with approximately the same number of observations, as the dataset structure is consistent across time. As a result, the probability of selecting any particular year in a random row draw is roughly equal.

Because no year has substantially fewer observations than others, no low‑probability or anomalous group is observed at the year level. This suggests that rarity in this dataset is more likely to emerge from categorical or geographic groupings rather than from time alone.

Hypothesis

A testable hypothesis is that notable changes in total deaths by year are driven by demographic shifts or exceptional events rather than sampling imbalance.

Visualization

ggplot(year_group, aes(
  x = Year,
  y = total_deaths
)) +
  geom_line(color = "steelblue", linewidth = 1) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Total Deaths by Year",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  theme_minimal()

Combination Analysis: State and Cause of Death

state_cause_combo <- deaths |>
  filter(
    State != "United States",
    Cause.Name != "All causes"
  ) |>
  count(State, Cause.Name)

state_cause_combo

##                    State              Cause.Name  n
## 1                Alabama     Alzheimer's disease 19
## 2                Alabama                    CLRD 19
## 3                Alabama                  Cancer 19
## 4                Alabama                Diabetes 19
## 5                Alabama           Heart disease 19
## 6                Alabama Influenza and pneumonia 19
## 7                Alabama          Kidney disease 19
## 8                Alabama                  Stroke 19
## 9                Alabama                 Suicide 19
## 10               Alabama  Unintentional injuries 19
## 11                Alaska     Alzheimer's disease 19
## 12                Alaska                    CLRD 19
## 13                Alaska                  Cancer 19
## 14                Alaska                Diabetes 19
## 15                Alaska           Heart disease 19
## 16                Alaska Influenza and pneumonia 19
## 17                Alaska          Kidney disease 19
## 18                Alaska                  Stroke 19
## 19                Alaska                 Suicide 19
## 20                Alaska  Unintentional injuries 19
## 21               Arizona     Alzheimer's disease 19
## 22               Arizona                    CLRD 19
## 23               Arizona                  Cancer 19
## 24               Arizona                Diabetes 19
## 25               Arizona           Heart disease 19
## 26               Arizona Influenza and pneumonia 19
## 27               Arizona          Kidney disease 19
## 28               Arizona                  Stroke 19
## 29               Arizona                 Suicide 19
## 30               Arizona  Unintentional injuries 19
## 31              Arkansas     Alzheimer's disease 19
## 32              Arkansas                    CLRD 19
## 33              Arkansas                  Cancer 19
## 34              Arkansas                Diabetes 19
## 35              Arkansas           Heart disease 19
## 36              Arkansas Influenza and pneumonia 19
## 37              Arkansas          Kidney disease 19
## 38              Arkansas                  Stroke 19
## 39              Arkansas                 Suicide 19
## 40              Arkansas  Unintentional injuries 19
## 41            California     Alzheimer's disease 19
## 42            California                    CLRD 19
## 43            California                  Cancer 19
## 44            California                Diabetes 19
## 45            California           Heart disease 19
## 46            California Influenza and pneumonia 19
## 47            California          Kidney disease 19
## 48            California                  Stroke 19
## 49            California                 Suicide 19
## 50            California  Unintentional injuries 19
## 51              Colorado     Alzheimer's disease 19
## 52              Colorado                    CLRD 19
## 53              Colorado                  Cancer 19
## 54              Colorado                Diabetes 19
## 55              Colorado           Heart disease 19
## 56              Colorado Influenza and pneumonia 19
## 57              Colorado          Kidney disease 19
## 58              Colorado                  Stroke 19
## 59              Colorado                 Suicide 19
## 60              Colorado  Unintentional injuries 19
## 61           Connecticut     Alzheimer's disease 19
## 62           Connecticut                    CLRD 19
## 63           Connecticut                  Cancer 19
## 64           Connecticut                Diabetes 19
## 65           Connecticut           Heart disease 19
## 66           Connecticut Influenza and pneumonia 19
## 67           Connecticut          Kidney disease 19
## 68           Connecticut                  Stroke 19
## 69           Connecticut                 Suicide 19
## 70           Connecticut  Unintentional injuries 19
## 71              Delaware     Alzheimer's disease 19
## 72              Delaware                    CLRD 19
## 73              Delaware                  Cancer 19
## 74              Delaware                Diabetes 19
## 75              Delaware           Heart disease 19
## 76              Delaware Influenza and pneumonia 19
## 77              Delaware          Kidney disease 19
## 78              Delaware                  Stroke 19
## 79              Delaware                 Suicide 19
## 80              Delaware  Unintentional injuries 19
## 81  District of Columbia     Alzheimer's disease 19
## 82  District of Columbia                    CLRD 19
## 83  District of Columbia                  Cancer 19
## 84  District of Columbia                Diabetes 19
## 85  District of Columbia           Heart disease 19
## 86  District of Columbia Influenza and pneumonia 19
## 87  District of Columbia          Kidney disease 19
## 88  District of Columbia                  Stroke 19
## 89  District of Columbia                 Suicide 19
## 90  District of Columbia  Unintentional injuries 19
## 91               Florida     Alzheimer's disease 19
## 92               Florida                    CLRD 19
## 93               Florida                  Cancer 19
## 94               Florida                Diabetes 19
## 95               Florida           Heart disease 19
## 96               Florida Influenza and pneumonia 19
## 97               Florida          Kidney disease 19
## 98               Florida                  Stroke 19
## 99               Florida                 Suicide 19
## 100              Florida  Unintentional injuries 19
## 101              Georgia     Alzheimer's disease 19
## 102              Georgia                    CLRD 19
## 103              Georgia                  Cancer 19
## 104              Georgia                Diabetes 19
## 105              Georgia           Heart disease 19
## 106              Georgia Influenza and pneumonia 19
## 107              Georgia          Kidney disease 19
## 108              Georgia                  Stroke 19
## 109              Georgia                 Suicide 19
## 110              Georgia  Unintentional injuries 19
## 111               Hawaii     Alzheimer's disease 19
## 112               Hawaii                    CLRD 19
## 113               Hawaii                  Cancer 19
## 114               Hawaii                Diabetes 19
## 115               Hawaii           Heart disease 19
## 116               Hawaii Influenza and pneumonia 19
## 117               Hawaii          Kidney disease 19
## 118               Hawaii                  Stroke 19
## 119               Hawaii                 Suicide 19
## 120               Hawaii  Unintentional injuries 19
## 121                Idaho     Alzheimer's disease 19
## 122                Idaho                    CLRD 19
## 123                Idaho                  Cancer 19
## 124                Idaho                Diabetes 19
## 125                Idaho           Heart disease 19
## 126                Idaho Influenza and pneumonia 19
## 127                Idaho          Kidney disease 19
## 128                Idaho                  Stroke 19
## 129                Idaho                 Suicide 19
## 130                Idaho  Unintentional injuries 19
## 131             Illinois     Alzheimer's disease 19
## 132             Illinois                    CLRD 19
## 133             Illinois                  Cancer 19
## 134             Illinois                Diabetes 19
## 135             Illinois           Heart disease 19
## 136             Illinois Influenza and pneumonia 19
## 137             Illinois          Kidney disease 19
## 138             Illinois                  Stroke 19
## 139             Illinois                 Suicide 19
## 140             Illinois  Unintentional injuries 19
## 141              Indiana     Alzheimer's disease 19
## 142              Indiana                    CLRD 19
## 143              Indiana                  Cancer 19
## 144              Indiana                Diabetes 19
## 145              Indiana           Heart disease 19
## 146              Indiana Influenza and pneumonia 19
## 147              Indiana          Kidney disease 19
## 148              Indiana                  Stroke 19
## 149              Indiana                 Suicide 19
## 150              Indiana  Unintentional injuries 19
## 151                 Iowa     Alzheimer's disease 19
## 152                 Iowa                    CLRD 19
## 153                 Iowa                  Cancer 19
## 154                 Iowa                Diabetes 19
## 155                 Iowa           Heart disease 19
## 156                 Iowa Influenza and pneumonia 19
## 157                 Iowa          Kidney disease 19
## 158                 Iowa                  Stroke 19
## 159                 Iowa                 Suicide 19
## 160                 Iowa  Unintentional injuries 19
## 161               Kansas     Alzheimer's disease 19
## 162               Kansas                    CLRD 19
## 163               Kansas                  Cancer 19
## 164               Kansas                Diabetes 19
## 165               Kansas           Heart disease 19
## 166               Kansas Influenza and pneumonia 19
## 167               Kansas          Kidney disease 19
## 168               Kansas                  Stroke 19
## 169               Kansas                 Suicide 19
## 170               Kansas  Unintentional injuries 19
## 171             Kentucky     Alzheimer's disease 19
## 172             Kentucky                    CLRD 19
## 173             Kentucky                  Cancer 19
## 174             Kentucky                Diabetes 19
## 175             Kentucky           Heart disease 19
## 176             Kentucky Influenza and pneumonia 19
## 177             Kentucky          Kidney disease 19
## 178             Kentucky                  Stroke 19
## 179             Kentucky                 Suicide 19
## 180             Kentucky  Unintentional injuries 19
## 181            Louisiana     Alzheimer's disease 19
## 182            Louisiana                    CLRD 19
## 183            Louisiana                  Cancer 19
## 184            Louisiana                Diabetes 19
## 185            Louisiana           Heart disease 19
## 186            Louisiana Influenza and pneumonia 19
## 187            Louisiana          Kidney disease 19
## 188            Louisiana                  Stroke 19
## 189            Louisiana                 Suicide 19
## 190            Louisiana  Unintentional injuries 19
## 191                Maine     Alzheimer's disease 19
## 192                Maine                    CLRD 19
## 193                Maine                  Cancer 19
## 194                Maine                Diabetes 19
## 195                Maine           Heart disease 19
## 196                Maine Influenza and pneumonia 19
## 197                Maine          Kidney disease 19
## 198                Maine                  Stroke 19
## 199                Maine                 Suicide 19
## 200                Maine  Unintentional injuries 19
## 201             Maryland     Alzheimer's disease 19
## 202             Maryland                    CLRD 19
## 203             Maryland                  Cancer 19
## 204             Maryland                Diabetes 19
## 205             Maryland           Heart disease 19
## 206             Maryland Influenza and pneumonia 19
## 207             Maryland          Kidney disease 19
## 208             Maryland                  Stroke 19
## 209             Maryland                 Suicide 19
## 210             Maryland  Unintentional injuries 19
## 211        Massachusetts     Alzheimer's disease 19
## 212        Massachusetts                    CLRD 19
## 213        Massachusetts                  Cancer 19
## 214        Massachusetts                Diabetes 19
## 215        Massachusetts           Heart disease 19
## 216        Massachusetts Influenza and pneumonia 19
## 217        Massachusetts          Kidney disease 19
## 218        Massachusetts                  Stroke 19
## 219        Massachusetts                 Suicide 19
## 220        Massachusetts  Unintentional injuries 19
## 221             Michigan     Alzheimer's disease 19
## 222             Michigan                    CLRD 19
## 223             Michigan                  Cancer 19
## 224             Michigan                Diabetes 19
## 225             Michigan           Heart disease 19
## 226             Michigan Influenza and pneumonia 19
## 227             Michigan          Kidney disease 19
## 228             Michigan                  Stroke 19
## 229             Michigan                 Suicide 19
## 230             Michigan  Unintentional injuries 19
## 231            Minnesota     Alzheimer's disease 19
## 232            Minnesota                    CLRD 19
## 233            Minnesota                  Cancer 19
## 234            Minnesota                Diabetes 19
## 235            Minnesota           Heart disease 19
## 236            Minnesota Influenza and pneumonia 19
## 237            Minnesota          Kidney disease 19
## 238            Minnesota                  Stroke 19
## 239            Minnesota                 Suicide 19
## 240            Minnesota  Unintentional injuries 19
## 241          Mississippi     Alzheimer's disease 19
## 242          Mississippi                    CLRD 19
## 243          Mississippi                  Cancer 19
## 244          Mississippi                Diabetes 19
## 245          Mississippi           Heart disease 19
## 246          Mississippi Influenza and pneumonia 19
## 247          Mississippi          Kidney disease 19
## 248          Mississippi                  Stroke 19
## 249          Mississippi                 Suicide 19
## 250          Mississippi  Unintentional injuries 19
## 251             Missouri     Alzheimer's disease 19
## 252             Missouri                    CLRD 19
## 253             Missouri                  Cancer 19
## 254             Missouri                Diabetes 19
## 255             Missouri           Heart disease 19
## 256             Missouri Influenza and pneumonia 19
## 257             Missouri          Kidney disease 19
## 258             Missouri                  Stroke 19
## 259             Missouri                 Suicide 19
## 260             Missouri  Unintentional injuries 19
## 261              Montana     Alzheimer's disease 19
## 262              Montana                    CLRD 19
## 263              Montana                  Cancer 19
## 264              Montana                Diabetes 19
## 265              Montana           Heart disease 19
## 266              Montana Influenza and pneumonia 19
## 267              Montana          Kidney disease 19
## 268              Montana                  Stroke 19
## 269              Montana                 Suicide 19
## 270              Montana  Unintentional injuries 19
## 271             Nebraska     Alzheimer's disease 19
## 272             Nebraska                    CLRD 19
## 273             Nebraska                  Cancer 19
## 274             Nebraska                Diabetes 19
## 275             Nebraska           Heart disease 19
## 276             Nebraska Influenza and pneumonia 19
## 277             Nebraska          Kidney disease 19
## 278             Nebraska                  Stroke 19
## 279             Nebraska                 Suicide 19
## 280             Nebraska  Unintentional injuries 19
## 281               Nevada     Alzheimer's disease 19
## 282               Nevada                    CLRD 19
## 283               Nevada                  Cancer 19
## 284               Nevada                Diabetes 19
## 285               Nevada           Heart disease 19
## 286               Nevada Influenza and pneumonia 19
## 287               Nevada          Kidney disease 19
## 288               Nevada                  Stroke 19
## 289               Nevada                 Suicide 19
## 290               Nevada  Unintentional injuries 19
## 291        New Hampshire     Alzheimer's disease 19
## 292        New Hampshire                    CLRD 19
## 293        New Hampshire                  Cancer 19
## 294        New Hampshire                Diabetes 19
## 295        New Hampshire           Heart disease 19
## 296        New Hampshire Influenza and pneumonia 19
## 297        New Hampshire          Kidney disease 19
## 298        New Hampshire                  Stroke 19
## 299        New Hampshire                 Suicide 19
## 300        New Hampshire  Unintentional injuries 19
## 301           New Jersey     Alzheimer's disease 19
## 302           New Jersey                    CLRD 19
## 303           New Jersey                  Cancer 19
## 304           New Jersey                Diabetes 19
## 305           New Jersey           Heart disease 19
## 306           New Jersey Influenza and pneumonia 19
## 307           New Jersey          Kidney disease 19
## 308           New Jersey                  Stroke 19
## 309           New Jersey                 Suicide 19
## 310           New Jersey  Unintentional injuries 19
## 311           New Mexico     Alzheimer's disease 19
## 312           New Mexico                    CLRD 19
## 313           New Mexico                  Cancer 19
## 314           New Mexico                Diabetes 19
## 315           New Mexico           Heart disease 19
## 316           New Mexico Influenza and pneumonia 19
## 317           New Mexico          Kidney disease 19
## 318           New Mexico                  Stroke 19
## 319           New Mexico                 Suicide 19
## 320           New Mexico  Unintentional injuries 19
## 321             New York     Alzheimer's disease 19
## 322             New York                    CLRD 19
## 323             New York                  Cancer 19
## 324             New York                Diabetes 19
## 325             New York           Heart disease 19
## 326             New York Influenza and pneumonia 19
## 327             New York          Kidney disease 19
## 328             New York                  Stroke 19
## 329             New York                 Suicide 19
## 330             New York  Unintentional injuries 19
## 331       North Carolina     Alzheimer's disease 19
## 332       North Carolina                    CLRD 19
## 333       North Carolina                  Cancer 19
## 334       North Carolina                Diabetes 19
## 335       North Carolina           Heart disease 19
## 336       North Carolina Influenza and pneumonia 19
## 337       North Carolina          Kidney disease 19
## 338       North Carolina                  Stroke 19
## 339       North Carolina                 Suicide 19
## 340       North Carolina  Unintentional injuries 19
## 341         North Dakota     Alzheimer's disease 19
## 342         North Dakota                    CLRD 19
## 343         North Dakota                  Cancer 19
## 344         North Dakota                Diabetes 19
## 345         North Dakota           Heart disease 19
## 346         North Dakota Influenza and pneumonia 19
## 347         North Dakota          Kidney disease 19
## 348         North Dakota                  Stroke 19
## 349         North Dakota                 Suicide 19
## 350         North Dakota  Unintentional injuries 19
## 351                 Ohio     Alzheimer's disease 19
## 352                 Ohio                    CLRD 19
## 353                 Ohio                  Cancer 19
## 354                 Ohio                Diabetes 19
## 355                 Ohio           Heart disease 19
## 356                 Ohio Influenza and pneumonia 19
## 357                 Ohio          Kidney disease 19
## 358                 Ohio                  Stroke 19
## 359                 Ohio                 Suicide 19
## 360                 Ohio  Unintentional injuries 19
## 361             Oklahoma     Alzheimer's disease 19
## 362             Oklahoma                    CLRD 19
## 363             Oklahoma                  Cancer 19
## 364             Oklahoma                Diabetes 19
## 365             Oklahoma           Heart disease 19
## 366             Oklahoma Influenza and pneumonia 19
## 367             Oklahoma          Kidney disease 19
## 368             Oklahoma                  Stroke 19
## 369             Oklahoma                 Suicide 19
## 370             Oklahoma  Unintentional injuries 19
## 371               Oregon     Alzheimer's disease 19
## 372               Oregon                    CLRD 19
## 373               Oregon                  Cancer 19
## 374               Oregon                Diabetes 19
## 375               Oregon           Heart disease 19
## 376               Oregon Influenza and pneumonia 19
## 377               Oregon          Kidney disease 19
## 378               Oregon                  Stroke 19
## 379               Oregon                 Suicide 19
## 380               Oregon  Unintentional injuries 19
## 381         Pennsylvania     Alzheimer's disease 19
## 382         Pennsylvania                    CLRD 19
## 383         Pennsylvania                  Cancer 19
## 384         Pennsylvania                Diabetes 19
## 385         Pennsylvania           Heart disease 19
## 386         Pennsylvania Influenza and pneumonia 19
## 387         Pennsylvania          Kidney disease 19
## 388         Pennsylvania                  Stroke 19
## 389         Pennsylvania                 Suicide 19
## 390         Pennsylvania  Unintentional injuries 19
## 391         Rhode Island     Alzheimer's disease 19
## 392         Rhode Island                    CLRD 19
## 393         Rhode Island                  Cancer 19
## 394         Rhode Island                Diabetes 19
## 395         Rhode Island           Heart disease 19
## 396         Rhode Island Influenza and pneumonia 19
## 397         Rhode Island          Kidney disease 19
## 398         Rhode Island                  Stroke 19
## 399         Rhode Island                 Suicide 19
## 400         Rhode Island  Unintentional injuries 19
## 401       South Carolina     Alzheimer's disease 19
## 402       South Carolina                    CLRD 19
## 403       South Carolina                  Cancer 19
## 404       South Carolina                Diabetes 19
## 405       South Carolina           Heart disease 19
## 406       South Carolina Influenza and pneumonia 19
## 407       South Carolina          Kidney disease 19
## 408       South Carolina                  Stroke 19
## 409       South Carolina                 Suicide 19
## 410       South Carolina  Unintentional injuries 19
## 411         South Dakota     Alzheimer's disease 19
## 412         South Dakota                    CLRD 19
## 413         South Dakota                  Cancer 19
## 414         South Dakota                Diabetes 19
## 415         South Dakota           Heart disease 19
## 416         South Dakota Influenza and pneumonia 19
## 417         South Dakota          Kidney disease 19
## 418         South Dakota                  Stroke 19
## 419         South Dakota                 Suicide 19
## 420         South Dakota  Unintentional injuries 19
## 421            Tennessee     Alzheimer's disease 19
## 422            Tennessee                    CLRD 19
## 423            Tennessee                  Cancer 19
## 424            Tennessee                Diabetes 19
## 425            Tennessee           Heart disease 19
## 426            Tennessee Influenza and pneumonia 19
## 427            Tennessee          Kidney disease 19
## 428            Tennessee                  Stroke 19
## 429            Tennessee                 Suicide 19
## 430            Tennessee  Unintentional injuries 19
## 431                Texas     Alzheimer's disease 19
## 432                Texas                    CLRD 19
## 433                Texas                  Cancer 19
## 434                Texas                Diabetes 19
## 435                Texas           Heart disease 19
## 436                Texas Influenza and pneumonia 19
## 437                Texas          Kidney disease 19
## 438                Texas                  Stroke 19
## 439                Texas                 Suicide 19
## 440                Texas  Unintentional injuries 19
## 441                 Utah     Alzheimer's disease 19
## 442                 Utah                    CLRD 19
## 443                 Utah                  Cancer 19
## 444                 Utah                Diabetes 19
## 445                 Utah           Heart disease 19
## 446                 Utah Influenza and pneumonia 19
## 447                 Utah          Kidney disease 19
## 448                 Utah                  Stroke 19
## 449                 Utah                 Suicide 19
## 450                 Utah  Unintentional injuries 19
## 451              Vermont     Alzheimer's disease 19
## 452              Vermont                    CLRD 19
## 453              Vermont                  Cancer 19
## 454              Vermont                Diabetes 19
## 455              Vermont           Heart disease 19
## 456              Vermont Influenza and pneumonia 19
## 457              Vermont          Kidney disease 19
## 458              Vermont                  Stroke 19
## 459              Vermont                 Suicide 19
## 460              Vermont  Unintentional injuries 19
## 461             Virginia     Alzheimer's disease 19
## 462             Virginia                    CLRD 19
## 463             Virginia                  Cancer 19
## 464             Virginia                Diabetes 19
## 465             Virginia           Heart disease 19
## 466             Virginia Influenza and pneumonia 19
## 467             Virginia          Kidney disease 19
## 468             Virginia                  Stroke 19
## 469             Virginia                 Suicide 19
## 470             Virginia  Unintentional injuries 19
## 471           Washington     Alzheimer's disease 19
## 472           Washington                    CLRD 19
## 473           Washington                  Cancer 19
## 474           Washington                Diabetes 19
## 475           Washington           Heart disease 19
## 476           Washington Influenza and pneumonia 19
## 477           Washington          Kidney disease 19
## 478           Washington                  Stroke 19
## 479           Washington                 Suicide 19
## 480           Washington  Unintentional injuries 19
## 481        West Virginia     Alzheimer's disease 19
## 482        West Virginia                    CLRD 19
## 483        West Virginia                  Cancer 19
## 484        West Virginia                Diabetes 19
## 485        West Virginia           Heart disease 19
## 486        West Virginia Influenza and pneumonia 19
## 487        West Virginia          Kidney disease 19
## 488        West Virginia                  Stroke 19
## 489        West Virginia                 Suicide 19
## 490        West Virginia  Unintentional injuries 19
## 491            Wisconsin     Alzheimer's disease 19
## 492            Wisconsin                    CLRD 19
## 493            Wisconsin                  Cancer 19
## 494            Wisconsin                Diabetes 19
## 495            Wisconsin           Heart disease 19
## 496            Wisconsin Influenza and pneumonia 19
## 497            Wisconsin          Kidney disease 19
## 498            Wisconsin                  Stroke 19
## 499            Wisconsin                 Suicide 19
## 500            Wisconsin  Unintentional injuries 19
## 501              Wyoming     Alzheimer's disease 19
## 502              Wyoming                    CLRD 19
## 503              Wyoming                  Cancer 19
## 504              Wyoming                Diabetes 19
## 505              Wyoming           Heart disease 19
## 506              Wyoming Influenza and pneumonia 19
## 507              Wyoming          Kidney disease 19
## 508              Wyoming                  Stroke 19
## 509              Wyoming                 Suicide 19
## 510              Wyoming  Unintentional injuries 19

Interpretation of Combinations

This dataframe represents all observed combinations of state and cause of death. All combinations appear in the dataset, indicating that reporting is structurally complete. However, some combinations occur far less frequently than others.

State–cause combinations with the smallest counts are considered low‑probability (anomalous) groups. These rare combinations occur when low‑prevalence causes intersect with smaller state populations, making them statistically unlikely to be selected in a random row draw. Their rarity reflects expected structural patterns rather than errors or omissions.

Hypothesis

A testable hypothesis is that rare state–cause combinations are driven by the interaction of low national prevalence and smaller population sizes.

Visualization

ggplot(state_cause_combo, aes(
  x = n,
  y = reorder(Cause.Name, n)
)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Frequency of State–Cause Combinations",
    x = "Number of Observations",
    y = "Cause of Death"
  ) +
  theme_minimal()

Weekly Data Dive Summary

This week’s data dive examined how probability and rarity emerge from grouping observations within the dataset. By analyzing causes of death, states, years, and state–cause combinations, low‑probability (anomalous) groups were identified based on their relatively small number of observations.

These findings demonstrate how probability can be used as a descriptive analytical tool to understand structure and rarity within complex datasets, rather than to infer causation. Future analyses may explore how these group‑level probabilities change when population‑adjusted measures, such as age‑adjusted death rates, are introduced.

Dataset Summary

Brian Blandino

2026-01-27

First Data Dive

Data Synopsis

Data Analysis Goals

Loading the Libraries

Reading the Data

Cleaning the Data

Min/Max/Mean/Median/Quantile Summaries

Year Vector Analysis

Summary and Insight:

Deaths Vector Analysis

Summary and Insight:

Age Adjusted Death Rate Vector Analysis

Summary and Insight:

Categorical Analysis

X113 Cause Name Vector Analysis

Cause Name Vector Analysis

State Vector Analysis

Summary of Categorical Vectors

Visualizations of the Data

Overall Cause of Death Visualization

Summary and Insight:

Deaths by Year Visualization

Summary and Insight:

Total Deaths by State Visualization (Top 10 States)

Summary and Insight:

Weekly Data Dive Summary

Week 3 Data Dive

Data Dive Objective

Grouped Analysis 1: Death Counts by Cause Name

Interpretation & Probability Context

Hypothesis

Visualization

Grouped Analysis 2: State-Level Deaths

Interpretation & Probability Context

Hypothesis

Visualization

Grouped Analysis 3: Deaths by Year

Interpretation & Probability Context

Hypothesis

Visualization

Combination Analysis: State and Cause of Death

Interpretation of Combinations

Hypothesis

Visualization

Weekly Data Dive Summary