This dataset is from the CDC depicting the leading causes of death by year, cause, location (state), number of deaths, and age-adjusted-death-rates. I am going to be doing an analysis of the data week-by-week attempting to extrapolate patterns and answering questions that will be formulated along the way.
Through the data analysis of the dataset, it is my goal to show which causes of death are most prevalent across the states and how the causes have change throughout time.
After this surface level analysis is complete, I will begin to peel back the layers of data to understand the possible why and how behind the results. This may lead to further data-diving to help explain the findings.
To begin, a basic analysis and visualization of the vectors will be completed and then questions will be formulated from the findings to answer from further investigation of the data.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
deaths <- read.csv('NCHS.csv')
In this \(dataset\), the Deaths and Age Adjusted Death Rate columns are categorized as character strings due to the comma being present. As such, it does not allow for proper data filtering when trying to perform basic R functions such as min, max, etc. It is necessary to transform the column to a numeric format to properly extract meaningful insights.
deaths <- deaths |>
mutate(
Deaths = as.numeric(gsub(",", "", Deaths))
)
deaths <- deaths |>
mutate(
Age.adjusted.Death.Rate = as.numeric(gsub(",", "", Age.adjusted.Death.Rate))
)
deaths |>
summarise(
min_year = min(Year, na.rm = TRUE),
max_year = max(Year, na.rm = TRUE),
q1_year = quantile(Year, 0.25, na.rm = TRUE),
med_year = median(Year, na.rm = TRUE),
q3_year = quantile(Year, 0.75, na.rm = TRUE)
)
## min_year max_year q1_year med_year q3_year
## 1 1999 2017 2003 2008 2013
The Year variable spans a defined range between the earliest and latest observations in the dataset, indicating the overall temporal coverage of the data. The median and interquartile range show where most observations are concentrated, suggesting that the data is more heavily represented in certain periods rather than evenly distributed across all years. This concentration is important to consider when interpreting trends, as patterns observed may be influenced by the density of data in specific time ranges. Further investigation could explore whether key changes in outcomes align with particular periods within this range.
deaths |>
summarise(
min_death = min(Deaths, na.rm = TRUE),
max_death = max(Deaths, na.rm = TRUE),
avg_death = mean(Deaths, na.rm = TRUE),
q1_death = quantile(Deaths, 0.25, na.rm = TRUE),
med_death = median(Deaths, na.rm = TRUE),
q3_death = quantile(Deaths, 0.75, na.rm = TRUE)
)
## min_death max_death avg_death q1_death med_death q3_death
## 1 21 2813503 15459.91 612 1718.5 5756.5
The Deaths variable exhibits a wide range between its minimum and maximum values, indicating substantial variation in the number of deaths across observations. The difference between the mean and median suggests potential skewness in the distribution, where higher death counts may disproportionately influence the average. The interquartile range highlights that most observations fall within a narrower band, while extreme values may represent outliers or exceptional cases. This variability suggests that additional contextual factors, such as time or population characteristics, may play a significant role in explaining differences in death counts.
deaths |>
summarise(
min_aadr = min(Age.adjusted.Death.Rate, na.rm = TRUE),
max_aadr = max(Age.adjusted.Death.Rate, na.rm = TRUE),
avg_aadr = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
q1_aadr = quantile(Age.adjusted.Death.Rate, 0.25, na.rm = TRUE),
med_aadr = median(Age.adjusted.Death.Rate, na.rm = TRUE),
q3_aadr = quantile(Age.adjusted.Death.Rate, 0.75, na.rm = TRUE)
)
## min_aadr max_aadr avg_aadr q1_aadr med_aadr q3_aadr
## 1 2.6 1087.3 127.5639 19.2 35.9 151.725
The age‑adjusted death rate shows a more standardized distribution compared to raw death counts, as reflected by a relatively tighter interquartile range. The median provides a useful measure of central tendency that accounts for population age structure, making this metric more suitable for comparisons across groups or time periods. While variation still exists between the minimum and maximum values, the adjusted nature of the rate suggests that observed differences are less driven by demographic composition and more likely related to underlying risk factors. Further analysis could examine how these rates change over time or differ across categories.
deaths |>
count(X113.Cause.Name)
## X113.Cause.Name n
## 1 Accidents (unintentional injuries) (V01-X59,Y85-Y86) 988
## 2 All Causes 988
## 3 Alzheimer's disease (G30) 988
## 4 Cerebrovascular diseases (I60-I69) 988
## 5 Chronic lower respiratory diseases (J40-J47) 988
## 6 Diabetes mellitus (E10-E14) 988
## 7 Diseases of heart (I00-I09,I11,I13,I20-I51) 988
## 8 Influenza and pneumonia (J09-J18) 988
## 9 Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) 988
## 10 Malignant neoplasms (C00-C97) 988
## 11 Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27) 988
deaths |>
count(Cause.Name)
## Cause.Name n
## 1 All causes 988
## 2 Alzheimer's disease 988
## 3 CLRD 988
## 4 Cancer 988
## 5 Diabetes 988
## 6 Heart disease 988
## 7 Influenza and pneumonia 988
## 8 Kidney disease 988
## 9 Stroke 988
## 10 Suicide 988
## 11 Unintentional injuries 988
deaths |>
count(State)
## State n
## 1 Alabama 209
## 2 Alaska 209
## 3 Arizona 209
## 4 Arkansas 209
## 5 California 209
## 6 Colorado 209
## 7 Connecticut 209
## 8 Delaware 209
## 9 District of Columbia 209
## 10 Florida 209
## 11 Georgia 209
## 12 Hawaii 209
## 13 Idaho 209
## 14 Illinois 209
## 15 Indiana 209
## 16 Iowa 209
## 17 Kansas 209
## 18 Kentucky 209
## 19 Louisiana 209
## 20 Maine 209
## 21 Maryland 209
## 22 Massachusetts 209
## 23 Michigan 209
## 24 Minnesota 209
## 25 Mississippi 209
## 26 Missouri 209
## 27 Montana 209
## 28 Nebraska 209
## 29 Nevada 209
## 30 New Hampshire 209
## 31 New Jersey 209
## 32 New Mexico 209
## 33 New York 209
## 34 North Carolina 209
## 35 North Dakota 209
## 36 Ohio 209
## 37 Oklahoma 209
## 38 Oregon 209
## 39 Pennsylvania 209
## 40 Rhode Island 209
## 41 South Carolina 209
## 42 South Dakota 209
## 43 Tennessee 209
## 44 Texas 209
## 45 United States 209
## 46 Utah 209
## 47 Vermont 209
## 48 Virginia 209
## 49 Washington 209
## 50 West Virginia 209
## 51 Wisconsin 209
## 52 Wyoming 209
Upon analyzing the two cause name vectors, it struck me as odd at first the number of occurrences for each unique value was the same (988). After further examination and diving, it became clear this was the case since there are 19 years being examined and 52 states (52 states due to the District of Columbia and United States appearing in the vector) and 19 * 52 is 988.
The same type of inquiry may arise at first glance for states as well since each unique value has the same number of occurrences at 209. This occurs since there are a total of 11 causes of death and 19 years, and 11 * 19 equates to 209.
deaths |>
filter(Cause.Name != "All causes") |>
group_by(Cause.Name) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ggplot(aes(
x = total_deaths,
y = reorder(Cause.Name, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths by Cause",
x = "Total Deaths (Millions)",
y = "Cause of Death"
) +
theme_minimal()
This visualization summarizes total deaths aggregated by cause across the entire dataset, highlighting which causes contribute the largest share of cumulative mortality. The ordering of causes reflects differences in frequency and duration of reporting rather than relative risk at the individual level. Causes with higher total deaths may represent conditions with broad prevalence, longer observation periods, or consistent reporting over time. As a result, this chart is best interpreted as a descriptive overview of how mortality is distributed across causes in the dataset, rather than as a comparison of severity or likelihood. Further analysis using age‑adjusted rates or time‑specific trends would provide more meaningful insight into relative risk.
deaths |>
filter(Cause.Name == "All causes") |>
group_by(Year) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ggplot(aes(
x = Year,
y = total_deaths
)) +
geom_line(color = "steelblue", linewidth = 1) +
scale_x_continuous(
breaks = seq(min(deaths$Year), max(deaths$Year), by = 2)
) +
scale_y_continuous(
limits = c(0, NA),
breaks = seq(0, 6e6, by = 1e6),
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths per Year (All Causes)",
x = "Year",
y = "Total Deaths (Millions)"
) +
theme_minimal()
This visualization shows total deaths aggregated by year across all causes, providing a high‑level view of how cumulative mortality changes over time in the dataset. The overall trend reflects both population growth and temporal changes in mortality reporting rather than year‑to‑year shifts in individual risk. Because the values represent raw totals, increases over time should be interpreted as descriptive of scale and aggregation effects, not necessarily as evidence of worsening health outcomes. This view is most useful for identifying broad temporal patterns and contextualizing more detailed analyses, such as cause‑specific or population‑adjusted death rates, which are better suited for comparative interpretation.
deaths |>
filter(
Cause.Name == "All causes",
State != "United States"
) |>
group_by(State) |>
summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
slice_max(total_deaths, n = 10) |>
ggplot(aes(
x = total_deaths,
y = reorder(State, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
limits = c(0, NA),
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Top 10 States by Total Deaths (All Years, All Causes)",
x = "Total Deaths (Millions)",
y = "State"
) +
theme_minimal()
The states appearing in the top 10 by total deaths largely reflect differences in population size and data aggregation rather than underlying risk alone. States with larger populations or longer reporting periods are more likely to appear at the top when using raw death counts, as this metric does not account for population normalization or demographic structure. As a result, these rankings should be interpreted as descriptive of cumulative totals within the dataset rather than as an indicator of relative severity or individual risk. This highlights the importance of complementing total counts with population‑adjusted measures, such as age‑adjusted death rates, for more meaningful comparisons.
This first data dive provided an opportunity to understand the dataset and what vectors needed to be cleaned in order to properly perform analysis. At this point, the minimums, maximums, averages, and quantiles of each numerical vector have been analyzed. Furthermore, the categorical vectors have been analyzed to see what unique values exist and how many times they appear in the dataset.
Visualizations were created based on answering very base analytical questions of:
From these findings, the foundation of data analysis of the dataset have occurred which yields further examination throughout the upcoming weeks. Possible questions from this introductory dive include:
The next analysis may focus on these questions and others as deeper dives into the data occur.
The goal of this week’s data dive is to examine groups of observations within the dataset and assess how likely those groups are to occur given the overall structure of the data. This type of analysis introduces a basic form of anomaly detection, where less frequent groups are interpreted as having lower probability of occurrence relative to more common groups.
The focus of this analysis is not to infer causation, but rather to identify rare versus common groupings, interpret what those groupings represent in context, and formulate testable hypotheses for why certain groups appear less frequently than others.
cause_group <- deaths |>
filter(Cause.Name != "All causes") |>
group_by(Cause.Name) |>
summarise(
total_deaths = sum(Deaths, na.rm = TRUE),
count = n()
)
cause_group
## # A tibble: 10 × 3
## Cause.Name total_deaths count
## <chr> <dbl> <int>
## 1 Alzheimer's disease 2989632 988
## 2 CLRD 5189854 988
## 3 Cancer 21687288 988
## 4 Diabetes 2799886 988
## 5 Heart disease 24445280 988
## 6 Influenza and pneumonia 2189282 988
## 7 Kidney disease 1717226 988
## 8 Stroke 5453046 988
## 9 Suicide 1394032 988
## 10 Unintentional injuries 4695640 988
Each group in this dataframe represents a specific cause of death, with the number of rows indicating how often that cause appears across all states and years. If a single row were randomly selected from the dataset, the probability of selecting a particular cause would depend on how many rows are associated with that cause relative to the total number of rows in the dataset.
Groups with fewer rows therefore have a lower probability of selection, making them statistically rarer within the dataset. This rarity reflects reporting structure and categorical coverage rather than medical severity.
The lowest‑probability group in this analysis is the cause of death with the smallest number of observations. This cause appears least frequently across states and years, meaning it is least likely to be selected in a random row draw. This group is explicitly tagged as a low‑probability (anomalous) group.
A testable hypothesis is that causes with lower probabilities correspond to conditions that are less prevalent nationally or were introduced later in the reporting timeline.
ggplot(cause_group, aes(
x = total_deaths,
y = reorder(Cause.Name, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(
title = "Total Deaths by Cause",
x = "Total Deaths (Millions)",
y = "Cause of Death"
) +
theme_minimal()
state_group <- deaths |>
filter(
Cause.Name == "All causes",
State != "United States"
) |>
group_by(State) |>
summarise(
total_deaths = sum(Deaths, na.rm = TRUE),
count = n()
)
state_group
## # A tibble: 51 × 3
## State total_deaths count
## <chr> <dbl> <int>
## 1 Alabama 914067 19
## 2 Alaska 67789 19
## 3 Arizona 895865 19
## 4 Arkansas 555553 19
## 5 California 4575252 19
## 6 Colorado 599361 19
## 7 Connecticut 562638 19
## 8 Delaware 145173 19
## 9 District of Columbia 99121 19
## 10 Florida 3334759 19
## # ℹ 41 more rows
Each state represents a group of observations aggregated across all years. If a row were randomly selected from the dataset, the probability of selecting a particular state would depend on how many rows are associated with that state.
States with fewer rows therefore represent lower‑probability groups, as they are less likely to be selected in a random draw. This is expected given differences in population size and total mortality volume across states.
The lowest‑probability states are those with the smallest number of observations. These states are explicitly tagged as low‑probability (anomalous) groups, reflecting structural characteristics of the dataset rather than missing data.
A testable hypothesis is that states with smaller populations consistently appear as lower‑probability groups due to fewer deaths being recorded across years.
ggplot(state_group, aes(
x = total_deaths,
y = reorder(State, total_deaths)
)) +
geom_col(fill = "steelblue") +
scale_x_continuous(
labels = label_number(scale = 1e-6, suffix = "M")
) +
labs(
title = "Total Deaths by State (All Causes)",
x = "Total Deaths (Millions)",
y = "State"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 8),
plot.margin = margin(10, 10, 10, 20)
)
year_group <- deaths |>
filter(Cause.Name == "All causes") |>
group_by(Year) |>
summarise(
total_deaths = sum(Deaths, na.rm = TRUE),
count = n()
)
year_group
## # A tibble: 19 × 3
## Year total_deaths count
## <int> <dbl> <int>
## 1 1999 4782798 52
## 2 2000 4806702 52
## 3 2001 4832850 52
## 4 2002 4886774 52
## 5 2003 4896576 52
## 6 2004 4795230 52
## 7 2005 4896034 52
## 8 2006 4852528 52
## 9 2007 4847424 52
## 10 2008 4943968 52
## 11 2009 4874326 52
## 12 2010 4936870 52
## 13 2011 5030916 52
## 14 2012 5086558 52
## 15 2013 5193986 52
## 16 2014 5252836 52
## 17 2015 5425260 52
## 18 2016 5488496 52
## 19 2017 5627006 52
Each year represents a group with approximately the same number of observations, as the dataset structure is consistent across time. As a result, the probability of selecting any particular year in a random row draw is roughly equal.
Because no year has substantially fewer observations than others, no low‑probability or anomalous group is observed at the year level. This suggests that rarity in this dataset is more likely to emerge from categorical or geographic groupings rather than from time alone.
A testable hypothesis is that notable changes in total deaths by year are driven by demographic shifts or exceptional events rather than sampling imbalance.
ggplot(year_group, aes(
x = Year,
y = total_deaths
)) +
geom_line(color = "steelblue", linewidth = 1) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(
title = "Total Deaths by Year",
x = "Year",
y = "Total Deaths (Millions)"
) +
theme_minimal()
state_cause_combo <- deaths |>
filter(
State != "United States",
Cause.Name != "All causes"
) |>
count(State, Cause.Name)
state_cause_combo
## State Cause.Name n
## 1 Alabama Alzheimer's disease 19
## 2 Alabama CLRD 19
## 3 Alabama Cancer 19
## 4 Alabama Diabetes 19
## 5 Alabama Heart disease 19
## 6 Alabama Influenza and pneumonia 19
## 7 Alabama Kidney disease 19
## 8 Alabama Stroke 19
## 9 Alabama Suicide 19
## 10 Alabama Unintentional injuries 19
## 11 Alaska Alzheimer's disease 19
## 12 Alaska CLRD 19
## 13 Alaska Cancer 19
## 14 Alaska Diabetes 19
## 15 Alaska Heart disease 19
## 16 Alaska Influenza and pneumonia 19
## 17 Alaska Kidney disease 19
## 18 Alaska Stroke 19
## 19 Alaska Suicide 19
## 20 Alaska Unintentional injuries 19
## 21 Arizona Alzheimer's disease 19
## 22 Arizona CLRD 19
## 23 Arizona Cancer 19
## 24 Arizona Diabetes 19
## 25 Arizona Heart disease 19
## 26 Arizona Influenza and pneumonia 19
## 27 Arizona Kidney disease 19
## 28 Arizona Stroke 19
## 29 Arizona Suicide 19
## 30 Arizona Unintentional injuries 19
## 31 Arkansas Alzheimer's disease 19
## 32 Arkansas CLRD 19
## 33 Arkansas Cancer 19
## 34 Arkansas Diabetes 19
## 35 Arkansas Heart disease 19
## 36 Arkansas Influenza and pneumonia 19
## 37 Arkansas Kidney disease 19
## 38 Arkansas Stroke 19
## 39 Arkansas Suicide 19
## 40 Arkansas Unintentional injuries 19
## 41 California Alzheimer's disease 19
## 42 California CLRD 19
## 43 California Cancer 19
## 44 California Diabetes 19
## 45 California Heart disease 19
## 46 California Influenza and pneumonia 19
## 47 California Kidney disease 19
## 48 California Stroke 19
## 49 California Suicide 19
## 50 California Unintentional injuries 19
## 51 Colorado Alzheimer's disease 19
## 52 Colorado CLRD 19
## 53 Colorado Cancer 19
## 54 Colorado Diabetes 19
## 55 Colorado Heart disease 19
## 56 Colorado Influenza and pneumonia 19
## 57 Colorado Kidney disease 19
## 58 Colorado Stroke 19
## 59 Colorado Suicide 19
## 60 Colorado Unintentional injuries 19
## 61 Connecticut Alzheimer's disease 19
## 62 Connecticut CLRD 19
## 63 Connecticut Cancer 19
## 64 Connecticut Diabetes 19
## 65 Connecticut Heart disease 19
## 66 Connecticut Influenza and pneumonia 19
## 67 Connecticut Kidney disease 19
## 68 Connecticut Stroke 19
## 69 Connecticut Suicide 19
## 70 Connecticut Unintentional injuries 19
## 71 Delaware Alzheimer's disease 19
## 72 Delaware CLRD 19
## 73 Delaware Cancer 19
## 74 Delaware Diabetes 19
## 75 Delaware Heart disease 19
## 76 Delaware Influenza and pneumonia 19
## 77 Delaware Kidney disease 19
## 78 Delaware Stroke 19
## 79 Delaware Suicide 19
## 80 Delaware Unintentional injuries 19
## 81 District of Columbia Alzheimer's disease 19
## 82 District of Columbia CLRD 19
## 83 District of Columbia Cancer 19
## 84 District of Columbia Diabetes 19
## 85 District of Columbia Heart disease 19
## 86 District of Columbia Influenza and pneumonia 19
## 87 District of Columbia Kidney disease 19
## 88 District of Columbia Stroke 19
## 89 District of Columbia Suicide 19
## 90 District of Columbia Unintentional injuries 19
## 91 Florida Alzheimer's disease 19
## 92 Florida CLRD 19
## 93 Florida Cancer 19
## 94 Florida Diabetes 19
## 95 Florida Heart disease 19
## 96 Florida Influenza and pneumonia 19
## 97 Florida Kidney disease 19
## 98 Florida Stroke 19
## 99 Florida Suicide 19
## 100 Florida Unintentional injuries 19
## 101 Georgia Alzheimer's disease 19
## 102 Georgia CLRD 19
## 103 Georgia Cancer 19
## 104 Georgia Diabetes 19
## 105 Georgia Heart disease 19
## 106 Georgia Influenza and pneumonia 19
## 107 Georgia Kidney disease 19
## 108 Georgia Stroke 19
## 109 Georgia Suicide 19
## 110 Georgia Unintentional injuries 19
## 111 Hawaii Alzheimer's disease 19
## 112 Hawaii CLRD 19
## 113 Hawaii Cancer 19
## 114 Hawaii Diabetes 19
## 115 Hawaii Heart disease 19
## 116 Hawaii Influenza and pneumonia 19
## 117 Hawaii Kidney disease 19
## 118 Hawaii Stroke 19
## 119 Hawaii Suicide 19
## 120 Hawaii Unintentional injuries 19
## 121 Idaho Alzheimer's disease 19
## 122 Idaho CLRD 19
## 123 Idaho Cancer 19
## 124 Idaho Diabetes 19
## 125 Idaho Heart disease 19
## 126 Idaho Influenza and pneumonia 19
## 127 Idaho Kidney disease 19
## 128 Idaho Stroke 19
## 129 Idaho Suicide 19
## 130 Idaho Unintentional injuries 19
## 131 Illinois Alzheimer's disease 19
## 132 Illinois CLRD 19
## 133 Illinois Cancer 19
## 134 Illinois Diabetes 19
## 135 Illinois Heart disease 19
## 136 Illinois Influenza and pneumonia 19
## 137 Illinois Kidney disease 19
## 138 Illinois Stroke 19
## 139 Illinois Suicide 19
## 140 Illinois Unintentional injuries 19
## 141 Indiana Alzheimer's disease 19
## 142 Indiana CLRD 19
## 143 Indiana Cancer 19
## 144 Indiana Diabetes 19
## 145 Indiana Heart disease 19
## 146 Indiana Influenza and pneumonia 19
## 147 Indiana Kidney disease 19
## 148 Indiana Stroke 19
## 149 Indiana Suicide 19
## 150 Indiana Unintentional injuries 19
## 151 Iowa Alzheimer's disease 19
## 152 Iowa CLRD 19
## 153 Iowa Cancer 19
## 154 Iowa Diabetes 19
## 155 Iowa Heart disease 19
## 156 Iowa Influenza and pneumonia 19
## 157 Iowa Kidney disease 19
## 158 Iowa Stroke 19
## 159 Iowa Suicide 19
## 160 Iowa Unintentional injuries 19
## 161 Kansas Alzheimer's disease 19
## 162 Kansas CLRD 19
## 163 Kansas Cancer 19
## 164 Kansas Diabetes 19
## 165 Kansas Heart disease 19
## 166 Kansas Influenza and pneumonia 19
## 167 Kansas Kidney disease 19
## 168 Kansas Stroke 19
## 169 Kansas Suicide 19
## 170 Kansas Unintentional injuries 19
## 171 Kentucky Alzheimer's disease 19
## 172 Kentucky CLRD 19
## 173 Kentucky Cancer 19
## 174 Kentucky Diabetes 19
## 175 Kentucky Heart disease 19
## 176 Kentucky Influenza and pneumonia 19
## 177 Kentucky Kidney disease 19
## 178 Kentucky Stroke 19
## 179 Kentucky Suicide 19
## 180 Kentucky Unintentional injuries 19
## 181 Louisiana Alzheimer's disease 19
## 182 Louisiana CLRD 19
## 183 Louisiana Cancer 19
## 184 Louisiana Diabetes 19
## 185 Louisiana Heart disease 19
## 186 Louisiana Influenza and pneumonia 19
## 187 Louisiana Kidney disease 19
## 188 Louisiana Stroke 19
## 189 Louisiana Suicide 19
## 190 Louisiana Unintentional injuries 19
## 191 Maine Alzheimer's disease 19
## 192 Maine CLRD 19
## 193 Maine Cancer 19
## 194 Maine Diabetes 19
## 195 Maine Heart disease 19
## 196 Maine Influenza and pneumonia 19
## 197 Maine Kidney disease 19
## 198 Maine Stroke 19
## 199 Maine Suicide 19
## 200 Maine Unintentional injuries 19
## 201 Maryland Alzheimer's disease 19
## 202 Maryland CLRD 19
## 203 Maryland Cancer 19
## 204 Maryland Diabetes 19
## 205 Maryland Heart disease 19
## 206 Maryland Influenza and pneumonia 19
## 207 Maryland Kidney disease 19
## 208 Maryland Stroke 19
## 209 Maryland Suicide 19
## 210 Maryland Unintentional injuries 19
## 211 Massachusetts Alzheimer's disease 19
## 212 Massachusetts CLRD 19
## 213 Massachusetts Cancer 19
## 214 Massachusetts Diabetes 19
## 215 Massachusetts Heart disease 19
## 216 Massachusetts Influenza and pneumonia 19
## 217 Massachusetts Kidney disease 19
## 218 Massachusetts Stroke 19
## 219 Massachusetts Suicide 19
## 220 Massachusetts Unintentional injuries 19
## 221 Michigan Alzheimer's disease 19
## 222 Michigan CLRD 19
## 223 Michigan Cancer 19
## 224 Michigan Diabetes 19
## 225 Michigan Heart disease 19
## 226 Michigan Influenza and pneumonia 19
## 227 Michigan Kidney disease 19
## 228 Michigan Stroke 19
## 229 Michigan Suicide 19
## 230 Michigan Unintentional injuries 19
## 231 Minnesota Alzheimer's disease 19
## 232 Minnesota CLRD 19
## 233 Minnesota Cancer 19
## 234 Minnesota Diabetes 19
## 235 Minnesota Heart disease 19
## 236 Minnesota Influenza and pneumonia 19
## 237 Minnesota Kidney disease 19
## 238 Minnesota Stroke 19
## 239 Minnesota Suicide 19
## 240 Minnesota Unintentional injuries 19
## 241 Mississippi Alzheimer's disease 19
## 242 Mississippi CLRD 19
## 243 Mississippi Cancer 19
## 244 Mississippi Diabetes 19
## 245 Mississippi Heart disease 19
## 246 Mississippi Influenza and pneumonia 19
## 247 Mississippi Kidney disease 19
## 248 Mississippi Stroke 19
## 249 Mississippi Suicide 19
## 250 Mississippi Unintentional injuries 19
## 251 Missouri Alzheimer's disease 19
## 252 Missouri CLRD 19
## 253 Missouri Cancer 19
## 254 Missouri Diabetes 19
## 255 Missouri Heart disease 19
## 256 Missouri Influenza and pneumonia 19
## 257 Missouri Kidney disease 19
## 258 Missouri Stroke 19
## 259 Missouri Suicide 19
## 260 Missouri Unintentional injuries 19
## 261 Montana Alzheimer's disease 19
## 262 Montana CLRD 19
## 263 Montana Cancer 19
## 264 Montana Diabetes 19
## 265 Montana Heart disease 19
## 266 Montana Influenza and pneumonia 19
## 267 Montana Kidney disease 19
## 268 Montana Stroke 19
## 269 Montana Suicide 19
## 270 Montana Unintentional injuries 19
## 271 Nebraska Alzheimer's disease 19
## 272 Nebraska CLRD 19
## 273 Nebraska Cancer 19
## 274 Nebraska Diabetes 19
## 275 Nebraska Heart disease 19
## 276 Nebraska Influenza and pneumonia 19
## 277 Nebraska Kidney disease 19
## 278 Nebraska Stroke 19
## 279 Nebraska Suicide 19
## 280 Nebraska Unintentional injuries 19
## 281 Nevada Alzheimer's disease 19
## 282 Nevada CLRD 19
## 283 Nevada Cancer 19
## 284 Nevada Diabetes 19
## 285 Nevada Heart disease 19
## 286 Nevada Influenza and pneumonia 19
## 287 Nevada Kidney disease 19
## 288 Nevada Stroke 19
## 289 Nevada Suicide 19
## 290 Nevada Unintentional injuries 19
## 291 New Hampshire Alzheimer's disease 19
## 292 New Hampshire CLRD 19
## 293 New Hampshire Cancer 19
## 294 New Hampshire Diabetes 19
## 295 New Hampshire Heart disease 19
## 296 New Hampshire Influenza and pneumonia 19
## 297 New Hampshire Kidney disease 19
## 298 New Hampshire Stroke 19
## 299 New Hampshire Suicide 19
## 300 New Hampshire Unintentional injuries 19
## 301 New Jersey Alzheimer's disease 19
## 302 New Jersey CLRD 19
## 303 New Jersey Cancer 19
## 304 New Jersey Diabetes 19
## 305 New Jersey Heart disease 19
## 306 New Jersey Influenza and pneumonia 19
## 307 New Jersey Kidney disease 19
## 308 New Jersey Stroke 19
## 309 New Jersey Suicide 19
## 310 New Jersey Unintentional injuries 19
## 311 New Mexico Alzheimer's disease 19
## 312 New Mexico CLRD 19
## 313 New Mexico Cancer 19
## 314 New Mexico Diabetes 19
## 315 New Mexico Heart disease 19
## 316 New Mexico Influenza and pneumonia 19
## 317 New Mexico Kidney disease 19
## 318 New Mexico Stroke 19
## 319 New Mexico Suicide 19
## 320 New Mexico Unintentional injuries 19
## 321 New York Alzheimer's disease 19
## 322 New York CLRD 19
## 323 New York Cancer 19
## 324 New York Diabetes 19
## 325 New York Heart disease 19
## 326 New York Influenza and pneumonia 19
## 327 New York Kidney disease 19
## 328 New York Stroke 19
## 329 New York Suicide 19
## 330 New York Unintentional injuries 19
## 331 North Carolina Alzheimer's disease 19
## 332 North Carolina CLRD 19
## 333 North Carolina Cancer 19
## 334 North Carolina Diabetes 19
## 335 North Carolina Heart disease 19
## 336 North Carolina Influenza and pneumonia 19
## 337 North Carolina Kidney disease 19
## 338 North Carolina Stroke 19
## 339 North Carolina Suicide 19
## 340 North Carolina Unintentional injuries 19
## 341 North Dakota Alzheimer's disease 19
## 342 North Dakota CLRD 19
## 343 North Dakota Cancer 19
## 344 North Dakota Diabetes 19
## 345 North Dakota Heart disease 19
## 346 North Dakota Influenza and pneumonia 19
## 347 North Dakota Kidney disease 19
## 348 North Dakota Stroke 19
## 349 North Dakota Suicide 19
## 350 North Dakota Unintentional injuries 19
## 351 Ohio Alzheimer's disease 19
## 352 Ohio CLRD 19
## 353 Ohio Cancer 19
## 354 Ohio Diabetes 19
## 355 Ohio Heart disease 19
## 356 Ohio Influenza and pneumonia 19
## 357 Ohio Kidney disease 19
## 358 Ohio Stroke 19
## 359 Ohio Suicide 19
## 360 Ohio Unintentional injuries 19
## 361 Oklahoma Alzheimer's disease 19
## 362 Oklahoma CLRD 19
## 363 Oklahoma Cancer 19
## 364 Oklahoma Diabetes 19
## 365 Oklahoma Heart disease 19
## 366 Oklahoma Influenza and pneumonia 19
## 367 Oklahoma Kidney disease 19
## 368 Oklahoma Stroke 19
## 369 Oklahoma Suicide 19
## 370 Oklahoma Unintentional injuries 19
## 371 Oregon Alzheimer's disease 19
## 372 Oregon CLRD 19
## 373 Oregon Cancer 19
## 374 Oregon Diabetes 19
## 375 Oregon Heart disease 19
## 376 Oregon Influenza and pneumonia 19
## 377 Oregon Kidney disease 19
## 378 Oregon Stroke 19
## 379 Oregon Suicide 19
## 380 Oregon Unintentional injuries 19
## 381 Pennsylvania Alzheimer's disease 19
## 382 Pennsylvania CLRD 19
## 383 Pennsylvania Cancer 19
## 384 Pennsylvania Diabetes 19
## 385 Pennsylvania Heart disease 19
## 386 Pennsylvania Influenza and pneumonia 19
## 387 Pennsylvania Kidney disease 19
## 388 Pennsylvania Stroke 19
## 389 Pennsylvania Suicide 19
## 390 Pennsylvania Unintentional injuries 19
## 391 Rhode Island Alzheimer's disease 19
## 392 Rhode Island CLRD 19
## 393 Rhode Island Cancer 19
## 394 Rhode Island Diabetes 19
## 395 Rhode Island Heart disease 19
## 396 Rhode Island Influenza and pneumonia 19
## 397 Rhode Island Kidney disease 19
## 398 Rhode Island Stroke 19
## 399 Rhode Island Suicide 19
## 400 Rhode Island Unintentional injuries 19
## 401 South Carolina Alzheimer's disease 19
## 402 South Carolina CLRD 19
## 403 South Carolina Cancer 19
## 404 South Carolina Diabetes 19
## 405 South Carolina Heart disease 19
## 406 South Carolina Influenza and pneumonia 19
## 407 South Carolina Kidney disease 19
## 408 South Carolina Stroke 19
## 409 South Carolina Suicide 19
## 410 South Carolina Unintentional injuries 19
## 411 South Dakota Alzheimer's disease 19
## 412 South Dakota CLRD 19
## 413 South Dakota Cancer 19
## 414 South Dakota Diabetes 19
## 415 South Dakota Heart disease 19
## 416 South Dakota Influenza and pneumonia 19
## 417 South Dakota Kidney disease 19
## 418 South Dakota Stroke 19
## 419 South Dakota Suicide 19
## 420 South Dakota Unintentional injuries 19
## 421 Tennessee Alzheimer's disease 19
## 422 Tennessee CLRD 19
## 423 Tennessee Cancer 19
## 424 Tennessee Diabetes 19
## 425 Tennessee Heart disease 19
## 426 Tennessee Influenza and pneumonia 19
## 427 Tennessee Kidney disease 19
## 428 Tennessee Stroke 19
## 429 Tennessee Suicide 19
## 430 Tennessee Unintentional injuries 19
## 431 Texas Alzheimer's disease 19
## 432 Texas CLRD 19
## 433 Texas Cancer 19
## 434 Texas Diabetes 19
## 435 Texas Heart disease 19
## 436 Texas Influenza and pneumonia 19
## 437 Texas Kidney disease 19
## 438 Texas Stroke 19
## 439 Texas Suicide 19
## 440 Texas Unintentional injuries 19
## 441 Utah Alzheimer's disease 19
## 442 Utah CLRD 19
## 443 Utah Cancer 19
## 444 Utah Diabetes 19
## 445 Utah Heart disease 19
## 446 Utah Influenza and pneumonia 19
## 447 Utah Kidney disease 19
## 448 Utah Stroke 19
## 449 Utah Suicide 19
## 450 Utah Unintentional injuries 19
## 451 Vermont Alzheimer's disease 19
## 452 Vermont CLRD 19
## 453 Vermont Cancer 19
## 454 Vermont Diabetes 19
## 455 Vermont Heart disease 19
## 456 Vermont Influenza and pneumonia 19
## 457 Vermont Kidney disease 19
## 458 Vermont Stroke 19
## 459 Vermont Suicide 19
## 460 Vermont Unintentional injuries 19
## 461 Virginia Alzheimer's disease 19
## 462 Virginia CLRD 19
## 463 Virginia Cancer 19
## 464 Virginia Diabetes 19
## 465 Virginia Heart disease 19
## 466 Virginia Influenza and pneumonia 19
## 467 Virginia Kidney disease 19
## 468 Virginia Stroke 19
## 469 Virginia Suicide 19
## 470 Virginia Unintentional injuries 19
## 471 Washington Alzheimer's disease 19
## 472 Washington CLRD 19
## 473 Washington Cancer 19
## 474 Washington Diabetes 19
## 475 Washington Heart disease 19
## 476 Washington Influenza and pneumonia 19
## 477 Washington Kidney disease 19
## 478 Washington Stroke 19
## 479 Washington Suicide 19
## 480 Washington Unintentional injuries 19
## 481 West Virginia Alzheimer's disease 19
## 482 West Virginia CLRD 19
## 483 West Virginia Cancer 19
## 484 West Virginia Diabetes 19
## 485 West Virginia Heart disease 19
## 486 West Virginia Influenza and pneumonia 19
## 487 West Virginia Kidney disease 19
## 488 West Virginia Stroke 19
## 489 West Virginia Suicide 19
## 490 West Virginia Unintentional injuries 19
## 491 Wisconsin Alzheimer's disease 19
## 492 Wisconsin CLRD 19
## 493 Wisconsin Cancer 19
## 494 Wisconsin Diabetes 19
## 495 Wisconsin Heart disease 19
## 496 Wisconsin Influenza and pneumonia 19
## 497 Wisconsin Kidney disease 19
## 498 Wisconsin Stroke 19
## 499 Wisconsin Suicide 19
## 500 Wisconsin Unintentional injuries 19
## 501 Wyoming Alzheimer's disease 19
## 502 Wyoming CLRD 19
## 503 Wyoming Cancer 19
## 504 Wyoming Diabetes 19
## 505 Wyoming Heart disease 19
## 506 Wyoming Influenza and pneumonia 19
## 507 Wyoming Kidney disease 19
## 508 Wyoming Stroke 19
## 509 Wyoming Suicide 19
## 510 Wyoming Unintentional injuries 19
This dataframe represents all observed combinations of state and cause of death. All combinations appear in the dataset, indicating that reporting is structurally complete. However, some combinations occur far less frequently than others.
State–cause combinations with the smallest counts are considered low‑probability (anomalous) groups. These rare combinations occur when low‑prevalence causes intersect with smaller state populations, making them statistically unlikely to be selected in a random row draw. Their rarity reflects expected structural patterns rather than errors or omissions.
A testable hypothesis is that rare state–cause combinations are driven by the interaction of low national prevalence and smaller population sizes.
ggplot(state_cause_combo, aes(
x = n,
y = reorder(Cause.Name, n)
)) +
geom_col(fill = "steelblue") +
labs(
title = "Frequency of State–Cause Combinations",
x = "Number of Observations",
y = "Cause of Death"
) +
theme_minimal()
This week’s data dive examined how probability and rarity emerge from grouping observations within the dataset. By analyzing causes of death, states, years, and state–cause combinations, low‑probability (anomalous) groups were identified based on their relatively small number of observations.
These findings demonstrate how probability can be used as a descriptive analytical tool to understand structure and rarity within complex datasets, rather than to infer causation. Future analyses may explore how these group‑level probabilities change when population‑adjusted measures, such as age‑adjusted death rates, are introduced.