First Data Dive

Data Synopsis

This dataset is from the CDC depicting the leading causes of death by year, cause, location (state), number of deaths, and age-adjusted-death-rates. I am going to be doing an analysis of the data week-by-week attempting to extrapolate patterns and answering questions that will be formulated along the way.

Data Analysis Goals

Through the data analysis of the dataset, it is my goal to show which causes of death are most prevalent across the states and how the causes have change throughout time.

After this surface level analysis is complete, I will begin to peel back the layers of data to understand the possible why and how behind the results. This may lead to further data-diving to help explain the findings.

To begin, a basic analysis and visualization of the vectors will be completed and then questions will be formulated from the findings to answer from further investigation of the data.

Loading the Libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

Reading the Data

deaths <- read.csv('NCHS.csv')

Cleaning the Data

In this \(dataset\), the Deaths and Age Adjusted Death Rate columns are categorized as character strings due to the comma being present. As such, it does not allow for proper data filtering when trying to perform basic R functions such as min, max, etc. It is necessary to transform the column to a numeric format to properly extract meaningful insights.

deaths <- deaths |>
  mutate(
    Deaths = as.numeric(gsub(",", "", Deaths))
  )
deaths <- deaths |>
  mutate(
    Age.adjusted.Death.Rate = as.numeric(gsub(",", "", Age.adjusted.Death.Rate))
  )

Min/Max/Mean/Median/Quantile Summaries

Year Vector Analysis

deaths |>
  summarise(
    min_year = min(Year, na.rm = TRUE),
    max_year = max(Year, na.rm = TRUE),
    q1_year = quantile(Year, 0.25, na.rm = TRUE),
    med_year = median(Year, na.rm = TRUE),
    q3_year = quantile(Year, 0.75, na.rm = TRUE)
  )
##   min_year max_year q1_year med_year q3_year
## 1     1999     2017    2003     2008    2013

Deaths Vector Analysis

deaths |>
  summarise(
    min_death = min(Deaths, na.rm = TRUE),
    max_death = max(Deaths, na.rm = TRUE),
    avg_death = mean(Deaths, na.rm = TRUE),
    q1_death = quantile(Deaths, 0.25, na.rm = TRUE),
    med_death = median(Deaths, na.rm = TRUE),
    q3_death = quantile(Deaths, 0.75, na.rm = TRUE)
  )
##   min_death max_death avg_death q1_death med_death q3_death
## 1        21   2813503  15459.91      612    1718.5   5756.5

Age Adjusted Death Rate Vector Analysis

deaths |>
  summarise(
    min_aadr = min(Age.adjusted.Death.Rate, na.rm = TRUE),
    max_aadr = max(Age.adjusted.Death.Rate, na.rm = TRUE),
    avg_aadr = mean(Age.adjusted.Death.Rate, na.rm = TRUE),
    q1_aadr = quantile(Age.adjusted.Death.Rate, 0.25, na.rm = TRUE),
    med_aadr = median(Age.adjusted.Death.Rate, na.rm = TRUE),
    q3_aadr = quantile(Age.adjusted.Death.Rate, 0.75, na.rm = TRUE)
  )
##   min_aadr max_aadr avg_aadr q1_aadr med_aadr q3_aadr
## 1      2.6   1087.3 127.5639    19.2     35.9 151.725

Categorical Analysis

X113 Cause Name Vector Analysis

deaths |>
  count(X113.Cause.Name)
##                                                          X113.Cause.Name   n
## 1                   Accidents (unintentional injuries) (V01-X59,Y85-Y86) 988
## 2                                                             All Causes 988
## 3                                              Alzheimer's disease (G30) 988
## 4                                     Cerebrovascular diseases (I60-I69) 988
## 5                           Chronic lower respiratory diseases (J40-J47) 988
## 6                                            Diabetes mellitus (E10-E14) 988
## 7                            Diseases of heart (I00-I09,I11,I13,I20-I51) 988
## 8                                      Influenza and pneumonia (J09-J18) 988
## 9                   Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) 988
## 10                                         Malignant neoplasms (C00-C97) 988
## 11 Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27) 988

Cause Name Vector Analysis

deaths |>
  count(Cause.Name)
##                 Cause.Name   n
## 1               All causes 988
## 2      Alzheimer's disease 988
## 3                     CLRD 988
## 4                   Cancer 988
## 5                 Diabetes 988
## 6            Heart disease 988
## 7  Influenza and pneumonia 988
## 8           Kidney disease 988
## 9                   Stroke 988
## 10                 Suicide 988
## 11  Unintentional injuries 988

State Vector Analysis

deaths |>
  count(State)
##                   State   n
## 1               Alabama 209
## 2                Alaska 209
## 3               Arizona 209
## 4              Arkansas 209
## 5            California 209
## 6              Colorado 209
## 7           Connecticut 209
## 8              Delaware 209
## 9  District of Columbia 209
## 10              Florida 209
## 11              Georgia 209
## 12               Hawaii 209
## 13                Idaho 209
## 14             Illinois 209
## 15              Indiana 209
## 16                 Iowa 209
## 17               Kansas 209
## 18             Kentucky 209
## 19            Louisiana 209
## 20                Maine 209
## 21             Maryland 209
## 22        Massachusetts 209
## 23             Michigan 209
## 24            Minnesota 209
## 25          Mississippi 209
## 26             Missouri 209
## 27              Montana 209
## 28             Nebraska 209
## 29               Nevada 209
## 30        New Hampshire 209
## 31           New Jersey 209
## 32           New Mexico 209
## 33             New York 209
## 34       North Carolina 209
## 35         North Dakota 209
## 36                 Ohio 209
## 37             Oklahoma 209
## 38               Oregon 209
## 39         Pennsylvania 209
## 40         Rhode Island 209
## 41       South Carolina 209
## 42         South Dakota 209
## 43            Tennessee 209
## 44                Texas 209
## 45        United States 209
## 46                 Utah 209
## 47              Vermont 209
## 48             Virginia 209
## 49           Washington 209
## 50        West Virginia 209
## 51            Wisconsin 209
## 52              Wyoming 209

Summary of Categorical Vectors

Upon analyzing the two cause name vectors, it struck me as odd at first the number of occurrences for each unique value was the same (988). After further examination and diving, it became clear this was the case since there are 19 years being examined and 52 states (52 states due to the District of Columbia and United States appearing in the vector) and 19 * 52 is 988.

The same type of inquiry may arise at first glance for states as well since each unique value has the same number of occurrences at 209. This occurs since there are a total of 11 causes of death and 19 years, and 11 * 19 equates to 209.

Visualizations of the Data

Overall Cause of Death Visualization

deaths |>
  filter(Cause.Name != "All causes") |>
  group_by(Cause.Name) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  ggplot(aes(
    x = total_deaths,
    y = reorder(Cause.Name, total_deaths)
  )) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths by Cause",
    x = "Total Deaths (Millions)",
    y = "Cause of Death"
  ) +
  theme_minimal()

Summary Analysis

This is an introductory analysis of the Cause.Name vector to see which cause is most prevalent cumulatively through the years of the dataset. Heart disease and cancer are the two leading causes of death in the United States from 1999-2017 according to this analysis. Further analysis of the vectors will help build the birds-eye view of the data as a foundation for subsequent investigation.

Deaths by Year Visualization

deaths |>
  filter(Cause.Name == "All causes") |>
  group_by(Year) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  ggplot(aes(
    x = Year,
    y = total_deaths
  )) +
  geom_line(color = "steelblue", linewidth = 1) +
  scale_x_continuous(
    breaks = seq(min(deaths$Year), max(deaths$Year), by = 2)
  ) +
  scale_y_continuous(
    limits = c(0, NA),
    breaks = seq(0, 6e6, by = 1e6),
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Total Deaths per Year (All Causes)",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  theme_minimal()

Summary Analysis

This visualization was built in order to depict a trend, if any, existed for how many total deaths occurred in the United States from the causes of death in the dataset. In this introductory analysis, it can be seen that, overall, the total deaths have increased throughout the years with a few exceptions that may require further analysis.

Total Deaths by State Visualization (Top 10 States)

deaths |>
  filter(
    Cause.Name == "All causes",
    State != "United States"
  ) |>
  group_by(State) |>
  summarise(total_deaths = sum(Deaths, na.rm = TRUE)) |>
  slice_max(total_deaths, n = 10) |>
  ggplot(aes(
    x = total_deaths,
    y = reorder(State, total_deaths)
  )) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(
    limits = c(0, NA),
    labels = label_number(scale = 1e-6, suffix = "M")
  ) +
  labs(
    title = "Top 10 States by Total Deaths (All Years, All Causes)",
    x = "Total Deaths (Millions)",
    y = "State"
  ) +
  theme_minimal()

Summary Analysis

This visualization of total deaths (by top 10 states) stemming from all causes across all years is an introductory analysis of what states have the most deaths. This will require further investigation to see if there are underlying factors that may skew these results or explain them in a way not involving the data in this particular dataset.

Weekly Data Dive Summary

This first data dive provided an opportunity to understand the dataset and what vectors needed to be cleaned in order to properly perform analysis. At this point, the minimums, maximums, averages, and quantiles of each numerical vector have been analyzed. Furthermore, the categorical vectors have been analyzed to see what unique values exist and how many times they appear in the dataset.

Visualizations were created based on answering very base analytical questions of:

  1. Which states had the most deaths across the years?
  2. What are the causes of death that were the most prevalent in quantity of deaths across the years?
  3. What states had the most deaths across the years?

From these findings, the foundation of data analysis of the dataset have occurred which yields further examination throughout the upcoming weeks. Possible questions from this introductory dive include:

  1. Why are those the top 10 states and what other factors could contribute to them being the top 10?
  2. What caused the dips in the trend of total deaths across the years?
  3. What states represent the highest number of deaths for each cause name?

The next analysis may focus on these questions and others as deeper dives into the data occur.