For this exploratory analysis I will use the US Chronic Disease Indicator dataset from the Center for Disease Control and Prevention (CDC). This is an extensive dataset with 34 variables and nearly 1 million observations. This dataset includes information about the prevalence of several diseases called “topics” (e.g., alcoholism, asthma, COPD, cardiovascular disease, etc.) for each U.S. State as well as Hawaii and Puerto Rico, from 2001-2021. The data will be cleaned by selecting the relevant variables including year, location, topic, and data value; filtering for the desired disease topics such as COPD, cardiovascular disease, and diabetes; and addressing any NA and missing values. This analysis aims to answer the following question: Is Chronic Obstructive Pulmonary Disease (COPD) significantly associated with a higher prevalence of cardiovascular disease and diabetes?
My initial interest in this topic was inspired by a 2022 article in the BC Medical Journal (BCMJ) which reports that “old age” has officially been removed from the International Classification of Diseases (ICD) as a cause of death. As of late, there has also been an ongoing discussion in the medical podcast world that “old age” is rarely a cause of death for Americans anyway. More and more Americans are succumbing to the same handful of chronic illnesses, and I thought this project would be an opportune time to explore some of these implications.
#Loading packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
#Reading in the US Chronic Disease dataset and viewing the heading.
chronic <- read_csv("USChronicDiseaseIndicators.csv")
## Rows: 999999 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): LocationAbbr, LocationDesc, DataSource, Topic, Question, DataValue...
## dbl (6): YearStart, YearEnd, DataValueAlt, LowConfidenceLimit, HighConfiden...
## lgl (10): Response, StratificationCategory2, Stratification2, Stratification...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(chronic)
## # A tibble: 6 × 34
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 2015 2015 AR Arkansas NVSS Alco… Chronic… NA
## 2 2018 2018 AR Arkansas NVSS Alco… Chronic… NA
## 3 2015 2015 CA California NVSS Alco… Chronic… NA
## 4 2015 2015 CO Colorado NVSS Alco… Chronic… NA
## 5 2013 2013 DC District of… NVSS Alco… Chronic… NA
## 6 2017 2017 HI Hawaii NVSS Alco… Chronic… NA
## # ℹ 26 more variables: DataValueUnit <chr>, DataValueType <chr>,
## # DataValue <chr>, DataValueAlt <dbl>, DataValueFootnoteSymbol <chr>,
## # DatavalueFootnote <chr>, LowConfidenceLimit <dbl>,
## # HighConfidenceLimit <dbl>, StratificationCategory1 <chr>,
## # Stratification1 <chr>, StratificationCategory2 <lgl>,
## # Stratification2 <lgl>, StratificationCategory3 <lgl>,
## # Stratification3 <lgl>, GeoLocation <chr>, ResponseID <lgl>, …
#Removing the word "POINT" from the observations (coordinates) under Geolocation in case I explore with mapping later.
chronic$GeoLocation <- gsub("POINT", "", as.character(chronic$GeoLocation))
head(chronic)
## # A tibble: 6 × 34
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 2015 2015 AR Arkansas NVSS Alco… Chronic… NA
## 2 2018 2018 AR Arkansas NVSS Alco… Chronic… NA
## 3 2015 2015 CA California NVSS Alco… Chronic… NA
## 4 2015 2015 CO Colorado NVSS Alco… Chronic… NA
## 5 2013 2013 DC District of… NVSS Alco… Chronic… NA
## 6 2017 2017 HI Hawaii NVSS Alco… Chronic… NA
## # ℹ 26 more variables: DataValueUnit <chr>, DataValueType <chr>,
## # DataValue <chr>, DataValueAlt <dbl>, DataValueFootnoteSymbol <chr>,
## # DatavalueFootnote <chr>, LowConfidenceLimit <dbl>,
## # HighConfidenceLimit <dbl>, StratificationCategory1 <chr>,
## # Stratification1 <chr>, StratificationCategory2 <lgl>,
## # Stratification2 <lgl>, StratificationCategory3 <lgl>,
## # Stratification3 <lgl>, GeoLocation <chr>, ResponseID <lgl>, …
#Selecting the variables I want to explore, filtering out "US" as a state in the US, and removing NA and missing values from the DataValueAlt variable.
chronic1 <- chronic |>
select(1, 3:7, 12, 23, 26) |>
filter(!is.na(DataValueAlt), LocationAbbr != "US")
head(chronic1)
## # A tibble: 6 × 9
## YearStart LocationAbbr LocationDesc DataSource Topic Question DataValueAlt
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2015 AR Arkansas NVSS Alco… Chronic… 266
## 2 2018 AR Arkansas NVSS Alco… Chronic… 267
## 3 2015 CA California NVSS Alco… Chronic… 3502
## 4 2015 CO Colorado NVSS Alco… Chronic… 276
## 5 2013 DC District of Col… NVSS Alco… Chronic… 37
## 6 2017 HI Hawaii NVSS Alco… Chronic… 51
## # ℹ 2 more variables: GeoLocation <chr>, TopicID <chr>
#Seeing what chronic health topics are included in the dataset.
unique(chronic1$Topic)
## [1] "Alcohol"
## [2] "Arthritis"
## [3] "Asthma"
## [4] "Cancer"
## [5] "Cardiovascular Disease"
## [6] "Chronic Kidney Disease"
## [7] "Chronic Obstructive Pulmonary Disease"
## [8] "Diabetes"
## [9] "Disability"
## [10] "Immunization"
## [11] "Mental Health"
## [12] "Nutrition, Physical Activity, and Weight Status"
## [13] "Older Adults"
## [14] "Oral Health"
## [15] "Overarching Conditions"
## [16] "Reproductive Health"
## [17] "Tobacco"
# Visualizing the distribution of each chronic disease included in the dataset.
chronic1 |>
ggplot(aes(x=TopicID, color = Topic)) +
geom_bar(alpha=0.8) +
coord_flip() +
labs(x = "Chronic Disease",
y = "Count",
title = "Distribution of US Chronic Diseases",
caption = "Source: Centers for Disease Control and Prevention")
According to BMC Pulmonary Medicine, “COPD is significantly associated with a higher prevalence of some CVDs, including coronary heart disease, heart failure, heart attack, and diabetes”. The chart above shows the prevalence of COPD, cardiovascular disease, and diabetes are similar which might support the claim of a significant association between the three diseases. The prevalence of cancer is also comparable to these three diseases. I will expand my analysis to include cancer and ATTEMPT to explore if the four diseases are strongly correlated or just coincidentally similar in prevalence.
# Filtering for the four possibly associated chronic diseases.
chronic2 <- chronic1 |>
filter(Topic == "Cardiovascular Disease" | Topic == "Chronic Obstructive Pulmonary Disease" | Topic == "Diabetes" | Topic == "Cancer")
head(chronic2)
## # A tibble: 6 × 9
## YearStart LocationAbbr LocationDesc DataSource Topic Question DataValueAlt
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2009 MI Michigan Death Certifi… Canc… Cancer … 9
## 2 2013 MT Montana Death Certifi… Canc… Cancer … 132
## 3 2016 IA Iowa BRFSS Canc… Mammogr… 77.5
## 4 2020 NY New York BRFSS Canc… Mammogr… 82
## 5 2010 AK Alaska Statewide cen… Canc… Invasiv… 428.
## 6 2013 AK Alaska Statewide cen… Canc… Invasiv… 141
## # ℹ 2 more variables: GeoLocation <chr>, TopicID <chr>
hchart(chronic2$TopicID, type = "column")
chronic2 |>
ggplot(aes(x=TopicID, fill = Topic)) +
geom_bar(alpha=0.8) +
labs(x = "Chronic Disease",
y = "Count",
title = "Distribution of US Chronic Diseases",
caption = "Source: Centers for Disease Control and Prevention")
In conclusion, this exploration was a bust. Several of the visualizations I attempted did not work and I couldn’t determine why, although I suspect it has something to do with the variable types. I think the exploration would have benefited from another quantitative variable like population. The visualizations do show a likely correlation between the four diseases: COPD, cardiovascular disease, diabetes, and cancer. However, I don’t know how I could have used the data to determine whether the associations were coincidentally or actual correlations.
https://bcmj.org/blog/old-age-no-longer-diagnosis-cause-death
https://bmcpulmmed.biomedcentral.com/articles/10.1186/s12890-023-02606-1#[:~:text=These%20findings%20indicated%20that%20COPD,was%20higher%20(14.1%25%20vs.](https://bmcpulmmed.biomedcentral.com/articles/10.1186/s12890-023-02606-1#:~:text=These%20findings%20indicated%20that%20COPD,was%20higher%20(14.1%25%20vs.)