Intro

For this exploratory analysis I will use the US Chronic Disease Indicator dataset from the Center for Disease Control and Prevention (CDC). This is an extensive dataset with 34 variables and nearly 1 million observations. This dataset includes information about the prevalence of several diseases called “topics” (e.g., alcoholism, asthma, COPD, cardiovascular disease, etc.) for each U.S. State as well as Hawaii and Puerto Rico, from 2001-2021. The data will be cleaned by selecting the relevant variables including year, location, topic, and data value; filtering for the desired disease topics such as COPD, cardiovascular disease, and diabetes; and addressing any NA and missing values. This analysis aims to answer the following question: Is Chronic Obstructive Pulmonary Disease (COPD) significantly associated with a higher prevalence of cardiovascular disease and diabetes?

My initial interest in this topic was inspired by a 2022 article in the BC Medical Journal (BCMJ) which reports that “old age” has officially been removed from the International Classification of Diseases (ICD) as a cause of death. As of late, there has also been an ongoing discussion in the medical podcast world that “old age” is rarely a cause of death for Americans anyway. More and more Americans are succumbing to the same handful of chronic illnesses, and I thought this project would be an opportune time to explore some of these implications.

#Loading packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
#Reading in the US Chronic Disease dataset and viewing the heading.

chronic <- read_csv("USChronicDiseaseIndicators.csv")
## Rows: 999999 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): LocationAbbr, LocationDesc, DataSource, Topic, Question, DataValue...
## dbl  (6): YearStart, YearEnd, DataValueAlt, LowConfidenceLimit, HighConfiden...
## lgl (10): Response, StratificationCategory2, Stratification2, Stratification...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(chronic)
## # A tibble: 6 × 34
##   YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response
##       <dbl>   <dbl> <chr>        <chr>        <chr>      <chr> <chr>    <lgl>   
## 1      2015    2015 AR           Arkansas     NVSS       Alco… Chronic… NA      
## 2      2018    2018 AR           Arkansas     NVSS       Alco… Chronic… NA      
## 3      2015    2015 CA           California   NVSS       Alco… Chronic… NA      
## 4      2015    2015 CO           Colorado     NVSS       Alco… Chronic… NA      
## 5      2013    2013 DC           District of… NVSS       Alco… Chronic… NA      
## 6      2017    2017 HI           Hawaii       NVSS       Alco… Chronic… NA      
## # ℹ 26 more variables: DataValueUnit <chr>, DataValueType <chr>,
## #   DataValue <chr>, DataValueAlt <dbl>, DataValueFootnoteSymbol <chr>,
## #   DatavalueFootnote <chr>, LowConfidenceLimit <dbl>,
## #   HighConfidenceLimit <dbl>, StratificationCategory1 <chr>,
## #   Stratification1 <chr>, StratificationCategory2 <lgl>,
## #   Stratification2 <lgl>, StratificationCategory3 <lgl>,
## #   Stratification3 <lgl>, GeoLocation <chr>, ResponseID <lgl>, …
#Removing the word "POINT" from the observations (coordinates) under Geolocation in case I explore with mapping later.

chronic$GeoLocation <- gsub("POINT", "", as.character(chronic$GeoLocation))
head(chronic)
## # A tibble: 6 × 34
##   YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response
##       <dbl>   <dbl> <chr>        <chr>        <chr>      <chr> <chr>    <lgl>   
## 1      2015    2015 AR           Arkansas     NVSS       Alco… Chronic… NA      
## 2      2018    2018 AR           Arkansas     NVSS       Alco… Chronic… NA      
## 3      2015    2015 CA           California   NVSS       Alco… Chronic… NA      
## 4      2015    2015 CO           Colorado     NVSS       Alco… Chronic… NA      
## 5      2013    2013 DC           District of… NVSS       Alco… Chronic… NA      
## 6      2017    2017 HI           Hawaii       NVSS       Alco… Chronic… NA      
## # ℹ 26 more variables: DataValueUnit <chr>, DataValueType <chr>,
## #   DataValue <chr>, DataValueAlt <dbl>, DataValueFootnoteSymbol <chr>,
## #   DatavalueFootnote <chr>, LowConfidenceLimit <dbl>,
## #   HighConfidenceLimit <dbl>, StratificationCategory1 <chr>,
## #   Stratification1 <chr>, StratificationCategory2 <lgl>,
## #   Stratification2 <lgl>, StratificationCategory3 <lgl>,
## #   Stratification3 <lgl>, GeoLocation <chr>, ResponseID <lgl>, …
#Selecting the variables I want to explore, filtering out "US" as a state in the US, and removing NA and missing values from the DataValueAlt variable.

chronic1 <- chronic |>
  select(1, 3:7, 12, 23, 26) |>
  filter(!is.na(DataValueAlt), LocationAbbr != "US")
head(chronic1)
## # A tibble: 6 × 9
##   YearStart LocationAbbr LocationDesc     DataSource Topic Question DataValueAlt
##       <dbl> <chr>        <chr>            <chr>      <chr> <chr>           <dbl>
## 1      2015 AR           Arkansas         NVSS       Alco… Chronic…          266
## 2      2018 AR           Arkansas         NVSS       Alco… Chronic…          267
## 3      2015 CA           California       NVSS       Alco… Chronic…         3502
## 4      2015 CO           Colorado         NVSS       Alco… Chronic…          276
## 5      2013 DC           District of Col… NVSS       Alco… Chronic…           37
## 6      2017 HI           Hawaii           NVSS       Alco… Chronic…           51
## # ℹ 2 more variables: GeoLocation <chr>, TopicID <chr>
#Seeing what chronic health topics are included in the dataset.

unique(chronic1$Topic)
##  [1] "Alcohol"                                        
##  [2] "Arthritis"                                      
##  [3] "Asthma"                                         
##  [4] "Cancer"                                         
##  [5] "Cardiovascular Disease"                         
##  [6] "Chronic Kidney Disease"                         
##  [7] "Chronic Obstructive Pulmonary Disease"          
##  [8] "Diabetes"                                       
##  [9] "Disability"                                     
## [10] "Immunization"                                   
## [11] "Mental Health"                                  
## [12] "Nutrition, Physical Activity, and Weight Status"
## [13] "Older Adults"                                   
## [14] "Oral Health"                                    
## [15] "Overarching Conditions"                         
## [16] "Reproductive Health"                            
## [17] "Tobacco"
# Visualizing the distribution of each chronic disease included in the dataset.

chronic1 |>
  ggplot(aes(x=TopicID, color = Topic)) +
  geom_bar(alpha=0.8) +
  coord_flip() +
  labs(x = "Chronic Disease",
       y = "Count",
       title = "Distribution of US Chronic Diseases",
       caption = "Source: Centers for Disease Control and Prevention")

According to BMC Pulmonary Medicine, “COPD is significantly associated with a higher prevalence of some CVDs, including coronary heart disease, heart failure, heart attack, and diabetes”. The chart above shows the prevalence of COPD, cardiovascular disease, and diabetes are similar which might support the claim of a significant association between the three diseases. The prevalence of cancer is also comparable to these three diseases. I will expand my analysis to include cancer and ATTEMPT to explore if the four diseases are strongly correlated or just coincidentally similar in prevalence.

# Filtering for the four possibly associated chronic diseases.

chronic2 <- chronic1 |>
  filter(Topic == "Cardiovascular Disease" | Topic == "Chronic Obstructive Pulmonary Disease" | Topic == "Diabetes" | Topic == "Cancer")
head(chronic2)
## # A tibble: 6 × 9
##   YearStart LocationAbbr LocationDesc DataSource     Topic Question DataValueAlt
##       <dbl> <chr>        <chr>        <chr>          <chr> <chr>           <dbl>
## 1      2009 MI           Michigan     Death Certifi… Canc… Cancer …          9  
## 2      2013 MT           Montana      Death Certifi… Canc… Cancer …        132  
## 3      2016 IA           Iowa         BRFSS          Canc… Mammogr…         77.5
## 4      2020 NY           New York     BRFSS          Canc… Mammogr…         82  
## 5      2010 AK           Alaska       Statewide cen… Canc… Invasiv…        428. 
## 6      2013 AK           Alaska       Statewide cen… Canc… Invasiv…        141  
## # ℹ 2 more variables: GeoLocation <chr>, TopicID <chr>
hchart(chronic2$TopicID, type = "column")
chronic2 |>
  ggplot(aes(x=TopicID, fill = Topic)) +
  geom_bar(alpha=0.8) +
  labs(x = "Chronic Disease",
       y = "Count",
       title = "Distribution of US Chronic Diseases",
       caption = "Source: Centers for Disease Control and Prevention")                  

Conclusion

In conclusion, this exploration was a bust. Several of the visualizations I attempted did not work and I couldn’t determine why, although I suspect it has something to do with the variable types. I think the exploration would have benefited from another quantitative variable like population. The visualizations do show a likely correlation between the four diseases: COPD, cardiovascular disease, diabetes, and cancer. However, I don’t know how I could have used the data to determine whether the associations were coincidentally or actual correlations.

Bibliography

https://bcmj.org/blog/old-age-no-longer-diagnosis-cause-death

https://bmcpulmmed.biomedcentral.com/articles/10.1186/s12890-023-02606-1#[:~:text=These%20findings%20indicated%20that%20COPD,was%20higher%20(14.1%25%20vs.](https://bmcpulmmed.biomedcentral.com/articles/10.1186/s12890-023-02606-1#:~:text=These%20findings%20indicated%20that%20COPD,was%20higher%20(14.1%25%20vs.)

https://err.ersjournals.com/content/27/149/180057