Final Project D607

Introduction

The U.S. Chronic Disease Indicators (CDI) dataset is a comprehensive resource that includes data on various chronic diseases, including cardiovascular disease, diabetes, cancer, and more in a time period of 2000 to 2020. The main goal of this analysis is to explore and understand the trends and patterns of these chronic diseases across the United States over recent years. By meticulously loading, cleaning, and tidying the data, we aim to gain important statistics and insights that can inform public health strategies and interventions, ultimately helping to improve the overall health situation in the U.S.

This dataset is located at: https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi

#Research Question

“What are the geographic and demographic patterns of the top 10 chronic diseases, including cardiovascular disease, diabetes, and cancer, across the United States over the past decade, and how can these insights inform targeted public health interventions to improve overall health outcomes?”

##Wrangling the Data #Loading the Data

The first step in data analysis is to load the dataset into a usable format. In this case, we’re using the read_csv() function from the readr package to load a CSV file into a data frame called chronic_disease_data. The head() function is then used to inspect the initial rows of the data frame and get a sense of its structure.

#Cleaning the Data

Once the data is loaded, it’s crucial to clean it to ensure accuracy and consistency. This involves tasks like handling missing values, correcting data types, and removing irrelevant or inconsistent data. In this case, we replace missing values with NA, convert the DataValue column to numeric, and drop rows with missing DataValue. Other cleaning steps might involve handling outliers or standardizing the data.

#Tidying the Data

The final step in preparing the data for analysis is tidying it. This involves organizing the data into a clear and consistent structure. In this specific case, we identify and remove outliers using the Interquartile Range (IQR) method to ensure the data distribution is not skewed. Additionally, we normalize the DataValue column to bring all values to a common scale, facilitating comparison and analysis across different topics.

# Install required libraries (if not already installed)
if(!require(tidyverse)) install.packages("tidyverse")

## Loading required package: tidyverse

## Warning: package 'tidyverse' was built under R version 4.4.2

## Warning: package 'ggplot2' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

if(!require(readr)) install.packages("readr")
if(!require(fastDummies)) install.packages("fastDummies")

## Loading required package: fastDummies

## Warning: package 'fastDummies' was built under R version 4.4.2

if(!require(tidyr)) install.packages("tidyr")
if(!require(forcats)) install.packages("forcats")
if(!require(ggpubr)) install.packages("ggpubr")

## Loading required package: ggpubr

## Warning: package 'ggpubr' was built under R version 4.4.2

# Load the necessary libraries
library(tidyverse)
library(readr)
library(fastDummies)
library(tidyr)
library(forcats)
library(ggpubr)

# Set the file path
file_path <- "C:/Users/Dell/Downloads/U.S._Chronic_Disease_Indicators__CDI___2023_Release.csv"

# Load the dataset
chronic_disease_data <- read_csv(file_path)

## Rows: 1185676 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): LocationAbbr, LocationDesc, DataSource, Topic, Question, DataValue...
## dbl  (5): YearStart, YearEnd, DataValueAlt, LowConfidenceLimit, HighConfiden...
## lgl (10): Response, StratificationCategory2, Stratification2, Stratification...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View the first rows to inspect the dataset
head(chronic_disease_data)

## # A tibble: 6 × 34
##   YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response
##       <dbl>   <dbl> <chr>        <chr>        <chr>      <chr> <chr>    <lgl>   
## 1      2010    2010 OR           Oregon       NVSS       Card… Mortali… NA      
## 2      2019    2019 AZ           Arizona      YRBSS      Alco… Alcohol… NA      
## 3      2019    2019 OH           Ohio         YRBSS      Alco… Alcohol… NA      
## 4      2019    2019 US           United Stat… YRBSS      Alco… Alcohol… NA      
## 5      2015    2015 VI           Virgin Isla… YRBSS      Alco… Alcohol… NA      
## 6      2020    2020 AL           Alabama      PRAMS      Alco… Alcohol… NA      
## # ℹ 26 more variables: DataValueUnit <chr>, DataValueType <chr>,
## #   DataValue <chr>, DataValueAlt <dbl>, DataValueFootnoteSymbol <chr>,
## #   DatavalueFootnote <chr>, LowConfidenceLimit <dbl>,
## #   HighConfidenceLimit <dbl>, StratificationCategory1 <chr>,
## #   Stratification1 <chr>, StratificationCategory2 <lgl>,
## #   Stratification2 <lgl>, StratificationCategory3 <lgl>,
## #   Stratification3 <lgl>, GeoLocation <chr>, ResponseID <lgl>, …

# Step 1: Select Relevant Columns
tidy_data <- chronic_disease_data %>%
  select(
    YearStart, YearEnd, LocationDesc, DataSource, Topic, Question,
    DataValue, StratificationCategory1, Stratification1
  ) %>%
  rename(
    Year = YearStart,
    Location = LocationDesc,
    Category = StratificationCategory1,
    Subgroup = Stratification1
  )

# Step 2: Clean Missing or Inconsistent Data
# Replace "-" or blank cells in DataValue with NA
tidy_data <- tidy_data %>%
  mutate(DataValue = na_if(DataValue, "-")) %>%
  drop_na(DataValue) # Drop rows with NA in DataValue

# Ensure DataValue is numeric
tidy_data$DataValue <- as.numeric(tidy_data$DataValue)

## Warning: NAs introduced by coercion

# Handle other missing values by imputation
tidy_data <- tidy_data %>%
  mutate(across(where(is.numeric), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Step 3: Outlier Detection and Handling
# Identify and handle outliers in DataValue
Q1 <- quantile(tidy_data$DataValue, 0.25)
Q3 <- quantile(tidy_data$DataValue, 0.75)
IQR <- Q3 - Q1
tidy_data <- tidy_data %>%
  filter(DataValue >= (Q1 - 1.5 * IQR) & DataValue <= (Q3 + 1.5 * IQR))

# Step 4: Normalize DataValue
tidy_data <- tidy_data %>%
  group_by(Topic) %>%
  mutate(DataValue = (DataValue - min(DataValue)) / (max(DataValue) - min(DataValue))) %>%
  ungroup()


# Step 5: Save Cleaned Data
write_csv(tidy_data, "cleaned_chronic_disease_data.csv")
print("Cleaned dataset saved to 'cleaned_chronic_disease_data.csv'")

## [1] "Cleaned dataset saved to 'cleaned_chronic_disease_data.csv'"

In the following chunk there is the code to find out the top 10 diseases with individuals in this dataset are living

# Top 10 chronic diseases based on summed values
top_diseases <- tidy_data %>%
  group_by(Topic) %>%
  summarize(Total_Value = sum(DataValue, na.rm = TRUE)) %>%
  arrange(desc(Total_Value)) %>%
  slice(1:10) # Top 10 diseases

# View the top 10 diseases
print(top_diseases)

## # A tibble: 10 × 2
##    Topic                                           Total_Value
##    <chr>                                                 <dbl>
##  1 Cardiovascular Disease                               29532.
##  2 Diabetes                                             23708.
##  3 Cancer                                               22994.
##  4 Nutrition, Physical Activity, and Weight Status      21366.
##  5 Chronic Obstructive Pulmonary Disease                20970.
##  6 Arthritis                                            20470.
##  7 Tobacco                                              10943.
##  8 Asthma                                                8692.
##  9 Overarching Conditions                                8518.
## 10 Oral Health                                           8483.

##Summary statistics

# Step 5: Enhanced Summary Statistics
# Include only the top 10 chronic diseases
top_diseases <- c("Cardiovascular Disease", "Diabetes", "Cancer", 
                  "Nutrition, Physical Activity, and Weight Status", 
                  "Chronic Obstructive Pulmonary Disease", "Arthritis", 
                  "Tobacco", "Asthma", "Overarching Conditions", "Oral Health")

enhanced_summary <- tidy_data %>%
  filter(Topic %in% top_diseases) %>%
  group_by(Topic, Year, Location, Category, Subgroup) %>%
  summarize(
    Average_Value = mean(DataValue, na.rm = TRUE),
    Median_Value = median(DataValue, na.rm = TRUE),
    Std_Dev = sd(DataValue, na.rm = TRUE),
    Count = n(),
    Confidence_Lower = quantile(DataValue, 0.025, na.rm = TRUE),
    Confidence_Upper = quantile(DataValue, 0.975, na.rm = TRUE),
    .groups = "drop"
  )

# View enhanced summary statistics
print("Enhanced Summary Statistics:")

## [1] "Enhanced Summary Statistics:"

print(enhanced_summary)

## # A tibble: 47,809 × 11
##    Topic      Year Location Category Subgroup Average_Value Median_Value Std_Dev
##    <chr>     <dbl> <chr>    <chr>    <chr>            <dbl>        <dbl>   <dbl>
##  1 Arthritis  2011 Alabama  Gender   Female           0.502        0.481  0.104 
##  2 Arthritis  2011 Alabama  Gender   Male             0.440        0.445  0.109 
##  3 Arthritis  2011 Alabama  Overall  Overall          0.469        0.45   0.101 
##  4 Arthritis  2011 Alabama  Race/Et… Black, …         0.489        0.527  0.116 
##  5 Arthritis  2011 Alabama  Race/Et… Hispanic         0.191        0.191  0.0932
##  6 Arthritis  2011 Alabama  Race/Et… Multira…         0.493        0.493  0.0193
##  7 Arthritis  2011 Alabama  Race/Et… Other, …         0.500        0.479  0.196 
##  8 Arthritis  2011 Alabama  Race/Et… White, …         0.473        0.444  0.106 
##  9 Arthritis  2011 Alaska   Gender   Female           0.389        0.347  0.123 
## 10 Arthritis  2011 Alaska   Gender   Male             0.359        0.306  0.117 
## # ℹ 47,799 more rows
## # ℹ 3 more variables: Count <int>, Confidence_Lower <dbl>,
## #   Confidence_Upper <dbl>

# Write the enhanced summary statistics to a CSV file
write_csv(enhanced_summary, "enhanced_summary_statistics.csv")
print("Enhanced summary statistics saved to 'enhanced_summary_statistics.csv'")

## [1] "Enhanced summary statistics saved to 'enhanced_summary_statistics.csv'"

# Calculate overall summary statistics
overall_summary <- enhanced_summary %>%
  group_by(Topic) %>%
  summarize(
    Average_Value = mean(Average_Value, na.rm = TRUE),
    Median_Value = median(Median_Value, na.rm = TRUE),
    Std_Dev = mean(Std_Dev, na.rm = TRUE),
    Count = sum(Count, na.rm = TRUE),
    Confidence_Lower = min(Confidence_Lower, na.rm = TRUE),
    Confidence_Upper = max(Confidence_Upper, na.rm = TRUE)
  )

# Print the overall summary statistics
print("Overall Summary Statistics:")

## [1] "Overall Summary Statistics:"

print(overall_summary)

## # A tibble: 10 × 7
##    Topic              Average_Value Median_Value Std_Dev  Count Confidence_Lower
##    <chr>                      <dbl>        <dbl>   <dbl>  <int>            <dbl>
##  1 Arthritis                  0.352        0.339   0.116  54809        0.00812  
##  2 Asthma                     0.190        0.194   0.127  38172        0        
##  3 Cancer                     0.271        0.149   0.174 101429        0        
##  4 Cardiovascular Di…         0.346        0.304   0.206  84195        0.0000450
##  5 Chronic Obstructi…         0.286        0.267   0.196  70862        0        
##  6 Diabetes                   0.285        0.314   0.160  77630        0.000360 
##  7 Nutrition, Physic…         0.349        0.311   0.194  63165        0        
##  8 Oral Health                0.503        0.532   0.168  16945        0        
##  9 Overarching Condi…         0.189        0.101   0.184  49635        0        
## 10 Tobacco                    0.269        0.218   0.181  36670        0        
## # ℹ 1 more variable: Confidence_Upper <dbl>

# Write the overall summary statistics to a CSV file
write_csv(overall_summary, "overall_summary_statistics.csv")
print("Overall summary statistics saved to 'overall_summary_statistics.csv'")

## [1] "Overall summary statistics saved to 'overall_summary_statistics.csv'"

#Interpretation:

The analysis of chronic disease data reveals several key insights. Conditions like cardiovascular disease, COPD, and overarching conditions exhibit high variability, indicating a wide range of values in their prevalence or severity. Additionally, several conditions, including cancer and diabetes, display skewed distributions, suggesting the presence of outliers or uneven data distribution.

Regarding data coverage, conditions like cancer and cardiovascular disease have extensive data, while oral health has a relatively smaller dataset. Specific conditions like arthritis and oral health show moderate prevalence with low variability. In contrast, conditions like cancer, cardiovascular disease, and COPD exhibit high prevalence and significant variability.

Understanding these trends and variations is crucial for identifying areas of concern, targeting specific interventions, and informing public health policies to address the burden of chronic diseases.

Including Plots

The visualizations below help us to provide insights for making conclusions and recommendations.

# Ensure necessary libraries are loaded
library(dplyr)
library(ggplot2)

# Assuming tidy_data is your cleaned data frame
data <- tidy_data

# Top 10 cities with the most cases
top_cities <- data %>%
  group_by(Location) %>%
  summarize(Total_Value = sum(DataValue, na.rm = TRUE)) %>%
  arrange(desc(Total_Value)) %>%
  slice(1:10) # Top 10 cities

# Bar plot
city_plot <- ggplot(top_cities, aes(x = reorder(Location, -Total_Value), y = Total_Value, fill = Location)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(title = "Top 10 Cities with Most Chronic Disease Cases",
       x = "City", y = "Total Cases") +
  theme_minimal() +
  coord_flip() # Horizontal bars for clarity

print(city_plot)

#Interpretation: By larger cities with higher populations may naturally have more chronic disease cases. It would be helpful to normalize the data by population size to get a better understanding of the prevalence of chronic diseases in each city. Also socioeconomic Factors like income, education, and access to healthcare can influence the prevalence of chronic diseases. Analyzing these factors for the top 10 cities could provide valuable insights.

Also Lifestyle factors such as diet, physical activity, and smoking habits can contribute to the development of chronic diseases.

# Filter Gender-specific rows
gender_data <- tidy_data %>%
  filter(Category == "Gender") %>%
  group_by(Subgroup) %>%
  summarize(Total_Value = sum(DataValue, na.rm = TRUE))

# Calculate percentages
gender_data <- gender_data %>%
  mutate(Percentage = Total_Value / sum(Total_Value) * 100)

# Pie chart with percentage labels
gender_pie <- ggplot(gender_data, aes(x = "", y = Total_Value, fill = Subgroup)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "Chronic Disease Cases by Gender", fill = "Gender") +
  theme_void()

print(gender_pie)

#Interpretation: With a slightly higher representation of females in the dataset, gender-specific health strategies could be developed to address the unique health needs of women.

# Summarize values by Year
time_trend <- tidy_data %>%
  group_by(Year) %>%
  summarize(Total_Value = sum(DataValue, na.rm = TRUE))

# Line plot
time_plot <- ggplot(time_trend, aes(x = Year, y = Total_Value)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red") +
  labs(title = "Trends of Chronic Diseases Over Time",
       x = "Year", y = "Total Cases") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(time_plot)

#Interpretation: The line plot shows a clear upward trend in the number of chronic disease cases over the years, with a significant increase starting around 2009. This suggests a potential rise in the prevalence or diagnosis of chronic diseases during this period.

##Conclusions

Geographic Clustering: The presence of multiple states in the top 10 suggests potential geographic clusters of chronic diseases. Further investigation could explore factors like environmental conditions, lifestyle habits, and healthcare access in these regions.

Socioeconomic Factors: Socioeconomic factors like income, education, and occupation can influence the prevalence of chronic diseases. Analyzing these factors in the top 10 cities could provide valuable insights. Healthcare Access and Utilization: Access to quality healthcare and healthcare utilization rates can impact the diagnosis and management of chronic diseases. It would be beneficial to examine these factors in the top 10 cities.

Lifestyle Factors: Lifestyle factors like diet, physical activity, and smoking can contribute to the development of chronic diseases. Investigating these factors in the top 10 cities could identify areas for potential intervention.

Genetic Predisposition: Genetic factors may play a role in the development of certain chronic diseases. Further research is needed to explore the genetic factors contributing to chronic diseases in these populations.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Final Project D607

Jose Fuentes

2024-12-18

Introduction

Including Plots