The U.S. Chronic Disease Indicators (CDI) dataset is a comprehensive resource that includes data on various chronic diseases, including cardiovascular disease, diabetes, cancer, and more in a time period of 2000 to 2020. The main goal of this analysis is to explore and understand the trends and patterns of these chronic diseases across the United States over recent years. By meticulously loading, cleaning, and tidying the data, we aim to gain important statistics and insights that can inform public health strategies and interventions, ultimately helping to improve the overall health situation in the U.S.
This dataset is located at: https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi
#Research Question
“What are the geographic and demographic patterns of the top 10 chronic diseases, including cardiovascular disease, diabetes, and cancer, across the United States over the past decade, and how can these insights inform targeted public health interventions to improve overall health outcomes?”
##Wrangling the Data #Loading the Data
The first step in data analysis is to load the dataset into a usable format. In this case, we’re using the read_csv() function from the readr package to load a CSV file into a data frame called chronic_disease_data. The head() function is then used to inspect the initial rows of the data frame and get a sense of its structure.
#Cleaning the Data
Once the data is loaded, it’s crucial to clean it to ensure accuracy and consistency. This involves tasks like handling missing values, correcting data types, and removing irrelevant or inconsistent data. In this case, we replace missing values with NA, convert the DataValue column to numeric, and drop rows with missing DataValue. Other cleaning steps might involve handling outliers or standardizing the data.
#Tidying the Data
The final step in preparing the data for analysis is tidying it. This involves organizing the data into a clear and consistent structure. In this specific case, we identify and remove outliers using the Interquartile Range (IQR) method to ensure the data distribution is not skewed. Additionally, we normalize the DataValue column to bring all values to a common scale, facilitating comparison and analysis across different topics.
# Install required libraries (if not already installed)
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if(!require(readr)) install.packages("readr")
if(!require(fastDummies)) install.packages("fastDummies")
## Loading required package: fastDummies
## Warning: package 'fastDummies' was built under R version 4.4.2
if(!require(tidyr)) install.packages("tidyr")
if(!require(forcats)) install.packages("forcats")
if(!require(ggpubr)) install.packages("ggpubr")
## Loading required package: ggpubr
## Warning: package 'ggpubr' was built under R version 4.4.2
# Load the necessary libraries
library(tidyverse)
library(readr)
library(fastDummies)
library(tidyr)
library(forcats)
library(ggpubr)
# Set the file path
file_path <- "C:/Users/Dell/Downloads/U.S._Chronic_Disease_Indicators__CDI___2023_Release.csv"
# Load the dataset
chronic_disease_data <- read_csv(file_path)
## Rows: 1185676 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): LocationAbbr, LocationDesc, DataSource, Topic, Question, DataValue...
## dbl (5): YearStart, YearEnd, DataValueAlt, LowConfidenceLimit, HighConfiden...
## lgl (10): Response, StratificationCategory2, Stratification2, Stratification...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first rows to inspect the dataset
head(chronic_disease_data)
## # A tibble: 6 × 34
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 2010 2010 OR Oregon NVSS Card… Mortali… NA
## 2 2019 2019 AZ Arizona YRBSS Alco… Alcohol… NA
## 3 2019 2019 OH Ohio YRBSS Alco… Alcohol… NA
## 4 2019 2019 US United Stat… YRBSS Alco… Alcohol… NA
## 5 2015 2015 VI Virgin Isla… YRBSS Alco… Alcohol… NA
## 6 2020 2020 AL Alabama PRAMS Alco… Alcohol… NA
## # ℹ 26 more variables: DataValueUnit <chr>, DataValueType <chr>,
## # DataValue <chr>, DataValueAlt <dbl>, DataValueFootnoteSymbol <chr>,
## # DatavalueFootnote <chr>, LowConfidenceLimit <dbl>,
## # HighConfidenceLimit <dbl>, StratificationCategory1 <chr>,
## # Stratification1 <chr>, StratificationCategory2 <lgl>,
## # Stratification2 <lgl>, StratificationCategory3 <lgl>,
## # Stratification3 <lgl>, GeoLocation <chr>, ResponseID <lgl>, …
# Step 1: Select Relevant Columns
tidy_data <- chronic_disease_data %>%
select(
YearStart, YearEnd, LocationDesc, DataSource, Topic, Question,
DataValue, StratificationCategory1, Stratification1
) %>%
rename(
Year = YearStart,
Location = LocationDesc,
Category = StratificationCategory1,
Subgroup = Stratification1
)
# Step 2: Clean Missing or Inconsistent Data
# Replace "-" or blank cells in DataValue with NA
tidy_data <- tidy_data %>%
mutate(DataValue = na_if(DataValue, "-")) %>%
drop_na(DataValue) # Drop rows with NA in DataValue
# Ensure DataValue is numeric
tidy_data$DataValue <- as.numeric(tidy_data$DataValue)
## Warning: NAs introduced by coercion
# Handle other missing values by imputation
tidy_data <- tidy_data %>%
mutate(across(where(is.numeric), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Step 3: Outlier Detection and Handling
# Identify and handle outliers in DataValue
Q1 <- quantile(tidy_data$DataValue, 0.25)
Q3 <- quantile(tidy_data$DataValue, 0.75)
IQR <- Q3 - Q1
tidy_data <- tidy_data %>%
filter(DataValue >= (Q1 - 1.5 * IQR) & DataValue <= (Q3 + 1.5 * IQR))
# Step 4: Normalize DataValue
tidy_data <- tidy_data %>%
group_by(Topic) %>%
mutate(DataValue = (DataValue - min(DataValue)) / (max(DataValue) - min(DataValue))) %>%
ungroup()
# Step 5: Save Cleaned Data
write_csv(tidy_data, "cleaned_chronic_disease_data.csv")
print("Cleaned dataset saved to 'cleaned_chronic_disease_data.csv'")
## [1] "Cleaned dataset saved to 'cleaned_chronic_disease_data.csv'"
In the following chunk there is the code to find out the top 10 diseases with individuals in this dataset are living
# Top 10 chronic diseases based on summed values
top_diseases <- tidy_data %>%
group_by(Topic) %>%
summarize(Total_Value = sum(DataValue, na.rm = TRUE)) %>%
arrange(desc(Total_Value)) %>%
slice(1:10) # Top 10 diseases
# View the top 10 diseases
print(top_diseases)
## # A tibble: 10 × 2
## Topic Total_Value
## <chr> <dbl>
## 1 Cardiovascular Disease 29532.
## 2 Diabetes 23708.
## 3 Cancer 22994.
## 4 Nutrition, Physical Activity, and Weight Status 21366.
## 5 Chronic Obstructive Pulmonary Disease 20970.
## 6 Arthritis 20470.
## 7 Tobacco 10943.
## 8 Asthma 8692.
## 9 Overarching Conditions 8518.
## 10 Oral Health 8483.
##Summary statistics
# Step 5: Enhanced Summary Statistics
# Include only the top 10 chronic diseases
top_diseases <- c("Cardiovascular Disease", "Diabetes", "Cancer",
"Nutrition, Physical Activity, and Weight Status",
"Chronic Obstructive Pulmonary Disease", "Arthritis",
"Tobacco", "Asthma", "Overarching Conditions", "Oral Health")
enhanced_summary <- tidy_data %>%
filter(Topic %in% top_diseases) %>%
group_by(Topic, Year, Location, Category, Subgroup) %>%
summarize(
Average_Value = mean(DataValue, na.rm = TRUE),
Median_Value = median(DataValue, na.rm = TRUE),
Std_Dev = sd(DataValue, na.rm = TRUE),
Count = n(),
Confidence_Lower = quantile(DataValue, 0.025, na.rm = TRUE),
Confidence_Upper = quantile(DataValue, 0.975, na.rm = TRUE),
.groups = "drop"
)
# View enhanced summary statistics
print("Enhanced Summary Statistics:")
## [1] "Enhanced Summary Statistics:"
print(enhanced_summary)
## # A tibble: 47,809 × 11
## Topic Year Location Category Subgroup Average_Value Median_Value Std_Dev
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Arthritis 2011 Alabama Gender Female 0.502 0.481 0.104
## 2 Arthritis 2011 Alabama Gender Male 0.440 0.445 0.109
## 3 Arthritis 2011 Alabama Overall Overall 0.469 0.45 0.101
## 4 Arthritis 2011 Alabama Race/Et… Black, … 0.489 0.527 0.116
## 5 Arthritis 2011 Alabama Race/Et… Hispanic 0.191 0.191 0.0932
## 6 Arthritis 2011 Alabama Race/Et… Multira… 0.493 0.493 0.0193
## 7 Arthritis 2011 Alabama Race/Et… Other, … 0.500 0.479 0.196
## 8 Arthritis 2011 Alabama Race/Et… White, … 0.473 0.444 0.106
## 9 Arthritis 2011 Alaska Gender Female 0.389 0.347 0.123
## 10 Arthritis 2011 Alaska Gender Male 0.359 0.306 0.117
## # ℹ 47,799 more rows
## # ℹ 3 more variables: Count <int>, Confidence_Lower <dbl>,
## # Confidence_Upper <dbl>
# Write the enhanced summary statistics to a CSV file
write_csv(enhanced_summary, "enhanced_summary_statistics.csv")
print("Enhanced summary statistics saved to 'enhanced_summary_statistics.csv'")
## [1] "Enhanced summary statistics saved to 'enhanced_summary_statistics.csv'"
# Calculate overall summary statistics
overall_summary <- enhanced_summary %>%
group_by(Topic) %>%
summarize(
Average_Value = mean(Average_Value, na.rm = TRUE),
Median_Value = median(Median_Value, na.rm = TRUE),
Std_Dev = mean(Std_Dev, na.rm = TRUE),
Count = sum(Count, na.rm = TRUE),
Confidence_Lower = min(Confidence_Lower, na.rm = TRUE),
Confidence_Upper = max(Confidence_Upper, na.rm = TRUE)
)
# Print the overall summary statistics
print("Overall Summary Statistics:")
## [1] "Overall Summary Statistics:"
print(overall_summary)
## # A tibble: 10 × 7
## Topic Average_Value Median_Value Std_Dev Count Confidence_Lower
## <chr> <dbl> <dbl> <dbl> <int> <dbl>
## 1 Arthritis 0.352 0.339 0.116 54809 0.00812
## 2 Asthma 0.190 0.194 0.127 38172 0
## 3 Cancer 0.271 0.149 0.174 101429 0
## 4 Cardiovascular Di… 0.346 0.304 0.206 84195 0.0000450
## 5 Chronic Obstructi… 0.286 0.267 0.196 70862 0
## 6 Diabetes 0.285 0.314 0.160 77630 0.000360
## 7 Nutrition, Physic… 0.349 0.311 0.194 63165 0
## 8 Oral Health 0.503 0.532 0.168 16945 0
## 9 Overarching Condi… 0.189 0.101 0.184 49635 0
## 10 Tobacco 0.269 0.218 0.181 36670 0
## # ℹ 1 more variable: Confidence_Upper <dbl>
# Write the overall summary statistics to a CSV file
write_csv(overall_summary, "overall_summary_statistics.csv")
print("Overall summary statistics saved to 'overall_summary_statistics.csv'")
## [1] "Overall summary statistics saved to 'overall_summary_statistics.csv'"
#Interpretation:
The analysis of chronic disease data reveals several key insights. Conditions like cardiovascular disease, COPD, and overarching conditions exhibit high variability, indicating a wide range of values in their prevalence or severity. Additionally, several conditions, including cancer and diabetes, display skewed distributions, suggesting the presence of outliers or uneven data distribution.
Regarding data coverage, conditions like cancer and cardiovascular disease have extensive data, while oral health has a relatively smaller dataset. Specific conditions like arthritis and oral health show moderate prevalence with low variability. In contrast, conditions like cancer, cardiovascular disease, and COPD exhibit high prevalence and significant variability.
Understanding these trends and variations is crucial for identifying areas of concern, targeting specific interventions, and informing public health policies to address the burden of chronic diseases.
The visualizations below help us to provide insights for making conclusions and recommendations.
# Ensure necessary libraries are loaded
library(dplyr)
library(ggplot2)
# Assuming tidy_data is your cleaned data frame
data <- tidy_data
# Top 10 cities with the most cases
top_cities <- data %>%
group_by(Location) %>%
summarize(Total_Value = sum(DataValue, na.rm = TRUE)) %>%
arrange(desc(Total_Value)) %>%
slice(1:10) # Top 10 cities
# Bar plot
city_plot <- ggplot(top_cities, aes(x = reorder(Location, -Total_Value), y = Total_Value, fill = Location)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = "Top 10 Cities with Most Chronic Disease Cases",
x = "City", y = "Total Cases") +
theme_minimal() +
coord_flip() # Horizontal bars for clarity
print(city_plot)
#Interpretation: By larger cities with higher populations may naturally have more chronic disease cases. It would be helpful to normalize the data by population size to get a better understanding of the prevalence of chronic diseases in each city. Also socioeconomic Factors like income, education, and access to healthcare can influence the prevalence of chronic diseases. Analyzing these factors for the top 10 cities could provide valuable insights.
Also Lifestyle factors such as diet, physical activity, and smoking habits can contribute to the development of chronic diseases.
# Filter Gender-specific rows
gender_data <- tidy_data %>%
filter(Category == "Gender") %>%
group_by(Subgroup) %>%
summarize(Total_Value = sum(DataValue, na.rm = TRUE))
# Calculate percentages
gender_data <- gender_data %>%
mutate(Percentage = Total_Value / sum(Total_Value) * 100)
# Pie chart with percentage labels
gender_pie <- ggplot(gender_data, aes(x = "", y = Total_Value, fill = Subgroup)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
labs(title = "Chronic Disease Cases by Gender", fill = "Gender") +
theme_void()
print(gender_pie)
#Interpretation: With a slightly higher representation of females in the dataset, gender-specific health strategies could be developed to address the unique health needs of women.
# Summarize values by Year
time_trend <- tidy_data %>%
group_by(Year) %>%
summarize(Total_Value = sum(DataValue, na.rm = TRUE))
# Line plot
time_plot <- ggplot(time_trend, aes(x = Year, y = Total_Value)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red") +
labs(title = "Trends of Chronic Diseases Over Time",
x = "Year", y = "Total Cases") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(time_plot)
#Interpretation: The line plot shows a clear upward trend in the number of chronic disease cases over the years, with a significant increase starting around 2009. This suggests a potential rise in the prevalence or diagnosis of chronic diseases during this period.
##Conclusions
Geographic Clustering: The presence of multiple states in the top 10 suggests potential geographic clusters of chronic diseases. Further investigation could explore factors like environmental conditions, lifestyle habits, and healthcare access in these regions.
Socioeconomic Factors: Socioeconomic factors like income, education, and occupation can influence the prevalence of chronic diseases. Analyzing these factors in the top 10 cities could provide valuable insights. Healthcare Access and Utilization: Access to quality healthcare and healthcare utilization rates can impact the diagnosis and management of chronic diseases. It would be beneficial to examine these factors in the top 10 cities.
Lifestyle Factors: Lifestyle factors like diet, physical activity, and smoking can contribute to the development of chronic diseases. Investigating these factors in the top 10 cities could identify areas for potential intervention.
Genetic Predisposition: Genetic factors may play a role in the development of certain chronic diseases. Further research is needed to explore the genetic factors contributing to chronic diseases in these populations.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.