# Load necessary libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Downloading dataset
hindicator <- read.csv("500CitiesHealthIndicatorsRegions.cdc.csv")
head(hindicator)
## Year StateAbbr StateDesc Region CityName GeographicLevel DataSource
## 1 2017 CA California West Hawthorne Census Tract BRFSS
## 2 2017 CA California West Hawthorne City BRFSS
## 3 2017 CA California West Hayward City BRFSS
## 4 2017 CA California West Hayward City BRFSS
## 5 2017 CA California West Hemet City BRFSS
## 6 2017 CA California West Indio Census Tract BRFSS
## Category Measure
## 1 Health Outcomes Arthritis among adults aged >=18 Years
## 2 Unhealthy Behaviors Current smoking among adults aged >=18 Years
## 3 Health Outcomes Coronary heart disease among adults aged >=18 Years
## 4 Unhealthy Behaviors Obesity among adults aged >=18 Years
## 5 Prevention Cholesterol screening among adults aged >=18 Years
## 6 Health Outcomes Arthritis among adults aged >=18 Years
## DataValueTypeID Data_Value_Type Data_Value PopulationCount CategoryID
## 1 CrdPrv Crude prevalence 14.6 4407 HLTHOUT
## 2 CrdPrv Crude prevalence 15.4 84293 UNHBEH
## 3 AgeAdjPrv Age-adjusted prevalence 4.8 144186 HLTHOUT
## 4 CrdPrv Crude prevalence 24.2 144186 UNHBEH
## 5 AgeAdjPrv Age-adjusted prevalence 78.0 78657 PREVENT
## 6 CrdPrv Crude prevalence 22.0 5006 HLTHOUT
## MeasureId Short_Question_Text
## 1 ARTHRITIS Arthritis
## 2 CSMOKING Current Smoking
## 3 CHD Coronary Heart Disease
## 4 OBESITY Obesity
## 5 CHOLSCREEN Cholesterol Screening
## 6 ARTHRITIS Arthritis
This dataset used in this analysis contains health indicators for different U.S. cities, including measures such as obesity, smoking, and chronic diseases. Each observation represent a specific health measure for a city, with 16 variables constructed.
To prepare the dataset for visualization, I will first filter the data to include only city-level observations and focus on a single health measure to simplify the analysis. I will then select relevant variables such as population size, region, and the chosen health indicator to create a clear multivariable graph.
# Filter to city level
health_clean <- hindicator |>
filter(GeographicLevel == "City")
unique(health_clean$Short_Question_Text)
## [1] "Current Smoking"
## [2] "Coronary Heart Disease"
## [3] "Obesity"
## [4] "Cholesterol Screening"
## [5] "Binge Drinking"
## [6] "COPD"
## [7] "Mammography"
## [8] "Teeth Loss"
## [9] "Current Asthma"
## [10] "Chronic Kidney Disease"
## [11] "Stroke"
## [12] "Dental Visit"
## [13] "Physical Inactivity"
## [14] "Sleep <7 hours"
## [15] "Diabetes"
## [16] "High Blood Pressure"
## [17] "Arthritis"
## [18] "Cancer (except skin)"
## [19] "Annual Checkup"
## [20] "Pap Smear Test"
## [21] "Physical Health"
## [22] "Mental Health"
## [23] "Health Insurance"
## [24] "High Cholesterol"
## [25] "Core preventive services for older men"
## [26] "Colorectal Cancer Screening"
## [27] "Core preventive services for older women"
## [28] "Taking BP Medication"
This output shows the different health measures available at the city level.
# Filter to only Obesity
health_clean <- health_clean |>
filter(Short_Question_Text == "Obesity")
# Select relevant columns
health_clean <- health_clean |>
select(Year, StateDesc, Region, CityName, Data_Value, PopulationCount)
# Check structure
str(health_clean)
## 'data.frame': 1000 obs. of 6 variables:
## $ Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ StateDesc : chr "California" "California" "Alabama" "California" ...
## $ Region : chr "West" "West" "South" "West" ...
## $ CityName : chr "Hayward" "Lakewood" "Mobile" "Corona" ...
## $ Data_Value : num 24.2 22.1 38.2 26.8 26.5 30.2 20.5 26.7 23.3 22.4 ...
## $ PopulationCount: int 144186 80048 195111 152374 153015 494665 135161 89861 76815 1307402 ...
After filtering the dataset to include only the obesity measure, the cleaned dataset contains 1000 rows and 6 columns. This process make it easier to create a multivariable visualization showing how obesity rates vary by population and region.
I choose a scatterplot as my graph, it is the best fit for this analysis
ggplot(health_clean, aes(x = PopulationCount,
y = Data_Value,
color = Region)) +
# Transparency
geom_point(alpha = 0.6, size = 4) +
# Log scale for population
scale_x_log10(labels = scales::comma) +
# Labels and titles
labs( title = "Population and Health Indicators in U.S. Cities",
subtitle = "Obesity rates by city population",
x = "City Population",
y = "Obesity Rate (%)",
color = "Region",
caption = "Source: dslabs package") +
# Colors for Regions
scale_color_manual(values = c("Midwest" = "red", "North" = "skyblue",
"South" = "darkgreen", "West" = "orange")) +
theme_minimal()
For this assignment, I used a dataset containing health indicators for
U.S. cities, focusing specifically on obesity rates. I first cleaned the
data by filtering it to include only observations related to obesity and
selected relevant variables such as year, state, region, city name,
obesity rate, and population size. I then created a scatterplot using
ggplot2, where the x-axis represents city population and the y-axis
represents obesity rate. I added color to represent different regions
(Midwest, North, South, and West) and every point represent a single
city. At the end I applied a minimal theme to improve the appearance of
the graph.
From the visualization, I observed that most cities in the dataset have smaller populations, as many points are clustered on the left side of the graph. Obesity rates vary widely across cities, ranging from about 15% to nearly 50%. I also noticed that cities in the South tend to have higher obesity rates, while cities in the West generally have lower rates. Overall, there does not appear to be a strong relationship between population size and obesity rate, as the points are scattered without a clear trend.