Assignment 7

Population and Health Indicators in U.S. Cities

# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.3

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Downloading dataset
hindicator <- read.csv("500CitiesHealthIndicatorsRegions.cdc.csv")
head(hindicator)

##   Year StateAbbr  StateDesc Region  CityName GeographicLevel DataSource
## 1 2017        CA California   West Hawthorne    Census Tract      BRFSS
## 2 2017        CA California   West Hawthorne            City      BRFSS
## 3 2017        CA California   West   Hayward            City      BRFSS
## 4 2017        CA California   West   Hayward            City      BRFSS
## 5 2017        CA California   West     Hemet            City      BRFSS
## 6 2017        CA California   West     Indio    Census Tract      BRFSS
##              Category                                             Measure
## 1     Health Outcomes              Arthritis among adults aged >=18 Years
## 2 Unhealthy Behaviors        Current smoking among adults aged >=18 Years
## 3     Health Outcomes Coronary heart disease among adults aged >=18 Years
## 4 Unhealthy Behaviors                Obesity among adults aged >=18 Years
## 5          Prevention  Cholesterol screening among adults aged >=18 Years
## 6     Health Outcomes              Arthritis among adults aged >=18 Years
##   DataValueTypeID         Data_Value_Type Data_Value PopulationCount CategoryID
## 1          CrdPrv        Crude prevalence       14.6            4407    HLTHOUT
## 2          CrdPrv        Crude prevalence       15.4           84293     UNHBEH
## 3       AgeAdjPrv Age-adjusted prevalence        4.8          144186    HLTHOUT
## 4          CrdPrv        Crude prevalence       24.2          144186     UNHBEH
## 5       AgeAdjPrv Age-adjusted prevalence       78.0           78657    PREVENT
## 6          CrdPrv        Crude prevalence       22.0            5006    HLTHOUT
##    MeasureId    Short_Question_Text
## 1  ARTHRITIS              Arthritis
## 2   CSMOKING        Current Smoking
## 3        CHD Coronary Heart Disease
## 4    OBESITY                Obesity
## 5 CHOLSCREEN  Cholesterol Screening
## 6  ARTHRITIS              Arthritis

This dataset used in this analysis contains health indicators for different U.S. cities, including measures such as obesity, smoking, and chronic diseases. Each observation represent a specific health measure for a city, with 16 variables constructed.

Dataset Exploration and cleaning

To prepare the dataset for visualization, I will first filter the data to include only city-level observations and focus on a single health measure to simplify the analysis. I will then select relevant variables such as population size, region, and the chosen health indicator to create a clear multivariable graph.

# Filter to city level
health_clean <- hindicator |>
  filter(GeographicLevel == "City")

unique(health_clean$Short_Question_Text)

##  [1] "Current Smoking"                         
##  [2] "Coronary Heart Disease"                  
##  [3] "Obesity"                                 
##  [4] "Cholesterol Screening"                   
##  [5] "Binge Drinking"                          
##  [6] "COPD"                                    
##  [7] "Mammography"                             
##  [8] "Teeth Loss"                              
##  [9] "Current Asthma"                          
## [10] "Chronic Kidney Disease"                  
## [11] "Stroke"                                  
## [12] "Dental Visit"                            
## [13] "Physical Inactivity"                     
## [14] "Sleep <7 hours"                          
## [15] "Diabetes"                                
## [16] "High Blood Pressure"                     
## [17] "Arthritis"                               
## [18] "Cancer (except skin)"                    
## [19] "Annual Checkup"                          
## [20] "Pap Smear Test"                          
## [21] "Physical Health"                         
## [22] "Mental Health"                           
## [23] "Health Insurance"                        
## [24] "High Cholesterol"                        
## [25] "Core preventive services for older men"  
## [26] "Colorectal Cancer Screening"             
## [27] "Core preventive services for older women"
## [28] "Taking BP Medication"

This output shows the different health measures available at the city level.

# Filter to only Obesity
health_clean <- health_clean |>
  filter(Short_Question_Text == "Obesity")

# Select relevant columns
health_clean <- health_clean |>
  select(Year, StateDesc, Region, CityName, Data_Value, PopulationCount)

# Check structure
str(health_clean)

## 'data.frame':    1000 obs. of  6 variables:
##  $ Year           : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ StateDesc      : chr  "California" "California" "Alabama" "California" ...
##  $ Region         : chr  "West" "West" "South" "West" ...
##  $ CityName       : chr  "Hayward" "Lakewood" "Mobile" "Corona" ...
##  $ Data_Value     : num  24.2 22.1 38.2 26.8 26.5 30.2 20.5 26.7 23.3 22.4 ...
##  $ PopulationCount: int  144186 80048 195111 152374 153015 494665 135161 89861 76815 1307402 ...

After filtering the dataset to include only the obesity measure, the cleaned dataset contains 1000 rows and 6 columns. This process make it easier to create a multivariable visualization showing how obesity rates vary by population and region.

Multivariable visualization

I choose a scatterplot as my graph, it is the best fit for this analysis

ggplot(health_clean, aes(x = PopulationCount, 
                         y = Data_Value, 
                         color = Region)) +
  
  # Transparency
  geom_point(alpha = 0.6, size = 4) +
  
  # Log scale for population 
  scale_x_log10(labels = scales::comma) +
  
  # Labels and titles
  labs( title = "Population and Health Indicators in U.S. Cities",
        subtitle = "Obesity rates by city population",
        x = "City Population",
        y = "Obesity Rate (%)",
        color = "Region",
        caption = "Source: dslabs package") +
  
  # Colors for Regions
  scale_color_manual(values = c("Midwest" = "red", "North"   = "skyblue",
                                "South"   = "darkgreen", "West"    = "orange")) +
  
  theme_minimal()

For this assignment, I used a dataset containing health indicators for U.S. cities, focusing specifically on obesity rates. I first cleaned the data by filtering it to include only observations related to obesity and selected relevant variables such as year, state, region, city name, obesity rate, and population size. I then created a scatterplot using ggplot2, where the x-axis represents city population and the y-axis represents obesity rate. I added color to represent different regions (Midwest, North, South, and West) and every point represent a single city. At the end I applied a minimal theme to improve the appearance of the graph.

From the visualization, I observed that most cities in the dataset have smaller populations, as many points are clustered on the left side of the graph. Obesity rates vary widely across cities, ranging from about 15% to nearly 50%. I also noticed that cities in the South tend to have higher obesity rates, while cities in the West generally have lower rates. Overall, there does not appear to be a strong relationship between population size and obesity rate, as the points are scattered without a clear trend.

Assignment 7

Leyla C

2026-04-07

Population and Health Indicators in U.S. Cities

Dataset Exploration and cleaning

Multivariable visualization