Healthy Cities GIS Assignment

Author

Maisha Ann Subin

Load the libraries and set the working directory

library(tidyverse)
library(tidyr)
library(ggplot2)
setwd("/Users/maishasubin/Downloads")
cities500 <- read_csv("500CitiesLocalHealthIndicators.cdc.csv")
data(cities500)

The GeoLocation variable has (lat, long) format

Split GeoLocation (lat, long) into two columns: lat and long

latlong <- cities500|>
  mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
  separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)
# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Health Outcom…
4  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
5  2017 CA        California Hemet     City            BRFSS      Prevention    
6  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

Filter the dataset

Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.

latlong_clean <- latlong |>
  filter(StateDesc != "United States") |>
  filter(Data_Value_Type == "Crude prevalence") |>
  filter(Year == 2017)
head(latlong_clean)
# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
4  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
5  2017 CA        California Inglewood Census Tract    BRFSS      Health Outcom…
6  2017 CA        California Lakewood  City            BRFSS      Unhealthy Beh…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

What variables are included? (can any of them be removed?)

names(latlong_clean)
 [1] "Year"                       "StateAbbr"                 
 [3] "StateDesc"                  "CityName"                  
 [5] "GeographicLevel"            "DataSource"                
 [7] "Category"                   "UniqueID"                  
 [9] "Measure"                    "Data_Value_Unit"           
[11] "DataValueTypeID"            "Data_Value_Type"           
[13] "Data_Value"                 "Low_Confidence_Limit"      
[15] "High_Confidence_Limit"      "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote"        "PopulationCount"           
[19] "lat"                        "long"                      
[21] "CategoryID"                 "MeasureId"                 
[23] "CityFIPS"                   "TractFIPS"                 
[25] "Short_Question_Text"       

Remove the variables that will not be used in the assignment

latlong_clean2 <- latlong_clean |>
  select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(latlong_clean2)
# A tibble: 6 × 18
   Year StateAbbr StateDesc  CityName  GeographicLevel Category UniqueID Measure
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>    <chr>    <chr>  
1  2017 CA        California Hawthorne Census Tract    Health … 0632548… Arthri…
2  2017 CA        California Hawthorne City            Unhealt… 632548   Curren…
3  2017 CA        California Hayward   City            Unhealt… 633000   Obesit…
4  2017 CA        California Indio     Census Tract    Health … 0636448… Arthri…
5  2017 CA        California Inglewood Census Tract    Health … 0636546… Diagno…
6  2017 CA        California Lakewood  City            Unhealt… 639892   Obesit…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

The new dataset “Prevention” is a manageable dataset now.

For your assignment, work with a cleaned dataset.

1. Once you run the above code and learn how to filter in this format, filter this dataset however you choose so that you have a subset with no more than 900 observations.

Filter chunk here

# 1 filtering for Florida, as I want to explore that state and filtering for Na's
florida_data <- latlong_clean2[latlong_clean2$StateDesc == "Florida", ]|>
   filter(!is.na(CityName) & !is.na(Short_Question_Text) &!is.na(TractFIPS) & !is.na(Data_Value) & !is.na(PopulationCount))

head(florida_data)
# A tibble: 6 × 18
   Year StateAbbr StateDesc CityName   GeographicLevel Category UniqueID Measure
  <dbl> <chr>     <chr>     <chr>      <chr>           <chr>    <chr>    <chr>  
1  2017 FL        Florida   Cape Coral Census Tract    Health … 1210275… Diagno…
2  2017 FL        Florida   Davie      Census Tract    Health … 1216475… High b…
3  2017 FL        Florida   Davie      Census Tract    Prevent… 1216475… Choles…
4  2017 FL        Florida   Hialeah    Census Tract    Prevent… 1230000… Visits…
5  2017 FL        Florida   Hialeah    Census Tract    Prevent… 1230000… Choles…
6  2017 FL        Florida   Hollywood  Census Tract    Health … 1232000… High b…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
# to see all the categories in Short question text column
unique(florida_data$Short_Question_Text)
 [1] "Diabetes"               "High Blood Pressure"    "Cholesterol Screening" 
 [4] "Annual Checkup"         "Coronary Heart Disease" "Obesity"               
 [7] "Current Asthma"         "Current Smoking"        "Cancer (except skin)"  
[10] "Binge Drinking"         "Physical Health"        "Arthritis"             
[13] "Health Insurance"       "Physical Inactivity"    "High Cholesterol"      
[16] "COPD"                   "Taking BP Medication"   "Mental Health"         
[19] "Stroke"                 "Chronic Kidney Disease"
# filtering for 5 diseases 
florida_data2 <- florida_data |> filter(Short_Question_Text %in% 
  c("High Blood Pressure", "Obesity", "Physical Inactivity", "Cholesterol Screening", "Mental Health"))
# plot to understand the data
ggplot(florida_data2, aes(x = CityName, y = Short_Question_Text, fill = Data_Value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightyellow", high = "red") +
  labs(title = "Heatmap of Disease Frequency by City (Florida 2017)",
       x = "City", y = "Health Problem", fill = "Data Value (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# filtering for just 1 diease: cholesterol as it seems significant 
florida_data3 <- florida_data |> filter(Short_Question_Text %in% 
  c("High Blood Pressure")) |>
  filter(PopulationCount > 3000) # taking population count greater than 3000 to condense my data set
# filtering to get average value of data values for same cities
florida_data4 <- florida_data3 %>%
  group_by(CityName, Short_Question_Text) %>%
  mutate(avg_val = mean(Data_Value)) %>%  
  ungroup() %>% 
  select(-Data_Value, -TractFIPS, -StateAbbr, -UniqueID, -CategoryID, -CityFIPS, -MeasureId, -Category, -GeographicLevel)  # Remove the original Data_Value column and other columns
# filtering to get the final data set to work with
florida_data5 <- florida_data4 %>%
  filter(Short_Question_Text == "High Blood Pressure") %>%
  group_by(CityName) %>%
  mutate(PopulationCount = sum(PopulationCount)) %>%  # Recalculate total pop
  slice(1) %>%  # Keep one row per city (with other info), used chatGPT's help to keep rest of the columns 
  ungroup()

2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.

First plot chunk here

# Plotting a bar graph for average values of Data_Value by City
ggplot(florida_data5, aes(x = CityName, y = avg_val, fill = PopulationCount)) + 
  geom_bar(stat = "identity", color = "black") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Define gradient colors
  scale_y_continuous(limits = c(0, 60)) +  # Set y-axis limit manually
  labs(
    title = "Average Data Value for 'High Blood Pressure' by\nCity with Population Gradient",
    x = "City",
    y = "Average Value",
    caption = "Source: CDC"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x-axis labels for better readability

3. Now create a map of your subsetted dataset.

First map chunk here

library(leaflet)

# Filter for High Blood Pressure
hbp_data <- florida_data5 %>%
  filter(Short_Question_Text == "High Blood Pressure") # Using florida_data5 data set

leaflet() |>
  setView(lng = -81.5158, lat = 27.6648, zoom = 6) |> # Setting log lat
  addProviderTiles("Esri.WorldImagery") |>
  addCircles(
    data = florida_data5,
    radius = florida_data5$avg_val * 500,
    color = "#14010d",
    fillColor = "#FFDE21",
    fillOpacity = 0.4
)
Assuming "long" and "lat" are longitude and latitude, respectively

4. Refine your map to include a mouse-click tooltip

# Including pop-ups
popupflorida<- paste0(
                         "<b>High Blood Pressure Instances </b> <br>",
      "<b>City: </b>", florida_data5$CityName, "<br>", #<b> is bold, <br> is line break
      "<b>Average Value: </b>", florida_data5$avg_val, "<br>",
      "<b>Population Count: </b>", florida_data5$PopulationCount, "<br>"
    )

Refined map chunk here

# With pop up
leaflet() |>
  setView(lng = -81.5158, lat = 27.6648, zoom = 6) |>
  addProviderTiles("Esri.WorldImagery") |>
  addCircles(
    data = florida_data5,
    radius = florida_data5$avg_val * 500,
    color = "#14010d",
    fillColor = "#FFDE21",
    fillOpacity = 0.5,
    popup = popupflorida
  )

5. Write a paragraph

Plot 1: To start, I created a heatmap to get a clearer understanding of my dataset and help decide which health conditions to focus on. From the many available options, I narrowed it down to five by using facet_wrap charts. These visuals helped me see which “Short Question Text” entries had more complete and meaningful data.

Plot 2: After filtering down to 846 observations, I created a bar chart focused on high blood pressure. I placed cities on the x-axis and the average data value (calculated earlier) on the y-axis, using population count as the fill gradient. This gave a clean and direct view of how average values varied across cities. I wanted to use less colors for this graph or make a different plot but wasn’t sure which one to go ahead with.

Map 1: For the first map, I used the “Esri.WorldImagery” basemap to visualize the filtered high blood pressure data by city. From what I saw, there wasn’t a clear correlation between a specific city and higher or lower blood pressure rates.

Map 2: My final map included popups for each point, showing the city name, average value, and population count. This made the data more interactive and easier to explore.