Healthy Cities GIS Assignment

Author

Karen Pesca

Load the libraries and set the working directory

library(tidyverse)
library(tidyr)
setwd("/Users/karenlizethpp/Library/Mobile Documents/com~apple~CloudDocs/Data 110")
cities500 <- read_csv("500CitiesLocalHealthIndicators.cdc.csv")
data(cities500)

Source: “CDC: 500 Cities Project: 2016 to 2019”

The GeoLocation variable has (lat, long) format

Split GeoLocation (lat, long) into two columns: lat and long

latlong <- cities500|>
  mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
  separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)

# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Health Outcom…
4  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
5  2017 CA        California Hemet     City            BRFSS      Prevention    
6  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

Filter the dataset

Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.

latlong_clean <- latlong |>
  filter(StateDesc != "United States") |>
  filter(Data_Value_Type == "Crude prevalence") |>
  filter(Year == 2017)
head(latlong_clean)

# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
4  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
5  2017 CA        California Inglewood Census Tract    BRFSS      Health Outcom…
6  2017 CA        California Lakewood  City            BRFSS      Unhealthy Beh…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

What variables are included? (can any of them be removed?)

names(latlong_clean)

 [1] "Year"                       "StateAbbr"                 
 [3] "StateDesc"                  "CityName"                  
 [5] "GeographicLevel"            "DataSource"                
 [7] "Category"                   "UniqueID"                  
 [9] "Measure"                    "Data_Value_Unit"           
[11] "DataValueTypeID"            "Data_Value_Type"           
[13] "Data_Value"                 "Low_Confidence_Limit"      
[15] "High_Confidence_Limit"      "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote"        "PopulationCount"           
[19] "lat"                        "long"                      
[21] "CategoryID"                 "MeasureId"                 
[23] "CityFIPS"                   "TractFIPS"                 
[25] "Short_Question_Text"

Remove the variables that will not be used in the assignment

latlong_clean2 <- latlong_clean |>
  select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(latlong_clean2)

# A tibble: 6 × 18
   Year StateAbbr StateDesc  CityName  GeographicLevel Category UniqueID Measure
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>    <chr>    <chr>  
1  2017 CA        California Hawthorne Census Tract    Health … 0632548… Arthri…
2  2017 CA        California Hawthorne City            Unhealt… 632548   Curren…
3  2017 CA        California Hayward   City            Unhealt… 633000   Obesit…
4  2017 CA        California Indio     Census Tract    Health … 0636448… Arthri…
5  2017 CA        California Inglewood Census Tract    Health … 0636546… Diagno…
6  2017 CA        California Lakewood  City            Unhealt… 639892   Obesit…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

#unique(md$CityName)

The new dataset “Prevention” is a manageable dataset now.

For your assignment, work with a cleaned dataset.

1. Once you run the above code and learn how to filter in this format, filter this dataset however you choose so that you have a subset with no more than 900 observations.

Filter chunk here

I filtered the data to keep only the tracts in Florida where the diabetes rate is 15% or higher, and I made sure it only includes the diabetes question at the census tract level.

latlongfl <- latlong_clean2 |>
  filter(StateAbbr == "FL",Data_Value>=15,  GeographicLevel == "Census Tract", Short_Question_Text=="Diabetes")

head(latlongfl)

# A tibble: 6 × 18
   Year StateAbbr StateDesc CityName   GeographicLevel Category UniqueID Measure
  <dbl> <chr>     <chr>     <chr>      <chr>           <chr>    <chr>    <chr>  
1  2017 FL        Florida   Deerfield… Census Tract    Health … 1216725… Diagno…
2  2017 FL        Florida   Hialeah    Census Tract    Health … 1230000… Diagno…
3  2017 FL        Florida   Gainesvil… Census Tract    Health … 1225175… Diagno…
4  2017 FL        Florida   Hialeah    Census Tract    Health … 1230000… Diagno…
5  2017 FL        Florida   Boynton B… Census Tract    Health … 1207875… Diagno…
6  2017 FL        Florida   Hialeah    Census Tract    Health … 1230000… Diagno…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.

First plot chunk here:

I filtered the dataset to get the top 5 cities with the highest diabetes percentages at the tract level using max().

#  Filter top 5 cities with highest diabetes %
top_cities <- latlongfl |>
  group_by(CityName) |>
  summarize(max_diabetes = max(Data_Value, na.rm = TRUE)) |>
  arrange(desc(max_diabetes)) |>
  slice(1:5) |>
  pull(CityName)

latlong_top5 <- latlongfl |>
  filter(CityName %in% top_cities,
         Short_Question_Text == "Diabetes")

Pull function:

https://www.statology.org/dplyr-pull/

Now that I have the 5 cities with the highest rate of diabates in adults, I created a scatterplot to show diabetes percentages by population across tracts, with a panel for each of those top cities using facet_wrap.

# Create a Scatterplot of diabetes % by population, faceted by city

ggplot(latlong_top5, aes(x = PopulationCount, y = Data_Value, color = CityName)) +
  geom_point(alpha = 0.6, size=3) +
  facet_wrap(~CityName) +
  scale_color_viridis_d() +
  labs(
    title = "Diabetes by Population in the Top 5 \nCities with Highest Rates in Florida (2017)",
    x = "Population",
    y = "Diabetes %",
    color = "City",
    caption = "Source: CDC, 500 Cities Project (2016–2019)"
  ) +
  theme_bw() +
  theme(strip.text = element_text(size = 10, face = "bold"), 
        axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5))

After this scatterplot, I would like to create a bar graph that shows the average diabetes rate in the top 5 cities in Florida by city. I’ll also need it for mapping later.

First Function:

https://www.rdocumentation.org/packages/gdata/versions/3.0.1/topics/first

#Top 5 cities with highest average diabetes %, total population summed
diabetesfl <- latlongfl |>
  group_by(CityName) |>
  summarize(
    mean_diabetes = mean(Data_Value, na.rm = TRUE),
    lat = first(lat),
    long = first(long),
    PopulationCount = sum(PopulationCount, na.rm = TRUE)
  ) |>
  arrange(desc(mean_diabetes)) |>
  slice(1:5)

#Create Bar Graph for Average Diabetes % in Top 5 Cities
ggplot(diabetesfl, aes(x = reorder(CityName, -mean_diabetes), 
                             y = mean_diabetes, 
                             fill = mean_diabetes)) +
  geom_bar(stat = "identity") +
   geom_text(aes(label = paste0(round(mean_diabetes, 1), "%")),
            vjust = -0.5, size = 3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Top 5 Cities with the Highest Average\n Rate of Diabetes in Adults in Florida (2017)",
       x = "City",
       y = "Average % with Diabetes",
       fill = "Diabetes %",
       caption = "Source: CDC, 500 Cities Project (2016–2019)") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5))

3. Now create a map of your subsetted dataset.

First map chunk here

Florida lat=27.994402, long= -81.760254.

Lat and Long Florida:

-https://www.latlong.net/place/florida-usa-15262.html

-https://leaflet-extras.github.io/leaflet-providers/preview/

florida_lon <- -81.760254
florida_lat <- 27.994402

# Calculed the average of all cities with diabetes rate >15% in Florida

diabetesfl2 <- latlongfl |>
  group_by(CityName) |>
  summarize(
    mean_diabetes = mean(Data_Value, na.rm = TRUE),
    lat = first(lat),
    long = first(long),
    PopulationCount = sum(PopulationCount, na.rm = TRUE)
  ) |>
  arrange(desc(mean_diabetes))

#Create a map
library(leaflet)
leaflet() |>
  setView(florida_lon,florida_lat, zoom = 6.5) |>  
  addProviderTiles("Stadia.Outdoors") |>
  addCircles(
    data = diabetesfl2,
    lat = diabetesfl2$lat,
    lng = diabetesfl2$long,
    radius = diabetesfl2$mean_diabetes)

Adding radius scale:

-https://r-graph-gallery.com/182-add-circles-rectangles-on-leaflet-map.html

library(leaflet)

leaflet() |>
  setView(lng = florida_lon, lat = florida_lat, zoom = 7) |>  
  addProviderTiles("Stadia.Outdoors") |>
  addCircles(
    data = diabetesfl2,
    lat = ~lat,
    lng = ~long,
    radius = ~mean_diabetes * 500,  # I use a scale it for better visibility
    color = "#eb1e17",
    fillOpacity = 0.5
  )

4. Refine your map to include a mouse-click tooltip

Refined map chunk here

Adding tooltips and using round() for better readability on the map.

# Create popup
popupdiabetes <- paste0(
  "<b>City: </b>", diabetesfl2$CityName, "<br>",
  "<b>Diabetes Prevalence: </b>", round(diabetesfl2$mean_diabetes, 1), "%<br>",
  "<b>Population: </b>", diabetesfl2$PopulationCount
)

# Create the interactive map
leaflet() |>
  setView(lng = florida_lon, lat = florida_lat, zoom = 6.4) |>
  addProviderTiles("Stadia.Outdoors") |>
  addCircles(
    data = diabetesfl2,
    lat = ~lat,
    lng = ~long,
    radius = ~mean_diabetes *1000,  # Improving scaling
    color = "#8a0b07",      
    fillColor = "#ed3434",             
    fillOpacity = 0.5,
    popup = popupdiabetes
  )

5. Write a paragraph

In a paragraph, describe the plots you created and what they show.

In this assigment, I focused on Florida because it’s one of the states with the highest immigrant populations, and I wanted to explore how the percentage of adults with diabetes behaves in different cities. My first plot is a scatterplot that shows the distribution of diabetes percentages across census tracts in the top five cities with the highest rates. This helped me see how the numbers vary within each city. In the second plot, I created a bar graph that shows the average diabetes rate by city. What I found interesting is that some cities with high individual rates didn’t necessarily have the highest averages. On the map, I displayed 25 cities in Florida that have an average diabetes rate above 15%, showing how those cities are distributed across the state. From this, I observed that the highest rates tend to appear in the south of Florida, where most immigrant populations are located. This makes it interesting for future analysis to focus on factors like race or gender in these areas, especially for chronic illnesses like diabetes. It could help identify trends and support better decision-making to improve health outcomes in these communities.