Healthy Cities GIS Assignment

Author

Karim MardamBey

Load the libraries and set the working directory

library(tidyverse)
library(tidyr)
library(leaflet)
setwd("~/Desktop/Data Science MC/Data Science 110")
cities500 <- read_csv("500CitiesLocalHealthIndicators.cdc.csv")
data(cities500)

The GeoLocation variable has (lat, long) format

Split GeoLocation (lat, long) into two columns: lat and long

latlong <- cities500|>
  mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
  separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)
# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Health Outcom…
4  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
5  2017 CA        California Hemet     City            BRFSS      Prevention    
6  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

Filter the dataset

Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.

latlong_clean <- latlong |>
  filter(StateDesc != "United States") |>
  filter(Data_Value_Type == "Crude prevalence") |>
  filter(Year == 2017)
head(latlong_clean)
# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
4  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
5  2017 CA        California Inglewood Census Tract    BRFSS      Health Outcom…
6  2017 CA        California Lakewood  City            BRFSS      Unhealthy Beh…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

What variables are included? (can any of them be removed?)

names(latlong_clean)
 [1] "Year"                       "StateAbbr"                 
 [3] "StateDesc"                  "CityName"                  
 [5] "GeographicLevel"            "DataSource"                
 [7] "Category"                   "UniqueID"                  
 [9] "Measure"                    "Data_Value_Unit"           
[11] "DataValueTypeID"            "Data_Value_Type"           
[13] "Data_Value"                 "Low_Confidence_Limit"      
[15] "High_Confidence_Limit"      "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote"        "PopulationCount"           
[19] "lat"                        "long"                      
[21] "CategoryID"                 "MeasureId"                 
[23] "CityFIPS"                   "TractFIPS"                 
[25] "Short_Question_Text"       

Remove the variables that will not be used in the assignment

latlong_clean2 <- latlong_clean |>
  select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(latlong_clean2)
# A tibble: 6 × 18
   Year StateAbbr StateDesc  CityName  GeographicLevel Category UniqueID Measure
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>    <chr>    <chr>  
1  2017 CA        California Hawthorne Census Tract    Health … 0632548… Arthri…
2  2017 CA        California Hawthorne City            Unhealt… 632548   Curren…
3  2017 CA        California Hayward   City            Unhealt… 633000   Obesit…
4  2017 CA        California Indio     Census Tract    Health … 0636448… Arthri…
5  2017 CA        California Inglewood Census Tract    Health … 0636546… Diagno…
6  2017 CA        California Lakewood  City            Unhealt… 639892   Obesit…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

The new dataset “Prevention” is a manageable dataset now.

For your assignment, work with a cleaned dataset.

1. Once you run the above code and learn how to filter in this format, filter this dataset however you choose so that you have a subset with no more than 900 observations.

Filter chunk here

#filter for specific State, then clean NAs
newyork_data <- latlong_clean2 |>
  filter(latlong_clean2$StateDesc == "New York") |>
  filter(!is.na(Data_Value), !is.na(PopulationCount), !is.na(Measure))

unique(newyork_data$Short_Question_Text)
 [1] "Stroke"                 "Cholesterol Screening"  "COPD"                  
 [4] "Diabetes"               "Chronic Kidney Disease" "Physical Inactivity"   
 [7] "Mental Health"          "Obesity"                "Physical Health"       
[10] "Health Insurance"       "Annual Checkup"         "Arthritis"             
[13] "Taking BP Medication"   "High Blood Pressure"    "Binge Drinking"        
[16] "High Cholesterol"       "Coronary Heart Disease" "Current Smoking"       
[19] "Cancer (except skin)"   "Current Asthma"        
# Filters by Short Question Text, City Name, and Population Count less than 1500
newyork_data2 <- newyork_data |>
  filter(Short_Question_Text %in% c("Current Smoking", "Current Asthma", "Stroke")) |>
  filter(CityName %in% c("New York", "Buffalo", "Rochester", "Syracuse", "Albany"), PopulationCount < 1500)

2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.

First plot chunk here

ggplot(newyork_data2, aes(x = CityName, y = Data_Value, color = Short_Question_Text)) +
  geom_jitter(width = 0.35, alpha = 0.7, size = 2.5) +
  labs(title = "Health Measures for 5 New York Cities", x = "City", y = "Crude Prevalence", color = "Health Measure", caption = "500 Healthy Cities (CDC)") +
  theme_bw(base_size = 8)

3. Now create a map of your subsetted dataset.

First map chunk here

leaflet() |>
  setView(lng = -73.97213, lat = 40.76260, zoom = 5) |>
  addProviderTiles("OpenStreetMap.Mapnik") |>
  addCircles(
    data = newyork_data2,
    radius = newyork_data2$Data_Value,
    color = "#e20f32",
    fillColor = "#713c65",
    fillOpacity = 0.6
  )
Assuming "long" and "lat" are longitude and latitude, respectively

4. Refine your map to include a mouse-click tooltip

Refined map chunk here

popupNY <- paste0(
  "<b>City:</b> ", newyork_data2$CityName, "<br>",
  "<b>Measure:</b>", newyork_data2$Short_Question_Text, "<br>",
  "<b>Value:</b>", newyork_data2$Data_Value, "%"
)
leaflet() |>
  setView(lng = -73.97213, lat = 40.76260, zoom = 5) |>
  addProviderTiles("OpenStreetMap.Mapnik") |>
  addCircles(
    data = newyork_data2,
    lng = ~long,
    lat = ~lat,
    radius = newyork_data2$Data_Value,
    color = "#e20f32",
    fillColor = "#713c65",
    fillOpacity = 0.6,
    popup = popupNY
  )

5. Write a paragraph

Plot 1: Is a scatter plot that shows the crude prevalence percentage of three health indicators (Current Smoking, Current Asthma, and Stroke) in 5 major New York cities (NYC, Buffalo, Rochester, Syracus, and Albany). I used the jitter plot to cleary graph any overlapping points, to make it more readable. The graph shows that Smoking is widely recorded while stroke is recorded less often. Plot 2 is an interactive map that displays the subset of data. The map filters for the three health measurements and their correlated data value.