Healthy Cities GIS Assignment

Author

Emilio Sanchez San Martin

Load the libraries and set the working directory

library(tidyverse)
library(tidyr)
library(plotly)
library(leaflet)
setwd("~/Downloads/Data 101 and Data 110 class/Data 110")
cities500 <- read_csv("Data Sets/500CitiesLocalHealthIndicators.cdc.csv")

The GeoLocation variable has (lat, long) format

Split GeoLocation (lat, long) into two columns: lat and long

latlong <- cities500|>
  mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
  separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)
# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName  GeographicLevel DataSource Category      
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>      <chr>         
1  2017 CA        California Hawthorne Census Tract    BRFSS      Health Outcom…
2  2017 CA        California Hawthorne City            BRFSS      Unhealthy Beh…
3  2017 CA        California Hayward   City            BRFSS      Health Outcom…
4  2017 CA        California Hayward   City            BRFSS      Unhealthy Beh…
5  2017 CA        California Hemet     City            BRFSS      Prevention    
6  2017 CA        California Indio     Census Tract    BRFSS      Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

Filter the dataset

Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.

latlong_clean <- latlong |>
  filter(StateDesc != "United States") |>
  filter(Category == "Prevention") |>
  filter(Data_Value_Type == "Crude prevalence") |>
  filter(Year == 2017)
head(latlong_clean)
# A tibble: 6 × 25
   Year StateAbbr StateDesc  CityName   GeographicLevel DataSource Category  
  <dbl> <chr>     <chr>      <chr>      <chr>           <chr>      <chr>     
1  2017 AL        Alabama    Montgomery City            BRFSS      Prevention
2  2017 CA        California Concord    City            BRFSS      Prevention
3  2017 CA        California Concord    City            BRFSS      Prevention
4  2017 CA        California Fontana    City            BRFSS      Prevention
5  2017 CA        California Richmond   Census Tract    BRFSS      Prevention
6  2017 FL        Florida    Davie      Census Tract    BRFSS      Prevention
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
#   DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
#   Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

What variables are included? (can any of them be removed?)

names(latlong_clean)
 [1] "Year"                       "StateAbbr"                 
 [3] "StateDesc"                  "CityName"                  
 [5] "GeographicLevel"            "DataSource"                
 [7] "Category"                   "UniqueID"                  
 [9] "Measure"                    "Data_Value_Unit"           
[11] "DataValueTypeID"            "Data_Value_Type"           
[13] "Data_Value"                 "Low_Confidence_Limit"      
[15] "High_Confidence_Limit"      "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote"        "PopulationCount"           
[19] "lat"                        "long"                      
[21] "CategoryID"                 "MeasureId"                 
[23] "CityFIPS"                   "TractFIPS"                 
[25] "Short_Question_Text"       

Remove the variables that will not be used in the assignment

prevention <- latlong_clean |>
  select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(prevention)
# A tibble: 6 × 18
   Year StateAbbr StateDesc  CityName  GeographicLevel Category UniqueID Measure
  <dbl> <chr>     <chr>      <chr>     <chr>           <chr>    <chr>    <chr>  
1  2017 AL        Alabama    Montgome… City            Prevent… 151000   Choles…
2  2017 CA        California Concord   City            Prevent… 616000   Visits…
3  2017 CA        California Concord   City            Prevent… 616000   Choles…
4  2017 CA        California Fontana   City            Prevent… 624680   Visits…
5  2017 CA        California Richmond  Census Tract    Prevent… 0660620… Choles…
6  2017 FL        Florida    Davie     Census Tract    Prevent… 1216475… Choles…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
md <- prevention |>
  filter(StateAbbr=="MD")
head(md)
# A tibble: 6 × 18
   Year StateAbbr StateDesc CityName  GeographicLevel Category  UniqueID Measure
  <dbl> <chr>     <chr>     <chr>     <chr>           <chr>     <chr>    <chr>  
1  2017 MD        Maryland  Baltimore Census Tract    Preventi… 2404000… "Chole…
2  2017 MD        Maryland  Baltimore Census Tract    Preventi… 2404000… "Visit…
3  2017 MD        Maryland  Baltimore Census Tract    Preventi… 2404000… "Visit…
4  2017 MD        Maryland  Baltimore Census Tract    Preventi… 2404000… "Curre…
5  2017 MD        Maryland  Baltimore Census Tract    Preventi… 2404000… "Curre…
6  2017 MD        Maryland  Baltimore Census Tract    Preventi… 2404000… "Visit…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
#   PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
#   MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>

The new dataset “Prevention” is a manageable dataset now. (and so as md)

For your assignment, work with the cleaned “Prevention” dataset

1. Once you run the above code, filter this dataset one more time for any particular subset.

Filter chunk here

# I want to filter out for Cholesterol Screening in the Short_Question_Text variable for the md data set. I'm trying to see the prevalence of Cholesterol Screening in Maryland!

cholesterol <- md |>
  filter(Short_Question_Text == "Cholesterol Screening")

2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.

First plot chunk here

# I want to make a scatterplot of the Cholesterol Screening data for Maryland. Have costume colors
# I will take away scientific notation
options(scipen=999)

p1 <- cholesterol |>
  ggplot(aes(x = PopulationCount, y = Data_Value)) +
  geom_point() +
  labs(title = "Cholesterol Screening in Maryland", x = "Population Count", y = "Data Value")

# make it plotly

p1 <- ggplotly(p1)
p1

Oh wow, the HIGHEST population count has a data value of 82.7, and it’s the highest population counted! How intresting. I wonder which part of Baltimore this is.

I will make the graph again, but this time I won’t use the highest population count.

p2 <- cholesterol |>
  filter(PopulationCount < 100000) |>
  ggplot(aes(x = PopulationCount, y = Data_Value)) +
  geom_point() +
  labs(title = "Cholesterol Screening in Maryland", x = "Population Count", y = "Data Value")

p2 <- ggplotly(p2)
p2

856 people in Baltimore had a high data value of 90.1 in Cholesterol Screening. The highest population being 6572, also had a high data value of 87.6.

Out of curiosity, I want to see the correlation between these two variables. Ignore any NA values.

cor(cholesterol$PopulationCount, cholesterol$Data_Value, use = "complete.obs") #Uses AI to find how to ignore NA.
[1] 0.002759384

This number is 0.0027, which is very close to 0. This means that there is no correlation between the population count and the data value of Cholesterol Screening in Maryland.

3. Now create a map of your subsetted dataset.

First map chunk here

# I want to make a map of the Cholesterol Screening data for Maryland. I will use the lat and long variables to make the map. I will use leaflet to make the map. I won't include a tooltip

leaflet(cholesterol) |>
  addTiles() |>
  addCircleMarkers(lng = ~long, lat = ~lat)

4. Refine your map to include a mousover tooltip

Refined map chunk here

popupGIS <- paste0(
  "<b>Population Count: </b>", cholesterol$PopulationCount, "<br>",
  "<b>Data Value: </b>", cholesterol$Data_Value, "<br>"
)

Here above is the tooltip that will show up when you hover over the circle markers on the map. I will also make the size bigger depending on Data_Value

leaflet(cholesterol) |>
  addTiles() |>
  addCircleMarkers(lng = ~long, lat = ~lat,
                   color = "black",
                   radius = ~(Data_Value)/7,
                   fillColor = "blue",
                   popup = popupGIS)

5. Write a paragraph

In a paragraph, describe the plots you created and what they show.

What my map shows is the prevalence of Cholesterol Screening in Maryland. From the given data, I am mapping out the population count and the data value of Cholesterol that was calculated in Baltimore, Maryland. The scatterplot shows the correlation between the population count and the data value of Cholesterol Screening in Maryland. I have done this to see if the population had something to do with the amount of value given from the cholesterol to see if many individuals in one high population gotten a bigger number in having Cholesterol. The correlation is 0.0027, which is very close to 0. What that meant is that there wasn’t any huge correlation between the population and the data value fo Cholesterol Screening, meaning that maybe the biggest populations for some taken account for, did not have huge amounts of Cholesterol. Some smaller populations however, in fact did have high values of Cholesterol Screenings, which means maybe some areas around Baltimore with high population don’t suffer as much with Cholesterol.The scatterplot also shows that the highest population count has a data value of 82.7, and it’s the highest population counted. The map shows the different locations in Baltimore with a high Data value given the amount of blue you see in the circle marker. What I noticed that for a population count of 22, there was no Data Value given. The rest of the data scattered around the map seems to be pretty close to one another. I wish I could have done a better job at making the circles bigger, for higher data value, and the data value for smallest be smaller, to show a better iamge; However, I think I did a good job at making the map and the scatter plot.