Split GeoLocation (lat, long) into two columns: lat and long
latlong <- cities500|>
mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)## # A tibble: 6 × 25
## Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 CA California Hawthorne Census Tract BRFSS Health Outcom…
## 2 2017 CA California Hawthorne City BRFSS Unhealthy Beh…
## 3 2017 CA California Hayward City BRFSS Health Outcom…
## 4 2017 CA California Hayward City BRFSS Unhealthy Beh…
## 5 2017 CA California Hemet City BRFSS Prevention
## 6 2017 CA California Indio Census Tract BRFSS Health Outcom…
## # ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
## # DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
## # Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
## # Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
## # PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
## # MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.
latlong_clean <- latlong |>
filter(StateDesc != "United States") |>
filter(Category == "Prevention") |>
filter(Data_Value_Type == "Crude prevalence") |>
filter(Year == 2017)
head(latlong_clean)## # A tibble: 6 × 25
## Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 AL Alabama Montgomery City BRFSS Prevention
## 2 2017 CA California Concord City BRFSS Prevention
## 3 2017 CA California Concord City BRFSS Prevention
## 4 2017 CA California Fontana City BRFSS Prevention
## 5 2017 CA California Richmond Census Tract BRFSS Prevention
## 6 2017 FL Florida Davie Census Tract BRFSS Prevention
## # ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
## # DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
## # Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
## # Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
## # PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
## # MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
Now that we have filtered for only the category prevention, we don’t need the category variable anymore. Similarly we don’t need Data_Value_Type anymore.
## [1] "Year" "StateAbbr"
## [3] "StateDesc" "CityName"
## [5] "GeographicLevel" "DataSource"
## [7] "Category" "UniqueID"
## [9] "Measure" "Data_Value_Unit"
## [11] "DataValueTypeID" "Data_Value_Type"
## [13] "Data_Value" "Low_Confidence_Limit"
## [15] "High_Confidence_Limit" "Data_Value_Footnote_Symbol"
## [17] "Data_Value_Footnote" "PopulationCount"
## [19] "lat" "long"
## [21] "CategoryID" "MeasureId"
## [23] "CityFIPS" "TractFIPS"
## [25] "Short_Question_Text"
prevention <- latlong_clean |>
select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(prevention)## # A tibble: 6 × 18
## Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 AL Alabama Montgome… City Prevent… 151000 Choles…
## 2 2017 CA California Concord City Prevent… 616000 Visits…
## 3 2017 CA California Concord City Prevent… 616000 Choles…
## 4 2017 CA California Fontana City Prevent… 624680 Visits…
## 5 2017 CA California Richmond Census Tract Prevent… 0660620… Choles…
## 6 2017 FL Florida Davie Census Tract Prevent… 1216475… Choles…
## # ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
## # PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
## # MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
## # A tibble: 6 × 18
## Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Chole…
## 2 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
## 3 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
## 4 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Curre…
## 5 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Curre…
## 6 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
## # ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
## # PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
## # MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
The new dataset “Prevention” is a manageable dataset now.
For your assignment, work with the cleaned “Prevention” dataset
Filter chunk here
NY_Rochester <- prevention |>
filter(StateAbbr=="NY") |>
filter(CityName=="Rochester")
head(NY_Rochester)## # A tibble: 6 × 18
## Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 NY New York Rochester Census Tract Preventi… 3663000… "Curre…
## 2 2017 NY New York Rochester Census Tract Preventi… 3663000… "Visit…
## 3 2017 NY New York Rochester Census Tract Preventi… 3663000… "Chole…
## 4 2017 NY New York Rochester Census Tract Preventi… 3663000… "Curre…
## 5 2017 NY New York Rochester Census Tract Preventi… 3663000… "Chole…
## 6 2017 NY New York Rochester Census Tract Preventi… 3663000… "Visit…
## # ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
## # PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
## # MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
After plotting these, this got me thinking. Everybody should be going for an annual checkup, but how might this be dependent on where you live?
ggplot(NY_Rochester, aes(x = MeasureId, y = Data_Value, fill = Short_Question_Text)) +
geom_boxplot() +
labs(title = "Crude Prevalence of Preventative Treatments/Issues",
x = "Treatment/Issue Type",
y = "Crude Prevalence")## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).
ggplot(NY_Rochester, aes(x = MeasureId, y = Data_Value/PopulationCount, fill = Short_Question_Text)) +
geom_boxplot() +
labs(title = "Relative Prevalence of Preventative Treatments/Issues",
x = "Treatment/Issue Type",
y = "Relative Prevalence")## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).
First map chunk here
## [1] 43.16448
## [1] -77.6096
leaflet() |>
setView(lng = -77.6096, lat = 43.16448, zoom = 12) |>
addProviderTiles("Esri.NatGeoWorldMap") |>
addCircles(
data = Rochester_Checkups,
radius = 5000*Rochester_Checkups$Data_Value/Rochester_Checkups$PopulationCount
)## Assuming "long" and "lat" are longitude and latitude, respectively
Refined map chunk here. Note that the mean relative frequency of annual checkups is roughly 3.0%.
rel_freq <- Rochester_Checkups$Data_Value/Rochester_Checkups$PopulationCount
tooltip <- paste0(
"<b>Visit type: </b>", Rochester_Checkups$Short_Question_Text, "<br>",
"<b>Raw Count: </b>", Rochester_Checkups$Data_Value, "<br>",
"<b>Population: </b>", Rochester_Checkups$PopulationCount, "<br>",
"<b>Relative Frequency: </b>", paste(100*round(rel_freq, digits = 4),"%"), "<br>"
)
leaflet() |>
setView(lng = -77.6096, lat = 43.16448, zoom = 12) |>
addProviderTiles("Esri.NatGeoWorldMap") |>
addCircles(
data = Rochester_Checkups,
radius = 5000*Rochester_Checkups$Data_Value/Rochester_Checkups$PopulationCount,
popup = tooltip
)## Assuming "long" and "lat" are longitude and latitude, respectively
It will be useful to look at the following two pictures while thinking about the significance of this map:
My first plot, the side-by-side boxplots, shows that there are similar crude and relative frequencies of each preventative measure, aside from lack of access to healthcare insurance (which is lower), in Rochester, NY. It also shows that the differences in variations are mostly eliminated when switching from raw counts to relative frequencies.
My map visualizes the relative frequencies of annual checkups by locations in Rochester in 2017. The variation is only by a couple of percentage points, but visually bears some correlation with the household incomes. It seems that the relative frequencies of checkups in the northern and eastern outskirts have the most consistent inverse relationship with proximity to hospitals and median household income. There is a big question of why people in central Rochester (e.g. near Clifford Avenue) have such high relative rates of annual checkups despite lower household incomes. Perhaps there is a nearby hospital that I have missed. It is notable that southeastern region shows slightly elevated relative frequencies of annual checkups, correlating with higher median household incomes.
If I were to do this again, I would like to somehow include the incomes and hospitals on the maps itself (this seems like it might be easier in tableau), and also explore the relationship between lack of healthcare insurance and annual checkups.
Coding:
- https://www.r-bloggers.com/2022/09/how-to-concatenate-strings-in-r/#google_vignette
- https://appsilon.com/leaflet-geomaps/
Data:
- https://www.cdc.gov/places/about/500-cities-2016-2019/index.html
- https://statisticalatlas.com/place/New-York/Rochester/Household-Income
- https://www.google.com/maps/search/hospitals+in+rochester/
Memes:
- https://knowyourmeme.com/memes/helth