library(tidyverse)
library(tidyr)
library(plotly)
library(leaflet)
setwd("~/Downloads/Data 101 and Data 110 class/Data 110")
cities500 <- read_csv("Data Sets/500CitiesLocalHealthIndicators.cdc.csv")Healthy Cities GIS Assignment
Load the libraries and set the working directory
The GeoLocation variable has (lat, long) format
Split GeoLocation (lat, long) into two columns: lat and long
latlong <- cities500|>
mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 CA California Hawthorne Census Tract BRFSS Health Outcom…
2 2017 CA California Hawthorne City BRFSS Unhealthy Beh…
3 2017 CA California Hayward City BRFSS Health Outcom…
4 2017 CA California Hayward City BRFSS Unhealthy Beh…
5 2017 CA California Hemet City BRFSS Prevention
6 2017 CA California Indio Census Tract BRFSS Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
Filter the dataset
Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.
latlong_clean <- latlong |>
filter(StateDesc != "United States") |>
filter(Category == "Prevention") |>
filter(Data_Value_Type == "Crude prevalence") |>
filter(Year == 2017)
head(latlong_clean)# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 AL Alabama Montgomery City BRFSS Prevention
2 2017 CA California Concord City BRFSS Prevention
3 2017 CA California Concord City BRFSS Prevention
4 2017 CA California Fontana City BRFSS Prevention
5 2017 CA California Richmond Census Tract BRFSS Prevention
6 2017 FL Florida Davie Census Tract BRFSS Prevention
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
What variables are included? (can any of them be removed?)
names(latlong_clean) [1] "Year" "StateAbbr"
[3] "StateDesc" "CityName"
[5] "GeographicLevel" "DataSource"
[7] "Category" "UniqueID"
[9] "Measure" "Data_Value_Unit"
[11] "DataValueTypeID" "Data_Value_Type"
[13] "Data_Value" "Low_Confidence_Limit"
[15] "High_Confidence_Limit" "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote" "PopulationCount"
[19] "lat" "long"
[21] "CategoryID" "MeasureId"
[23] "CityFIPS" "TractFIPS"
[25] "Short_Question_Text"
Remove the variables that will not be used in the assignment
prevention <- latlong_clean |>
select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(prevention)# A tibble: 6 × 18
Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 AL Alabama Montgome… City Prevent… 151000 Choles…
2 2017 CA California Concord City Prevent… 616000 Visits…
3 2017 CA California Concord City Prevent… 616000 Choles…
4 2017 CA California Fontana City Prevent… 624680 Visits…
5 2017 CA California Richmond Census Tract Prevent… 0660620… Choles…
6 2017 FL Florida Davie Census Tract Prevent… 1216475… Choles…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
md <- prevention |>
filter(StateAbbr=="MD")
head(md)# A tibble: 6 × 18
Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Chole…
2 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
3 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
4 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Curre…
5 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Curre…
6 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
The new dataset “Prevention” is a manageable dataset now. (and so as md)
For your assignment, work with the cleaned “Prevention” dataset
1. Once you run the above code, filter this dataset one more time for any particular subset.
Filter chunk here
# I want to filter out for Cholesterol Screening in the Short_Question_Text variable for the md data set. I'm trying to see the prevalence of Cholesterol Screening in Maryland!
cholesterol <- md |>
filter(Short_Question_Text == "Cholesterol Screening")2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.
First plot chunk here
# I want to make a scatterplot of the Cholesterol Screening data for Maryland. Have costume colors
# I will take away scientific notation
options(scipen=999)
p1 <- cholesterol |>
ggplot(aes(x = PopulationCount, y = Data_Value)) +
geom_point() +
labs(title = "Cholesterol Screening in Maryland", x = "Population Count", y = "Data Value")
# make it plotly
p1 <- ggplotly(p1)
p1Oh wow, the HIGHEST population count has a data value of 82.7, and it’s the highest population counted! How intresting. I wonder which part of Baltimore this is.
I will make the graph again, but this time I won’t use the highest population count.
p2 <- cholesterol |>
filter(PopulationCount < 100000) |>
ggplot(aes(x = PopulationCount, y = Data_Value)) +
geom_point() +
labs(title = "Cholesterol Screening in Maryland", x = "Population Count", y = "Data Value")
p2 <- ggplotly(p2)
p2856 people in Baltimore had a high data value of 90.1 in Cholesterol Screening. The highest population being 6572, also had a high data value of 87.6.
Out of curiosity, I want to see the correlation between these two variables. Ignore any NA values.
cor(cholesterol$PopulationCount, cholesterol$Data_Value, use = "complete.obs") #Uses AI to find how to ignore NA.[1] 0.002759384
This number is 0.0027, which is very close to 0. This means that there is no correlation between the population count and the data value of Cholesterol Screening in Maryland.
3. Now create a map of your subsetted dataset.
First map chunk here
# I want to make a map of the Cholesterol Screening data for Maryland. I will use the lat and long variables to make the map. I will use leaflet to make the map. I won't include a tooltip
leaflet(cholesterol) |>
addTiles() |>
addCircleMarkers(lng = ~long, lat = ~lat)4. Refine your map to include a mousover tooltip
Refined map chunk here
popupGIS <- paste0(
"<b>Population Count: </b>", cholesterol$PopulationCount, "<br>",
"<b>Data Value: </b>", cholesterol$Data_Value, "<br>"
)Here above is the tooltip that will show up when you hover over the circle markers on the map. I will also make the size bigger depending on Data_Value
leaflet(cholesterol) |>
addTiles() |>
addCircleMarkers(lng = ~long, lat = ~lat,
color = "black",
radius = ~(Data_Value)/7,
fillColor = "blue",
popup = popupGIS)5. Write a paragraph
In a paragraph, describe the plots you created and what they show.
What my map shows is the prevalence of Cholesterol Screening in Maryland. From the given data, I am mapping out the population count and the data value of Cholesterol that was calculated in Baltimore, Maryland. The scatterplot shows the correlation between the population count and the data value of Cholesterol Screening in Maryland. I have done this to see if the population had something to do with the amount of value given from the cholesterol to see if many individuals in one high population gotten a bigger number in having Cholesterol. The correlation is 0.0027, which is very close to 0. What that meant is that there wasn’t any huge correlation between the population and the data value fo Cholesterol Screening, meaning that maybe the biggest populations for some taken account for, did not have huge amounts of Cholesterol. Some smaller populations however, in fact did have high values of Cholesterol Screenings, which means maybe some areas around Baltimore with high population don’t suffer as much with Cholesterol.The scatterplot also shows that the highest population count has a data value of 82.7, and it’s the highest population counted. The map shows the different locations in Baltimore with a high Data value given the amount of blue you see in the circle marker. What I noticed that for a population count of 22, there was no Data Value given. The rest of the data scattered around the map seems to be pretty close to one another. I wish I could have done a better job at making the circles bigger, for higher data value, and the data value for smallest be smaller, to show a better iamge; However, I think I did a good job at making the map and the scatter plot.