library(tidyverse)
library(tidyr)
setwd("C:/Users/dburkart/Desktop/DATA 110/data")
<- read_csv("500CitiesLocalHealthIndicators.cdc.csv")
cities500 data(cities500)
GIS Assignment - 500 Healthy Cities
Load the libraries and set the working directory
The GeoLocation variable has (lat, long) format
Split GeoLocation (lat, long) into two columns: lat and long
<- cities500|>
latlong mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)
# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 CA California Hawthorne Census Tract BRFSS Health Outcom…
2 2017 CA California Hawthorne City BRFSS Unhealthy Beh…
3 2017 CA California Hayward City BRFSS Health Outcom…
4 2017 CA California Hayward City BRFSS Unhealthy Beh…
5 2017 CA California Hemet City BRFSS Prevention
6 2017 CA California Indio Census Tract BRFSS Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
Filter the dataset
Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.
<- latlong |>
latlong_clean filter(StateDesc != "United States") |>
filter(Category == "Prevention") |>
filter(Data_Value_Type == "Crude prevalence") |>
filter(Year == 2017)
head(latlong_clean)
# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 AL Alabama Montgomery City BRFSS Prevention
2 2017 CA California Concord City BRFSS Prevention
3 2017 CA California Concord City BRFSS Prevention
4 2017 CA California Fontana City BRFSS Prevention
5 2017 CA California Richmond Census Tract BRFSS Prevention
6 2017 FL Florida Davie Census Tract BRFSS Prevention
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
What variables are included? (can any of them be removed?)
names(latlong_clean)
[1] "Year" "StateAbbr"
[3] "StateDesc" "CityName"
[5] "GeographicLevel" "DataSource"
[7] "Category" "UniqueID"
[9] "Measure" "Data_Value_Unit"
[11] "DataValueTypeID" "Data_Value_Type"
[13] "Data_Value" "Low_Confidence_Limit"
[15] "High_Confidence_Limit" "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote" "PopulationCount"
[19] "lat" "long"
[21] "CategoryID" "MeasureId"
[23] "CityFIPS" "TractFIPS"
[25] "Short_Question_Text"
Remove the variables that will not be used in the assignment
<- latlong_clean |>
prevention select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(prevention)
# A tibble: 6 × 18
Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 AL Alabama Montgome… City Prevent… 151000 Choles…
2 2017 CA California Concord City Prevent… 616000 Visits…
3 2017 CA California Concord City Prevent… 616000 Choles…
4 2017 CA California Fontana City Prevent… 624680 Visits…
5 2017 CA California Richmond Census Tract Prevent… 0660620… Choles…
6 2017 FL Florida Davie Census Tract Prevent… 1216475… Choles…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
<- prevention |>
md filter(StateAbbr=="MD")
head(md)
# A tibble: 6 × 18
Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Chole…
2 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
3 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
4 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Curre…
5 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Curre…
6 2017 MD Maryland Baltimore Census Tract Preventi… 2404000… "Visit…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
unique(md$CityName)
[1] "Baltimore"
The new dataset “Prevention” is a manageable dataset now.
For your assignment, work with a cleaned dataset.
1. Once you run the above code, filter this dataset one more time for any particular subset with no more than 900 observations.
Filter chunk here
<- latlong |>
us filter(GeographicLevel == "City") |>
filter(Short_Question_Text == "Chronic Kidney Disease") |>
filter(DataValueTypeID == "AgeAdjPrv") |>
filter(Year == "2017") |>
filter(StateAbbr != "AK") |>
filter(StateAbbr != "HI")
head(us)
# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 CA California Menifee City BRFSS Health …
2 2017 CT Connecticut New Britain City BRFSS Health …
3 2017 FL Florida Lakeland City BRFSS Health …
4 2017 PA Pennsylvania Pittsburgh City BRFSS Health …
5 2017 SC South Carolin Rock Hill City BRFSS Health …
6 2017 TX Texas College Sta… City BRFSS Health …
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
<- us |>
db select(-GeographicLevel, -DataSource,-Category, -UniqueID, -Data_Value_Unit, -DataValueTypeID, -Data_Value_Type, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote, -CategoryID, -TractFIPS)
head(db)
# A tibble: 6 × 12
Year StateAbbr StateDesc CityName Measure Data_Value PopulationCount lat
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2017 CA California Menifee Chroni… 2.9 77519 33.7
2 2017 CT Connecticut New Bri… Chroni… 3.2 73206 41.7
3 2017 FL Florida Lakeland Chroni… 3.3 97422 28.1
4 2017 PA Pennsylvania Pittsbu… Chroni… 3 305704 40.4
5 2017 SC South Carol… Rock Hi… Chroni… 3.2 66154 34.9
6 2017 TX Texas College… Chroni… 2.9 93857 30.6
# ℹ 4 more variables: long <dbl>, MeasureId <chr>, CityFIPS <dbl>,
# Short_Question_Text <chr>
2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.
First plot chunk here
<- db |>
p1 ggplot(aes(x = lat, y = Data_Value)) +
geom_point(alpha = 0.5, color = "#9c4016") +
scale_color_viridis_d()+
#geom_jitter() +
labs(title = "Prevalence of Chronic Kidney Disease by Latitude in the Continuous United States",
x = "Latitude",
y = "Prevalence (%)",
caption = "Source:Centers for Disease Control and Prevention (CDC), \n Division of Population Health, Epidemiology and Surveillance Branch") +
theme_classic() +
scale_y_continuous(limits = c(0,5)) +
geom_vline(xintercept = 37, linetype = "dotdash", size = 0.5, color = "black") +
geom_text(aes(x=43, y=1, label="Northern States"), cex=3.5, color="black") +
geom_text(aes(x=31, y=1, label="Southern States"), cex=3.5, color="black")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p1
Warning in geom_text(aes(x = 43, y = 1, label = "Northern States"), cex = 3.5, : All aesthetics have length 1, but the data has 498 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_text(aes(x = 31, y = 1, label = "Southern States"), cex = 3.5, : All aesthetics have length 1, but the data has 498 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
3. Now create a map of your subsetted dataset.
First map chunk here
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.3
leaflet() |>
setView(lng = -94.57857, lat = 39.09973, zoom =4) |>
addProviderTiles("OpenStreetMap.Mapnik") |>
addCircles(
data = db,
radius = (db$Data_Value*10000),
color = "#9c4016",
fillColor = "#de9126",
fillOpacity = 0.25
)
Assuming "long" and "lat" are longitude and latitude, respectively
4. Refine your map to include a mouse-click tooltip
Refined map chunk here
<- paste0(
popups "<b>State: </b>", db$StateDesc, "<br>",
"<b>City: </b>", db$CityName, "<br>",
"<b>Population: </b>", db$PopulationCount, "<br>",
"<strong>Prevalence (%): </strong>", db$Data_Value, "<br>"
)leaflet() |>
setView(lng = -94.57857, lat = 39.09973, zoom =4) |>
addProviderTiles("OpenStreetMap.Mapnik") |>
addCircles(
data = db,
radius = (db$Data_Value)*10000,
color = "#9c4016",
fillColor = "#de9126",,
fillOpacity = 0.5,
popup = popups)
Assuming "long" and "lat" are longitude and latitude, respectively
5. Write a paragraph
In a paragraph, describe the plots you created and what they show.
I created these plots because I wanted to know if chronic kidney disease was more prevalent in southern US states than northern US states (distinguished by the 37th parallel) due to potential dietary differences between the two groups. Although I did not properly analyze the data, the graph and map that I created do not show any obvious difference between southern and northern states. The graph is a scatter plot with latitude on the x axis and prevalence of chronic kidney disease on the y axis (the response variable). A vertical line through x=37 divides the data into the two state groups. The map shows the geographical distribution of the data with the size of the circle corresponding to the prevalence. Since there wasn’t significant variation in this data, all of the circles are of similar size. Furthermore, it seems like most of the data was taken from metropolitan areas (Los Angeles, New York City, Chicago), so this doesn’t represent full coverage of the United States and therefore doesn’t provide any anecdotal evdience towards answering the original question. I chose the color orange for both these plots as orange is the color for kidney disease awareness.