library(tidyverse)
library(tidyr)
setwd("C:/Users/Saima Abbas/Downloads")
cities500 <- read_csv("500CitiesLocalHealthIndicators.cdc.csv")
data(cities500)Healthy Cities GIS Assignment
Load the libraries and set the working directory
The GeoLocation variable has (lat, long) format
Split GeoLocation (lat, long) into two columns: lat and long
latlong <- cities500|>
mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
head(latlong)# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 CA California Hawthorne Census Tract BRFSS Health Outcom…
2 2017 CA California Hawthorne City BRFSS Unhealthy Beh…
3 2017 CA California Hayward City BRFSS Health Outcom…
4 2017 CA California Hayward City BRFSS Unhealthy Beh…
5 2017 CA California Hemet City BRFSS Prevention
6 2017 CA California Indio Census Tract BRFSS Health Outcom…
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
Filter the dataset
Remove the StateDesc that includes the United Sates, select Prevention as the category (of interest), filter for only measuring crude prevalence and select only 2017.
latlong_clean <- latlong |>
filter(StateDesc != "United States") |>
filter(Data_Value_Type == "Crude prevalence") |>
filter(Year == 2017) |>
filter(StateAbbr == "CT") |>
filter(Category == "Prevention")
head(latlong_clean)# A tibble: 6 × 25
Year StateAbbr StateDesc CityName GeographicLevel DataSource Category
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 CT Connecticut Danbury City BRFSS Prevention
2 2017 CT Connecticut New Haven Census Tract BRFSS Prevention
3 2017 CT Connecticut New Haven Census Tract BRFSS Prevention
4 2017 CT Connecticut Danbury Census Tract BRFSS Prevention
5 2017 CT Connecticut Stamford Census Tract BRFSS Prevention
6 2017 CT Connecticut Norwalk Census Tract BRFSS Prevention
# ℹ 18 more variables: UniqueID <chr>, Measure <chr>, Data_Value_Unit <chr>,
# DataValueTypeID <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
What variables are included? (can any of them be removed?)
names(latlong_clean) [1] "Year" "StateAbbr"
[3] "StateDesc" "CityName"
[5] "GeographicLevel" "DataSource"
[7] "Category" "UniqueID"
[9] "Measure" "Data_Value_Unit"
[11] "DataValueTypeID" "Data_Value_Type"
[13] "Data_Value" "Low_Confidence_Limit"
[15] "High_Confidence_Limit" "Data_Value_Footnote_Symbol"
[17] "Data_Value_Footnote" "PopulationCount"
[19] "lat" "long"
[21] "CategoryID" "MeasureId"
[23] "CityFIPS" "TractFIPS"
[25] "Short_Question_Text"
Remove the variables that will not be used in the assignment
latlong_clean2 <- latlong_clean |>
select(-DataSource,-Data_Value_Unit, -DataValueTypeID, -Low_Confidence_Limit, -High_Confidence_Limit, -Data_Value_Footnote_Symbol, -Data_Value_Footnote)
head(latlong_clean2)# A tibble: 6 × 18
Year StateAbbr StateDesc CityName GeographicLevel Category UniqueID Measure
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2017 CT Connecticut Danbury City Prevent… 918430 "Chole…
2 2017 CT Connecticut New Hav… Census Tract Prevent… 0952000… "Chole…
3 2017 CT Connecticut New Hav… Census Tract Prevent… 0952000… "Visit…
4 2017 CT Connecticut Danbury Census Tract Prevent… 0918430… "Visit…
5 2017 CT Connecticut Stamford Census Tract Prevent… 0973000… "Curre…
6 2017 CT Connecticut Norwalk Census Tract Prevent… 0955990… "Curre…
# ℹ 10 more variables: Data_Value_Type <chr>, Data_Value <dbl>,
# PopulationCount <dbl>, lat <dbl>, long <dbl>, CategoryID <chr>,
# MeasureId <chr>, CityFIPS <dbl>, TractFIPS <dbl>, Short_Question_Text <chr>
The new dataset “Prevention” is a manageable dataset now.
For your assignment, work with a cleaned dataset.
1. Once you run the above code and learn how to filter this complicated dataset, perform your own investigation by filtering this dataset however you choose so that you have a subset with no more than 900 observations.
Filtering the data
Filter chunk here (you may need multiple chunks)
To start, I want to look at (as usual) the DMV area, and issues health issues surrounding that area. (I was going to do health outcomes, but everything with that got all funky)
dmv_health <- latlong |>
filter(StateAbbr %in% c("DC", "MD", "VA")) |>
filter(StateDesc != "United States") |>
filter(Data_Value_Type == "Crude prevalence") |>
filter(Year == 2017) |>
filter(Category == "Unhealthy Behaviors")Over here I kind of just followed what was done before and created a dataframe. First I filtered for the dmv, and then I filtered for the other categories I was interested in, which was unhealty behaviors.
dmv_health2 <- dmv_health|>
select(-DataSource, -Data_Value_Unit, -DataValueTypeID,
-Low_Confidence_Limit, -High_Confidence_Limit,
-Data_Value_Footnote_Symbol, -Data_Value_Footnote)Over here I was just following along to what was done earlier and dropping variables that weren’t needed using the select function.
nrow(dmv_health2)[1] 3572
I wanted to check how many rows I had, and it came up to 3572. That was too much, so I needed to figure out a way to take it down.
dmv_health2 <- dmv_health |>
head(n = 900) So I had a hard time figuring out how to dwindle the observations down to a number lower than 900, and I tried a couple of things to no avail. So I used to head function to gather the first 900 observations in the dataframe. The head function is used earlier in this assignment, but not really in the same way. Once I did this though, it worked! I resorted to google to find out how to use this properly, and it will be cited below!!
2. Based on the GIS tutorial (Japan earthquakes), create one plot about something in your subsetted dataset.
First plot chunk here
So I thought for a non-map plot it would make sense to look at the various things that are disrupting in dmv cities, like the different health issues.
dmv_health2 |>
group_by(CityName, Measure) |>
summarize(mean_value = mean(Data_Value, na.rm = TRUE), .groups = "drop") |>
ggplot(aes(x = reorder(CityName, -mean_value), y = mean_value, fill = Measure)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Prevalence of Unhealthy Behaviors by City in the DMV Area (2016)",
x = "City",
y = "Crude Prevalence (%)",
fill = "Health Outcomes",
caption = "Source: CDC 500 Cities Dataset"
) +
theme_light() +
theme(legend.position = "right",
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.5, "cm"),
axis.text.x = element_text(angle = 45, hjust = 1),
)So for this barplot I used the standard ggplot format, and first grouped the city and measure, which specifies the issue. Then using the summarize function I averaged the “crude prevalence” (the datavalue) for each issue per city. Then I filled each bar with different colors based on different issues. There were a few struggles to be honest with the cosmetic parts of this plot, but I’m actually a little proud of it. Its definetly not the most amazing best plot ever but I think it’s better than what I’ve made before. It looked a lot more interesting and full before I realized I had more than 900 observations, as did the map.
3. Now create a map of your subsetted dataset.
I struggled quite a bit!
First map chunk here
# leaflet()
library(leaflet)Warning: package 'leaflet' was built under R version 4.4.3
leaflet(dmv_health2) |>
addTiles() |>
addCircles(lng = ~long, lat = ~lat, weight = 1,
radius = 500,
color = "red",
stroke = FALSE, fillOpacity = 0.5)So over here I honestly just followed along what was done in the notes, I added tiles to bring a background,and tried to experiment with the other backgrounds and the add tiles thing, but it wasn’t showing up. I wanted to make it more visually interesting, but i’m running out of time.
### 4. Refine your map to include a mouse-click tooltip
Refined map chunk here So for this I also just looked off of the notes to see what to do, for the tool tip I just added in whats important, like the city, the issue (measure) and its rate basically (value)
leaflet(dmv_health2) |>
addTiles() |>
addCircleMarkers(
lng = ~long, lat = ~lat,
radius = 4,
color = "darkred",
stroke = FALSE, fillOpacity = 0.7,
popup = ~paste0("<b>City:</b> ", CityName,
"<br><b>Measure:</b> ", Measure,
"<br><b>Value:</b> ", Data_Value, "%")
)Over here, i adjusted what the tooltip is meant to display. When clicking on a point, you should be able to see the city, the issue, and its rate.
5. Write a paragraph
In a paragraph, describe the plots you created and what they show.
In these graphs I focused on various health concerns and issues across cities in the DMV. These aren’t exactly health issues per say, but issues that can effect the health of people. Chronic problems like drinking, obesity, smoking, and more. The barplot focuses on showing the average rates of these issues distributed across different DMV cities. Each issue is highlighted and labeled, with the most relevant appearing to be obesity amongst adults. I figure that makes sense, obesity has been a pretty big problem in America for a while, so it makes sense that it would be a big issue in the DMV area as well. In second place would be a lack of leisure time. I’m surprised this issue is here, but I suppose it plays a bigger role in how it effects mental health. The map also goes to show a somewhat similar story. It shows the cities that are densley packed with these different issues. I suppose we can note the areas in which these issues have the highest prevalence, and assume that it’s connected to the environment and poor health conditions. The most issue oriented areas appear to be Cheasapeake, Washington, and Baltimore.
Source that helped
(2010, April 18). Select first 4 rows of a data frame in R. Stack Overflow. https://stackoverflow.com/questions/2667673/select-first-4-rows-of-a-data-frame-in-r
I went through the thread ^