library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
df = read_csv('BeachEColiPredictions.csv')
## Rows: 7050 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Beach Name, Date, Prediction Source, RecordID, Location
## dbl (3): Predicted Level, Latitude, Longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(df) <- gsub(" ", "", names(df))
df <- df %>%
mutate(
month = strtoi(substr(Date, 1, 2)),
day = strtoi(substr(Date, 4, 5)),
year = strtoi(substr(Date, 7, 10))) %>%
mutate(Date = as.Date(Date, "%m/%d/%Y"))
head(df)
I think the ecoli will come in spikes seen across multiple nearby beaches
I think the ecoli will have trends year-to-year
I think the ecoli will be more generally higher at beaches closest to the chicago river
Which beaches have the most/least ecoli?
Which years/months have the most/least ecoli?
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
chi_map <- read_sf("https://raw.githubusercontent.com/thisisdaryn/data/master/geo/chicago/Comm_Areas.geojson")
ggplot(data = chi_map) +
geom_sf() +
geom_point(data=df, aes(x=Longitude, y=Latitude, color=BeachName), size=4) +
ggtitle("Locations of Beaches in Dataset")
df %>%
group_by(Date) %>%
mutate(dayTotal = sum(PredictedLevel)) %>%
ggplot(aes(Date, PredictedLevel)) +
geom_line() +
facet_wrap(~year, scales="free") +
ggtitle("Predicted level of ecoli across all beaches by year")
Not any clear trends just from this. Looks like due to covid in 2020 there are only 2 data points.
df %>%
filter(year == 2022) %>%
ggplot(aes(Date, PredictedLevel)) +
geom_col() +
facet_wrap(~BeachName) +
ggtitle("Predicted Ecoli levels by beach in 2022")
We can see that 31st beach has much higher levels pretty consistently, with Foster and 12th behind it. On the lower end Rogers, Albion, Leone and Jarvis beaches stay low.
df %>%
group_by(Date) %>%
summarise(AvgEcoli = mean(PredictedLevel)) %>%
arrange(desc(AvgEcoli))
This isn’t that telling but august 29th had the highest avg ecoli prediction.
df %>%
group_by(PredictionSource) %>%
summarise(avgEcoli = mean(PredictedLevel)) %>%
ggplot(aes(PredictionSource, avgEcoli)) +
geom_col()