Importing Libraries and Dataset

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
df = read_csv('BeachEColiPredictions.csv')
## Rows: 7050 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Beach Name, Date, Prediction Source, RecordID, Location
## dbl (3): Predicted Level, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(df) <- gsub(" ", "", names(df))
df <- df %>% 
  mutate(
    month = strtoi(substr(Date, 1, 2)), 
    day = strtoi(substr(Date, 4, 5)), 
    year = strtoi(substr(Date, 7, 10))) %>% 
  mutate(Date = as.Date(Date, "%m/%d/%Y"))
head(df)

What kind of things can we attempt to find in this data?

  • Identifying trends in the predicted level of E Coli
    • By date
    • By location/beach
    • By prediction source
  • Identify trends in when the predictions where taken
    • Were more tests conducting at certain times/locations?

Some hypothesises that could explore questions that can be answered from the data

I think the ecoli will come in spikes seen across multiple nearby beaches

I think the ecoli will have trends year-to-year

I think the ecoli will be more generally higher at beaches closest to the chicago river

Questions:

Which beaches have the most/least ecoli?

Which years/months have the most/least ecoli?

library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
chi_map <- read_sf("https://raw.githubusercontent.com/thisisdaryn/data/master/geo/chicago/Comm_Areas.geojson")

ggplot(data = chi_map) + 
  geom_sf() +
  geom_point(data=df, aes(x=Longitude, y=Latitude, color=BeachName), size=4) +
  ggtitle("Locations of Beaches in Dataset")

df %>%
  group_by(Date) %>%
  mutate(dayTotal = sum(PredictedLevel)) %>%
ggplot(aes(Date, PredictedLevel)) +
  geom_line() +
  facet_wrap(~year, scales="free") +
  ggtitle("Predicted level of ecoli across all beaches by year")

Not any clear trends just from this. Looks like due to covid in 2020 there are only 2 data points.

df %>% 
  filter(year == 2022) %>% 
ggplot(aes(Date, PredictedLevel)) +
  geom_col() +
  facet_wrap(~BeachName) +
  ggtitle("Predicted Ecoli levels by beach in 2022")

We can see that 31st beach has much higher levels pretty consistently, with Foster and 12th behind it. On the lower end Rogers, Albion, Leone and Jarvis beaches stay low.

df %>% 
  group_by(Date) %>%
  summarise(AvgEcoli = mean(PredictedLevel)) %>%
  arrange(desc(AvgEcoli))

This isn’t that telling but august 29th had the highest avg ecoli prediction.

df %>% 
  group_by(PredictionSource) %>%
  summarise(avgEcoli = mean(PredictedLevel)) %>%
ggplot(aes(PredictionSource, avgEcoli)) +
  geom_col()

What did we learn?