library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
df <- read_csv("snap.csv")
## Rows: 2494 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): District
## dbl (25): Year, Quarter, District Code, Front End Detection System (FEDS) Ca...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
names(df)<-str_replace_all(names(df), c(" " = "." , "," = "" ))
head(df)
ggplot(df) +
geom_col(aes(x = Quarter, y = FEDS.Total.Investigated)) +
facet_wrap(~Year)
We can see an average value around 50,000. There are peaks in Q3 of both 2015 and 2017. There are some noticeable gaps in the first half of 2016, and from Q1 2020 onward. I wonder if this is from less cases or less data? Lets try and find out
#convert year to factor so it will act as a categorical variable
df$Year <- as.factor(df$Year)
#_add a column with value of 1
df$Count <- 1
# plot number of cases each year
ggplot(df) +
geom_col(aes(x = Year, y = Count))
Interesting, we do see dip in datapoints in 2016 which most likely is due to some data not being reported (this reflects in the total cases chart). However, in 2020 and 2021 we see that there are actually more datapoints than in 2019, which means that covid must be responible for the rapid drop in cases.
#plot number of cases with no errors
ggplot(df) +
geom_dotplot(aes(x = FEDS.Cases.No.Errors, fill = Year), binwidth = 1) +
scale_y_continuous(NULL, breaks = NULL) +
xlab("Percent of cases with no errors")
This is… interesting. We see that all of the datapoints with 100% no errors (0% errors) are all from 2020 onward. From what we found earlier, this is most like due to very small sample sizes. The column for 0% no errors (100% errors) is literally off the chart. Lets see if we can adjust the scale to see it better.
ggplot(df) +
geom_dotplot(aes(x = FEDS.Cases.No.Errors, fill = Year), binwidth = 1, alpha = 0.2, stackratio = 0.4) +
scale_y_continuous(NULL, breaks = NULL)
This is really not that useful, but can can at least compare the scale of the values at 0% and the rest of the chart. finally, id like to split the graph by year, instead of using color
ggplot(df) +
geom_dotplot(aes(x = Year, y = FEDS.Cases.No.Errors, ), binwidth = 1, binaxis = "y", stackdir = "center") +
scale_y_continuous(NULL, breaks = NULL) +
xlab("Percent of cases with no errors")
This is cool, but kind of ruined by the overwhelming amount of datapoints at the top and bottom. Lets try a violin plot just for fun
ggplot(df, aes(Year, FEDS.Cases.No.Errors)) +
geom_violin()
This looks wrong… why are the top and bottom not the widest? Probably is caused by the distribution used to create the plot.