“Public Assistance and Supplemental Nutrition Assistance (SNAP) Program Fraud Prevention Performance Measures: Beginning 2013”

importing libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

loading csv

df <- read_csv("snap.csv")

## Rows: 2494 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): District
## dbl (25): Year, Quarter, District Code, Front End Detection System (FEDS) Ca...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

handling spaces in column names

names(df)<-str_replace_all(names(df), c(" " = "." , "," = "" ))
head(df)

graph 1: Plotting number of cases in each quarter as bar plot facted by year

ggplot(df) +
  geom_col(aes(x = Quarter, y = FEDS.Total.Investigated)) +
  facet_wrap(~Year)

We can see an average value around 50,000. There are peaks in Q3 of both 2015 and 2017. There are some noticeable gaps in the first half of 2016, and from Q1 2020 onward. I wonder if this is from less cases or less data? Lets try and find out

graph 2: plotting number of datapoints per year as bar graph

#convert year to factor so it will act as a categorical variable
df$Year <- as.factor(df$Year)
#_add a column with value of 1
df$Count <- 1
# plot number of cases each year
ggplot(df) +
  geom_col(aes(x = Year, y = Count))

Interesting, we do see dip in datapoints in 2016 which most likely is due to some data not being reported (this reflects in the total cases chart). However, in 2020 and 2021 we see that there are actually more datapoints than in 2019, which means that covid must be responible for the rapid drop in cases.

graph 3: plotting percent of cases with NO errors in a dotplot

#plot number of cases with no errors
ggplot(df) +
    geom_dotplot(aes(x = FEDS.Cases.No.Errors, fill = Year), binwidth = 1) +
    scale_y_continuous(NULL, breaks = NULL) +
    xlab("Percent of cases with no errors")

This is… interesting. We see that all of the datapoints with 100% no errors (0% errors) are all from 2020 onward. From what we found earlier, this is most like due to very small sample sizes. The column for 0% no errors (100% errors) is literally off the chart. Lets see if we can adjust the scale to see it better.

graph 4: plotting percent of cases with NO errors in a dotplot with adjusted alpha and stack ratio

ggplot(df) +
    geom_dotplot(aes(x = FEDS.Cases.No.Errors, fill = Year), binwidth = 1, alpha = 0.2, stackratio = 0.4) +
    scale_y_continuous(NULL, breaks = NULL)

This is really not that useful, but can can at least compare the scale of the values at 0% and the rest of the chart. finally, id like to split the graph by year, instead of using color

graph 5: plotting percent of cases with NO errors in a dotplot split by year

ggplot(df) +
    geom_dotplot(aes(x = Year, y = FEDS.Cases.No.Errors, ), binwidth = 1, binaxis = "y", stackdir = "center") +
    scale_y_continuous(NULL, breaks = NULL) +
    xlab("Percent of cases with no errors")

This is cool, but kind of ruined by the overwhelming amount of datapoints at the top and bottom. Lets try a violin plot just for fun

graph 6?: plotting percent of cases with NO errors in a violin plot split by year

ggplot(df, aes(Year, FEDS.Cases.No.Errors)) +
    geom_violin()

This looks wrong… why are the top and bottom not the widest? Probably is caused by the distribution used to create the plot.

B02

Ethan Niser

9-16-2022