library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringr)
setwd("C:/Users/dxj94/OneDrive/Documents/School/MC/DATA 110/Datasets")
bagtax <- read_csv("Bag_Tax_Clean.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   `File ID` = col_double(),
##   Account = col_double(),
##   `Date From` = col_character(),
##   `Date To` = col_character(),
##   `Bag Count` = col_double(),
##   `Amount Collected` = col_double(),
##   `Amount Due` = col_double(),
##   `Amount Retained` = col_double(),
##   `Create Date` = col_character(),
##   `Vendor Name` = col_character(),
##   City = col_character(),
##   State = col_character(),
##   `Zip code` = col_double(),
##   Location1 = col_character()
## )
view(bagtax)

Introduction of the Dataset

This dataset was found in montgomerycountymd.gov. This is a dataset of bag tax. On May 3, 2011, a legislation (Bill 8-11) was enacted in Montgomery county. This legislation imposes an exercise tax on certain carryout bags provided to customers at certain retail establishments. A tax in the amount of 5 cents is levied and imposed on customer for each carryout bag. Each retail establishment may retain 1 cent from each 5-cent tax to cover the administrative expense of collecting and remitting the tax to the county.

This dataset represents information that has been captured since the legislation went into effect. There are fourteen variables including both quantitative and categorical variables.

Description of Variables

File ID: Unique transaction number that is generated when a tax payment is submitted by a vendor

Account: Unique number that is assigned by the system when a vendor registers

Date From: The start date of the filing period

Date To: The end date of the filing period

Bag Count: The total number of bags that a vendor has issued to customers during the filing period

Amount Collected: The amount that a vendor collected from customers during the filing period

Amount Due: The amount that a vendor is required to pay based on the number of bags that were issued during the filing period (number of bags * 4 cents)

Amount Retained: The amount that a vendor is eligible to retain based on the number of bags issued during the filing period (number of bags * 1 cent)

Create Date: The date when the vendor entered filing period information and number of bags issued during the filing period

Vendor Name: The vendor’s business name

City: The vendor’s business City address

State: The vendor’s business State address

Zip code: The vendor’s business Zip address

Location1: The vendor’s City, State, and Zip address combined

Clean Up the Dataset

Make all column names lowercase and replace all spaces between words with underscore.

names(bagtax) <- tolower(names(bagtax))
names(bagtax) <- gsub(" ","_",names(bagtax))
head(bagtax)
## # A tibble: 6 x 14
##   file_id account date_from date_to    bag_count amount_collected amount_due
##     <dbl>   <dbl> <chr>     <chr>          <dbl>            <dbl>      <dbl>
## 1   68385     863 10/1/2020 10/31/2020       454             22.7       18.2
## 2   65348    1300 8/23/2019 6/26/2020       8000            400        320  
## 3   57483    3642 7/1/2019  7/31/2019       3424            171.       137. 
## 4   68185    3838 10/1/2020 10/31/2020       390             19.5       15.6
## 5   68289     512 10/1/2020 10/31/2020      6820            341        273. 
## 6   68165    1302 10/1/2020 10/31/2020      5689            284.       228. 
## # ... with 7 more variables: amount_retained <dbl>, create_date <chr>,
## #   vendor_name <chr>, city <chr>, state <chr>, zip_code <dbl>, location1 <chr>

I created a new variable “year” by selecting the last four character of create_date column. It is the year that the tax was collected for each record.

year <- str_sub(bagtax$create_date, -4, -1)
bagtax_new <- mutate(bagtax, year)
year <- as.numeric(year)
head(bagtax_new)
## # A tibble: 6 x 15
##   file_id account date_from date_to    bag_count amount_collected amount_due
##     <dbl>   <dbl> <chr>     <chr>          <dbl>            <dbl>      <dbl>
## 1   68385     863 10/1/2020 10/31/2020       454             22.7       18.2
## 2   65348    1300 8/23/2019 6/26/2020       8000            400        320  
## 3   57483    3642 7/1/2019  7/31/2019       3424            171.       137. 
## 4   68185    3838 10/1/2020 10/31/2020       390             19.5       15.6
## 5   68289     512 10/1/2020 10/31/2020      6820            341        273. 
## 6   68165    1302 10/1/2020 10/31/2020      5689            284.       228. 
## # ... with 8 more variables: amount_retained <dbl>, create_date <chr>,
## #   vendor_name <chr>, city <chr>, state <chr>, zip_code <dbl>,
## #   location1 <chr>, year <chr>

Visualization and Analysis

I created a new dataframe called bagtax1. This dataframe contains three variables: vendor name, year, and total number of bags for that year.

bagtax1 <- bagtax_new %>%
  select(account, vendor_name, year, amount_collected) %>%
  group_by(vendor_name, year) %>%
  summarise(amount_collected_year = sum(amount_collected)) %>%
  arrange(desc(amount_collected_year))
## `summarise()` has grouped output by 'vendor_name'. You can override using the `.groups` argument.
head(bagtax1)
## # A tibble: 6 x 3
## # Groups:   vendor_name [1]
##   vendor_name            year  amount_collected_year
##   <chr>                  <chr>                 <dbl>
## 1 Giant of Maryland, LLC 2020                824578.
## 2 Giant of Maryland, LLC 2015                720069.
## 3 Giant of Maryland, LLC 2018                701668.
## 4 Giant of Maryland, LLC 2016                698741.
## 5 Giant of Maryland, LLC 2019                698634.
## 6 Giant of Maryland, LLC 2017                665263.

In the following chunk, I selected the most popular grocery vendors and made a line graph. The amount of tax collected is directly associated with number of bags provided to customers. I’m expecting that bag tax is an incentive for customers to use reusable bags instead of purchasing plastic bags every time they go shopping. However, we can’t conclude that from the graph showing below. The data drops in 2021 because the dataset only include January and February data in 2021. Target and Walmart had a slightly decrease from 2018 to 2020. Giant also had a slightly decrease from 2015 - 2017. Except for that, majority of the time, we don’t see much decrease in the sale of bag in grocery retailers. The purpose of the legislation isn’t to generate more revenue, but to change customers’ behavior. Plastic pollution is a serious environmental issue we need to be aware and deal with, though the behavior is not easy to change. 5 cents per bag might not look much for one time shopping, but it can accumulate and make a big difference. With the bag tax revenue, the government will have more budget to manage the pollution and other environmental issues.

bagtax1 %>%
  filter(vendor_name == "Giant of Maryland, LLC" | 
         vendor_name == "Whole Foods Market Group Inc" |
         vendor_name == "Harris Teeter" |
         vendor_name == "Trader Joe's Company" |
         vendor_name == "Target Corporation" |
         vendor_name == "Wal-Mart Stores Inc" ) %>%
  ggplot(bagtax1, mapping = aes(x = year, y = amount_collected_year, color = vendor_name, group = vendor_name)) +
  geom_line()+
  xlab("Year")+
  ylab("Amount of Tax Collected Per Year")+
  ggtitle("Montgomery County Grocery Retailers Bag Tax Collection Year 2012 - 2021")

I also made a voilin plot to show the distribution of tax collection per filing period for each grocery retailer. Except for Giant, all other five retailers have a median between $7,500 - $15,000 and have no or few outliers. Giant is the largest grocery retailer in Montgomery county and it has the highest bag tax collection per filing period

bagtax_new %>%
  filter(vendor_name == "Giant of Maryland, LLC" | 
         vendor_name == "Whole Foods Market Group Inc" |
         vendor_name == "Harris Teeter" |
         vendor_name == "Trader Joe's Company" |
         vendor_name == "Target Corporation" |
         vendor_name == "Wal-Mart Stores Inc" ) %>%
  ggplot(bagtax1, mapping = aes(y = vendor_name, x = amount_collected))+
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))+
  xlab("Amount of Tax Collected Per Filing Period")+
  ylab("Grocery Retailers")+
  ggtitle("Montgomery County Grocery Retailer Bag Tax Collection Violon Plot")

Anything you wish you could have included?

I think it would be great if the dataset can have another variable to separate vendor categories, such as grocery store, restaurants, apparal, etc. It would be easier if we want to focus on only one category or to compare among different categories.

In addition, the filing period, date_from and date_to, have many errors. I assume because all of these information are manually entered by vendors when they report and file the bag tax. These errors make these two columns difficult to use. Instead, I use the create_date which is the timestamp when each vendor file the tax.