library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
setwd("C:/Users/dxj94/OneDrive/Documents/School/MC/DATA 110/Datasets")
bagtax <- read_csv("Bag_Tax_Clean.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## `File ID` = col_double(),
## Account = col_double(),
## `Date From` = col_character(),
## `Date To` = col_character(),
## `Bag Count` = col_double(),
## `Amount Collected` = col_double(),
## `Amount Due` = col_double(),
## `Amount Retained` = col_double(),
## `Create Date` = col_character(),
## `Vendor Name` = col_character(),
## City = col_character(),
## State = col_character(),
## `Zip code` = col_double(),
## Location1 = col_character()
## )
view(bagtax)
This dataset was found in montgomerycountymd.gov. This is a dataset of bag tax. On May 3, 2011, a legislation (Bill 8-11) was enacted in Montgomery county. This legislation imposes an exercise tax on certain carryout bags provided to customers at certain retail establishments. A tax in the amount of 5 cents is levied and imposed on customer for each carryout bag. Each retail establishment may retain 1 cent from each 5-cent tax to cover the administrative expense of collecting and remitting the tax to the county.
This dataset represents information that has been captured since the legislation went into effect. There are fourteen variables including both quantitative and categorical variables.
File ID: Unique transaction number that is generated when a tax payment is submitted by a vendor
Account: Unique number that is assigned by the system when a vendor registers
Date From: The start date of the filing period
Date To: The end date of the filing period
Bag Count: The total number of bags that a vendor has issued to customers during the filing period
Amount Collected: The amount that a vendor collected from customers during the filing period
Amount Due: The amount that a vendor is required to pay based on the number of bags that were issued during the filing period (number of bags * 4 cents)
Amount Retained: The amount that a vendor is eligible to retain based on the number of bags issued during the filing period (number of bags * 1 cent)
Create Date: The date when the vendor entered filing period information and number of bags issued during the filing period
Vendor Name: The vendor’s business name
City: The vendor’s business City address
State: The vendor’s business State address
Zip code: The vendor’s business Zip address
Location1: The vendor’s City, State, and Zip address combined
Make all column names lowercase and replace all spaces between words with underscore.
names(bagtax) <- tolower(names(bagtax))
names(bagtax) <- gsub(" ","_",names(bagtax))
head(bagtax)
## # A tibble: 6 x 14
## file_id account date_from date_to bag_count amount_collected amount_due
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 68385 863 10/1/2020 10/31/2020 454 22.7 18.2
## 2 65348 1300 8/23/2019 6/26/2020 8000 400 320
## 3 57483 3642 7/1/2019 7/31/2019 3424 171. 137.
## 4 68185 3838 10/1/2020 10/31/2020 390 19.5 15.6
## 5 68289 512 10/1/2020 10/31/2020 6820 341 273.
## 6 68165 1302 10/1/2020 10/31/2020 5689 284. 228.
## # ... with 7 more variables: amount_retained <dbl>, create_date <chr>,
## # vendor_name <chr>, city <chr>, state <chr>, zip_code <dbl>, location1 <chr>
I created a new variable “year” by selecting the last four character of create_date column. It is the year that the tax was collected for each record.
year <- str_sub(bagtax$create_date, -4, -1)
bagtax_new <- mutate(bagtax, year)
year <- as.numeric(year)
head(bagtax_new)
## # A tibble: 6 x 15
## file_id account date_from date_to bag_count amount_collected amount_due
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 68385 863 10/1/2020 10/31/2020 454 22.7 18.2
## 2 65348 1300 8/23/2019 6/26/2020 8000 400 320
## 3 57483 3642 7/1/2019 7/31/2019 3424 171. 137.
## 4 68185 3838 10/1/2020 10/31/2020 390 19.5 15.6
## 5 68289 512 10/1/2020 10/31/2020 6820 341 273.
## 6 68165 1302 10/1/2020 10/31/2020 5689 284. 228.
## # ... with 8 more variables: amount_retained <dbl>, create_date <chr>,
## # vendor_name <chr>, city <chr>, state <chr>, zip_code <dbl>,
## # location1 <chr>, year <chr>
I created a new dataframe called bagtax1. This dataframe contains three variables: vendor name, year, and total number of bags for that year.
bagtax1 <- bagtax_new %>%
select(account, vendor_name, year, amount_collected) %>%
group_by(vendor_name, year) %>%
summarise(amount_collected_year = sum(amount_collected)) %>%
arrange(desc(amount_collected_year))
## `summarise()` has grouped output by 'vendor_name'. You can override using the `.groups` argument.
head(bagtax1)
## # A tibble: 6 x 3
## # Groups: vendor_name [1]
## vendor_name year amount_collected_year
## <chr> <chr> <dbl>
## 1 Giant of Maryland, LLC 2020 824578.
## 2 Giant of Maryland, LLC 2015 720069.
## 3 Giant of Maryland, LLC 2018 701668.
## 4 Giant of Maryland, LLC 2016 698741.
## 5 Giant of Maryland, LLC 2019 698634.
## 6 Giant of Maryland, LLC 2017 665263.
In the following chunk, I selected the most popular grocery vendors and made a line graph. The amount of tax collected is directly associated with number of bags provided to customers. I’m expecting that bag tax is an incentive for customers to use reusable bags instead of purchasing plastic bags every time they go shopping. However, we can’t conclude that from the graph showing below. The data drops in 2021 because the dataset only include January and February data in 2021. Target and Walmart had a slightly decrease from 2018 to 2020. Giant also had a slightly decrease from 2015 - 2017. Except for that, majority of the time, we don’t see much decrease in the sale of bag in grocery retailers. The purpose of the legislation isn’t to generate more revenue, but to change customers’ behavior. Plastic pollution is a serious environmental issue we need to be aware and deal with, though the behavior is not easy to change. 5 cents per bag might not look much for one time shopping, but it can accumulate and make a big difference. With the bag tax revenue, the government will have more budget to manage the pollution and other environmental issues.
bagtax1 %>%
filter(vendor_name == "Giant of Maryland, LLC" |
vendor_name == "Whole Foods Market Group Inc" |
vendor_name == "Harris Teeter" |
vendor_name == "Trader Joe's Company" |
vendor_name == "Target Corporation" |
vendor_name == "Wal-Mart Stores Inc" ) %>%
ggplot(bagtax1, mapping = aes(x = year, y = amount_collected_year, color = vendor_name, group = vendor_name)) +
geom_line()+
xlab("Year")+
ylab("Amount of Tax Collected Per Year")+
ggtitle("Montgomery County Grocery Retailers Bag Tax Collection Year 2012 - 2021")
I also made a voilin plot to show the distribution of tax collection per filing period for each grocery retailer. Except for Giant, all other five retailers have a median between $7,500 - $15,000 and have no or few outliers. Giant is the largest grocery retailer in Montgomery county and it has the highest bag tax collection per filing period
bagtax_new %>%
filter(vendor_name == "Giant of Maryland, LLC" |
vendor_name == "Whole Foods Market Group Inc" |
vendor_name == "Harris Teeter" |
vendor_name == "Trader Joe's Company" |
vendor_name == "Target Corporation" |
vendor_name == "Wal-Mart Stores Inc" ) %>%
ggplot(bagtax1, mapping = aes(y = vendor_name, x = amount_collected))+
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))+
xlab("Amount of Tax Collected Per Filing Period")+
ylab("Grocery Retailers")+
ggtitle("Montgomery County Grocery Retailer Bag Tax Collection Violon Plot")
I think it would be great if the dataset can have another variable to separate vendor categories, such as grocery store, restaurants, apparal, etc. It would be easier if we want to focus on only one category or to compare among different categories.
In addition, the filing period, date_from and date_to, have many errors. I assume because all of these information are manually entered by vendors when they report and file the bag tax. These errors make these two columns difficult to use. Instead, I use the create_date which is the timestamp when each vendor file the tax.