First Dataset

What is the best candy?

I choose the article SO MUCH CANDY DATA, SERIOUSLY. This data is based on a survey conducted in 2017 on what people would like to get for Halloween. The orignal survey looks like [this] (https://www.scq.ubc.ca/wp-content/uploads/2017/10/candyhierarchysurvey2017.pdf). The data recorded was very wide with some manual entries for country and state.There are also a lot of missing data where the question was not answered completely.

I am interested to find out the kind of candy people wish to get on Halloween in 2017 and create a ranking for it.

Issues

  • Biggest problem is data corruption caused by human error. Spelling mistakes, answer is not uniform, and NULL answers.
  • Counties and States not spelled correctly will be part of the margin of error and be omitted from analysis.
  • Preference is not numerical
  • Data set where answers are incomplete.
  • Columns contain unidentified characters.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <-"https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Project_2/candyhierarchy2017.csv"
file<- read.csv(url, fileEncoding = "UTF-8")
# Taking care of empty spaces and set them to NA
file[file ==""] <- NA
# Changing the ranking for candy to number for analysis later
file[file=="MEH"] <- 2
file[file=="JOY"] <- 3
file[file=="DESPAIR"] <- 1

glimpse(file)
## Rows: 2,460
## Columns: 120
## $ Internal.ID                                                                           <int> …
## $ Q1..GOING.OUT.                                                                        <chr> …
## $ Q2..GENDER                                                                            <chr> …
## $ Q3..AGE                                                                               <chr> …
## $ Q4..COUNTRY                                                                           <chr> …
## $ Q5..STATE..PROVINCE..COUNTY..ETC                                                      <chr> …
## $ Q6...100.Grand.Bar                                                                    <chr> …
## $ Q6...Anonymous.brown.globs.that.come.in.black.and.orange.wrappers..a.k.a..Mary.Janes. <chr> …
## $ Q6...Any.full.sized.candy.bar                                                         <chr> …
## $ Q6...Black.Jacks                                                                      <chr> …
## $ Q6...Bonkers..the.candy.                                                              <chr> …
## $ Q6...Bonkers..the.board.game.                                                         <chr> …
## $ Q6...Bottle.Caps                                                                      <chr> …
## $ Q6...Box.o.Raisins                                                                    <chr> …
## $ Q6...Broken.glow.stick                                                                <chr> …
## $ Q6...Butterfinger                                                                     <chr> …
## $ Q6...Cadbury.Creme.Eggs                                                               <chr> …
## $ Q6...Candy.Corn                                                                       <chr> …
## $ Q6...Candy.that.is.clearly.just.the.stuff.given.out.for.free.at.restaurants           <chr> …
## $ Q6...Caramellos                                                                       <chr> …
## $ Q6...Cash..or.other.forms.of.legal.tender                                             <chr> …
## $ Q6...Chardonnay                                                                       <chr> …
## $ Q6...Chick.o.Sticks..we.donÕt.know.what.that.is.                                      <chr> …
## $ Q6...Chiclets                                                                         <chr> …
## $ Q6...Coffee.Crisp                                                                     <chr> …
## $ Q6...Creepy.Religious.comics.Chick.Tracts                                             <chr> …
## $ Q6...Dental.paraphenalia                                                              <chr> …
## $ Q6...Dots                                                                             <chr> …
## $ Q6...Dove.Bars                                                                        <chr> …
## $ Q6...Fuzzy.Peaches                                                                    <chr> …
## $ Q6...Generic.Brand.Acetaminophen                                                      <chr> …
## $ Q6...Glow.sticks                                                                      <chr> …
## $ Q6...Goo.Goo.Clusters                                                                 <chr> …
## $ Q6...Good.N..Plenty                                                                   <chr> …
## $ Q6...Gum.from.baseball.cards                                                          <chr> …
## $ Q6...Gummy.Bears.straight.up                                                          <chr> …
## $ Q6...Hard.Candy                                                                       <chr> …
## $ Q6...Healthy.Fruit                                                                    <chr> …
## $ Q6...Heath.Bar                                                                        <chr> …
## $ Q6...Hershey.s.Dark.Chocolate                                                         <chr> …
## $ Q6...HersheyÕs.Milk.Chocolate                                                         <chr> …
## $ Q6...Hershey.s.Kisses                                                                 <chr> …
## $ Q6...Hugs..actual.physical.hugs.                                                      <chr> …
## $ Q6...Jolly.Rancher..bad.flavor.                                                       <chr> …
## $ Q6...Jolly.Ranchers..good.flavor.                                                     <chr> …
## $ Q6...JoyJoy..Mit.Iodine..                                                             <chr> …
## $ Q6...Junior.Mints                                                                     <chr> …
## $ Q6...Senior.Mints                                                                     <chr> …
## $ Q6...Kale.smoothie                                                                    <chr> …
## $ Q6...Kinder.Happy.Hippo                                                               <chr> …
## $ Q6...Kit.Kat                                                                          <chr> …
## $ Q6...LaffyTaffy                                                                       <chr> …
## $ Q6...LemonHeads                                                                       <chr> …
## $ Q6...Licorice..not.black.                                                             <chr> …
## $ Q6...Licorice..yes.black.                                                             <chr> …
## $ Q6...Lindt.Truffle                                                                    <chr> …
## $ Q6...Lollipops                                                                        <chr> …
## $ Q6...Mars                                                                             <chr> …
## $ Q6...Maynards                                                                         <chr> …
## $ Q6...Mike.and.Ike                                                                     <chr> …
## $ Q6...Milk.Duds                                                                        <chr> …
## $ Q6...Milky.Way                                                                        <chr> …
## $ Q6...Regular.M.Ms                                                                     <chr> …
## $ Q6...Peanut.M.MÕs                                                                     <chr> …
## $ Q6...Blue.M.M.s                                                                       <chr> …
## $ Q6...Red.M.M.s                                                                        <chr> …
## $ Q6...Green.Party.M.M.s                                                                <chr> …
## $ Q6...Independent.M.M.s                                                                <chr> …
## $ Q6...Abstained.from.M.M.ing.                                                          <chr> …
## $ Q6...Minibags.of.chips                                                                <chr> …
## $ Q6...Mint.Kisses                                                                      <chr> …
## $ Q6...Mint.Juleps                                                                      <chr> …
## $ Q6...Mr..Goodbar                                                                      <chr> …
## $ Q6...Necco.Wafers                                                                     <chr> …
## $ Q6...Nerds                                                                            <chr> …
## $ Q6...Nestle.Crunch                                                                    <chr> …
## $ Q6...Now.n.Laters                                                                     <chr> …
## $ Q6...Peeps                                                                            <chr> …
## $ Q6...Pencils                                                                          <chr> …
## $ Q6...Pixy.Stix                                                                        <chr> …
## $ Q6...Real.Housewives.of.Orange.County.Season.9.Blue.Ray                               <chr> …
## $ Q6...ReeseÕs.Peanut.Butter.Cups                                                       <chr> …
## $ Q6...Reese.s.Pieces                                                                   <chr> …
## $ Q6...Reggie.Jackson.Bar                                                               <chr> …
## $ Q6...Rolos                                                                            <chr> …
## $ Q6...Sandwich.sized.bags.filled.with.BooBerry.Crunch                                  <chr> …
## $ Q6...Skittles                                                                         <chr> …
## $ Q6...Smarties..American.                                                              <chr> …
## $ Q6...Smarties..Commonwealth.                                                          <chr> …
## $ Q6...Snickers                                                                         <chr> …
## $ Q6...Sourpatch.Kids..i.e..abominations.of.nature.                                     <chr> …
## $ Q6...Spotted.Dick                                                                     <chr> …
## $ Q6...Starburst                                                                        <chr> …
## $ Q6...Sweet.Tarts                                                                      <chr> …
## $ Q6...Swedish.Fish                                                                     <chr> …
## $ Q6...Sweetums..a.friend.to.diabetes.                                                  <chr> …
## $ Q6...Take.5                                                                           <chr> …
## $ Q6...Tic.Tacs                                                                         <chr> …
## $ Q6...Those.odd.marshmallow.circus.peanut.things                                       <chr> …
## $ Q6...Three.Musketeers                                                                 <chr> …
## $ Q6...Tolberone.something.or.other                                                     <chr> …
## $ Q6...Trail.Mix                                                                        <chr> …
## $ Q6...Twix                                                                             <chr> …
## $ Q6...Vials.of.pure.high.fructose.corn.syrup..for.main.lining.into.your.vein           <chr> …
## $ Q6...Vicodin                                                                          <chr> …
## $ Q6...Whatchamacallit.Bars                                                             <chr> …
## $ Q6...White.Bread                                                                      <chr> …
## $ Q6...Whole.Wheat.anything                                                             <chr> …
## $ Q6...York.Peppermint.Patties                                                          <chr> …
## $ Q7..JOY.OTHER                                                                         <chr> …
## $ Q8..DESPAIR.OTHER                                                                     <chr> …
## $ Q9..OTHER.COMMENTS                                                                    <chr> …
## $ Q10..DRESS                                                                            <chr> …
## $ X                                                                                     <chr> …
## $ Q11..DAY                                                                              <chr> …
## $ Q12..MEDIA..Daily.Dish.                                                               <int> …
## $ Q12..MEDIA..Science.                                                                  <int> …
## $ Q12..MEDIA..ESPN.                                                                     <int> …
## $ Q12..MEDIA..Yahoo.                                                                    <int> …
## $ Click.Coordinates..x..y.                                                              <chr> …

Cleaning Up States column

Issues

  • Match pattern with State Names spelled out to the two letter State abbreviations
  • Parse the cell to extract the states
  • Leave all the other states from other countries alone

Solutions

  • Have a state table for matching with the full name and the two character abbreviation
  • Created a function to breakdown the State string and isolate string that match the State abbreviations and
US_States <- file$Q5..STATE..PROVINCE..COUNTY..ETC

# State Patterns

sname <- c("District of Columbia" ,  "Alabama"  , "Alaska"  , "American Samoa" ,   "Arizona" ,  "Arkansas" ,  "California" , "Colorado" , "Connecticut" , "Delaware" , "Florida" , "Georgia" , "Guam" , "Hawaii"  ,  "Idaho"  ,  "Illinois"  ,  "Indiana"  ,  "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming")
states2 <- c("DC" , "AL" , "AK" , "AS" , "AZ" , "AR" , "CA" , "CO" , "CT" , "DE" , "FL" , "GA" , "GU" , "HI" , "ID" , "IL" , "IN" , "IA" , "KS" , "KY" , "LA" , "ME" , "MD" , "MA" , "MI" , "MN" , "MS" , "MO" , "MT" , "NE" , "NV" , "NH" , "NJ" , "NM" , "NY" , "NC" , "ND" , "OH" , "OK" , "OR" , "PA" , "PR" , "RI" , "SC" , "SD" , "TN" , "TX" , "UT" , "VT" , "VA" , "WA" , "WV" , "WI" , "WY")

spattern <- data.frame(sname ,  states2)

# first iteration of states to convert state name to two letter abbreviation
for(s in spattern$sname){
  States <- agrep(s ,  US_States , ignore.case = TRUE, max.distance = 1)
  US_States[States] <- toupper(spattern$states2[spattern$sname==s])
}

#ST <- as.data.frame(US_States) %>% distinct()

# helper function to check if state pattern exists in the string
return_logic <- function(spattern, string){
  b <- FALSE
  w <- strsplit(string, "[, ;.]+") #%>% unlist()
  for (c in w){
    if(2 %in%str_length(c) && TRUE %in% grepl(spattern, c, ignore.case = TRUE)){
      b <- TRUE
    }
  }
  returnValue(b)
}
# second iteration of states to convert state name to two letter abbreviation
for (s in spattern$states2) {
  for (x in US_States) {
    if (return_logic(s,x) == TRUE){
      US_States[US_States==x] <- s
    }
  }
}

ST <- as.data.frame(US_States) %>% distinct() %>% arrange(US_States)
head(ST)
##   US_States
## 1         1
## 2        48
## 3         A
## 4        AB
## 5        AK
## 6        AL
# Overwrite the result of states cleaned
file <- file %>% mutate(States = US_States)

Cleaning Up Country Column

  • The unidentified characters really became a problem in this part. So I used regular expression and approximate match to get the likely county that the user inputed.
# Clean up country names 
Countries <- file$Q4..COUNTRY %>% toupper() %>% str_trim()

# Pattern to match
pat <- c("USA","U.S.A", "U.S","United States","Costa Rica","America")
# Other countries
opat <- c("England", "Canada", "Mexico","Germany")

# Match pattern where several instance of misspelled United States of America
# Assign the Country to America
for (p in pat){
  matches <- agrep(p, Countries,ignore.case = TRUE, max.distance = 1)
  Countries[matches] <- "USA"
}
# Match pattern for other countries
for (p in opat){
  matches <- agrep(p, Countries,ignore.case = TRUE, max.distance = 1)
  Countries[matches] <- toupper(p)
}
# Some left the US State as the Country
for(s in spattern$sname){
  matches <- grep(s ,  Countries , ignore.case = TRUE)
  Countries[matches] <- "USA"
}

Co <- as.data.frame(Countries) %>% distinct() %>% arrange(Countries)

file <- file %>% mutate(Country = Countries)

Extracting Valid Data

  • Data where country is USA
  • State is a valid USA State
  • All the Candy with the ranking data
# Remove special characters from the column names
names(file) <- gsub("Õ", "'", names(file))
names(file) <- gsub("\\."," ",names(file))
# Valid US data.
vd_US <- file %>% 
  mutate(notPerfect =  file %>% 
           select(starts_with("Q6")) %>%
           is.na %>% rowSums, .after = starts_with("Q5")) %>% 
  filter(Country == "USA" & States %in% spattern$states2 & notPerfect == 0) %>% 
  select(ID = 'Internal ID', 
         Gender = starts_with("Q2"), 
         Age = starts_with("Q3"),
         Country,
         States,
         starts_with("Q6"))  %>% 
 # rename_with(~gsub("^Q6   ","", .), starts_with("Q6"))
  pivot_longer(
    cols = starts_with("Q6"),
    names_to = "Candy_Catagory",
    values_to = "Rank",
    values_drop_na = TRUE
  )

# Extract the values for candy type
vd_US$Candy_Catagory <- gsub("Q6   ","",vd_US$Candy_Catagory)
# Convert Rank as integer
vd_US$Rank <- as.integer(vd_US$Rank)
glimpse(vd_US)
## Rows: 89,095
## Columns: 7
## $ ID             <int> 90272821, 90272821, 90272821, 90272821, 90272821, 90272…
## $ Gender         <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",…
## $ Age            <chr> "44", "44", "44", "44", "44", "44", "44", "44", "44", "…
## $ Country        <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA",…
## $ States         <chr> "NM", "NM", "NM", "NM", "NM", "NM", "NM", "NM", "NM", "…
## $ Candy_Catagory <chr> "100 Grand Bar", "Anonymous brown globs that come in bl…
## $ Rank           <int> 2, 1, 3, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1…

Analyze

We will look at the top 10 candy from the survey.

# Candy Rank
candy_rank <- vd_US %>% 
  group_by(Candy_Catagory) %>% 
  summarize(Ranking = sum(Rank)) %>% 
  arrange(desc(Ranking))

# Plot the first 10 candy categories from rank
candy_rank %>% 
  head(10) %>%  
  ggplot(aes(y = reorder(Candy_Catagory, Ranking), x = Ranking)) +
  geom_bar(stat = "identity") + 
  labs(
    y = "Candy Category",
    x = "Score",
    title = "Most Wanted Candy Category for Halloween 2017"
  )

Conclusion

This data set is difficult to clean up because there were many fields where it was freely entered by the participants of the survey. I tried using regular expression and did my best to clean up the location part of the dataset so I can do my analysis only on US data. My analysis shows that the most popular candy for this data set is any full size candy bar follow by Twix.

Second Data Set

Audio Book and rating

This dataset from Kaggle Audible Dataset aimed to understand how the audio book market has been growing over the years.This uncleaned data needs cleaning up in each observations to make the data type standardized. I am only looking at films in the English language for this project

Load the data

# Get the list of audible books from csv file uploaded to Github
url <-"https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Project_2/audible_uncleaned.csv"
audible_file<- read.csv(url)
# Filter only the English books
audible_file <- audible_file %>% 
  filter(language == "English")

Clean up the Data

The authors column was cleaned up by removing “Writtenby” so that we can do analysis with author name. The narrator column was cleaned up by removing “Narratedby” so that we can do analysis with narrator name. Standardized the movie run time length by extracting the hours and minutes from the time field and calculated a new column called total_length. Separated ratings and reviews into two columns since rating is always out of five we don’t have to specify in the column.

# Cleaning up the columns
# Authors column remove "Writtenby"
audible_file$author <- gsub("Writtenby:", "", audible_file$author)
# Narrator column remove "Narratedby:"
audible_file$narrator <- gsub("Narratedby:", "", audible_file$narrator)
# Extract hours and minutes to total_length in minutes
audible_file <- audible_file %>% 
  mutate(hours = as.numeric(gsub("([0-9]+) hr.*", "\\1",audible_file$time)),
         minutes = as.numeric(gsub(".* ([0-9]+) min.*$", "\\1",audible_file$time)),
         hours = ifelse(is.na(hours), 0, hours),
         minutes = ifelse(is.na(minutes), as.numeric(gsub("([0-9]+) min.*$", "\\1",audible_file$time)) , minutes),
         minutes = ifelse(is.na(minutes), 0, minutes),
         total_length = hours * 60 + minutes
  )
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `hours = as.numeric(gsub("([0-9]+) hr.*", "\\1",
##   audible_file$time))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
# Standardize date format
audible_file$releasedate <- dmy(audible_file$releasedate)

# Extract ratings and reviews
audible_file <- audible_file %>% 
  mutate(
  star_rating = as.numeric(gsub(" out of 5 stars.*", "", audible_file$stars)),
  reviews = as.numeric(gsub(".*stars([0-9]+) rating.*", "\\1", stars))
  )
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `star_rating = as.numeric(gsub(" out of 5 stars.*", "",
##   audible_file$stars))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

Analyze

With the data cleaned up we can now answer some questions as such as of the top rated audio books what is the most common length of audio book? With the histogram plotted out we can see the the most common length of audio books 4 stars above is 3 hours and 19 minutes to 9 hours and 59 minutes.

common_length <- audible_file %>% 
  filter(star_rating > 4) %>% 
  ggplot(aes(x = total_length)) +
  geom_histogram(bins = 20)
common_length

# get the information for histogram
plot_data <- ggplot_build(common_length)
# extract the bin data
bin_data <- plot_data$data[[1]]
# get the top 5 most counted container and their minutes value
bin_data  %>% 
  select(count, xmin, xmax) %>% 
  mutate(
         min_hour = ifelse(xmin<0, 0, floor(xmin/60)),
         min_min = xmin %% 60,
         max_hour = floor (xmax/60),
         max_min = xmax %% 60,
         ) %>% 
  head(5) %>% 
  arrange(desc(count)) %>% 
  print()
##   count      xmin      xmax min_hour  min_min max_hour  max_min
## 1  5323  199.7368  599.2105        3 19.73684        9 59.21053
## 2  2636  599.2105  998.6842        9 59.21053       16 38.68421
## 3  1507 -199.7368  199.7368        0 40.26316        3 19.73684
## 4   598  998.6842 1398.1579       16 38.68421       23 18.15789
## 5   177 1398.1579 1797.6316       23 18.15789       29 57.63158

Conclusion

This data set is not tidy because some cells contain more than one type of information like in the case of the category “stars”. To resolve this issue and make the data tidy, I use regular expression to extract the numbers for stars rating and separated the number of reviewers to two columns. I also converted the movie time from text form to total_length so we can analyze using the total audio length in minutes. I found out that the most common length of audio length for ratings more than 4 is from 3 hours and 19 minutes to 9 hours and 59 minutes.

Third Data Set

DOHMH New York City Restaurant Inspection Results

I am using dataset from DOHMH New York City Restaurant Inspection Results where it contains the violation citations from every full or special program inspection conducted up to three years prior to the most recent inspection for restaurants and college cafeterias. This data was compiled from more than one data source and contains many blank values making it not tidy.

Load Data

Since this dataset is very large, the file is not uploaded to github and I will be running this locally on my computer.

# Get the list of restaurant violation citations
url <-"DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
doh_file<- read.csv(url)

Data Cleaning

This dataset is fairly regular but because there are a lot of empty values it makes it hard to run any analysis. For data cleaning I will mainly focus on handling the missing data.

# Handle "New Establishment"
# Per the data index provided by DOH those with inspection data "01/01/1900" are new establishments created so does not have violation, grade, or any information

# NE is variable for new establishment
NE <- "01/01/1900"
# NV is variable for action = No violations were recorded at the time of this inspection.
NV <- "No violations were recorded at the time of this inspection."
# RO is a variable for action = Establishment re-opened by DOHMH.
RO <- "Establishment re-opened by DOHMH."

doh_file <- doh_file %>% 
  mutate(
    # when establishment is new
    CUISINE.DESCRIPTION = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", CUISINE.DESCRIPTION),
    ACTION = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", ACTION),
    VIOLATION.CODE = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", VIOLATION.CODE),
    VIOLATION.DESCRIPTION = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", VIOLATION.DESCRIPTION),
    INSPECTION.TYPE = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", INSPECTION.TYPE),
    
    # When no violation was recorded No violations were recorded at the time of this inspection.
    VIOLATION.CODE = ifelse(ACTION == NV, "NONE", VIOLATION.CODE),
    VIOLATION.DESCRIPTION = ifelse(ACTION == NV, "NO VIOLATION RECORDED", VIOLATION.DESCRIPTION),
    
    # When store is reopened by DOHMH
    VIOLATION.CODE = ifelse(ACTION == RO, "Re-opened", VIOLATION.CODE),
    VIOLATION.DESCRIPTION = ifelse(ACTION == RO, "Establishment Re-Opened by DOHMH", VIOLATION.DESCRIPTION),
    
   # convert Inspection Date to date object 
   INSPECTION.DATE = as.Date(INSPECTION.DATE, format = "%m/%d/%Y")
  )
  
# Any field that does not have a value will be set to NA
doh_file[doh_file ==""] <- NA

Analyze

With the data being tidy now, we can do an analysis on violations over time by violation code. With the graph it is interesting to see that from the early year of 2020 to mid 2021 there were no violations cited probably due to COVID pandemic.

# List of monthly citations without the new establishments where violation code is not missing
monthly_citation_type <- doh_file %>% 
  filter(INSPECTION.DATE>1/1/1900, !is.na(VIOLATION.CODE)) %>% 
  mutate(month = floor_date(INSPECTION.DATE, "month")) %>% 
  group_by(month, VIOLATION.CODE) %>% 
  summarise(type_count = n(), .groups = "drop") %>% 
  #left_join(daily_citation) %>% 
  ggplot(aes(x = month, y = type_count, fill = VIOLATION.CODE))+
  geom_bar(stat = "identity") +
  labs(title = "Monthly Violation Counts by Violation Code",
       x = "Inspection Date",
       y = "Number of Violations"
       ) +
  theme(legend.position = "none") +
  coord_flip()

monthly_citation_type

Conclusion

This data from Department of Health with restaurant violation citation has many missing values due to the way this report was put together from other datasets not visible to me. I did my best to filled the categorical values of missing data if I know the reason that it was missing such as the new establishment listing if the inspection date is 1/1/1900. The analysis I did was trying to see if there was a trend for violation citations and it is interesting to see that from early 2020 to end of 2021 there were no citations likely due to COVID. It is also interesting to see that the number of citations spiked way up after pandemic. This may be due to the way this report was put together for citations of the last three years or since the last inspection.