I choose the article SO MUCH CANDY DATA, SERIOUSLY. This data is based on a survey conducted in 2017 on what people would like to get for Halloween. The orignal survey looks like [this] (https://www.scq.ubc.ca/wp-content/uploads/2017/10/candyhierarchysurvey2017.pdf). The data recorded was very wide with some manual entries for country and state.There are also a lot of missing data where the question was not answered completely.
I am interested to find out the kind of candy people wish to get on Halloween in 2017 and create a ranking for it.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <-"https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Project_2/candyhierarchy2017.csv"
file<- read.csv(url, fileEncoding = "UTF-8")
# Taking care of empty spaces and set them to NA
file[file ==""] <- NA
# Changing the ranking for candy to number for analysis later
file[file=="MEH"] <- 2
file[file=="JOY"] <- 3
file[file=="DESPAIR"] <- 1
glimpse(file)
## Rows: 2,460
## Columns: 120
## $ Internal.ID <int> …
## $ Q1..GOING.OUT. <chr> …
## $ Q2..GENDER <chr> …
## $ Q3..AGE <chr> …
## $ Q4..COUNTRY <chr> …
## $ Q5..STATE..PROVINCE..COUNTY..ETC <chr> …
## $ Q6...100.Grand.Bar <chr> …
## $ Q6...Anonymous.brown.globs.that.come.in.black.and.orange.wrappers..a.k.a..Mary.Janes. <chr> …
## $ Q6...Any.full.sized.candy.bar <chr> …
## $ Q6...Black.Jacks <chr> …
## $ Q6...Bonkers..the.candy. <chr> …
## $ Q6...Bonkers..the.board.game. <chr> …
## $ Q6...Bottle.Caps <chr> …
## $ Q6...Box.o.Raisins <chr> …
## $ Q6...Broken.glow.stick <chr> …
## $ Q6...Butterfinger <chr> …
## $ Q6...Cadbury.Creme.Eggs <chr> …
## $ Q6...Candy.Corn <chr> …
## $ Q6...Candy.that.is.clearly.just.the.stuff.given.out.for.free.at.restaurants <chr> …
## $ Q6...Caramellos <chr> …
## $ Q6...Cash..or.other.forms.of.legal.tender <chr> …
## $ Q6...Chardonnay <chr> …
## $ Q6...Chick.o.Sticks..we.donÕt.know.what.that.is. <chr> …
## $ Q6...Chiclets <chr> …
## $ Q6...Coffee.Crisp <chr> …
## $ Q6...Creepy.Religious.comics.Chick.Tracts <chr> …
## $ Q6...Dental.paraphenalia <chr> …
## $ Q6...Dots <chr> …
## $ Q6...Dove.Bars <chr> …
## $ Q6...Fuzzy.Peaches <chr> …
## $ Q6...Generic.Brand.Acetaminophen <chr> …
## $ Q6...Glow.sticks <chr> …
## $ Q6...Goo.Goo.Clusters <chr> …
## $ Q6...Good.N..Plenty <chr> …
## $ Q6...Gum.from.baseball.cards <chr> …
## $ Q6...Gummy.Bears.straight.up <chr> …
## $ Q6...Hard.Candy <chr> …
## $ Q6...Healthy.Fruit <chr> …
## $ Q6...Heath.Bar <chr> …
## $ Q6...Hershey.s.Dark.Chocolate <chr> …
## $ Q6...HersheyÕs.Milk.Chocolate <chr> …
## $ Q6...Hershey.s.Kisses <chr> …
## $ Q6...Hugs..actual.physical.hugs. <chr> …
## $ Q6...Jolly.Rancher..bad.flavor. <chr> …
## $ Q6...Jolly.Ranchers..good.flavor. <chr> …
## $ Q6...JoyJoy..Mit.Iodine.. <chr> …
## $ Q6...Junior.Mints <chr> …
## $ Q6...Senior.Mints <chr> …
## $ Q6...Kale.smoothie <chr> …
## $ Q6...Kinder.Happy.Hippo <chr> …
## $ Q6...Kit.Kat <chr> …
## $ Q6...LaffyTaffy <chr> …
## $ Q6...LemonHeads <chr> …
## $ Q6...Licorice..not.black. <chr> …
## $ Q6...Licorice..yes.black. <chr> …
## $ Q6...Lindt.Truffle <chr> …
## $ Q6...Lollipops <chr> …
## $ Q6...Mars <chr> …
## $ Q6...Maynards <chr> …
## $ Q6...Mike.and.Ike <chr> …
## $ Q6...Milk.Duds <chr> …
## $ Q6...Milky.Way <chr> …
## $ Q6...Regular.M.Ms <chr> …
## $ Q6...Peanut.M.MÕs <chr> …
## $ Q6...Blue.M.M.s <chr> …
## $ Q6...Red.M.M.s <chr> …
## $ Q6...Green.Party.M.M.s <chr> …
## $ Q6...Independent.M.M.s <chr> …
## $ Q6...Abstained.from.M.M.ing. <chr> …
## $ Q6...Minibags.of.chips <chr> …
## $ Q6...Mint.Kisses <chr> …
## $ Q6...Mint.Juleps <chr> …
## $ Q6...Mr..Goodbar <chr> …
## $ Q6...Necco.Wafers <chr> …
## $ Q6...Nerds <chr> …
## $ Q6...Nestle.Crunch <chr> …
## $ Q6...Now.n.Laters <chr> …
## $ Q6...Peeps <chr> …
## $ Q6...Pencils <chr> …
## $ Q6...Pixy.Stix <chr> …
## $ Q6...Real.Housewives.of.Orange.County.Season.9.Blue.Ray <chr> …
## $ Q6...ReeseÕs.Peanut.Butter.Cups <chr> …
## $ Q6...Reese.s.Pieces <chr> …
## $ Q6...Reggie.Jackson.Bar <chr> …
## $ Q6...Rolos <chr> …
## $ Q6...Sandwich.sized.bags.filled.with.BooBerry.Crunch <chr> …
## $ Q6...Skittles <chr> …
## $ Q6...Smarties..American. <chr> …
## $ Q6...Smarties..Commonwealth. <chr> …
## $ Q6...Snickers <chr> …
## $ Q6...Sourpatch.Kids..i.e..abominations.of.nature. <chr> …
## $ Q6...Spotted.Dick <chr> …
## $ Q6...Starburst <chr> …
## $ Q6...Sweet.Tarts <chr> …
## $ Q6...Swedish.Fish <chr> …
## $ Q6...Sweetums..a.friend.to.diabetes. <chr> …
## $ Q6...Take.5 <chr> …
## $ Q6...Tic.Tacs <chr> …
## $ Q6...Those.odd.marshmallow.circus.peanut.things <chr> …
## $ Q6...Three.Musketeers <chr> …
## $ Q6...Tolberone.something.or.other <chr> …
## $ Q6...Trail.Mix <chr> …
## $ Q6...Twix <chr> …
## $ Q6...Vials.of.pure.high.fructose.corn.syrup..for.main.lining.into.your.vein <chr> …
## $ Q6...Vicodin <chr> …
## $ Q6...Whatchamacallit.Bars <chr> …
## $ Q6...White.Bread <chr> …
## $ Q6...Whole.Wheat.anything <chr> …
## $ Q6...York.Peppermint.Patties <chr> …
## $ Q7..JOY.OTHER <chr> …
## $ Q8..DESPAIR.OTHER <chr> …
## $ Q9..OTHER.COMMENTS <chr> …
## $ Q10..DRESS <chr> …
## $ X <chr> …
## $ Q11..DAY <chr> …
## $ Q12..MEDIA..Daily.Dish. <int> …
## $ Q12..MEDIA..Science. <int> …
## $ Q12..MEDIA..ESPN. <int> …
## $ Q12..MEDIA..Yahoo. <int> …
## $ Click.Coordinates..x..y. <chr> …
US_States <- file$Q5..STATE..PROVINCE..COUNTY..ETC
# State Patterns
sname <- c("District of Columbia" , "Alabama" , "Alaska" , "American Samoa" , "Arizona" , "Arkansas" , "California" , "Colorado" , "Connecticut" , "Delaware" , "Florida" , "Georgia" , "Guam" , "Hawaii" , "Idaho" , "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming")
states2 <- c("DC" , "AL" , "AK" , "AS" , "AZ" , "AR" , "CA" , "CO" , "CT" , "DE" , "FL" , "GA" , "GU" , "HI" , "ID" , "IL" , "IN" , "IA" , "KS" , "KY" , "LA" , "ME" , "MD" , "MA" , "MI" , "MN" , "MS" , "MO" , "MT" , "NE" , "NV" , "NH" , "NJ" , "NM" , "NY" , "NC" , "ND" , "OH" , "OK" , "OR" , "PA" , "PR" , "RI" , "SC" , "SD" , "TN" , "TX" , "UT" , "VT" , "VA" , "WA" , "WV" , "WI" , "WY")
spattern <- data.frame(sname , states2)
# first iteration of states to convert state name to two letter abbreviation
for(s in spattern$sname){
States <- agrep(s , US_States , ignore.case = TRUE, max.distance = 1)
US_States[States] <- toupper(spattern$states2[spattern$sname==s])
}
#ST <- as.data.frame(US_States) %>% distinct()
# helper function to check if state pattern exists in the string
return_logic <- function(spattern, string){
b <- FALSE
w <- strsplit(string, "[, ;.]+") #%>% unlist()
for (c in w){
if(2 %in%str_length(c) && TRUE %in% grepl(spattern, c, ignore.case = TRUE)){
b <- TRUE
}
}
returnValue(b)
}
# second iteration of states to convert state name to two letter abbreviation
for (s in spattern$states2) {
for (x in US_States) {
if (return_logic(s,x) == TRUE){
US_States[US_States==x] <- s
}
}
}
ST <- as.data.frame(US_States) %>% distinct() %>% arrange(US_States)
head(ST)
## US_States
## 1 1
## 2 48
## 3 A
## 4 AB
## 5 AK
## 6 AL
# Overwrite the result of states cleaned
file <- file %>% mutate(States = US_States)
# Clean up country names
Countries <- file$Q4..COUNTRY %>% toupper() %>% str_trim()
# Pattern to match
pat <- c("USA","U.S.A", "U.S","United States","Costa Rica","America")
# Other countries
opat <- c("England", "Canada", "Mexico","Germany")
# Match pattern where several instance of misspelled United States of America
# Assign the Country to America
for (p in pat){
matches <- agrep(p, Countries,ignore.case = TRUE, max.distance = 1)
Countries[matches] <- "USA"
}
# Match pattern for other countries
for (p in opat){
matches <- agrep(p, Countries,ignore.case = TRUE, max.distance = 1)
Countries[matches] <- toupper(p)
}
# Some left the US State as the Country
for(s in spattern$sname){
matches <- grep(s , Countries , ignore.case = TRUE)
Countries[matches] <- "USA"
}
Co <- as.data.frame(Countries) %>% distinct() %>% arrange(Countries)
file <- file %>% mutate(Country = Countries)
# Remove special characters from the column names
names(file) <- gsub("Õ", "'", names(file))
names(file) <- gsub("\\."," ",names(file))
# Valid US data.
vd_US <- file %>%
mutate(notPerfect = file %>%
select(starts_with("Q6")) %>%
is.na %>% rowSums, .after = starts_with("Q5")) %>%
filter(Country == "USA" & States %in% spattern$states2 & notPerfect == 0) %>%
select(ID = 'Internal ID',
Gender = starts_with("Q2"),
Age = starts_with("Q3"),
Country,
States,
starts_with("Q6")) %>%
# rename_with(~gsub("^Q6 ","", .), starts_with("Q6"))
pivot_longer(
cols = starts_with("Q6"),
names_to = "Candy_Catagory",
values_to = "Rank",
values_drop_na = TRUE
)
# Extract the values for candy type
vd_US$Candy_Catagory <- gsub("Q6 ","",vd_US$Candy_Catagory)
# Convert Rank as integer
vd_US$Rank <- as.integer(vd_US$Rank)
glimpse(vd_US)
## Rows: 89,095
## Columns: 7
## $ ID <int> 90272821, 90272821, 90272821, 90272821, 90272821, 90272…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",…
## $ Age <chr> "44", "44", "44", "44", "44", "44", "44", "44", "44", "…
## $ Country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA",…
## $ States <chr> "NM", "NM", "NM", "NM", "NM", "NM", "NM", "NM", "NM", "…
## $ Candy_Catagory <chr> "100 Grand Bar", "Anonymous brown globs that come in bl…
## $ Rank <int> 2, 1, 3, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1…
We will look at the top 10 candy from the survey.
# Candy Rank
candy_rank <- vd_US %>%
group_by(Candy_Catagory) %>%
summarize(Ranking = sum(Rank)) %>%
arrange(desc(Ranking))
# Plot the first 10 candy categories from rank
candy_rank %>%
head(10) %>%
ggplot(aes(y = reorder(Candy_Catagory, Ranking), x = Ranking)) +
geom_bar(stat = "identity") +
labs(
y = "Candy Category",
x = "Score",
title = "Most Wanted Candy Category for Halloween 2017"
)
This data set is difficult to clean up because there were many fields where it was freely entered by the participants of the survey. I tried using regular expression and did my best to clean up the location part of the dataset so I can do my analysis only on US data. My analysis shows that the most popular candy for this data set is any full size candy bar follow by Twix.
This dataset from Kaggle Audible Dataset aimed to understand how the audio book market has been growing over the years.This uncleaned data needs cleaning up in each observations to make the data type standardized. I am only looking at films in the English language for this project
# Get the list of audible books from csv file uploaded to Github
url <-"https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Project_2/audible_uncleaned.csv"
audible_file<- read.csv(url)
# Filter only the English books
audible_file <- audible_file %>%
filter(language == "English")
The authors column was cleaned up by removing “Writtenby” so that we can do analysis with author name. The narrator column was cleaned up by removing “Narratedby” so that we can do analysis with narrator name. Standardized the movie run time length by extracting the hours and minutes from the time field and calculated a new column called total_length. Separated ratings and reviews into two columns since rating is always out of five we don’t have to specify in the column.
# Cleaning up the columns
# Authors column remove "Writtenby"
audible_file$author <- gsub("Writtenby:", "", audible_file$author)
# Narrator column remove "Narratedby:"
audible_file$narrator <- gsub("Narratedby:", "", audible_file$narrator)
# Extract hours and minutes to total_length in minutes
audible_file <- audible_file %>%
mutate(hours = as.numeric(gsub("([0-9]+) hr.*", "\\1",audible_file$time)),
minutes = as.numeric(gsub(".* ([0-9]+) min.*$", "\\1",audible_file$time)),
hours = ifelse(is.na(hours), 0, hours),
minutes = ifelse(is.na(minutes), as.numeric(gsub("([0-9]+) min.*$", "\\1",audible_file$time)) , minutes),
minutes = ifelse(is.na(minutes), 0, minutes),
total_length = hours * 60 + minutes
)
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `hours = as.numeric(gsub("([0-9]+) hr.*", "\\1",
## audible_file$time))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
# Standardize date format
audible_file$releasedate <- dmy(audible_file$releasedate)
# Extract ratings and reviews
audible_file <- audible_file %>%
mutate(
star_rating = as.numeric(gsub(" out of 5 stars.*", "", audible_file$stars)),
reviews = as.numeric(gsub(".*stars([0-9]+) rating.*", "\\1", stars))
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `star_rating = as.numeric(gsub(" out of 5 stars.*", "",
## audible_file$stars))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
With the data cleaned up we can now answer some questions as such as of the top rated audio books what is the most common length of audio book? With the histogram plotted out we can see the the most common length of audio books 4 stars above is 3 hours and 19 minutes to 9 hours and 59 minutes.
common_length <- audible_file %>%
filter(star_rating > 4) %>%
ggplot(aes(x = total_length)) +
geom_histogram(bins = 20)
common_length
# get the information for histogram
plot_data <- ggplot_build(common_length)
# extract the bin data
bin_data <- plot_data$data[[1]]
# get the top 5 most counted container and their minutes value
bin_data %>%
select(count, xmin, xmax) %>%
mutate(
min_hour = ifelse(xmin<0, 0, floor(xmin/60)),
min_min = xmin %% 60,
max_hour = floor (xmax/60),
max_min = xmax %% 60,
) %>%
head(5) %>%
arrange(desc(count)) %>%
print()
## count xmin xmax min_hour min_min max_hour max_min
## 1 5323 199.7368 599.2105 3 19.73684 9 59.21053
## 2 2636 599.2105 998.6842 9 59.21053 16 38.68421
## 3 1507 -199.7368 199.7368 0 40.26316 3 19.73684
## 4 598 998.6842 1398.1579 16 38.68421 23 18.15789
## 5 177 1398.1579 1797.6316 23 18.15789 29 57.63158
This data set is not tidy because some cells contain more than one type of information like in the case of the category “stars”. To resolve this issue and make the data tidy, I use regular expression to extract the numbers for stars rating and separated the number of reviewers to two columns. I also converted the movie time from text form to total_length so we can analyze using the total audio length in minutes. I found out that the most common length of audio length for ratings more than 4 is from 3 hours and 19 minutes to 9 hours and 59 minutes.
I am using dataset from DOHMH New York City Restaurant Inspection Results where it contains the violation citations from every full or special program inspection conducted up to three years prior to the most recent inspection for restaurants and college cafeterias. This data was compiled from more than one data source and contains many blank values making it not tidy.
Since this dataset is very large, the file is not uploaded to github and I will be running this locally on my computer.
# Get the list of restaurant violation citations
url <-"DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
doh_file<- read.csv(url)
This dataset is fairly regular but because there are a lot of empty values it makes it hard to run any analysis. For data cleaning I will mainly focus on handling the missing data.
# Handle "New Establishment"
# Per the data index provided by DOH those with inspection data "01/01/1900" are new establishments created so does not have violation, grade, or any information
# NE is variable for new establishment
NE <- "01/01/1900"
# NV is variable for action = No violations were recorded at the time of this inspection.
NV <- "No violations were recorded at the time of this inspection."
# RO is a variable for action = Establishment re-opened by DOHMH.
RO <- "Establishment re-opened by DOHMH."
doh_file <- doh_file %>%
mutate(
# when establishment is new
CUISINE.DESCRIPTION = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", CUISINE.DESCRIPTION),
ACTION = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", ACTION),
VIOLATION.CODE = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", VIOLATION.CODE),
VIOLATION.DESCRIPTION = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", VIOLATION.DESCRIPTION),
INSPECTION.TYPE = ifelse(INSPECTION.DATE == NE, "NEW ESTABLISHMENT", INSPECTION.TYPE),
# When no violation was recorded No violations were recorded at the time of this inspection.
VIOLATION.CODE = ifelse(ACTION == NV, "NONE", VIOLATION.CODE),
VIOLATION.DESCRIPTION = ifelse(ACTION == NV, "NO VIOLATION RECORDED", VIOLATION.DESCRIPTION),
# When store is reopened by DOHMH
VIOLATION.CODE = ifelse(ACTION == RO, "Re-opened", VIOLATION.CODE),
VIOLATION.DESCRIPTION = ifelse(ACTION == RO, "Establishment Re-Opened by DOHMH", VIOLATION.DESCRIPTION),
# convert Inspection Date to date object
INSPECTION.DATE = as.Date(INSPECTION.DATE, format = "%m/%d/%Y")
)
# Any field that does not have a value will be set to NA
doh_file[doh_file ==""] <- NA
With the data being tidy now, we can do an analysis on violations over time by violation code. With the graph it is interesting to see that from the early year of 2020 to mid 2021 there were no violations cited probably due to COVID pandemic.
# List of monthly citations without the new establishments where violation code is not missing
monthly_citation_type <- doh_file %>%
filter(INSPECTION.DATE>1/1/1900, !is.na(VIOLATION.CODE)) %>%
mutate(month = floor_date(INSPECTION.DATE, "month")) %>%
group_by(month, VIOLATION.CODE) %>%
summarise(type_count = n(), .groups = "drop") %>%
#left_join(daily_citation) %>%
ggplot(aes(x = month, y = type_count, fill = VIOLATION.CODE))+
geom_bar(stat = "identity") +
labs(title = "Monthly Violation Counts by Violation Code",
x = "Inspection Date",
y = "Number of Violations"
) +
theme(legend.position = "none") +
coord_flip()
monthly_citation_type
This data from Department of Health with restaurant violation citation has many missing values due to the way this report was put together from other datasets not visible to me. I did my best to filled the categorical values of missing data if I know the reason that it was missing such as the new establishment listing if the inspection date is 1/1/1900. The analysis I did was trying to see if there was a trend for violation citations and it is interesting to see that from early 2020 to end of 2021 there were no citations likely due to COVID. It is also interesting to see that the number of citations spiked way up after pandemic. This may be due to the way this report was put together for citations of the last three years or since the last inspection.