Chocolate is a food product made from cocoa beans mixed with fat, sugar and other ingredients that could differs from each type of chocolate. This particular session is narrowly focused on plain dark chocolate type. Which by definition, dark chocolate contains 50-90% cocoa solid, cocoa butter, and sugar.
The rating which will be heavily analyzed is a database from “Flavors of Cacao” which can be retrieved from Tidy Tuesday. The ratings (as the Flavors of Cacao claimed) do not reflect health benefits, social missions, or organic status.
#Library Preparation
library(tidytuesdayR)
library(tidyverse)
library(ggplot2)
library(scales)chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')Data Glossary
| Variable | Description |
|---|---|
| ref | Reference ID, The highest REF numbers were the last entries made. |
| company_manufacturer | Manufacturer name |
| company_location | Manufacturer region |
| review_date | Review date (year) |
| country_of_bean_origin | Country of origin |
| specific_bean_origin_or_bar_name | Specific bean or bar name |
| cocoa_percent | Cocoa percent (% chocolate) |
| ingredients | Ingredients, (“#” = represents the number of ingredients in the chocolate; B = Beans, S = Sugar, S* = Sweetener other than white cane or beet sugar, C = Cocoa Butter, V = Vanilla, L = Lecithin, Sa = Salt) |
| most_memorable_characteristics | Most Memorable Characteristics column is a summary review of the most memorable characteristics of that bar. Terms generally relate to anything from texture, flavor, overall opinion, etc. separated by ‘,’ |
| rating | rating between 1-5 |
Based on data source: “Flavors of Cacao”
| Rating Scale | Interpretation |
|---|---|
| 4.0 - 5.0 | Outstanding |
| 3.5 - 3.9 | Highly Recommended |
| 3.0 - 3.49 | Recommended |
| 2.0 - 2.9 | Disappointing |
| 1.0 - 1.9 | Unpleasant |
any(duplicated(chocolate))>> [1] FALSE
#No duplicate dataanyNA(chocolate)>> [1] TRUE
#Check NA in each column
map(chocolate, ~sum(is.na(.)))>> $ref
>> [1] 0
>>
>> $company_manufacturer
>> [1] 0
>>
>> $company_location
>> [1] 0
>>
>> $review_date
>> [1] 0
>>
>> $country_of_bean_origin
>> [1] 0
>>
>> $specific_bean_origin_or_bar_name
>> [1] 0
>>
>> $cocoa_percent
>> [1] 0
>>
>> $ingredients
>> [1] 87
>>
>> $most_memorable_characteristics
>> [1] 0
>>
>> $rating
>> [1] 0
#Proportion of NA to all data
prop.table(table(is.na(chocolate$ingredients)))*100>>
>> FALSE TRUE
>> 96.561265 3.438735
Based on the proportion of NA value, I’m not going to drop any NA
value now, considering that NA value doesn’t have significant number
(and exists only in one column). I’ll include the observation which
contained NA value as long as the observation isn’t related to
ingredients column, and apply drop_na or
na.omit for any analysis that related to
ingredients column.
Inspect Data Frame
chocolateData Frame Adjustments:
ref (wouldn’t be analyzed)cocoa_percent columningredients column (separate into new column to place the
number of ingredients)chocolate <- chocolate %>%
mutate(n_ingredients = str_split(ingredients, "(- |-)",simplify = T)[,1], #Split between number and components, take number and input to column n_ingredients
ingredients = str_split(ingredients, "(- |-)",simplify = T)[,2], #Split between number and components, take number and input to column ingredients
cocoa_percent = str_remove(cocoa_percent, "%"), #remove %
review_date = as.factor(review_date)) %>%
select(company_manufacturer, company_location, review_date, country_of_bean_origin, specific_bean_origin_or_bar_name, cocoa_percent, n_ingredients, everything(), -c(ref),) #rearrange column (optional)
#change data type
chocolate[, c("cocoa_percent", "n_ingredients")] <- lapply(chocolate[, c("cocoa_percent", "n_ingredients")], as.numeric)
str(chocolate)>> tibble [2,530 × 10] (S3: tbl_df/tbl/data.frame)
>> $ company_manufacturer : chr [1:2530] "5150" "5150" "5150" "5150" ...
>> $ company_location : chr [1:2530] "U.S.A." "U.S.A." "U.S.A." "U.S.A." ...
>> $ review_date : Factor w/ 16 levels "2006","2007",..: 14 14 14 16 16 16 16 7 7 8 ...
>> $ country_of_bean_origin : chr [1:2530] "Tanzania" "Dominican Republic" "Madagascar" "Fiji" ...
>> $ specific_bean_origin_or_bar_name: chr [1:2530] "Kokoa Kamili, batch 1" "Zorzal, batch 1" "Bejofo Estate, batch 1" "Matasawalevu, batch 1" ...
>> $ cocoa_percent : num [1:2530] 76 76 76 68 72 80 68 70 63 70 ...
>> $ n_ingredients : num [1:2530] 3 3 3 3 3 3 3 4 4 4 ...
>> $ ingredients : chr [1:2530] "B,S,C" "B,S,C" "B,S,C" "B,S,C" ...
>> $ most_memorable_characteristics : chr [1:2530] "rich cocoa, fatty, bready" "cocoa, vegetal, savory" "cocoa, blackberry, full body" "chewy, off, rubbery" ...
>> $ rating : num [1:2530] 3.25 3.5 3.75 3 3 3.25 3.5 3.5 3.75 2.75 ...
chocolate#Inspect rating distribution
par(mfrow = c(2,1))
hist(chocolate$rating, breaks = 10, main = "Rating Distribution",xlab = NULL, ylab = NULL)
boxplot(chocolate$rating, horizontal = T, main = "Rating Distribution")par(mfrow = c(1,1))Based on the plot above, we can conclude:
Rating Distribution through the Years
ggplot(chocolate, aes(x = review_date, y = rating)) +
geom_boxplot( fill = "#8cafff", col = "#09202d", outlier.colour = "red") +
coord_flip() +
labs(x = NULL, y = NULL,
title = "Review Distribution in Each Year") +
theme_minimal()From the plot above we can conclude that:
#Change rating into categorical
chocolate.Rating <- chocolate %>%
mutate(rating = case_when(between(rating, 4.0, 5.0 )~ "Oustanding",
between(rating, 3.5, 3.9)~ "Highly Recommended",
between(rating, 3.0, 3.49)~ "Recommended",
between(rating, 2.0, 2.9)~ "Disappointing",
between(rating, 1.0, 1.9)~ "Unpleasant",
T ~ as.character(rating)),
rating = as.factor(rating))
#Rearrange level
chocolate.Rating$rating <- factor(chocolate.Rating$rating,
levels = c("Unpleasant", "Disappointing", "Recommended", "Highly Recommended", "Oustanding"))
chocolate.Ratingggplot(chocolate.Rating, aes(x = cocoa_percent, group = rating, fill =rating )) +
geom_density( alpha = .6, show.legend = F) +
facet_wrap(~rating, scales = "free") +
scale_x_continuous(breaks = seq(40, 100, 10),
limits = c(40, 100)) +
scale_y_continuous(breaks = seq(0, 0.4, 0.1),
limit = c(0, 0.4)) +
labs(x = NULL, y = NULL,
title = "Cocoa Percentage Frequency by Rating Category") +
theme_minimal() +
theme(strip.placement = "outside",
strip.background = element_rect(fill = "black"),
panel.background = element_rect(colour = "grey"),
strip.text = element_text(color = "white", size = 9),
plot.title = element_text(family = "serif", size = 20, face = "bold", hjust = 0.5))insight:
#Subset relevant data
popular.country.origin <- chocolate.Rating %>%
group_by(country_of_bean_origin) %>%
summarise(Total = n()) %>%
arrange(desc(Total))
ggplot(popular.country.origin[1:15,], aes(x = Total, y = reorder(country_of_bean_origin, Total) )) +
geom_segment(aes(y = reorder(country_of_bean_origin, Total), yend = reorder(country_of_bean_origin, Total),
x = 0 , xend = Total, col = Total), show.legend = F) +
geom_point(size= 4, aes(col = Total), alpha = .5, show.legend = F) +
scale_color_gradient(high = "#228e40" , low = "#3cffcd") +
geom_text(aes(label = Total), fontface = "bold", size = 3, hjust = 1, color = "blue", x=0) +
theme_minimal() +
scale_fill_gradient(high = "#ffcd42" , low = "#6b7781") +
scale_x_continuous(limit = c(0, 280),
breaks = seq(0, 300, 50)) +
labs(x = NULL, y = NULL,
title = "Popular Country of Bean Origin") +
theme(axis.text = element_text(size = 11),
plot.title = element_text(family = "serif", size = 18, face = "bold", hjust = 0.5))popular.beans <- chocolate %>%
group_by(specific_bean_origin_or_bar_name) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>%
head(20)
popular.beans#List the popular bean only
popular.beans <- popular.beans$specific_bean_origin_or_bar_name
#Filter based on popular beans only
plot.popular.beans <- chocolate %>%
filter(specific_bean_origin_or_bar_name %in% popular.beans)
# Rearrange level of Country to most popular beans to least popular, so facet plot will be ordered as level
plot.popular.beans$specific_bean_origin_or_bar_name <- factor(plot.popular.beans$specific_bean_origin_or_bar_name,
levels = c("Madagascar", "Ecuador","Peru","Dominican Republic","Chuao","Venezuela",
"Kokoa Kamili","Ghana","Papua New Guinea","Sambirano", "Belize","Ocumare",
"Oko Caribe","Tanzania","Ucayali", "Alto Beni", "Maya Mountain","Porcelana",
"Vietnam","Brazil")) ggplot(plot.popular.beans, aes(x = specific_bean_origin_or_bar_name, y= rating, group = specific_bean_origin_or_bar_name )) +
geom_boxplot(outlier.shape = NA) +
geom_hline(aes(yintercept = mean(rating)), data = chocolate, col = "orange", linetype = 2, size = 0.8) +
geom_count(aes(col = rating), alpha = 0.6, show.legend = T) +
scale_x_discrete(position = "top", labels = scales::wrap_format(2)) +
scale_y_continuous(breaks = seq(1.5, 4, .25), position = "right") +
scale_color_gradient(low = "#c1857e", high = "#3a3a98") +
scale_size_area(max_size = 6) +
labs(x = NULL, y = NULL,
caption = "Popular Beans & It's Rating Distribution") +
theme_minimal() +
theme(axis.text.x.top = element_text( size = 11, vjust = 0.5),
axis.text.y.right = element_text( face = "bold"),
plot.caption = element_text(size = 24, face = "bold", hjust = 0.5, family = "serif"),
legend.position = "left") +
coord_flip()Insights:
#Filter by low rating
low.rating <- chocolate.Rating %>%
filter(rating == c("Unpleasant", "Disappointing"))
#Tidy up data
low.character <- str_split(low.rating$most_memorable_characteristics, ",") #Split value
low.character <- unlist(low.character) #Convert into vector (unlist)
low.character <- str_trim(low.character, "left") #Remove left/beginning white space only
#Calculate Characteristic
low.character.pop <- sort(round(prop.table(table(low.character))*100, 2), decreasing = T)
low.character.pop <- data.frame(low.character.pop)
#Filter good rating
high.rating <- chocolate.Rating %>%
filter(rating == c("Highly Recommended", "Oustanding"))
#Tidy up data
good.character <- str_split(high.rating$most_memorable_characteristics, ",")
good.character <- unlist(good.character)
good.character <- str_trim(good.character, "left")
#Calculate Characteristic
good.character.pop <- sort(round(prop.table(table(good.character))*100, 2), decreasing = T)
good.character.pop <- data.frame(good.character.pop)
#Join data frame to compare between low and high rate character
Character.rate <- left_join(good.character.pop, low.character.pop, by = c("good.character" = "low.character"))
#Rename column dataframe
names(Character.rate) <- c("character", "High Rating", "Low Rating")
Character.rate#Negative texture Characteristic plotting
negative.texture <- Character.rate %>%
filter(character %in% c("dry", "sandy", "some large grits", "oily", "fatty", "chalky", "powdery", "waxy"))
negative.texture <- pivot_longer(data = negative.texture,
cols = -(character),
names_to = "Rating",
values_to = "value")
# Plotting
ggplot(negative.texture, aes(y = value, x = character)) +
geom_col(aes(fill = Rating), position = "dodge") +
geom_text(aes(label = value, group = Rating), hjust = 0.5,
vjust= -.5, position = position_dodge(width = .9),
fontface = "bold", color = "#4e5c68") +
scale_fill_manual(values = c("#219141", "#b1342a")) +
theme_minimal()+
labs(x = NULL,
y = "porportion from rating group",
title = "Negative Texture in Each Rating Group") +
theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
axis.title.y = element_text(size = 13, hjust = 0),
axis.text = element_text(size = 12),
legend.position = "bottom")
Insights:
other.Character <- Character.rate %>%
filter(character != c("dry", "sandy", "some large grits", "oily", "fatty", "chalky", "powdery", "waxy")) %>%
na.omit() %>%
head(25)
other.Character <- pivot_longer(data = other.Character,
cols = -(character),
names_to = "Rating",
values_to = "values")
ggplot(other.Character, aes(x = Rating, y = character, fill = values))+
scale_fill_gradient(low = "#dcedfe", high = "#780ba1") +
geom_tile(color = "black", width = 1) +
labs(y = NULL, x = NULL,
title = "Characteristic Frequency Found based on Group Rating") +
theme(
axis.text = element_text(size= 12),
panel.grid = element_line(colour = "black"),
panel.background = element_rect(fill = "black"),
plot.title = element_text(size = 16, face = "bold"))#Tidy up data
lo.rtg.ingredient <- drop_na(low.rating )
lo.rtg.ingredient <- str_split(lo.rtg.ingredient$ingredients, ",") #Split value
lo.rtg.ingredient <- unlist(lo.rtg.ingredient) #Convert into vector (unlist)
#Calculate Ingredients Frequency from a rating group
lo.rtg.ingredient <- sort(round(prop.table(table(lo.rtg.ingredient))*100, 2), decreasing = T)
lo.rtg.df <- data.frame(lo.rtg.ingredient)
#Tidy up data
hi.rtg.ingredient <- drop_na(high.rating)
hi.rtg.ingredient <- str_split(hi.rtg.ingredient$ingredients, ",") #Split value
hi.rtg.ingredient <- unlist(hi.rtg.ingredient) #Convert into vector (unlist)
#Calculate Ingredients Frequency from a rating group
hi.rtg.ingredient <- sort(round(prop.table(table(hi.rtg.ingredient))*100, 2), decreasing = T)
hi.rtg.df <- data.frame(hi.rtg.ingredient)
lo.rtg.ingredient>> lo.rtg.ingredient
>> B S C L V S* Sa
>> 31.41 29.84 21.66 7.58 7.58 1.32 0.60
hi.rtg.ingredient>> hi.rtg.ingredient
>> B S C L V S* Sa
>> 33.49 32.94 22.49 6.33 3.85 0.55 0.34
#Join data frame to compare between low and high rate character
ingredients.df <- left_join(hi.rtg.df, lo.rtg.df, by = c("hi.rtg.ingredient" = "lo.rtg.ingredient"))
colnames(ingredients.df) <- c("Ingredients", "High Rating", "Low Rating")
ingredients.df <- ingredients.df %>%
mutate(Ingredients = case_when(Ingredients == "B" ~ "Beans",
Ingredients == "S" ~ "Sugar",
Ingredients == "S*" ~ "Sweetener",
Ingredients == "C" ~ "Cocoa Butter",
Ingredients == "V" ~ "Vanilla",
Ingredients == "L" ~ "Lecithin",
Ingredients == "Sa" ~ "Salt",
T ~ as.character(Ingredients))) %>%
pivot_longer(cols = -(Ingredients),
values_to = "value",
names_to = "Rating")# Plotting
ggplot(ingredients.df, aes(y = value, x = Ingredients)) +
geom_col(aes(fill = Rating), position = "dodge") +
geom_text(aes(label = value, group = Rating), hjust = 0.5,
vjust= -.5, position = position_dodge(width = .9),
fontface = "bold", color = "#4e5c68") +
scale_fill_manual(values = c("#ffe643", "#4e5c68")) +
theme_minimal()+
labs(x = NULL,
y = "porportion from rating group",
title = "Frequency of: Ingredients Found on Group Rating") +
theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
axis.title.y = element_text(size = 13, hjust = 0),
axis.text = element_text(size = 12),
legend.position = "bottom")igd.n.lo <- low.rating %>%
mutate(rating = "Low Rating") %>%
na.omit()
igd.n.hi <- high.rating %>%
mutate(rating = "High Rating") %>%
na.omit()
ign.n <- rbind(igd.n.lo,igd.n.hi)
ggplot(ign.n, aes(x = rating, y = n_ingredients, group = rating)) +
geom_boxplot(aes(fill = rating), outlier.size = 3, outlier.colour = "red", show.legend = F) +
scale_y_continuous(breaks = seq(0,6,1)) +
labs(y = "Number of Ingredients", x = NULL,
title = "Distribution Number of Ingredients in Group Rating") +
theme_minimal()+
theme(axis.text = element_text(size = 12, face = "bold"),
plot.title = element_text(face = "bold", size = 18))Insights
low rated chocolates have longer range variation of number ingredients used compared to high rating,and yet we can say that there are more chocolates in low rating used 5-6 ingredients, whereas only few population in high rated chocolate use 5-6 ingredients.
#Prepare data
hi.rate.manufacture <- high.rating %>%
group_by(company_manufacturer) %>%
summarise(counts = n()) %>%
filter(counts > 4) %>%
arrange(desc(counts))
ggplot(hi.rate.manufacture, aes(x = counts , y = reorder(company_manufacturer, counts))) +
geom_col( show.legend = F, aes(fill = counts)) +
geom_text(aes(label = counts), x = 0, hjust = 1, color = "red", fontface = "bold")+
scale_fill_gradient(low = "#a0ed08" , high = "#257caf") +
labs(x = NULL, y = NULL,
title = "Most Manufactures Produced High Rated Chocolate") +
theme_minimal() +
theme(plot.title = element_text(family = "serif", size = 16, face = "bold", hjust = 0.5))