PREFACE

Chocolate is a food product made from cocoa beans mixed with fat, sugar and other ingredients that could differs from each type of chocolate. This particular session is narrowly focused on plain dark chocolate type. Which by definition, dark chocolate contains 50-90% cocoa solid, cocoa butter, and sugar.

The rating which will be heavily analyzed is a database from “Flavors of Cacao” which can be retrieved from Tidy Tuesday. The ratings (as the Flavors of Cacao claimed) do not reflect health benefits, social missions, or organic status.


LIBRARY PREPARATION

#Library Preparation
library(tidytuesdayR)
library(tidyverse)
library(ggplot2)
library(scales)

READ DATA FROM TIDYTUESDAY

chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

Data Glossary

Variable Description
ref Reference ID, The highest REF numbers were the last entries made.
company_manufacturer Manufacturer name
company_location Manufacturer region
review_date Review date (year)
country_of_bean_origin Country of origin
specific_bean_origin_or_bar_name Specific bean or bar name
cocoa_percent Cocoa percent (% chocolate)
ingredients Ingredients, (“#” = represents the number of ingredients in the chocolate; B = Beans, S = Sugar, S* = Sweetener other than white cane or beet sugar, C = Cocoa Butter, V = Vanilla, L = Lecithin, Sa = Salt)
most_memorable_characteristics Most Memorable Characteristics column is a summary review of the most memorable characteristics of that bar. Terms generally relate to anything from texture, flavor, overall opinion, etc. separated by ‘,’
rating rating between 1-5

Review Guide

Based on data source: “Flavors of Cacao”

Rating Scale Interpretation
4.0 - 5.0 Outstanding
3.5 - 3.9 Highly Recommended
3.0 - 3.49 Recommended
2.0 - 2.9 Disappointing
1.0 - 1.9 Unpleasant

OBJECTIVES

  1. Rating distribution
  2. Cocoa percentage frequency distribution in each rating classification
  3. Popular country of bean origin
  4. Popular beans & it’s rating distribution
  5. Analyze chocolate characteristics in rating group
  6. Ingredients composition based on rating group
  7. Most Manufactures Produced High Rated Chocolate

DATA WRANGLING

  1. Check duplicate data
any(duplicated(chocolate))
>> [1] FALSE
#No duplicate data
  1. Check NA/NULL Values
anyNA(chocolate)
>> [1] TRUE
  1. Inspect which column contains NA value
#Check NA in each column
map(chocolate, ~sum(is.na(.)))
>> $ref
>> [1] 0
>> 
>> $company_manufacturer
>> [1] 0
>> 
>> $company_location
>> [1] 0
>> 
>> $review_date
>> [1] 0
>> 
>> $country_of_bean_origin
>> [1] 0
>> 
>> $specific_bean_origin_or_bar_name
>> [1] 0
>> 
>> $cocoa_percent
>> [1] 0
>> 
>> $ingredients
>> [1] 87
>> 
>> $most_memorable_characteristics
>> [1] 0
>> 
>> $rating
>> [1] 0
  1. Calculate NA value proportion
#Proportion of NA to all data
prop.table(table(is.na(chocolate$ingredients)))*100
>> 
>>     FALSE      TRUE 
>> 96.561265  3.438735

Based on the proportion of NA value, I’m not going to drop any NA value now, considering that NA value doesn’t have significant number (and exists only in one column). I’ll include the observation which contained NA value as long as the observation isn’t related to ingredients column, and apply drop_na or na.omit for any analysis that related to ingredients column.

Inspect Data Frame

chocolate

Data Frame Adjustments:

  1. Remove column: ref (wouldn’t be analyzed)
  2. Remove % character in cocoa_percent column
  3. Separate number of ingredients and ingredients components from ingredients column (separate into new column to place the number of ingredients)
  4. Data type adjustments :
  • review_date : to factor
  • cocoa_percent: to numeric
chocolate <- chocolate %>% 
  mutate(n_ingredients = str_split(ingredients, "(- |-)",simplify = T)[,1], #Split between number and components, take number and input to column n_ingredients
         ingredients = str_split(ingredients, "(- |-)",simplify = T)[,2], #Split between number and components, take number and input to column ingredients
         cocoa_percent = str_remove(cocoa_percent, "%"), #remove %
         review_date = as.factor(review_date)) %>% 
  select(company_manufacturer, company_location, review_date, country_of_bean_origin, specific_bean_origin_or_bar_name, cocoa_percent, n_ingredients, everything(), -c(ref),) #rearrange column (optional)

#change data type
chocolate[, c("cocoa_percent", "n_ingredients")] <- lapply(chocolate[, c("cocoa_percent", "n_ingredients")], as.numeric)

str(chocolate)
>> tibble [2,530 × 10] (S3: tbl_df/tbl/data.frame)
>>  $ company_manufacturer            : chr [1:2530] "5150" "5150" "5150" "5150" ...
>>  $ company_location                : chr [1:2530] "U.S.A." "U.S.A." "U.S.A." "U.S.A." ...
>>  $ review_date                     : Factor w/ 16 levels "2006","2007",..: 14 14 14 16 16 16 16 7 7 8 ...
>>  $ country_of_bean_origin          : chr [1:2530] "Tanzania" "Dominican Republic" "Madagascar" "Fiji" ...
>>  $ specific_bean_origin_or_bar_name: chr [1:2530] "Kokoa Kamili, batch 1" "Zorzal, batch 1" "Bejofo Estate, batch 1" "Matasawalevu, batch 1" ...
>>  $ cocoa_percent                   : num [1:2530] 76 76 76 68 72 80 68 70 63 70 ...
>>  $ n_ingredients                   : num [1:2530] 3 3 3 3 3 3 3 4 4 4 ...
>>  $ ingredients                     : chr [1:2530] "B,S,C" "B,S,C" "B,S,C" "B,S,C" ...
>>  $ most_memorable_characteristics  : chr [1:2530] "rich cocoa, fatty, bready" "cocoa, vegetal, savory" "cocoa, blackberry, full body" "chewy, off, rubbery" ...
>>  $ rating                          : num [1:2530] 3.25 3.5 3.75 3 3 3.25 3.5 3.5 3.75 2.75 ...
chocolate

ANALYSIS

Question 1

  1. Rating Distribution
#Inspect rating distribution

par(mfrow = c(2,1))

hist(chocolate$rating, breaks = 10, main = "Rating Distribution",xlab = NULL, ylab = NULL)
boxplot(chocolate$rating, horizontal = T, main = "Rating Distribution")

par(mfrow = c(1,1))

Based on the plot above, we can conclude:

  1. Rating distribution is left skewed, which means that most of the population are in the higher rating. The tail indicates there only few chocolates rated below 2.0 and considered as outliers.
  2. 50% of the population is rated between range 3.0 - 3.5 with the average of rating around 3.25

Rating Distribution through the Years

ggplot(chocolate, aes(x = review_date, y = rating)) +
  geom_boxplot( fill = "#8cafff", col = "#09202d", outlier.colour = "red") +
  coord_flip() +
  labs(x = NULL, y = NULL,
       title = "Review Distribution in Each Year") +
  theme_minimal()

From the plot above we can conclude that:

  1. Start from the year 2010 the average of rating is increasing compared to the year 2006 - 2009
  2. Start from the year 2017 to 2021 the variation of rating is decreasing, in fact there are no chocolate rated below 2

Question 2

  1. Cocoa percentage frequency distribution in each rating classification
#Change rating into categorical
chocolate.Rating <- chocolate %>% 
  mutate(rating = case_when(between(rating, 4.0, 5.0 )~ "Oustanding",
                            between(rating, 3.5, 3.9)~ "Highly Recommended",
                            between(rating, 3.0, 3.49)~ "Recommended",
                            between(rating, 2.0, 2.9)~ "Disappointing",
                            between(rating, 1.0, 1.9)~ "Unpleasant",
                            T ~ as.character(rating)),
         rating = as.factor(rating))


#Rearrange level 
chocolate.Rating$rating <- factor(chocolate.Rating$rating,
                                          levels = c("Unpleasant", "Disappointing", "Recommended", "Highly Recommended", "Oustanding"))

chocolate.Rating
ggplot(chocolate.Rating, aes(x = cocoa_percent, group = rating, fill =rating )) +
  geom_density( alpha = .6, show.legend = F) +
  facet_wrap(~rating, scales = "free") +
  scale_x_continuous(breaks = seq(40, 100, 10),
                     limits = c(40, 100)) +
  scale_y_continuous(breaks = seq(0, 0.4, 0.1),
                     limit = c(0, 0.4)) +
  labs(x = NULL, y = NULL,
       title = "Cocoa Percentage Frequency by Rating Category") +
  theme_minimal() +
  theme(strip.placement = "outside",
        strip.background = element_rect(fill = "black"), 
        panel.background = element_rect(colour = "grey"),
        strip.text = element_text(color = "white", size = 9),
        plot.title = element_text(family = "serif", size = 20, face = "bold", hjust = 0.5))

insight:

  1. Unpleasant rating has bigger range variance of cocoa percentage
  2. other than unpleasant rating, other rating are more likely to use cocoa for around 70% (under 80%)

Question 3

  1. Popular country of bean origin
#Subset relevant data
popular.country.origin <- chocolate.Rating %>% 
  group_by(country_of_bean_origin) %>% 
  summarise(Total = n()) %>% 
  arrange(desc(Total))

ggplot(popular.country.origin[1:15,], aes(x = Total, y = reorder(country_of_bean_origin, Total) )) +
  geom_segment(aes(y = reorder(country_of_bean_origin, Total), yend = reorder(country_of_bean_origin, Total), 
                   x = 0 , xend = Total, col = Total), show.legend = F) +
  geom_point(size= 4, aes(col = Total), alpha = .5, show.legend = F) +
  scale_color_gradient(high = "#228e40" , low = "#3cffcd") +
  geom_text(aes(label = Total), fontface = "bold", size = 3, hjust = 1, color = "blue", x=0) +
  theme_minimal() +
  scale_fill_gradient(high = "#ffcd42" , low = "#6b7781") +
  scale_x_continuous(limit = c(0, 280),
                     breaks = seq(0, 300, 50)) +
  labs(x = NULL, y = NULL,
       title = "Popular Country of Bean Origin") +
  theme(axis.text = element_text(size = 11),
        plot.title = element_text(family = "serif", size = 18, face = "bold", hjust = 0.5))


Question 4

  1. Popular beans & it’s rating distribution
popular.beans <- chocolate %>% 
  group_by(specific_bean_origin_or_bar_name) %>% 
  summarise(Total = n()) %>% 
  arrange(desc(Total)) %>% 
  head(20)

popular.beans
#List the popular bean only
popular.beans <- popular.beans$specific_bean_origin_or_bar_name

#Filter based on popular beans only
plot.popular.beans <- chocolate %>% 
  filter(specific_bean_origin_or_bar_name %in% popular.beans)


# Rearrange level of Country to most popular beans to least popular, so facet plot will be ordered as level
plot.popular.beans$specific_bean_origin_or_bar_name <- factor(plot.popular.beans$specific_bean_origin_or_bar_name,
                        levels = c("Madagascar", "Ecuador","Peru","Dominican Republic","Chuao","Venezuela",
                                   "Kokoa Kamili","Ghana","Papua New Guinea","Sambirano", "Belize","Ocumare",
                                   "Oko Caribe","Tanzania","Ucayali", "Alto Beni", "Maya Mountain","Porcelana",
                                   "Vietnam","Brazil")) 
ggplot(plot.popular.beans, aes(x = specific_bean_origin_or_bar_name, y= rating, group = specific_bean_origin_or_bar_name )) +
  geom_boxplot(outlier.shape = NA) +
  geom_hline(aes(yintercept = mean(rating)), data = chocolate, col = "orange", linetype = 2, size = 0.8) +
  geom_count(aes(col = rating), alpha = 0.6, show.legend = T) +
  scale_x_discrete(position = "top", labels = scales::wrap_format(2)) +
  scale_y_continuous(breaks = seq(1.5, 4, .25), position = "right") +
  scale_color_gradient(low = "#c1857e", high = "#3a3a98") +
  scale_size_area(max_size = 6) +
  labs(x = NULL, y = NULL, 
       caption = "Popular Beans & It's Rating Distribution") +
  theme_minimal() +
  theme(axis.text.x.top = element_text( size = 11, vjust = 0.5),
        axis.text.y.right = element_text( face = "bold"),
        plot.caption = element_text(size = 24, face = "bold", hjust = 0.5, family = "serif"),
        legend.position = "left") +
  coord_flip()

Insights:

  1. 3 most popular beans origin are: Madagascar, Ecuador, Peru (order from above, we can see from the population count described by the size of the dot)
  2. Compared to beans origin from Peru, Ecuador and Madagascar beans has average rating above average of all rating (average described by the orange line)

Question 5

  1. Analyze chocolate characteristics in rating group
#Filter by low rating
low.rating <- chocolate.Rating %>% 
  filter(rating == c("Unpleasant", "Disappointing"))
 
#Tidy up data
low.character <- str_split(low.rating$most_memorable_characteristics, ",") #Split value
low.character <- unlist(low.character) #Convert into vector (unlist)
low.character <- str_trim(low.character, "left") #Remove left/beginning white space only


#Calculate Characteristic
low.character.pop <- sort(round(prop.table(table(low.character))*100, 2), decreasing = T)
low.character.pop <- data.frame(low.character.pop)

#Filter good rating
high.rating <- chocolate.Rating %>% 
  filter(rating == c("Highly Recommended", "Oustanding"))


#Tidy up data
good.character <- str_split(high.rating$most_memorable_characteristics, ",")
good.character <- unlist(good.character)
good.character <- str_trim(good.character, "left")


#Calculate Characteristic
good.character.pop <- sort(round(prop.table(table(good.character))*100, 2), decreasing = T)
good.character.pop <- data.frame(good.character.pop)

#Join data frame to compare between low and high rate character
Character.rate <- left_join(good.character.pop, low.character.pop, by = c("good.character" = "low.character"))

#Rename column dataframe
names(Character.rate) <-  c("character", "High Rating", "Low Rating")
Character.rate
#Negative texture Characteristic plotting

negative.texture <- Character.rate %>% 
  filter(character %in% c("dry", "sandy", "some large grits", "oily", "fatty", "chalky", "powdery", "waxy"))

negative.texture <- pivot_longer(data = negative.texture,
             cols = -(character),
             names_to = "Rating",
             values_to = "value")

# Plotting
ggplot(negative.texture, aes(y = value, x = character)) +
  geom_col(aes(fill = Rating), position = "dodge") +
  geom_text(aes(label = value, group = Rating), hjust = 0.5, 
            vjust= -.5,  position = position_dodge(width = .9),
            fontface = "bold", color = "#4e5c68") +
  scale_fill_manual(values = c("#219141", "#b1342a")) +
  theme_minimal()+
  labs(x = NULL,
       y = "porportion from rating group",
       title = "Negative Texture in Each Rating Group") +
  theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
        axis.title.y = element_text(size = 13, hjust = 0),
        axis.text = element_text(size = 12),
        legend.position = "bottom")

Insights:

  1. It’s pretty clear that low rated chocolates have higher rate of “negative values in texture”
  2. Though some of high rating chocolates possibly has the negative value in texture, the rate is considered low than the low rated chocolate
other.Character <- Character.rate %>% 
  filter(character != c("dry", "sandy", "some large grits", "oily", "fatty", "chalky", "powdery", "waxy")) %>% 
  na.omit() %>% 
  head(25)

other.Character <- pivot_longer(data = other.Character,
                  cols = -(character),
                  names_to = "Rating",
                  values_to = "values")


ggplot(other.Character, aes(x = Rating, y = character, fill = values))+
  scale_fill_gradient(low = "#dcedfe", high = "#780ba1") +
  geom_tile(color = "black", width = 1) +
  labs(y = NULL, x = NULL,
       title = "Characteristic Frequency Found based on Group Rating") +
  theme(
    axis.text = element_text(size= 12),
    panel.grid = element_line(colour = "black"),
    panel.background = element_rect(fill = "black"),
    plot.title = element_text(size = 16, face = "bold"))

  1. Strong characteristics which a high rated chocolate has are: cocoa, creamy, and nutty
  2. Strong characteristics which a low rated chocolate has are: sweet, fatty, honey, and sour

Question 6

  1. Ingredients composition of based on rating group
#Tidy up data
lo.rtg.ingredient <- drop_na(low.rating )
lo.rtg.ingredient <- str_split(lo.rtg.ingredient$ingredients, ",") #Split value
lo.rtg.ingredient <- unlist(lo.rtg.ingredient) #Convert into vector (unlist)

#Calculate Ingredients Frequency from a rating group
lo.rtg.ingredient <- sort(round(prop.table(table(lo.rtg.ingredient))*100, 2), decreasing = T)
lo.rtg.df <- data.frame(lo.rtg.ingredient)


#Tidy up data
hi.rtg.ingredient <- drop_na(high.rating)
hi.rtg.ingredient <- str_split(hi.rtg.ingredient$ingredients, ",") #Split value
hi.rtg.ingredient <- unlist(hi.rtg.ingredient) #Convert into vector (unlist)

#Calculate Ingredients Frequency from a rating group
hi.rtg.ingredient <- sort(round(prop.table(table(hi.rtg.ingredient))*100, 2), decreasing = T)
hi.rtg.df <- data.frame(hi.rtg.ingredient)

lo.rtg.ingredient
>> lo.rtg.ingredient
>>     B     S     C     L     V    S*    Sa 
>> 31.41 29.84 21.66  7.58  7.58  1.32  0.60
hi.rtg.ingredient
>> hi.rtg.ingredient
>>     B     S     C     L     V    S*    Sa 
>> 33.49 32.94 22.49  6.33  3.85  0.55  0.34
#Join data frame to compare between low and high rate character
ingredients.df <- left_join(hi.rtg.df, lo.rtg.df, by = c("hi.rtg.ingredient" = "lo.rtg.ingredient"))
colnames(ingredients.df) <- c("Ingredients", "High Rating", "Low Rating")

ingredients.df <- ingredients.df %>% 
  mutate(Ingredients = case_when(Ingredients == "B" ~ "Beans",
                                 Ingredients == "S" ~ "Sugar",
                                 Ingredients == "S*" ~ "Sweetener",
                                 Ingredients == "C" ~ "Cocoa Butter",
                                 Ingredients == "V" ~ "Vanilla",
                                 Ingredients == "L" ~ "Lecithin",
                                 Ingredients == "Sa" ~ "Salt",
                                 T ~ as.character(Ingredients))) %>% 
  pivot_longer(cols = -(Ingredients),
               values_to = "value",
               names_to = "Rating")
# Plotting
ggplot(ingredients.df, aes(y = value, x = Ingredients)) +
  geom_col(aes(fill = Rating), position = "dodge") +
  geom_text(aes(label = value, group = Rating), hjust = 0.5, 
            vjust= -.5,  position = position_dodge(width = .9),
            fontface = "bold", color = "#4e5c68") +
  scale_fill_manual(values = c("#ffe643", "#4e5c68")) +
  theme_minimal()+
  labs(x = NULL,
       y = "porportion from rating group",
       title = "Frequency of: Ingredients Found on Group Rating") +
  theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
        axis.title.y = element_text(size = 13, hjust = 0),
        axis.text = element_text(size = 12),
        legend.position = "bottom")

igd.n.lo <- low.rating %>% 
  mutate(rating =  "Low Rating") %>% 
  na.omit()
igd.n.hi <- high.rating %>% 
  mutate(rating = "High Rating") %>% 
  na.omit()

ign.n <- rbind(igd.n.lo,igd.n.hi)


ggplot(ign.n, aes(x = rating, y = n_ingredients, group = rating)) +
  geom_boxplot(aes(fill = rating), outlier.size = 3, outlier.colour = "red", show.legend = F) +
  scale_y_continuous(breaks = seq(0,6,1)) +
  labs(y = "Number of Ingredients", x = NULL,
       title = "Distribution Number of Ingredients in Group Rating") +
  theme_minimal()+
  theme(axis.text = element_text(size = 12, face = "bold"),
        plot.title = element_text(face = "bold", size = 18))

Insights

low rated chocolates have longer range variation of number ingredients used compared to high rating,and yet we can say that there are more chocolates in low rating used 5-6 ingredients, whereas only few population in high rated chocolate use 5-6 ingredients.


Question 7

  1. Most Manufactures Produced High Rated Chocolate
#Prepare data
hi.rate.manufacture <- high.rating %>% 
  group_by(company_manufacturer) %>% 
  summarise(counts = n()) %>% 
  filter(counts > 4) %>% 
  arrange(desc(counts))

 ggplot(hi.rate.manufacture, aes(x = counts , y = reorder(company_manufacturer, counts))) +
   geom_col( show.legend = F, aes(fill = counts)) +
   geom_text(aes(label = counts), x = 0, hjust = 1, color = "red", fontface = "bold")+
  scale_fill_gradient(low = "#a0ed08" , high = "#257caf") +
  labs(x = NULL, y = NULL,
       title = "Most Manufactures Produced High Rated Chocolate") +
  theme_minimal() +
  theme(plot.title = element_text(family = "serif", size = 16, face = "bold", hjust = 0.5))