PREFACE

Chocolate is a food product made from cocoa beans mixed with fat, sugar and other ingredients that could differs from each type of chocolate. This particular session is narrowly focused on plain dark chocolate type. Which by definition, dark chocolate contains 50-90% cocoa solid, cocoa butter, and sugar.

The rating which will be heavily analyzed is a database from “Flavors of Cacao” which can be retrieved from Tidy Tuesday. The ratings (as the Flavors of Cacao claimed) do not reflect health benefits, social missions, or organic status.

LIBRARY PREPARATION

#Library Preparation
library(tidytuesdayR)
library(tidyverse)
library(ggplot2)
library(scales)

READ DATA FROM TIDYTUESDAY

chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

Data Glossary

Variable	Description
ref	Reference ID, The highest REF numbers were the last entries made.
company_manufacturer	Manufacturer name
company_location	Manufacturer region
review_date	Review date (year)
country_of_bean_origin	Country of origin
specific_bean_origin_or_bar_name	Specific bean or bar name
cocoa_percent	Cocoa percent (% chocolate)
ingredients	Ingredients, (“#” = represents the number of ingredients in the chocolate; B = Beans, S = Sugar, S* = Sweetener other than white cane or beet sugar, C = Cocoa Butter, V = Vanilla, L = Lecithin, Sa = Salt)
most_memorable_characteristics	Most Memorable Characteristics column is a summary review of the most memorable characteristics of that bar. Terms generally relate to anything from texture, flavor, overall opinion, etc. separated by ‘,’
rating	rating between 1-5

Review Guide

Based on data source: “Flavors of Cacao”

Rating Scale	Interpretation
4.0 - 5.0	Outstanding
3.5 - 3.9	Highly Recommended
3.0 - 3.49	Recommended
2.0 - 2.9	Disappointing
1.0 - 1.9	Unpleasant

OBJECTIVES

Rating distribution
Cocoa percentage frequency distribution in each rating classification
Popular country of bean origin
Popular beans & it’s rating distribution
Analyze chocolate characteristics in rating group
Ingredients composition based on rating group
Most Manufactures Produced High Rated Chocolate

DATA WRANGLING

Check duplicate data

any(duplicated(chocolate))

>> [1] FALSE

#No duplicate data

Check NA/NULL Values

anyNA(chocolate)

>> [1] TRUE

Inspect which column contains NA value

#Check NA in each column
map(chocolate, ~sum(is.na(.)))

>> $ref
>> [1] 0
>> 
>> $company_manufacturer
>> [1] 0
>> 
>> $company_location
>> [1] 0
>> 
>> $review_date
>> [1] 0
>> 
>> $country_of_bean_origin
>> [1] 0
>> 
>> $specific_bean_origin_or_bar_name
>> [1] 0
>> 
>> $cocoa_percent
>> [1] 0
>> 
>> $ingredients
>> [1] 87
>> 
>> $most_memorable_characteristics
>> [1] 0
>> 
>> $rating
>> [1] 0

Calculate NA value proportion

#Proportion of NA to all data
prop.table(table(is.na(chocolate$ingredients)))*100

>> 
>>     FALSE      TRUE 
>> 96.561265  3.438735

Based on the proportion of NA value, I’m not going to drop any NA value now, considering that NA value doesn’t have significant number (and exists only in one column). I’ll include the observation which contained NA value as long as the observation isn’t related to ingredients column, and apply drop_na or na.omit for any analysis that related to ingredients column.

Inspect Data Frame

chocolate

Data Frame Adjustments:

Remove column: ref (wouldn’t be analyzed)
Remove % character in cocoa_percent column
Separate number of ingredients and ingredients components from ingredients column (separate into new column to place the number of ingredients)
Data type adjustments :

review_date : to factor
cocoa_percent: to numeric

chocolate <- chocolate %>% 
  mutate(n_ingredients = str_split(ingredients, "(- |-)",simplify = T)[,1], #Split between number and components, take number and input to column n_ingredients
         ingredients = str_split(ingredients, "(- |-)",simplify = T)[,2], #Split between number and components, take number and input to column ingredients
         cocoa_percent = str_remove(cocoa_percent, "%"), #remove %
         review_date = as.factor(review_date)) %>% 
  select(company_manufacturer, company_location, review_date, country_of_bean_origin, specific_bean_origin_or_bar_name, cocoa_percent, n_ingredients, everything(), -c(ref),) #rearrange column (optional)

#change data type
chocolate[, c("cocoa_percent", "n_ingredients")] <- lapply(chocolate[, c("cocoa_percent", "n_ingredients")], as.numeric)

str(chocolate)

>> tibble [2,530 × 10] (S3: tbl_df/tbl/data.frame)
>>  $ company_manufacturer            : chr [1:2530] "5150" "5150" "5150" "5150" ...
>>  $ company_location                : chr [1:2530] "U.S.A." "U.S.A." "U.S.A." "U.S.A." ...
>>  $ review_date                     : Factor w/ 16 levels "2006","2007",..: 14 14 14 16 16 16 16 7 7 8 ...
>>  $ country_of_bean_origin          : chr [1:2530] "Tanzania" "Dominican Republic" "Madagascar" "Fiji" ...
>>  $ specific_bean_origin_or_bar_name: chr [1:2530] "Kokoa Kamili, batch 1" "Zorzal, batch 1" "Bejofo Estate, batch 1" "Matasawalevu, batch 1" ...
>>  $ cocoa_percent                   : num [1:2530] 76 76 76 68 72 80 68 70 63 70 ...
>>  $ n_ingredients                   : num [1:2530] 3 3 3 3 3 3 3 4 4 4 ...
>>  $ ingredients                     : chr [1:2530] "B,S,C" "B,S,C" "B,S,C" "B,S,C" ...
>>  $ most_memorable_characteristics  : chr [1:2530] "rich cocoa, fatty, bready" "cocoa, vegetal, savory" "cocoa, blackberry, full body" "chewy, off, rubbery" ...
>>  $ rating                          : num [1:2530] 3.25 3.5 3.75 3 3 3.25 3.5 3.5 3.75 2.75 ...

chocolate

ANALYSIS

Question 1

Rating Distribution

#Inspect rating distribution

par(mfrow = c(2,1))

hist(chocolate$rating, breaks = 10, main = "Rating Distribution",xlab = NULL, ylab = NULL)
boxplot(chocolate$rating, horizontal = T, main = "Rating Distribution")

par(mfrow = c(1,1))

Based on the plot above, we can conclude:

Rating distribution is left skewed, which means that most of the population are in the higher rating. The tail indicates there only few chocolates rated below 2.0 and considered as outliers.
50% of the population is rated between range 3.0 - 3.5 with the average of rating around 3.25

Rating Distribution through the Years

ggplot(chocolate, aes(x = review_date, y = rating)) +
  geom_boxplot( fill = "#8cafff", col = "#09202d", outlier.colour = "red") +
  coord_flip() +
  labs(x = NULL, y = NULL,
       title = "Review Distribution in Each Year") +
  theme_minimal()

From the plot above we can conclude that:

Start from the year 2010 the average of rating is increasing compared to the year 2006 - 2009
Start from the year 2017 to 2021 the variation of rating is decreasing, in fact there are no chocolate rated below 2

Question 2

Cocoa percentage frequency distribution in each rating classification

#Change rating into categorical
chocolate.Rating <- chocolate %>% 
  mutate(rating = case_when(between(rating, 4.0, 5.0 )~ "Oustanding",
                            between(rating, 3.5, 3.9)~ "Highly Recommended",
                            between(rating, 3.0, 3.49)~ "Recommended",
                            between(rating, 2.0, 2.9)~ "Disappointing",
                            between(rating, 1.0, 1.9)~ "Unpleasant",
                            T ~ as.character(rating)),
         rating = as.factor(rating))


#Rearrange level 
chocolate.Rating$rating <- factor(chocolate.Rating$rating,
                                          levels = c("Unpleasant", "Disappointing", "Recommended", "Highly Recommended", "Oustanding"))

chocolate.Rating

ggplot(chocolate.Rating, aes(x = cocoa_percent, group = rating, fill =rating )) +
  geom_density( alpha = .6, show.legend = F) +
  facet_wrap(~rating, scales = "free") +
  scale_x_continuous(breaks = seq(40, 100, 10),
                     limits = c(40, 100)) +
  scale_y_continuous(breaks = seq(0, 0.4, 0.1),
                     limit = c(0, 0.4)) +
  labs(x = NULL, y = NULL,
       title = "Cocoa Percentage Frequency by Rating Category") +
  theme_minimal() +
  theme(strip.placement = "outside",
        strip.background = element_rect(fill = "black"), 
        panel.background = element_rect(colour = "grey"),
        strip.text = element_text(color = "white", size = 9),
        plot.title = element_text(family = "serif", size = 20, face = "bold", hjust = 0.5))

insight:

Unpleasant rating has bigger range variance of cocoa percentage
other than unpleasant rating, other rating are more likely to use cocoa for around 70% (under 80%)

Question 3

Popular country of bean origin

#Subset relevant data
popular.country.origin <- chocolate.Rating %>% 
  group_by(country_of_bean_origin) %>% 
  summarise(Total = n()) %>% 
  arrange(desc(Total))

ggplot(popular.country.origin[1:15,], aes(x = Total, y = reorder(country_of_bean_origin, Total) )) +
  geom_segment(aes(y = reorder(country_of_bean_origin, Total), yend = reorder(country_of_bean_origin, Total), 
                   x = 0 , xend = Total, col = Total), show.legend = F) +
  geom_point(size= 4, aes(col = Total), alpha = .5, show.legend = F) +
  scale_color_gradient(high = "#228e40" , low = "#3cffcd") +
  geom_text(aes(label = Total), fontface = "bold", size = 3, hjust = 1, color = "blue", x=0) +
  theme_minimal() +
  scale_fill_gradient(high = "#ffcd42" , low = "#6b7781") +
  scale_x_continuous(limit = c(0, 280),
                     breaks = seq(0, 300, 50)) +
  labs(x = NULL, y = NULL,
       title = "Popular Country of Bean Origin") +
  theme(axis.text = element_text(size = 11),
        plot.title = element_text(family = "serif", size = 18, face = "bold", hjust = 0.5))

Question 4

Popular beans & it’s rating distribution

popular.beans <- chocolate %>% 
  group_by(specific_bean_origin_or_bar_name) %>% 
  summarise(Total = n()) %>% 
  arrange(desc(Total)) %>% 
  head(20)

popular.beans

#List the popular bean only
popular.beans <- popular.beans$specific_bean_origin_or_bar_name

#Filter based on popular beans only
plot.popular.beans <- chocolate %>% 
  filter(specific_bean_origin_or_bar_name %in% popular.beans)


# Rearrange level of Country to most popular beans to least popular, so facet plot will be ordered as level
plot.popular.beans$specific_bean_origin_or_bar_name <- factor(plot.popular.beans$specific_bean_origin_or_bar_name,
                        levels = c("Madagascar", "Ecuador","Peru","Dominican Republic","Chuao","Venezuela",
                                   "Kokoa Kamili","Ghana","Papua New Guinea","Sambirano", "Belize","Ocumare",
                                   "Oko Caribe","Tanzania","Ucayali", "Alto Beni", "Maya Mountain","Porcelana",
                                   "Vietnam","Brazil"))

ggplot(plot.popular.beans, aes(x = specific_bean_origin_or_bar_name, y= rating, group = specific_bean_origin_or_bar_name )) +
  geom_boxplot(outlier.shape = NA) +
  geom_hline(aes(yintercept = mean(rating)), data = chocolate, col = "orange", linetype = 2, size = 0.8) +
  geom_count(aes(col = rating), alpha = 0.6, show.legend = T) +
  scale_x_discrete(position = "top", labels = scales::wrap_format(2)) +
  scale_y_continuous(breaks = seq(1.5, 4, .25), position = "right") +
  scale_color_gradient(low = "#c1857e", high = "#3a3a98") +
  scale_size_area(max_size = 6) +
  labs(x = NULL, y = NULL, 
       caption = "Popular Beans & It's Rating Distribution") +
  theme_minimal() +
  theme(axis.text.x.top = element_text( size = 11, vjust = 0.5),
        axis.text.y.right = element_text( face = "bold"),
        plot.caption = element_text(size = 24, face = "bold", hjust = 0.5, family = "serif"),
        legend.position = "left") +
  coord_flip()

Insights:

3 most popular beans origin are: Madagascar, Ecuador, Peru (order from above, we can see from the population count described by the size of the dot)
Compared to beans origin from Peru, Ecuador and Madagascar beans has average rating above average of all rating (average described by the orange line)

Question 5

Analyze chocolate characteristics in rating group

#Filter by low rating
low.rating <- chocolate.Rating %>% 
  filter(rating == c("Unpleasant", "Disappointing"))
 
#Tidy up data
low.character <- str_split(low.rating$most_memorable_characteristics, ",") #Split value
low.character <- unlist(low.character) #Convert into vector (unlist)
low.character <- str_trim(low.character, "left") #Remove left/beginning white space only


#Calculate Characteristic
low.character.pop <- sort(round(prop.table(table(low.character))*100, 2), decreasing = T)
low.character.pop <- data.frame(low.character.pop)

#Filter good rating
high.rating <- chocolate.Rating %>% 
  filter(rating == c("Highly Recommended", "Oustanding"))


#Tidy up data
good.character <- str_split(high.rating$most_memorable_characteristics, ",")
good.character <- unlist(good.character)
good.character <- str_trim(good.character, "left")


#Calculate Characteristic
good.character.pop <- sort(round(prop.table(table(good.character))*100, 2), decreasing = T)
good.character.pop <- data.frame(good.character.pop)

#Join data frame to compare between low and high rate character
Character.rate <- left_join(good.character.pop, low.character.pop, by = c("good.character" = "low.character"))

#Rename column dataframe
names(Character.rate) <-  c("character", "High Rating", "Low Rating")
Character.rate

#Negative texture Characteristic plotting

negative.texture <- Character.rate %>% 
  filter(character %in% c("dry", "sandy", "some large grits", "oily", "fatty", "chalky", "powdery", "waxy"))

negative.texture <- pivot_longer(data = negative.texture,
             cols = -(character),
             names_to = "Rating",
             values_to = "value")

# Plotting
ggplot(negative.texture, aes(y = value, x = character)) +
  geom_col(aes(fill = Rating), position = "dodge") +
  geom_text(aes(label = value, group = Rating), hjust = 0.5, 
            vjust= -.5,  position = position_dodge(width = .9),
            fontface = "bold", color = "#4e5c68") +
  scale_fill_manual(values = c("#219141", "#b1342a")) +
  theme_minimal()+
  labs(x = NULL,
       y = "porportion from rating group",
       title = "Negative Texture in Each Rating Group") +
  theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
        axis.title.y = element_text(size = 13, hjust = 0),
        axis.text = element_text(size = 12),
        legend.position = "bottom")

Insights:

It’s pretty clear that low rated chocolates have higher rate of “negative values in texture”
Though some of high rating chocolates possibly has the negative value in texture, the rate is considered low than the low rated chocolate

other.Character <- Character.rate %>% 
  filter(character != c("dry", "sandy", "some large grits", "oily", "fatty", "chalky", "powdery", "waxy")) %>% 
  na.omit() %>% 
  head(25)

other.Character <- pivot_longer(data = other.Character,
                  cols = -(character),
                  names_to = "Rating",
                  values_to = "values")


ggplot(other.Character, aes(x = Rating, y = character, fill = values))+
  scale_fill_gradient(low = "#dcedfe", high = "#780ba1") +
  geom_tile(color = "black", width = 1) +
  labs(y = NULL, x = NULL,
       title = "Characteristic Frequency Found based on Group Rating") +
  theme(
    axis.text = element_text(size= 12),
    panel.grid = element_line(colour = "black"),
    panel.background = element_rect(fill = "black"),
    plot.title = element_text(size = 16, face = "bold"))

Strong characteristics which a high rated chocolate has are: cocoa, creamy, and nutty
Strong characteristics which a low rated chocolate has are: sweet, fatty, honey, and sour

Question 6

Ingredients composition of based on rating group

#Tidy up data
lo.rtg.ingredient <- drop_na(low.rating )
lo.rtg.ingredient <- str_split(lo.rtg.ingredient$ingredients, ",") #Split value
lo.rtg.ingredient <- unlist(lo.rtg.ingredient) #Convert into vector (unlist)

#Calculate Ingredients Frequency from a rating group
lo.rtg.ingredient <- sort(round(prop.table(table(lo.rtg.ingredient))*100, 2), decreasing = T)
lo.rtg.df <- data.frame(lo.rtg.ingredient)


#Tidy up data
hi.rtg.ingredient <- drop_na(high.rating)
hi.rtg.ingredient <- str_split(hi.rtg.ingredient$ingredients, ",") #Split value
hi.rtg.ingredient <- unlist(hi.rtg.ingredient) #Convert into vector (unlist)

#Calculate Ingredients Frequency from a rating group
hi.rtg.ingredient <- sort(round(prop.table(table(hi.rtg.ingredient))*100, 2), decreasing = T)
hi.rtg.df <- data.frame(hi.rtg.ingredient)

lo.rtg.ingredient

>> lo.rtg.ingredient
>>     B     S     C     L     V    S*    Sa 
>> 31.41 29.84 21.66  7.58  7.58  1.32  0.60

hi.rtg.ingredient

>> hi.rtg.ingredient
>>     B     S     C     L     V    S*    Sa 
>> 33.49 32.94 22.49  6.33  3.85  0.55  0.34

#Join data frame to compare between low and high rate character
ingredients.df <- left_join(hi.rtg.df, lo.rtg.df, by = c("hi.rtg.ingredient" = "lo.rtg.ingredient"))
colnames(ingredients.df) <- c("Ingredients", "High Rating", "Low Rating")

ingredients.df <- ingredients.df %>% 
  mutate(Ingredients = case_when(Ingredients == "B" ~ "Beans",
                                 Ingredients == "S" ~ "Sugar",
                                 Ingredients == "S*" ~ "Sweetener",
                                 Ingredients == "C" ~ "Cocoa Butter",
                                 Ingredients == "V" ~ "Vanilla",
                                 Ingredients == "L" ~ "Lecithin",
                                 Ingredients == "Sa" ~ "Salt",
                                 T ~ as.character(Ingredients))) %>% 
  pivot_longer(cols = -(Ingredients),
               values_to = "value",
               names_to = "Rating")

# Plotting
ggplot(ingredients.df, aes(y = value, x = Ingredients)) +
  geom_col(aes(fill = Rating), position = "dodge") +
  geom_text(aes(label = value, group = Rating), hjust = 0.5, 
            vjust= -.5,  position = position_dodge(width = .9),
            fontface = "bold", color = "#4e5c68") +
  scale_fill_manual(values = c("#ffe643", "#4e5c68")) +
  theme_minimal()+
  labs(x = NULL,
       y = "porportion from rating group",
       title = "Frequency of: Ingredients Found on Group Rating") +
  theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
        axis.title.y = element_text(size = 13, hjust = 0),
        axis.text = element_text(size = 12),
        legend.position = "bottom")

igd.n.lo <- low.rating %>% 
  mutate(rating =  "Low Rating") %>% 
  na.omit()
igd.n.hi <- high.rating %>% 
  mutate(rating = "High Rating") %>% 
  na.omit()

ign.n <- rbind(igd.n.lo,igd.n.hi)


ggplot(ign.n, aes(x = rating, y = n_ingredients, group = rating)) +
  geom_boxplot(aes(fill = rating), outlier.size = 3, outlier.colour = "red", show.legend = F) +
  scale_y_continuous(breaks = seq(0,6,1)) +
  labs(y = "Number of Ingredients", x = NULL,
       title = "Distribution Number of Ingredients in Group Rating") +
  theme_minimal()+
  theme(axis.text = element_text(size = 12, face = "bold"),
        plot.title = element_text(face = "bold", size = 18))

Insights

low rated chocolates have longer range variation of number ingredients used compared to high rating,and yet we can say that there are more chocolates in low rating used 5-6 ingredients, whereas only few population in high rated chocolate use 5-6 ingredients.

Question 7

Most Manufactures Produced High Rated Chocolate

#Prepare data
hi.rate.manufacture <- high.rating %>% 
  group_by(company_manufacturer) %>% 
  summarise(counts = n()) %>% 
  filter(counts > 4) %>% 
  arrange(desc(counts))

 ggplot(hi.rate.manufacture, aes(x = counts , y = reorder(company_manufacturer, counts))) +
   geom_col( show.legend = F, aes(fill = counts)) +
   geom_text(aes(label = counts), x = 0, hjust = 1, color = "red", fontface = "bold")+
  scale_fill_gradient(low = "#a0ed08" , high = "#257caf") +
  labs(x = NULL, y = NULL,
       title = "Most Manufactures Produced High Rated Chocolate") +
  theme_minimal() +
  theme(plot.title = element_text(family = "serif", size = 16, face = "bold", hjust = 0.5))

Dark Chocolate Rating

Talia Thaib

2022-11-01