About Chocolate

Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billion pound (1.3 billion kilogram). However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of cacao bean used and where the beans were grown.

Chocolate bars accross the world

This database is narrowly focused on plain dark chocolate with an aim of appreciating the flavors of the cacao when made into chocolate.

# Extract the data frame from the database

cacao <- read.csv("data/Chocolate/flavors_of_cacao.csv")

head(cacao)
# Preview the information inside the data frame

str(cacao)
## 'data.frame':    1795 obs. of  9 variables:
##  $ CompanyÂ...Maker.if.known.      : chr  "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
##  $ Specific.Bean.Origin.or.Bar.Name: chr  "Agua Grande" "Kpime" "Atsane" "Akata" ...
##  $ REF                             : int  1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
##  $ Review.Date                     : int  2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
##  $ Cocoa.Percent                   : chr  "63%" "70%" "70%" "70%" ...
##  $ Company.Location                : chr  "France" "France" "France" "France" ...
##  $ Rating                          : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
##  $ Bean.Type                       : chr  " " " " " " " " ...
##  $ Broad.Bean.Origin               : chr  "Sao Tome" "Togo" "Togo" "Togo" ...

From this report, we are trying to analyze the best type of cacao beans for chocolate bars by finding out which company managed to make the highest rating chocolate bars, and analyze whether the cacao percentage, and the selected bean types has anything to do with the quality of the chocolate bars ratings.

To do that, we have to clean up our data first by removing the special characters, which might be caused by an imperfect conversion of the original data to “.csv”, and changing the data types of the columns from our data frame. Afterwards, we can start to subset our data to find our desired values.

# Rename the column names and remove special characters from the data frame

names(cacao)[names(cacao) %in% c("CompanyÂ...Maker.if.known.",
                                 "Specific.Bean.Origin.or.Bar.Name",
                                 "Broad.Bean.Origin")] <- c("Company.Names",
                                                            "Chocolate.Bar.Names",
                                                            "Bean.Origin")

cacao[c("Bean.Type", "Bean.Origin")] <- gsub("[Â]", NA_character_,
                                             unlist(cacao[c("Bean.Type", "Bean.Origin")]))

cacao[is.na(cacao)] <- "Missing"
# Convert some of the columns with repeating values into "Factor" for a better memory efficiency.

cacao[c("Company.Names",
        "Cocoa.Percent",
        "Company.Location",
        "Bean.Type",
        "Bean.Origin")] <- 
lapply(cacao[c("Company.Names",
               "Cocoa.Percent",
               "Company.Location",
               "Bean.Type",
               "Bean.Origin")], as.factor)


Most Produced Chocolate Bars

Dark chocolate bars, as the way to truly enjoy a refined cocoa taste, comes in many different cocoa percentage and bean types.

# Plot the cocoa percentage

plot(droplevels(cacao$Cocoa.Percent), cex.names=0.6, cex.axis=0.6, las=2, col=1:length(unique(cacao$Cocoa.Percent)), main="Most Produced Cocoa Percentage for Dark Chocolate Bars")

According to the bar plot above, the most produced dark chocolate bars by companies are the chocolate bars which has the cocoa percentage of 70%. This data tells us that with a wide range off selection like this, surely there are some of those chocolate bars which have more refined taste compare to the other in the same category in order to compete against the other brands released by the other company in hope to dominate the market.

# Plot the bean type

plot(droplevels(cacao[cacao$Bean.Type != "Missing",]$Bean.Type), cex.names=0.6, cex.axis=0.6, las=2, col=1:length(unique(cacao$Bean.Type)), main="Most Used types of Beans for Dark Chocolate Bars")




Apparently, most of the chocolate bars was produced by using Trinitario cacao beans. The reason was because Trinitario cacao beans was in one of the best three types of beans out of all kinds of beans exist around the world. Although Criollo categorized as the best in terms of quality, They are more challenging to be cultivated as the plant has less resistant to diseases compare to the other resulting in a more expensive price out of all the best three types of beans. The Forastero, which came in second after Criollo, have a more powerfull, less aromatic, and sometimes can be bitter and more acidic to the taste. However, Trinitario has the most powerful cocoa taste out of all three types of beans with generally less acidic and bitter taste compare to Forastero which can be more pleasant to the taste of common people’s tongue.

Flavors of Cacao Rating System

The reviewer of a total of 1795 chocolate bars across the world has set a scale of chocolate bar ratings starting from 1 (lowest ratings) up to 5 (highest ratings) in order to give us a bit of description of how it taste. The complete information regarding the rating values is as follow:

      5 = Elite (Transcending beyond the ordinary limits)
      4 = Premium (Superior flavor development, character and style)
      3 = Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
      2 = Disappointing (Passable but contains at least one significant flaw)
      1 = Unpleasant (mostly unpalatable)


With all of these ratings in mind, will there be any chocolate bars which has a perfect score such as rating of 5 (elite)?

The Distribution of The Chocolate Bar According to The Ratings

# Plot the chocolate bars ratings counts

hist(cacao$Rating, col="skyblue", main="Histogram of Chocolate Bar Rating", xlab="Rating", ylab="Frequency")

according to the plot above, we can observe that the quality of the chocolate based on the rating values were around a satisfactory level. However, as our objective is to find the best quality chocolate bars, we should remove the data with a ratings that are less than satisfactory (3) level

# Subset cacao ratings which have value above or equal to satisfactory level and count the number of data left in the data frame

good_quality_cacao <- cacao[cacao$Rating >= 3,]

nrow(good_quality_cacao)
## [1] 1346
# Plot the good quality chocolate bars

hist(good_quality_cacao$Rating, col="skyblue", main="Histogram of The Good Quality Chocolate Bar Rating", xlab="Rating", ylab="Frequency")

Down from a total of 1795 numbers of data to 1346 numbers of good quality chocolate bars, we can observe that the quality of the chocolate bars are at the rating of around 3.5, we also found that there were a perfect scored chocolate bars at a rating of close to 5.0. The next step is to look at those chocolate bars.

The Best Rated Chocolate Bars

# Subset to only chocolate bars with a rating of 5, change factor data type back to character for previewing purpose, preview the cacao data which has elite rating

elite_cacao <- cacao[cacao$Rating == 5,]

str(c(elite_cacao[sapply(elite_cacao, is.character)],
      lapply(elite_cacao[sapply(elite_cacao,is.factor)], as.character),
      elite_cacao[sapply(elite_cacao, is.numeric)]))
## List of 9
##  $ Chocolate.Bar.Names: chr [1:2] "Chuao" "Toscano Black"
##  $ Company.Names      : chr [1:2] "Amedei" "Amedei"
##  $ Cocoa.Percent      : chr [1:2] "70%" "70%"
##  $ Company.Location   : chr [1:2] "Italy" "Italy"
##  $ Bean.Type          : chr [1:2] "Trinitario" "Blend"
##  $ Bean.Origin        : chr [1:2] "Venezuela" "Missing"
##  $ REF                : int [1:2] 111 40
##  $ Review.Date        : int [1:2] 2007 2006
##  $ Rating             : num [1:2] 5 5

During the year of 2006 and 2007, the company goes by the name of Amedei from Italy managed to produce chocolate bars, named Chuao and Toscano Black, that obtain an Elite grade rating in which according to the reviewer, the taste has transcending beyond the ordinary limits. This amazing achievement by the company was do to them producing these chocolate bars with cocoa percentage of 70% combined with bean types Trinitario and Blend which are originated from Venezuela and sadly Missing value of the origin from Toscano Black chocolate bar as the reviewer might not be provided with which type of blend was in that particular chocolate bar. However, can we really conclude it as such?

The Cocoa Percentage and Type of Beans

To prove the hypothesis above, we will partially subset each criteria (cocoa percentage, and the bean types) to find the trace of each criteria inside the chocolate bars data which have ratings less than those of an elite chocolate bars. From there on, we will combine all of the criteria to find the chocolate bars which have the same criteria just like the elite chocolate bars.

# Subset the data of cocoa which have quality below elite

below_elite_cacao <- cacao[cacao$Rating < 5,]
# Observed the percentage which appear in the chocolate bars which has a rating below the elite chocolate bars

unique(below_elite_cacao$Cocoa.Percent)
##  [1] 63%   70%   60%   80%   88%   72%   55%   75%   65%   85%   73%   64%  
## [13] 66%   68%   50%   100%  77%   90%   71%   83%   78%   74%   76%   86%  
## [25] 82%   69%   91%   42%   61%   73.5% 62%   67%   58%   60.5% 79%   81%  
## [37] 57%   72.5% 56%   46%   89%   99%   84%   53%   87%  
## 45 Levels: 100% 42% 46% 50% 53% 55% 56% 57% 58% 60% 60.5% 61% 62% 63% ... 99%
# Does the elite chocolate bars percentage also present in the chocolate bars which has a lower rating values?

unique(elite_cacao$Cocoa.Percent) %in% unique(below_elite_cacao$Cocoa.Percent)
## [1] TRUE
# Bean types that are present in the chocolate bars which has a rating lower than the elite chocolate bars

unique(below_elite_cacao$Bean.Type)
##  [1] Missing                  Criollo                  Trinitario              
##  [4] Forastero (Arriba)       Forastero                Forastero (Nacional)    
##  [7] Criollo, Trinitario      Criollo (Porcelana)      Blend                   
## [10] Trinitario (85% Criollo) Forastero (Catongo)      Forastero (Parazinho)   
## [13] Trinitario, Criollo      CCN51                    Criollo (Ocumare)       
## [16] Nacional                 Criollo (Ocumare 61)     Criollo (Ocumare 77)    
## [19] Criollo (Ocumare 67)     Criollo (Wild)           Beniano                 
## [22] Amazon mix               Trinitario, Forastero    Forastero (Arriba) ASS  
## [25] Criollo, +               Amazon                   Amazon, ICS             
## [28] EET                      Blend-Forastero,Criollo  Trinitario (Scavina)    
## [31] Criollo, Forastero       Matina                   Forastero(Arriba, CCN)  
## [34] Nacional (Arriba)        Forastero (Arriba) ASSS  Forastero, Trinitario   
## [37] Forastero (Amelonado)                             Trinitario, Nacional    
## [40] Trinitario (Amelonado)   Trinitario, TCGA         Criollo (Amarru)        
## 42 Levels:  Amazon Amazon mix Amazon, ICS Beniano ... Trinitario, TCGA
# Does bean types used to produce the elite chocolate bars present in the lower rating chocolate bars?

unique(elite_cacao$Bean.Type) %in% unique(below_elite_cacao$Bean.Type)
## [1] TRUE TRUE

All the criteria which was tested partially were found in the data of chocolate bars which has a rating beside the elite rating. Nevertheless, will there be any chocolate bars which have the exact same criteria to the elite chocolate bars?

# Subset chocolate bars which have 70% cocoa percentage and bean types "Trinitario" and "Blend"

bec_elite <- below_elite_cacao[below_elite_cacao$Cocoa.Percent == "70%" &
                                 below_elite_cacao$Bean.Type %in%
                                 c("Trinitario", "Blend"),]
# Preview the data frame

head(bec_elite)

A total of 156 numbers of chocolate bars has the same cocoa percentage and the same bean types. If we observe the beans origin, we can see that with the same type of bean, they came from many different countries. Does the bean origin has any correlation with the ratings given to the chocolate bars?

To find the answer, we have to subset the data frame for each bean types. Sadly, due to Missing information regarding Blend bean type origin, we cannot use it to represent our data as it might give us defective result at the end. As a result, we will have to remove Blend bean type from our data frame.

# Remove the blend bean type and order the ratings from the highest to the lowest

bec_trinitario <- bec_elite[bec_elite$Bean.Type == "Trinitario",]

bec_trinitario <- bec_trinitario[order(bec_trinitario$Rating, decreasing=T),]
# Preview the highest rating chocolate bars with Trinitario as the bean type

bec_trinitario
# Plot a bar chart to find the ratio of ratings for each origin of the bean types

plot_bar <- table(bec_trinitario[bec_trinitario$Rating >= 1,]$Rating,
                  as.character(bec_trinitario[bec_trinitario$Rating >= 1,]$Bean.Origin))

barplot(plot_bar, cex.names=0.6, cex.axis=0.6, las=2, col=1:length(unique(bec_trinitario$Rating)), legend=T, args.legend = list(x = 8.5, ncol=2, title="Ratings"), main="Ratio of Chocolate Bar Rating for Each Bean Type Origin")




Result of the bar plot above shown us that the Trinitario beans originated from Venezuela has the highest number of rating with a value of 4 (premium). However, the chocolate bars which has the same beans that originated from the same country as mentioned previously, also has a ratings which are less than satisfactory level (3). Furthermore, Madagascar as one of the country origin of Trinitario beans has an overall higher number of chocolate bars produced equal and above satisfactory level with generally less number of ratings below satisfactory level relative to the beans originated from Venezuela.

Conclusion

In conclusion, it appears that the quality of the elite chocolate bars were not determined by the type of the beans and/or the percentage of the cocoa inside a chocolate bar as we have seen that the chocolate bars which has the same type of cocoa percentage, beans, and even the same origin of the beans were still present in lower rated chocolate bar list. Even worse, the were beans with the same type but originated from different country from the previously mentioned beans, has an overall better ratings at the lower rating than 5 (elite) compare to the beans which receive the highest ratings from the reviewer.

Can We Rely on The Ratings?

we can definitely rely on the ratings as our reference to which chocolate bars to select from many different brands that are available throughout the world which you can find at you local store. However, we do have to explore the taste by ourselves as people’s taste buds have different preference to the ratio of the sweetness, bitterness, and creaminess profile of a chocolate bar has to offer.

How The Ratings Were Made By The Reviewer

each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch. Batch numbers, vintages and review dates are included in the database when known. The ratings do not reflect health benefits, social missions, or organic status.

Flavor is the most important component of the Flavors of Cacao ratings. Diversity, balance, intensity and purity of flavors are all considered. It is possible for a straight forward single note chocolate to rate as high as a complex flavor profile that changes throughout. Genetics, terroir, post harvest techniques, processing and storage can all be discussed when considering the flavor component.

Texture has a great impact on the overall experience and it is also possible for texture related issues to impact flavor. It is a good way to evaluate the makers vision, attention to detail and level of proficiency.

Aftermelt is the experience after the chocolate has melted. Higher quality chocolate will linger and be long lasting and enjoyable. Since the aftermelt is the last impression you get from the chocolate, it receives equal importance in the overall rating.

Overall Opinion is really where the ratings reflect a subjective opinion. Ideally it is my evaluation of whether or not the components above worked together and an opinion on the flavor development, character and style. It is also here where each chocolate can usually be summarized by the most prominent impressions that you would remember about each chocolate.

Extra

When dealing with big data with large number of rows and columns which could reach millions, It will affect the performance of our program due to large amount of memory needed to run the program as not many generally owned computers or laptops can handle the stress. For beginners such as myself, we often do trial and error with our codes, define this and that to randomly named data and environment. We do have to understand that this can put alot of stress to our computers. So cleaning up after we are finished with our trial and error is a good practice to becoming a good programmer. To do that, we can look at the environment window on the right side of the our RStudio IDE or we can find it using the code below:

# Preview the list of data and variables defined in the environment

ls()
## [1] "bec_elite"          "bec_trinitario"     "below_elite_cacao" 
## [4] "cacao"              "elite_cacao"        "good_quality_cacao"
## [7] "plot_bar"

We can then choose which data and/or variables that we want to remove:

# Delete data or variable 1 by 1 from the environment

# rm("data_or_variable_name")

We can also remove all the data and variables and re-run all the codes as a simpler option for us who have many unused data or variables:

# Remove all data and variables from the environment

# rm(list = ls())

after which we have to run all the code from the top to obtain the used data and variables back.