Who does not like chocolate?

Fib Gro

1/21/2022

Introduction

Overview of Chocolate

Chocolate is one of the most popular food products around the world. Not only does the chocolate has a deliciously sweet taste, but consuming dark chocolate may reduce the risk of some health condition. Dark chocolate is also believed to be linked to neurotransmitter systems such as serotonin, endorphins and dopamine, which increase our mood. If you want to get benefit from serotonin, the dark chocolate containing 85% cocoa has the highest serotonin level with approximately 2.9 micrograms per gram (Journal of Chromatography A, 2012).

The history of chocolate can be traced back to Mesoamerica, nowaday Mexico. Chocolate was used during rituals and also served as medicine. Centuries later, the Mayans utilized chocolate as a ritual drink, called “xocolatl”. Mayan chocolate was produced from roasted cacao seeds mixed with water, cornmeal and chillies. In 1526, the explorer from Spain, Hernán Cortés, introduced cocoa seeds in Spain. Since then, chocolate reached its popularity in Europe especially for the rich and wealthy (History, 2020). The Industrial Revolution in 1828 has transformed the processing of chocolate. The cocoa powder is produced by pressing cocoa butter from roasted cocoa beans. The cocoa powder mixed with liquids is solidified into a chocolate bar. Present-day, chocolate is served in many forms, such as ice cream, cookies, milk etc.

Dataset Information

The dataset is called Flavors of Cacao ratings, which were originally collected and compiled by Brady Brelinski, a Founding Member of the Manhattan Chocolate Society. In this project, the dataset is gathered from the Kaggle website, where the data has been modified. The dataset is solely focused on dark chocolate in propose to appreciate the originality of the flavours of cocoa. The component of the dataset includes expert ratings of more than 1,700 chocolate bars between 2005 and 2016, region of origin, varieties of beans, and the percentage of cocoa. Classification of Flavors of Cacao Rating System is divided into five classes. The ratings themselves do not apply to chocolate health benefits or organic content.

5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavour development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)

Project Expectation

This project is part of Learn by Building (LBB) assignment. The project will deliver and implement some knowledge related to the usage of Rmarkdown, exploratory data analysis by using R and publishing into the Rpubs. The goal of the project :

Discover where are cocoa beans with the highest average rating grown.
Discover which countries manufacture chocolate bars with the highest average rating.
Discover the relationship between a cocoa percentage in chocolate bars and its rating.

Data Observation

First, Import the dataset of flavors_of_cacao.csv from the working directory and assign it as an object called Choco. Then, print the first six rows by using head() function.

Choco <- read.csv("flavors_of_cacao.csv")
head(Choco)

#>   Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name  REF Review.Date
#> 1                  A. Morin                      Agua Grande 1876        2016
#> 2                  A. Morin                            Kpime 1676        2015
#> 3                  A. Morin                           Atsane 1676        2015
#> 4                  A. Morin                            Akata 1680        2015
#> 5                  A. Morin                           Quilla 1704        2015
#> 6                  A. Morin                         Carenero 1315        2014
#>   Cocoa.Percent Company.Location Rating Bean.Type Broad.Bean.Origin
#> 1           63%           France   3.75                    Sao Tome
#> 2           70%           France   2.75                        Togo
#> 3           70%           France   3.00                        Togo
#> 4           70%           France   3.50                        Togo
#> 5           70%           France   3.50                        Peru
#> 6           70%           France   2.75   Criollo         Venezuela

Inspect the data by using str().

We found that there are 1,795 observations with 9 columns.
The data description :
- Company…Maker.if.known. is the name of the company manufacturing the chocolate bar.
- Specific.Bean.Origin.or.Bar.Name is the specific region of origin of the chocolate bar.
- REF is a review identification number.
- Review.Date is the publication date of the review.
- Cocoa.Percent is the percentage cocoa content in the chocolate bar.
- Company.Location is the location of the company.
- Rating is the expert rating as mentioned above.
- Bean.Type is the variety of beans used.
- Broad.Bean.Origin is the region of origin of the bean.

str(Choco)

#> 'data.frame':    1795 obs. of  9 variables:
#>  $ Company...Maker.if.known.       : chr  "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
#>  $ Specific.Bean.Origin.or.Bar.Name: chr  "Agua Grande" "Kpime" "Atsane" "Akata" ...
#>  $ REF                             : int  1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
#>  $ Review.Date                     : int  2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
#>  $ Cocoa.Percent                   : chr  "63%" "70%" "70%" "70%" ...
#>  $ Company.Location                : chr  "France" "France" "France" "France" ...
#>  $ Rating                          : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
#>  $ Bean.Type                       : chr  " " " " " " " " ...
#>  $ Broad.Bean.Origin               : chr  "Sao Tome" "Togo" "Togo" "Togo" ...

Observe Rating and Cocoa.Percent columns.

The rating has a range between 1 and 5 with the median and mean being relatively close.
The cocoa percentage ranges between 42 and 100 with the median and mean being around 70.

summary(Choco)

#>  Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name      REF      
#>  Length:1795               Length:1795                      Min.   :   5  
#>  Class :character          Class :character                 1st Qu.: 576  
#>  Mode  :character          Mode  :character                 Median :1069  
#>                                                             Mean   :1036  
#>                                                             3rd Qu.:1502  
#>                                                             Max.   :1952  
#>   Review.Date   Cocoa.Percent      Company.Location       Rating     
#>  Min.   :2006   Length:1795        Length:1795        Min.   :1.000  
#>  1st Qu.:2010   Class :character   Class :character   1st Qu.:2.875  
#>  Median :2013   Mode  :character   Mode  :character   Median :3.250  
#>  Mean   :2012                                         Mean   :3.186  
#>  3rd Qu.:2015                                         3rd Qu.:3.500  
#>  Max.   :2017                                         Max.   :5.000  
#>   Bean.Type         Broad.Bean.Origin 
#>  Length:1795        Length:1795       
#>  Class :character   Class :character  
#>  Mode  :character   Mode  :character  
#>                                       
#>                                       
#>

Data Cleaning

The objective of data cleaning in this project are:

Rename variable names if necessary
Observe and fix any misspelling or typos.
Verify and transform data type.
Observe for any missing values

Rename Columns

There are two columns with a long name. Let’s rename it and after that, confirm it by applying head().

Company…Maker.if.known. –> Company.Name
Specific.Bean.Origin.or.Bar.Name –> Region.Name

names(Choco)[names(Choco) == "Company...Maker.if.known."] <- "Company.Name"
names(Choco)[names(Choco) == "Specific.Bean.Origin.or.Bar.Name"] <- "Region.Bar"
head(Choco)

#>   Company.Name  Region.Bar  REF Review.Date Cocoa.Percent Company.Location
#> 1     A. Morin Agua Grande 1876        2016           63%           France
#> 2     A. Morin       Kpime 1676        2015           70%           France
#> 3     A. Morin      Atsane 1676        2015           70%           France
#> 4     A. Morin       Akata 1680        2015           70%           France
#> 5     A. Morin      Quilla 1704        2015           70%           France
#> 6     A. Morin    Carenero 1315        2014           70%           France
#>   Rating Bean.Type Broad.Bean.Origin
#> 1   3.75                    Sao Tome
#> 2   2.75                        Togo
#> 3   3.00                        Togo
#> 4   3.50                        Togo
#> 5   3.50                        Peru
#> 6   2.75   Criollo         Venezuela

Misspelling and validity of data.

Company.Location

First, we observe company location values by using unique() and sort().

sort(unique(Choco$Company.Location))

#>  [1] "Amsterdam"         "Argentina"         "Australia"        
#>  [4] "Austria"           "Belgium"           "Bolivia"          
#>  [7] "Brazil"            "Canada"            "Chile"            
#> [10] "Colombia"          "Costa Rica"        "Czech Republic"   
#> [13] "Denmark"           "Domincan Republic" "Ecuador"          
#> [16] "Eucador"           "Fiji"              "Finland"          
#> [19] "France"            "Germany"           "Ghana"            
#> [22] "Grenada"           "Guatemala"         "Honduras"         
#> [25] "Hungary"           "Iceland"           "India"            
#> [28] "Ireland"           "Israel"            "Italy"            
#> [31] "Japan"             "Lithuania"         "Madagascar"       
#> [34] "Martinique"        "Mexico"            "Netherlands"      
#> [37] "New Zealand"       "Niacragua"         "Nicaragua"        
#> [40] "Peru"              "Philippines"       "Poland"           
#> [43] "Portugal"          "Puerto Rico"       "Russia"           
#> [46] "Sao Tome"          "Scotland"          "Singapore"        
#> [49] "South Africa"      "South Korea"       "Spain"            
#> [52] "St. Lucia"         "Suriname"          "Sweden"           
#> [55] "Switzerland"       "U.K."              "U.S.A."           
#> [58] "Venezuela"         "Vietnam"           "Wales"

We found some of the values are not correct and also we spot some misspellings.

Amsterdam is not a country and should be categorized as the Netherlands.
Wales and Scotland are located in the U.K.
Some misspellings:
- Niacragua should be Nicaragua.
- Domincan Republic should be the Dominican Republic.
- Eucador should be Ecuador.

Let’s make it better. Originally, the Company.location has 60 unique values, but now, it has 55 unique values.

Choco$Company.Location[Choco$Company.Location ==  "Amsterdam"] <- "Netherlands"
Choco$Company.Location[Choco$Company.Location ==  "Niacragua"] <- "Nicaragua"
Choco$Company.Location[Choco$Company.Location ==  "Domincan Republic"] <- "Dominican Republic"
Choco$Company.Location[Choco$Company.Location ==  "Eucador"] <- "Ecuador"
Choco$Company.Location[Choco$Company.Location ==  "Wales" | Choco$Company.Location ==  "Scotland"] <- "U.K."
sort(unique(Choco$Company.Location))

#>  [1] "Argentina"          "Australia"          "Austria"           
#>  [4] "Belgium"            "Bolivia"            "Brazil"            
#>  [7] "Canada"             "Chile"              "Colombia"          
#> [10] "Costa Rica"         "Czech Republic"     "Denmark"           
#> [13] "Dominican Republic" "Ecuador"            "Fiji"              
#> [16] "Finland"            "France"             "Germany"           
#> [19] "Ghana"              "Grenada"            "Guatemala"         
#> [22] "Honduras"           "Hungary"            "Iceland"           
#> [25] "India"              "Ireland"            "Israel"            
#> [28] "Italy"              "Japan"              "Lithuania"         
#> [31] "Madagascar"         "Martinique"         "Mexico"            
#> [34] "Netherlands"        "New Zealand"        "Nicaragua"         
#> [37] "Peru"               "Philippines"        "Poland"            
#> [40] "Portugal"           "Puerto Rico"        "Russia"            
#> [43] "Sao Tome"           "Singapore"          "South Africa"      
#> [46] "South Korea"        "Spain"              "St. Lucia"         
#> [49] "Suriname"           "Sweden"             "Switzerland"       
#> [52] "U.K."               "U.S.A."             "Venezuela"         
#> [55] "Vietnam"

Broad.Bean.Origin

Next, we check the Broad.Bean.Origin column by using unique() and sort().

sort(unique(Choco$Broad.Bean.Origin))

#>   [1] ""                              " "                            
#>   [3] "Africa, Carribean, C. Am."     "Australia"                    
#>   [5] "Belize"                        "Bolivia"                      
#>   [7] "Brazil"                        "Burma"                        
#>   [9] "Cameroon"                      "Carribean"                    
#>  [11] "Carribean(DR/Jam/Tri)"         "Central and S. America"       
#>  [13] "Colombia"                      "Colombia, Ecuador"            
#>  [15] "Congo"                         "Cost Rica, Ven"               
#>  [17] "Costa Rica"                    "Cuba"                         
#>  [19] "Dom. Rep., Madagascar"         "Domincan Republic"            
#>  [21] "Dominican Rep., Bali"          "Dominican Republic"           
#>  [23] "DR, Ecuador, Peru"             "Ecuador"                      
#>  [25] "Ecuador, Costa Rica"           "Ecuador, Mad., PNG"           
#>  [27] "El Salvador"                   "Fiji"                         
#>  [29] "Gabon"                         "Ghana"                        
#>  [31] "Ghana & Madagascar"            "Ghana, Domin. Rep"            
#>  [33] "Ghana, Panama, Ecuador"        "Gre., PNG, Haw., Haiti, Mad"  
#>  [35] "Grenada"                       "Guat., D.R., Peru, Mad., PNG" 
#>  [37] "Guatemala"                     "Haiti"                        
#>  [39] "Hawaii"                        "Honduras"                     
#>  [41] "India"                         "Indonesia"                    
#>  [43] "Indonesia, Ghana"              "Ivory Coast"                  
#>  [45] "Jamaica"                       "Liberia"                      
#>  [47] "Mad., Java, PNG"               "Madagascar"                   
#>  [49] "Madagascar & Ecuador"          "Malaysia"                     
#>  [51] "Martinique"                    "Mexico"                       
#>  [53] "Nicaragua"                     "Nigeria"                      
#>  [55] "Panama"                        "Papua New Guinea"             
#>  [57] "Peru"                          "Peru, Belize"                 
#>  [59] "Peru, Dom. Rep"                "Peru, Ecuador"                
#>  [61] "Peru, Ecuador, Venezuela"      "Peru, Mad., Dom. Rep."        
#>  [63] "Peru, Madagascar"              "Peru(SMartin,Pangoa,nacional)"
#>  [65] "Philippines"                   "PNG, Vanuatu, Mad"            
#>  [67] "Principe"                      "Puerto Rico"                  
#>  [69] "Samoa"                         "Sao Tome"                     
#>  [71] "Sao Tome & Principe"           "Solomon Islands"              
#>  [73] "South America"                 "South America, Africa"        
#>  [75] "Sri Lanka"                     "St. Lucia"                    
#>  [77] "Suriname"                      "Tanzania"                     
#>  [79] "Tobago"                        "Togo"                         
#>  [81] "Trinidad"                      "Trinidad-Tobago"              
#>  [83] "Trinidad, Ecuador"             "Trinidad, Tobago"             
#>  [85] "Uganda"                        "Vanuatu"                      
#>  [87] "Ven, Bolivia, D.R."            "Ven, Trinidad, Ecuador"       
#>  [89] "Ven., Indonesia, Ecuad."       "Ven., Trinidad, Mad."         
#>  [91] "Ven.,Ecu.,Peru,Nic."           "Venez,Africa,Brasil,Peru,Mex" 
#>  [93] "Venezuela"                     "Venezuela, Carribean"         
#>  [95] "Venezuela, Dom. Rep."          "Venezuela, Ghana"             
#>  [97] "Venezuela, Java"               "Venezuela, Trinidad"          
#>  [99] "Venezuela/ Ghana"              "Vietnam"                      
#> [101] "West Africa"

We found that some of the names are not consistent and misspelling. Also, we need to delete some characters /,(), &. Let’s clean it to make it more readable and consistent.

Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ven., Indonesia, Ecuad."] <- "Venezuela, Indonesia, Ecuador"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Venezuela, Dom. Rep."] <- "Venezuela, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Venezuela/ Ghana"] <- "Venezuela, Ghana"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ven., Trinidad, Mad."] <- "Venezuela, Trinidad, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ven.,Ecu.,Peru,Nic."] <- "Venezuela, Ecuador, Nicaragua"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Peru, Dom. Rep"] <- "Peru, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Mad., Java, PNG"] <- "Madagascar, Java, PNG"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Dominican Rep., Bali"] <- "Dominican Republic, Bali"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Sao Tome & Principe"] <- "Sao Tome, Principe"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Africa, Carribean, C. Am."] <- "Africa, Carribean, Central America"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Trinidad-Tobago"] <- "Trinidad, Tobago"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "PNG, Vanuatu, Mad"] <- "PNG, Vanuatu, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Carribean(DR/Jam/Tri)"] <- "Carribean, Dominican Republic, Jamaica, Trinidad)"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Madagascar & Ecuador"] <- "Madagascar, Ecuador "
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Dom. Rep., Madagascar"] <- "Dominican Republic, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Domincan Republic"] <- "Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Venez,Africa,Brasil,Peru,Mex"] <- "Venezuela,Africa,Brasil,Peru,Mexico"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Peru, Mad., Dom. Rep."] <- "Peru, Madagascar, Dominican Republic"

Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ghana, Domin. Rep"] <- "Ghana, Dominican Republic "
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Central and S. America"] <- "Central America, South America"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ghana & Madagascar"] <- "Ghana, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Guat., D.R., Peru, Mad., PNG"] <- "Guatelama, Dominican Republic, Peru, Madagascar, PNG"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Gre., PNG, Haw., Haiti, Mad"] <- "Grenada, PNG, Hawaii, Haiti, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "DR, Ecuador, Peru"] <- "Dominican Republic, Ecuador, Peru"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ecuador, Mad., PNG"] <- "Ecuador, Madagascar, PNG"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Cost Rica, Ven"] <- "Costa Rica, Venezuela "
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ven, Bolivia, D.R."] <- "Venezuela, Bolivia, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ven., Trinidad, Mad."] <- "Venezuela, Trinidad, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Peru(SMartin,Pangoa,nacional)"] <- "Peru, St. Martin, Pangoa, Nacional"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Peru, Dom. Rep"] <- "Peru, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "PNG, Vanuatu, Mad"] <- "PNG, Vanuatu, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  "Ven, Trinidad, Ecuador"] <- "Venezuela, Trinidad, Ecuador"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin ==  ""] <- "Venezuela"

head(sort(unique(Choco$Broad.Bean.Origin)),10)

#>  [1] " "                                                
#>  [2] "Africa, Carribean, Central America"               
#>  [3] "Australia"                                        
#>  [4] "Belize"                                           
#>  [5] "Bolivia"                                          
#>  [6] "Brazil"                                           
#>  [7] "Burma"                                            
#>  [8] "Cameroon"                                         
#>  [9] "Carribean"                                        
#> [10] "Carribean, Dominican Republic, Jamaica, Trinidad)"

Bean.Type

Now, Observe the Bean.Type column by using unique() and sort().

sort(unique(Choco$Bean.Type))

#>  [1] ""                         " "                       
#>  [3] "Amazon"                   "Amazon mix"              
#>  [5] "Amazon, ICS"              "Beniano"                 
#>  [7] "Blend"                    "Blend-Forastero,Criollo" 
#>  [9] "CCN51"                    "Criollo"                 
#> [11] "Criollo (Amarru)"         "Criollo (Ocumare 61)"    
#> [13] "Criollo (Ocumare 67)"     "Criollo (Ocumare 77)"    
#> [15] "Criollo (Ocumare)"        "Criollo (Porcelana)"     
#> [17] "Criollo (Wild)"           "Criollo, +"              
#> [19] "Criollo, Forastero"       "Criollo, Trinitario"     
#> [21] "EET"                      "Forastero"               
#> [23] "Forastero (Amelonado)"    "Forastero (Arriba)"      
#> [25] "Forastero (Arriba) ASS"   "Forastero (Arriba) ASSS" 
#> [27] "Forastero (Catongo)"      "Forastero (Nacional)"    
#> [29] "Forastero (Parazinho)"    "Forastero, Trinitario"   
#> [31] "Forastero(Arriba, CCN)"   "Matina"                  
#> [33] "Nacional"                 "Nacional (Arriba)"       
#> [35] "Trinitario"               "Trinitario (85% Criollo)"
#> [37] "Trinitario (Amelonado)"   "Trinitario (Scavina)"    
#> [39] "Trinitario, Criollo"      "Trinitario, Forastero"   
#> [41] "Trinitario, Nacional"     "Trinitario, TCGA"

As we can see above the bean type are categorized based on their main and sub-variety. According to this article by Pohlan and Perez and bar and cocoa, the main varieties of the cacao plant are forastero, criollo, trinitario, and nacional. Let’s categorized some of the values into their original variety.

Choco$Bean.Type[Choco$Bean.Type %in% c("Criollo (Ocumare 67)", "Criollo (Wild)", "Criollo (Ocumare 77)", "Criollo, +" , "Criollo (Amarru)", "Criollo (Ocumare)", "Criollo, Forastero", "Criollo (Ocumare 61)" , "Criollo (Porcelana)", "Criollo, Trinitario")] <- "Criollo"
Choco$Bean.Type[Choco$Bean.Type %in% c("Forastero (Arriba) ASS", "Forastero (Parazinho)", "Forastero (Arriba) ASSS", "Forastero, Trinitario", "Forastero (Amelonado)", "Forastero (Arriba)", "Forastero (Catongo)", "Forastero (Nacional)", "Forastero(Arriba, CCN)", "Blend-Forastero,Criollo")] <- "Forastero"
Choco$Bean.Type[Choco$Bean.Type %in% c("Trinitario (85% Criollo)", "Trinitario (Amelonado)", "Trinitario (Scavina)" , "Trinitario, Criollo", "Trinitario, Forastero","Trinitario, Nacional", "Trinitario, TCGA")] <- "Trinitario"
Choco$Bean.Type[Choco$Bean.Type %in% c("Amazon mix", "Amazon, ICS")] <- "Amazon"
Choco$Bean.Type[Choco$Bean.Type %in% c("Nacional (Arriba)")] <- "Nacional" 

sort(unique(Choco$Bean.Type))

#>  [1] ""           " "          "Amazon"     "Beniano"    "Blend"     
#>  [6] "CCN51"      "Criollo"    "EET"        "Forastero"  "Matina"    
#> [11] "Nacional"   "Trinitario"

Datatype Transformation

Three columns should be transformed into an appropriate datatype.

Company…Maker.if.known., Company.Location, Broad.Bean.Origin are a character, it should be transformed into a factor.
Cocoa.Percent needs to change to numeric and delete the % character.
Review.Date needs to transform to Date.

Columns named Company.Name, Company.Location, Broad.Bean.Origin, Broad.Bean.Origin and Bean.Type is a character, it should be transformed to a factor. The transformed data type of multi-columns by using lappy() function.

Choco[, c("Company.Name", "Region.Bar", "Company.Location", "Broad.Bean.Origin", "Bean.Type" )] <- lapply(Choco[, c("Company.Name", "Region.Bar", "Company.Location", "Broad.Bean.Origin", "Bean.Type" )], as.factor)

Column Cocoa.Percent is transformed from character to numeric. First, we apply gsub() function to remove % character and then change the data type by using as.numeric. Don’t forget to confirm all the transformation data type by using str()

# Transform Cocoa.Percent into Numeric type
Choco$Cocoa.Percent <- gsub("%","",as.character(Choco$Cocoa.Percent))
Choco$Cocoa.Percent <- as.numeric(Choco$Cocoa.Percent)

# Transform Review.Date into Date type
Choco$Review.Date <- as.Date(as.character(Choco$Review.Date), format="%Y")

str(Choco)

#> 'data.frame':    1795 obs. of  9 variables:
#>  $ Company.Name     : Factor w/ 416 levels "A. Morin","Acalli",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Region.Bar       : Factor w/ 1039 levels "\"heirloom\", Arriba Nacional",..: 15 494 68 16 813 175 288 923 805 731 ...
#>  $ REF              : int  1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
#>  $ Review.Date      : Date, format: "2016-02-15" "2015-02-15" ...
#>  $ Cocoa.Percent    : num  63 70 70 70 70 70 70 70 70 70 ...
#>  $ Company.Location : Factor w/ 55 levels "Argentina","Australia",..: 17 17 17 17 17 17 17 17 17 17 ...
#>  $ Rating           : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
#>  $ Bean.Type        : Factor w/ 12 levels ""," ","Amazon",..: 2 2 2 2 2 7 2 7 7 2 ...
#>  $ Broad.Bean.Origin: Factor w/ 97 levels " ","Africa, Carribean, Central America",..: 68 78 78 78 55 84 17 84 84 55 ...

Drop Levels

For data visualization proposes, we need to drop levels for columns Bread.Bean.Original, Bean.Type and Company.Location.

Choco$Broad.Bean.Origin <- droplevels(Choco$Broad.Bean.Origin)
Choco$Bean.Type <- droplevels(Choco$Bean.Type)
Choco$Company.Location <- droplevels(Choco$Company.Location)

Treatment Missing Values

Check missing values for all columns by using colSums() and is.na(). We found no missing values in the data frame.

colSums(is.na(Choco))

#>      Company.Name        Region.Bar               REF       Review.Date 
#>                 0                 0                 0                 0 
#>     Cocoa.Percent  Company.Location            Rating         Bean.Type 
#>                 0                 0                 0                 0 
#> Broad.Bean.Origin 
#>                 0

Hmm.. it’s interesting since there are some empty values such as in Bean.Type (888 rows) and Broad.Bean.Origin (73 rows), as you can see in the table below. Considering many empty values have been found in Bean.Type (almost 50%), we plan to keep the original data but not utilize it.

head(as.data.frame(table(Choco$Bean.Type)))

#>      Var1 Freq
#> 1            1
#> 2          887
#> 3  Amazon    5
#> 4 Beniano    3
#> 5   Blend   41
#> 6   CCN51    1

head(as.data.frame(table(Choco$Broad.Bean.Origin)))

#>                                 Var1 Freq
#> 1                                      73
#> 2 Africa, Carribean, Central America    1
#> 3                          Australia    3
#> 4                             Belize   49
#> 5                            Bolivia   57
#> 6                             Brazil   58

Data Manipulation, Wranggling and Visualization

Question No.1

Check the frequency of bean origin (how many times it has been reviewed) and assign it as a variable called Choco.bean. The table suggests that the bean from Venezuela, Ecuador, Dominican Republic, Peru and Madagascar has a high number of reviews.

Choco.bean <- as.data.frame(table(Choco$Broad.Bean.Origin))
colnames(Choco.bean) <- c("Broad.Bean.Origin", "Number.Reviews")
Choco.bean <- Choco.bean[order(-Choco.bean$Number.Reviews),]
head(Choco.bean)

#>     Broad.Bean.Origin Number.Reviews
#> 84          Venezuela            215
#> 22            Ecuador            193
#> 18 Dominican Republic            166
#> 55               Peru            165
#> 45         Madagascar            145
#> 1                                 73

Now, Observe a summary of the Frequency on data frame Choco.bean.

The number of bean origin reviews ranges between 1 and 215. It implies that this data has a wide range value, with most of the data distributed at a low number of reviews (confirmed by a low median number). Let’s confirm it by creating the histogram.
The histogram suggests that most of the bean origin has been reviewed between 0 and 20 times.
My assumption: the data which has more reviews will accurately represent the population than the data with a lower number of reviews. In this case, I would use the data of bean origin which has been reviewed more than 30. Why do I choose this number? Based on investopedia, the size of samples greater than 30 is often considered sufficient for the Central Limit Theorem (CLT) to hold, which can accurately predict the characteristics of a population.

# Summary of number of reviews
summary(Choco.bean$Number.Reviews)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    1.00    3.00   18.51   11.00  215.00

# histogram for number of reviews.
hist(Choco.bean$Number.Reviews, 
     xlab = "Number of Reviews", main = "Histogram of Number Reviews for Bean Origin")

Thus, Modify Choco.bean to filter the origin bean that has more than 30 reviews. We found that there are 17 rows and one of them is an empty value.

Choco.bean <- Choco.bean[Choco.bean$Number.Reviews> 30,]
Choco.bean

#>     Broad.Bean.Origin Number.Reviews
#> 84          Venezuela            215
#> 22            Ecuador            193
#> 18 Dominican Republic            166
#> 55               Peru            165
#> 45         Madagascar            145
#> 1                                 73
#> 51          Nicaragua             60
#> 6              Brazil             58
#> 5             Bolivia             57
#> 4              Belize             49
#> 54   Papua New Guinea             42
#> 12           Colombia             40
#> 15         Costa Rica             38
#> 96            Vietnam             38
#> 76           Tanzania             34
#> 28              Ghana             33
#> 79           Trinidad             33

Let’s drop an empty values.

Choco.bean <- Choco.bean[-c(6),] 
Choco.bean

#>     Broad.Bean.Origin Number.Reviews
#> 84          Venezuela            215
#> 22            Ecuador            193
#> 18 Dominican Republic            166
#> 55               Peru            165
#> 45         Madagascar            145
#> 51          Nicaragua             60
#> 6              Brazil             58
#> 5             Bolivia             57
#> 4              Belize             49
#> 54   Papua New Guinea             42
#> 12           Colombia             40
#> 15         Costa Rica             38
#> 96            Vietnam             38
#> 76           Tanzania             34
#> 28              Ghana             33
#> 79           Trinidad             33

Create the data frame called Choco.bean.rating containing origin bean that has reviewed more than 30 as listed in Choco.bean$Broad.Bean.Origin. Change the data type of Broad.Bean.Origin into factor, and drop the levels.

Choco.bean.rating <- Choco[Choco$Broad.Bean.Origin %in% Choco.bean$Broad.Bean.Origin,]
Choco.bean.rating$Broad.Bean.Origin <- as.factor(Choco.bean.rating$Broad.Bean.Origin)
Choco.bean.rating$Broad.Bean.Origin <- droplevels(Choco.bean.rating$Broad.Bean.Origin)
head(Choco.bean.rating)

#>    Company.Name     Region.Bar  REF Review.Date Cocoa.Percent Company.Location
#> 5      A. Morin         Quilla 1704  2015-02-15            70           France
#> 6      A. Morin       Carenero 1315  2014-02-15            70           France
#> 8      A. Morin   Sur del Lago 1315  2014-02-15            70           France
#> 9      A. Morin Puerto Cabello 1319  2014-02-15            70           France
#> 10     A. Morin        Pablino 1319  2014-02-15            70           France
#> 12     A. Morin     Madagascar 1011  2013-02-15            70           France
#>    Rating Bean.Type Broad.Bean.Origin
#> 5    3.50                        Peru
#> 6    2.75   Criollo         Venezuela
#> 8    3.50   Criollo         Venezuela
#> 9    3.75   Criollo         Venezuela
#> 10   4.00                        Peru
#> 12   3.00   Criollo        Madagascar

Create the data frame containing mean, median, standard deviation and number of reviews from data frame Choco.bean.rating and assign it to a variable called bean.rating.final.

# Create variables each containing mean, median, standard deviation, number of reviews
bean.rating.median <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, median)
bean.rating.mean <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, mean)
bean.rating.sd <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, sd)
bean.rating.count <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, length)

# Merge four variable into dataframe called bean.rating.final. 
MyMerge <- function(x, y){
  df <- merge(x, y, by= "Broad.Bean.Origin", all.x= TRUE, all.y= TRUE)
  return(df)
}

bean.rating.final <- Reduce(MyMerge, list(bean.rating.mean, bean.rating.median, bean.rating.sd,bean.rating.count ))

# Rename columns and order the column by median and standard deviation. 
colnames(bean.rating.final) <- c('Origin.Bean.Type','Mean','Median','St.Dev', 'Number.Reviews')
bean.rating.final <- bean.rating.final[order(-bean.rating.final[,3], bean.rating.final[,4]), ]
bean.rating.final

#>      Origin.Bean.Type     Mean Median    St.Dev Number.Reviews
#> 16            Vietnam 3.315789  3.375 0.3166773             38
#> 3              Brazil 3.284483  3.375 0.4174071             58
#> 1              Belize 3.234694  3.250 0.3203520             49
#> 10          Nicaragua 3.200000  3.250 0.3900239             60
#> 6  Dominican Republic 3.206325  3.250 0.3997792            166
#> 14           Trinidad 3.204545  3.250 0.4072322             33
#> 9          Madagascar 3.265517  3.250 0.4106024            145
#> 11   Papua New Guinea 3.291667  3.250 0.4128894             42
#> 2             Bolivia 3.197368  3.250 0.4218815             57
#> 4            Colombia 3.225000  3.250 0.4414429             40
#> 13           Tanzania 3.205882  3.250 0.4785972             34
#> 15          Venezuela 3.241860  3.250 0.5002254            215
#> 12               Peru 3.137879  3.250 0.5018305            165
#> 7             Ecuador 3.134715  3.250 0.5272393            193
#> 5          Costa Rica 3.144737  3.125 0.4296116             38
#> 8               Ghana 3.090909  3.000 0.5513413             33

For visualization : Construct the boxplot containing the bean origin and its rating. Then, pull out the outliers.

# Creating boxplot of beans origin and its rating. 
boxplot.bean<- plot(y=Choco.bean.rating$Rating, x=Choco.bean.rating$Broad.Bean.Origin,
        xlab = "origin Bean", ylab = "Rating",
        col = c("red", "yellow", "blue", "brown", "orange", "green", "violet", "light yellow", "pink", "light green", "white",  "purple", "grey", "dark blue", "red","dark green"),
        main = "Boxplot Origin Bean to Rating")

# Pull out all outliers
out.bean <- Choco.bean.rating[Choco.bean.rating$Rating %in% boxplot.bean$out, c("Company.Name", "Cocoa.Percent", "Rating", "Broad.Bean.Origin")]
head(out.bean[order(-out.bean$Rating),])

#>         Company.Name Cocoa.Percent Rating Broad.Bean.Origin
#> 79            Amedei            70      5         Venezuela
#> 287      Cacao Barry            75      2          Tanzania
#> 288      Cacao Barry            72      2         Venezuela
#> 340 Caoni (Tulicorp)            77      2           Ecuador
#> 628           Escazu            65      2        Costa Rica
#> 677     French Broad            81      2              Peru

Insight:

The median will be used to measure rating central tendency because the value of median is not depended with outliers.

Based on bean.rating.final data frame, there are two countries (Vietnam and Brazil) that have similar values of the median at 3.375. However, the standard deviation of Vietnam is lower than Brazil, meaning the data distributed closer to the median value. Thus, Vietnam is the best area that produces cocoa beans with the highest median rating, followed by Brazil, Belize, Nicaragua, and the Dominican Republic.

Most of the bean origins are located in the South American regions.

Based on the boxplot, the cocoa farms located in those regions produce the cocoa beans with a rating between 3 and 3.75, which can be categorized as “Satisfactory” to “Praiseworthy”.

There are outliers, which mostly has a low rating between 1 and 2. However, there is only one outlier with a perfect score of 5. It is cocoa beans produced from Venezuela.

Question No.2

We will treat question No.2 similar to the first question. First, filter data frame for only company location that has been reviewed more than 30 times and assign it to an object called Choco.com.

# Create data frame called Choco.comp to find a frequency of review's number based on company location. 
Choco.comp <- as.data.frame(table(Choco$Company.Location))
colnames(Choco.comp) <- c("Company.Location", "Number.Review")

# Filter the Choco.comp so that the data frame has number of review > 30, assign it as variable called Choco.comp.
Choco.com <- Choco.comp[Choco.comp$Number.Review > 30, ]
Choco.com <- Choco.com[order(-Choco.com$Number.Review),]
Choco.com

#>    Company.Location Number.Review
#> 53           U.S.A.           764
#> 17           France           156
#> 7            Canada           125
#> 52             U.K.           107
#> 28            Italy            63
#> 14          Ecuador            55
#> 2         Australia            49
#> 4           Belgium            40
#> 51      Switzerland            38
#> 18          Germany            35

Ten company locations have more than 30 reviews. It is interesting to note that the United States has a significant number of reviews. It’s because of the high number of chocolate companies in the US or the repeated reviews of similar companies over the years when the total number of chocolate companies is relatively small. In other words, we want to know the correlation between the number of reviews and the number of companies for each country.

To answer the above question, let’s first look up the frequency table of company name and company location, set as a data frame called company. This data frame shows a list of companies and their locations and the number of reviews. The number of reviews 0 means the company is not located here. Now, drop the data with the number of reviews = 0.

company <- as.data.frame(table(Choco$Company.Name, Choco$Company.Location))
colnames(company) <- c("Company.Name", "Company.Location", "Number.Reviews")
company <- company[company$Number.Reviews != 0, ]
head(company)

#>                        Company.Name Company.Location Number.Reviews
#> 103 Compania de Chocolate (Salgado)        Argentina              5
#> 337                         Salgado        Argentina              4
#> 439                     Bahen & Co.        Australia              5
#> 464                          Bright        Australia              4
#> 524                          Cravve        Australia              7
#> 526                        Daintree        Australia              2

From company data frame, create table frequency of company location and assign it as data frame called country. This data frame describes the number of the company in each country. To make it better, merge the country and Choco.comp data frames so that we can observe the number of reviews and the number of the company for each country. Assign the result as a data frame called com.mer. Construct a scatter plot to give a visualization of the correlation.

country <- as.data.frame(table(company$Company.Location))
colnames(country) <- c("Company.Location", "Number.Companies")

com.mer <- merge(Choco.comp, country, by="Company.Location")
head(com.mer[order(-com.mer$Number.Companies),])

#>    Company.Location Number.Review Number.Companies
#> 53           U.S.A.           764              175
#> 52             U.K.           107               24
#> 17           France           156               22
#> 7            Canada           125               20
#> 14          Ecuador            55               14
#> 2         Australia            49               10

# Create a scatter plot 
plot(x=com.mer$Number.Review, y=com.mer$Number.Companies,
     xlab = "Number of Review", ylab = "Number of Companies",
     main = "Scatterplot Number of Review and Number of Company",
     pitch= 9)

abline(lm(formula = Number.Companies~Number.Review,
       data=com.mer),
       col=10, 
       lwd=2, 
       lty=3)

Insight : Based on the scatter plot and com.mer frame data, The U.S.A has the largest number of chocolate factories, which accounted for around 42% of total samples. Since there is a high correlation between the number of reviews and the number of companies in each country, it implies that the total number of reviews from the United States-based companies is significantly high. Even though, the United States has the highest number of reviews and chocolate bars companies. Does it mean that their chocolate bars also have a high rating?? To find the answer, let’s move on!

Now, create a data frame called Choco.com.rating containing a list of company locations as in Choco.com$Company.Location.

Choco.com.rating <- Choco[Choco$Company.Location %in% Choco.com$Company.Location,]
Choco.com.rating$Company.Location <- droplevels(Choco.com.rating$Company.Location)
head(Choco.com.rating)

#>   Company.Name  Region.Bar  REF Review.Date Cocoa.Percent Company.Location
#> 1     A. Morin Agua Grande 1876  2016-02-15            63           France
#> 2     A. Morin       Kpime 1676  2015-02-15            70           France
#> 3     A. Morin      Atsane 1676  2015-02-15            70           France
#> 4     A. Morin       Akata 1680  2015-02-15            70           France
#> 5     A. Morin      Quilla 1704  2015-02-15            70           France
#> 6     A. Morin    Carenero 1315  2014-02-15            70           France
#>   Rating Bean.Type Broad.Bean.Origin
#> 1   3.75                    Sao Tome
#> 2   2.75                        Togo
#> 3   3.00                        Togo
#> 4   3.50                        Togo
#> 5   3.50                        Peru
#> 6   2.75   Criollo         Venezuela

Create the data frame containing median, standard deviation and a number of reviews from data frame Choco.com.rating and assign it to a variable called com.final.

# Create variable each containing median, standard deviation, frequency for rating
com.median <- aggregate(Rating ~ Company.Location, Choco.com.rating, median)
com.sd <- aggregate(Rating ~ Company.Location, Choco.com.rating, sd)
com.count <- aggregate(Rating ~ Company.Location, Choco.com.rating, length)

# Merge four variable into dataframe. 
MyMerge <- function(x, y){
  df <- merge(x, y, by= "Company.Location", all.x= TRUE, all.y= TRUE)
  return(df)
}
com.final <- Reduce(MyMerge, list(com.median, com.sd, com.count ))

# Rename columns and order the column median and standard deviation
colnames(com.final) <- c('Company.Location','Median','St.Dev', 'Number.Reviews')
com.final[order(-com.final[,2], com.final[,3]), ]

#>    Company.Location Median    St.Dev Number.Reviews
#> 1         Australia   3.50 0.4177070             49
#> 3            Canada   3.25 0.4236268            125
#> 10           U.S.A.   3.25 0.4419656            764
#> 8       Switzerland   3.25 0.4665176             38
#> 6           Germany   3.25 0.4757789             35
#> 5            France   3.25 0.5466148            156
#> 7             Italy   3.25 0.5984437             63
#> 2           Belgium   3.25 0.8178448             40
#> 9              U.K.   3.00 0.5004847            107
#> 4           Ecuador   3.00 0.5630679             55

Create a new column called rating_int from data frame Choco.com.rating and construct a mosaic plot to visualize rating interval and company location. The area of each rectangle represents the proportion of that variable in each group.

# Create a function to classify the rating. 
convert_int <- function(y){ 
    if(y <= 2){
      y <- "0 - 2" 
    }else 
      if(y > 2 & y < 3){
      y <- "2 - 3" 
    }else 
      if(y > 3 & y < 4){
      y <- "3 - 4" 
    }else{
      y <- "4 - 5" 
    }  
}

# Create a new column called rating_int and construct a mosaic plot. 
Choco.com.rating$rating_int <- sapply(X = Choco.com.rating$Rating, FUN = convert_int) 
Choco.com.rating$rating_int  <- as.factor(Choco.com.rating$rating_int)
plot(xtabs(~ Company.Location + rating_int, Choco.com.rating),
        col = c("green", "orange", "red", "light blue"),
        xlab = "Company Location", ylab = "Interval Rating",
        legend.text=colnames(Choco.com$rating_int),
        main = "Mosaic plot for Company Location and Interval of Rating")

Insights:

Based on the mosaic plot, the United States has the largest proportion of the number of reviews, confirmed by the largest rectangular area.

The largest proportion of interval ratings for each country is between 3 and 4, as opposed to interval ratings between 0 and 2.

Based on the com.final data frame, the median rating for 10 company locations is relatively close to the range between 3 to 3.5.

Australia is the country that produces chocolate bars with the highest median rating, followed by Canada, USA, Switzerland and Germany.

Question No.3

Check the correlation value by applying cor() and construct a scatter plot to observe the correlation between rating and percentage of cocoa.

# Check correlation
cor(Choco$Cocoa.Percent, Choco$Rating)

#> [1] -0.1648202

# Create a scatter plot
plot(x=Choco$Cocoa.Percent, y=Choco$Rating,
     xlab = "Percentage of Cocoa", ylab = "Rating",
     main = "Scatterplot Cocoa's Percentage and Rating",
     pitch= 9)

abline(lm(formula = Rating ~ Cocoa.Percent,
       data=Choco),
       col=10, 
       lwd=2, 
       lty=3)

The scatter plot describes that there is a weak negative or almost no correlation between cocoa percentage and rating. Furthermore, we could explore much more detail of this correlation by creating the boxplot between an interval of cocoa percentage and its rating.

# Create a function convert_coc (interval of cocoa's percentage)
convert_coc <- function(y){ 
    if(y <= 50){
      y <- "0-50" 
    }else 
      if(y > 50 & y <= 60){
      y <- "50-60" 
    }else 
      if(y > 60 & y <= 70){
      y <- "60-70" 
    }else 
      if(y > 70 & y <= 80){
      y <- "70-80" 
    }else 
      if(y > 80 & y <= 90){
      y <- "80-90" 
    }else{
      y <- "90 to 100" 
    }  
}

# Assign the function to a variable called cocoa_int 
Choco$cocoa_int <- sapply(X = Choco$Cocoa.Percent, FUN = convert_coc) 
Choco$cocoa_int <- as.factor(Choco$cocoa_int)

# Create a boxplot
plot(x=Choco$cocoa_int,
     y=Choco$Rating,
     xlab = "Cocoa Percentage Interval", ylab = "Rating",
     col = c("red", "yellow", "blue", "brown", "orange", "green", "violet"),
     main = "Boxplot Percentage of Cocoa to Rating")

Insight :

There is a weak negative correlation between the rating and the cocoa percentage, which is confirmed by the value of correlation.

Based on the scatter plot, the cocoa’s content is concentrated between 60 and 80%, with a rating between 2.5 and 4.

Refer to boxplot, the chocolate bars with 60-70% and 70-80% cocoa have the highest median rating at approximately 3.25.

The highest interval of cocoa content (90-100%) has the lowest median rating at 2. It has also a wide distribution.

The box plot shows an increasing trend of rating from low cocoa percentage until 60-80% cocoa content, then the rating’s trend decreases for higher cocoa’s percentage.

Outliers mostly fall at low ratings.

Conclusion

Where are cocoa beans with the highest average rating grown?

Cocoa beans with the highest average rating are harvested in Vietnam, followed by Brazil, Belize, Nicaragua, and the Dominican Republic. Based on legecy chocolate, cocoa beans are mostly grown in tropic regions. Cocoa beans require constant warm temperatures between 65 and 90 degrees Fahrenheit to survive, with annual rainfall around 40-100 inches.

Discover which countries manufacture chocolate bars with the highest average rating?

The U.S.A has the largest number of chocolate production, which accounted for around 42% of total samples. Australia and Canada are considered as countries that manufactures chocolate bars with the highest median rating value at 3.5.

How is the relationship between a chocolate bar’s percentage and its rating?

There is a weak negative correlation between rating and cocoa percentage. Chocolate bars with a cocoa percentage between 60 and 80% appear to have the highest median rating, as opposed to chocolate bars with a cocoa content between 90 and 100%.