Introduction
Overview of Chocolate
Chocolate is one of the most popular food products around the world. Not only does the chocolate has a deliciously sweet taste, but consuming dark chocolate may reduce the risk of some health condition. Dark chocolate is also believed to be linked to neurotransmitter systems such as serotonin, endorphins and dopamine, which increase our mood. If you want to get benefit from serotonin, the dark chocolate containing 85% cocoa has the highest serotonin level with approximately 2.9 micrograms per gram (Journal of Chromatography A, 2012).
The history of chocolate can be traced back to Mesoamerica, nowaday Mexico. Chocolate was used during rituals and also served as medicine. Centuries later, the Mayans utilized chocolate as a ritual drink, called “xocolatl”. Mayan chocolate was produced from roasted cacao seeds mixed with water, cornmeal and chillies. In 1526, the explorer from Spain, Hernán Cortés, introduced cocoa seeds in Spain. Since then, chocolate reached its popularity in Europe especially for the rich and wealthy (History, 2020). The Industrial Revolution in 1828 has transformed the processing of chocolate. The cocoa powder is produced by pressing cocoa butter from roasted cocoa beans. The cocoa powder mixed with liquids is solidified into a chocolate bar. Present-day, chocolate is served in many forms, such as ice cream, cookies, milk etc.
Dataset Information
The dataset is called Flavors of Cacao ratings, which were originally collected and compiled by Brady Brelinski, a Founding Member of the Manhattan Chocolate Society. In this project, the dataset is gathered from the Kaggle website, where the data has been modified. The dataset is solely focused on dark chocolate in propose to appreciate the originality of the flavours of cocoa. The component of the dataset includes expert ratings of more than 1,700 chocolate bars between 2005 and 2016, region of origin, varieties of beans, and the percentage of cocoa. Classification of Flavors of Cacao Rating System is divided into five classes. The ratings themselves do not apply to chocolate health benefits or organic content.
- 5= Elite (Transcending beyond the ordinary limits)
- 4= Premium (Superior flavour development, character and style)
- 3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
- 2= Disappointing (Passable but contains at least one significant flaw)
- 1= Unpleasant (mostly unpalatable)
Project Expectation
This project is part of Learn by Building (LBB) assignment. The project will deliver and implement some knowledge related to the usage of Rmarkdown, exploratory data analysis by using R and publishing into the Rpubs. The goal of the project :
- Discover where are cocoa beans with the highest average rating grown.
- Discover which countries manufacture chocolate bars with the highest average rating.
- Discover the relationship between a cocoa percentage in chocolate bars and its rating.
Data Observation
First, Import the dataset of flavors_of_cacao.csv from the working directory and assign it as an object called Choco. Then, print the first six rows by using head() function.
Choco <- read.csv("flavors_of_cacao.csv")
head(Choco)
#> Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name REF Review.Date
#> 1 A. Morin Agua Grande 1876 2016
#> 2 A. Morin Kpime 1676 2015
#> 3 A. Morin Atsane 1676 2015
#> 4 A. Morin Akata 1680 2015
#> 5 A. Morin Quilla 1704 2015
#> 6 A. Morin Carenero 1315 2014
#> Cocoa.Percent Company.Location Rating Bean.Type Broad.Bean.Origin
#> 1 63% France 3.75 Sao Tome
#> 2 70% France 2.75 Togo
#> 3 70% France 3.00 Togo
#> 4 70% France 3.50 Togo
#> 5 70% France 3.50 Peru
#> 6 70% France 2.75 Criollo Venezuela
Inspect the data by using str().
- We found that there are 1,795 observations with 9 columns.
- The data description :
- Company…Maker.if.known. is the name of the company manufacturing the chocolate bar.
- Specific.Bean.Origin.or.Bar.Name is the specific region of origin of the chocolate bar.
- REF is a review identification number.
- Review.Date is the publication date of the review.
- Cocoa.Percent is the percentage cocoa content in the chocolate bar.
- Company.Location is the location of the company.
- Rating is the expert rating as mentioned above.
- Bean.Type is the variety of beans used.
- Broad.Bean.Origin is the region of origin of the bean.
str(Choco)
#> 'data.frame': 1795 obs. of 9 variables:
#> $ Company...Maker.if.known. : chr "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
#> $ Specific.Bean.Origin.or.Bar.Name: chr "Agua Grande" "Kpime" "Atsane" "Akata" ...
#> $ REF : int 1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
#> $ Review.Date : int 2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
#> $ Cocoa.Percent : chr "63%" "70%" "70%" "70%" ...
#> $ Company.Location : chr "France" "France" "France" "France" ...
#> $ Rating : num 3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
#> $ Bean.Type : chr " " " " " " " " ...
#> $ Broad.Bean.Origin : chr "Sao Tome" "Togo" "Togo" "Togo" ...
Observe Rating and Cocoa.Percent columns.
- The rating has a range between 1 and 5 with the median and mean being relatively close.
- The cocoa percentage ranges between 42 and 100 with the median and mean being around 70.
summary(Choco)
#> Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name REF
#> Length:1795 Length:1795 Min. : 5
#> Class :character Class :character 1st Qu.: 576
#> Mode :character Mode :character Median :1069
#> Mean :1036
#> 3rd Qu.:1502
#> Max. :1952
#> Review.Date Cocoa.Percent Company.Location Rating
#> Min. :2006 Length:1795 Length:1795 Min. :1.000
#> 1st Qu.:2010 Class :character Class :character 1st Qu.:2.875
#> Median :2013 Mode :character Mode :character Median :3.250
#> Mean :2012 Mean :3.186
#> 3rd Qu.:2015 3rd Qu.:3.500
#> Max. :2017 Max. :5.000
#> Bean.Type Broad.Bean.Origin
#> Length:1795 Length:1795
#> Class :character Class :character
#> Mode :character Mode :character
#>
#>
#>
Data Cleaning
The objective of data cleaning in this project are:
- Rename variable names if necessary
- Observe and fix any misspelling or typos.
- Verify and transform data type.
- Observe for any missing values
Rename Columns
There are two columns with a long name. Let’s rename it and after that, confirm it by applying head().
- Company…Maker.if.known. –> Company.Name
- Specific.Bean.Origin.or.Bar.Name –> Region.Name
names(Choco)[names(Choco) == "Company...Maker.if.known."] <- "Company.Name"
names(Choco)[names(Choco) == "Specific.Bean.Origin.or.Bar.Name"] <- "Region.Bar"
head(Choco)
#> Company.Name Region.Bar REF Review.Date Cocoa.Percent Company.Location
#> 1 A. Morin Agua Grande 1876 2016 63% France
#> 2 A. Morin Kpime 1676 2015 70% France
#> 3 A. Morin Atsane 1676 2015 70% France
#> 4 A. Morin Akata 1680 2015 70% France
#> 5 A. Morin Quilla 1704 2015 70% France
#> 6 A. Morin Carenero 1315 2014 70% France
#> Rating Bean.Type Broad.Bean.Origin
#> 1 3.75 Sao Tome
#> 2 2.75 Togo
#> 3 3.00 Togo
#> 4 3.50 Togo
#> 5 3.50 Peru
#> 6 2.75 Criollo Venezuela
Misspelling and validity of data.
Company.Location
First, we observe company location values by using unique() and sort().
sort(unique(Choco$Company.Location))
#> [1] "Amsterdam" "Argentina" "Australia"
#> [4] "Austria" "Belgium" "Bolivia"
#> [7] "Brazil" "Canada" "Chile"
#> [10] "Colombia" "Costa Rica" "Czech Republic"
#> [13] "Denmark" "Domincan Republic" "Ecuador"
#> [16] "Eucador" "Fiji" "Finland"
#> [19] "France" "Germany" "Ghana"
#> [22] "Grenada" "Guatemala" "Honduras"
#> [25] "Hungary" "Iceland" "India"
#> [28] "Ireland" "Israel" "Italy"
#> [31] "Japan" "Lithuania" "Madagascar"
#> [34] "Martinique" "Mexico" "Netherlands"
#> [37] "New Zealand" "Niacragua" "Nicaragua"
#> [40] "Peru" "Philippines" "Poland"
#> [43] "Portugal" "Puerto Rico" "Russia"
#> [46] "Sao Tome" "Scotland" "Singapore"
#> [49] "South Africa" "South Korea" "Spain"
#> [52] "St. Lucia" "Suriname" "Sweden"
#> [55] "Switzerland" "U.K." "U.S.A."
#> [58] "Venezuela" "Vietnam" "Wales"
We found some of the values are not correct and also we spot some misspellings.
- Amsterdam is not a country and should be categorized as the Netherlands.
- Wales and Scotland are located in the U.K.
- Some misspellings:
- Niacragua should be Nicaragua.
- Domincan Republic should be the Dominican Republic.
- Eucador should be Ecuador.
Let’s make it better. Originally, the Company.location has 60 unique values, but now, it has 55 unique values.
Choco$Company.Location[Choco$Company.Location == "Amsterdam"] <- "Netherlands"
Choco$Company.Location[Choco$Company.Location == "Niacragua"] <- "Nicaragua"
Choco$Company.Location[Choco$Company.Location == "Domincan Republic"] <- "Dominican Republic"
Choco$Company.Location[Choco$Company.Location == "Eucador"] <- "Ecuador"
Choco$Company.Location[Choco$Company.Location == "Wales" | Choco$Company.Location == "Scotland"] <- "U.K."
sort(unique(Choco$Company.Location))
#> [1] "Argentina" "Australia" "Austria"
#> [4] "Belgium" "Bolivia" "Brazil"
#> [7] "Canada" "Chile" "Colombia"
#> [10] "Costa Rica" "Czech Republic" "Denmark"
#> [13] "Dominican Republic" "Ecuador" "Fiji"
#> [16] "Finland" "France" "Germany"
#> [19] "Ghana" "Grenada" "Guatemala"
#> [22] "Honduras" "Hungary" "Iceland"
#> [25] "India" "Ireland" "Israel"
#> [28] "Italy" "Japan" "Lithuania"
#> [31] "Madagascar" "Martinique" "Mexico"
#> [34] "Netherlands" "New Zealand" "Nicaragua"
#> [37] "Peru" "Philippines" "Poland"
#> [40] "Portugal" "Puerto Rico" "Russia"
#> [43] "Sao Tome" "Singapore" "South Africa"
#> [46] "South Korea" "Spain" "St. Lucia"
#> [49] "Suriname" "Sweden" "Switzerland"
#> [52] "U.K." "U.S.A." "Venezuela"
#> [55] "Vietnam"
Broad.Bean.Origin
Next, we check the Broad.Bean.Origin column by using unique() and sort().
sort(unique(Choco$Broad.Bean.Origin))
#> [1] "" " "
#> [3] "Africa, Carribean, C. Am." "Australia"
#> [5] "Belize" "Bolivia"
#> [7] "Brazil" "Burma"
#> [9] "Cameroon" "Carribean"
#> [11] "Carribean(DR/Jam/Tri)" "Central and S. America"
#> [13] "Colombia" "Colombia, Ecuador"
#> [15] "Congo" "Cost Rica, Ven"
#> [17] "Costa Rica" "Cuba"
#> [19] "Dom. Rep., Madagascar" "Domincan Republic"
#> [21] "Dominican Rep., Bali" "Dominican Republic"
#> [23] "DR, Ecuador, Peru" "Ecuador"
#> [25] "Ecuador, Costa Rica" "Ecuador, Mad., PNG"
#> [27] "El Salvador" "Fiji"
#> [29] "Gabon" "Ghana"
#> [31] "Ghana & Madagascar" "Ghana, Domin. Rep"
#> [33] "Ghana, Panama, Ecuador" "Gre., PNG, Haw., Haiti, Mad"
#> [35] "Grenada" "Guat., D.R., Peru, Mad., PNG"
#> [37] "Guatemala" "Haiti"
#> [39] "Hawaii" "Honduras"
#> [41] "India" "Indonesia"
#> [43] "Indonesia, Ghana" "Ivory Coast"
#> [45] "Jamaica" "Liberia"
#> [47] "Mad., Java, PNG" "Madagascar"
#> [49] "Madagascar & Ecuador" "Malaysia"
#> [51] "Martinique" "Mexico"
#> [53] "Nicaragua" "Nigeria"
#> [55] "Panama" "Papua New Guinea"
#> [57] "Peru" "Peru, Belize"
#> [59] "Peru, Dom. Rep" "Peru, Ecuador"
#> [61] "Peru, Ecuador, Venezuela" "Peru, Mad., Dom. Rep."
#> [63] "Peru, Madagascar" "Peru(SMartin,Pangoa,nacional)"
#> [65] "Philippines" "PNG, Vanuatu, Mad"
#> [67] "Principe" "Puerto Rico"
#> [69] "Samoa" "Sao Tome"
#> [71] "Sao Tome & Principe" "Solomon Islands"
#> [73] "South America" "South America, Africa"
#> [75] "Sri Lanka" "St. Lucia"
#> [77] "Suriname" "Tanzania"
#> [79] "Tobago" "Togo"
#> [81] "Trinidad" "Trinidad-Tobago"
#> [83] "Trinidad, Ecuador" "Trinidad, Tobago"
#> [85] "Uganda" "Vanuatu"
#> [87] "Ven, Bolivia, D.R." "Ven, Trinidad, Ecuador"
#> [89] "Ven., Indonesia, Ecuad." "Ven., Trinidad, Mad."
#> [91] "Ven.,Ecu.,Peru,Nic." "Venez,Africa,Brasil,Peru,Mex"
#> [93] "Venezuela" "Venezuela, Carribean"
#> [95] "Venezuela, Dom. Rep." "Venezuela, Ghana"
#> [97] "Venezuela, Java" "Venezuela, Trinidad"
#> [99] "Venezuela/ Ghana" "Vietnam"
#> [101] "West Africa"
We found that some of the names are not consistent and misspelling. Also, we need to delete some characters /,(), &. Let’s clean it to make it more readable and consistent.
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ven., Indonesia, Ecuad."] <- "Venezuela, Indonesia, Ecuador"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Venezuela, Dom. Rep."] <- "Venezuela, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Venezuela/ Ghana"] <- "Venezuela, Ghana"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ven., Trinidad, Mad."] <- "Venezuela, Trinidad, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ven.,Ecu.,Peru,Nic."] <- "Venezuela, Ecuador, Nicaragua"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Peru, Dom. Rep"] <- "Peru, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Mad., Java, PNG"] <- "Madagascar, Java, PNG"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Dominican Rep., Bali"] <- "Dominican Republic, Bali"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Sao Tome & Principe"] <- "Sao Tome, Principe"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Africa, Carribean, C. Am."] <- "Africa, Carribean, Central America"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Trinidad-Tobago"] <- "Trinidad, Tobago"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "PNG, Vanuatu, Mad"] <- "PNG, Vanuatu, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Carribean(DR/Jam/Tri)"] <- "Carribean, Dominican Republic, Jamaica, Trinidad)"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Madagascar & Ecuador"] <- "Madagascar, Ecuador "
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Dom. Rep., Madagascar"] <- "Dominican Republic, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Domincan Republic"] <- "Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Venez,Africa,Brasil,Peru,Mex"] <- "Venezuela,Africa,Brasil,Peru,Mexico"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Peru, Mad., Dom. Rep."] <- "Peru, Madagascar, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ghana, Domin. Rep"] <- "Ghana, Dominican Republic "
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Central and S. America"] <- "Central America, South America"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ghana & Madagascar"] <- "Ghana, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Guat., D.R., Peru, Mad., PNG"] <- "Guatelama, Dominican Republic, Peru, Madagascar, PNG"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Gre., PNG, Haw., Haiti, Mad"] <- "Grenada, PNG, Hawaii, Haiti, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "DR, Ecuador, Peru"] <- "Dominican Republic, Ecuador, Peru"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ecuador, Mad., PNG"] <- "Ecuador, Madagascar, PNG"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Cost Rica, Ven"] <- "Costa Rica, Venezuela "
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ven, Bolivia, D.R."] <- "Venezuela, Bolivia, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ven., Trinidad, Mad."] <- "Venezuela, Trinidad, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Peru(SMartin,Pangoa,nacional)"] <- "Peru, St. Martin, Pangoa, Nacional"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Peru, Dom. Rep"] <- "Peru, Dominican Republic"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "PNG, Vanuatu, Mad"] <- "PNG, Vanuatu, Madagascar"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == "Ven, Trinidad, Ecuador"] <- "Venezuela, Trinidad, Ecuador"
Choco$Broad.Bean.Origin[Choco$Broad.Bean.Origin == ""] <- "Venezuela"
head(sort(unique(Choco$Broad.Bean.Origin)),10)
#> [1] " "
#> [2] "Africa, Carribean, Central America"
#> [3] "Australia"
#> [4] "Belize"
#> [5] "Bolivia"
#> [6] "Brazil"
#> [7] "Burma"
#> [8] "Cameroon"
#> [9] "Carribean"
#> [10] "Carribean, Dominican Republic, Jamaica, Trinidad)"
Bean.Type
Now, Observe the Bean.Type column by using unique() and sort().
sort(unique(Choco$Bean.Type))
#> [1] "" " "
#> [3] "Amazon" "Amazon mix"
#> [5] "Amazon, ICS" "Beniano"
#> [7] "Blend" "Blend-Forastero,Criollo"
#> [9] "CCN51" "Criollo"
#> [11] "Criollo (Amarru)" "Criollo (Ocumare 61)"
#> [13] "Criollo (Ocumare 67)" "Criollo (Ocumare 77)"
#> [15] "Criollo (Ocumare)" "Criollo (Porcelana)"
#> [17] "Criollo (Wild)" "Criollo, +"
#> [19] "Criollo, Forastero" "Criollo, Trinitario"
#> [21] "EET" "Forastero"
#> [23] "Forastero (Amelonado)" "Forastero (Arriba)"
#> [25] "Forastero (Arriba) ASS" "Forastero (Arriba) ASSS"
#> [27] "Forastero (Catongo)" "Forastero (Nacional)"
#> [29] "Forastero (Parazinho)" "Forastero, Trinitario"
#> [31] "Forastero(Arriba, CCN)" "Matina"
#> [33] "Nacional" "Nacional (Arriba)"
#> [35] "Trinitario" "Trinitario (85% Criollo)"
#> [37] "Trinitario (Amelonado)" "Trinitario (Scavina)"
#> [39] "Trinitario, Criollo" "Trinitario, Forastero"
#> [41] "Trinitario, Nacional" "Trinitario, TCGA"
As we can see above the bean type are categorized based on their main and sub-variety. According to this article by Pohlan and Perez and bar and cocoa, the main varieties of the cacao plant are forastero, criollo, trinitario, and nacional. Let’s categorized some of the values into their original variety.
Choco$Bean.Type[Choco$Bean.Type %in% c("Criollo (Ocumare 67)", "Criollo (Wild)", "Criollo (Ocumare 77)", "Criollo, +" , "Criollo (Amarru)", "Criollo (Ocumare)", "Criollo, Forastero", "Criollo (Ocumare 61)" , "Criollo (Porcelana)", "Criollo, Trinitario")] <- "Criollo"
Choco$Bean.Type[Choco$Bean.Type %in% c("Forastero (Arriba) ASS", "Forastero (Parazinho)", "Forastero (Arriba) ASSS", "Forastero, Trinitario", "Forastero (Amelonado)", "Forastero (Arriba)", "Forastero (Catongo)", "Forastero (Nacional)", "Forastero(Arriba, CCN)", "Blend-Forastero,Criollo")] <- "Forastero"
Choco$Bean.Type[Choco$Bean.Type %in% c("Trinitario (85% Criollo)", "Trinitario (Amelonado)", "Trinitario (Scavina)" , "Trinitario, Criollo", "Trinitario, Forastero","Trinitario, Nacional", "Trinitario, TCGA")] <- "Trinitario"
Choco$Bean.Type[Choco$Bean.Type %in% c("Amazon mix", "Amazon, ICS")] <- "Amazon"
Choco$Bean.Type[Choco$Bean.Type %in% c("Nacional (Arriba)")] <- "Nacional"
sort(unique(Choco$Bean.Type))
#> [1] "" " " "Amazon" "Beniano" "Blend"
#> [6] "CCN51" "Criollo" "EET" "Forastero" "Matina"
#> [11] "Nacional" "Trinitario"
Datatype Transformation
Three columns should be transformed into an appropriate datatype.
- Company…Maker.if.known., Company.Location, Broad.Bean.Origin are a character, it should be transformed into a factor.
- Cocoa.Percent needs to change to numeric and delete the
%character. - Review.Date needs to transform to Date.
Columns named Company.Name, Company.Location, Broad.Bean.Origin, Broad.Bean.Origin and Bean.Type is a character, it should be transformed to a factor. The transformed data type of multi-columns by using lappy() function.
Choco[, c("Company.Name", "Region.Bar", "Company.Location", "Broad.Bean.Origin", "Bean.Type" )] <- lapply(Choco[, c("Company.Name", "Region.Bar", "Company.Location", "Broad.Bean.Origin", "Bean.Type" )], as.factor)
Column Cocoa.Percent is transformed from character to numeric. First, we apply gsub() function to remove % character and then change the data type by using as.numeric. Don’t forget to confirm all the transformation data type by using str()
# Transform Cocoa.Percent into Numeric type
Choco$Cocoa.Percent <- gsub("%","",as.character(Choco$Cocoa.Percent))
Choco$Cocoa.Percent <- as.numeric(Choco$Cocoa.Percent)
# Transform Review.Date into Date type
Choco$Review.Date <- as.Date(as.character(Choco$Review.Date), format="%Y")
str(Choco)
#> 'data.frame': 1795 obs. of 9 variables:
#> $ Company.Name : Factor w/ 416 levels "A. Morin","Acalli",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ Region.Bar : Factor w/ 1039 levels "\"heirloom\", Arriba Nacional",..: 15 494 68 16 813 175 288 923 805 731 ...
#> $ REF : int 1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
#> $ Review.Date : Date, format: "2016-02-15" "2015-02-15" ...
#> $ Cocoa.Percent : num 63 70 70 70 70 70 70 70 70 70 ...
#> $ Company.Location : Factor w/ 55 levels "Argentina","Australia",..: 17 17 17 17 17 17 17 17 17 17 ...
#> $ Rating : num 3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
#> $ Bean.Type : Factor w/ 12 levels ""," ","Amazon",..: 2 2 2 2 2 7 2 7 7 2 ...
#> $ Broad.Bean.Origin: Factor w/ 97 levels " ","Africa, Carribean, Central America",..: 68 78 78 78 55 84 17 84 84 55 ...
Drop Levels
For data visualization proposes, we need to drop levels for columns Bread.Bean.Original, Bean.Type and Company.Location.
Choco$Broad.Bean.Origin <- droplevels(Choco$Broad.Bean.Origin)
Choco$Bean.Type <- droplevels(Choco$Bean.Type)
Choco$Company.Location <- droplevels(Choco$Company.Location)
Treatment Missing Values
Check missing values for all columns by using colSums() and is.na(). We found no missing values in the data frame.
colSums(is.na(Choco))
#> Company.Name Region.Bar REF Review.Date
#> 0 0 0 0
#> Cocoa.Percent Company.Location Rating Bean.Type
#> 0 0 0 0
#> Broad.Bean.Origin
#> 0
Hmm.. it’s interesting since there are some empty values such as in Bean.Type (888 rows) and Broad.Bean.Origin (73 rows), as you can see in the table below. Considering many empty values have been found in Bean.Type (almost 50%), we plan to keep the original data but not utilize it.
head(as.data.frame(table(Choco$Bean.Type)))
#> Var1 Freq
#> 1 1
#> 2 887
#> 3 Amazon 5
#> 4 Beniano 3
#> 5 Blend 41
#> 6 CCN51 1
head(as.data.frame(table(Choco$Broad.Bean.Origin)))
#> Var1 Freq
#> 1 73
#> 2 Africa, Carribean, Central America 1
#> 3 Australia 3
#> 4 Belize 49
#> 5 Bolivia 57
#> 6 Brazil 58
Data Manipulation, Wranggling and Visualization
Question No.1
Check the frequency of bean origin (how many times it has been reviewed) and assign it as a variable called Choco.bean. The table suggests that the bean from Venezuela, Ecuador, Dominican Republic, Peru and Madagascar has a high number of reviews.
Choco.bean <- as.data.frame(table(Choco$Broad.Bean.Origin))
colnames(Choco.bean) <- c("Broad.Bean.Origin", "Number.Reviews")
Choco.bean <- Choco.bean[order(-Choco.bean$Number.Reviews),]
head(Choco.bean)
#> Broad.Bean.Origin Number.Reviews
#> 84 Venezuela 215
#> 22 Ecuador 193
#> 18 Dominican Republic 166
#> 55 Peru 165
#> 45 Madagascar 145
#> 1 73
Now, Observe a summary of the Frequency on data frame Choco.bean.
- The number of bean origin reviews ranges between 1 and 215. It implies that this data has a wide range value, with most of the data distributed at a low number of reviews (confirmed by a low median number). Let’s confirm it by creating the histogram.
- The histogram suggests that most of the bean origin has been reviewed between 0 and 20 times.
- My assumption: the data which has more reviews will accurately represent the population than the data with a lower number of reviews. In this case, I would use the data of bean origin which has been reviewed more than 30. Why do I choose this number? Based on investopedia, the size of samples greater than 30 is often considered sufficient for the Central Limit Theorem (CLT) to hold, which can accurately predict the characteristics of a population.
# Summary of number of reviews
summary(Choco.bean$Number.Reviews)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 1.00 3.00 18.51 11.00 215.00
# histogram for number of reviews.
hist(Choco.bean$Number.Reviews,
xlab = "Number of Reviews", main = "Histogram of Number Reviews for Bean Origin")
Thus, Modify Choco.bean to filter the origin bean that has more than 30 reviews. We found that there are 17 rows and one of them is an empty value.
Choco.bean <- Choco.bean[Choco.bean$Number.Reviews> 30,]
Choco.bean
#> Broad.Bean.Origin Number.Reviews
#> 84 Venezuela 215
#> 22 Ecuador 193
#> 18 Dominican Republic 166
#> 55 Peru 165
#> 45 Madagascar 145
#> 1 73
#> 51 Nicaragua 60
#> 6 Brazil 58
#> 5 Bolivia 57
#> 4 Belize 49
#> 54 Papua New Guinea 42
#> 12 Colombia 40
#> 15 Costa Rica 38
#> 96 Vietnam 38
#> 76 Tanzania 34
#> 28 Ghana 33
#> 79 Trinidad 33
Let’s drop an empty values.
Choco.bean <- Choco.bean[-c(6),]
Choco.bean
#> Broad.Bean.Origin Number.Reviews
#> 84 Venezuela 215
#> 22 Ecuador 193
#> 18 Dominican Republic 166
#> 55 Peru 165
#> 45 Madagascar 145
#> 51 Nicaragua 60
#> 6 Brazil 58
#> 5 Bolivia 57
#> 4 Belize 49
#> 54 Papua New Guinea 42
#> 12 Colombia 40
#> 15 Costa Rica 38
#> 96 Vietnam 38
#> 76 Tanzania 34
#> 28 Ghana 33
#> 79 Trinidad 33
Create the data frame called Choco.bean.rating containing origin bean that has reviewed more than 30 as listed in Choco.bean$Broad.Bean.Origin. Change the data type of Broad.Bean.Origin into factor, and drop the levels.
Choco.bean.rating <- Choco[Choco$Broad.Bean.Origin %in% Choco.bean$Broad.Bean.Origin,]
Choco.bean.rating$Broad.Bean.Origin <- as.factor(Choco.bean.rating$Broad.Bean.Origin)
Choco.bean.rating$Broad.Bean.Origin <- droplevels(Choco.bean.rating$Broad.Bean.Origin)
head(Choco.bean.rating)
#> Company.Name Region.Bar REF Review.Date Cocoa.Percent Company.Location
#> 5 A. Morin Quilla 1704 2015-02-15 70 France
#> 6 A. Morin Carenero 1315 2014-02-15 70 France
#> 8 A. Morin Sur del Lago 1315 2014-02-15 70 France
#> 9 A. Morin Puerto Cabello 1319 2014-02-15 70 France
#> 10 A. Morin Pablino 1319 2014-02-15 70 France
#> 12 A. Morin Madagascar 1011 2013-02-15 70 France
#> Rating Bean.Type Broad.Bean.Origin
#> 5 3.50 Peru
#> 6 2.75 Criollo Venezuela
#> 8 3.50 Criollo Venezuela
#> 9 3.75 Criollo Venezuela
#> 10 4.00 Peru
#> 12 3.00 Criollo Madagascar
Create the data frame containing mean, median, standard deviation and number of reviews from data frame Choco.bean.rating and assign it to a variable called bean.rating.final.
# Create variables each containing mean, median, standard deviation, number of reviews
bean.rating.median <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, median)
bean.rating.mean <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, mean)
bean.rating.sd <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, sd)
bean.rating.count <- aggregate(Rating ~ Broad.Bean.Origin, Choco.bean.rating, length)
# Merge four variable into dataframe called bean.rating.final.
MyMerge <- function(x, y){
df <- merge(x, y, by= "Broad.Bean.Origin", all.x= TRUE, all.y= TRUE)
return(df)
}
bean.rating.final <- Reduce(MyMerge, list(bean.rating.mean, bean.rating.median, bean.rating.sd,bean.rating.count ))
# Rename columns and order the column by median and standard deviation.
colnames(bean.rating.final) <- c('Origin.Bean.Type','Mean','Median','St.Dev', 'Number.Reviews')
bean.rating.final <- bean.rating.final[order(-bean.rating.final[,3], bean.rating.final[,4]), ]
bean.rating.final
#> Origin.Bean.Type Mean Median St.Dev Number.Reviews
#> 16 Vietnam 3.315789 3.375 0.3166773 38
#> 3 Brazil 3.284483 3.375 0.4174071 58
#> 1 Belize 3.234694 3.250 0.3203520 49
#> 10 Nicaragua 3.200000 3.250 0.3900239 60
#> 6 Dominican Republic 3.206325 3.250 0.3997792 166
#> 14 Trinidad 3.204545 3.250 0.4072322 33
#> 9 Madagascar 3.265517 3.250 0.4106024 145
#> 11 Papua New Guinea 3.291667 3.250 0.4128894 42
#> 2 Bolivia 3.197368 3.250 0.4218815 57
#> 4 Colombia 3.225000 3.250 0.4414429 40
#> 13 Tanzania 3.205882 3.250 0.4785972 34
#> 15 Venezuela 3.241860 3.250 0.5002254 215
#> 12 Peru 3.137879 3.250 0.5018305 165
#> 7 Ecuador 3.134715 3.250 0.5272393 193
#> 5 Costa Rica 3.144737 3.125 0.4296116 38
#> 8 Ghana 3.090909 3.000 0.5513413 33
For visualization : Construct the boxplot containing the bean origin and its rating. Then, pull out the outliers.
# Creating boxplot of beans origin and its rating.
boxplot.bean<- plot(y=Choco.bean.rating$Rating, x=Choco.bean.rating$Broad.Bean.Origin,
xlab = "origin Bean", ylab = "Rating",
col = c("red", "yellow", "blue", "brown", "orange", "green", "violet", "light yellow", "pink", "light green", "white", "purple", "grey", "dark blue", "red","dark green"),
main = "Boxplot Origin Bean to Rating")
# Pull out all outliers
out.bean <- Choco.bean.rating[Choco.bean.rating$Rating %in% boxplot.bean$out, c("Company.Name", "Cocoa.Percent", "Rating", "Broad.Bean.Origin")]
head(out.bean[order(-out.bean$Rating),])
#> Company.Name Cocoa.Percent Rating Broad.Bean.Origin
#> 79 Amedei 70 5 Venezuela
#> 287 Cacao Barry 75 2 Tanzania
#> 288 Cacao Barry 72 2 Venezuela
#> 340 Caoni (Tulicorp) 77 2 Ecuador
#> 628 Escazu 65 2 Costa Rica
#> 677 French Broad 81 2 Peru
Insight:
- The median will be used to measure rating central tendency because the value of median is not depended with outliers.
- Based on
bean.rating.finaldata frame, there are two countries (Vietnam and Brazil) that have similar values of the median at 3.375. However, the standard deviation of Vietnam is lower than Brazil, meaning the data distributed closer to the median value. Thus, Vietnam is the best area that produces cocoa beans with the highest median rating, followed by Brazil, Belize, Nicaragua, and the Dominican Republic.- Most of the bean origins are located in the South American regions.
- Based on the boxplot, the cocoa farms located in those regions produce the cocoa beans with a rating between 3 and 3.75, which can be categorized as “Satisfactory” to “Praiseworthy”.
- There are outliers, which mostly has a low rating between 1 and 2. However, there is only one outlier with a perfect score of 5. It is cocoa beans produced from Venezuela.
Question No.2
We will treat question No.2 similar to the first question. First, filter data frame for only company location that has been reviewed more than 30 times and assign it to an object called Choco.com.
# Create data frame called Choco.comp to find a frequency of review's number based on company location.
Choco.comp <- as.data.frame(table(Choco$Company.Location))
colnames(Choco.comp) <- c("Company.Location", "Number.Review")
# Filter the Choco.comp so that the data frame has number of review > 30, assign it as variable called Choco.comp.
Choco.com <- Choco.comp[Choco.comp$Number.Review > 30, ]
Choco.com <- Choco.com[order(-Choco.com$Number.Review),]
Choco.com
#> Company.Location Number.Review
#> 53 U.S.A. 764
#> 17 France 156
#> 7 Canada 125
#> 52 U.K. 107
#> 28 Italy 63
#> 14 Ecuador 55
#> 2 Australia 49
#> 4 Belgium 40
#> 51 Switzerland 38
#> 18 Germany 35
Ten company locations have more than 30 reviews. It is interesting to note that the United States has a significant number of reviews. It’s because of the high number of chocolate companies in the US or the repeated reviews of similar companies over the years when the total number of chocolate companies is relatively small. In other words, we want to know the correlation between the number of reviews and the number of companies for each country.
To answer the above question, let’s first look up the frequency table of company name and company location, set as a data frame called company. This data frame shows a list of companies and their locations and the number of reviews. The number of reviews 0 means the company is not located here. Now, drop the data with the number of reviews = 0.
company <- as.data.frame(table(Choco$Company.Name, Choco$Company.Location))
colnames(company) <- c("Company.Name", "Company.Location", "Number.Reviews")
company <- company[company$Number.Reviews != 0, ]
head(company)
#> Company.Name Company.Location Number.Reviews
#> 103 Compania de Chocolate (Salgado) Argentina 5
#> 337 Salgado Argentina 4
#> 439 Bahen & Co. Australia 5
#> 464 Bright Australia 4
#> 524 Cravve Australia 7
#> 526 Daintree Australia 2
From company data frame, create table frequency of company location and assign it as data frame called country. This data frame describes the number of the company in each country. To make it better, merge the country and Choco.comp data frames so that we can observe the number of reviews and the number of the company for each country. Assign the result as a data frame called com.mer. Construct a scatter plot to give a visualization of the correlation.
country <- as.data.frame(table(company$Company.Location))
colnames(country) <- c("Company.Location", "Number.Companies")
com.mer <- merge(Choco.comp, country, by="Company.Location")
head(com.mer[order(-com.mer$Number.Companies),])
#> Company.Location Number.Review Number.Companies
#> 53 U.S.A. 764 175
#> 52 U.K. 107 24
#> 17 France 156 22
#> 7 Canada 125 20
#> 14 Ecuador 55 14
#> 2 Australia 49 10
# Create a scatter plot
plot(x=com.mer$Number.Review, y=com.mer$Number.Companies,
xlab = "Number of Review", ylab = "Number of Companies",
main = "Scatterplot Number of Review and Number of Company",
pitch= 9)
abline(lm(formula = Number.Companies~Number.Review,
data=com.mer),
col=10,
lwd=2,
lty=3)
Insight : Based on the scatter plot and
com.merframe data, The U.S.A has the largest number of chocolate factories, which accounted for around 42% of total samples. Since there is a high correlation between the number of reviews and the number of companies in each country, it implies that the total number of reviews from the United States-based companies is significantly high. Even though, the United States has the highest number of reviews and chocolate bars companies. Does it mean that their chocolate bars also have a high rating?? To find the answer, let’s move on!
Now, create a data frame called Choco.com.rating containing a list of company locations as in Choco.com$Company.Location.
Choco.com.rating <- Choco[Choco$Company.Location %in% Choco.com$Company.Location,]
Choco.com.rating$Company.Location <- droplevels(Choco.com.rating$Company.Location)
head(Choco.com.rating)
#> Company.Name Region.Bar REF Review.Date Cocoa.Percent Company.Location
#> 1 A. Morin Agua Grande 1876 2016-02-15 63 France
#> 2 A. Morin Kpime 1676 2015-02-15 70 France
#> 3 A. Morin Atsane 1676 2015-02-15 70 France
#> 4 A. Morin Akata 1680 2015-02-15 70 France
#> 5 A. Morin Quilla 1704 2015-02-15 70 France
#> 6 A. Morin Carenero 1315 2014-02-15 70 France
#> Rating Bean.Type Broad.Bean.Origin
#> 1 3.75 Sao Tome
#> 2 2.75 Togo
#> 3 3.00 Togo
#> 4 3.50 Togo
#> 5 3.50 Peru
#> 6 2.75 Criollo Venezuela
Create the data frame containing median, standard deviation and a number of reviews from data frame Choco.com.rating and assign it to a variable called com.final.
# Create variable each containing median, standard deviation, frequency for rating
com.median <- aggregate(Rating ~ Company.Location, Choco.com.rating, median)
com.sd <- aggregate(Rating ~ Company.Location, Choco.com.rating, sd)
com.count <- aggregate(Rating ~ Company.Location, Choco.com.rating, length)
# Merge four variable into dataframe.
MyMerge <- function(x, y){
df <- merge(x, y, by= "Company.Location", all.x= TRUE, all.y= TRUE)
return(df)
}
com.final <- Reduce(MyMerge, list(com.median, com.sd, com.count ))
# Rename columns and order the column median and standard deviation
colnames(com.final) <- c('Company.Location','Median','St.Dev', 'Number.Reviews')
com.final[order(-com.final[,2], com.final[,3]), ]
#> Company.Location Median St.Dev Number.Reviews
#> 1 Australia 3.50 0.4177070 49
#> 3 Canada 3.25 0.4236268 125
#> 10 U.S.A. 3.25 0.4419656 764
#> 8 Switzerland 3.25 0.4665176 38
#> 6 Germany 3.25 0.4757789 35
#> 5 France 3.25 0.5466148 156
#> 7 Italy 3.25 0.5984437 63
#> 2 Belgium 3.25 0.8178448 40
#> 9 U.K. 3.00 0.5004847 107
#> 4 Ecuador 3.00 0.5630679 55
Create a new column called rating_int from data frame Choco.com.rating and construct a mosaic plot to visualize rating interval and company location. The area of each rectangle represents the proportion of that variable in each group.
# Create a function to classify the rating.
convert_int <- function(y){
if(y <= 2){
y <- "0 - 2"
}else
if(y > 2 & y < 3){
y <- "2 - 3"
}else
if(y > 3 & y < 4){
y <- "3 - 4"
}else{
y <- "4 - 5"
}
}
# Create a new column called rating_int and construct a mosaic plot.
Choco.com.rating$rating_int <- sapply(X = Choco.com.rating$Rating, FUN = convert_int)
Choco.com.rating$rating_int <- as.factor(Choco.com.rating$rating_int)
plot(xtabs(~ Company.Location + rating_int, Choco.com.rating),
col = c("green", "orange", "red", "light blue"),
xlab = "Company Location", ylab = "Interval Rating",
legend.text=colnames(Choco.com$rating_int),
main = "Mosaic plot for Company Location and Interval of Rating")
Insights:
- Based on the mosaic plot, the United States has the largest proportion of the number of reviews, confirmed by the largest rectangular area.
- The largest proportion of interval ratings for each country is between 3 and 4, as opposed to interval ratings between 0 and 2.
- Based on the
com.finaldata frame, the median rating for 10 company locations is relatively close to the range between 3 to 3.5.- Australia is the country that produces chocolate bars with the highest median rating, followed by Canada, USA, Switzerland and Germany.
Question No.3
Check the correlation value by applying cor() and construct a scatter plot to observe the correlation between rating and percentage of cocoa.
# Check correlation
cor(Choco$Cocoa.Percent, Choco$Rating)
#> [1] -0.1648202
# Create a scatter plot
plot(x=Choco$Cocoa.Percent, y=Choco$Rating,
xlab = "Percentage of Cocoa", ylab = "Rating",
main = "Scatterplot Cocoa's Percentage and Rating",
pitch= 9)
abline(lm(formula = Rating ~ Cocoa.Percent,
data=Choco),
col=10,
lwd=2,
lty=3)
The scatter plot describes that there is a weak negative or almost no correlation between cocoa percentage and rating. Furthermore, we could explore much more detail of this correlation by creating the boxplot between an interval of cocoa percentage and its rating.
# Create a function convert_coc (interval of cocoa's percentage)
convert_coc <- function(y){
if(y <= 50){
y <- "0-50"
}else
if(y > 50 & y <= 60){
y <- "50-60"
}else
if(y > 60 & y <= 70){
y <- "60-70"
}else
if(y > 70 & y <= 80){
y <- "70-80"
}else
if(y > 80 & y <= 90){
y <- "80-90"
}else{
y <- "90 to 100"
}
}
# Assign the function to a variable called cocoa_int
Choco$cocoa_int <- sapply(X = Choco$Cocoa.Percent, FUN = convert_coc)
Choco$cocoa_int <- as.factor(Choco$cocoa_int)
# Create a boxplot
plot(x=Choco$cocoa_int,
y=Choco$Rating,
xlab = "Cocoa Percentage Interval", ylab = "Rating",
col = c("red", "yellow", "blue", "brown", "orange", "green", "violet"),
main = "Boxplot Percentage of Cocoa to Rating")
Insight :
- There is a weak negative correlation between the rating and the cocoa percentage, which is confirmed by the value of correlation.
- Based on the scatter plot, the cocoa’s content is concentrated between 60 and 80%, with a rating between 2.5 and 4.
- Refer to boxplot, the chocolate bars with 60-70% and 70-80% cocoa have the highest median rating at approximately 3.25.
- The highest interval of cocoa content (90-100%) has the lowest median rating at 2. It has also a wide distribution.
- The box plot shows an increasing trend of rating from low cocoa percentage until 60-80% cocoa content, then the rating’s trend decreases for higher cocoa’s percentage.
- Outliers mostly fall at low ratings.
Conclusion
Where are cocoa beans with the highest average rating grown?
Cocoa beans with the highest average rating are harvested in Vietnam, followed by Brazil, Belize, Nicaragua, and the Dominican Republic. Based on legecy chocolate, cocoa beans are mostly grown in tropic regions. Cocoa beans require constant warm temperatures between 65 and 90 degrees Fahrenheit to survive, with annual rainfall around 40-100 inches.
Discover which countries manufacture chocolate bars with the highest average rating?
The U.S.A has the largest number of chocolate production, which accounted for around 42% of total samples. Australia and Canada are considered as countries that manufactures chocolate bars with the highest median rating value at 3.5.
How is the relationship between a chocolate bar’s percentage and its rating?
There is a weak negative correlation between rating and cocoa percentage. Chocolate bars with a cocoa percentage between 60 and 80% appear to have the highest median rating, as opposed to chocolate bars with a cocoa content between 90 and 100%.