Introduction
Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown. This dataset was provided by Kaggle
Import Library
library(ggplot2)
Read File
We can use read.csv()
to read .csv file.
read.csv("flavors_of_cacao.csv")
choco <-::paged_table(choco) rmarkdown
Data Preparation
colnames(choco)
## [1] "CompanyÂ...Maker.if.known." "Specific.Bean.Origin.or.Bar.Name"
## [3] "REF" "Review.Date"
## [5] "Cocoa.Percent" "Company.Location"
## [7] "Rating" "Bean.Type"
## [9] "Broad.Bean.Origin"
If we see the column names, there is some column names which difficult to spell, we’ll try to rename column.
colnames(choco) <- c("Company", "Specific_Origin_Bean", "ID", "Year", "Cocoa_Percent", "Company_Location", "Rating", "Bean_Type", "Broad_Bean_Origin")
colnames(choco)
## [1] "Company" "Specific_Origin_Bean" "ID"
## [4] "Year" "Cocoa_Percent" "Company_Location"
## [7] "Rating" "Bean_Type" "Broad_Bean_Origin"
Check Data Structure
We can using str()
to the data structure, like the data type, etc.
str(choco)
## 'data.frame': 1795 obs. of 9 variables:
## $ Company : chr "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
## $ Specific_Origin_Bean: chr "Agua Grande" "Kpime" "Atsane" "Akata" ...
## $ ID : int 1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
## $ Year : int 2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
## $ Cocoa_Percent : chr "63%" "70%" "70%" "70%" ...
## $ Company_Location : chr "France" "France" "France" "France" ...
## $ Rating : num 3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
## $ Bean_Type : chr "Â " "Â " "Â " "Â " ...
## $ Broad_Bean_Origin : chr "Sao Tome" "Togo" "Togo" "Togo" ...
Company
: Name of the company manufacturing the bar.
Specific_Origin_Bean
: The specific geo-region of origin for the bar.
ID
: A value linked to when the review was entered in the database. Higher = more recent.
Year
: year of publication of the review.
Cocoa_Percent
: Cocoa percentage (darkness) of the chocolate bar being reviewed.
Company_Location
: Manufacturer base country.
Rating
: Expert rating for the bar.
Bean_Type
: The variety (breed) of bean used, if provided.
*Broad_Bean_Origin
: The broad geo-region of origin for the bean.
There is 1,795 obs and 9 var of our data. From the str()
function, we see that some data type isn’t correct yet. We’ll change some of data type.
$Company <- as.character(choco$Company)
choco$Specific_Origin_Bean <- as.factor(choco$Specific_Origin_Bean)
choco$ID <- as.character(choco$ID)
choco$Cocoa_Percent <- as.numeric(substr(choco$Cocoa_Percent,1,2))
choco$Company_Location <- as.factor(choco$Company_Location)
choco$Bean_Type <- as.factor(choco$Bean_Type)
choco$Broad_Bean_Origin <- as.factor(choco$Broad_Bean_Origin) choco
Is there missing value ?
anyNA(choco)
## [1] FALSE
There is no missing value in our data. We can continue to explore data.
Data Exploration
summary(choco)
## Company Specific_Origin_Bean ID
## Length:1795 Madagascar : 57 Length:1795
## Class :character Peru : 45 Class :character
## Mode :character Ecuador : 42 Mode :character
## Dominican Republic: 37
## Venezuela : 21
## Chuao : 19
## (Other) :1574
## Year Cocoa_Percent Company_Location Rating
## Min. :2006 Min. :10.00 U.S.A. :764 Min. :1.000
## 1st Qu.:2010 1st Qu.:70.00 France :156 1st Qu.:2.875
## Median :2013 Median :70.00 Canada :125 Median :3.250
## Mean :2012 Mean :70.69 U.K. : 96 Mean :3.186
## 3rd Qu.:2015 3rd Qu.:75.00 Italy : 63 3rd Qu.:3.500
## Max. :2017 Max. :99.00 Ecuador: 54 Max. :5.000
## (Other):537
## Bean_Type Broad_Bean_Origin
## Â :887 Venezuela :214
## Trinitario :419 Ecuador :193
## Criollo :153 Peru :165
## Forastero : 87 Madagascar :145
## Forastero (Nacional): 52 Dominican Republic:141
## Blend : 41 Â : 73
## (Other) :156 (Other) :864
From the summary, we can see some information :
1. The review of chocolate data was publicated from 2006 to 2017.
2. The percentage of cocoa in chocolate was minimal 10% and maximal 99%.
3. Some location of company which produced in USA, France, Canada, U.K., Italy, Ecuador, etc.
4. The range of rating is 1 to 5.
Where are the best cocoa beans grown?
aggregate(Cocoa_Percent ~ Broad_Bean_Origin, data = choco, FUN = mean)
agg_persen <-head(agg_persen[order(-agg_persen$Cocoa_Percent),],3)
## Broad_Bean_Origin Cocoa_Percent
## 61 Peru, Ecuador 99
## 36 Guat., D.R., Peru, Mad., PNG 88
## 99 Venezuela/ Ghana 85
The best cocoa beans with 99% of cocoa_percent was grown in Peru, Ecuador.
How much % of Cocoa which get the best Rating ?
aggregate(Rating ~ Cocoa_Percent, data = choco, FUN = max)
agg_rate <-head(agg_rate[order(-agg_rate$Rating),],3)
## Cocoa_Percent Rating
## 20 70 5
## 10 60 4
## 13 63 4
Based on rating, the best chocolate has 70% of cocoa in chocolate.
Which countries produce the highest-rated bars?
aggregate(Rating ~ Company_Location, data = choco, FUN = max)
agg_country_rate <-head(agg_country_rate[order(-agg_country_rate$Rating),],3)
## Company_Location Rating
## 30 Italy 5
## 3 Australia 4
## 5 Belgium 4
Italy produce the highest-rated bars chocolate with rating = 5.
What is the best bean_type which will be used to get better cocoa_percent ?
aggregate(Cocoa_Percent ~ Bean_Type, data = choco, FUN = mean)
agg_bean_percent <-head(agg_bean_percent[order(-agg_bean_percent$Cocoa_Percent),],3)
## Bean_Type Cocoa_Percent
## 15 Criollo (Ocumare) 80.0
## 19 Criollo, Forastero 76.5
## 30 Forastero(Arriba, CCN) 75.0
The average of Criollo(Ocumare) bean_type will be used to produced chocolate bars with 80% of cocoa.
Which Company produced the best rating Chocolate in 2017 ?
#filter year = 2017
2017 <- choco[choco$Year == 2017, ]
choco_
#get the average of rating based on company and company location in 2017
2017 <- aggregate(Rating ~ Company + Company_Location, data = choco_2017, FUN = mean)
best_choco_2017[order(-best_choco_2017$Rating),] best_choco_
## Company Company_Location Rating
## 6 Dick Taylor U.S.A. 3.750000
## 1 Smooth Chocolator, The Australia 3.500000
## 3 Alexandre Netherlands 3.500000
## 7 French Broad U.S.A. 3.500000
## 9 Madre U.S.A. 3.500000
## 2 Soul Canada 3.375000
## 8 Letterpress U.S.A. 3.375000
## 10 Spencer U.S.A. 3.333333
## 4 Beau Cacao U.K. 3.125000
## 5 Dalloway U.S.A. 2.750000
## 11 Xocolla U.S.A. 2.625000
ggplot(best_choco_2017, aes(y = reorder(Company, Rating),x = Rating, fill = Company_Location)) +
geom_col() +
geom_text(aes(label = round(Rating,2), hjust = 1.4)) +
labs(x = "Rating", y = "Company", title = "Best Chocolate Company in 2017") +
theme_minimal()
The best chocolate in 2017 was prduced by Dick Taylor company which located in U.S.A. with rating 3.75. We also get information that there are 7 company from U.S.A which produced chocolate in 2017.
Conclusion
From some exploration, we get information :
1. Most chocolate was produced in U.S.A.
2. Dick Taylor company which located in U.S.A produced the best rating chocolate with rating = 3.75 in 2017.
3. The average of Criollo(Ocumare) bean_type will be used to produced chocolate bars which contain 80% of cocoa.
4. The best cocoa beans with 99% of cocoa_percent was grown in Peru, Ecuador.
5. Chocolate bar which contain 70% of cocoa will get higher rating, or we can say most people like to eat chocolate bar which contain 70% of cocoa.