Data Preparation and Data Exploration of Chocolate Dataset

Verawaty

8/24/2020

Introduction

Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown. This dataset was provided by Kaggle

Import Library

library(ggplot2)

Read File

We can use read.csv() to read .csv file.

choco <- read.csv("flavors_of_cacao.csv")
rmarkdown::paged_table(choco)

Data Preparation

colnames(choco)
## [1] "CompanyÂ...Maker.if.known."       "Specific.Bean.Origin.or.Bar.Name"
## [3] "REF"                              "Review.Date"                     
## [5] "Cocoa.Percent"                    "Company.Location"                
## [7] "Rating"                           "Bean.Type"                       
## [9] "Broad.Bean.Origin"

If we see the column names, there is some column names which difficult to spell, we’ll try to rename column.

colnames(choco) <- c("Company", "Specific_Origin_Bean", "ID", "Year", "Cocoa_Percent", "Company_Location", "Rating", "Bean_Type", "Broad_Bean_Origin")
colnames(choco)
## [1] "Company"              "Specific_Origin_Bean" "ID"                  
## [4] "Year"                 "Cocoa_Percent"        "Company_Location"    
## [7] "Rating"               "Bean_Type"            "Broad_Bean_Origin"

Check Data Structure

We can using str() to the data structure, like the data type, etc.

str(choco)
## 'data.frame':    1795 obs. of  9 variables:
##  $ Company             : chr  "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
##  $ Specific_Origin_Bean: chr  "Agua Grande" "Kpime" "Atsane" "Akata" ...
##  $ ID                  : int  1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
##  $ Year                : int  2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
##  $ Cocoa_Percent       : chr  "63%" "70%" "70%" "70%" ...
##  $ Company_Location    : chr  "France" "France" "France" "France" ...
##  $ Rating              : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
##  $ Bean_Type           : chr  " " " " " " " " ...
##  $ Broad_Bean_Origin   : chr  "Sao Tome" "Togo" "Togo" "Togo" ...

Company : Name of the company manufacturing the bar.
Specific_Origin_Bean : The specific geo-region of origin for the bar.
ID : A value linked to when the review was entered in the database. Higher = more recent.
Year : year of publication of the review.
Cocoa_Percent : Cocoa percentage (darkness) of the chocolate bar being reviewed.
Company_Location : Manufacturer base country.
Rating : Expert rating for the bar.
Bean_Type : The variety (breed) of bean used, if provided.
*Broad_Bean_Origin : The broad geo-region of origin for the bean.

There is 1,795 obs and 9 var of our data. From the str() function, we see that some data type isn’t correct yet. We’ll change some of data type.

choco$Company <- as.character(choco$Company)
choco$Specific_Origin_Bean <- as.factor(choco$Specific_Origin_Bean)
choco$ID <- as.character(choco$ID)
choco$Cocoa_Percent <- as.numeric(substr(choco$Cocoa_Percent,1,2))
choco$Company_Location <- as.factor(choco$Company_Location)
choco$Bean_Type <- as.factor(choco$Bean_Type)
choco$Broad_Bean_Origin <- as.factor(choco$Broad_Bean_Origin)

Is there missing value ?

anyNA(choco)
## [1] FALSE

There is no missing value in our data. We can continue to explore data.

Data Exploration

summary(choco)
##    Company                  Specific_Origin_Bean      ID           
##  Length:1795        Madagascar        :  57      Length:1795       
##  Class :character   Peru              :  45      Class :character  
##  Mode  :character   Ecuador           :  42      Mode  :character  
##                     Dominican Republic:  37                        
##                     Venezuela         :  21                        
##                     Chuao             :  19                        
##                     (Other)           :1574                        
##       Year      Cocoa_Percent   Company_Location     Rating     
##  Min.   :2006   Min.   :10.00   U.S.A. :764      Min.   :1.000  
##  1st Qu.:2010   1st Qu.:70.00   France :156      1st Qu.:2.875  
##  Median :2013   Median :70.00   Canada :125      Median :3.250  
##  Mean   :2012   Mean   :70.69   U.K.   : 96      Mean   :3.186  
##  3rd Qu.:2015   3rd Qu.:75.00   Italy  : 63      3rd Qu.:3.500  
##  Max.   :2017   Max.   :99.00   Ecuador: 54      Max.   :5.000  
##                                 (Other):537                     
##                 Bean_Type            Broad_Bean_Origin
##                     :887   Venezuela         :214    
##  Trinitario          :419   Ecuador           :193    
##  Criollo             :153   Peru              :165    
##  Forastero           : 87   Madagascar        :145    
##  Forastero (Nacional): 52   Dominican Republic:141    
##  Blend               : 41                    : 73    
##  (Other)             :156   (Other)           :864

From the summary, we can see some information :
1. The review of chocolate data was publicated from 2006 to 2017.
2. The percentage of cocoa in chocolate was minimal 10% and maximal 99%.
3. Some location of company which produced in USA, France, Canada, U.K., Italy, Ecuador, etc.
4. The range of rating is 1 to 5.

Where are the best cocoa beans grown?

agg_persen <- aggregate(Cocoa_Percent ~ Broad_Bean_Origin, data = choco, FUN = mean)
head(agg_persen[order(-agg_persen$Cocoa_Percent),],3)
##               Broad_Bean_Origin Cocoa_Percent
## 61                Peru, Ecuador            99
## 36 Guat., D.R., Peru, Mad., PNG            88
## 99             Venezuela/ Ghana            85

The best cocoa beans with 99% of cocoa_percent was grown in Peru, Ecuador.

How much % of Cocoa which get the best Rating ?

agg_rate <- aggregate(Rating ~ Cocoa_Percent, data = choco, FUN = max)
head(agg_rate[order(-agg_rate$Rating),],3)
##    Cocoa_Percent Rating
## 20            70      5
## 10            60      4
## 13            63      4

Based on rating, the best chocolate has 70% of cocoa in chocolate.

Which countries produce the highest-rated bars?

agg_country_rate <- aggregate(Rating ~ Company_Location, data = choco, FUN = max)
head(agg_country_rate[order(-agg_country_rate$Rating),],3)
##    Company_Location Rating
## 30            Italy      5
## 3         Australia      4
## 5           Belgium      4

Italy produce the highest-rated bars chocolate with rating = 5.

What is the best bean_type which will be used to get better cocoa_percent ?

agg_bean_percent <- aggregate(Cocoa_Percent ~ Bean_Type, data = choco, FUN = mean)
head(agg_bean_percent[order(-agg_bean_percent$Cocoa_Percent),],3)
##                 Bean_Type Cocoa_Percent
## 15      Criollo (Ocumare)          80.0
## 19     Criollo, Forastero          76.5
## 30 Forastero(Arriba, CCN)          75.0

The average of Criollo(Ocumare) bean_type will be used to produced chocolate bars with 80% of cocoa.

Which Company produced the best rating Chocolate in 2017 ?

#filter year = 2017
choco_2017 <- choco[choco$Year == 2017, ]

#get the average of rating based on company and company location in 2017
best_choco_2017 <- aggregate(Rating ~ Company + Company_Location, data = choco_2017, FUN = mean)
best_choco_2017[order(-best_choco_2017$Rating),]
##                   Company Company_Location   Rating
## 6             Dick Taylor           U.S.A. 3.750000
## 1  Smooth Chocolator, The        Australia 3.500000
## 3               Alexandre      Netherlands 3.500000
## 7            French Broad           U.S.A. 3.500000
## 9                   Madre           U.S.A. 3.500000
## 2                    Soul           Canada 3.375000
## 8             Letterpress           U.S.A. 3.375000
## 10                Spencer           U.S.A. 3.333333
## 4              Beau Cacao             U.K. 3.125000
## 5                Dalloway           U.S.A. 2.750000
## 11                Xocolla           U.S.A. 2.625000
ggplot(best_choco_2017, aes(y = reorder(Company, Rating),x  = Rating, fill = Company_Location)) +
  geom_col() +
  geom_text(aes(label = round(Rating,2), hjust = 1.4)) +
  labs(x = "Rating", y = "Company", title = "Best Chocolate Company in 2017") +
  theme_minimal()

The best chocolate in 2017 was prduced by Dick Taylor company which located in U.S.A. with rating 3.75. We also get information that there are 7 company from U.S.A which produced chocolate in 2017.

Conclusion

From some exploration, we get information :
1. Most chocolate was produced in U.S.A.
2. Dick Taylor company which located in U.S.A produced the best rating chocolate with rating = 3.75 in 2017.
3. The average of Criollo(Ocumare) bean_type will be used to produced chocolate bars which contain 80% of cocoa.
4. The best cocoa beans with 99% of cocoa_percent was grown in Peru, Ecuador.
5. Chocolate bar which contain 70% of cocoa will get higher rating, or we can say most people like to eat chocolate bar which contain 70% of cocoa.