1 🍫 Introduction

We are often dazzling by the massive amount of choices of different chocolate brands in grocery stores or shops. In this project we will utilize EDA and Basic modeling to explore the chocolate bar rating dataset for:

Which country produces the highest rate chocolate bar?
What locations produce the top quality cocao beans?
Is the relationship between Cocao Percentage and the customer rating?

We hope these info can provide rough ideas to you so next time when shopping you know what the best chocolate to pick for your taste.

Cocao Dataset

datatable(cocao)

1.1 🌎 Chocolate / Cocao Industry World Map

We will see what’s the area distribution of Chocolate and industry and Cocao Origin

1.1.1 Chocolate Company

North America and Australia are dominated the chocolate industry!

# Group by country
colnames(cocao) <- c("company", "bean.bar.orgin", "ref","date" , "percent", "location","rating", "beantype" ,"origin")

commap <- group_by(cocao, location)
commap1 <- summarise(commap,  count=n())
map1 <- joinCountryData2Map(commap1, joinCode="NAME", nameJoinColumn="location")

## 51 codes from your data successfully matched countries in the map
## 9 codes from your data failed to match with a country code in the map
## 192 codes from the map weren't represented in your data

mapCountryData(map1, nameColumnToPlot="count", mapTitle="Chocolate Company Distribution" , colourPalette = "negpos8")

1.1.2 Cocao Origin

Africa, Australia and some South American Countries produce most the cocao beans.

# Group by country
omap <- group_by(cocao, origin)
omap1 <- summarise(omap,  count=n())
map2 <- joinCountryData2Map(omap1, joinCode="NAME", nameJoinColumn="origin")

## 47 codes from your data successfully matched countries in the map
## 54 codes from your data failed to match with a country code in the map
## 196 codes from the map weren't represented in your data

mapCountryData(map2, nameColumnToPlot="count", mapTitle="Cocao Origin Distribution" , colourPalette = "negpos8")

2 Country and highest rate Chocolate bars

Since the data structure broke down to chocolate Company, Bean Origin, Bean Type, Location Percentage of Cocoa and Rating. We will first group the country together then take the average value of the rating’s and plot it on the graph. The following graph showed that Chile has an average score of 3.75, out ran all the other countries, produces the highest rating chocolate bars. If we take the production volume into consideration, Canada and France produce fine quality chocolate with high volumes. Of course, U.S.A produces a large quantity of chocolate with decent favor.

#average rating by location
loca <- group_by(cocao, location)
good <- summarise(loca,  count=n(),
                  rate1= mean(rating))
good1<- arrange(good, desc(rate1))

#scattor plot
ggplot(good1,aes(x=reorder(location,rate1), y=rate1)) +geom_point(aes(size=count, colour=factor(rate1)), alpha=1/2) + theme_minimal(base_size = 9)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1) , legend.position="none") +
  labs(x="Country", "Chocolate Rating", "Chocolate Rating vs Country")

t1<- ggplot(good1,aes(x=reorder(location,rate1), y=rate1)) +geom_point(aes(size=count, colour=factor(rate1)), alpha=1/2) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1) ,legend.position="none", panel.background = element_rect(fill="white")) +
  labs(x="Country", "Chocolate Rating", "Chocolate Rating vs Country")
t1

3 🧩 Cocao solid percentage and rating

We will plot a simple scatter plot with percentage of cocao and rating first, as we can see, both the highest and lowest rating chocolate bars all contain 70% cocao powder. Therefore, we could not draw a conclusion from this graph. We will improve our analysis by group the data by cocao percentage and take the average value of each percent category then plot it with rating.

3.1 1. scattor plot

#convert percentage to numerical
cocao$pct = as.numeric(gsub("\\%", "", cocao$percent))

#scatter plot
ggplot(cocao,aes(x=pct, y=rating)) +geom_point(aes(colour=factor(location)))+theme_minimal()  +theme( legend.position="bottom",legend.key.width=unit(0.2,"cm"),legend.key.height=unit(0.2,"cm")) +
  xlab("cocao Percent(%) ") + ylab("Chocolate Bar Rating") + 
  ggtitle("Scatter plot of cocao Percent vs Chocolate Bar Rating")

3.2 2. Scatter plot of Average value

From this graph we can see when cocao=50% we have the highest rating, however, the case is not persuasive enough, because the simple space size of 50% cocao chocolate bars is not large enough. Consider the sample sizes into account, 70% cocao bar still our best choice. Moreover, any chocolate bars from 65% - 75% percent cocao looks pretty good! We will try to prove our conjecture by build a SVM model.

#rating by pct
pctdata <- group_by(cocao, pct)
gdpct <- summarise(pctdata,  count=n() ,rate2= mean(rating))
gdpct1<- arrange(gdpct, desc(rate2))

ggplot(gdpct1,aes(x=pct, y=rate2)) +
  geom_point(aes(size=count, colour=factor(pct)), alpha=1/2) +theme_minimal()+
  theme(legend.position="none")

3.3 Correlation Plot for Year, Rating, and Cacao Percentage

corrplot(cor(cocao[c("date","rating","pct")]), method="color",col=colorRampPalette(c("#C8F3B3","#F5E3B3","#F9DCD1"))(100),type="upper",tl.srt=90,tl.col="black")

3.4 Modeling

It is extremely obvious that there is no linear relationship between cocao percentage and rating, therefore, we will exam the relationship by SVM. The following graph showed there are some good predicted values from SVM model when cocao percentage is roughly from 60% to 85%. Now, we will enhance our model one more step by categorize the percentage of cocao. Our approach is take the floor value of (cocao percentage/ 10). For example, when cocao 40% -50%, then cocao category =4. Then we will perform our SVM prediction on this 6 categories. From the result of our analysis, our accuracy of prediction value has increased significantly. Now we know if given a chocolate in random, it is our best bet to choose the cocao percentage from 70%- 80% for highest rating.

model.c<- svm(formula= rate2 ~ pct+count, data=gdpct)
summary(model.c)

## 
## Call:
## svm(formula = rate2 ~ pct + count, data = gdpct)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  39

ggplot()+ geom_point(aes(x=gdpct$pct, y=gdpct$rate2), colour="pink")+ geom_line(aes(x=gdpct$pct, y=predict(model.c, newdat=gdpct)),color="cyan")+ theme_minimal()+
  ggtitle('SVM Predicted Rating value vs cocao Percentage ') +xlab(" cocao Percentage") +ylab("Rating")

c1<-mutate(cocao, 
       p1=floor(pct/10)
       )

p1data <- group_by(c1, p1)
gdp1 <- summarise(p1data,  count=n() ,rate3= mean(rating))
gdp1<- arrange(gdp1, desc(rate3))

ggplot(gdp1,aes(x=p1, y=rate3)) +
  geom_point(aes(size=count, colour=factor(rate3)), alpha=1/2) +
  theme(legend.position="none") +geom_line(linetype="dashed", colour="gold")+ theme_minimal()+
  labs(y="Chocolate Bar Rating ", x="cocao Percent Category" , title="Chocolate Bar Rating vs cocao Percent Category")

model.c1<- svm(formula= rate3 ~ p1+count, data=gdp1)
summary(model.c)

## 
## Call:
## svm(formula = rate2 ~ pct + count, data = gdpct)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  39

ggplot()+ geom_point(aes(x=gdp1$p1, y=gdp1$rate3), colour="red")+ geom_line(aes(x=gdp1$p1, y=predict(model.c1, newdat=gdp1)),color="blue")+ theme_minimal()+
  ggtitle('SVM Predicted Rating vs cocao Percent Category ') +xlab("cocao Percentage") +ylab("Rating")

4 🌲 Cacao bean Origin

In this section we begin by clean the data, since there are massive amount of different origins and many origins contained more than one country, we will only keep the records for an origin shows up more than 5 times. Then we will plot the scatter plot of Rating vs cocao Bean Origins. We observed that Haiti has the highest quality cocao beans. Furthermore, Venezuela, Madagascar and Brazil also produce high quantity and quality cocao beans. This also explained our first analysis , Madagascar produce decent quality and huge quantity of chocolates, because they process good cocao beans!

bean <- group_by(cocao, origin )
bean1 <- summarise(bean,  count=n() ,rate5= mean(rating))
# clean the data by selecting count >5 

bean2<- filter(bean1, count>=5)

# Scattor plot
ggplot(bean2,aes(x=rate5, y=origin)) +
  geom_point(aes(size=count, colour=factor(origin)), alpha=1/2) + theme_minimal(base_size = 9)+
  theme(legend.position="none")+ labs(x="Chocolate Bar Rating ", " Bean Origin" , title="Chocolate Bar Rating vs cocao Bean Origins")

5 Additional Exploratory on Best Chocolate Bars

5.1 Chocolate Bar Rating by Year

Now we will visualize the review from 2006 to 2017.

ggplot(cocao,aes(x=rating, fill=as.factor(date)))+geom_density(alpha=0.6)+
  theme_minimal()+facet_wrap(~as.factor(date))+guides(fill="none") +labs(x="Rating", y="Density") +ggtitle("Chocolate Bar Review by Year")

5.2 Chocolate Bar Rating by Country and Year

cyear<- c1 %>% rowwise %>%
  filter(!is.na(location))%>%
  group_by(location,date)%>%
  filter(n()>=9)%>%
  summarise(count=n(), rate6=mean(rating))%>%
  ungroup()


tyear<- ggplot(cyear, aes(x=date,y=rate6, size=count, colour=as.factor(location),alpha=0.05,  group = 1))+ geom_point()+
  geom_line(aes(x=date, y=rate6, colour=as.factor(location)))+
  facet_wrap(~location, ncol=3)+ theme_minimal(base_size = 8) +theme(axis.text.x=element_text(angle=45,hjust=1))+ guides(colour=FALSE)+ scale_alpha(guide=F)+
  labs(x="Year of Review", y="Average Rating", title="Chocolate Average Rating by Year")
tyear

5.3 Best Chocolate Bar with large production volume

In this section, we will exam the best chocolate bar in a broader way, to do so we will take the mean value of the chocolate bar ratings by grouping the origins and locations. Then we will use the records greater or equal to 20. From the heatmap we can see USA produces the most chocolate bars, where as France produced chocolate with cocao beans originated from Venezuela has the highest quality. This is consisted with our previous analysis, France produce exquiste chocolate while Venezuela is one of the origins of top cocao Beans.

orig <- group_by(cocao, origin , location)
orig1 <- summarise(orig,  count=n() ,rate4= mean(rating))
orig2<- filter(orig1, count>=20)

ggplot(orig2, aes(location, origin, fill=rate4))+
  geom_tile(colour = "white")+
scale_fill_gradient(low="#fce6c8", high="#4fcbac") + theme_minimal()+
  labs(x="Country ", "Bean Origin" , title="Chocolate Bar Heatmap" ,subtitle=" Country vs Bean Origin" , fill="Rating")

5.4 Chocolate bar by Company

Word Cloud

names <- cocao %>%
  group_by(company) %>%
  dplyr::summarize(count = n()) 
set.seed(88)
wordcloud(names$company, freq=names$count, min.freq=1,max.words=100 ,random.order=FALSE,rot.per=0.35,colors=brewer.pal(8, "Dark2"), main="Company")

Violin Plot

comp<- cocao %>% rowwise %>%
  filter(!is.na(company))%>%
  group_by(company)%>%
  filter(n()>=10)%>%
  mutate(rate5=mean(rating))

vc <- ggplot(comp,aes(x=reorder(company,rating), y=rating, fill=rate5))+geom_violin()+
      coord_flip()+ theme_minimal(base_size=7) +
      scale_fill_continuous(name="Average Rating",low="#F9D8D9",high="#96316C") +
      expand_limits(y=c(0.5))+
      labs(x="Company", y="Rating")+ ggtitle("Chocolate Company and Rating")
vc

TreeMap for Chocolate Company

#company count
keys1 <- group_by(cocao, location,company )
keys2 <- dplyr::summarise(keys1,  count=n())
keys2 <- arrange(keys2, desc(count))
keys3  <-filter(keys2, count >8)
# company treemap
treemap(keys3, index=c("location","company"), vSize="count", type="index", 
        palette="Accent", title="Top Chocolate Company", fontsize.title=6)

5.5 Absolute best chocolate Bar

From the bar chart we can see the only two 5 star rating chocolate bars are coming from Italy, one with the bean origin venezuela and the other from A. More interestingly, they are both produced from the same company, Amedei with 70% cocao powder. Looking forward to try it one day!

best<- filter(cocao, rating >4)
ggplot(best, aes(origin, rating))+geom_bar(stat="identity", aes(fill=company)) + theme_minimal()+
labs(x="Bean Origin ", y="Rating score" , title="Best Chocolate Bar")

fig<- plot_ly (type="treemap", labels=keys3$company,
               parents=keys3$location,
               values=keys3$count)
fig

6 Statistical Analysis 101 - ANOVA

ANOVA : Analysis of variance, is a statistical technique that is used to check if the means of two or more groups are significantly different from each other.

\[F = \frac{variance \ between \ groups}{variance \ within\ groups}\]

\[F= \frac{MS_{groups}}{MS_{Error}}\]

Assumptions:

Independence of observations
Normal Distribution
Equal variances

6.1 Mini Example: Analysis on the mean rating of the chocolate bar from U.S.A , France, and Cannada

a<- cocao[cocao$location %in% c("U.S.A.","France","Canada"),]
a1<- group_by(a, location) %>%
     summarise(count=n(), mean=mean(rating, na.rm=TRUE),
               sd=sd(rating,na.rm=TRUE))

a1

## # A tibble: 3 × 4
##   location count  mean    sd
##   <chr>    <int> <dbl> <dbl>
## 1 Canada     125  3.32 0.424
## 2 France     156  3.25 0.547
## 3 U.S.A.     764  3.15 0.442

a2<- a %>% rowwise %>%
  filter(!is.na(location))%>%
  group_by(location)%>%
  filter(n()>=10)%>%
  mutate(mean=mean(rating))

Check ANOVA assumption:

QQ plot for Normal Distribution

We can see it actually failed the Normal Distribution test.

ggplot(a, aes(sample = rating, colour = factor(location))) +
  stat_qq() + theme_minimal()+
  stat_qq_line()

Equal Variance:

From the Bartlett homogeneity test we can see p<0.05, which means the variance level of the three groups are significantly different from each other, therefore we conclude, we can’t use ANOVA.

var<- bartlett.test(rating ~ location, data= a) 
var

## 
##  Bartlett test of homogeneity of variances
## 
## data:  rating by location
## Bartlett's K-squared = 14.232, df = 2, p-value = 0.0008121

Null Hypothesis: There is No Difference of chocolate bar rating between U.S.A and France.

H0: \(\mu_{USA} = \mu_{France} = \mu_{Canada}\)

Interpretation: the \(p-value <0.05\), We reject the Null-Hypothesis, which means there is a significant difference between the mean of rating for the chocolate bar in between France, U.S.A. , or Canada.

ANOVA doesn’t fit the data, so the result is NOT correct, we can see from the box plot

res.aov <-aov(rating ~ location, data=a)
summary(res.aov)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## location       2   3.82  1.9094   9.143 0.000116 ***
## Residuals   1042 217.60  0.2088                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bc <- ggplot(a2,aes(x=reorder(location,rating), y=rating, fill=mean))+geom_boxplot()+
      theme_minimal(base_size=10) +
      scale_fill_continuous(name="Average Rating",low="#D2ECF2",high="#78B0BD") +
      expand_limits(y=c(0.5))+
      labs(x="Country", y="Rating")+ ggtitle("ANOVA example for Chocolate Bar Rating")
bc

7 Summary

This notebook showed that North America has a huge industry for the chocolate, while the Cacao beans are from Africa and South America countries. It is interesting to know 50% Cocao produce the highest rating chocolate bar, maybe it is not too sweet and not too bitter! Next time when I go to Italy, I will try the Amedei Chocolate !

8 Reference:

[1] https://en.wikipedia.org/wiki/Analysis_of_variance

[2] https://www.kaggle.com/willcanniford/chocolate-bar-ratings-extensive-eda

Cocao

fangya

Updated: 2022-03-03