0.1 Chocolate Bar analysis

0.2 Which country produces the highest rate bars on average?

Since the data strcture broke down to chocolate Company, Bean Origin, Bean Type, Location Percentage of Cocoa and Rating. We will first group the country together then take the average value of the rating’s and plot it on the graph. The following graph showed that Chile has an average score of 3.75, out ran all the other countries, produce the highest rating chocolate bars. IF we take the production volume into consideration, Canada and France produce fine quality chocolate with high volumes. Suprisingly, Madagascar produce a large quantity of chocolate with decent favor.

colnames(cocoa) <- c("company", "bean.bar.orgin", "ref","date" , "percent", "location","rating", "beantype" ,"origin")


#average rating by location
loca <- group_by(cocoa, location)
good <- summarise(loca,  count=n(),
                  rate1= mean(rating))
good1<- arrange(good, desc(rate1))

ggplot(good1,aes(x=reorder(location,rate1), y=rate1)) +geom_point(aes(size=count, colour=factor(rate1)), alpha=1/2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1) , legend.position="none") +
  labs(x="Country", "Chocolate Rating", "Chocolate Rating vs Country")

0.3 What are the relationship between cocoa solid percentage and rating?

We will plot a simple scattor plot with percentage of cocoa and rating first, as we can see, both the highest and lowest rating chocolate bars all contain 70% Cocoa powder. Therefore, we could not draw a conclusion from this graph. We will improve our analysis by group the data by cocoa percentage and take the average value of each percent category then plot it with rating.

0.3.1 1. scattor plot

#convert percentage to numerical
cocoa$pct = as.numeric(gsub("\\%", "", cocoa$percent))

ggplot(cocoa,aes(x=pct, y=rating)) +geom_point(aes(colour=factor(location))) +
  theme( legend.position="bottom",legend.key.width=unit(0.2,"cm"),legend.key.height=unit(0.2,"cm")) +
  xlab("Cocoa Percent(%) ") + ylab("Chocolate Bar Rating") +
  ggtitle("Scattor plot of Cocoa Percent vs Chocolate Bar Rating") 

0.3.2 2. Scattor plot of Average value

From this graph we can see when cocoa=50% we have the highest rating, however, the case is not persuasive enough, because the simple space size of 50% cocoa chocolate bars is not large enough. Consider the sample sizes into account, 70% cocoa bar still our best choice. Moreover, any chocolate bars from 65% - 75% percent cocoa looks pretty good! We will try to prove our conjecture by build a SVM model.

#rating by pct
pctdata <- group_by(cocoa, pct)
gdpct <- summarise(pctdata,  count=n() ,rate2= mean(rating))
gdpct1<- arrange(gdpct, desc(rate2))

ggplot(gdpct1,aes(x=pct, y=rate2)) +
  geom_point(aes(size=count, colour=factor(pct)), alpha=1/2) +
  theme(legend.position="none") 

0.3.3 3.Modeling

It is extremely obivious that there is no linear relationship between cocoa percentage and rating, therefore, we will exam the relationship by SVM. The following graph showed there are some good predicted values from SVM model when cocoa percentage is roughly from 60% to 85%. Now, we will enhance our model one more step by categorize the percentage of cocoa. Our approach is take the floor value of (cocoa percentage/ 10). For example, when cocoa 40% -50%, then cocoa category =4. Then we will perform our SVM prediction on this 6 categories. From the result of our analysis, our accuracy of prediction value has increased significantly. Now we know if given a chocolate in random, it is our best bet to choose the cocoa percentage from 70%- 80% for highest rating.

model.c<- svm(formula= rate2 ~ pct+count, data=gdpct)
summary(model.c)
## 
## Call:
## svm(formula = rate2 ~ pct + count, data = gdpct)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  39
ggplot()+ geom_point(aes(x=gdpct$pct, y=gdpct$rate2), colour="pink")+ geom_line(aes(x=gdpct$pct, y=predict(model.c, newdat=gdpct)),color="cyan")+
  ggtitle('SVM Predicted Rating value vs Cocoa Percentage ') +xlab(" Cocoa Percentage") +ylab("Rating")

c1<-mutate(cocoa, 
       p1=floor(pct/10)
       )

p1data <- group_by(c1, p1)
gdp1 <- summarise(p1data,  count=n() ,rate3= mean(rating))
gdp1<- arrange(gdp1, desc(rate3))

ggplot(gdp1,aes(x=p1, y=rate3)) +
  geom_point(aes(size=count, colour=factor(rate3)), alpha=1/2) +
  theme(legend.position="none") +geom_line(linetype="dashed", colour="gold")+
  labs(y="Chocolate Bar Rating ", x="Cocoa Percent Category" , title="Chocolate Bar Rating vs Cocoa Percent Category")

model.c1<- svm(formula= rate3 ~ p1+count, data=gdp1)
summary(model.c)
## 
## Call:
## svm(formula = rate2 ~ pct + count, data = gdpct)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  39
ggplot()+ geom_point(aes(x=gdp1$p1, y=gdp1$rate3), colour="red")+ geom_line(aes(x=gdp1$p1, y=predict(model.c1, newdat=gdp1)),color="blue")+
  ggtitle('SVM Predicted Rating vs Cocoa Percent Category ') +xlab("Cocoa Percentage") +ylab("Rating")

0.4 Where does the best bean grown?

In this section we begin by clean the data, since there are massive amount of different origins and many origins contained more than one country, we will only keep the records for an origin shows up more than 5 times. Then we will plot the scattor plot of Rating vs cocoa Bean Origins. We observed that Haiti has the highest quality cocoa beans. Furthermore, Venezuela, Madagascar and Brazil also produce high quantity and quality cocoa beans. This also explained our first analysis , Madagascar produce decent quality and huge quanity of chocolates, because they process good cocoa beans!

bean <- group_by(cocoa, origin )
bean1 <- summarise(bean,  count=n() ,rate5= mean(rating))
# clean the data by selecting count >5 

bean2<- filter(bean1, count>=5)

# Scattor plot
ggplot(bean2,aes(x=rate5, y=origin)) +
  geom_point(aes(size=count, colour=factor(origin)), alpha=1/2) +
  theme(legend.position="none")+ labs(x="Chocolate Bar Rating ", " Bean Origin" , title="Chocolate Bar Rating vs Cocoa Bean Origins") 

0.5 Additional Exploratory on Best Chocolate Bars

0.5.1 1. Best Chocolate Bar with large production volume

In this section, we will exam the best chocolate bar in a broader way, to do so we will take the mean value of the chocolate bar ratings by grouping the origins and locations. Then we will use the records greater or equal to 20. From the heatmap we can see USA produces the most chocolate bars, where as France produced chocolate with cocoa beans originated from Venezuela has the highest quality. This is consisted with our previous analysis, France produce exquiste chocolate while Venezuela is one of the origins of top Cocoa Beans.

orig <- group_by(cocoa, origin , location)
orig1 <- summarise(orig,  count=n() ,rate4= mean(rating))
orig2<- filter(orig1, count>=20)

ggplot(orig2, aes(location, origin, fill=rate4))+
  geom_tile(colour = "white")+
scale_fill_gradient(low="green", high="red") +
  labs(x="Country ", "Bean Origin" , title="Chocolate Bar Heatmap" ,subtitle=" Country vs Bean Origin" , fill="Rating")

0.5.2 2. Absolute best chocolate Bar

From the bar chart we can see the only two 5 star rating chocolate bars are coming from Italy, one with the bean origin venezuela and the other from A. More interestingly, they are both produced from the same company, Amedei with 70% cocoa powder. Looking forward to try it one day!

best<- filter(cocoa, rating >4)
ggplot(best, aes(origin, rating))+geom_bar(stat="identity", aes(fill=location)) +
labs(x="Bean Origin ", y="Rating score" , title="Best Chocolate Bar")