We are often dazzling by the massive amount of choices of different chocolate brands in grocery stores or shops. In this project we will utilize EDA and Basic modeling to explore the chocolate bar rating dataset for:
Which country produces the highest rate chocolate bar?
What locations produce the top quality cocao beans?
Is the relationship between Cocao Percentage and the customer rating?
We hope these info can provide rough ideas to you so next time when shopping you know what the best chocolate to pick for your taste.
Cocao Dataset
datatable(cocao)
We will see what’s the area distribution of Chocolate and industry and Cocao Origin
North America and Australia are dominated the chocolate industry!
# Group by country
colnames(cocao) <- c("company", "bean.bar.orgin", "ref","date" , "percent", "location","rating", "beantype" ,"origin")
<- group_by(cocao, location)
commap <- summarise(commap, count=n())
commap1 <- joinCountryData2Map(commap1, joinCode="NAME", nameJoinColumn="location") map1
## 51 codes from your data successfully matched countries in the map
## 9 codes from your data failed to match with a country code in the map
## 192 codes from the map weren't represented in your data
mapCountryData(map1, nameColumnToPlot="count", mapTitle="Chocolate Company Distribution" , colourPalette = "negpos8")
Africa, Australia and some South American Countries produce most the cocao beans.
# Group by country
<- group_by(cocao, origin)
omap <- summarise(omap, count=n())
omap1 <- joinCountryData2Map(omap1, joinCode="NAME", nameJoinColumn="origin") map2
## 47 codes from your data successfully matched countries in the map
## 54 codes from your data failed to match with a country code in the map
## 196 codes from the map weren't represented in your data
mapCountryData(map2, nameColumnToPlot="count", mapTitle="Cocao Origin Distribution" , colourPalette = "negpos8")
Since the data structure broke down to chocolate Company, Bean Origin, Bean Type, Location Percentage of Cocoa and Rating. We will first group the country together then take the average value of the rating’s and plot it on the graph. The following graph showed that Chile has an average score of 3.75, out ran all the other countries, produces the highest rating chocolate bars. If we take the production volume into consideration, Canada and France produce fine quality chocolate with high volumes. Of course, U.S.A produces a large quantity of chocolate with decent favor.
#average rating by location
<- group_by(cocao, location)
loca <- summarise(loca, count=n(),
good rate1= mean(rating))
<- arrange(good, desc(rate1))
good1
#scattor plot
ggplot(good1,aes(x=reorder(location,rate1), y=rate1)) +geom_point(aes(size=count, colour=factor(rate1)), alpha=1/2) + theme_minimal(base_size = 9)+
theme(axis.text.x = element_text(angle = 90, hjust = 1) , legend.position="none") +
labs(x="Country", "Chocolate Rating", "Chocolate Rating vs Country")
<- ggplot(good1,aes(x=reorder(location,rate1), y=rate1)) +geom_point(aes(size=count, colour=factor(rate1)), alpha=1/2) +
t1theme(axis.text.x = element_text(angle = 90, hjust = 1) ,legend.position="none", panel.background = element_rect(fill="white")) +
labs(x="Country", "Chocolate Rating", "Chocolate Rating vs Country")
t1
We will plot a simple scatter plot with percentage of cocao and rating first, as we can see, both the highest and lowest rating chocolate bars all contain 70% cocao powder. Therefore, we could not draw a conclusion from this graph. We will improve our analysis by group the data by cocao percentage and take the average value of each percent category then plot it with rating.
#convert percentage to numerical
$pct = as.numeric(gsub("\\%", "", cocao$percent))
cocao
#scatter plot
ggplot(cocao,aes(x=pct, y=rating)) +geom_point(aes(colour=factor(location)))+theme_minimal() +theme( legend.position="bottom",legend.key.width=unit(0.2,"cm"),legend.key.height=unit(0.2,"cm")) +
xlab("cocao Percent(%) ") + ylab("Chocolate Bar Rating") +
ggtitle("Scatter plot of cocao Percent vs Chocolate Bar Rating")
From this graph we can see when cocao=50% we have the highest rating, however, the case is not persuasive enough, because the simple space size of 50% cocao chocolate bars is not large enough. Consider the sample sizes into account, 70% cocao bar still our best choice. Moreover, any chocolate bars from 65% - 75% percent cocao looks pretty good! We will try to prove our conjecture by build a SVM model.
#rating by pct
<- group_by(cocao, pct)
pctdata <- summarise(pctdata, count=n() ,rate2= mean(rating))
gdpct <- arrange(gdpct, desc(rate2))
gdpct1
ggplot(gdpct1,aes(x=pct, y=rate2)) +
geom_point(aes(size=count, colour=factor(pct)), alpha=1/2) +theme_minimal()+
theme(legend.position="none")
corrplot(cor(cocao[c("date","rating","pct")]), method="color",col=colorRampPalette(c("#C8F3B3","#F5E3B3","#F9DCD1"))(100),type="upper",tl.srt=90,tl.col="black")
It is extremely obvious that there is no linear relationship between cocao percentage and rating, therefore, we will exam the relationship by SVM. The following graph showed there are some good predicted values from SVM model when cocao percentage is roughly from 60% to 85%. Now, we will enhance our model one more step by categorize the percentage of cocao. Our approach is take the floor value of (cocao percentage/ 10). For example, when cocao 40% -50%, then cocao category =4. Then we will perform our SVM prediction on this 6 categories. From the result of our analysis, our accuracy of prediction value has increased significantly. Now we know if given a chocolate in random, it is our best bet to choose the cocao percentage from 70%- 80% for highest rating.
<- svm(formula= rate2 ~ pct+count, data=gdpct)
model.csummary(model.c)
##
## Call:
## svm(formula = rate2 ~ pct + count, data = gdpct)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
## epsilon: 0.1
##
##
## Number of Support Vectors: 39
ggplot()+ geom_point(aes(x=gdpct$pct, y=gdpct$rate2), colour="pink")+ geom_line(aes(x=gdpct$pct, y=predict(model.c, newdat=gdpct)),color="cyan")+ theme_minimal()+
ggtitle('SVM Predicted Rating value vs cocao Percentage ') +xlab(" cocao Percentage") +ylab("Rating")
<-mutate(cocao,
c1p1=floor(pct/10)
)
<- group_by(c1, p1)
p1data <- summarise(p1data, count=n() ,rate3= mean(rating))
gdp1 <- arrange(gdp1, desc(rate3))
gdp1
ggplot(gdp1,aes(x=p1, y=rate3)) +
geom_point(aes(size=count, colour=factor(rate3)), alpha=1/2) +
theme(legend.position="none") +geom_line(linetype="dashed", colour="gold")+ theme_minimal()+
labs(y="Chocolate Bar Rating ", x="cocao Percent Category" , title="Chocolate Bar Rating vs cocao Percent Category")
<- svm(formula= rate3 ~ p1+count, data=gdp1)
model.c1summary(model.c)
##
## Call:
## svm(formula = rate2 ~ pct + count, data = gdpct)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
## epsilon: 0.1
##
##
## Number of Support Vectors: 39
ggplot()+ geom_point(aes(x=gdp1$p1, y=gdp1$rate3), colour="red")+ geom_line(aes(x=gdp1$p1, y=predict(model.c1, newdat=gdp1)),color="blue")+ theme_minimal()+
ggtitle('SVM Predicted Rating vs cocao Percent Category ') +xlab("cocao Percentage") +ylab("Rating")
In this section we begin by clean the data, since there are massive amount of different origins and many origins contained more than one country, we will only keep the records for an origin shows up more than 5 times. Then we will plot the scatter plot of Rating vs cocao Bean Origins. We observed that Haiti has the highest quality cocao beans. Furthermore, Venezuela, Madagascar and Brazil also produce high quantity and quality cocao beans. This also explained our first analysis , Madagascar produce decent quality and huge quantity of chocolates, because they process good cocao beans!
<- group_by(cocao, origin )
bean <- summarise(bean, count=n() ,rate5= mean(rating))
bean1 # clean the data by selecting count >5
<- filter(bean1, count>=5)
bean2
# Scattor plot
ggplot(bean2,aes(x=rate5, y=origin)) +
geom_point(aes(size=count, colour=factor(origin)), alpha=1/2) + theme_minimal(base_size = 9)+
theme(legend.position="none")+ labs(x="Chocolate Bar Rating ", " Bean Origin" , title="Chocolate Bar Rating vs cocao Bean Origins")
Now we will visualize the review from 2006 to 2017.
ggplot(cocao,aes(x=rating, fill=as.factor(date)))+geom_density(alpha=0.6)+
theme_minimal()+facet_wrap(~as.factor(date))+guides(fill="none") +labs(x="Rating", y="Density") +ggtitle("Chocolate Bar Review by Year")
<- c1 %>% rowwise %>%
cyearfilter(!is.na(location))%>%
group_by(location,date)%>%
filter(n()>=9)%>%
summarise(count=n(), rate6=mean(rating))%>%
ungroup()
<- ggplot(cyear, aes(x=date,y=rate6, size=count, colour=as.factor(location),alpha=0.05, group = 1))+ geom_point()+
tyeargeom_line(aes(x=date, y=rate6, colour=as.factor(location)))+
facet_wrap(~location, ncol=3)+ theme_minimal(base_size = 8) +theme(axis.text.x=element_text(angle=45,hjust=1))+ guides(colour=FALSE)+ scale_alpha(guide=F)+
labs(x="Year of Review", y="Average Rating", title="Chocolate Average Rating by Year")
tyear
In this section, we will exam the best chocolate bar in a broader way, to do so we will take the mean value of the chocolate bar ratings by grouping the origins and locations. Then we will use the records greater or equal to 20. From the heatmap we can see USA produces the most chocolate bars, where as France produced chocolate with cocao beans originated from Venezuela has the highest quality. This is consisted with our previous analysis, France produce exquiste chocolate while Venezuela is one of the origins of top cocao Beans.
<- group_by(cocao, origin , location)
orig <- summarise(orig, count=n() ,rate4= mean(rating))
orig1 <- filter(orig1, count>=20)
orig2
ggplot(orig2, aes(location, origin, fill=rate4))+
geom_tile(colour = "white")+
scale_fill_gradient(low="#fce6c8", high="#4fcbac") + theme_minimal()+
labs(x="Country ", "Bean Origin" , title="Chocolate Bar Heatmap" ,subtitle=" Country vs Bean Origin" , fill="Rating")
Word Cloud
<- cocao %>%
names group_by(company) %>%
::summarize(count = n())
dplyrset.seed(88)
wordcloud(names$company, freq=names$count, min.freq=1,max.words=100 ,random.order=FALSE,rot.per=0.35,colors=brewer.pal(8, "Dark2"), main="Company")
Violin Plot
<- cocao %>% rowwise %>%
compfilter(!is.na(company))%>%
group_by(company)%>%
filter(n()>=10)%>%
mutate(rate5=mean(rating))
<- ggplot(comp,aes(x=reorder(company,rating), y=rating, fill=rate5))+geom_violin()+
vc coord_flip()+ theme_minimal(base_size=7) +
scale_fill_continuous(name="Average Rating",low="#F9D8D9",high="#96316C") +
expand_limits(y=c(0.5))+
labs(x="Company", y="Rating")+ ggtitle("Chocolate Company and Rating")
vc
TreeMap for Chocolate Company
#company count
<- group_by(cocao, location,company )
keys1 <- dplyr::summarise(keys1, count=n())
keys2 <- arrange(keys2, desc(count))
keys2 <-filter(keys2, count >8)
keys3 # company treemap
treemap(keys3, index=c("location","company"), vSize="count", type="index",
palette="Accent", title="Top Chocolate Company", fontsize.title=6)
From the bar chart we can see the only two 5 star rating chocolate bars are coming from Italy, one with the bean origin venezuela and the other from A. More interestingly, they are both produced from the same company, Amedei with 70% cocao powder. Looking forward to try it one day!
<- filter(cocao, rating >4)
bestggplot(best, aes(origin, rating))+geom_bar(stat="identity", aes(fill=company)) + theme_minimal()+
labs(x="Bean Origin ", y="Rating score" , title="Best Chocolate Bar")
<- plot_ly (type="treemap", labels=keys3$company,
figparents=keys3$location,
values=keys3$count)
fig
ANOVA : Analysis of variance, is a statistical technique that is used to check if the means of two or more groups are significantly different from each other.
\[F = \frac{variance \ between \ groups}{variance \ within\ groups}\]
\[F= \frac{MS_{groups}}{MS_{Error}}\]
Assumptions:
<- cocao[cocao$location %in% c("U.S.A.","France","Canada"),]
a<- group_by(a, location) %>%
a1summarise(count=n(), mean=mean(rating, na.rm=TRUE),
sd=sd(rating,na.rm=TRUE))
a1
## # A tibble: 3 × 4
## location count mean sd
## <chr> <int> <dbl> <dbl>
## 1 Canada 125 3.32 0.424
## 2 France 156 3.25 0.547
## 3 U.S.A. 764 3.15 0.442
<- a %>% rowwise %>%
a2filter(!is.na(location))%>%
group_by(location)%>%
filter(n()>=10)%>%
mutate(mean=mean(rating))
QQ plot for Normal Distribution
We can see it actually failed the Normal Distribution test.
ggplot(a, aes(sample = rating, colour = factor(location))) +
stat_qq() + theme_minimal()+
stat_qq_line()
Equal Variance:
From the Bartlett homogeneity test we can see p<0.05, which means the variance level of the three groups are significantly different from each other, therefore we conclude, we can’t use ANOVA.
<- bartlett.test(rating ~ location, data= a)
var var
##
## Bartlett test of homogeneity of variances
##
## data: rating by location
## Bartlett's K-squared = 14.232, df = 2, p-value = 0.0008121
Null Hypothesis: There is No Difference of chocolate bar rating between U.S.A and France.
H0: \(\mu_{USA} = \mu_{France} = \mu_{Canada}\)
Interpretation: the \(p-value <0.05\), We reject the Null-Hypothesis, which means there is a significant difference between the mean of rating for the chocolate bar in between France, U.S.A. , or Canada.
ANOVA doesn’t fit the data, so the result is NOT correct, we can see from the box plot
<-aov(rating ~ location, data=a)
res.aov summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## location 2 3.82 1.9094 9.143 0.000116 ***
## Residuals 1042 217.60 0.2088
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- ggplot(a2,aes(x=reorder(location,rating), y=rating, fill=mean))+geom_boxplot()+
bc theme_minimal(base_size=10) +
scale_fill_continuous(name="Average Rating",low="#D2ECF2",high="#78B0BD") +
expand_limits(y=c(0.5))+
labs(x="Country", y="Rating")+ ggtitle("ANOVA example for Chocolate Bar Rating")
bc
This notebook showed that North America has a huge industry for the chocolate, while the Cacao beans are from Africa and South America countries. It is interesting to know 50% Cocao produce the highest rating chocolate bar, maybe it is not too sweet and not too bitter! Next time when I go to Italy, I will try the Amedei Chocolate !