First, we import the original cacao dataset. We also format the percent cocoa to integer values
cacao <- read.csv('flavors_of_cacao original.csv', header = TRUE, sep = ',')
cacao <- as.data.frame(cacao, stringsAsFactors=FALSE) #this fixes some bugs I faced later on
cacao$Cocoa.Percent <- as.integer(sub("%", "", cacao$Cocoa.Percent))
colnames(cacao) <- c('company','name','REF','review_date','cocoa_percent',
'company_loc','rating','bean_type','bean_origin')
When we imported the data, empty spaces were turned to a weird A character with an accent over it. We clean up the data here
'%!in%' <- function(x,y)!('%in%'(x,y))
to_NA <- function(x, start){ #start is the index of the first non-NA piece of data in levels(x)
for (i in 1:length(x)){
if (x[i] %!in% levels(x)[start:length(levels(x))]){
x[i] <- NA
}
}
return(x)
}
cacao$bean_type <- to_NA(cacao$bean_type, 3)
cacao$bean_origin <- to_NA(cacao$bean_origin, 3)
Here we add latitudes of the place where each cacao bean was grown, as it may be interesting to see how distance from equator affects the rating. The code is adapted from http://www.storybench.org/geocode-csv-addresses-r/
require(ggmap)
require(proto)
require(RDSTK)
cacao$bean_origin <- as.character(cacao$bean_origin)
#fixing some spelling errors in our data
cacao$bean_origin[which(cacao$bean_origin == 'Domincan Republic')] <- 'Dominican Republic'
cacao$bean_origin[which(cacao$bean_origin == 'Peru(SMartin,Pangoa,nacional)')] <- 'Peru'
#loop through the addresses to get the latitude of each bean origin
for(i in 1:nrow(cacao)){
if (!is.na(cacao$bean_origin[i])){
# Print("Working...")
result <- geocode(cacao$bean_origin[i], output = "latlona", source = "dsk", messaging = FALSE)
cacao$lat[i] <- as.numeric(result[2])
}
}
#the data science kit we used to find longitude had no information on the two locations below, so we add it manually
cacao$lat[which(cacao$bean_origin == 'Carribean(DR/Jam/Tri)')] <- 21.4691
cacao$lat[which(cacao$bean_origin == 'West Africa')] <- 13.5317
Let’s look at the chocolates’ ratings and percent cocoa makeup. The lowest and highest scores possible are 1 and 5, respectively, according to the rubric at http://flavorsofcacao.com/review_guide.html. This dataset contains chocolate bars at both ends of the spectrum, with the average being a score of 3.19. The percent cocoa also varies quite a bit.
cacao[cacao$rating == 1, c(1,2,7)] #these bars did the worst
## company name rating
## 327 Callebaut Baking 1
## 438 Claudio Corallo Principe 1
## 466 Cote d' Or (Kraft) Sensations Intense 1
## 1176 Neuhaus (Callebaut) Dark 1
cacao[cacao$rating == 5, c(1,2,7)] #while these bars did the best
## company name rating
## 79 Amedei Chuao 5
## 87 Amedei Toscano Black 5
summary(cacao[, c(5, 7)])
## cocoa_percent rating
## Min. : 42.0 Min. :1.000
## 1st Qu.: 70.0 1st Qu.:2.875
## Median : 70.0 Median :3.250
## Mean : 71.7 Mean :3.186
## 3rd Qu.: 75.0 3rd Qu.:3.500
## Max. :100.0 Max. :5.000
Let’s look at a histogram of years in which the chocolates were reviewed. The trend seems to be that there were more reviews as time went on, except for 2017, maybe when the data stopped being collected.
a <- min(cacao$review_date)
b <- max(cacao$review_date)
hist_year <- hist(cacao$review_date + 0.001, xaxt='n', xlab = 'review year',
ylab = 'Number of chocolates', main = 'Review years of chocolates')
axis(side = 1, at = hist_year$mids, labels = seq(a, b))
Finally, let’s look at where the companies are that make the chocolate. I shamelessly adapted the code from here: https://stackoverflow.com/questions/24136868/plot-map-with-values-for-countries-as-color-in-r
#gives the number of chocolates produced by each country
company_counts <- data.frame(country = levels(cacao$company_loc),
count = as.numeric(table(cacao$company_loc)))
require(rworldmap)
#create a map-shaped window
mapDevice('x11')
spdf <- joinCountryData2Map(company_counts, joinCode = 'NAME', nameJoinColumn = 'country')
mapCountryData(spdf, nameColumnToPlot = 'count', catMethod = 'fixedWidth', mapTitle = 'Where are All the Chocolate Companies?')
head(company_counts[order(-company_counts$count),])
## country count
## 57 U.S.A. 764
## 19 France 156
## 8 Canada 125
## 56 U.K. 96
## 30 Italy 63
## 15 Ecuador 54
Looks like America has the most chocolate companies represented in this dataset. But where are the cacao beans themselves from? The following map shows that they come from places around the equator with hot climates, as expected. Note: some chocolates used cacao bean from several countries. The map only reflects chocolates using beans from a single country
#gives the number of cacao beans from each country
cacao$bean_origin <- factor(cacao$bean_origin)
bean_counts <- data.frame(country = levels(cacao$bean_origin),
count = as.numeric(table(cacao$bean_origin)))
mapDevice('x11')
spdf <- joinCountryData2Map(bean_counts, joinCode = 'NAME', nameJoinColumn = 'country')
mapCountryData(spdf, nameColumnToPlot = 'count', catMethod = 'fixedWidth',
mapTitle = 'Where are All the Cacao Beans?')
abline(h = 0) #to draw the equator
Now that we understand our data better, a reasonable question to ask is how we can predict what rating a specific chocolate produced will receive. The 1-5 scale used is categorized into labels, from elite to unpleasant. I got labels from http://flavorsofcacao.com/review_guide.html. First we put a label with each rating
classify <- function(rating){
len <- length(rating)
labels <- rep(NA, len)
for (i in 1:len){
if (rating[i] >= 4.5){
labels[i] <- 'Elite'
}
else if (rating[i] >= 3.5) {
labels[i] <- 'Premium'
}
else if (rating[i] >= 2.5) {
labels[i] <- 'Satisfactory'
}
else if (rating[i] >= 1.5) {
labels[i] <- 'Dissapointing'
}
else {
labels[i] <- 'Unpleasant'
}
}
labels <- as.factor(labels)
return(labels)
}
label <- classify(cacao$rating)
The predictive model we are going to use is a decision tree. It is useful because it takes continuous and categorical data (unlike SVMs and multiple regression models) and can give predictions of more than two output categories (unlike logistic regression, which deals with dichotomous results). The decision tree also is a good visual representation of our predictive model
I tried making a tree with “tree <- rpart(label ~ review_date + cocoa_percent + company_loc + bean_type + bean_origin, data = cacao)” at first, but I quickly learned that it was unreasonable to try to have a different branch for each country as soon as my R session terminated. Let’s limit the scope of our data
First of all, there are too many countries represented in the company_loc column. The map of where all the chocolate companies are shows that America, France, and Canada are the three countries with the most chocolate companies represented. Three branches based on the company’s country location is manageable for a decision tree
#we remove data from other countries, since the ratings of other countries' chocolates may be too variable and decrease the predictive accuracy of our model
cacao <- cacao[which(cacao$company_loc == 'U.S.A.' | cacao$company_loc == 'Canada' |
cacao$company_loc =='France'),]
We also want to utilize the information provided by the kind of cacao bean used. Chocolate bloggers have discussed in-depth the difference in taste of chocolate made from a single type of bean versus chocolate from a blend of beans (https://thechocolatejournalist.com/single-origin-vs-blend/). They also discuss the main kinds of beans: criollo, forastero, trinitario, and the nacional bean, discovered in 2011 (https://blog.barandcocoa.com/about-chocolate/varieties-of-cocoa-beans/). Here we categorize the many labels of cacao beans into the broad groups
cacao$bean_type <- factor(cacao$bean_type)
cacao$bean_type <- as.character(cacao$bean_type)
criollo <- c("Criollo", "Criollo (Ocumare)", "Criollo (Porcelana)")
cacao$bean_type[which(cacao$bean_type %in% criollo)] <- 'criollo'
forastero <- c("Forastero", "Forastero (Amelonado)", "Forastero (Arriba)", "Forastero (Parazinho)")
cacao$bean_type[which(cacao$bean_type %in% forastero)] <- 'forastero'
nacional <- c("Forastero (Nacional)", "Nacional", "Nacional (Arriba)")
cacao$bean_type[which(cacao$bean_type %in% nacional)] <- 'nacional'
trinitario <- c("Trinitario", "Trinitario (Amelonado)")
cacao$bean_type[which(cacao$bean_type %in% trinitario)] <- 'trinitario'
blend <- c("Amazon, ICS", "Blend", "Criollo, +", "Criollo, Forastero",
"Criollo, Trinitario", "Forastero, Trinitario", "Trinitario, Criollo",
"Trinitario, Forastero", "Trinitario, Nacional")
cacao$bean_type[which(cacao$bean_type %in% blend)] <- 'blend'
#as was done for company location, we remove chocolates that don't fall into any of these groups because they may decrease the tree's accuracy
cacao <- cacao[which(cacao$bean_type == 'blend' | cacao$bean_type == 'criollo' |
cacao$bean_type == 'forastero' | cacao$bean_type == 'nacional' |
cacao$bean_type == 'trinitario'),]
Finally, we make our tree, as well as plot the importance of each predictive variable. Looks like cocoa percent has the largest impact. This makes sense, since percent cocoa is something you can taste, which directly affects the rating
require(rattle)
require(rpart)
label <- classify(cacao$rating)
tree <- rpart(label ~ review_date + cocoa_percent + company_loc +
bean_type + lat, data = cacao, control = rpart.control(cp = 0.01))
fancyRpartPlot(tree)
tree$variable.importance
## cocoa_percent review_date lat bean_type company_loc
## 16.349254 8.299967 4.829181 3.187326 0.911426
How accurate is our model? Approxamitely 2/3 of the time
pred <- predict(tree)
maxprob <- colnames(pred)[max.col(pred, ties.method="first")]
table <- table(label, maxprob)
table
## maxprob
## label Premium Satisfactory
## Dissapointing 3 7
## Premium 126 76
## Satisfactory 64 173
(table[2,1] + table[3,2])/sum(table)
## [1] 0.6659243
mydata <- read.csv('flavors_of_cacao original.csv', header = TRUE, sep = ',')
mydata <- as.data.frame(mydata, stringsAsFactors=FALSE) #this fixes some bugs I faced later on
mydata$Cocoa.Percent <- as.integer(sub("%", "", mydata$Cocoa.Percent))
colnames(mydata) <- c('company','name','REF','review_date','cocoa_percent',
'company_loc','rating','bean_type','bean_origin')
cocoa_percent <- as.numeric(mydata[,5])
rating <- as.numeric(mydata[,7])
rate_high <- c()
rate_low <- c()
len <- length(cocoa_percent)
for (i in 1:len) {
if (cocoa_percent[i] > 70) {
rate_high <- c(rate_high, rating[i])
} else {
rate_low <- c(rate_low, rating[i])
}
}
len_high <- length(rate_high)
len_low <- length(rate_low)
temp1 <- 0
temp2 <- 0
for (i in 1:len_high) {
temp1 <- temp1 + rate_high[i]
}
for (j in 1:len_low) {
temp2 <- temp2 + rate_low[j]
}
avg_rate_high <- temp1 / len_high
avg_rate_low <- temp2 / len_low
difference <- avg_rate_high - avg_rate_low
temp3 <- 0
temp4 <- 0
for (i in 1:len_high) {
temp3 <- temp3 + (rate_high[i]-avg_rate_high)^2
}
for (i in 1:len_low) {
temp4 <- temp4 + (rate_low[i]-avg_rate_low)^2
}
var_high <- temp3/(len_high - 1)
var_low <- temp4/(len_low - 1)
t <- (difference - 0) / sqrt((var_high/len_high) + (var_low/len_low))
pt(-abs(t),len-2)
## [1] 6.398543e-10
#t-value = -6.102098
#P-value = 6.398543e-10