Importing and formatting data

First, we import the original cacao dataset. We also format the percent cocoa to integer values

cacao <- read.csv('flavors_of_cacao original.csv', header = TRUE, sep = ',')
cacao <- as.data.frame(cacao, stringsAsFactors=FALSE) #this fixes some bugs I faced later on
cacao$Cocoa.Percent <- as.integer(sub("%", "", cacao$Cocoa.Percent))
colnames(cacao) <- c('company','name','REF','review_date','cocoa_percent',
                     'company_loc','rating','bean_type','bean_origin')

Cleaning up the data

When we imported the data, empty spaces were turned to a weird A character with an accent over it. We clean up the data here

'%!in%' <- function(x,y)!('%in%'(x,y))
to_NA <- function(x, start){ #start is the index of the first non-NA piece of data in levels(x)
  for (i in 1:length(x)){
    if (x[i] %!in% levels(x)[start:length(levels(x))]){
      x[i] <- NA
    }
  }
  return(x)
}
cacao$bean_type <- to_NA(cacao$bean_type, 3)
cacao$bean_origin <- to_NA(cacao$bean_origin, 3)

Adding latitudes

Here we add latitudes of the place where each cacao bean was grown, as it may be interesting to see how distance from equator affects the rating. The code is adapted from http://www.storybench.org/geocode-csv-addresses-r/

require(ggmap)
require(proto)
require(RDSTK)

cacao$bean_origin <- as.character(cacao$bean_origin)

#fixing some spelling errors in our data
cacao$bean_origin[which(cacao$bean_origin == 'Domincan Republic')] <- 'Dominican Republic'
cacao$bean_origin[which(cacao$bean_origin == 'Peru(SMartin,Pangoa,nacional)')] <- 'Peru'

#loop through the addresses to get the latitude of each bean origin
for(i in 1:nrow(cacao)){
  if (!is.na(cacao$bean_origin[i])){
    # Print("Working...")
    result <- geocode(cacao$bean_origin[i], output = "latlona", source = "dsk", messaging = FALSE)
    cacao$lat[i] <- as.numeric(result[2])
  }
}

#the data science kit we used to find longitude had no information on the two locations below, so we add it manually
cacao$lat[which(cacao$bean_origin == 'Carribean(DR/Jam/Tri)')] <- 21.4691
cacao$lat[which(cacao$bean_origin == 'West Africa')] <- 13.5317

Exploring our dataset

Let’s look at the chocolates’ ratings and percent cocoa makeup. The lowest and highest scores possible are 1 and 5, respectively, according to the rubric at http://flavorsofcacao.com/review_guide.html. This dataset contains chocolate bars at both ends of the spectrum, with the average being a score of 3.19. The percent cocoa also varies quite a bit.

cacao[cacao$rating == 1, c(1,2,7)] #these bars did the worst

##                  company               name rating
## 327            Callebaut             Baking      1
## 438      Claudio Corallo           Principe      1
## 466   Cote d' Or (Kraft) Sensations Intense      1
## 1176 Neuhaus (Callebaut)               Dark      1

cacao[cacao$rating == 5, c(1,2,7)] #while these bars did the best

##    company          name rating
## 79  Amedei         Chuao      5
## 87  Amedei Toscano Black      5

summary(cacao[, c(5, 7)])

##  cocoa_percent       rating     
##  Min.   : 42.0   Min.   :1.000  
##  1st Qu.: 70.0   1st Qu.:2.875  
##  Median : 70.0   Median :3.250  
##  Mean   : 71.7   Mean   :3.186  
##  3rd Qu.: 75.0   3rd Qu.:3.500  
##  Max.   :100.0   Max.   :5.000

Let’s look at a histogram of years in which the chocolates were reviewed. The trend seems to be that there were more reviews as time went on, except for 2017, maybe when the data stopped being collected.

a <- min(cacao$review_date)
b <- max(cacao$review_date)
hist_year <- hist(cacao$review_date + 0.001, xaxt='n', xlab = 'review year', 
                  ylab = 'Number of chocolates', main = 'Review years of chocolates')
axis(side = 1, at = hist_year$mids, labels = seq(a, b))

Finally, let’s look at where the companies are that make the chocolate. I shamelessly adapted the code from here: https://stackoverflow.com/questions/24136868/plot-map-with-values-for-countries-as-color-in-r

#gives the number of chocolates produced by each country
company_counts <- data.frame(country = levels(cacao$company_loc), 
                             count = as.numeric(table(cacao$company_loc)))

require(rworldmap)

#create a map-shaped window
mapDevice('x11')
spdf <- joinCountryData2Map(company_counts, joinCode = 'NAME', nameJoinColumn = 'country')
mapCountryData(spdf, nameColumnToPlot = 'count', catMethod = 'fixedWidth', mapTitle = 'Where are All the Chocolate Companies?')

head(company_counts[order(-company_counts$count),])

##    country count
## 57  U.S.A.   764
## 19  France   156
## 8   Canada   125
## 56    U.K.    96
## 30   Italy    63
## 15 Ecuador    54

Looks like America has the most chocolate companies represented in this dataset. But where are the cacao beans themselves from? The following map shows that they come from places around the equator with hot climates, as expected. Note: some chocolates used cacao bean from several countries. The map only reflects chocolates using beans from a single country

#gives the number of cacao beans from each country
cacao$bean_origin <- factor(cacao$bean_origin)

bean_counts <- data.frame(country = levels(cacao$bean_origin), 
                             count = as.numeric(table(cacao$bean_origin)))

mapDevice('x11')
spdf <- joinCountryData2Map(bean_counts, joinCode = 'NAME', nameJoinColumn = 'country')
mapCountryData(spdf, nameColumnToPlot = 'count', catMethod = 'fixedWidth', 
               mapTitle = 'Where are All the Cacao Beans?')
abline(h = 0) #to draw the equator

Classifying chocolate based on rating

Now that we understand our data better, a reasonable question to ask is how we can predict what rating a specific chocolate produced will receive. The 1-5 scale used is categorized into labels, from elite to unpleasant. I got labels from http://flavorsofcacao.com/review_guide.html. First we put a label with each rating

classify <- function(rating){
  len <- length(rating)
  labels <- rep(NA, len)
  
  for (i in 1:len){
    if (rating[i] >= 4.5){
      labels[i] <- 'Elite'
    } 
    
    else if (rating[i] >= 3.5) {
      labels[i] <- 'Premium'
    } 
    
    else if (rating[i] >= 2.5) {
      labels[i] <- 'Satisfactory'
    } 
    
    else if (rating[i] >= 1.5) {
      labels[i] <- 'Dissapointing'
    } 
    
    else {
      labels[i] <- 'Unpleasant'
    }
  }
  
  labels <- as.factor(labels)
  return(labels)
}

label <- classify(cacao$rating)

Finalizing data to use

The predictive model we are going to use is a decision tree. It is useful because it takes continuous and categorical data (unlike SVMs and multiple regression models) and can give predictions of more than two output categories (unlike logistic regression, which deals with dichotomous results). The decision tree also is a good visual representation of our predictive model

I tried making a tree with “tree <- rpart(label ~ review_date + cocoa_percent + company_loc + bean_type + bean_origin, data = cacao)” at first, but I quickly learned that it was unreasonable to try to have a different branch for each country as soon as my R session terminated. Let’s limit the scope of our data

First of all, there are too many countries represented in the company_loc column. The map of where all the chocolate companies are shows that America, France, and Canada are the three countries with the most chocolate companies represented. Three branches based on the company’s country location is manageable for a decision tree

#we remove data from other countries, since the ratings of other countries' chocolates may be too variable and decrease the predictive accuracy of our model
cacao <- cacao[which(cacao$company_loc == 'U.S.A.' | cacao$company_loc == 'Canada' | 
               cacao$company_loc =='France'),]

We also want to utilize the information provided by the kind of cacao bean used. Chocolate bloggers have discussed in-depth the difference in taste of chocolate made from a single type of bean versus chocolate from a blend of beans (https://thechocolatejournalist.com/single-origin-vs-blend/). They also discuss the main kinds of beans: criollo, forastero, trinitario, and the nacional bean, discovered in 2011 (https://blog.barandcocoa.com/about-chocolate/varieties-of-cocoa-beans/). Here we categorize the many labels of cacao beans into the broad groups

cacao$bean_type <- factor(cacao$bean_type)
cacao$bean_type <- as.character(cacao$bean_type)

criollo <- c("Criollo", "Criollo (Ocumare)", "Criollo (Porcelana)")
cacao$bean_type[which(cacao$bean_type %in% criollo)] <- 'criollo'

forastero <- c("Forastero", "Forastero (Amelonado)", "Forastero (Arriba)", "Forastero (Parazinho)")
cacao$bean_type[which(cacao$bean_type %in% forastero)] <- 'forastero'

nacional <- c("Forastero (Nacional)", "Nacional", "Nacional (Arriba)")
cacao$bean_type[which(cacao$bean_type %in% nacional)] <- 'nacional'

trinitario <- c("Trinitario", "Trinitario (Amelonado)")
cacao$bean_type[which(cacao$bean_type %in% trinitario)] <- 'trinitario'

blend <- c("Amazon, ICS", "Blend", "Criollo, +", "Criollo, Forastero", 
           "Criollo, Trinitario", "Forastero, Trinitario", "Trinitario, Criollo", 
           "Trinitario, Forastero", "Trinitario, Nacional")
cacao$bean_type[which(cacao$bean_type %in% blend)] <- 'blend'

#as was done for company location, we remove chocolates that don't fall into any of these groups because they may decrease the tree's accuracy
cacao <- cacao[which(cacao$bean_type == 'blend' | cacao$bean_type == 'criollo' | 
               cacao$bean_type == 'forastero' | cacao$bean_type == 'nacional' |
               cacao$bean_type == 'trinitario'),]

Making our decision tree

Finally, we make our tree, as well as plot the importance of each predictive variable. Looks like cocoa percent has the largest impact. This makes sense, since percent cocoa is something you can taste, which directly affects the rating

require(rattle)
require(rpart)

label <- classify(cacao$rating)

tree <- rpart(label ~ review_date + cocoa_percent + company_loc + 
              bean_type + lat, data = cacao, control = rpart.control(cp = 0.01))
fancyRpartPlot(tree)

tree$variable.importance

## cocoa_percent   review_date           lat     bean_type   company_loc 
##     16.349254      8.299967      4.829181      3.187326      0.911426

How accurate is our model? Approxamitely 2/3 of the time

pred <- predict(tree)
maxprob <- colnames(pred)[max.col(pred, ties.method="first")]
table <- table(label, maxprob)
table

##                maxprob
## label           Premium Satisfactory
##   Dissapointing       3            7
##   Premium           126           76
##   Satisfactory       64          173

(table[2,1] + table[3,2])/sum(table)

## [1] 0.6659243

Hypothesis testing

mydata <- read.csv('flavors_of_cacao original.csv', header = TRUE, sep = ',')
mydata <- as.data.frame(mydata, stringsAsFactors=FALSE) #this fixes some bugs I faced later on
mydata$Cocoa.Percent <- as.integer(sub("%", "", mydata$Cocoa.Percent))
colnames(mydata) <- c('company','name','REF','review_date','cocoa_percent',
                     'company_loc','rating','bean_type','bean_origin')

cocoa_percent <- as.numeric(mydata[,5])
rating <- as.numeric(mydata[,7])
rate_high <- c()
rate_low <- c()
len <- length(cocoa_percent)

for (i in 1:len) {
    if (cocoa_percent[i] > 70) {
        rate_high <- c(rate_high, rating[i])
    } else {
        rate_low <- c(rate_low, rating[i])
    }
}

len_high <- length(rate_high)
len_low <- length(rate_low)
temp1 <- 0
temp2 <- 0
for (i in 1:len_high) {
    temp1 <- temp1 + rate_high[i]
}

for (j in 1:len_low) {
    temp2 <- temp2 + rate_low[j]
}
avg_rate_high <- temp1 / len_high
avg_rate_low <- temp2 / len_low
difference <- avg_rate_high - avg_rate_low

temp3 <- 0
temp4 <- 0
for (i in 1:len_high) {
    temp3 <- temp3 + (rate_high[i]-avg_rate_high)^2
}
for (i in 1:len_low) {
    temp4 <- temp4 + (rate_low[i]-avg_rate_low)^2
}
var_high <- temp3/(len_high - 1)
var_low <- temp4/(len_low - 1)

t <- (difference - 0) / sqrt((var_high/len_high) + (var_low/len_low))
pt(-abs(t),len-2)

## [1] 6.398543e-10

#t-value = -6.102098
#P-value = 6.398543e-10

Flavors of Cacao