Load and clean the data

Firstly, let’s read in the packages that I am going to be using during the analysis.

library(tidyverse)
library(ggplot2)
library(readr)
library(stringr)
library(lubridate)

The data here is provided to us in a csv file and kept in the input directory, which is inside the current working directory. I just simply load it here using the read_csv function from the readr package.

I don’t know why I got error when I read the file Rows: 1795 Columns: 9── Column specification

full_data <- read_csv('~/Downloads/flavors_of_cacao.csv')

What is dim?

dim(full_data)

## [1] 1795    9

We seem to have just under 1,800 observations, or chocolates, and 9 variables. We can check the variables that we have by checking the column names of the data. It just like colnames

names(full_data)

## [1] "Company \n(Maker-if known)"        "Specific Bean Origin\nor Bar Name"
## [3] "REF"                               "Review\nDate"                     
## [5] "Cocoa\nPercent"                    "Company\nLocation"                
## [7] "Rating"                            "Bean\nType"                       
## [9] "Broad Bean\nOrigin"

If we look at the column names we can see that they contain spaces as well as some line breaks, which definintely isn’t good, so I will remove them using gsub. What did he remove? I saw no differece?

# Remove and replace spaces and line breaks
names(full_data) <- tolower(gsub(pattern = '[[:space:]+]', '_', names(full_data)))

# View the new column names 
names(full_data)

## [1] "company _(maker-if_known)"        "specific_bean_origin_or_bar_name"
## [3] "ref"                              "review_date"                     
## [5] "cocoa_percent"                    "company_location"                
## [7] "rating"                           "bean_type"                       
## [9] "broad_bean_origin"

That is much more manageable now. I have things about the janitor package that deals with things like that, perhaps in a better way so that will be something that I explore in the future.

View the first few rows of the data.

head(full_data)

company _(maker-if_known)	specific_bean_origin_or_bar_name	ref	review_date	cocoa_percent	company_location	rating	bean_type	broad_bean_origin
A. Morin	Agua Grande	1876	2016	63%	France	3.75		Sao Tome
A. Morin	Kpime	1676	2015	70%	France	2.75		Togo
A. Morin	Atsane	1676	2015	70%	France	3.00		Togo
A. Morin	Akata	1680	2015	70%	France	3.50		Togo
A. Morin	Quilla	1704	2015	70%	France	3.50		Peru
A. Morin	Carenero	1315	2014	70%	France	2.75	Criollo	Venezuela

It looks like the first row of the data is duplicated from the header, so I will remove that as it definitely isn’t an observation. Why duplicated? I spot nothing

full_data <- full_data[-1,] # Remove first row

Another piece of data cleaning that we should do is remove the % sign from the cocoa_percent column. This will mean that we can have the column as numeric, which makes much more sense than having it as a string or category. Why can’t we turn it into numeric?? I understand gsub is remove % and turn the value into text, but sapply and fuction(x) is making me confuse

# Remove the % sign from the cocoa percent column
full_data$cocoa_percent <- sapply(full_data$cocoa_percent, function(x) gsub("%", "", x))

We can now retype the columns to convert the cocoa_percent to a numeric. type_convert() is confusing, is the function can auto recognize the value and convert it to string or numeric??

# Retype the columns
full_data <- type_convert(full_data)

# Have a look at the new data types
map_df(full_data, class)

company _(maker-if_known)	specific_bean_origin_or_bar_name	ref	review_date	cocoa_percent	company_location	rating	bean_type	broad_bean_origin
character	character	numeric	numeric	numeric	character	numeric	character	character

We can convert the year column to a date type using the lubridate package.

# Mutate a column that converts the year into a date type 
full_data <- full_data %>%
  mutate(review_date = ymd(paste(review_date, 1, 1, sep="-")))

We can use the newly created date column that we mutated to create a simple summary table grouping by the year.

avg_ratings <- full_data %>% 
  group_by(review_date) %>% # Group by new date column
  summarise(avg_rating = mean(rating), n_ratings = n()) # Summary stats

# Print our new table
avg_ratings

review_date	avg_rating	n_ratings
2006-01-01	3.125000	72
2007-01-01	3.162338	77
2008-01-01	2.994624	93
2009-01-01	3.073171	123
2010-01-01	3.148649	111
2011-01-01	3.256061	165
2012-01-01	3.178205	195
2013-01-01	3.197011	184
2014-01-01	3.189271	247
2015-01-01	3.246491	285
2016-01-01	3.223624	218
2017-01-01	3.312500	24

Here we have created a basic summary table of the average ratings and the number of reviewed observations for each year in the range of our data.

Insights

By year

Faceted distribution of ratings

ggplot(full_data, aes(x = rating, fill = as.factor(review_date))) + 
  geom_density(alpha = .5) + 
  theme_minimal() +
  facet_wrap(~ as.factor(year(review_date))) + 
  guides(fill = FALSE) + labs(x = 'Rating', y = 'Density')

While the peaks throughout the years seems to remain fairly constant, you can see that the most recent year doesn’t have the spread of the first years.
We can analyse this further by creating a summary table and visualisation while grouping by the year.

Spread of ratings

standard_deviations <- 
  full_data %>% group_by(review_date) %>%
  summarise(count = n(), avg = mean(rating), sd = sd(rating)) 

standard_deviations

review_date	count	avg	sd
2006-01-01	72	3.125000	0.7691224
2007-01-01	77	3.162338	0.6998193
2008-01-01	93	2.994624	0.5442118
2009-01-01	123	3.073171	0.4591195
2010-01-01	111	3.148649	0.4663426
2011-01-01	165	3.256061	0.4899536
2012-01-01	195	3.178205	0.4835962
2013-01-01	184	3.197011	0.4461178
2014-01-01	247	3.189271	0.4148615
2015-01-01	285	3.246491	0.3810960
2016-01-01	218	3.223624	0.4200386
2017-01-01	24	3.312500	0.3317444

We seem to see that the standard deviation of rating seems to be decreasing as time passes. This is more clearly seen when you visualise the trend.

I have included the number of reviews per year here using the size aesthetic of geom_point to give an idea of the sample size for each year as time passes.

standard_deviations %>% # Use our previous summary table
  ggplot(aes(review_date, sd)) + 
  geom_point(aes(size = count)) + # Change point size depending on number of reviews
  geom_line(size = .5) + 
  theme_minimal() + 
  scale_x_date(date_breaks = '1 year', date_labels = '%Y') + 
  scale_size(breaks = seq(0,250,50), name = 'Number of ratings') + 
  labs(x = 'Date', 
       y = 'Standard deviation of ratings', 
       title = 'Standard deviation of review ratings over time')

Upon visualisation of the data we can see that 2017 might simply have a smaller spread due to the smaller number of ratings that have been taken in that year. As the number of ratings taken in the year increases, we say see it rise to near that of the previous 6/7 years, although there does appear to be a definite downward trend, implying that reviewers are giving less extreme scores.

Extreme ratings

low_scores <- full_data %>% 
  group_by(review_date) %>%
  summarise(less_than_2.5 = sum(rating < 2.5), count = n(), perc = round(less_than_2.5 / count, 2))

low_scores

review_date	less_than_2.5	count	perc
2006-01-01	13	72	0.18
2007-01-01	7	77	0.09
2008-01-01	7	93	0.08
2009-01-01	6	123	0.05
2010-01-01	5	111	0.05
2011-01-01	5	165	0.03
2012-01-01	6	195	0.03
2013-01-01	4	184	0.02
2014-01-01	5	247	0.02
2015-01-01	3	285	0.01
2016-01-01	2	218	0.01
2017-01-01	0	24	0.00

Having a closer look, by year, at the reviews that were given a rating of less than 2.5, you can clearly see that the number of those poor scores is going down, despite the number of reviews going up! 18% of reviews in 2006 (13/72) where less than 2.5, while only 5 reviews fell below that threshold since 2015!

Once again, plotting this trend makes it much easier to digest.

low_scores %>%
  ggplot() + geom_line(aes(review_date, perc * 100)) + 
  theme_minimal() + 
  scale_x_date(date_breaks = '1 year', date_labels = '%Y') + 
  labs(x = 'Date', 
       y = 'Percentage',
       title = 'Percentage of ratings below 2.5 by year')

Frequent ratings over time

So we have established that the spread of the ratings is getting smaller, and reviewers are giving less low scores below 2.5.

We can further visualise these features of the data by exploring other plots.

full_data %>%
  ggplot(aes(review_date, rating)) + 
  geom_count(aes(colour = ..n..)) + 
  geom_line(data = avg_ratings, aes(review_date, avg_rating), colour = 'red') + # Use average ratings table from earlier
  scale_x_date(date_breaks = '1 year', date_labels = '%Y') + 
  scale_size_continuous(name = 'Count') + scale_color_continuous(name = 'Count') + 
  expand_limits(y = 0) + # Expand limits to include 0
  theme_minimal() + 
  labs(x = 'Review year', y = 'Rating')

We can see that the spread of the reviews is definitely narrowing, with the lower tail retracting and reviews now generally falling between 3 and 4 in 2015 onwards.

Companies and Makers

Company location

We can investigate whether the location of the company has an impact on the quality of the product.

full_data %>%
  group_by(company_location) %>% 
  filter(n() > 10) %>% 
  mutate(avg = mean(rating)) %>%
  ggplot() + 
  geom_boxplot(aes(reorder(company_location, avg), rating, fill = avg)) + 
  scale_fill_continuous(low = '#ffffcc', high = '#fc4e2a', name = "Average rating") + 
  coord_flip() + 
  theme_minimal() + 
  labs(x = 'Company Location', y = 'Rating') +
  expand_limits(y = c(0,5))

Master makers

As the column in the data is company/maker, I will extract the maker so that we can compare the makers against one another to see if there are any master makers hidden in the data.

Once I have split the company and maker from the column then I will select those makers that have at least 3 reviewed products.

colnames(full_data)[1] <- 'company_maker'
makers <- full_data %>%
  rowwise %>%
  mutate(maker = str_match(company_maker, '\\((.*)\\)')[2]) %>% 
  filter(!is.na(maker)) %>% # If a maker was noted
  group_by(maker) %>% # Grouping
  filter(n() >= 3) %>% # Minimum product filter
  mutate(avg = mean(rating)) # Get average rating 

ggplot(makers) + 
  geom_boxplot(aes(x = reorder(maker, rating, FUN = mean), y = rating, fill = avg)) + 
  labs(x = 'Maker', y = 'Rating') + 
  coord_flip() + 
  theme_minimal() + 
  scale_fill_continuous(name = "Average rating") +
  expand_limits(y = c(0,5))

It appears as those Cinagra is our master chocolatier.

Company records over time

I split the company/maker column again, this time selecting the company, and filter by those with at least 10 reviews.

companies <- full_data %>%
  rowwise() %>%
  mutate(company = str_trim(str_split(company_maker, '\\(')[[1]][1])) %>%
  group_by(company) %>% 
  filter(n() > 10) %>% 
  mutate(avg = mean(rating))

By plotting the count of reviews for ratings using the size aesthetic again, we are also able to see the distribution of scores for the companies.

companies  %>%
  ggplot(aes(x = reorder(as.factor(company), rating, FUN = mean), y = rating)) + 
  geom_count(alpha = .1) + 
  geom_point(aes(x = as.factor(company), y = avg, colour = avg)) + 
  theme_minimal() + 
  coord_flip() + 
  labs(x = 'Company', y = 'Rating') + 
  scale_color_continuous(name = 'Average rating', breaks = seq(3,4,.25)) + 
  scale_size_continuous(name = 'Number of ratings', breaks = seq(0,14,2))

This gives us an overview of companies overall, but doesn’t show us trends for those companies over time. We can investigate that using facet_wrap and plotting against the review_date.

# Which of these companies are improving
full_data %>%
  rowwise() %>%
  mutate(company = str_trim(str_split(company_maker, '\\(')[[1]][1])) %>%
  group_by(company) %>% 
  filter(n() > 10) %>% 
  group_by(company, review_date) %>%
  summarise(count = n(), avg = mean(rating)) %>%
  ungroup() %>%
  ggplot() + 
    geom_point(aes(x = review_date, y = avg, size = count, colour = as.factor(company), alpha = 0.05)) +
    geom_line(aes(x = review_date, y = avg, colour = as.factor(company))) + 
    facet_wrap(~ company, ncol = 3) + 
    scale_x_date(date_labels = '%Y', date_breaks = '2 year') + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    guides(colour = FALSE) + 
  labs(x = 'Year of review', y = 'Average rating', title = "Company average ratings over time") + 
    scale_size_continuous(breaks = seq(4, 16, 2), name = 'Number of reviews') + 
    scale_alpha(guide = F)

Some appear to have shown improvements, but such increases can probably just be put down to small number of reviews for a given year. To see whether the company is truly improving we would require a greater number of product reviews for the current year.

Cocoa percent

Let’s investigate to see whether there is a relationship between cocoa percentage and the chocolate’s rating.

full_data %>%
  ggplot(aes(x = cocoa_percent, y = rating)) +
  geom_jitter(alpha = .75) + 
  coord_cartesian(ylim = c(0,5)) +
  labs(x = 'Cocoa percentage', y = 'Rating') + 
  theme_minimal() + 
  geom_smooth(method = 'lm', se = FALSE, col = 'red')

There doesn’t appear to be a strong relationship between cocoa percent and rating here. Let’s have a look at the model separately.

model <- lm(formula = rating ~ cocoa_percent, 
            data = full_data)

summary(model)

## 
## Call:
## lm(formula = rating ~ cocoa_percent, data = full_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.20675 -0.31981  0.04325  0.31806  1.79325 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.075165   0.126835  32.130  < 2e-16 ***
## cocoa_percent -0.012406   0.001762  -7.041 2.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4717 on 1792 degrees of freedom
## Multiple R-squared:  0.02692,    Adjusted R-squared:  0.02637 
## F-statistic: 49.57 on 1 and 1792 DF,  p-value: 2.716e-12

With an adjusted R-squared of 0.02662 it is fair to say that this isn’t a very good model, but there is a relationship between the rating and the cocoa_percent, it just isn’t the most informative. The negative slope implies to us that the higher the cocoa_percent then the lower the rating of the chocolate.

If we were going to build a model to predict the rating of the chocolate then we would definitely have to take more variables into consideration.

Considering time with predictions

We have already seen the impact that time has had on predictions. We can see that the spread of the ratings has decreased as time has passed. Perhaps factoring this into consideration will help us improve our prediction power.

ggplot(full_data, aes(x = cocoa_percent, y = rating, group = factor(year(review_date)))) +
  geom_jitter(alpha = .75) + 
  coord_cartesian(ylim = c(0,5)) +
  labs(x = 'Cocoa percentage', y = 'Rating') + 
  theme_minimal() + 
  geom_smooth(method = 'lm', se = FALSE, aes(colour = factor(year(review_date)))) + 
  scale_colour_discrete(name = 'Review year')

A plot like this is good at showing us that there is definitely deviation throughout the years regarding how cocoa percent relates to the rating, but if we wanted to know more about which years seem to diverge then faceting by year(review_date) might be more effective. Let’s have a look.

ggplot(full_data, aes(x = cocoa_percent, y = rating, group = factor(year(review_date)))) +
  geom_jitter(alpha = .5, shape = 21, size = 1) + 
  coord_cartesian(ylim = c(0,5)) +
  labs(x = 'Cocoa percentage', y = 'Rating', title = 'Rating and cocoa percent faceted by year of review') + 
  theme_minimal() + 
  geom_smooth(method = 'lm', se = FALSE, aes(colour = factor(year(review_date)))) + 
  facet_wrap(~ year(review_date)) + 
  guides(colour = FALSE) + # Remove colour legend as faceting makes this clearer anyway
  theme(panel.spacing = unit(1, "lines")) # Just increase the spacing between plots slightly

Both the above plots illustrate to us that the slope of the line changes with year. The first one let’s us evaluate that there are differences year on year and we can compare them, but it is quite messy. While faceting makes it slightly more difficult to directly compare the linear slopes but makes evaluating one at a time much more manageable.

We can see how adding year to the linear predictor impacts the quality of our model.

model <- lm(formula = rating ~ cocoa_percent + factor(year(review_date)), 
            data = full_data)

summary(model)

## 
## Call:
## lm(formula = rating ~ cocoa_percent + factor(year(review_date)), 
##     data = full_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18802 -0.27862  0.02444  0.30343  1.86240 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    4.019413   0.136438  29.460  < 2e-16 ***
## cocoa_percent                 -0.012597   0.001758  -7.168 1.11e-12 ***
## factor(year(review_date))2007  0.050426   0.076777   0.657   0.5114    
## factor(year(review_date))2008 -0.108974   0.073556  -1.482   0.1386    
## factor(year(review_date))2009 -0.058845   0.069481  -0.847   0.3972    
## factor(year(review_date))2010  0.020868   0.070849   0.295   0.7684    
## factor(year(review_date))2011  0.130679   0.066129   1.976   0.0483 *  
## factor(year(review_date))2012  0.059859   0.064572   0.927   0.3540    
## factor(year(review_date))2013  0.087963   0.065122   1.351   0.1769    
## factor(year(review_date))2014  0.080056   0.062744   1.276   0.2022    
## factor(year(review_date))2015  0.134265   0.061781   2.173   0.0299 *  
## factor(year(review_date))2016  0.108679   0.063656   1.707   0.0879 .  
## factor(year(review_date))2017  0.194324   0.110359   1.761   0.0784 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4682 on 1781 degrees of freedom
## Multiple R-squared:  0.04706,    Adjusted R-squared:  0.04064 
## F-statistic:  7.33 on 12 and 1781 DF,  p-value: 2.798e-13

This shows us that some of the years have a ‘signficant’ impact on the response variable, but overall the impact isn’t as substantial as we might hope for.

We can note that the R-squared value has risen from 0.02662 to 0.04097 so it is in the right direction but still pretty awful.

Considering other model types

Our linear models have yielded very little success; while this is likely to be indicative that the relationships and predictive power just isn’t present in the data, I thought that I would just give it a go using a random forest rather than a plain linear model.

Random forest models are probably more associated with classification tasks, rather than the regression task that we have set ourselves here. We will see later what that means for our predictions. Let’s first build the model using the rpart package.

library(rpart)
set.seed(1000) # Set the seed to make results reproducible
random_forest_model <- rpart(formula = 'rating ~ cocoa_percent + factor(year(review_date))', data = full_data)

# Calling summary on a random forest model produces a large output so I have commented out for the sake of presentation space
# summary(random_forest_model)

Now that we have our model, let’s make some predictions and compare them to the actual ratings that were given.

forest_pred = predict(random_forest_model, full_data)
table(forest_pred)

## forest_pred
##             2.27 2.85843373493976 2.97938144329897 3.13590604026846 
##               25               83               97              149 
## 3.23940972222222 
##             1440

This is what I was talking about earlier. The predictions are binned because of the structure of the tree and decision trees in general. In our case the inputs are branched and reach a terminal node where they are given a numeric value as the prediction for that particular observation; 1 of 5 possible predictions in our case.

We should also note that one of the predictions is greatly favoured: 3.23976405274115

This doesn’t necessarily mean that the random forest is ineffective, and perhaps with some fine tuning we could improve that, but we want to be careful not to overfit the data by making the trees too deep.

library(Metrics) # Load metrics library for the rmse function
forest_rmse <- rmse(actual = full_data$rating, predicted = forest_pred)
linear_rmse <- rmse(actual = full_data$rating, predicted = predict(model, full_data))

# Display the two results
data.frame(forest_rmse, linear_rmse)

forest_rmse	linear_rmse
0.4548661	0.4664957

It does appear that the random forest model performs slightly better than the linear model that we ran earlier.

Adding company location

Our model is lacking because of the predictors we are using to predict our response variables; the strength of the relationship with rating is not strong enough to magically produce a good prediction model.

From the exploratory analysis, I can see that there appears to be a loose correlation with the location of the company and the rating. Perhaps including this will improve our predictive power, and reduce the RMSE.

class(full_data$company_location) # Check the class of the location - it is a character class

## [1] "character"

n_distinct(full_data$company_location) # There are 60 distinct company locations in our data

## [1] 60

I’m interested to see whether the random forest model will be able to handle the 60 categories categorical variable, or whether we have to one-hot encode it for example. For some algorithms, where a matrix is required, we could use vtreat to do just that and create a sparse matrix where the categories are spread across columns.

random_forest_model_2 <- rpart(formula = 'rating ~ cocoa_percent + factor(year(review_date)) + factor(company_location)', data = full_data)
forest_pred_location = predict(random_forest_model_2, full_data)
table(forest_pred_location)

## forest_pred_location
##             2.27 2.78977272727273 2.97297297297297 3.17489495798319 
##               25               44              148              952 
##           3.3168 
##              625

We can already see a difference! We have 6 output classes or terminal nodes with our second tree.

library(caret) # Load package
varImp(random_forest_model_2) # View the variable importance

	Overall
cocoa_percent	0.1003184
factor(company_location)	0.0948902
factor(year(review_date))	0.0683489

We can see that company_location has jumped over the year of review and is more important for our predictive power. cocoa_percent still remains the most important predictor variable. Let’s have a look and see if it has had a positive impact on the RMSE.

forest_location_rmse <- rmse(actual = full_data$rating, predicted = forest_pred_location)
data.frame(forest_rmse, linear_rmse, forest_location_rmse)

forest_rmse	linear_rmse	forest_location_rmse
0.4548661	0.4664957	0.4506148

Yet more improvement!

NB: we are training and testing our models on the same dataset, which isn’t good practice and could be the reason for the difference in results. We would need to test the out-of-sample error using cross-validation for a better indication of model strength.

Broad bean origin

Let’s have a look at the unique origins that are listed in the dataset. This is easily achieved using the unique function.

# Look at unique origins
unique(full_data$broad_bean_origin)

##   [1] "Togo"                          "Peru"                         
##   [3] "Venezuela"                     "Cuba"                         
##   [5] "Panama"                        "Madagascar"                   
##   [7] "Brazil"                        "Ecuador"                      
##   [9] "Colombia"                      "Burma"                        
##  [11] "Papua New Guinea"              "Bolivia"                      
##  [13] "Fiji"                          "Mexico"                       
##  [15] "Indonesia"                     "Trinidad"                     
##  [17] "Vietnam"                       "Nicaragua"                    
##  [19] "Tanzania"                      "Dominican Republic"           
##  [21] "Ghana"                         "Belize"                       
##  [23] " "                             "Jamaica"                      
##  [25] "Grenada"                       "Guatemala"                    
##  [27] "Honduras"                      "Costa Rica"                   
##  [29] "Domincan Republic"             "Haiti"                        
##  [31] "Congo"                         "Philippines"                  
##  [33] "Malaysia"                      "Dominican Rep., Bali"         
##  [35] "Venez,Africa,Brasil,Peru,Mex"  "Gabon"                        
##  [37] "Ivory Coast"                   "Carribean"                    
##  [39] "Sri Lanka"                     "Puerto Rico"                  
##  [41] "Sao Tome"                      "Uganda"                       
##  [43] "Martinique"                    "Sao Tome & Principe"          
##  [45] "Vanuatu"                       "Australia"                    
##  [47] "Liberia"                       "Ecuador, Costa Rica"          
##  [49] "West Africa"                   "Hawaii"                       
##  [51] "St. Lucia"                     "Cost Rica, Ven"               
##  [53] "Peru, Madagascar"              "Venezuela, Trinidad"          
##  [55] "Trinidad, Tobago"              "Ven, Trinidad, Ecuador"       
##  [57] "South America, Africa"         "India"                        
##  [59] "Africa, Carribean, C. Am."     "Tobago"                       
##  [61] "Ven., Indonesia, Ecuad."       "Trinidad-Tobago"              
##  [63] "Peru, Ecuador, Venezuela"      "Venezuela, Dom. Rep."         
##  [65] "Colombia, Ecuador"             "Solomon Islands"              
##  [67] "Nigeria"                       "Peru, Belize"                 
##  [69] "Peru, Mad., Dom. Rep."         NA                             
##  [71] "PNG, Vanuatu, Mad"             "El Salvador"                  
##  [73] "South America"                 "Samoa"                        
##  [75] "Ghana, Domin. Rep"             "Trinidad, Ecuador"            
##  [77] "Cameroon"                      "Venezuela, Java"              
##  [79] "Venezuela/ Ghana"              "Venezuela, Ghana"             
##  [81] "Indonesia, Ghana"              "Peru(SMartin,Pangoa,nacional)"
##  [83] "Principe"                      "Central and S. America"       
##  [85] "Ven., Trinidad, Mad."          "Carribean(DR/Jam/Tri)"        
##  [87] "Ghana & Madagascar"            "Ven.,Ecu.,Peru,Nic."          
##  [89] "Madagascar & Ecuador"          "Guat., D.R., Peru, Mad., PNG" 
##  [91] "Peru, Dom. Rep"                "Dom. Rep., Madagascar"        
##  [93] "Gre., PNG, Haw., Haiti, Mad"   "Mad., Java, PNG"              
##  [95] "Ven, Bolivia, D.R."            "DR, Ecuador, Peru"            
##  [97] "Suriname"                      "Peru, Ecuador"                
##  [99] "Ecuador, Mad., PNG"            "Ghana, Panama, Ecuador"       
## [101] "Venezuela, Carribean"

Since we are looking at the broad bean origin during this part of the analysis we are going to want to remove those rows that don’t appear to have a known bean origin.
Using the unique values above we are able to look through the origins that are included in the data set and use that information to filter all the observations accordingly.

beans <- full_data %>% 
    filter(!is.na(broad_bean_origin),     # Remove those with NA values
           !nchar(broad_bean_origin) < 2) # Remove those where the origin is too short to be a realistic palce

Most frequent bean origin

Now that we have filtered the data (beans) to include just those observations with broad bean origins we are able to working out which origins occur the most frequently, or are the most widely used in these chocolates.

# Most frequent broad bean origin
beans %>% 
  group_by(broad_bean_origin) %>% # Group by origin
  filter(n() > 10) %>% # Limit to those with at least 10 observations
  mutate(count = n()) %>% # Add the count column
  ggplot(aes(x = reorder(broad_bean_origin, count))) + 
  geom_bar() + 
  coord_flip() + 
  theme_minimal() + 
  labs(x = 'Bean origin', y = 'Count', title = 'Most frequently used broad bean origins')

Looking at this simple graph we are able to see that the most popular broad bean origins are Venezuela, Ecuador and Peru. Madagascar, probably more famous for its vanilla, is in fourth place.

Origin and ratings

Does the origin of the bean have a noticeable impact on the rating of the chocolate?
We can visualise the answer to this question by looking at the statistics of the ratings when they are grouped by their broad bean origin.

beans %>% 
  group_by(broad_bean_origin) %>% 
  filter(n() > 10) %>% # Keep only those with more than 10 observations
  mutate(count = n()) %>%
  ggplot() + 
  geom_boxplot(aes(x = broad_bean_origin, y = rating)) + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 75, hjust = 1)) + 
  labs(x = 'Broad bean origin', y = 'Rating')

Having seen the boxplot, we can see that there doesn’t appear to be any relationship between broad bean origin and rating. The company or the maker that is producing the product itself must be a more important indicator of the overall rating of the chocolate.

Companies and bean origin heatmap

We can use the subset which contains only the companies with the most observations to generate a heatmap showing the interactions between company and broad bean origin.

companies %>% 
  filter(!is.na(broad_bean_origin), !nchar(broad_bean_origin) < 2) %>% 
  group_by(company, broad_bean_origin) %>%
  summarise(count = n()) %>%
  filter(count > 0) %>%
  ggplot() + 
  geom_tile(aes(broad_bean_origin, company, fill = count)) + 
  coord_equal() +
  scale_fill_continuous(name = 'Count') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 5)) +
  theme(axis.text.y = element_text(size = 5)) + 
  labs(x = 'Broad bean origin', y = 'Company')

We can see that there are only a few bright spots on what is a quite busy plot. The popular locations for broad bean origin make the vertical straight lines on the plot.
You can see clear vertical lines at Venezuela, Peru and Ecuador as we would expect.

Conclusions

Over time the average rating isn’t increasing but the instances of extreme ratings is descreasing, resulting in a smaller spread of ratings
The country of manufacter doesn’t appear to have a large impact on rating, but both the maker and company can be seen to impact the average rating of a chocolate
There isn’t a strong relationship between the cocoa percent and rating of the chocolate. Fitting a linear model showed us that we would need to include more features to build a good prediction model
- The addition of year to the model as a factor improved the model, but it is still poor
- Using the trends spotted in our EDA, the inclusion of company location improved our predictive power further
- A random forest model seems to perform better overall than a simple linear model
Broad bean origin seems to be well represented by the big five:
- Venezuela
- Ecuador
- Peru
- Madagascar
- Dominican Republic
However, the broad bean origin of the bean doesn’t seem to determine the quality, or rating, of the chocolate

Thanks for taking the time to have a look at my analysis and visualisations, if you enjoyed it or learned something then please feel free to upvote the kernel ;)
I am still looking to improve this kernel and would greatly appreciate any feedback or advice that you could share with me.

Chocolate Bar Ratings: Extensive EDA

Will Canniford