Source of the dataset: https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings
Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.
Each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch. Batch numbers, vintages and review dates are included in the database when known.
The database is narrowly focused on plain dark chocolate with an aim of appreciating the flavors of the cacao when made into chocolate. The ratings do not reflect health benefits, social missions, or organic status.
Flavor is the most important component of the Flavors of Cacao ratings. Diversity, balance, intensity and purity of flavors are all considered. It is possible for a straight forward single note chocolate to rate as high as a complex flavor profile that changes throughout. Genetics, terroir, post harvest techniques, processing and storage can all be discussed when considering the flavor component.
Texture has a great impact on the overall experience and it is also possible for texture related issues to impact flavor. It is a good way to evaluate the makers vision, attention to detail and level of proficiency.
Aftermelt is the experience after the chocolate has melted. Higher quality chocolate will linger and be long lasting and enjoyable. Since the aftermelt is the last impression you get from the chocolate, it receives equal importance in the overall rating.
Overall Opinion is really where the ratings reflect a subjective opinion. Ideally it is my evaluation of whether or not the components above worked together and an opinion on the flavor development, character and style. It is also here where each chocolate can usually be summarized by the most prominent impressions that you would remember about each chocolate.
company_maker_if_known : Name of the company
manufacturing the bar.
specific_bean_origin_or_bar_name : The specific
geo-region of origin for the bar.
ref : A value linked to when the review was entered in
the database. Higher = more recent.
review_date : Date of publication of the review.
cocoa_percent : Cocoa percentage (darkness) of the
chocolate bar being reviewed.
company_location : Manufacturer base country.
rating : Expert rating for the bar.
bean_type : The variety (breed) of bean used, if
provided.
broad_bean_origin : The broad geo-region of origin for
the bean.
In this analysis, the dataset will be used to answer the following questions:
In this analysis, we will be using the following packages:
ggplot2, dplyr, readxl,
janitor
library(ggplot2)
library(dplyr)
library(readxl)
library(janitor)
Let’s load our dataset and clean the column names to ensure the uniqueness and consistent of it.
df_cocoa = read_excel("cocoa_data.xlsx")
df_cocoa = clean_names(df_cocoa)
head(df_cocoa)
## # A tibble: 6 × 9
## company_maker_i…¹ speci…² ref revie…³ cocoa…⁴ compa…⁵ rating bean_…⁶ broad…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 A. Morin Agua G… 1876 2016 0.63 France 3.75 <NA> Sao To…
## 2 A. Morin Kpime 1676 2015 0.7 France 2.75 <NA> Togo
## 3 A. Morin Atsane 1676 2015 0.7 France 3 <NA> Togo
## 4 A. Morin Akata 1680 2015 0.7 France 3.5 <NA> Togo
## 5 A. Morin Quilla 1704 2015 0.7 France 3.5 <NA> Peru
## 6 A. Morin Carene… 1315 2014 0.7 France 2.75 Criollo Venezu…
## # … with abbreviated variable names ¹company_maker_if_known,
## # ²specific_bean_origin_or_bar_name, ³review_date, ⁴cocoa_percent,
## # ⁵company_location, ⁶bean_type, ⁷broad_bean_origin
Let’s check for duplicate values.
if (count(df_cocoa) == count(unique(df_cocoa))) {
print("There are no duplicate values")
} else {
print("There are duplicates")
}
## [1] "There are no duplicate values"
The above code compares the unique data and the original data we have. If the counts are equal that means we don’t have any duplicate values in this dataframe.
And for that, there are no duplicate values in this dataframe.
But, before we start with our analysis, let’s check our
rating column to see if there’s any missing values. And if
there is, we are going to remove it, because it won’t make any sense for
a row without a rating as that is our metric for answering these
questions.
sum(is.na(df_cocoa$rating) == TRUE)
## [1] 0
In the code above, we count all of the NA values we have
in rating column. We got 0 that means we don’t
have any missing values for the rating column.
We can now start with our analysis.
To answer this question, we will be using two variables:
broad_bean_origin and rating.
Why broad_bean_origin?
broad_bean_origin contains
the specific geo-region of origin for the bean.Why I didn’t choose company_location?
company_locaiton contains the manufacturer base
country. Not all manufacturer use the same ingredients. The location of
a manufacturer can impact the quality of the ingredients used in the
product.
For example, if a manufacturer is located in a region with best cocoa, this can positively impact the taste of a chocolate produced by the manufacturer.
Now, let’s make this data frame.
best_cocoa_grown <- df_cocoa %>%
select(broad_bean_origin, rating) %>%
group_by(broad_bean_origin) %>%
summarise(average_rating = mean(rating))
head(best_cocoa_grown)
## # A tibble: 6 × 2
## broad_bean_origin average_rating
## <chr> <dbl>
## 1 Africa, Carribean, C. Am. 2.75
## 2 Australia 3.25
## 3 Belize 3.23
## 4 Bolivia 3.20
## 5 Brazil 3.28
## 6 Burma 3
The above code grouped all the values of
broad_bean_origin and calculated the average rating. The
average_rating will be the metric to determine the cocoa
beans grown.
Now, let’s prepare this data to count how many scored on each ratings.
cocoa_count <- c(
sum(best_cocoa_grown$average_rating < 6 & best_cocoa_grown$average_rating >= 5),
sum(best_cocoa_grown$average_rating < 5 & best_cocoa_grown$average_rating >= 4),
sum(best_cocoa_grown$average_rating < 4 & best_cocoa_grown$average_rating >= 3),
sum(best_cocoa_grown$average_rating < 3 & best_cocoa_grown$average_rating >= 2),
sum(best_cocoa_grown$average_rating < 2 & best_cocoa_grown$average_rating >= 1)
)
df_cocoa_count <- data.frame(
Ratings = c("5", "4", "3", "2", "1"),
Counts = cocoa_count
)
ggplot(df_cocoa_count, aes(Ratings, Counts)) + geom_bar(stat = "identity", fill = "blue") +
geom_text(aes(label = Counts), nudge_y = 2) +
labs(title = "Counts of each Ratings")
There are six broad_bean_origin value who scored the
average rating of 4.
We are only interested in the best cocoa beans grown so, let’s filter this data greater than or equal to the average rating of 4.
best_cocoa_grown %>%
filter(average_rating >= 4)
## # A tibble: 6 × 2
## broad_bean_origin average_rating
## <chr> <dbl>
## 1 Dom. Rep., Madagascar 4
## 2 Gre., PNG, Haw., Haiti, Mad 4
## 3 Guat., D.R., Peru, Mad., PNG 4
## 4 Peru, Dom. Rep 4
## 5 Ven, Bolivia, D.R. 4
## 6 Venezuela, Java 4
Six rows scored the same average rating of 4.
Here are the countries associated with each region mentioned:
Dom. Rep. refers to the Dominican Republic, and
Madagascar refers to the island country off the
southeastern coast of Africa.Gre. may refer to Greece, PNG to Papua New
Guinea, Haw. to Hawaii, Haiti to the Caribbean
country, and Mad to Madagascar.Guat. refers to Guatemala, D.R. to the
Dominican Republic, Peru to the South American country,
Mad. to Madagascar, and PNG to Papua New
Guinea.Peru refers to the South American country, and
Dom. Rep. refers to the Dominican Republic.Ven refers to Venezuela, Bolivia to the
South American country, and D.R. to the Dominican
Republic.Venezuela refers to the South American country, and
Java refers to the Indonesian island known for producing
coffee beans.Therefore, Dominican Republic, Madagascar, Greece, Papua New Guinea, Hawaii, Haiti, Guatemala, Peru, Venezuela, Bolivia, and Java have the best cocoa beans grown.
In this question, we are talking about the COUNTRY
and the HIGHEST-RATED BARS. And as I explained in the
previous question, the location of the manufacturer can impact the
product so, we are going to use the company_location and
the rating columns.
Let’s prepare our data to determine the highest-rated bars.
highest_rated_bar <- df_cocoa %>%
select(company_location, rating) %>%
group_by(company_location) %>%
summarise(highest_rated = max(rating))
head(highest_rated_bar)
## # A tibble: 6 × 2
## company_location highest_rated
## <chr> <dbl>
## 1 Amsterdam 3.75
## 2 Argentina 3.75
## 3 Australia 4
## 4 Austria 3.75
## 5 Belgium 4
## 6 Bolivia 3.75
In this code, we grouped each value of the
company_location and got their highest ratings. With this
data, we can now determine the highgest-rated bars.
But let’s see first how many scored on each ratings, let’s prepare it.
highest_count <- c(
sum(highest_rated_bar$highest_rated < 6 & highest_rated_bar$highest_rated >= 5),
sum(highest_rated_bar$highest_rated < 5 & highest_rated_bar$highest_rated >= 4),
sum(highest_rated_bar$highest_rated < 4 & highest_rated_bar$highest_rated >= 3),
sum(highest_rated_bar$highest_rated < 3 & highest_rated_bar$highest_rated >= 2),
sum(highest_rated_bar$highest_rated < 2 & highest_rated_bar$highest_rated >= 1)
)
df_highest_count <- data.frame(
Ratings = c("5", "4", "3", "2", "1"),
Counts = highest_count
)
ggplot(df_highest_count, aes(Ratings, Counts)) + geom_bar(stat = "identity", fill = "blue") +
geom_text(aes(label = Counts), nudge_y = 2) +
labs(title = "Counts of each Ratings")
In this visualization, we can see that one country scored the perfect 5.
Let’s look at the value.
filter(highest_rated_bar, highest_rated >= 5)
## # A tibble: 1 × 2
## company_location highest_rated
## <chr> <dbl>
## 1 Italy 5
Italy is the highest-rated bar.
Then, Is Italy using the best cocoa beans grown? Let’s find out.
df_cocoa %>%
select(company_location, rating, broad_bean_origin) %>%
filter(company_location == "Italy", rating == 5)
## # A tibble: 2 × 3
## company_location rating broad_bean_origin
## <chr> <dbl> <chr>
## 1 Italy 5 Venezuela
## 2 Italy 5 <NA>
As we can see here, Italy is using beans from Venezuela. And if we are going to look at our value from the previous question.
best_cocoa_grown %>%
filter(average_rating >= 4)
## # A tibble: 6 × 2
## broad_bean_origin average_rating
## <chr> <dbl>
## 1 Dom. Rep., Madagascar 4
## 2 Gre., PNG, Haw., Haiti, Mad 4
## 3 Guat., D.R., Peru, Mad., PNG 4
## 4 Peru, Dom. Rep 4
## 5 Ven, Bolivia, D.R. 4
## 6 Venezuela, Java 4
Venezuela is one of the best cocoa beans grown who scored 4.
Therefore, the location of the manufacturer can potentially affect the taste of the product, its just one of many factors that can impact the final flavor of a product.
In this question, we will be using the columns
cocoa_percent and rating to determine the
correlation between cocoa solids percentage and rating.
Let’s create our dataframe.
corr_cocoa_rating <- df_cocoa %>%
select(cocoa_percent, rating)
head(corr_cocoa_rating)
## # A tibble: 6 × 2
## cocoa_percent rating
## <dbl> <dbl>
## 1 0.63 3.75
## 2 0.7 2.75
## 3 0.7 3
## 4 0.7 3.5
## 5 0.7 3.5
## 6 0.7 2.75
Let’ visualize it.
corr_r <- cor(corr_cocoa_rating$cocoa_percent, corr_cocoa_rating$rating)
ggplot(corr_cocoa_rating, aes(x = cocoa_percent, y = rating)) +
geom_point() +
geom_smooth(formula = y ~ x, method = lm, se = FALSE) +
annotate("text", x = 0.5, y = 4.5, label = paste("R = ", format(round(corr_r, 2), nsmall = 2))) +
labs(title = "Correlation of Cocoa Percentage and Ratings")
There are little or no correlation of
cocoa_percent and
rating.
In this analysis, we created three questions to determine the following:
Six of them tied for the best cocoa beans grown. And Italy is the highest rated bar that uses one of the best cocoa beans which is from Venezuela. Although, we saw that there is almost no correlation with the cocoa percentage and rating.