This is a Datacamp competition involving a specialty foods import company that wants to expand into gourmet chocolate bars.

1 Background

The background of the challenge states that:
“Your boss needs your team to research this market to inform your initial approach to potential suppliers. After finding valuable chocolate bar ratings online, you need to explore if the chocolate bars with the highest ratings share any characteristics that could help you narrow your search for suppliers (e.g., cacao percentage, bean country of origin, etc.)”
The highlighted challenges to solve are;

What is the average rating by country of origin?
How many bars were reviewed for each of those countries?
Create plots to visualize findings for questions 1 and 2.
Is the cacao bean’s origin an indicator of quality?
[Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
[Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
Summarize your findings

2 Analysis

2.1 Preparation for analyses

invisible({capture.output({
  library(tidyverse)
library(readr) 
library(ggplot2)
library(treemap)
library(highcharter)
library(stringr)
library(ggplot2)
})})

2.2 ‘’Loading data’’

df<- read.csv("chocolate_bars.CSV")

2.3 understanding and cleaning of data

There are 11 different columns in the data set

print (ncol(df))

## [1] 11

There are 2530 different rolls in the data set

print(nrow(df))

## [1] 2530

Names of variables

names(df)

##  [1] "id"               "manufacturer"     "company_location" "year_reviewed"   
##  [5] "bean_origin"      "bar_name"         "cocoa_percent"    "num_ingredients" 
##  [9] "ingredients"      "review"           "rating"

Changing the column header, bean_origin column to country

colnames(df)[5] <- "country"

First 10 data set

head(df, n=5)

##     id manufacturer company_location year_reviewed            country
## 1 2454         5150           U.S.A.          2019           Tanzania
## 2 2458         5150           U.S.A.          2019 Dominican Republic
## 3 2454         5150           U.S.A.          2019         Madagascar
## 4 2542         5150           U.S.A.          2021               Fiji
## 5 2546         5150           U.S.A.          2021          Venezuela
##                 bar_name cocoa_percent num_ingredients ingredients
## 1  Kokoa Kamili, batch 1            76               3       B,S,C
## 2        Zorzal, batch 1            76               3       B,S,C
## 3 Bejofo Estate, batch 1            76               3       B,S,C
## 4  Matasawalevu, batch 1            68               3       B,S,C
## 5  Sur del Lago, batch 1            72               3       B,S,C
##                              review rating
## 1         rich cocoa, fatty, bready   3.25
## 2            cocoa, vegetal, savory   3.50
## 3      cocoa, blackberry, full body   3.75
## 4               chewy, off, rubbery   3.00
## 5 fatty, earthy, moss, nutty,chalky   3.00

No of bars reviewed

n_distinct(df$bar_name)

## [1] 1605

11 Years of review

df %>% distinct(year_reviewed)

##    year_reviewed
## 1           2019
## 2           2021
## 3           2012
## 4           2013
## 5           2014
## 6           2015
## 7           2016
## 8           2018
## 9           2020
## 10          2011
## 11          2009
## 12          2010
## 13          2017
## 14          2007
## 15          2008
## 16          2006

No of Chocolate bar manufacturers

n_distinct(df$manufacturer)

## [1] 580

2.4 Solving the challenges

2.4.1 Challenge 1

What is the average rating by country of origin?
The country of origin is also the column called country which was formerly named bean_origin.

avg_rating_country <- df%>%group_by (country)  %>% 
summarize( average_rating = mean(rating))
avg_rating_country

## # A tibble: 62 x 2
##    country   average_rating
##    <chr>              <dbl>
##  1 Australia           3.25
##  2 Belize              3.23
##  3 Blend               3.04
##  4 Bolivia             3.18
##  5 Brazil              3.26
##  6 Burma               3   
##  7 Cameroon            3.08
##  8 China               3.5 
##  9 Colombia            3.20
## 10 Congo               3.32
## # ... with 52 more rows

2.4.2 Challenge 2

How many bars were reviewed for each of those countries

No_of_bars_reviewed_per_country <- df%>% group_by (country) %>% summarize(no_of_bars_reviewed=n_distinct(bar_name))
No_of_bars_reviewed_per_country

## # A tibble: 62 x 2
##    country   no_of_bars_reviewed
##    <chr>                   <int>
##  1 Australia                   3
##  2 Belize                     40
##  3 Blend                     140
##  4 Bolivia                    57
##  5 Brazil                     55
##  6 Burma                       1
##  7 Cameroon                    3
##  8 China                       1
##  9 Colombia                   55
## 10 Congo                       9
## # ... with 52 more rows

2.4.3 Challenge 3

Create plots to visualize findings for questions 1 and 2 ##### Plot1 for average rating by country of origin (question1)

Using the Treemap chart

avg_rating_country_chart<- avg_rating_country %>% 
 hchart(
    "treemap", 
    hcaes(x = country, value = average_rating, color = average_rating)
    )%>%
  hc_title(
    text = "<b>Average rating by each country</b>",
    margin = 20,
    align = "center",
    style = list(color = "#22A884", useHTML = TRUE)
  )
avg_rating_country_chart #hover each country to get average rating

2.4.3.0.1 Plot2

for no of bars that were reviewed for each countries (for question2)

No_of_bars_reviewed_per_country_chart <- No_of_bars_reviewed_per_country %>%
hchart(
    "treemap", 
    hcaes(x = country, value = no_of_bars_reviewed, color = no_of_bars_reviewed)
    )%>%

  hc_title(
    text = "<b>No of bars reviewed for each countries</b>",
    margin = 20,
    align = "center",
    style = list(color = "#22A884", useHTML = TRUE)
  )
No_of_bars_reviewed_per_country_chart #hover each country the number of bars reviewed

2.4.4 Challenge 4

Is the cocoa bean’s origin an indicator of quality? To know if the country which is the bean’s origin is an indicator of quality. Quality can be assessed by the rating given,therefore a correlation test will be conducted to know whether there’s a relationship (indication) between country and quality
The ways to detect a relationship between variablesis by constructing a scatter plot diagram

A scatter plot to show if there’s any form of relationship between country and quality

ggplot(df) +
  aes(x = country, y = rating) +
  geom_point(colour = "#0c4c8a") +
  theme_minimal()+
 theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))

From the above diagram, it is noticeable that there’s no pattern or trend whatsoever. Therefore, there’s no relationship between country and quality which conclusion is drawn that bean origin is not an indicator of quality

2.4.5 Challenge 5

How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?

2.4.5.1 To know how cocoa content (cocoa percent) relates to rating

a scatter plot is also constructed to show relationship

ggplot(df) +
  aes(x = rating, y = cocoa_percent) +
  geom_point(colour = "#0c4c8a") +
  theme_minimal()+
 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))

From the diagram, it is also detectable that there’s no pattern or trend whatsoever. Therefore, there’s no relationship between cocoa content and rating

cor(df$rating, df$cocoa_percent)

## [1] -0.1466896

Also, from correlation formula, the value is -0.1466896 ~ -0.15, this shows a very negative weak relationship between the cocoa content and rating

2.4.5.2 Average average cocoa content for bars with higher ratings (above 3.5)

Avg_cocoa_content <- df %>%filter(rating > 3.5)%>% group_by (bar_name) %>% summarize(average_percent= mean(cocoa_percent),rating)%>% arrange(desc(average_percent,rating))
Avg_cocoa_content

## # A tibble: 412 x 3
## # Groups:   bar_name [349]
##    bar_name                               average_percent rating
##    <chr>                                            <dbl>  <dbl>
##  1 Dark, Central and S. America                        90   3.75
##  2 Crazy 88, Guat., D.R., Peru, Mad., PNG              88   4   
##  3 Upala, Batch 12                                     82   3.75
##  4 Carenero Superior                                   80   3.75
##  5 Fortissima                                          80   3.75
##  6 Peru, Awagum bar                                    80   3.75
##  7 Trinidad                                            80   3.75
##  8 Vanua Levu, Matasawalevu                            80   3.75
##  9 Costa Esmeralda, Batch 30                           78   3.75
## 10 Guadalcanal                                         78   3.75
## # ... with 402 more rows

2.4.6 Challenge 6

Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).

#creating a new column to display TRUE OR FALSE for the prescenceof lecithin in ingredients
df$contains_lecithin<-str_detect(df$ingredients,"L")
#view first10 values
head(df$contains_lecithin,n=10)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

#Since the new column is added, we find the average mean for both circumstances
average_rating_about_Lecithin <- df %>% group_by (contains_lecithin) %>% summarize(average_rating= mean(rating))
average_rating_about_Lecithin

## # A tibble: 2 x 2
##   contains_lecithin average_rating
##   <lgl>                      <dbl>
## 1 FALSE                       3.21
## 2 TRUE                        3.15

Therefore, it is quite evident that the rating is higher for bars without lecithin

2.5 World map

Map showing manufacturing countries and corresponding number of bars manufactured

Datacamp challenge

Adebolu Temitope

5/12/2022