This is a Datacamp competition involving a specialty foods import company that wants to expand into gourmet chocolate bars.
The background of the challenge states that:
“Your boss needs your team to research this market to inform your initial approach to potential suppliers. After finding valuable chocolate bar ratings online, you need to explore if the chocolate bars with the highest ratings share any characteristics that could help you narrow your search for suppliers (e.g., cacao percentage, bean country of origin, etc.)”
The highlighted challenges to solve are;
invisible({capture.output({
library(tidyverse)
library(readr)
library(ggplot2)
library(treemap)
library(highcharter)
library(stringr)
library(ggplot2)
})})
df<- read.csv("chocolate_bars.CSV")
There are 11 different columns in the data set
print (ncol(df))
## [1] 11
There are 2530 different rolls in the data set
print(nrow(df))
## [1] 2530
Names of variables
names(df)
## [1] "id" "manufacturer" "company_location" "year_reviewed"
## [5] "bean_origin" "bar_name" "cocoa_percent" "num_ingredients"
## [9] "ingredients" "review" "rating"
Changing the column header, bean_origin column to country
colnames(df)[5] <- "country"
First 10 data set
head(df, n=5)
## id manufacturer company_location year_reviewed country
## 1 2454 5150 U.S.A. 2019 Tanzania
## 2 2458 5150 U.S.A. 2019 Dominican Republic
## 3 2454 5150 U.S.A. 2019 Madagascar
## 4 2542 5150 U.S.A. 2021 Fiji
## 5 2546 5150 U.S.A. 2021 Venezuela
## bar_name cocoa_percent num_ingredients ingredients
## 1 Kokoa Kamili, batch 1 76 3 B,S,C
## 2 Zorzal, batch 1 76 3 B,S,C
## 3 Bejofo Estate, batch 1 76 3 B,S,C
## 4 Matasawalevu, batch 1 68 3 B,S,C
## 5 Sur del Lago, batch 1 72 3 B,S,C
## review rating
## 1 rich cocoa, fatty, bready 3.25
## 2 cocoa, vegetal, savory 3.50
## 3 cocoa, blackberry, full body 3.75
## 4 chewy, off, rubbery 3.00
## 5 fatty, earthy, moss, nutty,chalky 3.00
No of bars reviewed
n_distinct(df$bar_name)
## [1] 1605
11 Years of review
df %>% distinct(year_reviewed)
## year_reviewed
## 1 2019
## 2 2021
## 3 2012
## 4 2013
## 5 2014
## 6 2015
## 7 2016
## 8 2018
## 9 2020
## 10 2011
## 11 2009
## 12 2010
## 13 2017
## 14 2007
## 15 2008
## 16 2006
No of Chocolate bar manufacturers
n_distinct(df$manufacturer)
## [1] 580
What is the average rating by country of origin?
The country of origin is also the column called country which was formerly named bean_origin.
avg_rating_country <- df%>%group_by (country) %>%
summarize( average_rating = mean(rating))
avg_rating_country
## # A tibble: 62 x 2
## country average_rating
## <chr> <dbl>
## 1 Australia 3.25
## 2 Belize 3.23
## 3 Blend 3.04
## 4 Bolivia 3.18
## 5 Brazil 3.26
## 6 Burma 3
## 7 Cameroon 3.08
## 8 China 3.5
## 9 Colombia 3.20
## 10 Congo 3.32
## # ... with 52 more rows
How many bars were reviewed for each of those countries
No_of_bars_reviewed_per_country <- df%>% group_by (country) %>% summarize(no_of_bars_reviewed=n_distinct(bar_name))
No_of_bars_reviewed_per_country
## # A tibble: 62 x 2
## country no_of_bars_reviewed
## <chr> <int>
## 1 Australia 3
## 2 Belize 40
## 3 Blend 140
## 4 Bolivia 57
## 5 Brazil 55
## 6 Burma 1
## 7 Cameroon 3
## 8 China 1
## 9 Colombia 55
## 10 Congo 9
## # ... with 52 more rows
Create plots to visualize findings for questions 1 and 2 ##### Plot1 for average rating by country of origin (question1)
Using the Treemap chart
avg_rating_country_chart<- avg_rating_country %>%
hchart(
"treemap",
hcaes(x = country, value = average_rating, color = average_rating)
)%>%
hc_title(
text = "<b>Average rating by each country</b>",
margin = 20,
align = "center",
style = list(color = "#22A884", useHTML = TRUE)
)
avg_rating_country_chart #hover each country to get average rating
for no of bars that were reviewed for each countries (for question2)
No_of_bars_reviewed_per_country_chart <- No_of_bars_reviewed_per_country %>%
hchart(
"treemap",
hcaes(x = country, value = no_of_bars_reviewed, color = no_of_bars_reviewed)
)%>%
hc_title(
text = "<b>No of bars reviewed for each countries</b>",
margin = 20,
align = "center",
style = list(color = "#22A884", useHTML = TRUE)
)
No_of_bars_reviewed_per_country_chart #hover each country the number of bars reviewed
Is the cocoa bean’s origin an indicator of quality? To know if the country which is the bean’s origin is an indicator of quality. Quality can be assessed by the rating given,therefore a correlation test will be conducted to know whether there’s a relationship (indication) between country and quality
The ways to detect a relationship between variablesis by constructing a scatter plot diagram
A scatter plot to show if there’s any form of relationship between country and quality
ggplot(df) +
aes(x = country, y = rating) +
geom_point(colour = "#0c4c8a") +
theme_minimal()+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))
From the above diagram, it is noticeable that there’s no pattern or trend whatsoever. Therefore, there’s no relationship between country and quality which conclusion is drawn that bean origin is not an indicator of quality
How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
a scatter plot is also constructed to show relationship
ggplot(df) +
aes(x = rating, y = cocoa_percent) +
geom_point(colour = "#0c4c8a") +
theme_minimal()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))
From the diagram, it is also detectable that there’s no pattern or trend whatsoever. Therefore, there’s no relationship between cocoa content and rating
cor(df$rating, df$cocoa_percent)
## [1] -0.1466896
Also, from correlation formula, the value is -0.1466896 ~ -0.15, this shows a very negative weak relationship between the cocoa content and rating
Avg_cocoa_content <- df %>%filter(rating > 3.5)%>% group_by (bar_name) %>% summarize(average_percent= mean(cocoa_percent),rating)%>% arrange(desc(average_percent,rating))
Avg_cocoa_content
## # A tibble: 412 x 3
## # Groups: bar_name [349]
## bar_name average_percent rating
## <chr> <dbl> <dbl>
## 1 Dark, Central and S. America 90 3.75
## 2 Crazy 88, Guat., D.R., Peru, Mad., PNG 88 4
## 3 Upala, Batch 12 82 3.75
## 4 Carenero Superior 80 3.75
## 5 Fortissima 80 3.75
## 6 Peru, Awagum bar 80 3.75
## 7 Trinidad 80 3.75
## 8 Vanua Levu, Matasawalevu 80 3.75
## 9 Costa Esmeralda, Batch 30 78 3.75
## 10 Guadalcanal 78 3.75
## # ... with 402 more rows
Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
#creating a new column to display TRUE OR FALSE for the prescenceof lecithin in ingredients
df$contains_lecithin<-str_detect(df$ingredients,"L")
#view first10 values
head(df$contains_lecithin,n=10)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
#Since the new column is added, we find the average mean for both circumstances
average_rating_about_Lecithin <- df %>% group_by (contains_lecithin) %>% summarize(average_rating= mean(rating))
average_rating_about_Lecithin
## # A tibble: 2 x 2
## contains_lecithin average_rating
## <lgl> <dbl>
## 1 FALSE 3.21
## 2 TRUE 3.15
Therefore, it is quite evident that the rating is higher for bars without lecithin
Map showing manufacturing countries and corresponding number of bars manufactured