Note: This Project is built upon the Scenario
included in the Course Challenge of ‘Data
Analysis with R Programming’, provided by Google on
Coursera.
Introduction.
In this Scenario, the task is to analyze the data of the company called
‘Chocolate and Tea’, a chain of cafes. Then, the findings and output of
the analysis is to be shared with the Stakeholders to improve the
overall service quality and enrich the chocolate bar menu of the
company.
As described in the course, the main motivation for this analysis is
that the company aims to serve chocolate bars that are highly rated by
professional critics and to align with the latest ratings and to ensure
that the list contains bars from a variety of countries. Specifically,
the company aims to determine which countries produce the highest-rated
bars of super dark chocolate (a high percentage of cocoa).
Data Source. For this project, the ‘Chocolate
Bar Ratings’ dataset will be used, which consists of 9
columns.
According to the description of the dataset, the cacao flavor is
rated on 1 to 5 scale:
5: Elite
4: Premium
3: Satisfactory
2: Disappointing
1: Unpleasant
Step 1. Installing and Loading the Necessary
Packages.
install.packages("tidyverse")
install.packages("janitor")
library(tidyverse)
library(janitor)
chocolate_bar_ratings <- read_csv("flavors_of_cacao.csv")
To get a quick overview of the dataset, we run the head()
function:
head(chocolate_bar_ratings)
## # A tibble: 6 × 9
## Company \n(Make…¹ Speci…² REF Revie…³ Cocoa…⁴ Compa…⁵ Rating Bean\…⁶ Broad…⁷
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 A. Morin Agua G… 1876 2016 63% France 3.75 Sao To…
## 2 A. Morin Kpime 1676 2015 70% France 2.75 Togo
## 3 A. Morin Atsane 1676 2015 70% France 3 Togo
## 4 A. Morin Akata 1680 2015 70% France 3.5 Togo
## 5 A. Morin Quilla 1704 2015 70% France 3.5 Peru
## 6 A. Morin Carene… 1315 2014 70% France 2.75 Criollo Venezu…
## # … with abbreviated variable names ¹`Company \n(Maker-if known)`,
## # ²`Specific Bean Origin\nor Bar Name`, ³`Review\nDate`, ⁴`Cocoa\nPercent`,
## # ⁵`Company\nLocation`, ⁶`Bean\nType`, ⁷`Broad Bean\nOrigin`
Step 2. Cleaning the Data.
When glimpsing the dataset, it is easy notice that some of the column
names are inconsistent (having space or other characters), making it
hard to work with these column names.
It is necessary to change the column names using clean_names() function
in the ‘janitor’ package. Additionally, ‘cleaned_df’ will be assigned as
name to the cleaned dataset:
cleaned_df <- chocolate_bar_ratings %>%
clean_names()
Now, it is a good idea to check the output of the code:
colnames(cleaned_df)
## [1] "company_maker_if_known" "specific_bean_origin_or_bar_name"
## [3] "ref" "review_date"
## [5] "cocoa_percent" "company_location"
## [7] "rating" "bean_type"
## [9] "broad_bean_origin"
The column names have been successfully cleaned and are ready to be used
now.
Step 3. Filtering the Data.
According to the instructions in the course challenge, it is
determined that any rating greater than or equal to 3.9 points can be
considered a high rating. At the same time, a bar is considered to be
super dark chocolate if the bar’s cocoa percent is greater than or equal
to 75%.
A new subset of the cleaned_df will be created using the filter()
function and pipes:
cleaned_df_filtered <- cleaned_df %>%
filter(rating >= 3.9 & cocoa_percent >= 75)
head(cleaned_df_filtered)
## # A tibble: 6 × 9
## company_maker_i…¹ speci…² ref revie…³ cocoa…⁴ compa…⁵ rating bean_…⁶ broad…⁷
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 Amedei Nine 111 2007 75% Italy 4 Blend
## 2 Bonnat Kaori 1339 2014 75% France 4 Brazil
## 3 Bonnat Haiti 629 2011 75% France 4 Haiti
## 4 Bonnat Madaga… 629 2011 75% France 4 Criollo Madaga…
## 5 Bonnat Porcel… 199 2008 75% France 4 Crioll… Venezu…
## 6 Bonnat Ocumar… 32 2006 75% France 4 Venezu…
## # … with abbreviated variable names ¹company_maker_if_known,
## # ²specific_bean_origin_or_bar_name, ³review_date, ⁴cocoa_percent,
## # ⁵company_location, ⁶bean_type, ⁷broad_bean_origin
Step 4. Building Visuals based on the Cleaned and Filtered Dataset.
Now it is time to know which companies produce the highly rated super
dark chocolate bar based on the filters introduced in Step 3. For this
purpose a bar chart will be created using ggplot2 package and its
functions.
ggplot(data = cleaned_df_filtered) +
geom_bar(mapping = aes(x = company_maker_if_known))
Step 5. Improving the Readibility of the Visual.
The bar chart created in Step 4 is basic, and the readibility should
be improved, specifically:
The visual should be provided a name; x and y axes should be
renamed
A caption about the review years should be added to the plot
The background of the plot should be removed
The names in the x axis should be clearly visible
min_date <- cleaned_df_filtered %>%
summarize(min(review_date))
max_date <- cleaned_df_filtered %>%
summarize(max(review_date))
ggplot(data = cleaned_df_filtered) +
geom_bar(mapping = aes(x = company_maker_if_known), fill = 'lightblue') +
labs(title = "Top Manufacturers of Highly Rated Chocolate Bar", caption = paste0("Data is from ", min_date, " to ", max_date), x = "Company", y = "Highly Rated Chocolate Bar") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90))
In order to know in which countries these top manufacturers are
located, the following visual will be built:
ggplot(data = cleaned_df_filtered) +
geom_bar(mapping = aes(x = company_location), fill = 'cyan') +
labs(title = "Countries where Top Manufactureres are Located", subtitle = "This chart visualizes how many top chocolate bar producers each country has", x = "Country", y = "Number of Top Manufactureres") +
theme_classic()
According to the plots created in Step 5, the Top 2 Manufacturers are
Bonnat and Prauls, while the Top 2 Manufacturing countries are France
and the USA.