Firstly, let’s read in the packages that I am going to be using during the analysis.
The data here is provided to us in a csv file and kept in the input
directory, which is inside the current working directory. I just simply
load it here using the read_csv function from the
readr package.
I don’t know why I got error when I read the file
Rows: 1795 Columns: 9── Column specification
What is dim?
## [1] 1795 9
We seem to have just under 1,800 observations, or chocolates, and 9 variables. We can check the variables that we have by checking the column names of the data. It just like colnames
## [1] "Company \n(Maker-if known)" "Specific Bean Origin\nor Bar Name"
## [3] "REF" "Review\nDate"
## [5] "Cocoa\nPercent" "Company\nLocation"
## [7] "Rating" "Bean\nType"
## [9] "Broad Bean\nOrigin"
If we look at the column names we can see that they contain spaces as
well as some line breaks, which definintely isn’t good, so I will remove
them using gsub. What did he remove? I saw no
differece?
# Remove and replace spaces and line breaks
names(full_data) <- tolower(gsub(pattern = '[[:space:]+]', '_', names(full_data)))
# View the new column names
names(full_data)## [1] "company _(maker-if_known)" "specific_bean_origin_or_bar_name"
## [3] "ref" "review_date"
## [5] "cocoa_percent" "company_location"
## [7] "rating" "bean_type"
## [9] "broad_bean_origin"
That is much more manageable now. I have things about the
janitor package that deals with things like that, perhaps
in a better way so that will be something that I explore in the
future.
View the first few rows of the data.
| company _(maker-if_known) | specific_bean_origin_or_bar_name | ref | review_date | cocoa_percent | company_location | rating | bean_type | broad_bean_origin |
|---|---|---|---|---|---|---|---|---|
| A. Morin | Agua Grande | 1876 | 2016 | 63% | France | 3.75 | Sao Tome | |
| A. Morin | Kpime | 1676 | 2015 | 70% | France | 2.75 | Togo | |
| A. Morin | Atsane | 1676 | 2015 | 70% | France | 3.00 | Togo | |
| A. Morin | Akata | 1680 | 2015 | 70% | France | 3.50 | Togo | |
| A. Morin | Quilla | 1704 | 2015 | 70% | France | 3.50 | Peru | |
| A. Morin | Carenero | 1315 | 2014 | 70% | France | 2.75 | Criollo | Venezuela |
It looks like the first row of the data is duplicated from the header, so I will remove that as it definitely isn’t an observation. Why duplicated? I spot nothing
Another piece of data cleaning that we should do is remove the % sign from the cocoa_percent column. This will mean that we can have the column as numeric, which makes much more sense than having it as a string or category. Why can’t we turn it into numeric?? I understand gsub is remove % and turn the value into text, but sapply and fuction(x) is making me confuse
# Remove the % sign from the cocoa percent column
full_data$cocoa_percent <- sapply(full_data$cocoa_percent, function(x) gsub("%", "", x))We can now retype the columns to convert the cocoa_percent to a numeric. type_convert() is confusing, is the function can auto recognize the value and convert it to string or numeric??
# Retype the columns
full_data <- type_convert(full_data)
# Have a look at the new data types
map_df(full_data, class)| company _(maker-if_known) | specific_bean_origin_or_bar_name | ref | review_date | cocoa_percent | company_location | rating | bean_type | broad_bean_origin |
|---|---|---|---|---|---|---|---|---|
| character | character | numeric | numeric | numeric | character | numeric | character | character |
We can convert the year column to a date type using the
lubridate package.
# Mutate a column that converts the year into a date type
full_data <- full_data %>%
mutate(review_date = ymd(paste(review_date, 1, 1, sep="-")))We can use the newly created date column that we mutated to create a simple summary table grouping by the year.
avg_ratings <- full_data %>%
group_by(review_date) %>% # Group by new date column
summarise(avg_rating = mean(rating), n_ratings = n()) # Summary stats
# Print our new table
avg_ratings| review_date | avg_rating | n_ratings |
|---|---|---|
| 2006-01-01 | 3.125000 | 72 |
| 2007-01-01 | 3.162338 | 77 |
| 2008-01-01 | 2.994624 | 93 |
| 2009-01-01 | 3.073171 | 123 |
| 2010-01-01 | 3.148649 | 111 |
| 2011-01-01 | 3.256061 | 165 |
| 2012-01-01 | 3.178205 | 195 |
| 2013-01-01 | 3.197011 | 184 |
| 2014-01-01 | 3.189271 | 247 |
| 2015-01-01 | 3.246491 | 285 |
| 2016-01-01 | 3.223624 | 218 |
| 2017-01-01 | 3.312500 | 24 |
Here we have created a basic summary table of the average ratings and the number of reviewed observations for each year in the range of our data.
ggplot(full_data, aes(x = rating, fill = as.factor(review_date))) +
geom_density(alpha = .5) +
theme_minimal() +
facet_wrap(~ as.factor(year(review_date))) +
guides(fill = FALSE) + labs(x = 'Rating', y = 'Density')While the peaks throughout the years seems to remain fairly constant,
you can see that the most recent year doesn’t have the spread of the
first years.
We can analyse this further by creating a summary table and
visualisation while grouping by the year.
standard_deviations <-
full_data %>% group_by(review_date) %>%
summarise(count = n(), avg = mean(rating), sd = sd(rating))
standard_deviations| review_date | count | avg | sd |
|---|---|---|---|
| 2006-01-01 | 72 | 3.125000 | 0.7691224 |
| 2007-01-01 | 77 | 3.162338 | 0.6998193 |
| 2008-01-01 | 93 | 2.994624 | 0.5442118 |
| 2009-01-01 | 123 | 3.073171 | 0.4591195 |
| 2010-01-01 | 111 | 3.148649 | 0.4663426 |
| 2011-01-01 | 165 | 3.256061 | 0.4899536 |
| 2012-01-01 | 195 | 3.178205 | 0.4835962 |
| 2013-01-01 | 184 | 3.197011 | 0.4461178 |
| 2014-01-01 | 247 | 3.189271 | 0.4148615 |
| 2015-01-01 | 285 | 3.246491 | 0.3810960 |
| 2016-01-01 | 218 | 3.223624 | 0.4200386 |
| 2017-01-01 | 24 | 3.312500 | 0.3317444 |
We seem to see that the standard deviation of rating seems to be decreasing as time passes. This is more clearly seen when you visualise the trend.
I have included the number of reviews per year here using the
size aesthetic of geom_point to give an idea
of the sample size for each year as time passes.
standard_deviations %>% # Use our previous summary table
ggplot(aes(review_date, sd)) +
geom_point(aes(size = count)) + # Change point size depending on number of reviews
geom_line(size = .5) +
theme_minimal() +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_size(breaks = seq(0,250,50), name = 'Number of ratings') +
labs(x = 'Date',
y = 'Standard deviation of ratings',
title = 'Standard deviation of review ratings over time')Upon visualisation of the data we can see that 2017 might simply have a smaller spread due to the smaller number of ratings that have been taken in that year. As the number of ratings taken in the year increases, we say see it rise to near that of the previous 6/7 years, although there does appear to be a definite downward trend, implying that reviewers are giving less extreme scores.
low_scores <- full_data %>%
group_by(review_date) %>%
summarise(less_than_2.5 = sum(rating < 2.5), count = n(), perc = round(less_than_2.5 / count, 2))
low_scores| review_date | less_than_2.5 | count | perc |
|---|---|---|---|
| 2006-01-01 | 13 | 72 | 0.18 |
| 2007-01-01 | 7 | 77 | 0.09 |
| 2008-01-01 | 7 | 93 | 0.08 |
| 2009-01-01 | 6 | 123 | 0.05 |
| 2010-01-01 | 5 | 111 | 0.05 |
| 2011-01-01 | 5 | 165 | 0.03 |
| 2012-01-01 | 6 | 195 | 0.03 |
| 2013-01-01 | 4 | 184 | 0.02 |
| 2014-01-01 | 5 | 247 | 0.02 |
| 2015-01-01 | 3 | 285 | 0.01 |
| 2016-01-01 | 2 | 218 | 0.01 |
| 2017-01-01 | 0 | 24 | 0.00 |
Having a closer look, by year, at the reviews that were given a rating of less than 2.5, you can clearly see that the number of those poor scores is going down, despite the number of reviews going up! 18% of reviews in 2006 (13/72) where less than 2.5, while only 5 reviews fell below that threshold since 2015!
Once again, plotting this trend makes it much easier to digest.
low_scores %>%
ggplot() + geom_line(aes(review_date, perc * 100)) +
theme_minimal() +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
labs(x = 'Date',
y = 'Percentage',
title = 'Percentage of ratings below 2.5 by year')So we have established that the spread of the ratings is getting smaller, and reviewers are giving less low scores below 2.5.
We can further visualise these features of the data by exploring other plots.
full_data %>%
ggplot(aes(review_date, rating)) +
geom_count(aes(colour = ..n..)) +
geom_line(data = avg_ratings, aes(review_date, avg_rating), colour = 'red') + # Use average ratings table from earlier
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_size_continuous(name = 'Count') + scale_color_continuous(name = 'Count') +
expand_limits(y = 0) + # Expand limits to include 0
theme_minimal() +
labs(x = 'Review year', y = 'Rating')We can see that the spread of the reviews is definitely narrowing, with the lower tail retracting and reviews now generally falling between 3 and 4 in 2015 onwards.
We can investigate whether the location of the company has an impact on the quality of the product.
full_data %>%
group_by(company_location) %>%
filter(n() > 10) %>%
mutate(avg = mean(rating)) %>%
ggplot() +
geom_boxplot(aes(reorder(company_location, avg), rating, fill = avg)) +
scale_fill_continuous(low = '#ffffcc', high = '#fc4e2a', name = "Average rating") +
coord_flip() +
theme_minimal() +
labs(x = 'Company Location', y = 'Rating') +
expand_limits(y = c(0,5))As the column in the data is company/maker, I will extract the maker so that we can compare the makers against one another to see if there are any master makers hidden in the data.
Once I have split the company and maker from the column then I will select those makers that have at least 3 reviewed products.
colnames(full_data)[1] <- 'company_maker'
makers <- full_data %>%
rowwise %>%
mutate(maker = str_match(company_maker, '\\((.*)\\)')[2]) %>%
filter(!is.na(maker)) %>% # If a maker was noted
group_by(maker) %>% # Grouping
filter(n() >= 3) %>% # Minimum product filter
mutate(avg = mean(rating)) # Get average rating
ggplot(makers) +
geom_boxplot(aes(x = reorder(maker, rating, FUN = mean), y = rating, fill = avg)) +
labs(x = 'Maker', y = 'Rating') +
coord_flip() +
theme_minimal() +
scale_fill_continuous(name = "Average rating") +
expand_limits(y = c(0,5))It appears as those Cinagra is our master chocolatier.
I split the company/maker column again, this time selecting the company, and filter by those with at least 10 reviews.
companies <- full_data %>%
rowwise() %>%
mutate(company = str_trim(str_split(company_maker, '\\(')[[1]][1])) %>%
group_by(company) %>%
filter(n() > 10) %>%
mutate(avg = mean(rating))By plotting the count of reviews for ratings using the
size aesthetic again, we are also able to see the
distribution of scores for the companies.
companies %>%
ggplot(aes(x = reorder(as.factor(company), rating, FUN = mean), y = rating)) +
geom_count(alpha = .1) +
geom_point(aes(x = as.factor(company), y = avg, colour = avg)) +
theme_minimal() +
coord_flip() +
labs(x = 'Company', y = 'Rating') +
scale_color_continuous(name = 'Average rating', breaks = seq(3,4,.25)) +
scale_size_continuous(name = 'Number of ratings', breaks = seq(0,14,2))This gives us an overview of companies overall, but doesn’t show us
trends for those companies over time. We can investigate that using
facet_wrap and plotting against the
review_date.
# Which of these companies are improving
full_data %>%
rowwise() %>%
mutate(company = str_trim(str_split(company_maker, '\\(')[[1]][1])) %>%
group_by(company) %>%
filter(n() > 10) %>%
group_by(company, review_date) %>%
summarise(count = n(), avg = mean(rating)) %>%
ungroup() %>%
ggplot() +
geom_point(aes(x = review_date, y = avg, size = count, colour = as.factor(company), alpha = 0.05)) +
geom_line(aes(x = review_date, y = avg, colour = as.factor(company))) +
facet_wrap(~ company, ncol = 3) +
scale_x_date(date_labels = '%Y', date_breaks = '2 year') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(colour = FALSE) +
labs(x = 'Year of review', y = 'Average rating', title = "Company average ratings over time") +
scale_size_continuous(breaks = seq(4, 16, 2), name = 'Number of reviews') +
scale_alpha(guide = F)Some appear to have shown improvements, but such increases can probably just be put down to small number of reviews for a given year. To see whether the company is truly improving we would require a greater number of product reviews for the current year.
Let’s investigate to see whether there is a relationship between cocoa percentage and the chocolate’s rating.
full_data %>%
ggplot(aes(x = cocoa_percent, y = rating)) +
geom_jitter(alpha = .75) +
coord_cartesian(ylim = c(0,5)) +
labs(x = 'Cocoa percentage', y = 'Rating') +
theme_minimal() +
geom_smooth(method = 'lm', se = FALSE, col = 'red')There doesn’t appear to be a strong relationship between cocoa percent and rating here. Let’s have a look at the model separately.
##
## Call:
## lm(formula = rating ~ cocoa_percent, data = full_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.20675 -0.31981 0.04325 0.31806 1.79325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.075165 0.126835 32.130 < 2e-16 ***
## cocoa_percent -0.012406 0.001762 -7.041 2.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4717 on 1792 degrees of freedom
## Multiple R-squared: 0.02692, Adjusted R-squared: 0.02637
## F-statistic: 49.57 on 1 and 1792 DF, p-value: 2.716e-12
With an adjusted R-squared of 0.02662 it is fair to say that this
isn’t a very good model, but there is a relationship between the rating
and the cocoa_percent, it just isn’t the most informative.
The negative slope implies to us that the higher the
cocoa_percent then the lower the rating of the
chocolate.
If we were going to build a model to predict the rating of the chocolate then we would definitely have to take more variables into consideration.
We have already seen the impact that time has had on predictions. We can see that the spread of the ratings has decreased as time has passed. Perhaps factoring this into consideration will help us improve our prediction power.
ggplot(full_data, aes(x = cocoa_percent, y = rating, group = factor(year(review_date)))) +
geom_jitter(alpha = .75) +
coord_cartesian(ylim = c(0,5)) +
labs(x = 'Cocoa percentage', y = 'Rating') +
theme_minimal() +
geom_smooth(method = 'lm', se = FALSE, aes(colour = factor(year(review_date)))) +
scale_colour_discrete(name = 'Review year')A plot like this is good at showing us that there is definitely
deviation throughout the years regarding how cocoa percent relates to
the rating, but if we wanted to know more about which
years seem to diverge then faceting by year(review_date)
might be more effective. Let’s have a look.
ggplot(full_data, aes(x = cocoa_percent, y = rating, group = factor(year(review_date)))) +
geom_jitter(alpha = .5, shape = 21, size = 1) +
coord_cartesian(ylim = c(0,5)) +
labs(x = 'Cocoa percentage', y = 'Rating', title = 'Rating and cocoa percent faceted by year of review') +
theme_minimal() +
geom_smooth(method = 'lm', se = FALSE, aes(colour = factor(year(review_date)))) +
facet_wrap(~ year(review_date)) +
guides(colour = FALSE) + # Remove colour legend as faceting makes this clearer anyway
theme(panel.spacing = unit(1, "lines")) # Just increase the spacing between plots slightlyBoth the above plots illustrate to us that the slope of the line changes with year. The first one let’s us evaluate that there are differences year on year and we can compare them, but it is quite messy. While faceting makes it slightly more difficult to directly compare the linear slopes but makes evaluating one at a time much more manageable.
We can see how adding year to the linear predictor impacts the quality of our model.
model <- lm(formula = rating ~ cocoa_percent + factor(year(review_date)),
data = full_data)
summary(model)##
## Call:
## lm(formula = rating ~ cocoa_percent + factor(year(review_date)),
## data = full_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.18802 -0.27862 0.02444 0.30343 1.86240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.019413 0.136438 29.460 < 2e-16 ***
## cocoa_percent -0.012597 0.001758 -7.168 1.11e-12 ***
## factor(year(review_date))2007 0.050426 0.076777 0.657 0.5114
## factor(year(review_date))2008 -0.108974 0.073556 -1.482 0.1386
## factor(year(review_date))2009 -0.058845 0.069481 -0.847 0.3972
## factor(year(review_date))2010 0.020868 0.070849 0.295 0.7684
## factor(year(review_date))2011 0.130679 0.066129 1.976 0.0483 *
## factor(year(review_date))2012 0.059859 0.064572 0.927 0.3540
## factor(year(review_date))2013 0.087963 0.065122 1.351 0.1769
## factor(year(review_date))2014 0.080056 0.062744 1.276 0.2022
## factor(year(review_date))2015 0.134265 0.061781 2.173 0.0299 *
## factor(year(review_date))2016 0.108679 0.063656 1.707 0.0879 .
## factor(year(review_date))2017 0.194324 0.110359 1.761 0.0784 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4682 on 1781 degrees of freedom
## Multiple R-squared: 0.04706, Adjusted R-squared: 0.04064
## F-statistic: 7.33 on 12 and 1781 DF, p-value: 2.798e-13
This shows us that some of the years have a ‘signficant’ impact on the response variable, but overall the impact isn’t as substantial as we might hope for.
We can note that the R-squared value has risen from 0.02662 to 0.04097 so it is in the right direction but still pretty awful.
Our linear models have yielded very little success; while this is likely to be indicative that the relationships and predictive power just isn’t present in the data, I thought that I would just give it a go using a random forest rather than a plain linear model.
Random forest models are probably more associated with classification
tasks, rather than the regression task that we have set ourselves here.
We will see later what that means for our predictions. Let’s first build
the model using the rpart package.
library(rpart)
set.seed(1000) # Set the seed to make results reproducible
random_forest_model <- rpart(formula = 'rating ~ cocoa_percent + factor(year(review_date))', data = full_data)
# Calling summary on a random forest model produces a large output so I have commented out for the sake of presentation space
# summary(random_forest_model)Now that we have our model, let’s make some predictions and compare them to the actual ratings that were given.
## forest_pred
## 2.27 2.85843373493976 2.97938144329897 3.13590604026846
## 25 83 97 149
## 3.23940972222222
## 1440
This is what I was talking about earlier. The predictions are binned because of the structure of the tree and decision trees in general. In our case the inputs are branched and reach a terminal node where they are given a numeric value as the prediction for that particular observation; 1 of 5 possible predictions in our case.
We should also note that one of the predictions is greatly favoured: 3.23976405274115
This doesn’t necessarily mean that the random forest is ineffective, and perhaps with some fine tuning we could improve that, but we want to be careful not to overfit the data by making the trees too deep.
library(Metrics) # Load metrics library for the rmse function
forest_rmse <- rmse(actual = full_data$rating, predicted = forest_pred)
linear_rmse <- rmse(actual = full_data$rating, predicted = predict(model, full_data))
# Display the two results
data.frame(forest_rmse, linear_rmse)| forest_rmse | linear_rmse |
|---|---|
| 0.4548661 | 0.4664957 |
It does appear that the random forest model performs slightly better than the linear model that we ran earlier.
Our model is lacking because of the predictors we are using to predict our response variables; the strength of the relationship with rating is not strong enough to magically produce a good prediction model.
From the exploratory analysis, I can see that there appears to be a loose correlation with the location of the company and the rating. Perhaps including this will improve our predictive power, and reduce the RMSE.
## [1] "character"
## [1] 60
I’m interested to see whether the random forest model will be able to
handle the 60 categories categorical variable, or whether we have to
one-hot encode it for example. For some algorithms, where a matrix is
required, we could use vtreat to do just that and create a
sparse matrix where the categories are spread across columns.
random_forest_model_2 <- rpart(formula = 'rating ~ cocoa_percent + factor(year(review_date)) + factor(company_location)', data = full_data)
forest_pred_location = predict(random_forest_model_2, full_data)
table(forest_pred_location)## forest_pred_location
## 2.27 2.78977272727273 2.97297297297297 3.17489495798319
## 25 44 148 952
## 3.3168
## 625
We can already see a difference! We have 6 output classes or terminal nodes with our second tree.
| Overall | |
|---|---|
| cocoa_percent | 0.1003184 |
| factor(company_location) | 0.0948902 |
| factor(year(review_date)) | 0.0683489 |
We can see that company_location has jumped over the
year of review and is more important for our predictive power.
cocoa_percent still remains the most important predictor
variable. Let’s have a look and see if it has had a positive impact on
the RMSE.
forest_location_rmse <- rmse(actual = full_data$rating, predicted = forest_pred_location)
data.frame(forest_rmse, linear_rmse, forest_location_rmse)| forest_rmse | linear_rmse | forest_location_rmse |
|---|---|---|
| 0.4548661 | 0.4664957 | 0.4506148 |
Yet more improvement!
NB: we are training and testing our models on the same dataset, which isn’t good practice and could be the reason for the difference in results. We would need to test the out-of-sample error using cross-validation for a better indication of model strength.
Let’s have a look at the unique origins that are listed in the
dataset. This is easily achieved using the unique
function.
## [1] "Togo" "Peru"
## [3] "Venezuela" "Cuba"
## [5] "Panama" "Madagascar"
## [7] "Brazil" "Ecuador"
## [9] "Colombia" "Burma"
## [11] "Papua New Guinea" "Bolivia"
## [13] "Fiji" "Mexico"
## [15] "Indonesia" "Trinidad"
## [17] "Vietnam" "Nicaragua"
## [19] "Tanzania" "Dominican Republic"
## [21] "Ghana" "Belize"
## [23] " " "Jamaica"
## [25] "Grenada" "Guatemala"
## [27] "Honduras" "Costa Rica"
## [29] "Domincan Republic" "Haiti"
## [31] "Congo" "Philippines"
## [33] "Malaysia" "Dominican Rep., Bali"
## [35] "Venez,Africa,Brasil,Peru,Mex" "Gabon"
## [37] "Ivory Coast" "Carribean"
## [39] "Sri Lanka" "Puerto Rico"
## [41] "Sao Tome" "Uganda"
## [43] "Martinique" "Sao Tome & Principe"
## [45] "Vanuatu" "Australia"
## [47] "Liberia" "Ecuador, Costa Rica"
## [49] "West Africa" "Hawaii"
## [51] "St. Lucia" "Cost Rica, Ven"
## [53] "Peru, Madagascar" "Venezuela, Trinidad"
## [55] "Trinidad, Tobago" "Ven, Trinidad, Ecuador"
## [57] "South America, Africa" "India"
## [59] "Africa, Carribean, C. Am." "Tobago"
## [61] "Ven., Indonesia, Ecuad." "Trinidad-Tobago"
## [63] "Peru, Ecuador, Venezuela" "Venezuela, Dom. Rep."
## [65] "Colombia, Ecuador" "Solomon Islands"
## [67] "Nigeria" "Peru, Belize"
## [69] "Peru, Mad., Dom. Rep." NA
## [71] "PNG, Vanuatu, Mad" "El Salvador"
## [73] "South America" "Samoa"
## [75] "Ghana, Domin. Rep" "Trinidad, Ecuador"
## [77] "Cameroon" "Venezuela, Java"
## [79] "Venezuela/ Ghana" "Venezuela, Ghana"
## [81] "Indonesia, Ghana" "Peru(SMartin,Pangoa,nacional)"
## [83] "Principe" "Central and S. America"
## [85] "Ven., Trinidad, Mad." "Carribean(DR/Jam/Tri)"
## [87] "Ghana & Madagascar" "Ven.,Ecu.,Peru,Nic."
## [89] "Madagascar & Ecuador" "Guat., D.R., Peru, Mad., PNG"
## [91] "Peru, Dom. Rep" "Dom. Rep., Madagascar"
## [93] "Gre., PNG, Haw., Haiti, Mad" "Mad., Java, PNG"
## [95] "Ven, Bolivia, D.R." "DR, Ecuador, Peru"
## [97] "Suriname" "Peru, Ecuador"
## [99] "Ecuador, Mad., PNG" "Ghana, Panama, Ecuador"
## [101] "Venezuela, Carribean"
Since we are looking at the broad bean origin during this part of the
analysis we are going to want to remove those rows that don’t appear to
have a known bean origin.
Using the unique values above we are able to look through the origins
that are included in the data set and use that information to filter all
the observations accordingly.
beans <- full_data %>%
filter(!is.na(broad_bean_origin), # Remove those with NA values
!nchar(broad_bean_origin) < 2) # Remove those where the origin is too short to be a realistic palceNow that we have filtered the data (beans) to include
just those observations with broad bean origins we are able to working
out which origins occur the most frequently, or are the most widely used
in these chocolates.
# Most frequent broad bean origin
beans %>%
group_by(broad_bean_origin) %>% # Group by origin
filter(n() > 10) %>% # Limit to those with at least 10 observations
mutate(count = n()) %>% # Add the count column
ggplot(aes(x = reorder(broad_bean_origin, count))) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = 'Bean origin', y = 'Count', title = 'Most frequently used broad bean origins')Looking at this simple graph we are able to see that the most popular broad bean origins are Venezuela, Ecuador and Peru. Madagascar, probably more famous for its vanilla, is in fourth place.
Does the origin of the bean have a noticeable impact on the
rating of the chocolate?
We can visualise the answer to this question by looking at the
statistics of the ratings when they are grouped by their broad bean
origin.
beans %>%
group_by(broad_bean_origin) %>%
filter(n() > 10) %>% # Keep only those with more than 10 observations
mutate(count = n()) %>%
ggplot() +
geom_boxplot(aes(x = broad_bean_origin, y = rating)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
labs(x = 'Broad bean origin', y = 'Rating')Having seen the boxplot, we can see that there doesn’t appear to be any relationship between broad bean origin and rating. The company or the maker that is producing the product itself must be a more important indicator of the overall rating of the chocolate.
We can use the subset which contains only the companies with the most observations to generate a heatmap showing the interactions between company and broad bean origin.
companies %>%
filter(!is.na(broad_bean_origin), !nchar(broad_bean_origin) < 2) %>%
group_by(company, broad_bean_origin) %>%
summarise(count = n()) %>%
filter(count > 0) %>%
ggplot() +
geom_tile(aes(broad_bean_origin, company, fill = count)) +
coord_equal() +
scale_fill_continuous(name = 'Count') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 5)) +
theme(axis.text.y = element_text(size = 5)) +
labs(x = 'Broad bean origin', y = 'Company')We can see that there are only a few bright spots on what is a quite
busy plot. The popular locations for broad bean origin make the vertical
straight lines on the plot.
You can see clear vertical lines at Venezuela, Peru and Ecuador as we
would expect.
Thanks for taking the time to have a look at my analysis and
visualisations, if you enjoyed it or learned something then please feel
free to upvote the kernel ;)
I am still looking to improve this kernel and would greatly appreciate
any feedback or advice that you could share with me.