As part of the course “Developing Data Products” we’ve been asked to produce a small project creating a webpage using R Markdown which features plots created by plotly. I will host the web page on RPubs.
I have chosen to use use a relatively small dataset available from Kaggle containing reviews of 2,200+ Scotch Whiskies.
Thanks to user Koki Ando for providing the dataset which was collected from the Whisky Advocate Website
I will use this datset to demonstrate:
ggplotly()
function to convert to a plotly graphNote: You will need to download the data from Kaggle and unzip into your working directory
You will need the following packages installed:
First lets import the data and carry out the following operations:
scotch_review.csv
gsub()
whiskey <- read.csv("./scotch_review.csv")
#SPLIT THE DESCRIPTION STRING TO GET ABV
#REMOVE "," when its the trailing character
whiskey$name <- gsub("^\\.|\\,$","", whiskey$name)
whiskey$ABV <- as.numeric(sub("%","", gsub("ABV|ââ¬â¢|'|$|ââ¬Â¨|ââ¬","", sub('.*\\,', '', whiskey$name))))
## Warning: NAs introduced by coercion
whiskey$price <- as.character(whiskey$price)
whiskey$price_num <- as.numeric(gsub(",|\\set", "",whiskey$price))
## Warning: NAs introduced by coercion
sum(is.na(whiskey$price_num)) #6NAs
## [1] 6
sum(is.na(whiskey$ABV))
## [1] 29
#25 NA VALUES - LEAVES US WITH A GOOD SIZED DATASET
# ASSUME ANY WHISKEY < 46% IS CHILL FILTERED
# ANYTHING ABOVE 50% is CASK STRENGTH
whiskey$chill_filter <- ifelse(whiskey$ABV >= 46, "N", "Y")
whiskey$cask_strength <- ifelse(whiskey$ABV > 50, "Y", "N")
whiskey$std_ABV <- ifelse(whiskey$ABV == 40, "Y", "N")
whiskey$price_band <- ifelse(whiskey$price_num < 30, "A. <$30", ifelse(whiskey$price_num <= 50, "B. $30-$50",
ifelse(whiskey$price_num <= 100, "D. $50-$100",ifelse(whiskey$price_num <= 250, "E. $100-$250",
ifelse(whiskey$price_num <= 1000, "F. $250-$1000",ifelse(whiskey$price_num > 1000, "G. $1000+", NA))))))
Now lets make some plots and try to draw some conclusions. Prices are in US dollars and may therefore prices be quite different to other parts of the world.
g <- ggplot(whiskey, aes(x = price_num, y = review.point, colour= category, label = name)) +
geom_point(position = "jitter", size = 0.4) +
ggtitle("Whisky Review Score by Price and Category") +
xlab("Price ($)") +
ylab("Review Score") +
theme(legend.position = "bottom", legend.title = element_blank(), legend.background = element_blank())
ggplotly(g) %>%
layout(legend = list(
orientation = "v",
x = 0.65,
y= 0.01
)
)
In this plot you may want to zoom into the left portion of the graph since there are a couple of extemely expensive whiskeys skewing the plot, one benefit of plotly is that it allows us to zoom in to overcome issues like this, whereas in ggplot I may have had to play around with the xlim options to set the graph optimally.
Its hard to draw conclusions from this graph:
x <- list(
title = "Chill Filtered"
)
y <- list(
title = "Review Score"
)
g2 <- plot_ly(whiskey, x = ~chill_filter, y = ~review.point, color = ~chill_filter, type = "box") %>%
layout(xaxis = x, yaxis = y, title = "Does Using Chill Filtering impact review score? (Y = Chill Filtered)")
g2
## Warning: Ignoring 29 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Here we can see:
x <- list(
title = "Cask Strength"
)
y <- list(
title = "Review Score"
)
#CASK STRENGTh
g3 <- plot_ly(whiskey, x = ~cask_strength, y = ~review.point, color = ~cask_strength, type = "box") %>%
layout(xaxis = x, yaxis = y, title = "Does bottling at Cask Strength impact review score? (Y = Cask Strength)")
g3
## Warning: Ignoring 29 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
x <- list(
title = "Standard ABV (40%)"
)
y <- list(
title = "Review Score"
)
#std_ABV
g4 <- plot_ly(whiskey, x = ~std_ABV, y = ~review.point, color = ~std_ABV, type = "box") %>%
layout(xaxis = x, yaxis = y, title = "Does bottling at std ABV (40%) impact review score? (Y = Std ABV)")
g4
## Warning: Ignoring 29 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Here we have our first real conclusive insights!!
x <- list(
title = "Price Band"
)
y <- list(
title = "Review Score"
)
g5 <- plot_ly(whiskey, x = ~price_band, y = ~review.point, color = ~price_band, type = "box") %>%
layout(xaxis = x, yaxis = y, title = "Whisky Review Score by Price Band")
g5
## Warning: Ignoring 6 observations
The first plot was a little hard to read, so using price bands, set arbitrarily by myself, I wanted to see if there was any sign that the more expensive whiskies score higher in the reviews.
We can see from the plot here:
While its hard to draw conclusions, on what is a dataset of reviews with relatively low differentiation for a 100 point scale, to summarise what we’ve found…
Finally I can conclude that all reviews are biased towards peoples tastes, having tried Johnnie Walker Blue, the top ranked whisky here and was a bit meh about it. Noting that my personal favourite scotch - after trying many bottles! - Kilkerran 12 is ranked a criminally low 88 and for the bargain $45 I’m inclined to instead urge you, if you are going to pick up a bottle to forget all about the rankings above and pick up a bottle of this instead!
“There is no bad whisky. There are only some whiskies that aren’t as good as others” Raymond Chandler