Introduction

As part of the course “Developing Data Products” we’ve been asked to produce a small project creating a webpage using R Markdown which features plots created by plotly. I will host the web page on RPubs.

I have chosen to use use a relatively small dataset available from Kaggle containing reviews of 2,200+ Scotch Whiskies.

Thanks to user Koki Ando for providing the dataset which was collected from the Whisky Advocate Website

I will use this datset to demonstrate:

Note: You will need to download the data from Kaggle and unzip into your working directory

Required Packages

You will need the following packages installed:

Data Processing

First lets import the data and carry out the following operations:

whiskey <- read.csv("./scotch_review.csv")
#SPLIT THE DESCRIPTION STRING TO GET ABV
#REMOVE "," when its the trailing character
whiskey$name <- gsub("^\\.|\\,$","", whiskey$name)
whiskey$ABV <- as.numeric(sub("%","", gsub("ABV|’|'|$|
|â€","", sub('.*\\,', '', whiskey$name))))
## Warning: NAs introduced by coercion
whiskey$price <- as.character(whiskey$price)
whiskey$price_num <- as.numeric(gsub(",|\\set", "",whiskey$price))
## Warning: NAs introduced by coercion
sum(is.na(whiskey$price_num)) #6NAs
## [1] 6
sum(is.na(whiskey$ABV))
## [1] 29
#25 NA VALUES - LEAVES US WITH A GOOD SIZED DATASET 
# ASSUME ANY WHISKEY < 46% IS CHILL FILTERED
# ANYTHING ABOVE 50% is CASK STRENGTH

whiskey$chill_filter <- ifelse(whiskey$ABV >= 46, "N", "Y")
whiskey$cask_strength <- ifelse(whiskey$ABV > 50, "Y", "N")
whiskey$std_ABV <- ifelse(whiskey$ABV == 40, "Y", "N")
whiskey$price_band <- ifelse(whiskey$price_num < 30, "A. <$30", ifelse(whiskey$price_num <= 50, "B. $30-$50",
                          ifelse(whiskey$price_num <= 100, "D. $50-$100",ifelse(whiskey$price_num <= 250, "E. $100-$250",
                            ifelse(whiskey$price_num <= 1000, "F. $250-$1000",ifelse(whiskey$price_num > 1000, "G. $1000+", NA))))))

PLOTS

Now lets make some plots and try to draw some conclusions. Prices are in US dollars and may therefore prices be quite different to other parts of the world.

1. USING GGPLOTLY: Simple plot of price vs Review Score

g <- ggplot(whiskey, aes(x = price_num, y = review.point, colour= category, label = name)) +
     geom_point(position = "jitter", size = 0.4) +
     ggtitle("Whisky Review Score by Price and Category") +
     xlab("Price ($)") +
     ylab("Review Score") +
     theme(legend.position = "bottom", legend.title = element_blank(), legend.background = element_blank())


ggplotly(g) %>% 
  layout(legend = list(
    orientation = "v",
    x = 0.65,
    y= 0.01
  )
)

In this plot you may want to zoom into the left portion of the graph since there are a couple of extemely expensive whiskeys skewing the plot, one benefit of plotly is that it allows us to zoom in to overcome issues like this, whereas in ggplot I may have had to play around with the xlim options to set the graph optimally.

Its hard to draw conclusions from this graph:

  • The majority of reviews are for single malt scotch whiskies
  • There is one bottle over $150,000 !!
  • There are 3 whiskies ranged 97 score. Johnnie Walker Blue label (A blend), Black Bowmore 1964 (single malt) and Bowmore 46 - clearly the reviewers are a fan of Bowmore.
  • There is no scores below 63, and scores are very condensed in around 85-95

2. PLOT_LY: Does Chill Filtering have an impact on score??

x <- list(
  title = "Chill Filtered"
)
y <- list(
  title = "Review Score"
)
g2 <- plot_ly(whiskey, x = ~chill_filter, y = ~review.point, color = ~chill_filter, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Does Using Chill Filtering impact review score? (Y = Chill Filtered)")
g2
## Warning: Ignoring 29 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Here we can see:

  • Median scores for Non Chill Filtered and Chill Filtered whiskies are identical
  • Chill filtered whiskies have a lower inter quartile range so are more condensed around the median.

3. PLOT_LY: Does bottling at cask strength have an impact on score??

x <- list(
  title = "Cask Strength"
)
y <- list(
  title = "Review Score"
)

#CASK STRENGTh
g3 <- plot_ly(whiskey, x = ~cask_strength, y = ~review.point, color = ~cask_strength, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Does bottling at Cask Strength impact review score? (Y = Cask Strength)")
g3
## Warning: Ignoring 29 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
  • Median Scores are identical for both whiskies bottled at cask strength and those that arent
  • Higher upper end of Interquartile range for cask strength whiskies, but only by a point

4. PLOT_LY: Does bottling at standard ABV have an impact on score??

x <- list(
  title = "Standard ABV (40%)"
)
y <- list(
  title = "Review Score"
)

#std_ABV
g4 <- plot_ly(whiskey, x = ~std_ABV, y = ~review.point, color = ~std_ABV, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Does bottling at std ABV (40%) impact review score? (Y = Std ABV)")
g4
## Warning: Ignoring 29 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Here we have our first real conclusive insights!!

  • The Median values of whiskies bottled at non standard ABV is higher than those bottled at the standard (and minimum required) 40% ABV
  • Both the Upper and lower IQR are higher for Non Standard ABV whiskies

5. PLOT_LY: Does price really matter??

x <- list(
  title = "Price Band"
)
y <- list(
  title = "Review Score"
)

g5 <- plot_ly(whiskey, x = ~price_band, y = ~review.point, color = ~price_band, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Whisky Review Score by Price Band")
g5
## Warning: Ignoring 6 observations

The first plot was a little hard to read, so using price bands, set arbitrarily by myself, I wanted to see if there was any sign that the more expensive whiskies score higher in the reviews.

We can see from the plot here:

  • Very clear upwards trend, the median increases with every jump in price band
  • The lower fence however remains fairly static up until we reach whiskies over $250: So paying more doesn’t guarantee you a good bottle!

SUMMARY

While its hard to draw conclusions, on what is a dataset of reviews with relatively low differentiation for a 100 point scale, to summarise what we’ve found…

  • More expensive whiskies are better (Or are the reviewers getting blinded by the price tag and better packaging!)
  • Whiskies bottled above 40% have higher scores.
  • There is no clear evidence that chill filtering dis-improves whisky : in the element of doubt it would therefore be better to non chill filter (I am biased)!

Finally I can conclude that all reviews are biased towards peoples tastes, having tried Johnnie Walker Blue, the top ranked whisky here and was a bit meh about it. Noting that my personal favourite scotch - after trying many bottles! - Kilkerran 12 is ranked a criminally low 88 and for the bargain $45 I’m inclined to instead urge you, if you are going to pick up a bottle to forget all about the rankings above and pick up a bottle of this instead!

“There is no bad whisky. There are only some whiskies that aren’t as good as others” Raymond Chandler