Developing Data Products: Plotly - A quick analysis of Whisky

Introduction

As part of the course “Developing Data Products” we’ve been asked to produce a small project creating a webpage using R Markdown which features plots created by plotly. I will host the web page on RPubs.

I have chosen to use use a relatively small dataset available from Kaggle containing reviews of 2,200+ Scotch Whiskies.

Thanks to user Koki Ando for providing the dataset which was collected from the Whisky Advocate Website

I will use this datset to demonstrate:

Plotting directly from the plotly package
Using ggplot to create a plot and the ggplotly() function to convert to a plotly graph

Note: You will need to download the data from Kaggle and unzip into your working directory

Required Packages

You will need the following packages installed:

plotly
ggplot2
dplyr

Data Processing

First lets import the data and carry out the following operations:

Read in the .csv file scotch_review.csv
Clean up the name field which on inspection has some unusual characters using gsub()
Convert the price field from character to numeric, some values will be left as NA, There are a small number of these, given more time we could have input these.
Created a price band field so as to more clearly see the impact of price.
Created a field called ABV: Alcohol by volume to see if whisky bottled at higher ABVs tend to get higher scores than those bottled at the standard 40%
Created a Chill Filtered flag so as to investigate scores versus whether or not the whiskey was likely chill filtered. Chill filtered whiskies are typically over 46% ABV. The chill filtering process is a means of removing residue and cloudiness from whiskies and often disliked by whisky purists
Created a Cask Strength flag to invesitage score versus whether or not the whiskey was likely bottled at cask strength. I’ve assumed cask strength whisky is anything over 50% - this means the whisky has not been watered down and taken directly from the cask.
Finally created a Standard ABV flag which identifies whiskies bottled at a standard 40% the bare minimum legal requirement to qualify as whisky.

whiskey <- read.csv("./scotch_review.csv")
#SPLIT THE DESCRIPTION STRING TO GET ABV
#REMOVE "," when its the trailing character
whiskey$name <- gsub("^\\.|\\,$","", whiskey$name)
whiskey$ABV <- as.numeric(sub("%","", gsub("ABV|Ã¢â¬â¢|'|$|Ã¢â¬Â¨|Ã¢â¬","", sub('.*\\,', '', whiskey$name))))

## Warning: NAs introduced by coercion

whiskey$price <- as.character(whiskey$price)
whiskey$price_num <- as.numeric(gsub(",|\\set", "",whiskey$price))

## Warning: NAs introduced by coercion

sum(is.na(whiskey$price_num)) #6NAs

## [1] 6

sum(is.na(whiskey$ABV))

## [1] 29

#25 NA VALUES - LEAVES US WITH A GOOD SIZED DATASET 
# ASSUME ANY WHISKEY < 46% IS CHILL FILTERED
# ANYTHING ABOVE 50% is CASK STRENGTH

whiskey$chill_filter <- ifelse(whiskey$ABV >= 46, "N", "Y")
whiskey$cask_strength <- ifelse(whiskey$ABV > 50, "Y", "N")
whiskey$std_ABV <- ifelse(whiskey$ABV == 40, "Y", "N")
whiskey$price_band <- ifelse(whiskey$price_num < 30, "A. <$30", ifelse(whiskey$price_num <= 50, "B. $30-$50",
                          ifelse(whiskey$price_num <= 100, "D. $50-$100",ifelse(whiskey$price_num <= 250, "E. $100-$250",
                            ifelse(whiskey$price_num <= 1000, "F. $250-$1000",ifelse(whiskey$price_num > 1000, "G. $1000+", NA))))))

PLOTS

Now lets make some plots and try to draw some conclusions. Prices are in US dollars and may therefore prices be quite different to other parts of the world.

1. USING GGPLOTLY: Simple plot of price vs Review Score

g <- ggplot(whiskey, aes(x = price_num, y = review.point, colour= category, label = name)) +
     geom_point(position = "jitter", size = 0.4) +
     ggtitle("Whisky Review Score by Price and Category") +
     xlab("Price ($)") +
     ylab("Review Score") +
     theme(legend.position = "bottom", legend.title = element_blank(), legend.background = element_blank())


ggplotly(g) %>% 
  layout(legend = list(
    orientation = "v",
    x = 0.65,
    y= 0.01
  )
)

In this plot you may want to zoom into the left portion of the graph since there are a couple of extemely expensive whiskeys skewing the plot, one benefit of plotly is that it allows us to zoom in to overcome issues like this, whereas in ggplot I may have had to play around with the xlim options to set the graph optimally.

Its hard to draw conclusions from this graph:

The majority of reviews are for single malt scotch whiskies
There is one bottle over $150,000 !!
There are 3 whiskies ranged 97 score. Johnnie Walker Blue label (A blend), Black Bowmore 1964 (single malt) and Bowmore 46 - clearly the reviewers are a fan of Bowmore.
There is no scores below 63, and scores are very condensed in around 85-95

2. PLOT_LY: Does Chill Filtering have an impact on score??

x <- list(
  title = "Chill Filtered"
)
y <- list(
  title = "Review Score"
)
g2 <- plot_ly(whiskey, x = ~chill_filter, y = ~review.point, color = ~chill_filter, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Does Using Chill Filtering impact review score? (Y = Chill Filtered)")
g2

## Warning: Ignoring 29 observations

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Here we can see:

Median scores for Non Chill Filtered and Chill Filtered whiskies are identical
Chill filtered whiskies have a lower inter quartile range so are more condensed around the median.

3. PLOT_LY: Does bottling at cask strength have an impact on score??

x <- list(
  title = "Cask Strength"
)
y <- list(
  title = "Review Score"
)

#CASK STRENGTh
g3 <- plot_ly(whiskey, x = ~cask_strength, y = ~review.point, color = ~cask_strength, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Does bottling at Cask Strength impact review score? (Y = Cask Strength)")
g3

## Warning: Ignoring 29 observations

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Median Scores are identical for both whiskies bottled at cask strength and those that arent
Higher upper end of Interquartile range for cask strength whiskies, but only by a point

4. PLOT_LY: Does bottling at standard ABV have an impact on score??

x <- list(
  title = "Standard ABV (40%)"
)
y <- list(
  title = "Review Score"
)

#std_ABV
g4 <- plot_ly(whiskey, x = ~std_ABV, y = ~review.point, color = ~std_ABV, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Does bottling at std ABV (40%) impact review score? (Y = Std ABV)")
g4

## Warning: Ignoring 29 observations

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Here we have our first real conclusive insights!!

The Median values of whiskies bottled at non standard ABV is higher than those bottled at the standard (and minimum required) 40% ABV
Both the Upper and lower IQR are higher for Non Standard ABV whiskies

5. PLOT_LY: Does price really matter??

x <- list(
  title = "Price Band"
)
y <- list(
  title = "Review Score"
)

g5 <- plot_ly(whiskey, x = ~price_band, y = ~review.point, color = ~price_band, type = "box") %>%
  layout(xaxis = x, yaxis = y, title = "Whisky Review Score by Price Band")
g5

## Warning: Ignoring 6 observations

The first plot was a little hard to read, so using price bands, set arbitrarily by myself, I wanted to see if there was any sign that the more expensive whiskies score higher in the reviews.

We can see from the plot here:

Very clear upwards trend, the median increases with every jump in price band
The lower fence however remains fairly static up until we reach whiskies over $250: So paying more doesn’t guarantee you a good bottle!

SUMMARY

While its hard to draw conclusions, on what is a dataset of reviews with relatively low differentiation for a 100 point scale, to summarise what we’ve found…

More expensive whiskies are better (Or are the reviewers getting blinded by the price tag and better packaging!)
Whiskies bottled above 40% have higher scores.
There is no clear evidence that chill filtering dis-improves whisky : in the element of doubt it would therefore be better to non chill filter (I am biased)!

Finally I can conclude that all reviews are biased towards peoples tastes, having tried Johnnie Walker Blue, the top ranked whisky here and was a bit meh about it. Noting that my personal favourite scotch - after trying many bottles! - Kilkerran 12 is ranked a criminally low 88 and for the bargain $45 I’m inclined to instead urge you, if you are going to pick up a bottle to forget all about the rankings above and pick up a bottle of this instead!

“There is no bad whisky. There are only some whiskies that aren’t as good as others” Raymond Chandler