PART 1

Introduction

Have you ever wondered if paying the extra couple of dollars on a nice bottle of scotch is worth it? Well, that is what we are here to find out. I will be looking to analyze the overall scotch market and determine whether or not you should spend your extra hard earned dollars. I will also be conducting a sentiment analysis utilizing data scraped from tweets mentioning certain companies and comparing it to the professional point scores that are seen in a separate data set. Through all of this I hope to provide a better understanding as to how why and how scotches gain their scores and whether or not the general public agrees. This is especially interesting to me given that I have been drinking scotch for some years now, but have yet to truly refine my taste and these questions above are things that I have always been curious about.

The data for the beginning analysis that is seen in part 2 was collected from Kaggle. A website that hosts a plethora of different data sets free to download. The data consists of 2,247 different observations with each observation being a unique bottle of Scotch. The variables can be found in the table below and the data can be accessed via this link :

https://myxavier-my.sharepoint.com/:x:/g/personal/kaderlia_xavier_edu/EcHe_AB0-ChGo3M8vGqIXVABpOKrtw2UTo5zuhAhALt9Cw?download=1

Variables

Variables in Dataset Variable Type Explanation
ID Numeric Unique ID for the bottle
name Character Name of the bottle
alc_percent Numeric Percentage of alcohol by volume
age Numeric How many years the scotch was aged
category Character Type of scotch
review_point Numeric Professional review score
price Numeric Market price of the bottle
currency Character Currency that he price is in
description Character Professional description of bottle

Scotch Bottle Data Table

The table below displays all of the bottles in the data set with their respective variables that are listed and explained in the table above.

Note : Not all bottles have reported ages

Initial Summary Statistics

category average_rating average_price median_price average_age average_alc_percent
Blended Malt Scotch Whisky 87.65909 130.0379 70 18.90323 0.4674242
Blended Scotch Whisky 87.23697 1003.7630 68 20.36000 0.4292417
Grain Scotch Whisky 86.50000 272.4286 125 32.22222 0.4992593
Single Grain Whisky 85.50877 219.5789 124 29.22917 0.5077193
Single Malt Scotch 86.60858 657.4942 115 20.44828 0.4847894

Interpretation :

Overall, most of these numbers are to be expected. Please not that there is a $150,000 bottle of Blended Scotch that is throwing off the average price. One interesting thing to note is that the Grain Scotch does have one of the lower average rating while also having some of the highest median price, years aged, and average alcohol percentage. I will explore more between about these relationships throughout the remainder of this article.

PART 2

Note : All bottles above the price of $1,000 have been removed from the below visuals as to not throw off scaling and further analysis

Relationship between price and review score

Interpretation :

The visual above plots the relationship between price, on the x-axis, and review points, on the y-axis. At first glance there does not appear to be much of a noticeable relationship. However, the variability of review points decreases as the price of the bottle increases. There are still some very high rated scotches that are at relative low price levels, but they are scattered about.

Relationship between price and alcohol percentage

Interpretation :

As expected there does not appear to be a relationship between price and alcohol percentage. However, from the perspective of a college student, it is interesting to see that there does appear to be an higher average alcohol percentage in the cheaper scotch section. For someone that is perhaps not drinking for the flavor and also does not have much disposable income, like college students, this might be the route that they choose.

Relationship between price and years aged

Interpretation :

Here we do se an upward sloping relationship between price and years aged. Logically speaking, there are multiple inputs when creating a product, one being time and time is money. So when something takes longer to produce, odds are that it takes more resources such as storage space and employee time. This could be why we see that many of the cheaper scotches do not seemed to be aged for very long and why the more expensive bottles, on average, have more years with regards to the agingg process. Like they say, ages like a fine wine, or I guess in this case like a fine scotch.

Relationship between age and alcohol percent

Interpretation :

For someone who does not know too much about the alcohol aging process, this visual above does do a decent job explaining its relationship. I had always thought that the aging process had something to do with the alcohol percentage of the bottle. This however, is not the case. It turns out that it has a lot to do with initial fermentation process when the sugar is converted into alcohol. With this being said, there is no immediate relationship between the age of the bottle and alcohol percentage. On the other hand, as displayed above, many of the cheaper bottles have been aged for less time and have a higher alcohol percentage. Now focusing on the relationship between age and alcohol percentage, we still see that some of the lower aged bottles that are cheaper, also have a higher alcohol percentage.

Best bang for your buck (price per point)

Interpretation :

There are many different types of scotch and the odds are that the casual drinker cannot tell the difference when it comes to flavor. What this visual above shows is on average how much a review point costs. The lowest is Blended Malt Scotch and the highest is Single Grain. When we look back to the summary statistics Blended Malt Scotch has the highest average rating while single grain has the lowest average rating. Intuitively, taking both of these statements above into consideration, the best bang for your buck most likely lies with the Blended Malt Scotch while the worst bang for your buck falls with Single Grain.

PART 3

For this section I will be bringing in data that has been scraped from Twitter. the tweets included in this analysis all have had mentioned the scotch brand Johnnie Walker in their tweets. This will then be compared to the professional description/review of 221 Johnnie Walker bottles. The following visuals take into account words that have been labeled as positive and as negative, but please not that this was done automatically and the process is not fully accurate when interpreting the manner that the words were used in.

Johnnie Walker twitter comparison to primary data (sentiment analysis)

Interpretation :

The first of the two visuals utilizes the scraped twitter data while the second utilizes the descriptions of the bottles from the initial professional descriptions. Overall, the tweets seem to be be extremely positive mentioning words such as stunning, loving, perfect, and awards. On the other hand, there is one word, bash, that does appear to be negative. However, this word could have been used to describe some sort of party setting like a “birthday bash” in which they loved the Johnnie walker bottle that they had purchased. Moving onto the professional description, once again, a majority of the words are positive and there are even some negative words that could be interpreted as being positive depending on the type of scotch that you enjoy. Personally I love a complex scotch with all sorts of different flavors, especially smokey. There was one word that did confuse me, dusty. After doing some research I did find that this could be used to describe a bottle that is older. Something that has been aged for a long time or has been sitting around. Something that has been collecting dust with regards to the aging process. So this too could be considered a positive depending on your preferences.

PART 4

Predictive analysis : Attempt to predict high points with regression

One of the best tools for assessing the effect of each aspect for the scotch is to use multiple regression. We want to build the best regression model possible for describing the effect that the variables have on what score they recieve.

The generalized regression equation we begin with is: \[ReviewScore = \alpha_i + AlcoholPercent_i + age_i + price_i + category_i\]

## 
## Call:
## lm(formula = review_point ~ alc_percent + age + price + category, 
##     data = whisky)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2337  -2.4091   0.2969   2.6358   8.4701 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    8.161e+01  1.171e+00  69.713  < 2e-16 ***
## alc_percent                    3.789e+00  1.956e+00   1.938  0.05290 .  
## age                            1.597e-01  1.189e-02  13.438  < 2e-16 ***
## price                          2.350e-05  2.649e-05   0.887  0.37512    
## categoryBlended Scotch Whisky  1.459e+00  7.935e-01   1.838  0.06629 .  
## categoryGrain Scotch Whisky   -3.013e+00  1.156e+00  -2.606  0.00928 ** 
## categorySingle Grain Whisky   -2.405e+00  9.006e-01  -2.671  0.00767 ** 
## categorySingle Malt Scotch    -1.585e-01  7.022e-01  -0.226  0.82144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.842 on 1198 degrees of freedom
##   (1041 observations deleted due to missingness)
## Multiple R-squared:  0.1572, Adjusted R-squared:  0.1522 
## F-statistic: 31.91 on 7 and 1198 DF,  p-value: < 2.2e-16

Interpretation :

The above table is the results from the regression equation mentioned prior. When looking at our variables it is very interesting to see that price was not statistically significant in determining rating. With the price of the bottle being determined prior to review I would have expected the price of lower reviewed bottles to drop, which they may have, while the price of higher reviewed bottles to also adjust accordingly. This is something that would be neat to look into at another time. Moving on, there were was one variable significant at the .001 level and this was age. It has a very small positive affect on the point scale. There were also two variables that were significant at the .01 level. These were Grain Scotch and Single Grain category variables which both had a relatively large negative effect on the the point scale. The final two variables that were significant at the 0.1 level were alcohol percentage and the Blended Scotch category. Both of these had a relatively large positive impact on the point level meaning. These are all things that we expected to see after doing some initial analysis in section 2. Finally, i would like to draw your attention to the r-squared which is only 0.1522 which means that the variables in our equation only account for around 15% of the variation in the point scale. There are so many other things that could truly impact how a scotch is rated.

Closing Remarks :

Overall, through our analysis, we can see that there are many things that affect the rating of a scotch. From age to what type of scotch, there is something out there for everyone at any price level. To truly better understand what creates a favorable rating, maybe flavor profile analysis would be in line. Thank you, and best of luck on finding the best bottle for you!