An Analysis of Scotch Reviews as a Normal Distribution

As an data scientist student the central limit theorem is one of the most powerful tools in your belt. The theorem in short allows you to approximate any distribution of random events as the normal distribution. I decided to pick a fairly arbitrary, yet interesting data set and see if I could successfully demonstrate the central limit theorem.

I have decided to investigate a set of roughly 2200 scotch reviews as the source of my investigation. I will look at the review each whiskey got, and attempt to see if it follows a normal distribution. Thereby verifying the central limit theorem.

Starting off we see the distribution of review points.

scotch <- read.csv("scotch_review.csv")
review <- scotch$review.point
par(mar = c(4, 4, .1, .1))
rePlot <- hist(review, breaks = 60:100)

We can already begin to see a rough normal shape here. A peak occurs between 85 and 90 points with a decrease in score from then on.

Now let’s look at the counts in each of these bins, and convert them to a probability. Here we see for each review score the corresponding probability of a review of that score.

reviewCounts <- rePlot$counts
reviewProb <- reviewCounts / sum(reviewCounts)
length(reviewProb)
## [1] 40
plot(61:100, reviewProb, type = 'h')

Now suppose we select some scotch from this list of 2200+ different varieties. What would be the odds we get one with a high score? Suppose we take 100 bottles off the shelf that is our data set.

randomBottles <- sample(review, size = 100, replace = T)
hist(randomBottles)

The distribution features two peaks. Perhaps squinting you could call it a normal distribution, but we can do better to prove our point. Suppose now we pick 100 bottles, twice and get the average of each review. Then we may get a more accurate representation of the samples.

bottleAverage <- (sample(review, size = 100, replace = T) + 
                  sample(review, size = 100, replace = T)) / 2
hist(bottleAverage, main = "Average of 2 Samples")

This seems to lead to a fitting distribution, lets repeat this process for a couple values.

n <- 10000

bottleMatx <- matrix(sample(review, size = n, replace = T) ,ncol = n, nrow = 10000)

hist(rowMeans(bottleMatx[1:5,]), main = "Average of 5 Samples")

hist(rowMeans(bottleMatx[1:10,]), main = "Average of 10 Samples")

hist(rowMeans(bottleMatx[1:25,]), main = "Average of 25 Samples")

hist(rowMeans(bottleMatx[1:50,]), main = "Average of 50 Samples")

hist(rowMeans(bottleMatx[1:100,]), main = "Average of 100 Samples")

Finally after a couple iterations one can see a symmetrical distribution like that of the normal distribution.