Imputing Missing Data and Analyzing Regional Variations in Scotch Whisky Smokiness

Author

Saurabh C Srivastava

Published

February 23, 2025

Objective of the Analysis

The objective of this analysis is to handle missing data in the Scotch Single Malts dataset (whisky_rev.csv) using Multiple Imputation by Chained Equations (MICE) and to explore regional variations in whisky smokiness. The dataset originally contains missing values, which are addressed using Random Forest-based imputation to ensure data completeness and accuracy. Additionally, the study examines whether whisky smokiness varies by region, using data visualization techniques to identify potential patterns in the distribution of smokiness scores across different whisky-producing areas.

Brief Description of the Code

This R script handles missing data in a whisky dataset and explores regional variations in whisky smokiness using data imputation, visualization, and statistical analysis.

1. Data Loading & Preprocessing

library(tidyverse)   # Data manipulation and visualization
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mice)        # Handling missing data with multiple imputation

Attaching package: 'mice'

The following object is masked from 'package:stats':

    filter

The following objects are masked from 'package:base':

    cbind, rbind
library(naniar)      # Visualizing missing data

whisky = read.csv("whisky_rev.csv", header = TRUE)

2. Handling Missing Data with MICE (Multiple Imputation by Chained Equations)

  • Converts categorical variables (distillery, brand, region) into factors.

  • The function naniar::vis_miss(whisky) is used to visualize missing values (NA’s) in a dataset. It provides an intuitive graphical representation of where the missing values are located in the dataset.

set.seed(42)
whisky$distillery <- factor(whisky$distillery)
whisky$brand <- factor(whisky$brand)
whisky$region <- factor(whisky$region)

naniar::vis_miss(whisky)

  • Performs missing data imputation using Random Forest (rf).

  • The mice() function replaces missing values based on patterns in the existing data.

nlevels(whisky$distillery)
[1] 76
nlevels(whisky$brand)
[1] 2
nlevels(whisky$region)
[1] 6
imp_whisky = mice(data = whisky, method = "rf")

 iter imp variable
  1   1  distillery  brand  region  honey  nutty
  1   2  distillery  brand  region  honey  nutty
  1   3  distillery  brand  region  honey  nutty
  1   4  distillery  brand  region  honey  nutty
  1   5  distillery  brand  region  honey  nutty
  2   1  distillery  brand  region  honey  nutty
  2   2  distillery  brand  region  honey  nutty
  2   3  distillery  brand  region  honey  nutty
  2   4  distillery  brand  region  honey  nutty
  2   5  distillery  brand  region  honey  nutty
  3   1  distillery  brand  region  honey  nutty
  3   2  distillery  brand  region  honey  nutty
  3   3  distillery  brand  region  honey  nutty
  3   4  distillery  brand  region  honey  nutty
  3   5  distillery  brand  region  honey  nutty
  4   1  distillery  brand  region  honey  nutty
  4   2  distillery  brand  region  honey  nutty
  4   3  distillery  brand  region  honey  nutty
  4   4  distillery  brand  region  honey  nutty
  4   5  distillery  brand  region  honey  nutty
  5   1  distillery  brand  region  honey  nutty
  5   2  distillery  brand  region  honey  nutty
  5   3  distillery  brand  region  honey  nutty
  5   4  distillery  brand  region  honey  nutty
  5   5  distillery  brand  region  honey  nutty
Warning: Number of logged events: 121
imp_whisky
Class: mids
Number of multiple imputations:  5 
Imputation methods:
distillery      brand     region       body  sweetness      smoky  medicinal 
      "rf"       "rf"       "rf"         ""         ""         ""         "" 
   tobacco      honey      spicy      winey      nutty      malty     fruity 
        ""       "rf"         ""         ""       "rf"         ""         "" 
    floral   postcode   latitude  longitude 
        ""         ""         ""         "" 
PredictorMatrix:
           distillery brand region body sweetness smoky medicinal tobacco honey
distillery          0     1      1    1         1     1         1       1     1
brand               1     0      1    1         1     1         1       1     1
region              1     1      0    1         1     1         1       1     1
body                1     1      1    0         1     1         1       1     1
sweetness           1     1      1    1         0     1         1       1     1
smoky               1     1      1    1         1     0         1       1     1
           spicy winey nutty malty fruity floral postcode latitude longitude
distillery     1     1     1     1      1      1        0        1         1
brand          1     1     1     1      1      1        0        1         1
region         1     1     1     1      1      1        0        1         1
body           1     1     1     1      1      1        0        1         1
sweetness      1     1     1     1      1      1        0        1         1
smoky          1     1     1     1      1      1        0        1         1
Number of logged events:  121 
  it im    dep     meth
1  0  0        constant
2  1  1  brand       rf
3  1  1  brand       rf
4  1  1 region       rf
5  1  1 region       rf
6  1  1  honey       rf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                 out
1                                                                                                                                                                                                                                                                                                                                                                                                                                                           postcode
2                                                                                                                                                                                                                                                                                                                                                                                                                df set to 1. # observed cases: 68  # predictors: 95
3 distilleryAultmore, distilleryBalblair, distilleryBalvenie, distilleryBlair Athol, distilleryBowmore, distilleryBruichladdich, distilleryDailuaine, distilleryDalwhinnie, distilleryEdradour, distilleryGlenallachie, distilleryGlenkinchie, distilleryKnockando, distilleryLongmorn, distilleryMortlach, distilleryStrathmill, distilleryTamnavulin, regionhighland, regionislands, regionislay, regionlowland, smoky, medicinal, tobacco, honey, malty, latitude
4                                                                                                                                                                                                                                                                                                                                                                                                                df set to 1. # observed cases: 73  # predictors: 91
5                                                                                                                                                                                                                  distilleryBalmenach, distilleryBalvenie, distilleryBenrinnes, distilleryBruichladdich, distilleryFettercairn, distilleryGlen Spey, distilleryIsle of Jura, distilleryKnockando, distilleryStrathmill, brandB, sweetness, medicinal, honey, floral
6                                                                                                                                                                                                                                                                                                                                                                                                                df set to 1. # observed cases: 70  # predictors: 95
summary(imp_whisky)
Class: mids
Number of multiple imputations:  5 
Imputation methods:
distillery      brand     region       body  sweetness      smoky  medicinal 
      "rf"       "rf"       "rf"         ""         ""         ""         "" 
   tobacco      honey      spicy      winey      nutty      malty     fruity 
        ""       "rf"         ""         ""       "rf"         ""         "" 
    floral   postcode   latitude  longitude 
        ""         ""         ""         "" 
PredictorMatrix:
           distillery brand region body sweetness smoky medicinal tobacco honey
distillery          0     1      1    1         1     1         1       1     1
brand               1     0      1    1         1     1         1       1     1
region              1     1      0    1         1     1         1       1     1
body                1     1      1    0         1     1         1       1     1
sweetness           1     1      1    1         0     1         1       1     1
smoky               1     1      1    1         1     0         1       1     1
           spicy winey nutty malty fruity floral postcode latitude longitude
distillery     1     1     1     1      1      1        0        1         1
brand          1     1     1     1      1      1        0        1         1
region         1     1     1     1      1      1        0        1         1
body           1     1     1     1      1      1        0        1         1
sweetness      1     1     1     1      1      1        0        1         1
smoky          1     1     1     1      1      1        0        1         1
Number of logged events:  121 
  it im    dep     meth
1  0  0        constant
2  1  1  brand       rf
3  1  1  brand       rf
4  1  1 region       rf
5  1  1 region       rf
6  1  1  honey       rf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                 out
1                                                                                                                                                                                                                                                                                                                                                                                                                                                           postcode
2                                                                                                                                                                                                                                                                                                                                                                                                                df set to 1. # observed cases: 68  # predictors: 95
3 distilleryAultmore, distilleryBalblair, distilleryBalvenie, distilleryBlair Athol, distilleryBowmore, distilleryBruichladdich, distilleryDailuaine, distilleryDalwhinnie, distilleryEdradour, distilleryGlenallachie, distilleryGlenkinchie, distilleryKnockando, distilleryLongmorn, distilleryMortlach, distilleryStrathmill, distilleryTamnavulin, regionhighland, regionislands, regionislay, regionlowland, smoky, medicinal, tobacco, honey, malty, latitude
4                                                                                                                                                                                                                                                                                                                                                                                                                df set to 1. # observed cases: 73  # predictors: 91
5                                                                                                                                                                                                                  distilleryBalmenach, distilleryBalvenie, distilleryBenrinnes, distilleryBruichladdich, distilleryFettercairn, distilleryGlen Spey, distilleryIsle of Jura, distilleryKnockando, distilleryStrathmill, brandB, sweetness, medicinal, honey, floral
6                                                                                                                                                                                                                                                                                                                                                                                                                df set to 1. # observed cases: 70  # predictors: 95
  • Fills missing values and stores the cleaned dataset in whisky_df.

  • Re-checks for missing data after imputation.

whisky_df = mice::complete(imp_whisky)
str(whisky_df)
'data.frame':   86 obs. of  18 variables:
 $ distillery: Factor w/ 76 levels "Aberfeldy","Aberlour",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ brand     : Factor w/ 2 levels "A","B": 1 1 1 1 2 1 1 2 1 2 ...
 $ region    : Factor w/ 6 levels "campbeltown ",..: 2 6 2 4 2 3 5 6 6 2 ...
 $ body      : int  2 3 1 4 2 2 0 2 2 2 ...
 $ sweetness : int  2 3 3 1 2 3 2 3 2 3 ...
 $ smoky     : int  2 1 2 4 2 1 0 1 1 2 ...
 $ medicinal : int  0 0 0 4 0 1 0 0 0 1 ...
 $ tobacco   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ honey     : int  2 4 2 0 1 1 1 1 1 0 ...
 $ spicy     : int  1 3 0 2 1 1 1 1 0 2 ...
 $ winey     : int  2 2 0 0 1 1 0 2 0 0 ...
 $ nutty     : int  2 2 2 1 2 0 1 2 2 2 ...
 $ malty     : int  2 3 2 2 3 1 2 2 2 1 ...
 $ fruity    : int  2 3 3 1 1 1 3 2 2 2 ...
 $ floral    : int  2 2 2 0 1 2 3 1 2 1 ...
 $ postcode  : chr  "PH15 2EB" "AB38 9PJ" "AB5 5LI" "PA42 7EB" ...
 $ latitude  : int  286580 326340 352960 141560 355350 194050 247670 340754 340754 270820 ...
 $ longitude : int  749680 842570 839320 646220 829140 649950 672610 848623 848623 885770 ...
naniar::vis_miss(whisky_df)

head(whisky_df)
  distillery brand   region body sweetness smoky medicinal tobacco honey spicy
1  Aberfeldy     A highland    2         2     2         0       0     2     1
2   Aberlour     A speyside    3         3     1         0       0     4     3
3     AnCnoc     A highland    1         3     2         0       0     2     0
4     Ardbeg     A    islay    4         1     4         4       0     0     2
5    Ardmore     B highland    2         2     2         0       0     1     1
6      Arran     A  islands    2         3     1         1       0     1     1
  winey nutty malty fruity floral postcode latitude longitude
1     2     2     2      2      2 PH15 2EB   286580    749680
2     2     2     3      3      2 AB38 9PJ   326340    842570
3     0     2     2      3      2  AB5 5LI   352960    839320
4     0     1     2      1      0 PA42 7EB   141560    646220
5     1     2     3      1      1 AB54 4NH   355350    829140
6     1     0     1      1      2 KA27 8HJ   194050    649950
class(whisky_df)
[1] "data.frame"

3. Analyzing Whisky Smokiness by Region

  • Visualizes whisky smokiness (smoky) across different regions.

  • Uses color-coded scatter plots to check how smokiness varies across whisky-producing regions.

whisky_df %>% 
  ggplot(aes(y = smoky, x = region, col = as.factor(region))) +
  geom_point(size = 3)+
  geom_jitter() +
  theme(legend.position = "none")

  • Calculates the mean smokiness per region and plots it.

  • Uses linear regression (geom_smooth(method = "lm")) to check for trends in smokiness across regions.

str(whisky_df$region)
 Factor w/ 6 levels "campbeltown ",..: 2 6 2 4 2 3 5 6 6 2 ...
table(whisky_df$region)

campbeltown      highland      islands        islay      lowland     speyside 
           2           25            5            6            5           43 
whisky_df$region = as.integer(whisky_df$region)

whisky_df %>% group_by(region) %>%
  summarize(mean_smokiness = mean(smoky)) %>%
  ggplot(aes(y = mean_smokiness, x = region)) +
  geom_point(size = 2)+
  geom_jitter() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(caption = "Saurabh") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  labs(title = "Mean Smokiness by Region",
       x = "Region",                     
       y = "Mean Smokiness") 
`geom_smooth()` using formula = 'y ~ x'

Conclusion

This analysis successfully imputes missing values in the whisky dataset using Random Forest-based imputation (mice package), allowing for a more reliable exploration of whisky characteristics. The visualization of missing data (naniar::vis_miss()) confirmed that the imputation process effectively filled in gaps, improving dataset completeness.

The study also investigates whether whisky smokiness varies by region. The results suggest that while some regional variations in smokiness exist, the relationship is not strictly linear. The mean smokiness by region plot highlights how certain whisky-producing regions tend to have higher average smokiness scores, which could be indicative of regional production styles and ingredient differences.

This approach provides a data-driven method to explore whisky classifications, making it useful for whisky enthusiasts, industry professionals, and researchers interested in understanding regional influences on whisky characteristics. Future research could incorporate additional whisky attributes such as peat level, age, and alcohol content to further refine the analysis.