Imputing Missing Data and Analyzing Regional Variations in Scotch Whisky Smokiness
Author
Saurabh C Srivastava
Published
February 23, 2025
Objective of the Analysis
The objective of this analysis is to handle missing data in the Scotch Single Malts dataset (whisky_rev.csv) using Multiple Imputation by Chained Equations (MICE) and to explore regional variations in whisky smokiness. The dataset originally contains missing values, which are addressed using Random Forest-based imputation to ensure data completeness and accuracy. Additionally, the study examines whether whisky smokiness varies by region, using data visualization techniques to identify potential patterns in the distribution of smokiness scores across different whisky-producing areas.
Brief Description of the Code
This R script handles missing data in a whisky dataset and explores regional variations in whisky smokiness using data imputation, visualization, and statistical analysis.
1. Data Loading & Preprocessing
library(tidyverse) # Data manipulation and visualization
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mice) # Handling missing data with multiple imputation
Attaching package: 'mice'
The following object is masked from 'package:stats':
filter
The following objects are masked from 'package:base':
cbind, rbind
2. Handling Missing Data with MICE (Multiple Imputation by Chained Equations)
Converts categorical variables (distillery, brand, region) into factors.
The function naniar::vis_miss(whisky) is used to visualize missing values (NA’s) in a dataset. It provides an intuitive graphical representation of where the missing values are located in the dataset.
whisky_df$region =as.integer(whisky_df$region)whisky_df %>%group_by(region) %>%summarize(mean_smokiness =mean(smoky)) %>%ggplot(aes(y = mean_smokiness, x = region)) +geom_point(size =2)+geom_jitter() +geom_smooth(method ="lm", se =TRUE) +labs(caption ="Saurabh") +theme(legend.position ="none", plot.title =element_text(hjust =0.5)) +labs(title ="Mean Smokiness by Region",x ="Region", y ="Mean Smokiness")
`geom_smooth()` using formula = 'y ~ x'
Conclusion
This analysis successfully imputes missing values in the whisky dataset using Random Forest-based imputation (mice package), allowing for a more reliable exploration of whisky characteristics. The visualization of missing data (naniar::vis_miss()) confirmed that the imputation process effectively filled in gaps, improving dataset completeness.
The study also investigates whether whisky smokiness varies by region. The results suggest that while some regional variations in smokiness exist, the relationship is not strictly linear. The mean smokiness by region plot highlights how certain whisky-producing regions tend to have higher average smokiness scores, which could be indicative of regional production styles and ingredient differences.
This approach provides a data-driven method to explore whisky classifications, making it useful for whisky enthusiasts, industry professionals, and researchers interested in understanding regional influences on whisky characteristics. Future research could incorporate additional whisky attributes such as peat level, age, and alcohol content to further refine the analysis.