Chapter 6: Data Preprocessing

Statistics for Data Science

Author

Pai

Published

January 1, 2026


1 Chapter Overview

Raw data is almost never ready for analysis. In practice, data scientists spend the majority of their time — estimates range from 60% to 80% — cleaning, transforming, and preparing data before any modeling begins. This is not merely a technical chore: the decisions made during preprocessing directly shape the conclusions drawn from the analysis. A model trained on poorly preprocessed data is unreliable regardless of its algorithmic sophistication.

This chapter covers:

  • Why Preprocessing Matters — the preprocessing pipeline and its role in data science
  • Handling Missing Data — types of missingness, deletion strategies, and imputation
  • Outlier Detection and Treatment — IQR, z-score, and multivariate methods
  • Data Transformation — log, Box-Cox, normalization, and standardization
  • Encoding Categorical Variables — dummy, one-hot, ordinal, and target encoding
  • Feature Scaling — min-max, z-score, and robust scaling
  • Data Integration and Reshaping — merging, pivoting, and tidy data principles
NoteLearning Objectives

By the end of this chapter, you will be able to:

  1. Describe the role of preprocessing in the data science pipeline.
  2. Identify the type of missingness and apply appropriate handling strategies.
  3. Detect and treat outliers using univariate and multivariate methods.
  4. Apply appropriate transformations to correct distributional issues.
  5. Encode categorical variables correctly for statistical and machine learning models.
  6. Scale features appropriately for distance-based and gradient-based algorithms.
  7. Reshape and integrate data using tidyverse tools following tidy data principles.

2 Why Preprocessing Matters

2.1 Introduction

The phrase “garbage in, garbage out” (GIGO) captures a fundamental truth in data science: the quality of any analysis is bounded by the quality of its input data. Missing values, outliers, inconsistent formatting, wrong data types, and poorly scaled features can silently distort statistical estimates, inflate model error, and lead to conclusions that seem analytically sound but are empirically baseless. Preprocessing is the systematic process of transforming raw data into a form suitable for analysis.

2.2 Theory

2.2.1 The Data Preprocessing Pipeline

A typical preprocessing pipeline consists of sequential steps applied before modeling:

Raw Data
    ↓
1. Data Inspection      — understand structure, types, missingness
    ↓
2. Missing Data         — identify, understand, and handle NAs
    ↓
3. Outlier Treatment    — detect and decide: remove, cap, or keep
    ↓
4. Transformation       — correct distributional shape
    ↓
5. Encoding             — convert categorical to numeric
    ↓
6. Scaling              — normalize feature magnitudes
    ↓
7. Integration          — merge sources, reshape to tidy format
    ↓
Clean Data → Analysis / Modeling

The order matters: outliers should be addressed before scaling; missing data should be handled before transformation; encoding precedes scaling for newly created numeric features.

2.2.2 Why Each Step Matters

Consequences of skipping preprocessing steps
Issue Consequence if Ignored Preprocessing Fix
Missing values Biased estimates, model errors Imputation or deletion
Outliers Inflated variance, distorted means Detection + treatment
Skewed distributions Violated normality assumptions Log/Box-Cox transformation
Mixed scales Distance-based models dominated by large-scale features Standardization
Raw categories Most algorithms require numeric input Encoding
Inconsistent formats Merge failures, incorrect grouping Cleaning + reshaping

2.2.3 The Leakage Problem

A critical principle in preprocessing: all transformations must be fitted on training data only and then applied to test data. Fitting a scaler or imputer on the full dataset before splitting allows test set information to “leak” into the training process, producing overly optimistic performance estimates. This is why preprocessing pipelines (like recipes in R or sklearn.pipeline in Python) are best practice — they enforce the correct order of operations and prevent leakage.

2.3 Example: The Cost of Skipping Preprocessing

Example 6.1. A data scientist trains a regression model to predict hospital readmission risk. The dataset has:

  • 18% missing values in the key predictor “number of prior admissions” (not handled → model drops these rows, losing 18% of data and biasing toward patients with complete records)
  • One outlier patient with 47 prior admissions (not handled → inflates the regression slope)
  • Age in years and income in dollars (not scaled → income dominates distance calculations)
  • Insurance type as a string variable (not encoded → model crashes or silently drops it)

Each issue independently degrades model performance. Together, they produce a model that is unreliable despite sophisticated methodology.

2.4 R Example: Initial Data Inspection

# --- Use airquality as a running example ---
data(airquality)

# Step 1: Structural overview
cat("=== Dataset Structure ===\n")
=== Dataset Structure ===
glimpse(airquality)
Rows: 153
Columns: 6
$ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
$ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
$ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
$ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
# Step 2: Missing value summary
cat("\n=== Missing Values ===\n")

=== Missing Values ===
miss_summary <- data.frame(
  Variable    = names(airquality),
  Type        = sapply(airquality, class),
  N_Missing   = colSums(is.na(airquality)),
  Pct_Missing = round(colMeans(is.na(airquality)) * 100, 1)
)
rownames(miss_summary) <- NULL

kable(miss_summary,
      caption   = "Variable Overview and Missing Data",
      col.names = c("Variable","Type","N Missing","% Missing")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  column_spec(4, bold = TRUE,
              color = ifelse(miss_summary$Pct_Missing > 0,
                             "tomato", "darkgreen"))
Variable Overview and Missing Data
Variable Type N Missing % Missing
Ozone integer 37 24.2
Solar.R integer 7 4.6
Wind numeric 0 0.0
Temp integer 0 0.0
Month integer 0 0.0
Day integer 0 0.0
# Step 3: Basic descriptive statistics
summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

Code explanation:

  • glimpse() provides a compact structural overview: variable names, types, and first few values — always the first step.
  • Building a custom miss_summary data frame is more informative than is.na() alone — it quantifies the scale of the missing data problem per variable.
  • summary() gives a quick distributional picture including min, max, quartiles, and NA counts — useful for spotting impossible values and gross outliers.

2.5 Exercises

TipExercise 6.1

Load the mtcars dataset and perform a full initial inspection:

  1. Check variable types with str() and glimpse(). Which variables are stored incorrectly?
  2. Create a missing value summary table. Are there any missing values?
  3. Use summary() to identify any suspicious values (impossible ranges, extreme outliers).
  4. Write a 100-word “data quality report” describing the state of the data before preprocessing.

3 Handling Missing Data

3.1 Introduction

Missing data is one of the most pervasive problems in real-world datasets. How missing values are handled has a profound effect on the validity of subsequent analysis — and the wrong strategy can introduce bias far worse than the original missingness. The first step is always to understand why data is missing, because the appropriate handling strategy depends on the mechanism of missingness.

3.2 Theory

3.2.1 Types of Missingness

Little and Rubin (1987) established the canonical taxonomy of missing data mechanisms:

Missing Completely At Random (MCAR): The probability of missingness is unrelated to any variable — observed or missing. The missing data is a random subsample of the complete data.

\[P(\text{missing} \mid Y_{\text{obs}}, Y_{\text{mis}}) = P(\text{missing})\]

Example: A sensor randomly malfunctions regardless of the measurement value. MCAR is the most benign mechanism — complete case analysis (listwise deletion) is unbiased, though it loses power.

Missing At Random (MAR): The probability of missingness depends on observed variables but not on the missing values themselves — after conditioning on observed data.

\[P(\text{missing} \mid Y_{\text{obs}}, Y_{\text{mis}}) = P(\text{missing} \mid Y_{\text{obs}})\]

Example: Older patients are less likely to complete a digital survey (age is observed; the missing survey responses are not related to their content). MAR is the most common mechanism and supports valid imputation.

Missing Not At Random (MNAR): The probability of missingness depends on the missing values themselves — even after conditioning on observed data.

\[P(\text{missing} \mid Y_{\text{obs}}, Y_{\text{mis}}) \neq P(\text{missing} \mid Y_{\text{obs}})\]

Example: People with very high incomes decline to report income. MNAR is the most problematic — no standard statistical method can fully correct for it without additional assumptions or external data.

3.2.2 Strategies for Handling Missing Data

1. Complete Case Analysis (Listwise Deletion)

Remove all rows containing any missing value. Simple and default in most software.

  • ✓ Valid and unbiased under MCAR
  • ✗ Loses power; biased under MAR or MNAR
  • ✗ Can lose substantial data if missingness is spread across many variables

2. Mean / Median / Mode Imputation

Replace missing values with the variable’s mean (continuous), median (skewed/ordinal), or mode (categorical).

  • ✓ Simple and fast
  • ✗ Underestimates variance (artificially concentrates values at center)
  • ✗ Distorts correlations between variables
  • ✗ Valid only approximately under MCAR

3. Multiple Imputation (MI)

Generate \(m\) complete datasets by imputing missing values \(m\) times using a statistical model (typically a multivariate normal or predictive mean matching), analyze each, and combine results using Rubin’s rules.

\[\hat{\theta} = \frac{1}{m}\sum_{i=1}^{m}\hat{\theta}_i, \qquad \text{Var}(\hat{\theta}) = \bar{W} + \left(1 + \frac{1}{m}\right)B\]

where \(\bar{W}\) is the within-imputation variance and \(B\) is the between-imputation variance.

  • ✓ Valid under MAR; gold standard for missing data handling
  • ✓ Correctly propagates uncertainty from missingness
  • ✗ More complex to implement and interpret

4. Model-Based Imputation

Use a regression model to predict missing values from other variables (K-nearest neighbors imputation, random forest imputation).

  • ✓ Preserves relationships between variables better than mean imputation
  • ✓ Works well for MAR data
  • ✗ Risk of overfitting if not cross-validated

3.2.3 Diagnosing the Missingness Mechanism

Missingness diagnostics
Test Purpose
Compare observed vs. missing on other variables Check for MAR patterns
Little’s MCAR test Formal test of MCAR assumption
Missing data pattern visualization Identify which variables co-occur in missingness

3.3 Example: Missing Data in a Clinical Study

Example 6.2. A clinical dataset records blood pressure, cholesterol, age, and BMI for 500 patients. Cholesterol has 22% missing values. Investigation reveals: older patients (age > 65) are less likely to have cholesterol measured (they were referred to a different lab). Since age is observed and the missingness depends on age (not on the cholesterol value itself), this is MAR — multiple imputation using age and other observed variables as predictors is the appropriate strategy.

If instead, patients with very high cholesterol avoided the test (fearing bad news), this would be MNAR — no standard imputation method can fully correct for this without external information.

3.4 R Example: Missing Data Analysis and Imputation

# --- Missing data visualization ---
# airquality: Ozone (24 missing) and Solar.R (7 missing)

# Pattern of missingness
naniar::vis_miss(airquality) +
  labs(title = "Missing Data Pattern: airquality Dataset") +
  theme_minimal(base_size = 12)

# --- Explore MAR: does missingness in Ozone relate to other vars? ---
airquality <- airquality |>
  mutate(ozone_missing = is.na(Ozone))

# Compare Temp and Wind between missing and non-missing Ozone groups
missing_comparison <- airquality |>
  group_by(ozone_missing) |>
  summarise(
    n         = n(),
    Mean_Temp = round(mean(Temp,  na.rm = TRUE), 2),
    Mean_Wind = round(mean(Wind,  na.rm = TRUE), 2),
    Mean_Solar = round(mean(Solar.R, na.rm = TRUE), 2),
    .groups   = "drop"
  )

kable(missing_comparison,
      caption   = "Variable Means by Ozone Missingness Status",
      col.names = c("Ozone Missing?","n","Mean Temp",
                    "Mean Wind","Mean Solar.R")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
Variable Means by Ozone Missingness Status
Ozone Missing? n Mean Temp Mean Wind Mean Solar.R
FALSE 116 77.87 9.86 184.80
TRUE 37 77.92 10.26 189.51
# --- Compare imputation methods ---
airquality_clean <- airquality |> dplyr::select(-ozone_missing)

# Method 1: Complete case analysis
cc_data <- na.omit(airquality_clean)
cat("Complete cases:", nrow(cc_data),
    "of", nrow(airquality_clean), "\n")
Complete cases: 111 of 153 
# Method 2: Mean imputation
mean_imputed <- airquality_clean |>
  mutate(
    Ozone    = ifelse(is.na(Ozone),
                       mean(Ozone, na.rm = TRUE), Ozone),
    Solar.R  = ifelse(is.na(Solar.R),
                       mean(Solar.R, na.rm = TRUE), Solar.R)
  )

# Method 3: Multiple imputation via mice
set.seed(42)
mice_out  <- mice(airquality_clean, m = 5,
                   method = "pmm", printFlag = FALSE)
mi_data   <- complete(mice_out, 1)  # first imputed dataset

# Compare distributions of imputed Ozone
imp_compare <- data.frame(
  Method = c("Complete Cases","Mean Imputation",
             "Multiple Imputation (MI)"),
  n      = c(nrow(cc_data), nrow(mean_imputed), nrow(mi_data)),
  Mean_Ozone = round(c(mean(cc_data$Ozone),
                        mean(mean_imputed$Ozone),
                        mean(mi_data$Ozone)), 2),
  SD_Ozone   = round(c(sd(cc_data$Ozone),
                        sd(mean_imputed$Ozone),
                        sd(mi_data$Ozone)), 2)
)

kable(imp_compare,
      caption   = "Imputation Method Comparison: Ozone Variable",
      col.names = c("Method","n","Mean Ozone","SD Ozone")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
Imputation Method Comparison: Ozone Variable
Method n Mean Ozone SD Ozone
Complete Cases 111 42.10 33.28
Mean Imputation 153 42.13 28.69
Multiple Imputation (MI) 153 41.50 30.91
# --- Visualize effect of imputation on distribution ---
dist_df <- bind_rows(
  data.frame(Ozone  = cc_data$Ozone,
             Method = "Complete Cases"),
  data.frame(Ozone  = mean_imputed$Ozone,
             Method = "Mean Imputation"),
  data.frame(Ozone  = mi_data$Ozone,
             Method = "Multiple Imputation")
)

ggplot(dist_df, aes(x = Ozone, fill = Method)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("steelblue","tomato","seagreen")) +
  labs(title    = "Ozone Distribution Under Different Imputation Methods",
       subtitle = "Mean imputation artificially spikes at the mean; MI preserves shape",
       x        = "Ozone (ppb)", y = "Density",
       fill     = "Method") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Code explanation:

  • naniar::vis_miss() produces a tile plot of missingness — each column is a variable, each row an observation, and missing values appear in a contrasting color.
  • mice(data, m=5, method="pmm") runs multiple imputation with predictive mean matching — the most robust method for continuous variables. m=5 creates 5 imputed datasets.
  • The density plot reveals why mean imputation is problematic: it creates an artificial spike at the mean, underestimating variance and distorting the distribution.

3.5 Exercises

TipExercise 6.2

Using the airquality dataset:

  1. Test whether Ozone missingness is related to Month using a chi-square test or proportion comparison. What type of missingness does this suggest?
  2. Implement KNN imputation using VIM::kNN(). Compare the distribution of imputed Ozone to mean imputation and MI.
  3. After multiple imputation (m=5), pool results using mice::pool() for a linear regression of Ozone on Temp and Wind. Report the pooled coefficients.
TipExercise 6.3

Create a dataset with MNAR missingness: simulate 200 observations of income where values above 100,000 are set to NA (high earners hide income).

  1. Apply mean imputation and MI. Compare estimates of the mean income to the true value.
  2. Show that neither method recovers the true mean. What does this demonstrate about MNAR?

4 Outlier Detection and Treatment

4.1 Introduction

An outlier is an observation that deviates markedly from the general pattern of the data. Outliers arise from data entry errors, measurement failures, genuine extreme values, or population heterogeneity. The key question is always: Is this outlier an error to be corrected, or a genuine extreme value to be retained? Blindly removing outliers is as dangerous as blindly keeping them — context and domain knowledge are essential.

4.2 Theory

4.2.1 Univariate Methods

IQR Method (Tukey’s Fences):

An observation is a potential outlier if it falls below the lower fence or above the upper fence:

\[\text{Lower fence} = Q_1 - 1.5 \times \text{IQR}\] \[\text{Upper fence} = Q_3 + 1.5 \times \text{IQR}\]

Using a multiplier of 3.0 instead of 1.5 defines extreme outliers. This method is robust — it uses quartiles rather than the mean, so it is not affected by the outliers it is trying to detect.

Z-Score Method:

An observation is an outlier if its standardized value exceeds a threshold (typically \(|z| > 3\)):

\[z_i = \frac{x_i - \bar{x}}{s}\]

This method is sensitive to the outliers themselves (since \(\bar{x}\) and \(s\) are both affected), making it less reliable than the IQR method for heavily contaminated data. The modified z-score using the median and MAD (median absolute deviation) is more robust:

\[\tilde{z}_i = \frac{0.6745(x_i - \tilde{x})}{\text{MAD}}, \qquad \text{MAD} = \text{median}(|x_i - \tilde{x}|)\]

Observations with \(|\tilde{z}_i| > 3.5\) are flagged as outliers.

4.2.2 Multivariate Methods

Mahalanobis Distance:

For multivariate data, univariate methods miss outliers that are unusual in the joint distribution but not in any individual variable. The Mahalanobis distance measures how far an observation is from the multivariate centroid, accounting for correlations:

\[D^2_i = (\mathbf{x}_i - \bar{\mathbf{x}})^T \mathbf{S}^{-1} (\mathbf{x}_i - \bar{\mathbf{x}})\]

Under multivariate normality, \(D^2_i \sim \chi^2(p)\) where \(p\) is the number of variables. Observations with \(D^2_i > \chi^2_{p, 0.975}\) are flagged.

4.2.3 Treatment Options

Outlier treatment options
Treatment When to Use
Remove Confirmed data entry errors or measurement failures
Cap (Winsorize) Genuine extreme values; replace with fence values
Transform Log or sqrt reduces the influence of large values
Keep Genuine extreme values that are part of the population
Robust methods Use median-based estimators instead of mean-based
WarningNever Remove Outliers Automatically

Always investigate before removing. An outlier might be the most scientifically interesting observation. Document every removal decision with a justification.

4.3 Example: Outlier Detection

Example 6.3. In a dataset of 100 monthly salaries (THB), the IQR method flags a salary of 850,000 THB as an outlier (all others are below 120,000). Investigation reveals this is the CEO — a genuine extreme value, not an error. The appropriate treatment is to retain it for descriptive analysis but report separately when computing “typical” employee salary, or to use the median as the summary statistic.

4.4 R Example: Outlier Detection and Treatment

# --- Univariate outlier detection ---
data(airquality)
ozone <- na.omit(airquality$Ozone)

# IQR method
Q1  <- quantile(ozone, 0.25)
Q3  <- quantile(ozone, 0.75)
IQR_val <- Q3 - Q1
lower   <- Q1 - 1.5 * IQR_val
upper   <- Q3 + 1.5 * IQR_val

# Z-score method
z_scores <- abs(scale(ozone))

# Modified z-score (robust)
MAD_val  <- mad(ozone)
mod_z    <- abs(0.6745 * (ozone - median(ozone)) / MAD_val)

outlier_summary <- data.frame(
  Method           = c("IQR (1.5×IQR)",
                        "IQR (3.0×IQR — extreme)",
                        "Z-score (|z|>3)",
                        "Modified Z-score (|z̃|>3.5)"),
  N_Flagged        = c(
    sum(ozone < lower | ozone > upper),
    sum(ozone < Q1 - 3*IQR_val | ozone > Q3 + 3*IQR_val),
    sum(z_scores > 3),
    sum(mod_z > 3.5)
  ),
  Values_Flagged   = c(
    paste(sort(ozone[ozone < lower | ozone > upper]),
          collapse = ", "),
    paste(sort(ozone[ozone < Q1-3*IQR_val |
                       ozone > Q3+3*IQR_val]),
          collapse = ", "),
    paste(sort(ozone[z_scores > 3]), collapse = ", "),
    paste(sort(ozone[mod_z > 3.5]), collapse = ", ")
  )
)

kable(outlier_summary,
      caption   = "Outlier Detection: Ozone Variable",
      col.names = c("Method","N Flagged","Values")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
Outlier Detection: Ozone Variable
Method N Flagged Values
IQR (1.5×IQR) 2 135, 168
IQR (3.0×IQR — extreme) 0
Z-score (&#124;z&#124;>3) 1 168
Modified Z-score (&#124;z̃&#124;>3.5) | |168
# --- Multivariate outlier: Mahalanobis distance ---
mtcars_cont <- mtcars[, c("mpg","hp","wt","disp")]

maha_dist  <- mahalanobis(mtcars_cont,
                            colMeans(mtcars_cont),
                            cov(mtcars_cont))
chi_crit   <- qchisq(0.975, df = ncol(mtcars_cont))

outlier_flag <- maha_dist > chi_crit

cat("Mahalanobis Distance Outliers (χ²₄, p=0.975 threshold =",
    round(chi_crit, 2), "):\n")
Mahalanobis Distance Outliers (χ²₄, p=0.975 threshold = 11.14 ):
print(mtcars[outlier_flag,
             c("mpg","hp","wt","disp")])
              mpg  hp   wt disp
Maserati Bora  15 335 3.57  301
# --- Winsorization (capping) ---
winsorize <- function(x, lower_p = 0.05, upper_p = 0.95) {
  lo <- quantile(x, lower_p, na.rm = TRUE)
  hi <- quantile(x, upper_p, na.rm = TRUE)
  pmin(pmax(x, lo), hi)
}

ozone_wins <- winsorize(ozone, 0.05, 0.95)

cat("\nOzone before Winsorization: mean =",
    round(mean(ozone), 2), " sd =", round(sd(ozone), 2), "\n")

Ozone before Winsorization: mean = 42.13  sd = 32.99 
cat("Ozone after Winsorization:  mean =",
    round(mean(ozone_wins), 2), " sd =",
    round(sd(ozone_wins), 2), "\n")
Ozone after Winsorization:  mean = 41.25  sd = 30.06 
# --- Visualize outliers ---
ozone_df <- data.frame(
  ozone  = ozone,
  outlier = ozone > upper | ozone < lower
)

p1 <- ggplot(ozone_df, aes(x = "", y = ozone)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6,
               outlier.shape = NA) +
  geom_jitter(aes(color = outlier), width = 0.15,
              size = 2.5, alpha = 0.8) +
  geom_hline(yintercept = c(lower, upper),
             color = "tomato", linetype = "dashed",
             linewidth = 0.9) +
  scale_color_manual(values = c("FALSE" = "steelblue",
                                 "TRUE"  = "tomato")) +
  labs(title    = "A. IQR Outlier Detection",
       subtitle = paste0("Red dashed = fences [",
                          round(lower,1), ", ",
                          round(upper,1), "]"),
       x = "", y = "Ozone (ppb)",
       color = "Outlier") +
  theme_minimal(base_size = 12)

p2 <- ggplot(data.frame(car   = rownames(mtcars),
                         maha  = maha_dist,
                         flag  = outlier_flag),
             aes(x = reorder(car, maha), y = maha,
                 fill = flag)) +
  geom_col(color = "white") +
  geom_hline(yintercept = chi_crit, color = "tomato",
             linetype = "dashed", linewidth = 1) +
  scale_fill_manual(values = c("FALSE" = "steelblue",
                                "TRUE"  = "tomato")) +
  coord_flip() +
  labs(title    = "B. Mahalanobis Distance (mtcars)",
       subtitle = paste0("Red = exceeds χ²₄ threshold (",
                          round(chi_crit,1), ")"),
       x = "Car", y = "Mahalanobis Distance²",
       fill = "Outlier") +
  theme_minimal(base_size = 10) +
  theme(legend.position = "none")

p1 + p2

Code explanation:

  • mahalanobis(x, center, cov) computes squared Mahalanobis distances; qchisq(0.975, df=p) gives the critical value.
  • The winsorize() function clamps extreme values at specified percentiles — a common and principled alternative to deletion.
  • The bar chart of Mahalanobis distances (Panel B) makes it easy to rank observations by multivariate extremity and identify which cars are unusual across all dimensions simultaneously.

4.5 Exercises

TipExercise 6.4

Using the mtcars dataset:

  1. Apply IQR and modified z-score outlier detection to hp. Do both methods agree?
  2. Compute Mahalanobis distances using mpg, hp, wt, and qsec. Which cars are multivariate outliers?
  3. Compare the regression of mpg ~ hp + wt with and without the identified outliers. How much do the coefficients change?
TipExercise 6.5

For the airquality Ozone variable:

  1. Winsorize at the 5th and 95th percentiles. Plot original vs. winsorized distributions.
  2. Compare the mean and SD before and after winsorization.
  3. Would you recommend winsorization or log transformation for this variable? Justify using skewness and normality tests.

5 Data Transformation

5.1 Introduction

Many statistical methods assume that variables are approximately normally distributed, have constant variance, or are measured on a linear scale. When these assumptions are violated — as they often are with real data — transformations can correct the distributional shape, stabilize variance, and linearize relationships. Transformations are applied to the variable itself, not to the model; after transformation, the standard methods can proceed validly.

5.2 Theory

5.2.1 Log Transformation

The most common transformation for right-skewed, positive data:

\[x' = \log(x) \quad \text{(natural log)} \qquad \text{or} \qquad x' = \log_{10}(x)\]

When to use: Right-skewed distributions (income, reaction times, biological measurements, counts). Requires \(x > 0\); use \(\log(x + c)\) for data containing zeros, where \(c\) is a small constant.

Effect: Compresses large values more than small ones, pulling the right tail toward the center.

5.2.2 Square Root Transformation

\[x' = \sqrt{x}\]

When to use: Count data (Poisson-distributed), moderate right skew. Less aggressive than log; applicable when \(x \geq 0\).

5.2.3 Box-Cox Transformation

A flexible family of power transformations that includes log and square root as special cases:

\[x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \log(x) & \lambda = 0 \end{cases}\]

The optimal \(\lambda\) is estimated by maximum likelihood to maximize normality. Common values: \(\lambda = 1\) (no transformation), \(\lambda = 0.5\) (square root), \(\lambda = 0\) (log), \(\lambda = -1\) (reciprocal).

5.2.4 Yeo-Johnson Transformation

An extension of Box-Cox that handles zero and negative values:

\[\psi(x, \lambda) = \begin{cases} [(x+1)^\lambda - 1]/\lambda & x \geq 0, \lambda \neq 0 \\ \log(x+1) & x \geq 0, \lambda = 0 \\ -[(-x+1)^{2-\lambda} - 1]/(2-\lambda) & x < 0, \lambda \neq 2 \\ -\log(-x+1) & x < 0, \lambda = 2 \end{cases}\]

5.2.5 Normalization vs. Standardization

These are scaling operations, covered in detail in Section 6, but distinguished here from transformation:

  • Transformation changes the distributional shape (skewness, tail behavior).
  • Scaling/Normalization changes the location and spread without changing the shape.

5.3 Example: Choosing a Transformation

Example 6.4. Household income data has skewness = 3.8, indicating severe right skew. Three transformations are compared:

Transformation Skewness After Shapiro-Wilk p
None 3.8 < 0.001
Square root 1.9 0.008
Log 0.4 0.21
Box-Cox (\(\lambda = 0.12\)) 0.1 0.38

Box-Cox achieves the best normalization, followed closely by log transformation. For most practical purposes, log transformation is preferred for its interpretability (log-income differences correspond to percentage differences).

5.4 R Example: Data Transformations

# --- Compare transformations on Ozone ---
data(airquality)
ozone <- na.omit(airquality$Ozone)

# Compute skewness before and after
transforms <- data.frame(
  Transformation = c("None","Square Root",
                      "Log","Box-Cox"),
  Data           = list(
    ozone,
    sqrt(ozone),
    log(ozone),
    NA  # filled below
  )
)

# Box-Cox optimal lambda
bc <- MASS::boxcox(ozone ~ 1, lambda = seq(-2, 2, 0.1),
                    plotit = FALSE)
lambda_opt <- bc$x[which.max(bc$y)]
cat("Optimal Box-Cox lambda:", round(lambda_opt, 3), "\n")
Optimal Box-Cox lambda: 0.2 
ozone_bc <- (ozone^lambda_opt - 1) / lambda_opt

# Summary table
transform_summary <- data.frame(
  Transformation = c("None","Square Root","Log",
                      paste0("Box-Cox (λ=",
                             round(lambda_opt,2), ")")),
  Mean           = round(c(mean(ozone), mean(sqrt(ozone)),
                            mean(log(ozone)), mean(ozone_bc)), 3),
  SD             = round(c(sd(ozone), sd(sqrt(ozone)),
                            sd(log(ozone)), sd(ozone_bc)), 3),
  Skewness       = round(c(moments::skewness(ozone),
                            moments::skewness(sqrt(ozone)),
                            moments::skewness(log(ozone)),
                            moments::skewness(ozone_bc)), 3),
  SW_p           = round(c(shapiro.test(ozone)$p.value,
                            shapiro.test(sqrt(ozone))$p.value,
                            shapiro.test(log(ozone))$p.value,
                            shapiro.test(ozone_bc)$p.value), 4)
)

kable(transform_summary,
      caption   = "Transformation Comparison: Ozone Variable",
      col.names = c("Transformation","Mean","SD",
                    "Skewness","Shapiro-Wilk p")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  column_spec(4, bold = TRUE,
              color = ifelse(abs(transform_summary$Skewness) < 0.5,
                             "darkgreen","tomato"))
Transformation Comparison: Ozone Variable
Transformation Mean SD Skewness Shapiro-Wilk p
None 42.129 32.988 1.226 0.0000
Square Root 6.023 2.430 0.514 0.0044
Log 3.419 0.865 -0.555 0.0147
Box-Cox (λ=0.2) 5.050 1.676 -0.024 0.3399
# --- Visual comparison of transformations ---
trans_df <- data.frame(
  None         = ozone,
  Square_Root  = sqrt(ozone),
  Log          = log(ozone),
  Box_Cox      = ozone_bc
) |>
  pivot_longer(everything(),
               names_to  = "Transformation",
               values_to = "Value") |>
  mutate(Transformation = factor(
    Transformation,
    levels = c("None","Square_Root","Log","Box_Cox"),
    labels = c("None","Square Root","Log",
               paste0("Box-Cox (λ=",round(lambda_opt,2),")"))))

ggplot(trans_df, aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins  = 25, fill  = "steelblue",
                 color = "white", alpha = 0.8) +
  geom_density(color = "tomato", linewidth = 1) +
  facet_wrap(~Transformation, scales = "free", ncol = 2) +
  labs(title    = "Effect of Transformations on Ozone Distribution",
       subtitle = "Log and Box-Cox achieve approximate normality",
       x = "Transformed Value", y = "Density") +
  theme_minimal(base_size = 12)

Code explanation:

  • MASS::boxcox(y ~ 1, lambda) estimates the optimal \(\lambda\) by maximum likelihood. The plotit = FALSE option suppresses the automatic plot so we can create our own.
  • pivot_longer() reshapes the wide data frame of transformed values into a long format suitable for facet_wrap().
  • facet_wrap(scales = "free") allows each panel to have its own axis range — essential when the transformed scales differ dramatically.

5.5 Exercises

TipExercise 6.6

Using the mtcars dataset:

  1. Apply log, square root, and Box-Cox transformations to hp. Compare skewness and Shapiro-Wilk p-values.
  2. Apply the Yeo-Johnson transformation using bestNormalize::yeojohnson(). Does it outperform Box-Cox?
  3. For mpg, determine the optimal Box-Cox \(\lambda\). Is a transformation needed?

6 Encoding Categorical Variables

6.1 Introduction

Most statistical models and machine learning algorithms operate on numeric data. Encoding converts categorical variables into numeric representations that preserve the relevant information while meeting the mathematical requirements of the algorithm. Choosing the wrong encoding can introduce artificial ordering, create near-singular matrices, or lose important information about category relationships.

6.2 Theory

6.2.1 Dummy Coding (Treatment Coding)

For a categorical variable with \(k\) levels, create \(k-1\) binary (0/1) dummy variables, leaving one category as the reference (baseline):

Category D1 (Arts) D2 (Business)
Science (reference) 0 0
Arts 1 0
Business 0 1

The coefficient for each dummy variable represents the difference from the reference category. This is the standard encoding for linear regression and ANOVA — using \(k\) dummies causes perfect multicollinearity (dummy variable trap).

6.2.2 One-Hot Encoding

Create \(k\) binary variables, one per category — each is 1 if the observation belongs to that category, 0 otherwise. Unlike dummy coding, no reference category is dropped.

Used in machine learning algorithms (tree-based models, neural networks) that do not suffer from multicollinearity. Problematic for linear regression without regularization.

6.2.3 Ordinal Encoding

Assign consecutive integers to ordered categories:

\[\text{Low} \to 1, \quad \text{Medium} \to 2, \quad \text{High} \to 3\]

Valid for ordinal variables where the ordering is meaningful and the assumption of equal spacing is reasonable. Never apply ordinal encoding to nominal variables — it imposes a spurious ordering.

6.2.4 Target Encoding (Mean Encoding)

Replace each category with the mean of the target variable for that category:

\[\text{Encoded value for category } c = \frac{1}{n_c}\sum_{i: x_i = c} y_i\]

Powerful for high-cardinality categorical variables (many categories), but prone to overfitting — requires smoothing or cross-validation. Should never be applied using the full dataset; use out-of-fold means in a cross-validation framework.

6.2.5 Binary Encoding

Convert the integer-encoded category to binary representation using log₂(k) bits. More compact than one-hot for high-cardinality variables.

6.2.6 Encoding Selection Guide

Categorical encoding selection guide
Variable Type Recommended Encoding Notes
Nominal, few levels (\(k \leq 10\)) Dummy (regression) or One-hot (ML) Drop reference for regression
Nominal, many levels (\(k > 10\)) Target or Binary encoding Avoid one-hot explosion
Ordinal, equal spacing Ordinal integer Verify spacing assumption
Ordinal, unequal spacing Dummy or contrast coding Preserves non-linearity
Binary Keep as 0/1 No encoding needed

6.3 Example: Encoding Faculty Variable

Example 6.5. A dataset contains faculty (Science, Arts, Business) and outcome satisfaction. For a linear regression:

  • Dummy coding (reference = Science): Creates Arts (1/0) and Business (1/0). Coefficients represent mean satisfaction difference from Science.
  • One-hot (for a random forest): Creates Science, Arts, Business — all three binary columns.
  • Ordinal (inappropriate here): Would assign 1, 2, 3 — implying Arts is “twice” Science, which is meaningless.

6.4 R Example: Categorical Encoding

# --- Build example dataset ---
set.seed(42)
n <- 200
student_data <- data.frame(
  faculty      = sample(c("Science","Arts","Business"),
                         n, replace = TRUE,
                         prob = c(0.4, 0.35, 0.25)),
  year         = sample(c("First","Second","Third","Fourth"),
                         n, replace = TRUE),
  satisfaction = NA
)
student_data$satisfaction <-
  60 +
  ifelse(student_data$faculty == "Science", 10,
  ifelse(student_data$faculty == "Arts",    7, 0)) +
  ifelse(student_data$year == "Fourth", 8,
  ifelse(student_data$year == "Third",  5,
  ifelse(student_data$year == "Second", 3, 0))) +
  rnorm(n, 0, 8)

cat("Original data (first 6 rows):\n")
Original data (first 6 rows):
head(student_data)
   faculty   year satisfaction
1 Business Fourth     51.99257
2 Business Second     65.67022
3  Science  First     79.37060
4 Business Fourth     84.47631
5     Arts  Third     60.98511
6     Arts  First     57.79316
# === 1. DUMMY CODING (reference = Business) ===
student_data$faculty <- factor(student_data$faculty,
                                levels = c("Business",
                                           "Science","Arts"))
dummy_encoded <- model.matrix(~ faculty, data = student_data)
cat("\nDummy coding (first 6 rows):\n")

Dummy coding (first 6 rows):
head(dummy_encoded)
  (Intercept) facultyScience facultyArts
1           1              0           0
2           1              0           0
3           1              1           0
4           1              0           0
5           1              0           1
6           1              0           1
# === 2. ONE-HOT ENCODING ===
onehot_encoded <- model.matrix(~ faculty - 1,
                                data = student_data)
cat("One-hot encoding (first 6 rows):\n")
One-hot encoding (first 6 rows):
head(onehot_encoded)
  facultyBusiness facultyScience facultyArts
1               1              0           0
2               1              0           0
3               0              1           0
4               1              0           0
5               0              0           1
6               0              0           1
# === 3. ORDINAL ENCODING ===
student_data$year_ordinal <- as.integer(factor(
  student_data$year,
  levels  = c("First","Second","Third","Fourth"),
  ordered = TRUE
))

cat("Ordinal encoding (year):\n")
Ordinal encoding (year):
head(student_data[, c("year","year_ordinal")])
    year year_ordinal
1 Fourth            4
2 Second            2
3  First            1
4 Fourth            4
5  Third            3
6  First            1
# === 4. TARGET ENCODING ===
faculty_means <- student_data |>
  group_by(faculty) |>
  summarise(target_enc = mean(satisfaction), .groups = "drop")

student_data <- student_data |>
  left_join(faculty_means, by = "faculty")

cat("\nTarget encoding (faculty → mean satisfaction):\n")

Target encoding (faculty → mean satisfaction):
print(faculty_means)
# A tibble: 3 × 2
  faculty  target_enc
  <fct>         <dbl>
1 Business       65.1
2 Science        73.9
3 Arts           69.5
# Compare: dummy regression vs. target-encoded model
model_dummy  <- lm(satisfaction ~ faculty,
                    data = student_data)
model_target <- lm(satisfaction ~ target_enc,
                    data = student_data)

cat("\nDummy model R²:  ",
    round(summary(model_dummy)$r.squared, 4), "\n")

Dummy model R²:   0.1669 
cat("Target enc. R²: ",
    round(summary(model_target)$r.squared, 4), "\n")
Target enc. R²:  0.1669 

Code explanation:

  • model.matrix(~ factor) creates dummy-coded columns automatically, dropping the reference level. model.matrix(~ factor - 1) creates one-hot encoding (no reference dropped).
  • factor(..., ordered = TRUE) followed by as.integer() implements ordinal encoding, preserving the category order.
  • Target encoding is implemented via group_by() |> summarise() + left_join(). In production, this should be done within cross-validation folds to prevent data leakage.

6.5 Exercises

TipExercise 6.7

Using the iris dataset:

  1. Apply dummy coding to Species (reference = setosa). Verify with model.matrix().
  2. Apply one-hot encoding. Confirm you have 3 columns instead of 2.
  3. Fit a linear regression of Petal.Length ~ Species using dummy encoding. Interpret each coefficient.
  4. When would you prefer target encoding over one-hot for the Species variable?

7 Feature Scaling

7.1 Introduction

Many statistical and machine learning algorithms are sensitive to the scale of input features. K-nearest neighbors computes Euclidean distances — a variable measured in thousands will dominate one measured in single digits. Gradient descent converges faster when features are on similar scales. Principal Component Analysis (Chapter 7) is entirely scale-dependent. Feature scaling brings all features to a comparable range without changing their distributional shape.

7.2 Theory

7.2.1 Min-Max Normalization

Scales each feature to the range \([0, 1]\):

\[x'_i = \frac{x_i - x_{\min}}{x_{\max} - x_{\min}}\]

Properties: - Preserves the shape of the original distribution - Sensitive to outliers (extreme values compress all others toward 0 or 1) - Appropriate for algorithms requiring bounded input (neural networks with sigmoid activation)

7.2.2 Z-Score Standardization

Centers the variable at 0 and scales to unit variance:

\[x'_i = \frac{x_i - \bar{x}}{s}\]

Properties: - Transformed variable has mean = 0, SD = 1 - Does not bound the range (values can exceed \(\pm 3\)) - Robust to different scales; appropriate for most algorithms - Assumes approximately normal distribution for optimal behavior

7.2.3 Robust Scaling

Uses the median and IQR instead of mean and SD:

\[x'_i = \frac{x_i - \tilde{x}}{\text{IQR}}\]

Properties: - Robust to outliers (median and IQR are not affected by extremes) - Appropriate when data contains outliers that cannot be removed

7.2.4 MaxAbs Scaling

Divides by the maximum absolute value, scaling to \([-1, 1]\):

\[x'_i = \frac{x_i}{\max(|x|)}\]

Useful for sparse data (preserves zero entries).

7.2.5 When to Scale

Feature scaling requirements by algorithm
Algorithm Scaling Needed? Recommended Method
Linear/logistic regression Yes (for comparability) Z-score
K-nearest neighbors Yes (essential) Z-score or min-max
PCA Yes (essential) Z-score
Decision trees / Random forests No None
Neural networks Yes Min-max or Z-score
SVM Yes (essential) Z-score or min-max
K-means clustering Yes (essential) Z-score

7.3 Example: Effect of Scaling on KNN

Example 6.6. A dataset contains age (range: 20–70 years) and income (range: 20,000–500,000 THB). Without scaling, the Euclidean distance between two observations is dominated entirely by income differences — a 1-year age difference (Δ = 1) is dwarfed by a 1,000 THB income difference. After Z-score standardization, both variables contribute equally to the distance metric.

7.4 R Example: Feature Scaling

# --- Compare scaling methods ---
data(mtcars)
vars_to_scale <- mtcars[, c("mpg","hp","wt","disp")]

# Z-score standardization
z_scaled <- scale(vars_to_scale)

# Min-max normalization
minmax_scale <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
minmax_scaled <- apply(vars_to_scale, 2, minmax_scale)

# Robust scaling
robust_scale <- function(x) {
  (x - median(x)) / IQR(x)
}
robust_scaled <- apply(vars_to_scale, 2, robust_scale)

# Summary comparison
scaling_summary <- function(mat, name) {
  data.frame(
    Method    = name,
    Variable  = colnames(mat),
    Mean      = round(colMeans(mat), 4),
    SD        = round(apply(mat, 2, sd), 4),
    Min       = round(apply(mat, 2, min), 4),
    Max       = round(apply(mat, 2, max), 4)
  )
}

bind_rows(
  scaling_summary(as.data.frame(vars_to_scale), "Original"),
  scaling_summary(as.data.frame(z_scaled),      "Z-Score"),
  scaling_summary(as.data.frame(minmax_scaled),  "Min-Max"),
  scaling_summary(as.data.frame(robust_scaled),  "Robust")
) |>
  filter(Variable == "hp") |>
  kable(caption   = "Scaling Methods Comparison: hp variable",
        col.names = c("Method","Variable","Mean",
                      "SD","Min","Max")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
Scaling Methods Comparison: hp variable
Method Variable Mean SD Min Max
hp...1 Original hp 146.6875 68.5629 52.0000 335.0000
hp...2 Z-Score hp 0.0000 1.0000 -1.3810 2.7466
hp...3 Min-Max hp 0.3346 0.2423 0.0000 1.0000
hp...4 Robust hp 0.2837 0.8211 -0.8503 2.5389
# --- Visualize effect of scaling on all variables ---
orig_long <- as.data.frame(vars_to_scale) |>
  mutate(Method = "Original") |>
  pivot_longer(-Method, names_to = "Variable",
               values_to = "Value")

z_long <- as.data.frame(z_scaled) |>
  mutate(Method = "Z-Score") |>
  pivot_longer(-Method, names_to = "Variable",
               values_to = "Value")

mm_long <- as.data.frame(minmax_scaled) |>
  mutate(Method = "Min-Max") |>
  pivot_longer(-Method, names_to = "Variable",
               values_to = "Value")

all_scaled <- bind_rows(orig_long, z_long, mm_long)
all_scaled$Method <- factor(all_scaled$Method,
                              levels = c("Original",
                                         "Z-Score","Min-Max"))

ggplot(all_scaled, aes(x = Variable, y = Value,
                        fill = Variable)) +
  geom_boxplot(alpha = 0.7, outlier.size = 1) +
  scale_fill_brewer(palette = "Set2") +
  facet_wrap(~Method, scales = "free_y", ncol = 1) +
  labs(title    = "Feature Scaling: Original vs. Z-Score vs. Min-Max",
       subtitle = "Z-score centers at 0; min-max bounds to [0,1]",
       x        = "Variable", y = "Value") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Code explanation:

  • scale(data) performs Z-score standardization for all columns simultaneously — the most concise method in R.
  • apply(mat, 2, fun) applies a function column-wise (2 = columns, 1 = rows) — used to apply custom min-max and robust scaling functions.
  • The three-panel boxplot immediately shows why scaling is necessary: in the “Original” panel, disp and hp dwarf mpg and wt — after scaling, all variables are comparable.

7.5 Exercises

TipExercise 6.8

Using the iris dataset (numeric variables only):

  1. Apply Z-score, min-max, and robust scaling to all four variables.
  2. For each scaling method, verify: (i) mean after Z-score ≈ 0, (ii) range after min-max ∈ [0,1].
  3. Perform K-means clustering (\(k = 3\)) on unscaled vs. Z-scored data. Do the cluster assignments differ? Which matches the true species grouping better?

8 Data Integration and Reshaping

8.1 Introduction

Real-world data rarely arrives in a single, perfectly structured table. Data scientists routinely work with multiple source tables that must be merged, and with data that is organized in ways that do not match the format required for analysis. Tidy data principles define the target structure, and reshaping operations transform data between wide and long formats. Mastering these skills is essential for the practical data pipeline.

8.2 Theory

8.2.1 Tidy Data Principles

Tidy data (Wickham, 2014) follows three rules:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each observational unit forms a table.

Most raw data violates at least one of these rules — column headers are values (not variable names), multiple variables are stored in one column, or variables are stored in both rows and columns.

8.2.2 Pivoting: Wide ↔︎ Long

Wide format: Each time point or category is a separate column. Common in spreadsheets and data entry.

Long format: Each observation has its own row, with a column for the category identifier. Required by most R visualization and modeling functions.

Wide:                    Long:
id | week1 | week2       id | week | score
1  | 80    | 85          1  | 1    | 80
2  | 72    | 78          1  | 2    | 85
                         2  | 1    | 72
                         2  | 2    | 78

8.2.3 Joining Tables

dplyr join types
Join Type Result
inner_join Only rows with matches in both tables
left_join All rows from left table; NAs for non-matches from right
right_join All rows from right table; NAs for non-matches from left
full_join All rows from both; NAs where no match
anti_join Rows in left with NO match in right

8.2.4 String and Date Handling

Common integration tasks include:

  • Parsing dates with lubridate::ymd(), dmy(), mdy()
  • Cleaning strings with stringr::str_trim(), str_to_lower(), str_replace()
  • Splitting compound columns with tidyr::separate()
  • Uniting columns with tidyr::unite()

8.3 Example: Reshaping for Analysis

Example 6.7. A student performance dataset is stored in wide format with columns score_Q1, score_Q2, score_Q3, score_Q4. For a repeated-measures analysis (Chapter 3) or a time series plot, we need long format with columns quarter and score. pivot_longer() achieves this in one step.

8.4 R Example: Data Integration and Reshaping

# --- Create a wide-format dataset ---
set.seed(42)
student_wide <- data.frame(
  student_id = 1:20,
  faculty    = sample(c("Science","Arts","Business"),
                       20, replace = TRUE),
  score_Q1   = round(rnorm(20, 70, 10)),
  score_Q2   = round(rnorm(20, 72, 9)),
  score_Q3   = round(rnorm(20, 75, 8)),
  score_Q4   = round(rnorm(20, 78, 7))
)

cat("Wide format (first 4 rows):\n")
Wide format (first 4 rows):
head(student_wide, 4)
  student_id faculty score_Q1 score_Q2 score_Q3 score_Q4
1          1 Science       70       64       89       84
2          2 Science       83       81       80       89
3          3 Science       80       66       52       71
4          4 Science       79       62       83       75
# --- Pivot to long format ---
student_long <- student_wide |>
  pivot_longer(
    cols      = starts_with("score_"),
    names_to  = "quarter",
    values_to = "score"
  ) |>
  mutate(quarter = str_remove(quarter, "score_"))

cat("\nLong format (first 8 rows):\n")

Long format (first 8 rows):
head(student_long, 8)
# A tibble: 8 × 4
  student_id faculty quarter score
       <int> <chr>   <chr>   <dbl>
1          1 Science Q1         70
2          1 Science Q2         64
3          1 Science Q3         89
4          1 Science Q4         84
5          2 Science Q1         83
6          2 Science Q2         81
7          2 Science Q3         80
8          2 Science Q4         89
# --- Pivot back to wide ---
student_back_wide <- student_long |>
  pivot_wider(
    names_from  = quarter,
    values_from = score,
    names_prefix = "score_"
  )
cat("\nReturned to wide (first 4 rows):\n")

Returned to wide (first 4 rows):
head(student_back_wide, 4)
# A tibble: 4 × 6
  student_id faculty score_Q1 score_Q2 score_Q3 score_Q4
       <int> <chr>      <dbl>    <dbl>    <dbl>    <dbl>
1          1 Science       70       64       89       84
2          2 Science       83       81       80       89
3          3 Science       80       66       52       71
4          4 Science       79       62       83       75
# --- Joining tables ---
# Faculty information table
faculty_info <- data.frame(
  faculty     = c("Science","Arts","Business"),
  dean        = c("Dr. Smith","Dr. Lee","Dr. Park"),
  building    = c("Building A","Building B","Building C")
)

# Left join: keep all students, add faculty info
student_full <- student_wide |>
  left_join(faculty_info, by = "faculty")

cat("\nAfter left join (first 4 rows):\n")

After left join (first 4 rows):
head(student_full[, c("student_id","faculty",
                        "dean","score_Q1")], 4)
  student_id faculty      dean score_Q1
1          1 Science Dr. Smith       70
2          2 Science Dr. Smith       83
3          3 Science Dr. Smith       80
4          4 Science Dr. Smith       79
# --- Visualize trends in long format ---
student_long |>
  group_by(faculty, quarter) |>
  summarise(mean_score = mean(score),
            se         = sd(score)/sqrt(n()),
            .groups    = "drop") |>
  ggplot(aes(x = quarter, y = mean_score,
             group = faculty, color = faculty)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_score - se,
                     ymax = mean_score + se),
                width = 0.1) +
  scale_color_brewer(palette = "Set1") +
  labs(title    = "Mean Score by Quarter and Faculty",
       subtitle = "Long format enables grouped time-series visualization",
       x        = "Quarter", y = "Mean Score",
       color    = "Faculty") +
  theme_minimal(base_size = 13)

Code explanation:

  • pivot_longer(cols, names_to, values_to) converts wide to long format. starts_with("score_") selects all score columns by prefix.
  • str_remove(quarter, "score_") cleans the quarter labels from “score_Q1” to “Q1” — a common post-pivot cleanup step.
  • pivot_wider(names_from, values_from) reverses the operation — useful for creating summary tables after analysis.
  • left_join(y, by) merges two tables on a key variable, preserving all rows from the left table.

8.5 Exercises

TipExercise 6.9

The built-in tidyr::who dataset contains tuberculosis (TB) counts in wide format with many messy column names encoding country, year, age group, sex, and diagnosis method.

  1. Use pivot_longer() to convert to long format.
  2. Use separate() to split the messy column names into meaningful variables.
  3. Filter to Thailand (country == "Thailand") and plot TB counts over time by sex and age group.
TipExercise 6.10

Create two data frames: one with student demographics (id, name, program) and one with exam scores (id, exam, score). Simulate 5 students and 3 exams each.

  1. Inner join the two tables and explain what happens if a student has no exam record.
  2. Left join and full join — how do the results differ?
  3. Pivot the joined data so each exam is a column. Compute the average score per student.

9 Chapter Lab Activity: Full Preprocessing Pipeline with msleep

9.1 Objectives

In this lab you will apply the complete preprocessing pipeline — from initial inspection through missing data handling, outlier treatment, transformation, encoding, scaling, and reshaping — to the msleep dataset (mammal sleep data from ggplot2). This mirrors a real data science workflow where raw data must be made analysis-ready before modeling.

9.2 The Dataset

# msleep: sleep patterns of 83 mammals
data(msleep, package = "ggplot2")

cat("Dimensions:", nrow(msleep), "rows ×",
    ncol(msleep), "columns\n\n")
Dimensions: 83 rows × 11 columns
# Overview
glimpse(msleep)
Rows: 83
Columns: 11
$ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
$ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
$ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
$ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
$ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
$ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
$ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
$ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
$ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
$ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
$ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
# Missing value map
miss_summary_lab <- data.frame(
  Variable    = names(msleep),
  Type        = sapply(msleep, class),
  N_Missing   = colSums(is.na(msleep)),
  Pct_Missing = round(colMeans(is.na(msleep)) * 100, 1)
) |> filter(N_Missing > 0)
rownames(miss_summary_lab) <- NULL

kable(miss_summary_lab,
      caption   = "Variables with Missing Data — msleep",
      col.names = c("Variable","Type",
                    "N Missing","% Missing")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  column_spec(4, bold = TRUE, color = "tomato")
Variables with Missing Data — msleep
Variable Type N Missing % Missing
vore character 7 8.4
conservation character 29 34.9
sleep_rem numeric 22 26.5
sleep_cycle numeric 51 61.4
brainwt numeric 27 32.5

9.3 Lab Task 1: Missing Data Handling

# Impute numeric missing values with median (simple approach)
# Impute sleep_rem and brainwt using mice
msleep_num <- msleep |>
  dplyr::select(sleep_total, sleep_rem, sleep_cycle, awake, brainwt, bodywt)

set.seed(42)
mice_lab <- mice(msleep_num, m = 3,
                  method = "pmm", printFlag = FALSE)
msleep_imputed_num <- complete(mice_lab, 1)

# Attach back with non-numeric columns
msleep_proc <- msleep |>
  dplyr::select(name, genus, vore, order, conservation) |>
  bind_cols(msleep_imputed_num)

cat("Missing values after imputation:\n")
Missing values after imputation:
print(colSums(is.na(msleep_proc)))
        name        genus         vore        order conservation  sleep_total 
           0            0            7            0           29            0 
   sleep_rem  sleep_cycle        awake      brainwt       bodywt 
           0            0            0            0            0 

9.4 Lab Task 2: Outlier Detection

# IQR outlier detection for bodywt and brainwt
detect_outliers_iqr <- function(x, var_name) {
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR_v <- Q3 - Q1
  flags <- x < Q1 - 1.5*IQR_v | x > Q3 + 1.5*IQR_v
  data.frame(
    Variable = var_name,
    N_Outliers = sum(flags, na.rm = TRUE),
    Outlier_Values = paste(
      round(sort(x[flags]), 1), collapse = ", "
    )
  )
}

bind_rows(
  detect_outliers_iqr(msleep_proc$bodywt,  "bodywt"),
  detect_outliers_iqr(msleep_proc$brainwt, "brainwt"),
  detect_outliers_iqr(msleep_proc$sleep_total, "sleep_total")
) |>
  kable(caption = "IQR Outlier Detection — msleep") |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
IQR Outlier Detection — msleep
Variable N_Outliers Outlier_Values
bodywt 11 161.5, 162.6, 173.3, 187, 207.5, 521, 600, 800, 900, 2547, 6654
brainwt 11 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.4, 0.7, 1.3, 4.6, 5.7
sleep_total 0

9.5 Lab Task 3: Transformation

# bodywt and brainwt are extremely right-skewed — apply log
msleep_proc <- msleep_proc |>
  mutate(
    log_bodywt  = log(bodywt  + 1),
    log_brainwt = log(brainwt + 1)
  )

transform_check <- data.frame(
  Variable      = c("bodywt","log(bodywt+1)",
                     "brainwt","log(brainwt+1)"),
  Skewness      = round(c(
    moments::skewness(msleep_proc$bodywt),
    moments::skewness(msleep_proc$log_bodywt),
    moments::skewness(msleep_proc$brainwt),
    moments::skewness(msleep_proc$log_brainwt)
  ), 3)
)

kable(transform_check,
      caption = "Skewness Before and After Log Transformation") |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  column_spec(2, bold = TRUE,
              color = ifelse(
                abs(transform_check$Skewness) < 1,
                "darkgreen","tomato"))
Skewness Before and After Log Transformation
Variable Skewness
bodywt 7.230
log(bodywt+1) 1.144
brainwt 5.836
log(brainwt+1) 4.575

9.6 Lab Task 4: Encoding and Scaling

# Encode vore (dietary classification) as dummy variables
msleep_proc$vore <- factor(
  msleep_proc$vore,
  levels = c("herbi","carni","omni","insecti")
)
msleep_proc$vore[is.na(msleep_proc$vore)] <- "herbi"

# Dummy encode (reference = herbi)
vore_dummies <- model.matrix(~ vore, data = msleep_proc)

# Z-score scale numeric features
features_to_scale <- msleep_proc |>
  dplyr::select(sleep_total, sleep_rem, awake,
         log_bodywt, log_brainwt)

scaled_features <- as.data.frame(scale(features_to_scale))
names(scaled_features) <- paste0(names(scaled_features),
                                   "_scaled")

# Verify scaling
kable(data.frame(
  Variable = names(features_to_scale),
  Mean_Before = round(colMeans(features_to_scale), 3),
  SD_Before   = round(apply(features_to_scale, 2, sd), 3),
  Mean_After  = round(colMeans(scaled_features), 3),
  SD_After    = round(apply(scaled_features, 2, sd), 3)
),
caption   = "Z-Score Scaling Verification",
col.names = c("Variable","Mean Before","SD Before",
              "Mean After","SD After")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
Z-Score Scaling Verification
Variable Mean Before SD Before Mean After SD After
sleep_total sleep_total 10.434 4.450 0 1
sleep_rem sleep_rem 1.913 1.184 0 1
awake awake 13.567 4.452 0 1
log_bodywt log_bodywt 1.951 2.208 0 1
log_brainwt log_brainwt 0.119 0.302 0 1

9.7 Lab Task 5: Final Clean Dataset and Visualization

# Assemble final preprocessed dataset
msleep_final <- bind_cols(
  msleep_proc |> dplyr::select(name, vore, order),
  scaled_features
)

cat("Final preprocessed dataset dimensions:",
    nrow(msleep_final), "×", ncol(msleep_final), "\n\n")
Final preprocessed dataset dimensions: 83 × 8 
cat("No missing values:", sum(is.na(msleep_final)) == 0, "\n")
No missing values: TRUE 
# Visualize the full pipeline outcome
p1 <- ggplot(msleep_proc, aes(x = bodywt)) +
  geom_histogram(bins  = 20, fill  = "tomato",
                 color = "white", alpha = 0.8) +
  labs(title = "A. bodywt: Before (raw)", x = "bodywt") +
  theme_minimal(base_size = 11)

p2 <- ggplot(msleep_proc, aes(x = log_bodywt)) +
  geom_histogram(bins  = 20, fill  = "seagreen",
                 color = "white", alpha = 0.8) +
  labs(title = "B. bodywt: After log transform",
       x = "log(bodywt+1)") +
  theme_minimal(base_size = 11)

p3 <- ggplot(scaled_features,
             aes(x = sleep_total_scaled)) +
  geom_histogram(bins  = 20, fill  = "steelblue",
                 color = "white", alpha = 0.8) +
  labs(title = "C. sleep_total: After Z-scaling",
       x = "sleep_total (scaled)") +
  theme_minimal(base_size = 11)

p4 <- msleep_proc |>
  filter(!is.na(vore)) |>
  ggplot(aes(x = vore, y = log_bodywt, fill = vore)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "D. log(bodywt) by Diet Type",
       x = "Diet (vore)", y = "log(bodywt+1)") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none")

(p1 + p2) / (p3 + p4)

9.8 Lab Discussion Questions

Answer the following in writing (100–150 words each):

  1. Missing Data Strategy: brainwt has 27% missing values. Is multiple imputation appropriate here, or would you recommend a different strategy? What type of missingness (MCAR/MAR/MNAR) do you suspect for brain weight in mammals?

  2. Outlier Decision: The elephant has extreme bodywt and brainwt values flagged as outliers. Should it be removed? What are the consequences of removal for a study of mammal sleep patterns?

  3. Transformation Justification: Why is log transformation more appropriate than square root for bodywt? Use the skewness values from Lab Task 3 to support your answer.

  4. Encoding Choice: vore has 4 levels including some NAs. Justify your reference category choice for dummy encoding. How would target encoding differ, and what leakage risk would it introduce?

  5. Scaling and Algorithms: If you were to cluster mammals by sleep patterns using K-means (Chapter 8), which preprocessing steps from this lab are essential, and which are optional? Justify each decision.


10 Chapter Summary

This chapter built a complete data preprocessing toolkit essential for any data science project:

  • Why preprocessing matters — the GIGO principle; each step in the pipeline has a distinct purpose and must be applied in the correct order; the leakage problem requires all fitting to occur on training data only.
  • Missing data — MCAR, MAR, and MNAR require different strategies; multiple imputation is the gold standard under MAR; no method fully corrects MNAR.
  • Outlier detection — IQR and modified z-score for univariate detection; Mahalanobis distance for multivariate; treatment depends on whether outliers are errors or genuine extreme values.
  • Transformation — log and Box-Cox correct right skew and stabilize variance; Yeo-Johnson handles negative values; transformation changes shape, not location or spread.
  • Encoding — dummy coding for regression (drop reference); one-hot for ML (no reference dropped); ordinal for ordered categories; target encoding for high cardinality with cross-validation.
  • Feature scaling — Z-score for most algorithms; min-max for bounded ranges; robust scaling when outliers are present; tree-based models do not require scaling.
  • Data integration and reshaping — tidy data principles; pivot_longer/wider for format conversion; left/inner/full_join for table merging.
ImportantKey Formulas to Know

IQR Outlier Fences: \[\text{Lower} = Q_1 - 1.5 \times \text{IQR}, \quad \text{Upper} = Q_3 + 1.5 \times \text{IQR}\]

Mahalanobis Distance: \[D^2_i = (\mathbf{x}_i - \bar{\mathbf{x}})^T \mathbf{S}^{-1} (\mathbf{x}_i - \bar{\mathbf{x}}) \sim \chi^2(p)\]

Box-Cox Transformation: \[x'(\lambda) = \begin{cases} (x^\lambda - 1)/\lambda & \lambda \neq 0 \\ \log(x) & \lambda = 0 \end{cases}\]

Z-Score Standardization: \[x'_i = \frac{x_i - \bar{x}}{s}\]

Min-Max Normalization: \[x'_i = \frac{x_i - x_{\min}}{x_{\max} - x_{\min}}\]

Post-Stratification Weight: \[w_i = \frac{p_i^{\text{pop}}}{p_i^{\text{sample}}}\]


End of Chapter 6. Proceed to Chapter 7: Data Dimension Reduction.