Step 1: Randomly Split Data into Training and Test Sets
Step 2: Evaluate Transformations for Quantitative Variables
Step 3: Apply Transformations to the Training Data
Step 4: Apply Transformations to the Test Data
Step 5: Min-Max Scaling on the Training Set
Step 6: Apply Scaling to the Test Set
Step 7: Interpretability of Variables for Forecasting
Step 8: Importance of Transformations and Scaling
Step 9: Summary of Data Preprocessing

Step 1: Randomly Split Data into Training and Test Sets

# Load the data set
setwd("C:/DDS 8501 Titanic")
titanic <- read.csv("train.csv")

# Select quantitative variables and remove missing values
titanic_quant <- titanic[, c("Age", "Fare", "SibSp", "Parch")]
titanic_quant <- na.omit(titanic_quant)

# Set seed for reproducibility
set.seed(42)

# Perform an 80/20 train-test split
train_indices <- sample(1:nrow(titanic_quant), size = 0.8 * nrow(titanic_quant))
train_set <- titanic_quant[train_indices, ]
test_set <- titanic_quant[-train_indices, ]

Explanation:
Using set.seed() ensures reproducibility. The 80/20 split retains a sufficient training sample while holding out a test set for evaluating model generalization.

Step 2: Evaluate Transformations for Quantitative Variables

Analysis of Distributions (Training Set Only)

Age
- Distribution: Slightly right-skewed.
- Decision: No transformation applied.
  - Age is naturally interpretable and mildly skewed. Transforming it may reduce transparency without significant modeling benefit.
Fare
- Distribution: Heavily right-skewed, with extreme outliers.
- Decision: Log transformation recommended.
  - Logarithmic transformation (log1p) stabilizes variance and compresses outliers while maintaining ordinal relationships.
SibSp
- Distribution: Zero-inflated, with most values at 0.
- Decision: No transformation.
  - Discrete count data is left untransformed. Decision tree–based models can handle this directly.
Parch
- Distribution: Zero-inflated and right-skewed.
- Decision: No transformation.
  - Similar rationale as SibSp.

Test Set Not Used
Only the training set is used to evaluate transformations. Using the test set introduces data leakage and compromises the integrity of model evaluation.

Step 3: Apply Transformations to the Training Data

# Apply log1p transformation to Fare
train_set$Fare <- log1p(train_set$Fare)

Explanation:
The log1p() function performs a logarithmic transformation using log(1 + x). It is preferred over log() to accommodate values of zero. This transformation reduces skew, compresses extreme values, and improves symmetry, which benefits models assuming normality.

Step 4: Apply Transformations to the Test Data

# Apply log1p transformation to Fare in the test set
test_set$Fare <- log1p(test_set$Fare)

Explanation:
The same transformation applied to the training data must be applied to the test data to ensure consistency. This avoids distributional mismatch between datasets, which can lead to unreliable predictions.

Step 5: Min-Max Scaling on the Training Set

# Define a min-max scaling function
min_max_scale <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply to training data
train_scaled <- as.data.frame(lapply(train_set, min_max_scale))

Explanation:
Min-max scaling normalizes all feature values to the range [0, 1]. This is essential for models sensitive to magnitude (e.g., KNN, neural networks) and ensures equal weighting across features. Only the training data is used to calculate scaling parameters to preserve test set independence.

Step 6: Apply Scaling to the Test Set

# Capture training min and max values
train_mins <- sapply(train_set, min)
train_maxs <- sapply(train_set, max)

# Apply scaling using training set parameters
test_scaled <- as.data.frame(mapply(function(x, min_val, max_val) {
  (x - min_val) / (max_val - min_val)
}, test_set, train_mins, train_maxs))

Explanation:
The test data is scaled using the same min and max values from the training set. This maintains consistency and prevents information leakage, ensuring fair model evaluation.

Step 7: Interpretability of Variables for Forecasting

Transformations and scaling can introduce challenges in interpreting the original meaning of quantitative variables, particularly in applied forecasting contexts. For example, once a variable such as Fare undergoes a logarithmic transformation, its values no longer represent raw monetary amounts but instead reflect the natural log of fare plus one. This alters the scale and magnitude of differences, making it more difficult to explain model outcomes to non-technical stakeholders (James et al., 2021). Similarly, applying min-max scaling re-expresses all feature values within a [0, 1] range, effectively removing their original units and potentially diminishing interpretability for domain-specific applications.

Despite these limitations, such preprocessing operations serve a critical function in enhancing the quality of data used for predictive modeling. By reducing skewness, stabilizing variance, and harmonizing feature magnitudes, transformations and scaling contribute to improved algorithmic stability, faster convergence, and reduced model bias (Kuhn & Johnson, 2019). These effects are particularly valuable when deploying models in production environments, where consistent performance and reproducibility are essential. In short, while some interpretability may be sacrificed, the overall quality and usability of the data for machine learning tasks are substantially improved through these preprocessing steps (Han et al., 2022).

Step 8: Importance of Transformations and Scaling

Transformations and scaling are essential preprocessing techniques in predictive modeling because they address data irregularities that can negatively impact model accuracy, stability, and interpretability. Raw data often contain skewed distributions, outliers, or features on vastly different scales—each of which can bias a model’s learning process and degrade performance (Kuhn & Johnson, 2019).

For instance, logarithmic transformations are commonly used to stabilize variance and reduce right-skewness in variables such as income or fare. These transformations improve the normality assumption underlying many statistical models, such as linear regression and support vector machines (James et al., 2021). In this project, log-transforming the Fare variable helped minimize the influence of extreme values while preserving ordinal relationships.

Min-max scaling brings all quantitative variables into a [0, 1] range, ensuring that features contribute equally to model training. This is particularly important for distance-based algorithms like K-nearest neighbors or gradient descent–based neural networks, which are sensitive to variable magnitude (Han et al., 2022). Without scaling, variables with larger ranges can dominate model behavior, leading to suboptimal predictions.

Moreover, consistent application of transformations and scaling improves model generalization and supports reproducibility. While some interpretability is sacrificed—especially with transformed and rescaled variables—the benefit of more reliable and accurate predictions typically outweighs this cost. In summary, transformations correct for skewed or heavy-tailed distributions, while scaling harmonizes variable magnitudes. Together, they enhance both the performance and interpretability of predictive models by creating a cleaner, more uniform feature space for learning.

Step 9: Summary of Data Preprocessing

The preprocessing phase involved a series of structured, methodologically justified steps to prepare the Titanic dataset for predictive modeling. First, the dataset was randomly split into an 80% training set and a 20% test set using set.seed() for reproducibility. Only the training set was used to evaluate distributions and determine the necessity of transformations to prevent data leakage.

Exploratory analysis revealed that the Fare variable exhibited substantial right skew and extreme outliers. A logarithmic transformation (log1p) was applied to reduce this skew and improve normality. This approach aligns with best practices in data preprocessing, as transforming skewed data can enhance model performance by stabilizing variance and making the data more normally distributed (James et al., 2021). Other variables—Age, SibSp, and Parch—were retained in their original form based on their mild skewness or discrete count nature, ensuring interpretability and modeling integrity.

Min-max scaling was subsequently applied to all quantitative variables in the training set to standardize feature ranges between 0 and 1. This scaling technique is particularly beneficial for algorithms sensitive to the magnitude of input features, such as K-nearest neighbors and neural networks, as it ensures that each feature contributes equally to the model’s learning process (Han et al., 2022). The same scaling method was applied to the test set using the training set’s minimum and maximum values to preserve consistency and avoid data leakage.

Each step in this preprocessing workflow was selected to enhance model performance, stability, and generalizability. The transformations improved distributional properties, while scaling ensured equal contribution of all features during model training. These techniques together created a robust and clean dataset aligned with best practices in predictive analytics (Kuhn & Johnson, 2019).

References

Han, J., Pei, J., & Kamber, M. (2022). Data mining: Concepts and techniques (4th ed.). Morgan Kaufmann.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd ed.). Springer.
Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.

KCampise-6Knit

Kat Campise

May 2025