Split, Transform and Scale the Data Set - Titanic.csv


1. Prepare the Data (Load, Filter, Clean).


setwd("/Users/whinton/src/rstudio/tim8501")
df <- read.csv("titanic.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)
#df <- titanic ## make copy of original dataset to data frame df

Show Initial Missing Values.

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
## ####################################################################################

Perform Pre-processing, Imputation and Show Filtered/Cleaned Data.

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0
## ####################################################################################

2. Show Pre-Transformed Statistics of Quantitative Variables


Descriptive Statistics of Quantitative Variables

##     Var Obs Mean Median Variance St.Dev  Range   IQR Skewness Kurtosis Outliers
## 1   Age 891 29.7   29.7   169.05     13  79.58    13     0.43     3.95       66
## 2 SibSp 891 0.52      0     1.22    1.1      8     1     3.69    20.77       46
## 3 Parch 891 0.38      0     0.65   0.81      6     0     2.74    12.72      213
## 4  Fare 891 32.2  14.45  2469.44  49.69 512.33 23.09     4.78     36.2      116
## ####################################################################################
## Quantiles Data Frame of Quantitative Variables
##     qAge qFare qSibSp qParch
## 0%   0.4   0.0      0      0
## 5%   6.0   7.2      0      0
## 25% 22.0   7.9      0      0
## 50% 29.7  14.5      0      0
## 75% 35.0  31.0      1      0
## 95% 54.0 112.1      3      2
## 10% 16.0   7.6      0      0
## ####################################################################################

3. Split Data into Training and Test Sets


## Seed 2066 set for reproducibility
## Training set size: 712
## Testing set size: 179

4. Assess Transformations for Quantiative Variables.


## #################################################################
## Check Stats, Skewness, Kurtosis & Distributions of TRAINING Set
## #################################################################
## 
## Age:
## Minimum: 0.67
## Maximum: 80
## Mean: 29.37
## Outliers: 60
## Median: 29.699
## Variance: 165.002
## Skewness: 0.4892794
## Kurtosis: 4.086198
## 
## 
## Fare:
## Minimum: 0
## Maximum: 512.329
## Mean: 31.839
## Outliers: 92
## Median: 14.427
## Variance: 2395.699
## Skewness: 4.661264
## Kurtosis: 34.31577
## 
## 
## SibSp:
## Minimum: 0
## Maximum: 8
## Mean: 0.555
## Outliers: 43
## Median: 0
## Variance: 1.392
## Skewness: 3.577952
## Kurtosis: 19.11013
## 
## 
## Parch:
## Minimum: 0
## Maximum: 6
## Mean: 0.383
## Outliers: 170
## Median: 0
## Variance: 0.642
## Skewness: 2.649649
## Kurtosis: 12.15382
## 
## 
## Plots of Training Set for Assessment

## #################################################################
## Check Stats, Skewness, Kurtosis & Distributions of TEST Set
## #################################################################
## 
## Age:
## Minimum: 0.42
## Maximum: 74
## Mean: 31.007
## Outliers: 6
## Median: 29.699
## Variance: 184.026
## Skewness: 0.2213026
## Kurtosis: 3.589415
## 
## 
## Fare:
## Minimum: 0
## Maximum: 512.329
## Mean: 33.659
## Outliers: 24
## Median: 14.5
## Variance: 2775.187
## Skewness: 5.127016
## Kurtosis: 41.2421
## 
## 
## SibSp:
## Minimum: 0
## Maximum: 4
## Mean: 0.397
## Outliers: 3
## Median: 0
## Variance: 0.499
## Skewness: 2.622303
## Kurtosis: 12.43628
## 
## 
## Parch:
## Minimum: 0
## Maximum: 6
## Mean: 0.383
## Outliers: 170
## Median: 0
## Variance: 0.642
## Skewness: 2.649649
## Kurtosis: 12.15382
## 
## 
## Plots of TEST Set for Assessment


5. Apply Transformations to Training Set.


## 
## See Appendix for Alternative Box-Cox Transformation for (Fare).
## 
## Plots of TRANSFORMED TRAINING Set


6. Apply Transformations to Test Set.


## 
## See Appendix for Alternative Box-Cox Transformation for (Fare).
## 
## Plots of TRANSFORMED TEST Set


7. Min-Max Scaling on the Transformed Training Set


## #########################################################
## Check First Few Rows of SCALED Vars in TRAINING Set
## #########################################################
## 
## First few rows of the scaled variables:
##     Age_scaled Fare_scaled SibSp_scaled Parch_scaled
## 717  0.4705660  0.44409922        0.000    0.0000000
## 602  0.3659286  0.01541158        0.000    0.0000000
## 513  0.4453548  0.05130978        0.000    0.0000000
## 195  0.5461994  0.05410740        0.000    0.0000000
## 230  0.3659286  0.04970769        0.375    0.1666667
## 71   0.3949326  0.02049464        0.000    0.0000000
## 
## Plot of Min-Max SCALED TRAINING Variables:


8. Min-Max Scaling on the Transformed Test Set


## #########################################################
## Check First Few Rows of SCALED TEST Set Vars
## #########################################################
## 
## First few rows of the scaled variables:
##    Age_scaled Fare_scaled SibSp_scaled Parch_scaled
## 6  0.39792223  0.01650950         0.00          0.0
## 10 0.18456102  0.05869429         0.25          0.0
## 11 0.04865453  0.03259623         0.25          0.2
## 12 0.78254961  0.05182215         0.00          0.0
## 14 0.52432726  0.06104473         0.25          1.0
## 18 0.39792223  0.02537431         0.00          0.0
## 
## Plot of Min-Max SCALED TEST set Variables:


9. Summary and Findings.


This segment of the EDA study specifically delves into the essential roles that transformations and min-max scaling play in the preprocessing of the Titanic dataset for predictive modeling. By tackling concerns related to skewness, outliers, varying scales, and convergence issues, these steps enhance the accuracy, stability, and interpretability of models. They will be particularly advantageous in the next steps of machine learning workflows, where effective feature engineering can greatly influence the overall success of the final model.

Data Preparation

Data Preparation is fairly boilerplate at this point with reusable functions for handling missing data, performing imputation and filtering/cleaning. See prior studies and papers in the References section.

Splitting the Data

Splitting the dataset into training and testing samples is crucial because it helps prevent overfitting and ensures that the model’s performance is evaluated on unseen data. By splitting the data, we can train the model on one part (training set) and test it on another part (testing set). This ensures the model generalizes well to new, unseen data rather than just memorizing the training data. Testing on a separate dataset allows us to calculate reliable metrics such as accuracy, precision, recall, or root mean square error, depending on the problem type. Splitting prevents data leakage, where information from the testing set influences the training phase, leading to overly optimistic model performance. Random splitting with a set seed ensures consistent results across multiple runs, allowing comparisons and debugging.

Assessing and Applying Transformations

Non-normality can negatively impact statistical analyses and models that assume normally distributed data. Descriptive statistics, including outliers, skewness, and kurtosis, and a histogram plot of the variable’s distribution indicate the primary indicators of non-normality. Below, I outline the indicators of non-normality for each variable and suggest suitable transformations.

  1. Age-related indicators of non-normality include skewness, kurtosis, a histogram and outliers in the minimum or maximum values.
  2. Like age, the indicators of non-normality for the passenger fare, which is heavily right-skewed, include skewness, outliers, a histogram, and a large difference between mean and median.
  3. SibSp is discrete and has small integer values. Its non-normality can be indicated by a histogram showing a step-like pattern and a bit of a right-skew because there are many passengers with no siblings (low mean and variance with many zeros) or spouses and few with higher counts
  4. Parch is a discrete variable dominated by zeros. Non-normality indicators are right-skewness, a histogram with a long tail and a peak at 0, descriptive stats with many zeros, and low mean and variance.

Perform and Apply Min-Max Scaling

This type of normalization, transforms all variable values into a range of 0 to 1. This is particularly useful for several reasons, including feature comparability, improved model convergence, maintenance of relationships, and outlier sensitivity.

  1. Feature Comparability.
  2. Improved Model Convergence.
  3. Maintains Relationships.
  4. Outlier Sensitivity.

Importance of Transforming and Scaling

Transformations and min-max scaling are essential preprocessing techniques when preparing the Titanic dataset (or any dataset) for predictive modeling. They help improve model performance, convergence, and interpretability in several ways. Transformations such as Box-Cox, logarithmic, square root, and power transformations are applied to normalize distributions, reduce skewness and stabilize variance.

  1. Dealing with Skewness: Variables like fare in the Titanic dataset are often right-skewed, meaning they have extreme values (outliers) that can dominate and distort the results of models like regression, decision trees, or clustering.

  2. Stablilizing Variance: Variance changes across values can negatively affect models like linear regression and gradient-based models. Transformations like square root or logarithmic transformations stabilize variance, enabling models to capture relationships better.

  3. Improved Model Assumptions: Many models (e.g., linear regression, logistic regression, and neural networks) assume that features or errors are normally distributed. Transformations help to meet these assumptions, thereby improving model reliability.

  4. Outlier Mitigation: Variables like Fare and Age often have outliers that can skew predictions. Transformations reduce the influence of outliers without removing valuable data points.

Min-max scaling normalizes feature values to a range (e.g., 0 to 1). This is crucial when variables have different scales, as in the Titanic dataset. For example, Fare values can range from 0 to hundreds, while Age typically ranges from 0 to 80. Discrete variables like SibSp and Parch have much smaller ranges, usually between 0 and 10.

Transformed and scaled features align with the models’ assumptions, making results and predictions easier for stakeholders to explain and interpret. Specific examples in the Titanic data set include the following.
Age. Often has a few outliers and may have a slight skew. Transformations ensure it is normally distributed, improving its predictive power (e.g., predicting survival likelihood).
Fare. Right-skewed with extreme values. Box-Cox transformation normalizes it, and scaling ensures it does not dominate other features.
SibSp and Parch. These are count variables with small ranges. Scaling ensures fair treatment compared to continuous variables like Fare and Age.

Summary

Together, transformations and scaling improve predictive models by enhancing model performance, reducing overfitting, facilitating algorithm selection, and improving model interpretability. Models become more robust and accurate when features are normalized, as they better capture underlying relationships without bias from skewness, outliers, or scale differences. Normalized features reduce the risk of overfitting, especially in models that rely on weights or distance measures, as smaller ranges limit the model’s tendency to assign excessive importance to specific variables. Properly scaled and transformed data make it easier to select the most appropriate algorithm for the problem.

References

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer. https://link.springer.com/book/10.1007/978-0-387-21706-2.

Frost, J. (2020). Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries. Statistics by Jim Publishing. https://statisticsbyjim.com/

Hinton, W. (2024). From Univariate to Bivariate and Multivariate Analysis. Available at Rpubs. https://www.rpubs.com/whinton/.

Packt Publishing. (2018). R programming for statistics and data science (Media from Packt Publishing available freely through O’Reilly Media Inc.). https://learning.oreilly.com/course/r-programming-for/9781789950298/.

Datar, R., & Garg, H. (2019). Hands-on exploratory data analysis with R: Become an expert in exploratory data analysis using R packages. O’Reilly Media, Inc. 

Prabhakaran, S. (2023). The complete ggplot2 tutorial. R-statistics.co. Available online _{link}(https://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)_

Smeaton, A. (2003). NIST/SEMATECH Engineering Statistics Handbook. _{link}(https://www.itl.nist.gov/div898/handbook/)_. R Programming for Statistics and Data Science (Media from Packt Publishing available freely through O’Reilly Media Inc.). (2018).

Kabacoff, Robert (2022). R in Action, Third Edition. O’Reilly Online Learning. _{Link}(https://learning.oreilly.com/library/view/r-in-action/9781617296055/)_

.
This study conducted and performed by Will Hinton


Appendix: Additional Illustrations


# Box-Cox transformation (requires no zero values in data)
# The Box Cox Transformation in R is the technique used to transform non-normal 
# data to a normal distribution by applying the power transformation. 
# This transformation is commonly used in statistical modeling to improve 
# the normality of the data and to stabilize the variance

cat("\nAlternative Box-Cox Transformation for Fare (See ?boxcox() documentation):","\n")
## 
## Alternative Box-Cox Transformation for Fare (See ?boxcox() documentation):
fare_no_zeros <- titanic$Fare + 1  # To handle 0 values
boxcox_fare <- boxcox(lm(fare_no_zeros ~ 1), lambda = seq(-2, 2, by = 0.1))

optimal_lambda <- boxcox_fare$x[which.max(boxcox_fare$y)]
titanic$Fare_boxcox <- (fare_no_zeros^optimal_lambda - 1) / optimal_lambda