setwd("/Users/whinton/src/rstudio/tim8501")
df <- read.csv("titanic.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)
#df <- titanic ## make copy of original dataset to data frame df
Show Initial Missing Values.
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
## ####################################################################################
Perform Pre-processing, Imputation and Show Filtered/Cleaned Data.
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
## ####################################################################################
Descriptive Statistics of Quantitative Variables
## Var Obs Mean Median Variance St.Dev Range IQR Skewness Kurtosis Outliers
## 1 Age 891 29.7 29.7 169.05 13 79.58 13 0.43 3.95 66
## 2 SibSp 891 0.52 0 1.22 1.1 8 1 3.69 20.77 46
## 3 Parch 891 0.38 0 0.65 0.81 6 0 2.74 12.72 213
## 4 Fare 891 32.2 14.45 2469.44 49.69 512.33 23.09 4.78 36.2 116
## ####################################################################################
## Quantiles Data Frame of Quantitative Variables
## qAge qFare qSibSp qParch
## 0% 0.4 0.0 0 0
## 5% 6.0 7.2 0 0
## 25% 22.0 7.9 0 0
## 50% 29.7 14.5 0 0
## 75% 35.0 31.0 1 0
## 95% 54.0 112.1 3 2
## 10% 16.0 7.6 0 0
## ####################################################################################
## Seed 2066 set for reproducibility
## Training set size: 712
## Testing set size: 179
## #################################################################
## Check Stats, Skewness, Kurtosis & Distributions of TRAINING Set
## #################################################################
##
## Age:
## Minimum: 0.67
## Maximum: 80
## Mean: 29.37
## Outliers: 60
## Median: 29.699
## Variance: 165.002
## Skewness: 0.4892794
## Kurtosis: 4.086198
##
##
## Fare:
## Minimum: 0
## Maximum: 512.329
## Mean: 31.839
## Outliers: 92
## Median: 14.427
## Variance: 2395.699
## Skewness: 4.661264
## Kurtosis: 34.31577
##
##
## SibSp:
## Minimum: 0
## Maximum: 8
## Mean: 0.555
## Outliers: 43
## Median: 0
## Variance: 1.392
## Skewness: 3.577952
## Kurtosis: 19.11013
##
##
## Parch:
## Minimum: 0
## Maximum: 6
## Mean: 0.383
## Outliers: 170
## Median: 0
## Variance: 0.642
## Skewness: 2.649649
## Kurtosis: 12.15382
##
##
## Plots of Training Set for Assessment
## #################################################################
## Check Stats, Skewness, Kurtosis & Distributions of TEST Set
## #################################################################
##
## Age:
## Minimum: 0.42
## Maximum: 74
## Mean: 31.007
## Outliers: 6
## Median: 29.699
## Variance: 184.026
## Skewness: 0.2213026
## Kurtosis: 3.589415
##
##
## Fare:
## Minimum: 0
## Maximum: 512.329
## Mean: 33.659
## Outliers: 24
## Median: 14.5
## Variance: 2775.187
## Skewness: 5.127016
## Kurtosis: 41.2421
##
##
## SibSp:
## Minimum: 0
## Maximum: 4
## Mean: 0.397
## Outliers: 3
## Median: 0
## Variance: 0.499
## Skewness: 2.622303
## Kurtosis: 12.43628
##
##
## Parch:
## Minimum: 0
## Maximum: 6
## Mean: 0.383
## Outliers: 170
## Median: 0
## Variance: 0.642
## Skewness: 2.649649
## Kurtosis: 12.15382
##
##
## Plots of TEST Set for Assessment
##
## See Appendix for Alternative Box-Cox Transformation for (Fare).
##
## Plots of TRANSFORMED TRAINING Set
##
## See Appendix for Alternative Box-Cox Transformation for (Fare).
##
## Plots of TRANSFORMED TEST Set
## #########################################################
## Check First Few Rows of SCALED Vars in TRAINING Set
## #########################################################
##
## First few rows of the scaled variables:
## Age_scaled Fare_scaled SibSp_scaled Parch_scaled
## 717 0.4705660 0.44409922 0.000 0.0000000
## 602 0.3659286 0.01541158 0.000 0.0000000
## 513 0.4453548 0.05130978 0.000 0.0000000
## 195 0.5461994 0.05410740 0.000 0.0000000
## 230 0.3659286 0.04970769 0.375 0.1666667
## 71 0.3949326 0.02049464 0.000 0.0000000
##
## Plot of Min-Max SCALED TRAINING Variables:
## #########################################################
## Check First Few Rows of SCALED TEST Set Vars
## #########################################################
##
## First few rows of the scaled variables:
## Age_scaled Fare_scaled SibSp_scaled Parch_scaled
## 6 0.39792223 0.01650950 0.00 0.0
## 10 0.18456102 0.05869429 0.25 0.0
## 11 0.04865453 0.03259623 0.25 0.2
## 12 0.78254961 0.05182215 0.00 0.0
## 14 0.52432726 0.06104473 0.25 1.0
## 18 0.39792223 0.02537431 0.00 0.0
##
## Plot of Min-Max SCALED TEST set Variables:
This segment of the EDA study specifically delves into the essential roles that transformations and min-max scaling play in the preprocessing of the Titanic dataset for predictive modeling. By tackling concerns related to skewness, outliers, varying scales, and convergence issues, these steps enhance the accuracy, stability, and interpretability of models. They will be particularly advantageous in the next steps of machine learning workflows, where effective feature engineering can greatly influence the overall success of the final model.
Data Preparation
Data Preparation is fairly boilerplate at this point with reusable functions for handling missing data, performing imputation and filtering/cleaning. See prior studies and papers in the References section.
Splitting the Data
Splitting the dataset into training and testing samples is crucial because it helps prevent overfitting and ensures that the model’s performance is evaluated on unseen data. By splitting the data, we can train the model on one part (training set) and test it on another part (testing set). This ensures the model generalizes well to new, unseen data rather than just memorizing the training data. Testing on a separate dataset allows us to calculate reliable metrics such as accuracy, precision, recall, or root mean square error, depending on the problem type. Splitting prevents data leakage, where information from the testing set influences the training phase, leading to overly optimistic model performance. Random splitting with a set seed ensures consistent results across multiple runs, allowing comparisons and debugging.
Assessing and Applying Transformations
Non-normality can negatively impact statistical analyses and models that assume normally distributed data. Descriptive statistics, including outliers, skewness, and kurtosis, and a histogram plot of the variable’s distribution indicate the primary indicators of non-normality. Below, I outline the indicators of non-normality for each variable and suggest suitable transformations.
Perform and Apply Min-Max Scaling
This type of normalization, transforms all variable values into a range of 0 to 1. This is particularly useful for several reasons, including feature comparability, improved model convergence, maintenance of relationships, and outlier sensitivity.
Importance of Transforming and Scaling
Transformations and min-max scaling are essential preprocessing techniques when preparing the Titanic dataset (or any dataset) for predictive modeling. They help improve model performance, convergence, and interpretability in several ways. Transformations such as Box-Cox, logarithmic, square root, and power transformations are applied to normalize distributions, reduce skewness and stabilize variance.
Dealing with Skewness: Variables like fare in the Titanic dataset are often right-skewed, meaning they have extreme values (outliers) that can dominate and distort the results of models like regression, decision trees, or clustering.
Stablilizing Variance: Variance changes across values can negatively affect models like linear regression and gradient-based models. Transformations like square root or logarithmic transformations stabilize variance, enabling models to capture relationships better.
Improved Model Assumptions: Many models (e.g., linear regression, logistic regression, and neural networks) assume that features or errors are normally distributed. Transformations help to meet these assumptions, thereby improving model reliability.
Outlier Mitigation: Variables like Fare and Age often have outliers that can skew predictions. Transformations reduce the influence of outliers without removing valuable data points.
Min-max scaling normalizes feature values to a range (e.g., 0 to 1). This is crucial when variables have different scales, as in the Titanic dataset. For example, Fare values can range from 0 to hundreds, while Age typically ranges from 0 to 80. Discrete variables like SibSp and Parch have much smaller ranges, usually between 0 and 10.
Transformed and scaled features align with the models’ assumptions,
making results and predictions easier for stakeholders to explain and
interpret. Specific examples in the Titanic data set include the
following.
Age. Often has a few outliers and may have a slight
skew. Transformations ensure it is normally distributed, improving its
predictive power (e.g., predicting survival likelihood).
Fare. Right-skewed with extreme values. Box-Cox
transformation normalizes it, and scaling ensures it does not dominate
other features.
SibSp and Parch. These are count variables with small
ranges. Scaling ensures fair treatment compared to continuous variables
like Fare and Age.
Summary
Together, transformations and scaling improve predictive models by enhancing model performance, reducing overfitting, facilitating algorithm selection, and improving model interpretability. Models become more robust and accurate when features are normalized, as they better capture underlying relationships without bias from skewness, outliers, or scale differences. Normalized features reduce the risk of overfitting, especially in models that rely on weights or distance measures, as smaller ranges limit the model’s tendency to assign excessive importance to specific variables. Properly scaled and transformed data make it easier to select the most appropriate algorithm for the problem.
References
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer. https://link.springer.com/book/10.1007/978-0-387-21706-2.
Frost, J. (2020). Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries. Statistics by Jim Publishing. https://statisticsbyjim.com/
Hinton, W. (2024). From Univariate to Bivariate and Multivariate Analysis. Available at Rpubs. https://www.rpubs.com/whinton/.
Packt Publishing. (2018). R programming for statistics and data science (Media from Packt Publishing available freely through O’Reilly Media Inc.). https://learning.oreilly.com/course/r-programming-for/9781789950298/.
Datar, R., & Garg, H. (2019). Hands-on exploratory data analysis with R: Become an expert in exploratory data analysis using R packages. O’Reilly Media, Inc.Â
Prabhakaran, S. (2023). The complete ggplot2 tutorial. R-statistics.co. Available online _{link}(https://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)_
Smeaton, A. (2003). NIST/SEMATECH Engineering Statistics Handbook. _{link}(https://www.itl.nist.gov/div898/handbook/)_. R Programming for Statistics and Data Science (Media from Packt Publishing available freely through O’Reilly Media Inc.). (2018).
Kabacoff, Robert (2022). R in Action, Third Edition. O’Reilly Online Learning. _{Link}(https://learning.oreilly.com/library/view/r-in-action/9781617296055/)_
.
This study conducted and performed by Will Hinton
# Box-Cox transformation (requires no zero values in data)
# The Box Cox Transformation in R is the technique used to transform non-normal
# data to a normal distribution by applying the power transformation.
# This transformation is commonly used in statistical modeling to improve
# the normality of the data and to stabilize the variance
cat("\nAlternative Box-Cox Transformation for Fare (See ?boxcox() documentation):","\n")
##
## Alternative Box-Cox Transformation for Fare (See ?boxcox() documentation):
fare_no_zeros <- titanic$Fare + 1 # To handle 0 values
boxcox_fare <- boxcox(lm(fare_no_zeros ~ 1), lambda = seq(-2, 2, by = 0.1))
optimal_lambda <- boxcox_fare$x[which.max(boxcox_fare$y)]
titanic$Fare_boxcox <- (fare_no_zeros^optimal_lambda - 1) / optimal_lambda