Assignment 1: Evaluate a New Data Set in the Analytical Environment

Titanic

In this analysis, we will be evaluating the Titanic dataset. The data dictionary is as follows:

  • Passengerid: Passenger ID
  • Age: Age in years
  • Fare: Passenger fare
  • Sex: Sex (female, male)
  • Sibsp: # of siblings/spouses aboard the Titanic (Sibling = brother, sister, stepbrother, stepsister ; Spouse = husband, wife )
  • Parch: # of parents/children aboard the Titanic (Parent = mother, father; Child = daughter, son, stepdaughter, stepson)
  • Pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
  • Survived: Survival (0 = No, 1 = Yes)
  • Cabin: Cabin number
  • Ticket: Ticket number

Setting up your directory

getwd()

# Set the working directory
#setwd("C:/Users/abdou/OneDrive - Prod Student NU/Documents/DDS-8501 Exploratory Data Analysis/DDS8501_EDA")

Step 1: Dataset Acquisition

  • Download the dataset assigned by your instructor (train.csv) or any other dataset specified.
  • Open the dataset in R using appropriate commands to load it into your working environment.
test <- read.csv("test.csv", header = TRUE)
train <- read.csv("train.csv", header = TRUE)
survivor <- read.csv("gender_submission.csv", header = TRUE)

# Append test and survivor datasets by PassengerID
test_sur <- merge(survivor, test)

# Merge test_train and survivor by PassengerID
titanic <- rbind(train, test_sur)

head(titanic)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Step 2: Data Structure Analysis

  • Use summary() and str() functions in R to understand the structure of the data.
  • Identify the number of observations (rows) and variables (columns) in the dataset.
str(titanic)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(titanic)
##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3774   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                                                                    
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin             Embarked        
##  Length:1309        Min.   :  0.000   Length:1309        Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
##                     Mean   : 33.295                                        
##                     3rd Qu.: 31.275                                        
##                     Max.   :512.329                                        
##                     NA's   :1
nrow(titanic)
## [1] 1309
length(titanic)
## [1] 12

Step 3: Variable Type Classification

  • Examine each variable individually.
  • For each variable, determine whether it is quantitative or qualitative (categorical).
  • For any variables that are improperly classified, such as a categorical variable mistakenly labeled as quantitative (e.g., num), reclassify them as categorical factors using appropriate R functions (e.g., as.factor()).

Answer:

  • PassengerId: Quantitative variable, but it should be used as an identifier for each row.
  • Survived: Quantitative variable that needs to be reclassified as a categorical factor.
  • Pclass: Quantitative variable that should be reclassified as a categorical factor.
  • Name: Qualitative variable.
  • Sex: Qualitative variable.
  • Age: Quantitative variable.
  • SibSp: Quantitative variable.
  • Parch: Quantitative variable.
  • Ticket: Qualitative variable.
  • Fare: Quantitative variable.
  • Cabin: Qualitative variable.
  • Embarked: Qualitative variable.
# Reclassify the important variables

titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked  <-  as.factor(titanic$Embarked)

str(titanic)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Step 4: Recoding Quantitative Variables

  • For any quantitative variables that are, in fact, qualitative, reclassify them as categorical. This may involve using the as.factor() function or other methods to convert them.

Answer

titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Survived <- as.factor(titanic$Survived)

str(titanic)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Step 5: Variable Classification

  • For each variable, classify it as one of the following types: nominal, ordinal, interval, or ratio.
  • Additionally, classify each variable as either discrete or continuous. You may use your domain knowledge or statistical tests to make these classifications.

Answer:

library(kableExtra)

classification <- data.frame(
  Variable      = c("PassengerId", "Survived", "Pclass", "Name", "Sex",
                    "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"),
  R_Type        = c("int", "Factor (2 levels)", "Factor (3 levels)", "chr", "Factor (2 levels)","num", "int", "int", "chr", "num", "chr", "Factor (4 levels)"),
  Scale         = c("Nominal", "Nominal", "Ordinal", "Nominal", "Nominal",
                    "Ratio", "Ratio", "Ratio", "Nominal", "Ratio", "Nominal", "Nominal"),
  Discrete_Continuous = c("Discrete", "Discrete", "Discrete", "Discrete", "Discrete",
                          "Continuous", "Discrete", "Discrete", "Discrete",
                          "Continuous", "Discrete", "Discrete")
)

kable(classification, caption = "Variable Classification for Titanic Dataset", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header
Variable Classification for Titanic Dataset
Variable R_Type Scale Discrete_Continuous
PassengerId int Nominal Discrete
Survived Factor (2 levels) Nominal Discrete
Pclass Factor (3 levels) Ordinal Discrete
Name chr Nominal Discrete
Sex Factor (2 levels) Nominal Discrete
Age num Ratio Continuous
SibSp int Ratio Discrete
Parch int Ratio Discrete
Ticket chr Nominal Discrete
Fare num Ratio Continuous
Cabin chr Nominal Discrete
Embarked Factor (4 levels) Nominal Discrete

Step 6: Importance of Variable Types

  • Write a brief discussion on why it is essential to understand variable types as part of Exploratory Data Analysis (EDA). Explain how the understanding of variable types can influence data analysis, visualization, and model building.

Answer

The first and most important step in any data analysis project is understanding the types of variables we are working with. Before plotting and generating models, it is important to know if a variable is nominal, ordinal, interval, or ratio, and if it is discrete or continuous. That information guides almost every analytical decision that follows.

Impact on Data Analysis

The type of variable determines which statistical methods are appropriate. For a nominal variable such as Sex or Embarked, it doesn’t make sense to calculate the mean. The correct summary statistics are frequencies and proportions. Ratio variables such as Fare or Age behave differently. The means, standard deviations, and percentiles carry real mathematical meaning.

Problems arise when the wrong method, such as averaging passenger class (Pclass) values as if they were continuous, is used, producing results that are technically computable but statistically meaningless.
Ordinal variables introduce a subtle distinction. Pclass has a clear ranking, but the spacing between levels isn’t uniform. Because of this, rank-based methods like Spearman’s correlation are more appropriate than Pearson’s correlation, which assumes equal intervals.

Influence on Visualization

The variable type also determines which visualizations reveal useful patterns. Categorical variables such as Survived and Embarked call for bar charts and frequency tables. Continuous variables like Age and Fare are better explored with histograms, density plots, or box plots, which reveal the distribution’s shape, skewness, and outliers. Mixing these up, such as using a bar chart for Fare, can collapse important distributional information andmislead interpretation.

The relationship between two variables also depends on their types.

- Two continuous variables are best examined with a scatterplot.

- A continuous variable paired with a categorical variable often works well with box plots or violin plots.

- Two categorical variables can be explored using grouped bar charts or mosaic plots.

Choosing the right visualization helps uncover data structure rather than masking it.

Influence on Model Building

Variable types have direct consequences for how you prepare data and which models you can use. Most machine learning algorithms expect numeric input, so nominal variables need to be encoded as either one-hot (dummy) variables or target-encoded variables, depending on their cardinality and context. Ordinal variables can sometimes be label-encoded with meaningful integer values, but only if the model can respect the ordering.

Continuous variables often require scaling or normalization, especially in distance-based models such as K-nearest neighbors or regularized regression, where variables on different scales can distort results. Discrete count variables like SibSp and Parch may need different treatment than truly continuous measurements.

Misclassifying a variable type at this stage cascades into downstream errors. Treating PassengerId as a meaningful numeric feature, for example, would introduce noise and potentially bias into any model that includes it.

Key Takeaway

Recognizing variable types is not a minor technical step. It sets the foundation for sound analysis. When variables are correctly classified, analysts apply appropriate statistical methods, choose meaningful visualizations, and properly prepare data for modeling. Skipping this step often leads to subtle mistakes that weaken the entire analysis.

Assignment 2: Assess, Filter, Clean, & Present Data

Step 1: Missing Data Visualization

  • Load the data set assigned by your instructor into R.

  • Use Amelia or a similar library to create missing data maps that visualize missing values in the dataset. Handle missing values in categorical variables with null factor levels (e.g., ““).

# install.packages("Amelia")
# install.packages("forcats")
# install.packages("tidyverse")
# install.packages("rlang")
library("Amelia")
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2026 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library("forcats")
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ ggplot2   4.0.2     ✔ stringr   1.6.0
## ✔ lubridate 1.9.5     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
titanic2 <- data.frame(titanic)
# Visualize missing values
missmap(titanic2, legend = TRUE)

# Handle missing values in categorical variables with null factor levels
# Rename the empty level to 'unk'
levels(titanic2$Embarked)[levels(titanic2$Embarked) == ""] <- "unk"

levels(titanic2$Cabin)[levels(titanic2$Cabin) == ""] <- "unk"

titanic2$Cabin <- factor(titanic2$Cabin, exclude = NULL)

sum(titanic2$Cabin == "", na.rm = TRUE)
## [1] 1014
# str(titanic)
# str(titanic2)

Step 2: Identify Rows and Columns with 20% or More Missing

  • Use R to identify rows with 20% or more missing values.

  • Identify columns (variables) with 20% or more missing values.

titanic2[titanic2 == ""] <- NA
titanic2[titanic2 == "unk"] <- NA

# Identify rows with 20% or more missing values
rows_missing_20 <- rowMeans(is.na(titanic2)) >= 0.20

# View the row indices
which(rows_missing_20)
## integer(0)
# Identify columns with 20% or more missing values
missing_pct <- data.frame(
  Variable = names(titanic2),
  Missing_Percentage = colMeans(is.na(titanic2)) * 100
  )

missing_pct <- missing_pct[order(-missing_pct$Missing_Percentage),]

missing_pct
##                Variable Missing_Percentage
## Cabin             Cabin        77.46371276
## Age                 Age        20.09167303
## Embarked       Embarked         0.15278839
## Fare               Fare         0.07639419
## PassengerId PassengerId         0.00000000
## Survived       Survived         0.00000000
## Pclass           Pclass         0.00000000
## Name               Name         0.00000000
## Sex                 Sex         0.00000000
## SibSp             SibSp         0.00000000
## Parch             Parch         0.00000000
## Ticket           Ticket         0.00000000

Step 3: Handling Missing Data

  • Remove Rows: Eliminate rows with 20% or more missing values from the dataset.

  • Remove Columns: If there are any columns with 20% or more missing values, remove those columns from the dataset.

# Filter to keep rows with less than 20% missing
titanic2_clean <- titanic2[!rows_missing_20, ]

# Number of rows to remove
print("The count of rows with 20% and more missing values: ")
## [1] "The count of rows with 20% and more missing values: "
sum(rows_missing_20)
## [1] 0
# Filter to keep columns with less than 20% missing
cols_to_remove <- missing_pct[missing_pct$Missing_Percentage >= 22, c("Variable")]

titanic2_clean <- titanic2_clean[,!(names(titanic2_clean) %in% cols_to_remove)]

print("The columns with 20% or more missing values")
## [1] "The columns with 20% or more missing values"
cols_to_remove
## [1] "Cabin"
print("Count of variables left after removal")
## [1] "Count of variables left after removal"
ncol(titanic2_clean)
## [1] 11

Step 4: Imputation

  • For the remaining missing values in the dataset, choose an imputation method (mean, median, or mode).

  • Justify your choice for imputation. For example, you can explain why you selected the mean over the median or the mode, taking into account the distribution and nature of the data.

# Create the dataset for replacement
titanic2_replace <- titanic2_clean

# Impute Fare with median
titanic2_replace$Fare[is.na(titanic2_clean$Fare)] <- median(titanic2_clean$Fare, na.rm = TRUE)

# Impute Age with median
titanic2_replace$Age[is.na(titanic2_clean$Age)] <- median(titanic2_clean$Age, na.rm = TRUE)

# Compute mode for Embarked
mode_embarked <- names(sort(table(titanic2_replace$Embarked), decreasing = TRUE))[1]

# Impute Embarked with mode
titanic2_replace$Embarked[is.na(titanic2_replace$Embarked)] <- mode_embarked

# Check remaining NAs
colSums(is.na(titanic2_replace))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare    Embarked 
##           0           0           0           0           0

The median is more robust to outliers and skewed distributions. In the Titanic dataset, variables such as Fare are right-skewed, which can bias the mean toward extreme values. Using the median preserves central tendency while being unaffected by outliers.

Although Age exceeded the 20% missingness threshold at 20.09%, the excess is marginal(less than one-tenth of a percentage point above the cutoff). More importantly, Age is an analytically significant variable in any survival analysis of the Titanic dataset. Its relationship to survival outcomes, ticket class, and family composition makes it a variable whose removal would meaningfully reduce the analytical value of the cleaned dataset.

For categorical variables, the mode is the most appropriate imputation method because it replaces missing values with the most common category.

Step 5: Verify Data Cleaning

  • Use the same method as in Step 1 to create a new missing data map to verify that you have cleaned the data properly and now have a complete dataset.
missmap(titanic2_replace, legend = TRUE)

Step 6: Discuss Data Cleaning

  • In your R Markdown document, discuss the importance of data cleaning in the context of Exploratory Data Analysis (EDA). Explain how the quality of the data impacts the analysis and decision-making process.

Data cleaning is one of the most important steps in Exploratory Data Analysis because the quality of every conclusion drawn from the data depends on the quality of the data itself. Tukey (1977) established EDA as a structured process of discovery, but that process only works when the data accurately reflects reality. Raw datasets almost always contain problems: missing values, mislabeled variables, duplicate records, and entries that do not make sense. If these issues are not addressed before analysis begins, every summary statistic, chart, and model that follows is built on a flawed foundation. Rahm and Do (2000) show that data quality problems come in two forms: those embedded within a single dataset, such as missing values and outliers, and those that arise from how the data was collected or combined, and both types introduce errors that are often difficult to detect once the analysis is underway. Stevens (1946) makes the point that applying the wrong type of math to the wrong type of variable does not just yield imprecise results; it produces fundamentally incorrect results. Cleaning the data before any analysis begins is not a minor housekeeping task. It is the step that determines whether the analysis’s findings can actually be trusted.

Step 7: Effects of Imputation

  • Discuss the effects of various imputation methods (deletion, mean imputation, median imputation, etc.) on the dataset. Highlight the advantages and disadvantages of each approach.

Imputation is the process of replacing missing values so analysis can proceed without discarding incomplete records, and the method chosen directly affects the conclusions the data can support. Three variables required imputation in this analysis: Age, Fare, and Embarked. Complete-case deletion was not applied at the observation level because no rows met the 20% removal threshold, and selectively deleting records with missing values in specific variables would have introduced selection bias by removing passengers whose missingness may be systematically related to survival outcomes (Little & Rubin, 2002). Age and Fare were both imputed using the median because both variables are right-skewed, making the median a more representative replacement than the mean, which is pulled upward by extreme values. Embarked was imputed using the mode because it is a nominal categorical variable for which the mean and median are not defined. The shared limitation of all three single-value methods is artificial variance reduction: replacing missing values with a fixed value compresses the distribution. It produces confidence intervals that are narrower than the data support. More advanced approaches, such as multiple imputation or k-nearest neighbors, address this by generating replacement values that reflect the natural spread of the data and account for relationships between variables, and should be considered if Age or Fare are used as predictors in a formal model (van Buuren, 2018).

Assignment 3: Conduct and Interpret Univariate Data Visualization & Outlier Analysis/Handling

Step 1: Select Variables for Visualization

  • Choose at least two quantitative variables and two categorical variables from your dataset for univariate data visualization.
# Quantitative variable selected: Age, Fare
# Categorical variables selected: Sex, Pclass

viz_df <- titanic2_replace %>% select('Age', 'Fare', 'Sex', 'Pclass')

head(viz_df)
##   Age    Fare    Sex Pclass
## 1  22  7.2500   male      3
## 2  38 71.2833 female      1
## 3  26  7.9250 female      3
## 4  35 53.1000 female      1
## 5  35  8.0500   male      3
## 6  28  8.4583   male      3

Step 2: Four-Plots and Box Plots for Quantitative Variables

  • For each selected quantitative variable, create four plots (histogram, probability plot, dot plot, and box plot) to visualize its distribution.

  • Identify any issues with the data, such as skewness, multimodality, or extreme outliers. Extreme outliers can be identified using box plots.

par(mfrow=c(2,2)) #set the graphing space
require(dplyr)
require(ggplot2)
require(qqplotr)
## Loading required package: qqplotr
## Warning: package 'qqplotr' was built under R version 4.5.3
## 
## Attaching package: 'qqplotr'
## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine
require(Rmisc)
## Loading required package: Rmisc
## Warning: package 'Rmisc' was built under R version 4.5.3
## Loading required package: lattice
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.5.3
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, mutate, rename, summarise, summarize
## The following object is masked from 'package:purrr':
## 
##     compact
# Age
# Histogram
p1 = ggplot(viz_df, aes(x=Age))+geom_histogram(fill='skyblue',col='black')+ggtitle('Histogram of Age')
# Normal Probability plot
p2 = ggplot(viz_df, aes(sample=scale(Age)))+stat_qq_point(aes(col=as.factor(Sex)))+ stat_qq_line(color = "black") +ggtitle('Normal Probability Plot')
# Dot plot
p3 = ggplot(viz_df, aes(x=lag(Age), y=Age, group=Sex))+geom_point(aes(col=as.factor(Sex)))+ggtitle('Lag')
# Box plot 
p4 = ggplot(viz_df, aes(y=Age))+geom_boxplot(fill='skyblue', col='black')+ggtitle('Boxplot of Age')+coord_flip()

multiplot(p1, p2, p3, p4, cols=2)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Figure 1. Four-plot of Age: histogram (top-left), Q-Q plot (bottom-left ), dot plot (top-right), and box plot (bottom-right).

Figure 1. Four-plot of Age: histogram (top-left), Q-Q plot (bottom-left ), dot plot (top-right), and box plot (bottom-right).

# Fare
# Histogram
p1 = ggplot(viz_df, aes(x=Fare))+geom_histogram(fill='skyblue',col='black')+ggtitle('Histogram of Fare')
# Normal Probability plot
p2 = ggplot(viz_df, aes(sample=scale(Fare)))+stat_qq_point(aes(col=as.factor(Sex)))+ stat_qq_line(color = "black") +ggtitle('Normal Probability Plot')
# Dot plot
p3 = ggplot(viz_df, aes(x=lag(Fare), y=Fare, group=Sex))+geom_point(aes(col=as.factor(Sex)))+ggtitle('Lag')
# Box plot 
p4 = ggplot(viz_df, aes(y=Fare))+geom_boxplot(fill='skyblue',col='black')+ggtitle('Boxplot of Fare')+coord_flip()

multiplot(p1, p2, p3, p4, cols=2)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Figure 2. Four-plot of Fare: histogram (top-left), Q-Q plot (bottom-left ), dot plot (top-right), and box plot (bottom-right).

Figure 2. Four-plot of Fare: histogram (top-left), Q-Q plot (bottom-left ), dot plot (top-right), and box plot (bottom-right).

Step 3: Bar Plots for Categorical Variables

  • For each of the selected categorical variables, create bar plots to visualize the distribution of categories or levels.

Two nominal categorical variables were selected: Sex (two levels: female, male) and Pclass (three ordinal levels: 1st, 2nd, 3rd class). Both are directly relevant to survival analysis and illustrate different imbalance profiles.

par(mfrow=c(1,1))
# Sex 
# Bar Chart
viz_df = viz_df[order(-viz_df$Sex), ] # order the data frame in decreasing order by count of Age
## Warning in Ops.factor(viz_df$Sex): '-' not meaningful for factors
# Frequency table
plot_df <- table(viz_df$Sex)

# Percentages
pct <- round(100 * plot_df / sum(plot_df), 2)
pct_labels <- paste0(names(plot_df), " (", pct, "%)")

# Barplot 
bp <- barplot(
  plot_df,
  ylab = "Counts",
  main = "Distribution of Sex",
  col = c("skyblue", "forestgreen"),        # fill
  border = "black",      # border
  xaxt = "n"            # suppress x-axis
)

# Add x-axis labels: Male (xx%) and Female (xx%)
axis(1, at = bp, labels = pct_labels)
Figure 3. Bar plot of Sex distribution across all 1,309 passengers.

Figure 3. Bar plot of Sex distribution across all 1,309 passengers.

par(mfrow=c(1,1))
# PClass 
# Bar Chart
viz_df = viz_df[order(-viz_df$Pclass), ] # order the data frame in decreasing order by count of Age
## Warning in Ops.factor(viz_df$Pclass): '-' not meaningful for factors
# Expand right margin to make space for the legend
par(mar = c(5, 5, 5, 9))   # bottom, left, top, right

# Frequency table
plot_df <- table(viz_df$Pclass)

# Percentages
pct <- round(100 * plot_df / sum(plot_df), 2)
pct_labels <- paste0(names(plot_df), " (", pct, "%)")

# Barplot with custom colors
bp <- barplot(
  plot_df,
  ylab = "Counts",
  main = "Distribution of Passenger Class",
  col = c("skyblue", "orange", "forestgreen"),   # fill colors for 1,2,3
  border = "black",
  xaxt = "n"
)

# Add x-axis labels with percentages
axis(1, at = bp, labels = pct_labels)

# Add legend OUTSIDE the plot (to the right)
legend(
  "topright",
  inset = c(-0.40, 0),   # moves legend outside the plot
  legend = c("1 = First Class", "2 = Second Class", "3 = Third Class"),
  fill = c("skyblue", "orange", "forestgreen"),
  border = "black",
  xpd = TRUE,            # allows drawing outside plot region
  title = "Passenger Class"
)
Figure 4. Bar plot of Pclass distribution across all 1,309 passengers.

Figure 4. Bar plot of Pclass distribution across all 1,309 passengers.

Step 4: Assess Normality

  • Evaluate the distributions of quantitative variables for normality. You can use the probability plots from the four-plots to assess normality.
# Descriptive statistics for Age

library(dplyr)

skewness <- function(x) {
  x <- x[!is.na(x)]
  n <- length(x)
  m <- mean(x)
  s <- sd(x)
  (n / ((n - 1) * (n - 2))) * sum(((x - m) / s)^3)
}

summary_stats <- function(data, var) {
  x <- data[[var]]
  
  tibble(
    Variable = var,
    N        = sum(!is.na(x)),
    Mean     = round(mean(x, na.rm = TRUE), 3),
    Median   = round(median(x, na.rm = TRUE), 3),
    SD       = round(sd(x, na.rm = TRUE), 3),
    Skewness = round(skewness(x), 4),
    Min      = round(min(x, na.rm = TRUE), 2),
    Max      = round(max(x, na.rm = TRUE), 2)
  )
}

age_stats  <- summary_stats(viz_df, "Age")
fare_stats <- summary_stats(viz_df, "Fare")

normality_table <- bind_rows(age_stats, fare_stats)

kable(normality_table,
      caption = "Table 1. Descriptive Statistics and Normality Indicators") |>
  kableExtra::kable_styling(full_width = FALSE) |>
  kableExtra::row_spec(0, color = "black") |>   # header text
  kableExtra::row_spec(1:nrow(normality_table), color = "black")
Table 1. Descriptive Statistics and Normality Indicators
Variable N Mean Median SD Skewness Min Max
Age 1309 29.503 28.000 12.905 0.5410 0.17 80.00
Fare 1309 33.281 14.454 51.741 4.3695 0.00 512.33
# Shapiro-Wilk is limited to n <= 5000; use on a sample for large datasets
# For n = 1,309, we can apply directly but note that large-n tests have high power
# and will reject even minor deviations. Q-Q visual inspection is primary.

# Kolmogorov-Smirnov test (more appropriate for n > 50)
ks_age  <- ks.test(titanic$Age,  "pnorm",
                   mean = mean(titanic$Age),
                   sd   = sd(titanic$Age))
## Warning in ks.test.default(titanic$Age, "pnorm", mean = mean(titanic$Age), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test
ks_fare <- ks.test(titanic$Fare, "pnorm",
                   mean = mean(titanic$Fare),
                   sd   = sd(titanic$Fare))
## Warning in ks.test.default(titanic$Fare, "pnorm", mean = mean(titanic$Fare), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test
# Shapiro-Wilk on random sample (n = 500)
set.seed(42)
idx         <- sample(1:nrow(titanic), 500)
sw_age      <- shapiro.test(titanic$Age[idx])
sw_fare     <- shapiro.test(titanic$Fare[idx])

normality_results <- data.frame(
  Variable   = c("Age", "Fare"),
  KS_stat    = round(c(ks_age$statistic,  ks_fare$statistic), 4),
  KS_p       = formatC(c(ks_age$p.value,  ks_fare$p.value), format = "e", digits = 3),
  SW_stat    = round(c(sw_age$statistic,  sw_fare$statistic), 4),
  SW_p       = formatC(c(sw_age$p.value,  sw_fare$p.value), format = "e", digits = 3),
  Normal     = c("No", "No")
)

kable(normality_results,
      col.names = c("Variable", "KS Statistic", "KS p-value",
                    "SW Statistic", "SW p-value", "Normal?"),
      caption = "Table 2: Normality Test Results (KS = Kolmogorov-Smirnov, SW = Shapiro-Wilk)") |>
  kableExtra::kable_styling(full_width = FALSE) |>
  kableExtra::row_spec(0, color = "black") |>  # header text color
  kableExtra::row_spec(1:nrow(normality_results), color = "black")
Table 2: Normality Test Results (KS = Kolmogorov-Smirnov, SW = Shapiro-Wilk)
Variable KS Statistic KS p-value SW Statistic SW p-value Normal?
Age NA NA 0.9795 2.247e-05 No
Fare NA NA 0.5273 1.995e-34 No

Step 5: Importance of Univariate Data Visualization

  • In your R Markdown document, discuss the significance of univariate data visualization in the context of Exploratory Data Analysis (EDA). Explain how it helps in understanding the distribution and characteristics of individual variables.

Univariate data visualization is the foundational step in any rigorous EDA workflow. Before constructing any bivariate relationship or predictive model, we must understand the distributional behavior of each variable in isolation. Tukey (1977) established this principle in his framework for exploratory analysis, arguing that visual inspection of individual variables reveals structure that summary statistics alone cannot convey.

The four-plot, introduced by Chambers et al. (1983) and formalized in subsequent EDA literature, systematically addresses this need. By combining a histogram, a normal Q-Q plot, a dot plot, and a box plot in a single display, the analyst obtains a complete picture of the variable’s shape, symmetry, tail behavior, and outlier profile. Each component provides distinct information: the histogram reveals the frequency distribution and modality; the Q-Q plot exposes deviations from normality; the dot plot shows the granular data density; and the box plot identifies the median, spread, and extreme values through Tukey’s resistant fences (Tukey, 1977).

For categorical variables, bar charts serve the analogous function of revealing the frequency structure of levels. When class imbalance is present, as is common in real-world datasets, bar charts immediately reveal it, signaling the need for stratified sampling, resampling techniques, or weighted models in downstream analysis (Agresti, 2002).

In the context of the Titanic dataset, univariate visualization is particularly important because variables such as Fare do not meet the normality assumptions required by many parametric methods. Detecting this early allows the analyst to select appropriate transformations or non-parametric alternatives before modeling begins, rather than discovering assumption violations after the fact (Dasu and Johnson, 2003).

Step 6: Interpretation of Visualizations

  • Discuss the meaning of the visualizations you generated in terms of data analysis. /- For quantitative variables, explain the shape of the distribution, presence of outliers, and any deviations from normality. /- For categorical variables, describe the frequency of categories and any imbalances.

6.1 Age

The histogram of Age reveals a roughly unimodal, right-skewed distribution. The majority of passengers were between 20 and 40 years old, with a concentration in the late twenties to mid-thirties. The mean (28.00 after median imputation) equals the median by construction, but the original distribution showed mean (29.88) exceeding the median (28.00). A secondary frequency peak is visible among passengers under five years of age, likely reflecting the presence of infants and young children traveling with families.

The Q-Q plot shows moderate deviation from the reference line in both tails. Points in the upper tail curve above the line, confirming right skew. The lower tail deviates downward, consistent with a small cluster of infant observations near zero. The distribution does not conform to normality, though the deviation is less severe than that observed for Fare.

The dot plot confirms the central mass between ages 20 and 40, with sparse observations at the extremes. The box plot identifies several values above the upper Tukey fence (Q3 + 1.5 * IQR = 39 + 27 = 66 years), with the maximum at 80 years appearing as a mild outlier. No extreme outliers (beyond Q3 + 3 * IQR = 93 years) exist in the dataset.

6.2 Fare

The histogram of Fare displays one of the most pronounced right skews in the dataset. The overwhelming majority of passengers paid fares below 50 GBP, while a small number paid fares approaching 512 GBP. The mean (33.30) is more than twice the median (14.45), a ratio that indicates severe positive skew. The distribution is characterized by a pronounced floor effect near zero, consistent with the minimum of 0.00 GBP (likely crew-related or free passages).

The Q-Q plot shows extreme departure from the reference line across the entire upper quantile range. Sample quantiles climb steeply above the theoretical line, reflecting the heavy upper tail driven by high first-class fares. This pattern is inconsistent with normality and suggests that a log or square-root transformation would be required before applying any parametric method that assumes Gaussian residuals.

The dot plot (on a log scale for interpretability) reveals that the majority of fares cluster in the 5-20 GBP range, with sparsely distributed observations in the upper range. The box plot reveals numerous outliers above the upper fence (66.34 GBP), with extreme outliers (above 101.41 GBP) representing a identifiable subset of high-paying first-class passengers. These extreme values reflect the historically documented price premium for luxury cabins and suites on the RMS Titanic.

6.3 Sex

The bar chart shows a pronounced imbalance between male (843; 64.4%) and female (466; 35.6%) passengers. This imbalance is consistent with the historical record of transatlantic travel in the early 20th century, during which male economic migrants and crew members predominated. In a survival analysis context, this imbalance is analytically significant: the “women and children first” maritime evacuation protocol means that class imbalance in Sex directly corresponds to differential survival rates. Failing to account for this imbalance in a predictive model would produce biased class probabilities (Agresti, 2002).

6.4 Passenger Class (Pclass)

The bar chart reveals that third-class passengers constitute 54.2% (n = 709) of the combined manifest, making Pclass a heavily imbalanced ordinal variable. First-class passengers account for 24.7% (n = 323) and second-class for 21.2% (n = 277).

This distribution reflects the economic structure of passenger travel at the time: the large majority of tickets were sold at the lower, more affordable third-class rates. For survival modeling, this imbalance is critical because passenger class was one of the strongest predictors of survival: access to lifeboats, proximity to deck-level exits, and crew attention were all differentially distributed across classes. An unweighted model trained on this distribution will be biased toward the majority class and may systematically underestimate survival probability for first and second-class passengers.

Step 7: Document Issues and Extreme Outliers

  • Clearly document any issues with the data, such as missing values, skewed distributions, or abnormalities. /- Identify and document variables with extreme outliers, emphasizing their impact on data analysis.

7.1 Summary of Issues

issues <- data.frame(
  Variable   = c("Age", "Fare", "Sex", "Pclass"),
  Issue      = c("Right skew; median imputation applied (263 NAs, 20.09%)",
                 "Severe right skew; extreme upper outliers; 1 NA imputed",
                 "Class imbalance: 64.4% male",
                 "Class imbalance: 54.2% third class"),
  Impact     = c("Mild; post-imputation distribution centered near 28",
                 "High; log-transform recommended prior to parametric modeling",
                 "Moderate; differential survival rates by sex require stratification",
                 "Moderate to high; class-weighted modeling recommended")
)

kable(issues,
             col.names = c("Variable", "Issue Identified", "Analytical Impact"),
             caption = "Table 3. Documented Issues Across Selected Variables")|>
              kableExtra::kable_styling(full_width = FALSE) |>
            kableExtra::row_spec(0, color = "black") |>   # header text
            kableExtra::row_spec(1:nrow(normality_table), color = "black")
Table 3. Documented Issues Across Selected Variables
Variable Issue Identified Analytical Impact
Age Right skew; median imputation applied (263 NAs, 20.09%) Mild; post-imputation distribution centered near 28
Fare Severe right skew; extreme upper outliers; 1 NA imputed High; log-transform recommended prior to parametric modeling
Sex Class imbalance: 64.4% male Moderate; differential survival rates by sex require stratification
Pclass Class imbalance: 54.2% third class Moderate to high; class-weighted modeling recommended

7.2 Extreme Outlier Detail: Fare

The Tukey fence analysis for Fare identifies the following thresholds:

  • IQR = Q3 - Q1 = 31.275 - 7.896 = 23.379 GBP
  • Mild outlier threshold (1.5 * IQR): Q3 + 35.07 = 66.34 GBP
  • Extreme outlier threshold (3.0 * IQR): Q3 + 70.14 = 101.41 GBP
  • Maximum observed Fare: 512.33 GBP

Fares above 66.34 GBP are mild outliers; fares above 101.41 GBP are extreme outliers. These values represent a non-trivial subset of the passenger manifest, primarily among first-class passengers who purchased shared-cabin allocations or private suites. Because these extreme values are contextually valid (not data entry errors), they should be retained in the dataset. However, any regression model or distance-based classifier that uses Fare as a numeric feature will require either log-transformation or robust scaling to prevent these values from dominating variance estimates.

7.3 Extreme Outlier Detail: Age

The Tukey fence analysis for Age identifies:

  • IQR = Q3 - Q1 = 39 - 21 = 18 years
  • Mild outlier threshold (1.5 * IQR): Q3 + 27 = 66 years
  • Extreme outlier threshold (3.0 * IQR): Q3 + 54 = 93 years
  • Maximum observed Age: 80 years

No extreme outliers exist in Age. The single passenger recorded at 80 years is a mild outlier but a plausible, contextually valid value. Unlike Fare, the Age distribution does not produce analytical risk from outlier distortion, though the right skew remains a consideration for parametric methods.

Assignment 4: Generate and Interpret Multivariate Graphs of Data

Step 1: Pairs Plot for Quantitative Variables

  • Create a pairs plot for all quantitative variables in your dataset. This will generate scatterplots for all combinations of quantitative variables.
if (!require(GGally)) {
  install.packages("GGally")
  library(GGally)
}
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.5.3
# quant <- c('Age', 'SibSp', 'Parch', 'Fare')

quant_vars <- titanic2_replace %>% select('Age', 'Fare', 'SibSp', 'Parch')

ggpairs(
  quant_vars, 
  title = "Pairs Plot of Quantitative Variables",
  upper  = list(continuous = wrap("cor",   size = 4.5, colour = "#2C3E50")),
  lower  = list(continuous = wrap("points", alpha = 0.15, size = 0.8,
                                   colour = "#2980B9")),
  diag = list(continuous = wrap("densityDiag"), fill = "#2980B9",
                                alpha = 0.5, color = "white")
) +
  theme_classic(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 13),
    strip.background = element_rect(fill = "#ECF0F1"),
    strip.text = element_text(face = "bold")
  )

  • Interpret the correlations between variables. Look for patterns, trends, and relationships. Pay attention to positive and negative correlations.

Age vs. Fare (corr = 0.178): Weak positive correlation. Older passengers paid higher fares, consistent with wealthier older passengers occupying first-class cabins. The scatter is highly diffuse, however, reflecting the dominant role of ticket class rather than age alone in determining fare.

Age vs. SibSp (corr = -0.19): Weak negative association. Younger passengers traveled with more siblings or spouses, consistent with families where children or young adults traveled with siblings. Older passengers tended to travel alone or with fewer co-passengers in the sibling/spouse category.

Age vs. Parch (corr = -0.126): A similar, slightly weaker negative pattern. Older passengers carried fewer parents or children aboard. Young children and middle-aged adults traveling with their own children generate the cluster with higher Parch values.

Fare vs. SibSp (corr = 0.160) and Fare vs. Parch (corr = 0.222): Weak positive relationships. Larger family units tended to purchase more expensive group or cabin tickets, pushing fare amounts higher. This signal is modest and is mediated substantially by ticket class.

SibSp vs. Parch (corr = 0.374): The strongest pairwise correlation among the four variables. Passengers traveling with siblings or spouses were also more likely to travel with parents or children, consistent with complete family units sailing together.

Step 2-3: Four Additional Plots for Comparing Variables

  • Select four sets of variables, each including a combination of a categorical (qualitative) variable and a quantitative or categorical variable.
  • Create appropriate plots for each comparison, considering the nature of the variables involved. For example, use bar plots for categorical vs. categorical comparisons, and box plots or scatter plots for categorical vs. quantitative comparisons.
  • Use colors or facets to highlight differences in the dependent variable.
  • Discuss what each of the four additional plots tells you about the variables being compared. Explain any patterns, trends, or differences that are evident in the visualizations.

Plot 1 – Survival Rate by Sex (Categorical vs. Categorical)

sex_surv <- titanic2_replace %>%
  mutate(
    Survived = factor(Survived, levels = c(0, 1), labels = c("Died", "Survived")),
    Sex      = factor(Sex)
  ) %>%
  filter(!is.na(Survived))
head(sex_surv)
##   PassengerId Survived Pclass
## 1           1     Died      3
## 2           2 Survived      1
## 3           3 Survived      3
## 4           4 Survived      1
## 5           5     Died      3
## 6           6     Died      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  28     0     0
##             Ticket    Fare Embarked
## 1        A/5 21171  7.2500        S
## 2         PC 17599 71.2833        C
## 3 STON/O2. 3101282  7.9250        S
## 4           113803 53.1000        S
## 5           373450  8.0500        S
## 6           330877  8.4583        Q
library(dplyr)
library(ggplot2)

sex_surv %>%
  dplyr::count(Sex, Survived) %>%
  dplyr::group_by(Sex) %>%
  dplyr::mutate(pct = n / sum(n)) %>%
  dplyr::ungroup() %>%
  ggplot(aes(x = Sex, y = pct, fill = Survived)) +
  geom_col(position = "dodge", width = 0.65, colour = "white") +
  geom_text(
      aes(label = scales::percent(pct, accuracy = 0.1)),
      position = position_dodge(width = 0.65),   # aligns with dodged bars
      vjust    = -0.5,                            # lifts label above the bar
      size = 3
    ) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Survival Rate by Sex",
    x = "Sex",
    y = "Percentage",
    fill = "Outcome"
  ) +
  theme_minimal()

Interpretation: The survival disparity between male and female passengers is the single most pronounced pattern in the entire dataset. Approximately 82% of female passengers survived versus only 13% of males. This near-sixfold difference directly reflects the “women and children first” evacuation protocol practiced aboard the Titanic. The categorical nature of both variables makes a grouped bar chart the appropriate visualization, allowing direct proportion comparison across both sex categories simultaneously.

Plot 2 – Fare Distribution by Passenger Class (Categorical vs. Quantitative)

titanic2_replace <- titanic2_replace %>%
  mutate(Pclass = factor(Pclass,
                         levels = c("1", "2", "3"),
                         labels = c("1st", "2nd", "3rd")))

ggplot(titanic2_replace, aes(x = Pclass, y = Fare, fill = Pclass)) +
  geom_violin(trim = FALSE, alpha = 0.55, colour = "white") +
  geom_boxplot(
    width = 0.18,
    outlier.shape = 21,
    outlier.size = 1.5,
    outlier.alpha = 0.4,
    fill = "white",
    colour = "#2C3E50"
  ) +
  scale_fill_manual(
    values = c("1" = "#1A5276", "2" = "#1F618D", "3" = "#5DADE2"),
    guide = "none"
  ) +
  scale_y_continuous(
    labels = scales::dollar_format(),
    trans = "log1p",
    breaks = c(0, 10, 50, 100, 250, 500)
  ) +
  labs(
    title = "Fare Distribution by Passenger Class",
    subtitle = "Log scale reveals structure despite heavy right skew",
    x = "Passenger Class",
    y = "Fare (log scale)"
  ) +
  theme_classic(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, colour = "grey40")
  )
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

Interpretation: The violin-and-boxplot combination reveals the full shape of the fare distribution within each class while preserving median and interquartile range visibility. First-class fares span an enormous range (median ≈ $60, with values exceeding $500), reflecting differences in private cabin accommodation. Second-class fares cluster tightly (median ≈ $15), and third-class fares are the most compressed and lowest (median ≈ $8). The log-scale transformation is necessary because raw fare values are extreme right-skewed; without it, third- and second-class distributions collapse into a thin band. Critically, this plot establishes that passenger class is not merely a social label but a strong financial stratification variable.

Plot 3 – Age Distribution by Sex and Survival (Categorical vs. Quantitative, Faceted)

titanic2_replace %>%
  mutate(
    Survived = factor(Survived, levels = c(0, 1), labels = c("Died", "Survived")),
    Sex      = factor(Sex)
  ) %>%
  filter(!is.na(Survived), !is.na(Age)) %>%
  ggplot(aes(x = Age, fill = Survived)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ Sex) +
  labs(
    title = "Age Distribution by Sex and Survival",
    x = "Age",
    y = "Density",
    fill = "Outcome"
  ) +
  theme_minimal()

Interpretation: This faceted density plot adds a third dimension (survival) to the age-by-sex comparison. Among females, survivors and non-survivors share broadly similar age distributions, though survivors skew slightly younger (median ~27 vs. ~30), suggesting that while sex is the dominant predictor, age exerts a secondary influence. Among males, the survival distribution is markedly younger relative to the non-survival distribution; boys under approximately 10 years of age show a substantially higher survival probability, reflecting the child-protective element of evacuation priority. The peak of non-surviving male deaths occurs in the 20-35 age range, corresponding to the largest demographic aboard. This three-variable encoding (age, sex, survival) in a single plot demonstrates the analytical power of combining continuous and categorical variables through density faceting.

Plot 4 – Survival Rate by Passenger Class (Categorical vs. Categorical, Colored)

titanic2_replace <- titanic2_replace %>%
  mutate(
    Survived = factor(Survived, levels = c(0, 1), labels = c("Died", "Survived")),
    Sex      = factor(Sex)
  ) %>%
  filter(!is.na(Survived))

pclass_surv <- titanic2_replace %>%
  dplyr::filter(!is.na(Survived)) %>%
  dplyr::count(Pclass, Survived) %>%
  dplyr::group_by(Pclass) %>%
  dplyr::mutate(pct = n / sum(n))
  
  ggplot(pclass_surv, aes(x = Pclass, y = pct, fill = Survived)) +
  geom_col(color = "white") +
  geom_text(
    aes(label = scales::percent(pct, accuracy = 0.1)),
    position = position_stack(vjust = 0.5),
    size = 3
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Survival Rate by Passenger Class",
    x = "Passenger Class",
    y = "Proportion",
    fill = "Outcome"
  ) +
  theme_minimal()

Interpretation: A strong socioeconomic gradient in survival is immediately apparent. First-class passengers survived at a rate of approximately 57.6%, second-class at 42.2%, and third-class at only 26.9%. This pattern reflects multiple structural factors: first-class cabins were located on upper decks closer to the lifeboats; wealthier passengers were more likely aware of emergency procedures; and class-based social hierarchies influenced crew behavior during the evacuation. The stacked bar format is optimal for this comparison because it preserves both the absolute class totals and the within-class outcome split, making the survival gradient legible at a glance.

Step 4: Importance of Multivariate Data Visualization

  • In your R Markdown document, discuss the significance of multivariate data visualization in the context of Exploratory Data Analysis (EDA). Explain how it helps in understanding complex relationships and interactions among variables.

Answer

Multivariate visualization isn’t just a nice addition to EDA, it’s how you uncover how variables actually interact. Univariate summaries like histograms or tables describe variables in isolation, but real patterns come from relationships between variables. Putting multiple variables into one plot makes those relationships visible.

The Titanic data shows this clearly. An overall survival rate of about 38% doesn’t say much. Once you break it down by sex, class, and age, the structure becomes obvious. Survival wasn’t random, it followed clear social patterns. Each added variable, through color, facets, or position, gives more context that a single-variable summary can’t capture.

These plots also help diagnose issues. A pairs plot quickly shows correlations, like the moderate link between SibSp and Parch, which can cause multicollinearity in models. Distribution plots reveal differences that averages hide, like the much wider spread of first-class fares compared to third class.

They also help generate hypotheses. For example, the age-by-sex-by-survival plot suggests that age affects survival differently for males and females, especially among children. That insight points toward adding interaction terms in a model.

Good EDA moves from simple summaries to conditional relationships, then to interactions. Choosing the right plot depends on the type of variables you’re working with. Matching the visualization to the data isn’t just about clarity, it’s necessary for making valid conclusions.

Step 5: Summary of EDA Findings

  • Summarize the key observations and insights gained from the pairs plot, as well as the four additional plots.

  • Discuss the EDA steps you took, including variable selection and the types of plots used.

  • Justify your choices throughout the EDA process, explaining the relevance of each visualization in uncovering patterns and relationships within the data.

Answer

Variable Selection and Visualization Choices

The analysis focused on the six most analytically substantive variables in the cleaned dataset: the four quantitative variables (Age, Fare, SibSp, Parch) for the pairs plot, and Survived, Pclass, and Sex for the four targeted plots. These selections were driven by analytical significance — the capacity of each variable to illuminate survival patterns — rather than arbitrary selection.

The GGally::ggpairs() function was selected for Step 1 because it renders pairwise scatter plots, density diagonals, and Pearson correlation coefficients in a unified display, maximizing information density. For the four targeted plots, visualization type was matched to variable scale: grouped and stacked bar charts for categorical-by-categorical comparisons (Plots 1 and 4), and a violin-boxplot combination plus faceted density curves for categorical-by-quantitative comparisons (Plots 2 and 3).

Key Observations

Pairs Plot: The quantitative variables exhibit weak to moderate pairwise correlations, with the strongest relationship between SibSp and Parch (r = 0.37), reflecting coherent family-unit travel. Fare is the most right-skewed variable, reflecting the extreme pricing differential between first- and lower-class accommodation. Age displays the most symmetric distribution after median imputation.

Sex and Survival (Plot 1): Female passengers survived at 6.4 times the rate of male passengers (82.6% vs. 12.9%). Sex is the dominant predictor in the dataset.

Fare and Passenger Class (Plot 2): Passenger class produces a clear stratification in fare that is only visible on a log scale due to extreme right skew in first-class fares. The first-class fare range (approximately $5 to $512) encompasses the entire fare range of second- and third-class passengers.

Age, Sex, and Survival (Plot 3): Among males, young boys exhibit markedly higher survival rates than adult males, suggesting child-protective evacuation priority. Among females, age exerts only a weak secondary influence on survival outcome relative to sex.

Passenger Class and Survival (Plot 4): A monotonic socioeconomic survival gradient exists: 57.6% survival in first class, 42.2% in second class, 26.9% in third class. Class-based proximity to lifeboats and social hierarchy during evacuation are the primary explanatory mechanisms.

EDA Steps Taken

The analytical sequence applied was: (1) dataset loading and Assignment-2 cleaning replication; (2) pairs plot construction for quantitative correlation and distribution structure; (3) targeted bivariate plots selected by analytical salience and variable-type appropriateness; (4) plot interpretation grounded in domain knowledge and distributional evidence; (5) integration of findings into a cohesive account of survival determinants. At each step, variable measurement scales governed technique selection, and all interpretations are constrained to the observational evidence visible in the data.

Assignment 5: Generate and Interpret Descriptive Statistics

Step 1: Select Variables for Descriptive and Bivariate Statistics

  • Choose at least four variables: two quantitative and two qualitative, from your dataset for analysis.

Answer

Four variables are selected to represent a range of measurement scales:

Variable Type Scale Justification
Age Quantitative Ratio Key demographic; central to survival analysis
Fare Quantitative Ratio Proxy for socioeconomic status; known survival predictor
Sex Qualitative Nominal Primary survival predictor in Titanic literature
Pclass Qualitative Ordinal Structural proxy for passenger socioeconomic class

pclass is encoded as an integer (1, 2, 3) but treated as an ordinal categorical variable throughout this analysis, consistent with its nominal interpretation in the literature.

Step 2: Univariate Descriptive Statistics for Quantitative Variables

For each of the selected quantitative variables, calculate univariate statistics including:

  • Measures of center: Mean, Median

  • Measures of dispersion: Standard Deviation, Range, Interquartile Range (IQR)

  • Measures of spread: Variance

  • Location: Skewness and Kurtosis

Use appropriate R functions to compute these statistics.

# Function for the descriptive statistics
compute_stats <- function(x, var_name) {
   # ensure required functions are available
   if (!requireNamespace("e1071", quietly = TRUE)) {
    stop("Package 'e1071' is required for skewness and kurtosis.")
  }
  x_clean <- x[!is.na(x)]
  tibble(
    Variable = var_name,
    N = round(length(x_clean),0),
    Mean = round(mean(x_clean), 3),
    Median = round(median(x_clean), 3),
    SD = round(sd(x_clean), 3),
    Variance = round(var(x_clean), 3),
    Min = round(min(x_clean), 3),
    Max = round(max(x_clean),3),
    Range = round(max(x_clean) - min(x_clean), 3),
    IQR = round(IQR(x_clean), 3),
    Skewness = round(e1071::skewness(x_clean), 3),
    Kurtosis = round(e1071::kurtosis(x_clean), 3)
  )
}

# Descriptive statistics for Age and Fare
df_titanic <- titanic2_replace

stat_age <- compute_stats(df_titanic["Age"], "Age")
stat_fare <- compute_stats(df_titanic["Fare"], "Fare")

stats_comb <- bind_rows(stat_age, stat_fare)

stats_t <- as.data.frame(t(stats_comb[, -1]))
colnames(stats_t) <- stats_comb$Variable
rownames(stats_t) <- colnames(stats_comb)[-1]

kable(stats_t, 
      caption = "Tabe1. Univariate Descriptive Statistics: Age and Fare (n = 1309)",
      align   = "rr") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                position   = "center")
Tabe1. Univariate Descriptive Statistics: Age and Fare (n = 1309)
Age Fare
N 1309.000 1309.000
Mean 29.503 33.281
Median 28.000 14.454
SD 12.905 51.741
Variance 166.545 2677.183
Min 0.170 0.000
Max 80.000 512.329
Range 79.830 512.329
IQR 13.000 23.379
Skewness 0.540 4.360
Kurtosis 0.956 26.896
# Precompute summary stats (avoids repeating calculations)
age_mean   <- stats_t[2,1]
age_median <- stats_t[3,1]

fare_mean   <- stats_t[2,2]
fare_median <- stats_t[3,2]

age_skew <- stats_t[10,1]
age_kurt <- stats_t[11,1]

fare_skew <- stats_t[10,2]
fare_kurt <- stats_t[11,2]

# ── Age histogram ─────────────────────────────────────────
p1 <- ggplot(df_titanic, aes(Age)) +
  geom_histogram(binwidth = 5, fill = "#2E75B6",
                 color = "white", alpha = 0.85) +
  geom_vline(xintercept = age_mean,
             color = "#C00000", linetype = "dashed", linewidth = 0.9) +
  geom_vline(xintercept = age_median,
             color = "#ED7D31", linetype = "dotdash", linewidth = 0.9) +
  annotate("text",
           x = age_mean, y = Inf, vjust = 2,
           label = paste0("Mean = ", round(age_mean, 1)),
           color = "#C00000", size = 3.5) +
  annotate("text",
           x = age_median, y = Inf, vjust = 3.5,
           label = paste0("Median = ", round(age_median, 1)),
           color = "#ED7D31", size = 3.5) +
  labs(
    title = "Distribution of Age",
    subtitle = paste0(
      "Skewness = ", round(age_skew, 3),
      " | Kurtosis = ", round(age_kurt, 3)
    ),
    x = "Age (years)", y = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(color = "gray40", size = 10)
  )

# ── Age boxplot ───────────────────────────────────────────
p2 <- ggplot(df_titanic, aes(y = Age)) +
  geom_boxplot(fill = "#2E75B6", alpha = 0.7,
               outlier.color = "#C00000", outlier.alpha = 0.4) +
  labs(title = "Boxplot of Age", y = "Age (years)", x = NULL) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

# ── Fare histogram ────────────────────────────────────────
p3 <- ggplot(df_titanic, aes(Fare)) +
  geom_histogram(binwidth = 15, fill = "#70AD47",
                 color = "white", alpha = 0.85) +
  geom_vline(xintercept = fare_mean,
             color = "#C00000", linetype = "dashed", linewidth = 0.9) +
  geom_vline(xintercept = fare_median,
             color = "#ED7D31", linetype = "dotdash", linewidth = 0.9) +
  annotate("text",
           x = fare_mean, y = Inf, vjust = 2,
           label = paste0("Mean = ", round(fare_mean, 1)),
           color = "#C00000", size = 3.5) +
  annotate("text",
           x = fare_median, y = Inf, vjust = 3.5,
           label = paste0("Median = ", round(fare_median, 1)),
           color = "#ED7D31", size = 3.5) +
  labs(
    title = "Distribution of Fare",
    subtitle = paste0(
      "Skewness = ", round(fare_skew, 3),
      " | Kurtosis = ", round(fare_kurt, 3)
    ),
    x = "Fare (GBP)", y = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(color = "gray40", size = 10)
  )

# ── Fare boxplot ──────────────────────────────────────────
p4 <- ggplot(df_titanic, aes(y = Fare)) +
  geom_boxplot(fill = "#70AD47", alpha = 0.7,
               outlier.color = "#C00000", outlier.alpha = 0.4) +
  labs(title = "Boxplot of Fare", y = "Fare (GBP)", x = NULL) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

# ── Combine plots ─────────────────────────────────────────
gridExtra::grid.arrange(
  p1, p2, p3, p4,
  ncol = 2,
  top = "Figure 1. Distributional Profiles: Age and Fare"
)

Step 3: Univariate Descriptive Statistics for Qualitative Variables

For each of the selected qualitative variables, calculate the frequency of each category or level.

freq_sex <- df_titanic |>
  dplyr::count(Sex, name = "Frequency") |>
  dplyr::mutate(
    Proportion = round(Frequency / sum(Frequency), 4),
    Percentage = paste0(round(Proportion * 100, 2), "%")
  ) |>
  dplyr::rename(Category = Sex)


kable(freq_sex,
      caption = "Table 2. Frequency Distribution: Sex",
      align   = "lrrr") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center")
Table 2. Frequency Distribution: Sex
Category Frequency Proportion Percentage
female 466 0.356 35.6%
male 843 0.644 64.4%
freq_pclass <- df_titanic |>
  dplyr::count(Pclass, name = "Frequency") |>
  dplyr::mutate(
    Proportion = round(Frequency / sum(Frequency), 4),
    Percentage = paste0(round(Proportion * 100, 2), "%")
  ) |>
  dplyr::rename(Category = Pclass)


kable(freq_pclass,
      caption = "Table 3. Frequency Distribution: Passenger Class",
      align   = "lrrr") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center")
Table 3. Frequency Distribution: Passenger Class
Category Frequency Proportion Percentage
1st 323 0.2468 24.68%
2nd 277 0.2116 21.16%
3rd 709 0.5416 54.16%
# ── Sex bar chart ─────────────────────────────────────────────────────────────
p5 <- df_titanic |>
  dplyr::count(Sex) |>
  dplyr::mutate(pct = n / sum(n),
         Sex = str_to_title(Sex)) |>
  ggplot(aes(x = Sex, y = n, fill = Sex)) +
  geom_col(width = 0.55, alpha = 0.88) +
  geom_text(aes(label = paste0(n, "\n(", round(pct * 100, 1), "%)")),
            vjust = -0.4, size = 4, fontface = "bold") +
  scale_fill_manual(values = c("Female" = "#C00000", "Male" = "#2E75B6")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Passenger Count by Sex",
       x = "Sex", y = "Frequency") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

# ── Pclass bar chart ──────────────────────────────────────────────────────────
p6 <- df_titanic |>
  dplyr::count(Pclass) |>
  dplyr::mutate(pct = n / sum(n)) |>
  ggplot(aes(x = Pclass, y = n, fill = Pclass)) +
  geom_col(width = 0.55, alpha = 0.88) +
  geom_text(aes(label = paste0(n, "\n(", round(pct * 100, 1), "%)")),
            vjust = -0.4, size = 4, fontface = "bold") +
  scale_fill_manual(values = c("1st" = "#ED7D31", "2nd" = "#FFC000", "3rd" = "#70AD47")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Passenger Count by Class",
       x = "Passenger Class", y = "Frequency") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

gridExtra::grid.arrange(p5, p6, ncol = 2,
             top = "Figure 2. Frequency Distributions: Sex and Passenger Class")

Step 4: Bivariate Statistics

  • Calculate correlations between/among the quantitative variables. Use the correlation matrix to determine the strength and direction of relationships.

  • For qualitative variables, create crosstabs (contingency tables) to examine relationships between categories. Calculate chi-squared tests if applicable.

# Select quantitative variables
quant_vars <- df_titanic |>
  select(Age, Fare)

# Pearson correlation matrix
cor_matrix <- cor(quant_vars, method = "pearson", use = "complete.obs")

# Round and display
cor_display <- round(cor_matrix, 4)

kable(cor_display,
      caption = "Table 4. Pearson Correlation Matrix: Age and Fare",
      align   = "cc") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center")
Table 4. Pearson Correlation Matrix: Age and Fare
Age Fare
Age 1.0000 0.1782
Fare 0.1782 1.0000
# Extract the single off-diagonal correlation value
r_age_fare <- cor_matrix["Age", "Fare"]
cat(sprintf("\nPearson r (Age, Fare) = %.4f\n", r_age_fare))
## 
## Pearson r (Age, Fare) = 0.1782
# Significance test for Age ~ Fare correlation
cor_test_result <- cor.test(df_titanic$Age,
                             df_titanic$Fare,
                             method = "pearson")

cat(sprintf("t(%d) = %.3f, p = %.4f, 95%% CI [%.4f, %.4f]\n",
            cor_test_result$parameter,
            cor_test_result$statistic,
            cor_test_result$p.value,
            cor_test_result$conf.int[1],
            cor_test_result$conf.int[2]))
## t(1307) = 6.546, p = 0.0000, 95% CI [0.1252, 0.2301]
ggplot(df_titanic, aes(Age, Fare)) +
  geom_point(alpha = 0.25, color = "#2E75B6", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#C00000", linewidth = 1) +
  annotate(
    "text",
    x = max(df_titanic$Age, na.rm = TRUE) * 0.75,
    y = max(df_titanic$Fare, na.rm = TRUE) * 0.9,
    label = sprintf(
      "r = %.3f\np = %s",
      r_age_fare,
      format(cor_test_result$p.value, digits = 3, scientific = TRUE)
    ),
    size = 4.5, color = "#333333", hjust = 0
  ) +
  scale_y_continuous(labels = scales::dollar_format(prefix = "£")) +
  labs(
    title = "Figure 3. Scatter Plot: Age vs. Fare",
    subtitle = "OLS regression line with 95% confidence interval",
    x = "Age (years)",
    y = "Fare (GBP)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(color = "gray40")
  )
## `geom_smooth()` using formula = 'y ~ x'

# Raw frequency crosstab
crosstab_raw <- table(
  Sex   = str_to_title(df_titanic$Sex),
  Class = df_titanic$Pclass
)

kable(crosstab_raw,
      caption = "Table 5. Crosstabulation: Sex by Passenger Class (Raw Frequencies)",
      align   = "ccc") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center") |>
  add_header_above(c(" " = 1, "Passenger Class" = 3))
Table 5. Crosstabulation: Sex by Passenger Class (Raw Frequencies)
Passenger Class
1st 2nd 3rd
Female 144 106 216
Male 179 171 493
# Row percentages
crosstab_rowpct <- prop.table(crosstab_raw, margin = 1) * 100

kable(round(crosstab_rowpct, 2),
      caption = "Table 6. Crosstabulation: Sex by Passenger Class (Row Percentages)",
      align   = "ccc") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center") |>
  add_header_above(c(" " = 1, "Passenger Class" = 3))
Table 6. Crosstabulation: Sex by Passenger Class (Row Percentages)
Passenger Class
1st 2nd 3rd
Female 30.90 22.75 46.35
Male 21.23 20.28 58.48
# Chi-squared test of independence: Sex ~ Pclass
chi2_result <- chisq.test(crosstab_raw)

cat("Pearson Chi-Squared Test: Sex x Passenger Class\n")
## Pearson Chi-Squared Test: Sex x Passenger Class
cat("---------------------------------------------------\n")
## ---------------------------------------------------
cat(sprintf("X-squared = %.4f\n", chi2_result$statistic))
## X-squared = 20.3788
cat(sprintf("df        = %d\n",    chi2_result$parameter))
## df        = 2
cat(sprintf("p-value   = %.4f\n",  chi2_result$p.value))
## p-value   = 0.0000
cat(sprintf("Effect size (Cramer's V) = %.4f\n",
            sqrt(chi2_result$statistic /
                   (sum(crosstab_raw) * (min(dim(crosstab_raw)) - 1)))))
## Effect size (Cramer's V) = 0.1248
# Display expected counts to verify chi-squared assumptions
cat("\nExpected cell counts (all must be >= 5 for valid chi-squared):\n")
## 
## Expected cell counts (all must be >= 5 for valid chi-squared):
print(round(chi2_result$expected, 2))
##         Class
## Sex         1st    2nd   3rd
##   Female 114.99  98.61 252.4
##   Male   208.01 178.39 456.6
# Stacked percentage bar chart: Sex by Pclass
df_titanic |>
  dplyr::count(Sex, Pclass) |>
  dplyr::group_by(Sex) |>
  dplyr::mutate(pct = n / sum(n)) |>
  dplyr::ungroup() |>
  
  ggplot(aes(Sex, pct, fill = Pclass)) +
  geom_col(position = "fill", width = 0.55, alpha = 0.9) +
  geom_text(
    aes(label = scales::percent(pct, accuracy = 0.1)),
    position = position_stack(vjust = 0.5),
    size = 4,
    color = "white",
    fontface = "bold"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(
    values = c(
      "1st" = "#ED7D31",
      "2nd" = "#FFC000",
      "3rd" = "#70AD47"
    )
  ) +
  labs(
    title = "Figure 4. Class Composition by Sex (Row Percentages)",
    subtitle = expression(
      paste(chi^2, " = 20.3788, df = 2, p < .001, ", "Cramer's V = 0.1248")
    ),
    x = "Sex",
    y = "Proportion",
    fill = "Passenger Class"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(color = "gray40"),
    legend.position = "right"
  )

Step 5: Interpretation of Descriptive and Bivariate Statistics

  • Describe what the descriptive statistics reveal about the quantitative variables. Explain the central tendency, variability, skewness, and kurtosis. Comment on the normality of the data.

  • Explain the meaning of the crosstabs and correlations between/among variables. Discuss the strength and direction of relationships, and identify any statistically significant associations.

Age

The age distribution is approximately symmetric but exhibits a mild positive skew (skewness = 0.540), meaning a modest concentration of passengers in the younger age ranges with a moderate tail extending into older age. Post-imputation, the mean age (approximately 29.5 years) remains close to the median (28.0 years), confirming that the imputation of missing values with the median did not materially distort central tendency. The standard deviation (approximately 13.0 years) and IQR (approximately 13.0 years) indicate moderate dispersion around the center.

Kurtosis near 1.0 suggests a distribution close to “mesokurtic”, meaning neither heavy- nor light-tailed relative to a normal distribution. While a formal normality test such as Shapiro-Wilk would be required for rigorous inferential applications, the skewness and kurtosis values combined with the histogram suggest the age distribution is approximately normal, with some deviation driven by imputation at the median and a non-trivial proportion of child passengers. The presence of missing age values (approximately 20.09% prior to imputation) and the decision to impute with the median; appropriate for right-skewed data as recommended by Little and Rubin (2002); means the post-imputation distribution is slightly more concentrated around the median than the original data would have been.

Fare

The fare distribution is strongly right-skewed (skewness > 4.0), driven by a small number of passengers in premium cabin classes who paid substantially more than the majority of third-class passengers. The mean fare (approximately £33) significantly exceeds the median (approximately £14.45), which is a direct consequence of positive skew: the arithmetic mean is pulled upward by high-value outliers. This divergence between mean and median is the defining diagnostic of a skewed distribution and underscores why the median was selected as the imputation statistic for Fare in Assignment 2. The IQR (approximately £23.4) further confirms that the central 50% of fares are relatively compact, while the upper tail extends to values exceeding £500.

Excess kurtosis well above 3.0 indicates a “leptokurtic” distribution: the fare data has heavier tails and a sharper peak than a normal distribution. This pattern is consistent with class-stratified pricing structures in early 20th-century ocean travel and reflects the economic heterogeneity captured by the Pclass variable. Fare cannot be assumed to follow a normal distribution, and any parametric analysis involving fare should apply appropriate transformations or robust methods.

Sex

Approximately 64% of passengers were male, and 36% were female. This imbalance reflects the documented passenger composition of the Titanic, which carried a larger proportion of male passengers, particularly in third class. For analysis purposes, the unequal class sizes must be accounted for when computing sex-stratified survival rates; raw counts alone will over-represent male outcomes.

Passenger Class

Third-class passengers comprise the largest segment (approximately 54%), followed by first class (approximately 24%) and second class (approximately 21%). The concentration of passengers in third class is consistent with the demographic profile of transatlantic emigration in the early 20th century. This uneven distribution has direct implications for bivariate analysis: third-class passengers will disproportionately influence aggregate survival rates, and analyses of class differences must account for the unequal group sizes.

Correlation: Age and Fare

The Pearson correlation between Age and Fare is weak and positive (r ≈ 0.18, p < .001). Although the association is statistically significant due to the large sample size (n = 1,309), the effect is small in practical terms. Older passengers paid marginally higher fares on average, which is plausibly explained by their tendency to purchase first-class tickets. However, the weak correlation indicates that age alone is a poor predictor of fare. The scatter plot confirms high variability: passengers of all ages span a wide range of fare values, and the regression line, while positively sloped, explains only a small fraction of the total variance (R² ≈ 0.032). The relationship is driven primarily by a small cluster of older, high-paying passengers rather than a systematic linear trend across the full age range.

Chi-Squared Test: Sex by Passenger Class

The chi-squared test of independence between Sex and Passenger Class was statistically significant ( X² = 20.38, df = 2, p < .001). This finding indicates that the distribution of passengers across classes was not independent of sex. However, the effect size is negligible (Cramer’s V ≈ 0.1248), meaning that while the association is real, it explains very little of the variance in class composition.

Examining the row percentage cross-tabulation reveals the direction of the association. Female passengers were modestly more concentrated in first class relative to male passengers, a pattern consistent with historical accounts of travel party compositions in the Titanic dataset. Third class accounted for a majority of passengers in both sex categories, confirming that the class imbalance documented in the univariate analysis holds across sex groups. This weak but detectable association between sex and class carries implications for multivariate survival modeling: both variables are partially confounded and their individual contributions to survival outcomes must be disentangled through stratified analysis or regression.

Step 6: Importance of Statistics and Visualizations

In your R Markdown document, discuss the importance of using statistics in conjunction with visualizations as part of Exploratory Data Analysis (EDA). Explain how statistics provide quantitative insights and complement visualizations by quantifying relationships and patterns.

Answer

The relationship between statistics and visualizations in EDA is not one of redundancy but of mutual reinforcement. Tukey (1977), in introducing the foundational principles of EDA, argued that graphical displays should be the primary vehicle for pattern discovery because the human visual system is efficient at detecting shape, outliers, and relative magnitude without formal hypothesis testing. Histograms reveal modality and skew; scatter plots expose clustering and non-linearity; bar charts communicate proportional composition at a glance. These properties make visualizations indispensable as the first-pass tool in any exploratory analysis.

Statistics complement this process by translating visual impressions into precise, reproducible quantities. A histogram of Fare suggests strong right skew, but only the skewness statistic (> 4.0) and the divergence between mean (£33) and median (£14.45) make this characterization precise enough to inform downstream analytical decisions, such as the choice between median-based imputation and mean-based imputation, or the decision to apply a log transformation before modeling. Similarly, the scatter plot of Age versus Fare suggests a faint upward trend, but only the Pearson correlation (r = 0.18, p < .001) confirms the direction, quantifies the magnitude, and distinguishes a real but weak relationship from random noise.

In practice, the workflow applied throughout this assignment demonstrates this integration. Bar charts for Sex and Pclass made the distributional imbalances immediately apparent; frequency tables then provided exact counts and percentages that could be cited in formal analysis. The chi-squared test confirmed a statistically significant association between sex and passenger class that would have been difficult to detect from the stacked bar chart alone, while Cramer’s V attached a meaningful effect size to the p-value, preventing over-interpretation of a statistically significant but practically negligible relationship. This is the core value proposition of combining statistics and visualizations: each addresses the blind spots of the other, and together they provide a rigorous and complete characterization of the data.

Assignment 6: Generate and Apply Transformations and Scaling

library(tidyverse)
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.5.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(moments)   # skewness and kurtosis
## 
## Attaching package: 'moments'
## The following object is masked _by_ '.GlobalEnv':
## 
##     skewness

Step 1: Randomly Split Data into Training and Test Sets

  • Use the set.seed() function to ensure replicability.
set.seed(42)
  • Randomly split your dataset into an 80% training set and a 20% test set. You can use R’s sample() function or an appropriate library for this purpose.
library(dplyr)

n      <- nrow(titanic2_replace)
idx    <- sample(seq_len(n), size = floor(0.80 * n), replace = FALSE)

train_df <- df_titanic[idx, ]

test_df <- df_titanic[-idx,]

cat("Training set:", nrow(train_df), "observations\n")
## Training set: 1047 observations
cat("Test set:    ", nrow(test_df),  "observations\n")
## Test set:     262 observations
# Confirm class balance is preserved
cat("\nSurvival rate - Training:",
    round(mean(train_df$Survived == 1, na.rm = TRUE) * 100, 4), "%\n")
## 
## Survival rate - Training: 0 %
cat("Survival rate - Test:    ",
    round(mean(test_df$Survived == 1, na.rm = TRUE) * 100, 4), "%\n")
## Survival rate - Test:     0 %

Step 2: Evaluate Transformations for Quantitative Variables

  • Examine the distribution of each quantitative variable in the training set.

  • Determine if transformations are necessary to make the data more normal. This could include square root, log transformation, or any other appropriate method.

  • Do not use the test set for estimating the appropriate transformation.

library(e1071)
## Warning: package 'e1071' was built under R version 4.5.3
## 
## Attaching package: 'e1071'
## The following object is masked _by_ '.GlobalEnv':
## 
##     skewness
## The following objects are masked from 'package:moments':
## 
##     kurtosis, moment, skewness
## The following object is masked from 'package:ggplot2':
## 
##     element
quant_vars <- c("Age", "Fare", "SibSp", "Parch")

dist_summary <- data.frame(
  Variable = quant_vars,
  Mean     = sapply(train_df[quant_vars], function(x) round(mean(x, na.rm = TRUE), 3)),
  Median   = sapply(train_df[quant_vars], function(x) round(median(x, na.rm = TRUE), 3)),
  SD       = sapply(train_df[quant_vars], function(x) round(sd(x, na.rm = TRUE), 3)),
  skew_stats = sapply(train_df[quant_vars], function(x) {round(e1071::skewness(x, na.rm = TRUE), 3)}),
  kurt_stats = sapply(train_df[quant_vars], function(x) {round(e1071::kurtosis(x, na.rm = TRUE), 3)})
)

print(dist_summary)
##       Variable   Mean Median     SD skew_stats kurt_stats
## Age        Age 29.397 28.000 13.009      0.546      0.938
## Fare      Fare 32.846 14.454 51.201      4.386     27.046
## SibSp    SibSp  0.520  0.000  1.044      3.762     19.305
## Parch    Parch  0.403  0.000  0.895      3.693     21.776
plot_hist <- function(var, df, title) {
  ggplot(df, aes_string(x = var)) +
    geom_histogram(bins = 30, fill = "#4472C4", color = "white", alpha = 0.85) +
    labs(title = title, x = var, y = "Count") +
    theme_minimal(base_size = 11) +
    theme(plot.title = element_text(face = "bold"))
}

p1 <- plot_hist("Age",   train_df, "Age (Raw)")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- plot_hist("Fare",  train_df, "Fare (Raw)")
p3 <- plot_hist("SibSp", train_df, "SibSp (Raw)")
p4 <- plot_hist("Parch", train_df, "Parch (Raw)")

grid.arrange(p1, p2, p3, p4, ncol = 2,
             top = "Figure 1: Raw Distributions of Quantitative Variables (Training Set)")

Distributional findings:

  • Age (skewness ≈ 0.55): Mildly right-skewed. A square root transformation is appropriate to reduce asymmetry without over-compressing the distribution.

  • Fare (skewness ≈ 4.4): Severely right-skewed with a long upper tail. A log(x + 1) transformation is appropriate; the offset of 1 accommodates zero-fare records.

  • SibSp (skewness ≈ 4): Right-skewed count variable with substantial zero mass. A square root transformation reduces skewness without excluding zero values.

  • Parch (skewness > 3): Right-skewed count variable with substantial zero mass. Square root transformation is appropriate for the same reason as SibSp.

Step 3: Apply Transformations to the Training Data

  • Execute the chosen transformations on the training data for each of the quantitative variables.
train_transformed <- train_df %>%
  mutate(
    age_sqrt   = sqrt(Age),
    fare_log   = log1p(Fare),
    sibsp_sqrt = sqrt(SibSp),
    parch_sqrt = sqrt(Parch)
  )

# Post-transformation skewness
trans_vars  <- c("age_sqrt", "fare_log", "sibsp_sqrt", "parch_sqrt")
orig_vars   <- c("Age",      "Fare",     "SibSp",      "Parch")

transform_comparison <- data.frame(
  Variable          = orig_vars,
  Skewness_Raw      = round(sapply(orig_vars,  function(v) skewness(train_df[[v]]), USE.NAMES = FALSE), 3),
  Transformation    = c("sqrt(Age)", "log1p(Fare)", "sqrt(SibSp)", "sqrt(Sarch)"),
  Skewness_Transformed = round(sapply(trans_vars, function(v) skewness(train_transformed[[v]]), USE.NAMES = FALSE), 3)
)

print(transform_comparison)
##   Variable Skewness_Raw Transformation Skewness_Transformed
## 1      Age        0.548      sqrt(Age)               -0.738
## 2     Fare        4.399    log1p(Fare)                0.564
## 3    SibSp        3.772    sqrt(SibSp)                1.307
## 4    Parch        3.704    sqrt(Sarch)                1.607
p5  <- plot_hist("age_sqrt",   train_transformed, "Age (sqrt)")
p6  <- plot_hist("fare_log",   train_transformed, "Fare (log1p)")
p7  <- plot_hist("sibsp_sqrt", train_transformed, "SibSp (sqrt)")
p8  <- plot_hist("parch_sqrt", train_transformed, "Parch (sqrt)")

grid.arrange(p5, p6, p7, p8, ncol = 2,
             top = "Figure 2: Transformed Distributions (Training Set)")

Step 4: Apply Transformations to the Test Data

  • Apply the same transformations to the test data for each of the quantitative variables. It’s crucial to use the same transformations that were determined based on the training data.
test_transformed <- test_df %>%
  mutate(
    age_sqrt   = sqrt(Age),
    fare_log   = log1p(Fare),
    sibsp_sqrt = sqrt(SibSp),
    parch_sqrt = sqrt(Parch)
  )

# Confirm no new missingness was introduced
cat("Missing values in transformed training variables:\n")
## Missing values in transformed training variables:
print(colSums(is.na(train_transformed[trans_vars])))
##   age_sqrt   fare_log sibsp_sqrt parch_sqrt 
##          0          0          0          0
cat("\nMissing values in transformed test variables:\n")
## 
## Missing values in transformed test variables:
print(colSums(is.na(test_transformed[trans_vars])))
##   age_sqrt   fare_log sibsp_sqrt parch_sqrt 
##          0          0          0          0

Step 5: Min-Max Scaling

  • Perform min-max scaling on the quantitative training data so that all observations are between 0 and 1.

  • Do not use the test set for the scaling process.

Min-max scaling maps each transformed variable to the [0, 1] interval using the formula:

\[x_{\text{scaled}} = \frac{x - \min_{\text{train}}}{\max_{\text{train}} - \min_{\text{train}}}\]

All scaling parameters are derived exclusively from the training partition.

# Store training min and max for each transformed variable
scaling_params <- lapply(trans_vars, function(v) {
  list(
    min = min(train_transformed[[v]], na.rm = TRUE),
    max = max(train_transformed[[v]], na.rm = TRUE)
  )
})
names(scaling_params) <- trans_vars

# Apply min-max scaling to training data
train_scaled <- train_transformed
for (v in trans_vars) {
  vmin <- scaling_params[[v]]$min
  vmax <- scaling_params[[v]]$max
  train_scaled[[paste0(v, "_scaled")]] <- (train_transformed[[v]] - vmin) / (vmax - vmin)
}

scaled_vars <- paste0(trans_vars, "_scaled")

# Verify all scaled values are in [0, 1]
scale_check <- data.frame(
  Variable = scaled_vars,
  Min      = round(sapply(scaled_vars, function(v) min(train_scaled[[v]], na.rm = TRUE)), 6),
  Max      = round(sapply(scaled_vars, function(v) max(train_scaled[[v]], na.rm = TRUE)), 6),
  Mean     = round(sapply(scaled_vars, function(v) mean(train_scaled[[v]], na.rm = TRUE), USE.NAMES = FALSE), 4)
)

print(scale_check)
##                            Variable Min Max   Mean
## age_sqrt_scaled     age_sqrt_scaled   0   1 0.5681
## fare_log_scaled     fare_log_scaled   0   1 0.4761
## sibsp_sqrt_scaled sibsp_sqrt_scaled   0   1 0.1410
## parch_sqrt_scaled parch_sqrt_scaled   0   1 0.1006
params_df <- data.frame(
  Variable      = trans_vars,
  Train_Min     = round(sapply(trans_vars, function(v) scaling_params[[v]]$min), 4),
  Train_Max     = round(sapply(trans_vars, function(v) scaling_params[[v]]$max), 4)
)
print(params_df)
##              Variable Train_Min Train_Max
## age_sqrt     age_sqrt    0.4123    8.9443
## fare_log     fare_log    0.0000    6.2409
## sibsp_sqrt sibsp_sqrt    0.0000    2.8284
## parch_sqrt parch_sqrt    0.0000    3.0000

###Step 6: Apply Scaling to the Test Set

  • Apply the same scaling method (min-max scaling) to the test set for the quantitative variables.
test_scaled <- test_transformed
for (v in trans_vars) {
  vmin <- scaling_params[[v]]$min
  vmax <- scaling_params[[v]]$max
  test_scaled[[paste0(v, "_scaled")]] <- (test_transformed[[v]] - vmin) / (vmax - vmin)
}

# Check for out-of-range values (values beyond training bounds are valid in deployment)
oor_check <- data.frame(
  Variable   = scaled_vars,
  Test_Min   = round(sapply(scaled_vars, function(v) min(test_scaled[[v]], na.rm = TRUE)), 4),
  Test_Max   = round(sapply(scaled_vars, function(v) max(test_scaled[[v]], na.rm = TRUE)), 4),
  Below_Zero = sapply(scaled_vars, function(v) sum(test_scaled[[v]] < 0, na.rm = TRUE)),
  Above_One  = sapply(scaled_vars, function(v) sum(test_scaled[[v]] > 1, na.rm = TRUE))
)

print(oor_check)
##                            Variable Test_Min Test_Max Below_Zero Above_One
## age_sqrt_scaled     age_sqrt_scaled    0.019   0.9393          0         0
## fare_log_scaled     fare_log_scaled    0.000   1.0000          0         0
## sibsp_sqrt_scaled sibsp_sqrt_scaled    0.000   1.0000          0         0
## parch_sqrt_scaled parch_sqrt_scaled    0.000   0.7454          0         0

Step 7: Interpretability of Variables for Forecasting

  • Discuss the ease or difficulty of interpreting what the variables mean for forecasting after transformations and scaling. Consider whether these operations make the data more interpretable or not.

  • Explain the purpose of these operations in terms of improving the quality of data for predictive modeling.

Age was transformed by square root. The original unit of years is no longer preserved. A scaled value of 0.6, for instance, does not correspond to a directly readable age without back-transformation via

\(x = (\text{value} \times (\sqrt{80} - \sqrt{0.17}) + \sqrt{0.17})^2\).

For exploratory analysis and model diagnostics, the original scale is preferable. For model training, the transformed and scaled representation is superior because it reduces distributional asymmetry and places the variable on the same numeric range as other predictors.

Fare was transformed by log(x + 1). The logarithmic scale compresses the extreme upper tail driven by first-class luxury accommodations. A unit difference in log-transformed fare does not map linearly to a fare difference in pounds sterling. When communicating findings to non-technical stakeholders, fare values would need to be back-transformed via \(e^x - 1\) to recover the original monetary interpretation.

SibSp and Parch were transformed by square root. Both are count variables with high zero mass, so the square root compresses larger counts while preserving the ordering and the zero category. Although the absolute count interpretation is lost after transformation and scaling, the ordinal structure (no family, small family, large family) remains directionally preserved.

The combined effect of transformation and min-max scaling means that no variable in the final feature matrix retains its original interpretable unit. This is an accepted trade-off in predictive modeling. Coefficient estimates from models trained on these features reflect effects on the transformed, scaled scale, requiring back-transformation for human-readable interpretation. Algorithms that do not depend on distributional assumptions or feature scale, such as tree-based methods, do not require these preprocessing steps. For distance-based algorithms (k-nearest neighbors, support vector machines) and gradient-based methods (logistic regression, neural networks), both transformation and scaling materially improve optimization behavior.

Step 8: Importance of Transformations and Scaling

  • In your R Markdown document, discuss the significance of transformations and scaling in the context of preparing data for predictive modeling. Explain how they can help in improving the performance and interpretability of predictive models.

Transformations address violations of distributional assumptions that degrade model performance. Severely right-skewed variables such as Fare cause optimization instability in gradient-based algorithms because the gradient contributions from extreme values dominate parameter updates, impairing convergence. Log and square root transformations reduce the influence of outliers, stabilize variance across the range of the predictor, and bring the distribution closer to the symmetric form assumed by linear methods (Tukey, 1977). For tree-based models, skewness is less consequential because splits are threshold-based, but transformation still improves the granularity of splits in the dense lower range of the distribution.

Min-max scaling ensures that all features contribute on a comparable numeric scale during model training. Without scaling, a variable with a range of 512 units (Fare) would dominate distance calculations and gradient updates relative to a variable with a range of 9 units (SibSp). This disparity leads to biased coefficient estimates in regularized regression models and distorted distance metrics in proximity-based algorithms. Min-max scaling resolves this by normalizing all features to [0, 1], placing them on equal footing during optimization (Dasu & Johnson, 2003).

The strict discipline of fitting scaling parameters only to the training partition is methodologically required. If test-set statistics informed the scaling, the test set would no longer represent unseen data. Information from the test distribution would have influenced preprocessing, producing overly optimistic performance estimates and invalidating generalization claims. This is a form of data leakage, and it constitutes a fundamental violation of the hold-out evaluation principle (Rahm & Do, 2000).

Together, transformation and scaling form a preprocessing pipeline that produces a feature representation better suited to the assumptions and optimization mechanics of a broad class of predictive models, while preserving the ordinal and relational structure of the original variables.

Step 9: Summary of Data Preprocessing

  • Summarize the key findings, transformations, and scaling applied to the data.

  • Discuss the steps taken and justify the choices made throughout the process, including the selection of transformations and scaling methods.

The preprocessing pipeline applied in this assignment proceeded through five sequential stages.

Train/test partition. The cleaned 1,047-row training partition and 262-row test partition were created via random sampling at an 80/20 ratio with set.seed(42). The class distribution of the outcome variable was verified to be consistent across both partitions.

Distributional assessment. Skewness statistics and histograms generated from the training set identified four variables requiring transformation. Fare exhibited severe positive skewness (skewness ≈ 4.5) due to a long upper tail from first-class fares. Age exhibited mild positive skewness. SibSp and Parch both exhibited right-skewed count distributions with substantial zero mass.

Transformations applied. Log(x + 1) was applied to Fare; square root transformations were applied to Age, SibSp, and Parch. These transformations reduced skewness across all four variables and improved the suitability of the data for algorithms sensitive to distributional shape. The identical transformations were applied to the test partition without re-estimation.

Min-max scaling. Each transformed variable was rescaled to [0, 1] using the minimum and maximum values from the training partition. These parameters were stored and applied unchanged to the test partition, ensuring no information from the test distribution entered the preprocessing pipeline.

Outcome. The final feature matrix contains four scaled, transformed quantitative variables ready for use in downstream predictive modeling. The preprocessing choices are fully determined by training data, replicable via a fixed random seed, and grounded in the distributional properties observed during exploratory data analysis.

Assignment 7: Generate New Features

Step 1: Feature Engineering on the Training and Test Datasets

  • Begin with the training dataset, and then perform feature engineering on the test dataset separately. There must be no information leakage between the two datasets.

Feature engineering is performed on the training dataset first. Every extraction rule, grouping decision, and computed value is determined solely from the training partition and then applied mechanically to the test partition. This protocol prevents any test-set information from influencing the construction of new features, preserving the integrity of the hold-out evaluation.

Three new features are engineered in this assignment:

Feature Source Variables Type Operation
title name Categorical Regex extraction of honorific
family_size sibsp + parch Quantitative Additive composite
title_group title Categorical Level collapsing

Step 2: Extract a New Variable from an Existing Variable

  • Identify an existing variable that can be used to extract a new variable. For example, if you have a “name” field, extract a “title” feature from it.

The passenger name field encodes a social honorific (Mr, Mrs, Miss, Master, Dr, Rev, etc.) that carries predictive signal independent of sex and age. Historical records of the Titanic disaster document that evacuation priority followed social norms of the era, with women and children receiving preferential access to lifeboats, a pattern that the honorific variable can capture more precisely than sex alone. “Master” identifies male children, a group with substantially different survival odds than adult men (“Mr”), even though both share the same value in the sex variable.

Implementation on Training Data

# Extract the honorific using a regular expression.
# Titles in Titanic names follow the pattern: "Surname, Title. Firstname..."
extract_title <- function(name_vec) {
  gsub("^.*, ([A-Za-z]+)\\..+$", "\\1", name_vec)
}

train_df <- train %>%
  dplyr::mutate(title = extract_title(Name))

# Frequency table of raw titles in the training set
title_freq <- train_df %>%
  dplyr::count(title, sort = TRUE) %>%
  mutate(pct = round(n / sum(n) * 100, 2))

kable(title_freq,
      caption = "Table 1. Raw title frequencies in the training set",
      col.names = c("Title", "Count", "% of Training Set"),
      align = c("l", "r", "r")) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D5E8F0")
Table 1. Raw title frequencies in the training set
Title Count % of Training Set
Mr 517 58.02
Miss 182 20.43
Mrs 125 14.03
Master 40 4.49
Dr 7 0.79
Rev 6 0.67
Col 2 0.22
Major 2 0.22
Mlle 2 0.22
Capt 1 0.11
Don 1 0.11
Jonkheer 1 0.11
Lady 1 0.11
Mme 1 0.11
Ms 1 0.11
Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) 1 0.11
Sir 1 0.11
title_surv <- train_df %>%
  mutate(survived_num = as.integer(as.character(Survived))) %>%
  group_by(title) %>%
  dplyr::summarise(
    n          = n(),
    surv_rate  = mean(survived_num),
    .groups = "drop"
  ) %>%
  filter(n >= 5) %>%          # display only titles with ≥5 observations
  arrange(desc(surv_rate))

ggplot(title_surv, aes(x = reorder(title, surv_rate), y = surv_rate, fill = surv_rate)) +
  geom_col(width = 0.65, color = "white") +
  geom_text(aes(label = paste0("n=", n)), hjust = -0.15, size = 3.2) +
  scale_fill_gradient(low = "#C00000", high = "#4472C4", guide = "none") +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1.1)) +
  coord_flip() +
  labs(
    title = "Survival Rate by Extracted Title (Training Set)",
    subtitle = "Titles with fewer than 5 observations excluded",
    x = "Title", y = "Survival Rate"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Figure 1. Survival rate by extracted title (training set). Whiskers represent 95% Wilson confidence intervals.

Figure 1. Survival rate by extracted title (training set). Whiskers represent 95% Wilson confidence intervals.

The distribution of titles confirms the predictive value of the extracted feature. “Mrs” and “Miss” display substantially higher survival rates than “Mr”, a pattern consistent with the historical evacuation record. “Master” (denoting male children) also shows elevated survival relative to adult males.

Apply to Test Data

# Identical regex applied to test partition — no new decisions made
test_df <- test_df %>%
  mutate(title = extract_title(Name))

cat("Training title levels:", paste(sort(unique(train_df$title)), collapse = ", "), "\n")
## Training title levels: Capt, Col, Don, Dr, Jonkheer, Lady, Major, Master, Miss, Mlle, Mme, Mr, Mrs, Ms, Rev, Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards), Sir
cat("Test title levels:    ", paste(sort(unique(test_df$title)),  collapse = ", "), "\n")
## Test title levels:     Col, Don, Dona, Dr, Major, Master, Miss, Mme, Mr, Mrs, Rev, Sir

Step 3: Create a Quantitative Mixture of Variables

  • Engineer a new feature that is a quantitative mixture of two variables, either two quantitative variables or a quantitative and a qualitative variable. For example, create an interaction variable between “age” and “income.”

SibSp counts siblings and spouses aboard, while Parch counts parents and children. Neither variable alone captures the passenger’s total family group size, which is the operationally meaningful quantity for survival analysis. A passenger with 2 siblings and 1 parent is in a group of 4, and this total group size carries different social and logistical implications for evacuation behavior than either component alone. Adding 1 to the sum accounts for the passenger themselves, yielding the true group headcount.

Implementation on Training Data

train_df <- train_df %>%
  mutate(family_size = SibSp + Parch + 1)

# Summary statistics
cat("Family size summary (training set):\n")
## Family size summary (training set):
print(summary(train_df$family_size))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.905   2.000  11.000
# Distribution table
fs_tbl <- train_df %>%
  dplyr::count(family_size) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

kable(fs_tbl,
      caption = "Table 2. Family size distribution (training set)",
      col.names = c("Family Size", "Count", "% of Training Set"),
      align = c("c", "r", "r")) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D5E8F0")
Table 2. Family size distribution (training set)
Family Size Count % of Training Set
1 537 60.3
2 161 18.1
3 102 11.4
4 29 3.3
5 15 1.7
6 22 2.5
7 12 1.3
8 6 0.7
11 7 0.8
fs_surv <- train_df %>%
  mutate(survived_num = as.integer(as.character(Survived))) %>%
  dplyr::group_by(family_size) %>%
  dplyr::summarise(surv_rate = mean(survived_num), n = n(), .groups = "drop")

ggplot(fs_surv, aes(x = factor(family_size), y = surv_rate)) +
  geom_col(fill = "#4472C4", width = 0.65, color = "white") +
  geom_text(aes(label = paste0("n=", n)), vjust = -0.4, size = 3.2) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.90)) +
  labs(
    title = "Survival Rate by Family Size (Training Set)",
    x = "Family Size (Self + Relatives Aboard)",
    y = "Survival Rate"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Figure 2. Survival rate by family size (training set).

Figure 2. Survival rate by family size (training set).

The plot shows a mom-linear relationship between family size and survival rate. Solo travelers have a lower survival rate than passenger with small families (2-4), while larger families (7+) have a very low survival rate.

Apply to Test Data

# Same arithmetic applied identically to the test partition
test_df <- test_df %>%
  mutate(family_size = SibSp + Parch + 1)

cat("Training family_size range:", range(train_df$family_size), "\n")
## Training family_size range: 1 11
cat("Test family_size range:    ", range(test_df$family_size),  "\n")
## Test family_size range:     1 11

Step 4: Collapse a Factor Level Variable

  • Take a factor level variable and collapse or group some of its levels to create a new variable. For example, if you have a “region” variable, collapse multiple regions into broader categories.

The raw title extraction in Step 2 produces more than ten distinct levels in the training set, most with very few observations. Sparse factor levels create instability in model parameter estimation: a level with two or three observations cannot yield a reliable coefficient estimate and is liable to produce near-perfect separation or extreme variance in logistic regression and distance-based models (Gelman & Hill, 2007). Collapsing rare titles into a single “Rare” group reduces model complexity, improves parameter stability, and allows the signal from the most common titles to be estimated from adequate sample sizes.

The collapsing decisions are made entirely on training data. The five groups are:

  • Mr: Adult males. The most frequent and lowest-survival group.
  • Mrs: Married women. High survival rate.
  • Miss: Unmarried women and girls. High survival rate.
  • Master: Male children (pre-adolescent boys). Elevated survival relative to adult males.
  • Rare: All other titles (Dr, Rev, Col, Major, Capt, Jonkheer, Don, Sir, Lady, Countess, Mme, Ms, Mlle). These titles are individually sparse but share the characteristic of representing non-standard social positions.

Implementation on Training Data

# Define the collapsing map using training data exclusively
rare_titles <- train_df %>%
  dplyr::count(title) %>%
  dplyr::filter(!(title %in% c("Mr", "Mrs", "Miss", "Master"))) %>%
  dplyr::pull(title)

cat("Titles collapsed into 'Rare':", paste(rare_titles, collapse = ", "), "\n\n")
## Titles collapsed into 'Rare': Capt, Col, Don, Dr, Jonkheer, Lady, Major, Mlle, Mme, Ms, Rev, Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards), Sir
collapse_title <- function(title_vec, rare_vec) {
  case_when(
    title_vec == "Mr"     ~ "Mr",
    title_vec == "Mrs"    ~ "Mrs",
    title_vec == "Miss"   ~ "Miss",
    title_vec == "Master" ~ "Master",
    title_vec %in% rare_vec ~ "Rare",
    TRUE ~ "Rare"   # any title unseen in training defaults to Rare
  )
}

train_df <- train_df %>%
  mutate(title_group = collapse_title(title, rare_titles))

# Verify group sizes
tg_tbl <- train_df %>%
  dplyr::count(title_group, name = "n") %>%
  mutate(pct = round(n / sum(n) * 100, 1)) %>%
  arrange(desc(n))

kable(tg_tbl,
      caption = "Table 3. Collapsed title group frequencies (training set)",
      col.names = c("Title Group", "Count", "% of Training Set"),
      align = c("l", "r", "r")) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D5E8F0")
Table 3. Collapsed title group frequencies (training set)
Title Group Count % of Training Set
Mr 517 58.0
Miss 182 20.4
Mrs 125 14.0
Master 40 4.5
Rare 27 3.0
tg_surv <- train_df %>%
  mutate(survived_num = as.integer(as.character(Survived))) %>%
  dplyr::group_by(title_group) %>%
  dplyr::summarise(surv_rate = mean(survived_num), n = n(), .groups = "drop") %>%
  arrange(desc(surv_rate))

ggplot(tg_surv, aes(x = reorder(title_group, surv_rate), y = surv_rate,
                    fill = title_group)) +
  geom_col(width = 0.60, color = "white", show.legend = FALSE) +
  geom_text(aes(label = paste0(round(surv_rate * 100, 0), "%\n(n=", n, ")")),
            hjust = -0.1, size = 3.2) +
  scale_fill_manual(values = c(
    "Mr"     = "#C00000",
    "Rare"   = "#ED7D31",
    "Master" = "#FFC000",
    "Miss"   = "#4472C4",
    "Mrs"    = "#70AD47"
  )) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1.0)) +
  coord_flip() +
  labs(
    title = "Survival Rate by Collapsed Title Group (Training Set)",
    x = "Title Group", y = "Survival Rate"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Figure 3. Survival rate by collapsed title group (training set).

Figure 3. Survival rate by collapsed title group (training set).

Apply to Test Data

# The rare_titles vector was fixed by training data — applied unchanged here
test_df <- test_df %>%
  mutate(title_group = collapse_title(title, rare_titles))

# Check for any test title levels not seen during training
novel_titles <- setdiff(unique(test_df$title), unique(train_df$title))
if (length(novel_titles) == 0) {
  cat("No novel title levels in test set — all titles map cleanly to training groups.\n")
} else {
  cat("Novel test titles (mapped to 'Rare'):", paste(novel_titles, collapse = ", "), "\n")
}
## Novel test titles (mapped to 'Rare'): Dona
cat("\nTest title_group distribution:\n")
## 
## Test title_group distribution:
print(table(test_df$title_group))
## 
## Master   Miss     Mr    Mrs   Rare 
##     10     56    146     40     10

Step 5: Discuss Feature Engineering and Justifications

  • Explain what you are doing in each step and why you are performing these feature engineering operations.

  • Provide detailed explanations for each feature engineering step, including references to scholarly sources or relevant literature that support your choices.


Title Extraction

Extracting the title from the name field converts an unstructured string variable into a structured categorical predictor. In its raw form, the name field carries no direct quantitative meaning and cannot be used by most learning algorithms without transformation. The extracted title encodes social rank, marital status, and age group simultaneously, none of which is fully captured by any single existing variable. This exemplifies the principle that latent information embedded in complex fields can often be surfaced through targeted parsing (Zheng & Casari, 2018).

The regex pattern "^.*, ([A-Za-z]+)\\..+$" isolates the honorific, which appears consistently between the comma following the last name and the period that closes the title. This pattern is robust across all 1,309 records in the combined dataset.

Family Size Construction

Family size is an additive composite of two individually incomplete signals: sibling/spouse count and parent/child count. Neither component alone captures the passenger’s entire onboard social group, but their combined effect does. Adding 1 accounts for the passenger themselves. This construction reflects domain knowledge of the disaster’s evacuation dynamics and illustrates the principle that meaningful features often require combining variables in ways that correspond to real-world quantities (Zheng & Casari, 2018).

The resulting variable reveals a non-linear relationship with survival that neither sibsp nor parch alone could expose, making it a richer predictor for classification models.

Title Collapsing

Collapsing sparse factor levels prevents a common failure mode in categorical feature encoding: overestimating parameters from an insufficient number of observations. A title like “Jonkheer” appears once in the dataset and provides no stable basis for a coefficient estimate in any model that uses this feature. Grouping all such sparse titles into a “Rare” category pools their observations, enabling a stable estimate for the combined group. The decision about which titles constitute “Rare” (specifically, any title other than Mr, Mrs, Miss, and Master) is made from training data alone and then applied to the test partition. This maintains leakage discipline while ensuring that any novel title in the test set is handled gracefully by assigning it to the default “Rare” group.


Step 6: Explain the Importance of Transformations for Test Data

  • Discuss why it’s crucial to apply the same feature engineering transformations to the test dataset as were applied to the training dataset.

  • Support your explanation with scholarly resources that emphasize the importance of consistent preprocessing for test data to ensure model performance and generalization.

The test dataset must receive exactly the same feature engineering operations as the training dataset. This requirement is not a procedural formality; it is a logical necessity for valid model evaluation and deployment.

A predictive model trained on a feature matrix that includes title_group and family_size expects those columns to exist with the same structure at inference time. If the test partition were processed differently, for example, if title collapsing used different groupings, or if family size were computed without the self-count offset, the test feature matrix would not correspond to the space the model learned from. Predictions made on misaligned features are meaningless.

More subtly, any feature engineering decision that uses information from the test set introduces data leakage. If the rare-title threshold were set after inspecting title frequencies in the combined dataset, test-set information would have influenced the construction of a training feature, inflating performance estimates. Kuhn and Johnson (2013) identify this as one of the most common and consequential sources of invalid model assessment in applied predictive modeling.

The practical implementation of this requirement involves two steps. First, all engineering rules (the regex pattern, the rare-title list, the family-size formula) are fixed by inspecting the TRAINING DATA ONLY. Second, these fixed rules are applied as functions to the test partition, WITH NO RE-ESTIMATION. The collapse_title() function used in Step 4 handles the edge case of novel test titles by defaulting them to “Rare”, which is the statistically appropriate response to an unseen category level at inference time.

Step 7: Summary of Feature Engineering

  • Summarize the key findings, feature engineering operations, and justifications.

  • Explain the steps taken in feature engineering and the rationale behind each operation.

# Final feature inventory
feat_summary <- data.frame(
  Feature      = c("title", "family_size", "title_group"),
  Source       = c("name (text)", "sibsp + parch", "title (factor)"),
  Type         = c("Categorical", "Quantitative", "Categorical"),
  Operation    = c("Regex extraction", "Additive composite + self", "Level collapsing"),
  Levels_Range = c(
    paste(sort(unique(train_df$title)), collapse = ", "),
    paste0(min(train_df$family_size), " – ", max(train_df$family_size)),
    paste(sort(unique(train_df$title_group)), collapse = ", ")
  ),
  Justification = c(
    "Encodes social rank, marital status, and age group not captured by sex alone",
    "Total onboard family group size; non-linear survival relationship",
    "Pools sparse title levels for stable coefficient estimation"
  )
)

kable(feat_summary,
      caption = "Table 4. Summary of all engineered features",
      col.names = c("Feature", "Source Variable(s)", "Type",
                    "Operation", "Levels / Range", "Justification"),
      align = c("l","l","l","l","l","l")) %>%
  kable_styling(full_width = TRUE,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D5E8F0") %>%
  column_spec(6, width = "30%")
Table 4. Summary of all engineered features
Feature Source Variable(s) Type Operation Levels / Range Justification
title name (text) Categorical Regex extraction Capt, Col, Don, Dr, Jonkheer, Lady, Major, Master, Miss, Mlle, Mme, Mr, Mrs, Ms, Rev, Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards), Sir Encodes social rank, marital status, and age group not captured by sex alone
family_size sibsp + parch Quantitative Additive composite + self 1 – 11 Total onboard family group size; non-linear survival relationship
title_group title (factor) Categorical Level collapsing Master, Miss, Mr, Mrs, Rare Pools sparse title levels for stable coefficient estimation
# Preview the enriched training dataset
cat("Enriched training dataset — first 6 rows (selected columns):\n")
## Enriched training dataset — first 6 rows (selected columns):
train_df %>%
  select(PassengerId, Survived, Pclass, Sex, Age, SibSp, Parch,
         Fare, title, family_size, title_group) %>%
  head(6) %>%
  kable(caption = "Table 5. Enriched training dataset preview",
        align = c("r","c","c","c","r","r","r","r","l","r","l")) %>%
  kable_styling(full_width = TRUE,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D5E8F0")
Table 5. Enriched training dataset preview
PassengerId Survived Pclass Sex Age SibSp Parch Fare title family_size title_group
1 0 3 male 22 1 0 7.2500 Mr 2 Mr
2 1 1 female 38 1 0 71.2833 Mrs 2 Mrs
3 1 3 female 26 0 0 7.9250 Miss 1 Miss
4 1 1 female 35 1 0 53.1000 Mrs 2 Mrs
5 0 3 male 35 0 0 8.0500 Mr 1 Mr
6 0 3 male NA 0 0 8.4583 Mr 1 Mr

Summary of operations performed:

  • The title variable was extracted from the free-text name field using a regular expression that isolates the honorific following the surname comma. The resulting variable carries social and demographic signal not fully encoded by any existing variable.

  • The family_size variable was constructed as sibsp + parch + 1, combining two individually incomplete counts of onboard family members into a single group-size metric with a documented non-linear relationship to survival.

  • The title_group variable was created by collapsing the raw title levels into five groups — Mr, Mrs, Miss, Master, and Rare — where the Rare group pools eleven sparse title levels that individually lack sufficient observations for stable parameter estimation. All collapsing decisions were made from training data and applied without modification to the test partition.