Introduction

This report presents a structured Exploratory Data Analysis (EDA) covering univariate and bivariate statistics, complemented by narrative discussions. The purpose of EDA is to understand underlying patterns, distributions, and relationships before proceeding to modeling or hypothesis testing. Detailed descriptive metrics and visual examinations provide quantitative insights and assess the appropriateness of subsequent analyses.


Step 1: Variable Selection

Description: Four variables—two quantitative and two qualitative—were selected based on relevance to the research question and data completeness.

Rationale: Balancing continuous measurements and categorical distinctions enables a comprehensive understanding of data structure.

  • Quantitative Variables:
    1. Age (continuous numeric variable capturing passenger age)
    2. Fare (continuous numeric variable representing ticket fare paid)
  • Qualitative Variables:
    1. Survived (binary categorical variable indicating survival status)
    2. Pclass (ordinal categorical variable representing passenger class)
# Load titanic
setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv")
quant_vars <- titanic %>% select(Age, Fare)
qual_vars  <- titanic %>% select(Survived, Pclass)

Step 2: Univariate Descriptive Statistics (Quantitative)

Description: Measures of center, dispersion, spread, skewness, and kurtosis were computed for each quantitative variable, following principles laid out in Tukey (1977).

Rationale: Summarizing data through these metrics informs distributional characteristics and guides assumptions for modeling (Wickham & Grolemund, 2017).

# Summary overview
psych::describe(quant_vars)
##      vars   n mean    sd median trimmed   mad  min    max  range skew kurtosis
## Age     1 714 29.7 14.53  28.00   29.27 13.34 0.42  80.00  79.58 0.39     0.16
## Fare    2 891 32.2 49.69  14.45   21.38 10.24 0.00 512.33 512.33 4.77    33.12
##        se
## Age  0.54
## Fare 1.66
# Range, IQR, min, max
quant_vars %>%
  summarise(
    Age_min    = min(Age, na.rm = TRUE),
    Age_max    = max(Age, na.rm = TRUE),
    Age_range  = Age_max - Age_min,
    Age_IQR    = IQR(Age, na.rm = TRUE),
    Fare_min   = min(Fare, na.rm = TRUE),
    Fare_max   = max(Fare, na.rm = TRUE),
    Fare_range = Fare_max - Fare_min,
    Fare_IQR   = IQR(Fare, na.rm = TRUE)
  )
## # A tibble: 1 × 8
##   Age_min Age_max Age_range Age_IQR Fare_min Fare_max Fare_range Fare_IQR
##     <dbl>   <dbl>     <dbl>   <dbl>    <dbl>    <dbl>      <dbl>    <dbl>
## 1    0.42      80      79.6    17.9        0     512.       512.     23.1
# Distribution shape metrics
sapply(quant_vars, skewness, na.rm = TRUE)
##       Age      Fare 
## 0.3882899 4.7792533
sapply(quant_vars, kurtosis, na.rm = TRUE)
##       Age      Fare 
##  3.168637 36.204289

Results:

  • Age: mean = 29.70, median = 28.0, SD = 14.53, range = 79.58 (min = 0.42, max = 80), IQR = 17.88, skewness = 0.39, kurtosis = 3.17.
  • Fare: mean = 32.20, median = 14.45, SD = 49.69, range = 512.33 (min = 0.00, max = 512.33), IQR = 23.09, skewness = 4.78, kurtosis = 36.20.

Step 3: Univariate Descriptive Statistics (Qualitative)

Description: Frequencies and proportions of categories were calculated for each qualitative variable in accordance with standard practices (Wickham & Grolemund, 2017).

Rationale: Understanding category distribution reveals group imbalances and potential biases, informing subsequent sampling strategies (Little & Rubin, 2019).

# Frequency and proportion tables
table(titanic$Survived) %>% prop.table() %>% round(3)
## 
##     0     1 
## 0.616 0.384
table(titanic$Pclass)   %>% prop.table() %>% round(3)
## 
##     1     2     3 
## 0.242 0.207 0.551

Results:

  • Survived: 0 (did not survive) = 549 (61.62%), 1 (survived) = 342 (38.38%)
  • Pclass: 1st = 216 (24.24%), 2nd = 184 (20.65%), 3rd = 491 (55.11%)

Step 4: Bivariate Statistics

4.1 Correlation Between Quantitative Variables

Description: Pearson correlation coefficient was computed and visualized.

Rationale: Quantifying linear relationships identifies variable associations and potential multicollinearity issues.

a <- titanic %>% select(Age, Fare) %>% drop_na()
cor_matrix <- cor(a)
corrplot(cor_matrix, method = "shade", addCoef.col = "black")

# Pearson correlation test
test <- cor.test(a$Age, a$Fare)
test$estimate; test$p.value
##        cor 
## 0.09606669
## [1] 0.01021628

Results:

  • Pearson r = 0.096 (p = 0.0102), indicating a weak but statistically significant positive association between Age and Fare.

4.2 Crosstabs for Qualitative Variables

Description: Contingency table creation and chi-squared test for association.

Rationale: Assessing categorical associations determines whether observed group differences are statistically significant.

ct <- table(titanic$Survived, titanic$Pclass)
ct
##    
##       1   2   3
##   0  80  97 372
##   1 136  87 119
chisq <- chisq.test(ct)
chisq$statistic; chisq$p.value
## X-squared 
##   102.889
## [1] 4.549252e-23

Results:

  • Survival by Class:
    • Class 1: 62.96% survived
    • Class 2: 47.28% survived
    • Class 3: 24.24% survived
  • Chi-squared test: χ²(2) = 102.89, p < 0.001, indicating a significant association between passenger class and survival.

Step 5: Interpretation of Descriptive and Bivariate Statistics

Description: Numerical findings were synthesized into actionable insights.

Rationale: Linking statistics to research objectives guides data preprocessing, modeling decisions, and hypothesis generation (Tukey, 1977).

  • Age Distribution: moderate right skew (skewness = 0.39) with mean (29.70) > median (28.0), SD = 14.53 indicating age diversity.

  • Fare Distribution: highly right-skewed (skewness = 4.78), heavy-tailed (kurtosis = 36.20), median (14.45) << mean (32.20) suggests outliers requiring log transformation.

  • Correlation: low correlation (r = 0.096, p = 0.0102) indicates minimal multicollinearity.

  • Categorical Association: strong link between class and survival; first-class passengers had the highest survival rate, χ²(2) = 102.89, p < 0.001.


Step 6: Importance of Statistics and Visualizations

Description: Discussion of statistics and visuals as complementary EDA tools.

Rationale: Combining numerical metrics with graphs ensures comprehensive pattern detection and validation (Tukey, 1977; Wickham & Grolemund, 2017).

  • Statistics provide precision through exact metrics and hypothesis testing.
  • Visualizations offer intuitive pattern recognition and outlier detection.

Example: Boxplot of Fare highlights extreme high values driving skewness observed in the mean vs. median comparison.


Step 7: Summary of EDA Findings

  • Age distribution: mean = 29.70, right-skewed, SD = 14.53.
  • Fare distribution: mean = 32.20, highly skewed, SD = 49.69; log transform recommended.
  • Survival probability varies markedly by class (62.96% in 1st class vs. 24.24% in 3rd class).
  • Correlation between Age and Fare is weak but significant (r = 0.096, p = 0.0102).

Step 8: References

  1. Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  2. Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly Media.
  3. Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.