This report presents a structured Exploratory Data Analysis (EDA) covering univariate and bivariate statistics, complemented by narrative discussions. The purpose of EDA is to understand underlying patterns, distributions, and relationships before proceeding to modeling or hypothesis testing. Detailed descriptive metrics and visual examinations provide quantitative insights and assess the appropriateness of subsequent analyses.
Description: Four variables—two quantitative and two qualitative—were selected based on relevance to the research question and data completeness.
Rationale: Balancing continuous measurements and categorical distinctions enables a comprehensive understanding of data structure.
Age (continuous numeric variable capturing passenger
age)Fare (continuous numeric variable representing ticket
fare paid)Survived (binary categorical variable indicating
survival status)Pclass (ordinal categorical variable representing
passenger class)# Load titanic
setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv")
quant_vars <- titanic %>% select(Age, Fare)
qual_vars <- titanic %>% select(Survived, Pclass)
Description: Measures of center, dispersion, spread, skewness, and kurtosis were computed for each quantitative variable, following principles laid out in Tukey (1977).
Rationale: Summarizing data through these metrics informs distributional characteristics and guides assumptions for modeling (Wickham & Grolemund, 2017).
# Summary overview
psych::describe(quant_vars)
## vars n mean sd median trimmed mad min max range skew kurtosis
## Age 1 714 29.7 14.53 28.00 29.27 13.34 0.42 80.00 79.58 0.39 0.16
## Fare 2 891 32.2 49.69 14.45 21.38 10.24 0.00 512.33 512.33 4.77 33.12
## se
## Age 0.54
## Fare 1.66
# Range, IQR, min, max
quant_vars %>%
summarise(
Age_min = min(Age, na.rm = TRUE),
Age_max = max(Age, na.rm = TRUE),
Age_range = Age_max - Age_min,
Age_IQR = IQR(Age, na.rm = TRUE),
Fare_min = min(Fare, na.rm = TRUE),
Fare_max = max(Fare, na.rm = TRUE),
Fare_range = Fare_max - Fare_min,
Fare_IQR = IQR(Fare, na.rm = TRUE)
)
## # A tibble: 1 × 8
## Age_min Age_max Age_range Age_IQR Fare_min Fare_max Fare_range Fare_IQR
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.42 80 79.6 17.9 0 512. 512. 23.1
# Distribution shape metrics
sapply(quant_vars, skewness, na.rm = TRUE)
## Age Fare
## 0.3882899 4.7792533
sapply(quant_vars, kurtosis, na.rm = TRUE)
## Age Fare
## 3.168637 36.204289
Results:
- Age: mean = 29.70, median = 28.0, SD = 14.53, range = 79.58 (min = 0.42, max = 80), IQR = 17.88, skewness = 0.39, kurtosis = 3.17.
- Fare: mean = 32.20, median = 14.45, SD = 49.69, range = 512.33 (min = 0.00, max = 512.33), IQR = 23.09, skewness = 4.78, kurtosis = 36.20.
Description: Frequencies and proportions of categories were calculated for each qualitative variable in accordance with standard practices (Wickham & Grolemund, 2017).
Rationale: Understanding category distribution reveals group imbalances and potential biases, informing subsequent sampling strategies (Little & Rubin, 2019).
# Frequency and proportion tables
table(titanic$Survived) %>% prop.table() %>% round(3)
##
## 0 1
## 0.616 0.384
table(titanic$Pclass) %>% prop.table() %>% round(3)
##
## 1 2 3
## 0.242 0.207 0.551
Results:
- Survived: 0 (did not survive) = 549 (61.62%), 1 (survived) = 342 (38.38%)
- Pclass: 1st = 216 (24.24%), 2nd = 184 (20.65%), 3rd = 491 (55.11%)
Description: Pearson correlation coefficient was computed and visualized.
Rationale: Quantifying linear relationships identifies variable associations and potential multicollinearity issues.
a <- titanic %>% select(Age, Fare) %>% drop_na()
cor_matrix <- cor(a)
corrplot(cor_matrix, method = "shade", addCoef.col = "black")
# Pearson correlation test
test <- cor.test(a$Age, a$Fare)
test$estimate; test$p.value
## cor
## 0.09606669
## [1] 0.01021628
Results:
- Pearson r = 0.096 (p = 0.0102), indicating a weak but statistically significant positive association between Age and Fare.
Description: Contingency table creation and chi-squared test for association.
Rationale: Assessing categorical associations determines whether observed group differences are statistically significant.
ct <- table(titanic$Survived, titanic$Pclass)
ct
##
## 1 2 3
## 0 80 97 372
## 1 136 87 119
chisq <- chisq.test(ct)
chisq$statistic; chisq$p.value
## X-squared
## 102.889
## [1] 4.549252e-23
Results:
- Survival by Class:
- Class 1: 62.96% survived
- Class 2: 47.28% survived
- Class 3: 24.24% survived
- Chi-squared test: χ²(2) = 102.89, p < 0.001, indicating a significant association between passenger class and survival.
Description: Numerical findings were synthesized into actionable insights.
Rationale: Linking statistics to research objectives guides data preprocessing, modeling decisions, and hypothesis generation (Tukey, 1977).
Age Distribution: moderate right skew (skewness = 0.39) with mean (29.70) > median (28.0), SD = 14.53 indicating age diversity.
Fare Distribution: highly right-skewed (skewness = 4.78), heavy-tailed (kurtosis = 36.20), median (14.45) << mean (32.20) suggests outliers requiring log transformation.
Correlation: low correlation (r = 0.096, p = 0.0102) indicates minimal multicollinearity.
Categorical Association: strong link between class and survival; first-class passengers had the highest survival rate, χ²(2) = 102.89, p < 0.001.
Description: Discussion of statistics and visuals as complementary EDA tools.
Rationale: Combining numerical metrics with graphs ensures comprehensive pattern detection and validation (Tukey, 1977; Wickham & Grolemund, 2017).
Example: Boxplot of Fare highlights extreme high values driving skewness observed in the mean vs. median comparison.