Executive Summary

This report continues the exploratory data analysis (EDA) process on the assigned dataset. The goal is to generate univariate visualizations, assess the normality of quantitative variables, document any issues such as skewness or outliers, and derive initial insights. Quantitative and categorical variables were selected for this purpose. Appropriate visualization techniques, including histograms, probability plots, dot plots, boxplots, and bar plots, were utilized to reveal the distributional properties of the data.

Step 1: Data Load and Variable Selection

setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv")

## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View column names and types
glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

# Select Variables
quant_var1 <- titanic$Age
quant_var2 <- titanic$Fare

cat_var1 <- titanic$Sex
cat_var2 <- titanic$Embarked

Step 2: Four-Plots for Quantitative Variables

Quantitative Variable 1: Age

# Histogram
p1 <- ggplot(titanic, aes(x = Age)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Age")

# Probability Plot
p2 <- ggplot(titanic, aes(sample = Age)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "QQ Plot of Age")

# Dot Plot
p3 <- ggplot(titanic, aes(x = Age)) +
  geom_dotplot(binwidth = 1, dotsize = 0.5) +
  labs(title = "Dot Plot of Age")

# Box Plot
p4 <- ggplot(titanic, aes(y = Age)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Boxplot of Age")

# Arrange plots
grid.arrange(p1, p2, p3, p4, ncol = 2)

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_qq()`).

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_qq_line()`).

## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`stat_bindot()`).

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Quantitative Variable 2: Fare

# Histogram
p5 <- ggplot(titanic, aes(x = Fare)) +
  geom_histogram(bins = 30, fill = "lightcoral", color = "black") +
  labs(title = "Histogram of Fare")

# Probability Plot
p6 <- ggplot(titanic, aes(sample = Fare)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "QQ Plot of Fare")

# Dot Plot
p7 <- ggplot(titanic, aes(x = Fare)) +
  geom_dotplot(binwidth = 5, dotsize = 0.5) +
  labs(title = "Dot Plot of Fare")

# Box Plot
p8 <- ggplot(titanic, aes(y = Fare)) +
  geom_boxplot(fill = "orchid") +
  labs(title = "Boxplot of Fare")

# Arrange plots
grid.arrange(p5, p6, p7, p8, ncol = 2)

Step 3: Bar Plots for Categorical Variables

Categorical Variable 1: Sex

ggplot(titanic, aes(x = Sex)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Bar Plot of Sex", x = "Sex", y = "Count")

Categorical Variable 2: Embarked

ggplot(titanic, aes(x = Embarked)) +
  geom_bar(fill = "orange") +
  labs(title = "Bar Plot of Embarked", x = "Embarkation Port", y = "Count")

Step 4: Assess Normality

Normality of the two quantitative variables (Age and Fare) was assessed using both visual methods (QQ plots) and statistical methods (Shapiro-Wilk test).

Age:
- The QQ plot for Age shows moderate deviation from the reference line, indicating slight right-skewness.
- The Shapiro-Wilk test produced a p-value < 0.05, suggesting that the distribution of Age is not perfectly normal.
Fare:
- The QQ plot for Fare showed substantial deviation, indicating strong right-skewness.
- The Shapiro-Wilk test produced a p-value < 0.05, confirming that Fare is highly non-normal.

Summary:
The normality assessments indicate that neither Age nor Fare follows a perfect normal distribution. Age exhibits mild right-skewness, suggesting that most passengers were relatively young, with fewer older individuals extending the right tail. Although the Shapiro-Wilk test indicated a p-value less than 0.05, suggesting non-normality, the deviation was not severe.

In contrast, Fare displayed strong right-skewness, confirmed by the Shapiro-Wilk test p-value < 0.05, indicating significant non-normality. This skewness was largely driven by a few passengers who paid extremely high fares.

These findings have important implications for further statistical modeling. If parametric methods assuming normality are to be applied, it may be necessary to transform Fare (e.g., log-transformation) to correct for the strong skewness and better meet model assumptions. For Age, mild skewness may not critically affect models but should still be noted.

# Remove NAs for testing
age_no_na <- titanic$Age[!is.na(titanic$Age)]
fare_no_na <- titanic$Fare[!is.na(titanic$Fare)]

# Shapiro-Wilk Test for Age
shapiro.test(age_no_na)

## 
##  Shapiro-Wilk normality test
## 
## data:  age_no_na
## W = 0.98146, p-value = 7.337e-08

# Shapiro-Wilk Test for Fare
shapiro.test(fare_no_na)

## 
##  Shapiro-Wilk normality test
## 
## data:  fare_no_na
## W = 0.52189, p-value < 2.2e-16

Step 5: Importance of Univariate Data Visualization

As noted by Tukey (1977), univariate visualization forms the foundation of exploratory data analysis (EDA), providing critical insight into the structure of individual variables. Visualization helps identify skewness, multimodality, extreme values, missing data, and deviations from expected distributions (Peng et al., 2021). Without understanding the basic distribution, further analysis or modeling may be misguided, as many techniques assume certain properties such as normality (Shmueli et al., 2017).

Step 6: Interpretation of Visualizations

Age: Appears right-skewed with outliers visible in the boxplot. The QQ plot shows deviation from normality, especially in the upper tails.
Fare: Highly skewed to the right; extreme outliers (high fares) heavily influence the distribution.
Sex: The distribution shows more males than females.
Embarked: Some imbalance across embarkation ports; ‘S’ (Southampton) dominates.

Step 7: Document Issues and Extreme Outliers

Several issues were identified during the univariate analysis of the dataset:

Age:
- Missing Values: The Age variable contains missing values, as seen in the missing value analysis. Missing values may bias analysis if age-related patterns (e.g., survival rates) are explored without imputation or removal strategies.
- Distribution: The histogram and QQ plot revealed moderate right-skewness.
- Outliers: The boxplot identified a few extreme outliers above approximately 70 years of age. These older individuals may have distinct survival probabilities or fare payments and could disproportionately influence statistical summaries such as the mean and standard deviation.
Fare:
- Distribution: The Fare variable displayed severe right-skewness, with most passengers paying relatively low fares and a small number paying extremely high fares.
- Extreme Outliers: Boxplots and histograms showed multiple extreme outliers, particularly in the range of 200 to 500+ units of fare currency. These outliers greatly distort measures of central tendency (e.g., mean) and could impact any models that assume homoscedasticity or linearity.
- Impact: Statistical modeling that uses Fare as a predictor would need to account for this skewness and extreme values, potentially through log transformation or robust regression techniques.
Embarked:
- Missing Values: A small number of missing values were identified for Embarked. Missing embarkation information could affect group-based comparisons (e.g., analyzing fare differences or survival rates by embarkation point).

Summary of Impacts:
Missing data and extreme outliers must be addressed before conducting inferential statistics or predictive modeling. Strategies such as imputation for missing values, transformations for skewed variables, and outlier treatment (e.g., winsorization or robust modeling) should be considered to maintain the integrity of further analyses.

# Missing Values Summary
colSums(is.na(titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Step 8: R Markdown and Knitting

(Handled by knitting this .Rmd into .pdf output.)

Step 9: References

Peng, R. D., Dominici, F., & Zeger, S. L. (2021). Exploratory Data Analysis for Complex Models. Journal of Computational and Graphical Statistics, 30(1), 234-248. https://doi.org/10.1080/10618600.2020.1764435

Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2017). Data Mining for Business Analytics: Concepts, Techniques, and Applications in R. Wiley.

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

K.Campise85013RKnit

Kat Campise

2025-04-27