Executive Summary

This report continues the exploratory data analysis (EDA) process on the assigned dataset. The goal is to generate univariate visualizations, assess the normality of quantitative variables, document any issues such as skewness or outliers, and derive initial insights. Quantitative and categorical variables were selected for this purpose. Appropriate visualization techniques, including histograms, probability plots, dot plots, boxplots, and bar plots, were utilized to reveal the distributional properties of the data.

Step 1: Data Load and Variable Selection

setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View column names and types
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
# Select Variables
quant_var1 <- titanic$Age
quant_var2 <- titanic$Fare

cat_var1 <- titanic$Sex
cat_var2 <- titanic$Embarked

Step 2: Four-Plots for Quantitative Variables

Quantitative Variable 1: Age

# Histogram
p1 <- ggplot(titanic, aes(x = Age)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Age")

# Probability Plot
p2 <- ggplot(titanic, aes(sample = Age)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "QQ Plot of Age")

# Dot Plot
p3 <- ggplot(titanic, aes(x = Age)) +
  geom_dotplot(binwidth = 1, dotsize = 0.5) +
  labs(title = "Dot Plot of Age")

# Box Plot
p4 <- ggplot(titanic, aes(y = Age)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Boxplot of Age")

# Arrange plots
grid.arrange(p1, p2, p3, p4, ncol = 2)
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`stat_bindot()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Quantitative Variable 2: Fare

# Histogram
p5 <- ggplot(titanic, aes(x = Fare)) +
  geom_histogram(bins = 30, fill = "lightcoral", color = "black") +
  labs(title = "Histogram of Fare")

# Probability Plot
p6 <- ggplot(titanic, aes(sample = Fare)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "QQ Plot of Fare")

# Dot Plot
p7 <- ggplot(titanic, aes(x = Fare)) +
  geom_dotplot(binwidth = 5, dotsize = 0.5) +
  labs(title = "Dot Plot of Fare")

# Box Plot
p8 <- ggplot(titanic, aes(y = Fare)) +
  geom_boxplot(fill = "orchid") +
  labs(title = "Boxplot of Fare")

# Arrange plots
grid.arrange(p5, p6, p7, p8, ncol = 2)

Step 3: Bar Plots for Categorical Variables

Categorical Variable 1: Sex

ggplot(titanic, aes(x = Sex)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Bar Plot of Sex", x = "Sex", y = "Count")

Categorical Variable 2: Embarked

ggplot(titanic, aes(x = Embarked)) +
  geom_bar(fill = "orange") +
  labs(title = "Bar Plot of Embarked", x = "Embarkation Port", y = "Count")

Step 4: Assess Normality

Normality of the two quantitative variables (Age and Fare) was assessed using both visual methods (QQ plots) and statistical methods (Shapiro-Wilk test).

Summary:
The normality assessments indicate that neither Age nor Fare follows a perfect normal distribution. Age exhibits mild right-skewness, suggesting that most passengers were relatively young, with fewer older individuals extending the right tail. Although the Shapiro-Wilk test indicated a p-value less than 0.05, suggesting non-normality, the deviation was not severe.

In contrast, Fare displayed strong right-skewness, confirmed by the Shapiro-Wilk test p-value < 0.05, indicating significant non-normality. This skewness was largely driven by a few passengers who paid extremely high fares.

These findings have important implications for further statistical modeling. If parametric methods assuming normality are to be applied, it may be necessary to transform Fare (e.g., log-transformation) to correct for the strong skewness and better meet model assumptions. For Age, mild skewness may not critically affect models but should still be noted.

# Remove NAs for testing
age_no_na <- titanic$Age[!is.na(titanic$Age)]
fare_no_na <- titanic$Fare[!is.na(titanic$Fare)]

# Shapiro-Wilk Test for Age
shapiro.test(age_no_na)
## 
##  Shapiro-Wilk normality test
## 
## data:  age_no_na
## W = 0.98146, p-value = 7.337e-08
# Shapiro-Wilk Test for Fare
shapiro.test(fare_no_na)
## 
##  Shapiro-Wilk normality test
## 
## data:  fare_no_na
## W = 0.52189, p-value < 2.2e-16

Step 5: Importance of Univariate Data Visualization

As noted by Tukey (1977), univariate visualization forms the foundation of exploratory data analysis (EDA), providing critical insight into the structure of individual variables. Visualization helps identify skewness, multimodality, extreme values, missing data, and deviations from expected distributions (Peng et al., 2021). Without understanding the basic distribution, further analysis or modeling may be misguided, as many techniques assume certain properties such as normality (Shmueli et al., 2017).

Step 6: Interpretation of Visualizations

Step 7: Document Issues and Extreme Outliers

Several issues were identified during the univariate analysis of the dataset:

Summary of Impacts:
Missing data and extreme outliers must be addressed before conducting inferential statistics or predictive modeling. Strategies such as imputation for missing values, transformations for skewed variables, and outlier treatment (e.g., winsorization or robust modeling) should be considered to maintain the integrity of further analyses.

# Missing Values Summary
colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Step 8: R Markdown and Knitting

(Handled by knitting this .Rmd into .pdf output.)

Step 9: References

Peng, R. D., Dominici, F., & Zeger, S. L. (2021). Exploratory Data Analysis for Complex Models. Journal of Computational and Graphical Statistics, 30(1), 234-248. https://doi.org/10.1080/10618600.2020.1764435

Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2017). Data Mining for Business Analytics: Concepts, Techniques, and Applications in R. Wiley.

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.