This report continues the exploratory data analysis (EDA) process on the assigned dataset. The goal is to generate univariate visualizations, assess the normality of quantitative variables, document any issues such as skewness or outliers, and derive initial insights. Quantitative and categorical variables were selected for this purpose. Appropriate visualization techniques, including histograms, probability plots, dot plots, boxplots, and bar plots, were utilized to reveal the distributional properties of the data.
setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv")
## Rows: 891 Columns: 12
## āā Column specification āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ā¹ Use `spec()` to retrieve the full column specification for this data.
## ā¹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View column names and types
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,ā¦
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1ā¦
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3ā¦
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Flā¦
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "malā¦
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ā¦
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0ā¦
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0ā¦
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37ā¦
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,ā¦
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "Cā¦
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"ā¦
# Select Variables
quant_var1 <- titanic$Age
quant_var2 <- titanic$Fare
cat_var1 <- titanic$Sex
cat_var2 <- titanic$Embarked
# Histogram
p1 <- ggplot(titanic, aes(x = Age)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Histogram of Age")
# Probability Plot
p2 <- ggplot(titanic, aes(sample = Age)) +
stat_qq() +
stat_qq_line() +
labs(title = "QQ Plot of Age")
# Dot Plot
p3 <- ggplot(titanic, aes(x = Age)) +
geom_dotplot(binwidth = 1, dotsize = 0.5) +
labs(title = "Dot Plot of Age")
# Box Plot
p4 <- ggplot(titanic, aes(y = Age)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Boxplot of Age")
# Arrange plots
grid.arrange(p1, p2, p3, p4, ncol = 2)
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`stat_bindot()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Histogram
p5 <- ggplot(titanic, aes(x = Fare)) +
geom_histogram(bins = 30, fill = "lightcoral", color = "black") +
labs(title = "Histogram of Fare")
# Probability Plot
p6 <- ggplot(titanic, aes(sample = Fare)) +
stat_qq() +
stat_qq_line() +
labs(title = "QQ Plot of Fare")
# Dot Plot
p7 <- ggplot(titanic, aes(x = Fare)) +
geom_dotplot(binwidth = 5, dotsize = 0.5) +
labs(title = "Dot Plot of Fare")
# Box Plot
p8 <- ggplot(titanic, aes(y = Fare)) +
geom_boxplot(fill = "orchid") +
labs(title = "Boxplot of Fare")
# Arrange plots
grid.arrange(p5, p6, p7, p8, ncol = 2)
ggplot(titanic, aes(x = Sex)) +
geom_bar(fill = "steelblue") +
labs(title = "Bar Plot of Sex", x = "Sex", y = "Count")
ggplot(titanic, aes(x = Embarked)) +
geom_bar(fill = "orange") +
labs(title = "Bar Plot of Embarked", x = "Embarkation Port", y = "Count")
Normality of the two quantitative variables (Age and
Fare) was assessed using both visual methods (QQ plots) and
statistical methods (Shapiro-Wilk test).
Summary:
The normality assessments indicate that neither Age nor Fare follows a
perfect normal distribution. Age exhibits mild right-skewness,
suggesting that most passengers were relatively young, with fewer older
individuals extending the right tail. Although the Shapiro-Wilk test
indicated a p-value less than 0.05, suggesting non-normality, the
deviation was not severe.
In contrast, Fare displayed strong right-skewness, confirmed by the Shapiro-Wilk test p-value < 0.05, indicating significant non-normality. This skewness was largely driven by a few passengers who paid extremely high fares.
These findings have important implications for further statistical modeling. If parametric methods assuming normality are to be applied, it may be necessary to transform Fare (e.g., log-transformation) to correct for the strong skewness and better meet model assumptions. For Age, mild skewness may not critically affect models but should still be noted.
# Remove NAs for testing
age_no_na <- titanic$Age[!is.na(titanic$Age)]
fare_no_na <- titanic$Fare[!is.na(titanic$Fare)]
# Shapiro-Wilk Test for Age
shapiro.test(age_no_na)
##
## Shapiro-Wilk normality test
##
## data: age_no_na
## W = 0.98146, p-value = 7.337e-08
# Shapiro-Wilk Test for Fare
shapiro.test(fare_no_na)
##
## Shapiro-Wilk normality test
##
## data: fare_no_na
## W = 0.52189, p-value < 2.2e-16
As noted by Tukey (1977), univariate visualization forms the foundation of exploratory data analysis (EDA), providing critical insight into the structure of individual variables. Visualization helps identify skewness, multimodality, extreme values, missing data, and deviations from expected distributions (Peng et al., 2021). Without understanding the basic distribution, further analysis or modeling may be misguided, as many techniques assume certain properties such as normality (Shmueli et al., 2017).
Several issues were identified during the univariate analysis of the dataset:
Age variable
contains missing values, as seen in the missing value analysis. Missing
values may bias analysis if age-related patterns (e.g., survival rates)
are explored without imputation or removal strategies.Fare variable
displayed severe right-skewness, with most passengers
paying relatively low fares and a small number paying extremely high
fares.Fare as a predictor would need to account for this skewness
and extreme values, potentially through log transformation or robust
regression techniques.Embarked. Missing embarkation
information could affect group-based comparisons (e.g., analyzing fare
differences or survival rates by embarkation point).Summary of Impacts:
Missing data and extreme outliers must be addressed before conducting
inferential statistics or predictive modeling. Strategies such as
imputation for missing values, transformations for skewed variables, and
outlier treatment (e.g., winsorization or robust modeling) should be
considered to maintain the integrity of further analyses.
# Missing Values Summary
colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
(Handled by knitting this .Rmd into .pdf
output.)
Peng, R. D., Dominici, F., & Zeger, S. L. (2021). Exploratory Data Analysis for Complex Models. Journal of Computational and Graphical Statistics, 30(1), 234-248. https://doi.org/10.1080/10618600.2020.1764435
Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2017). Data Mining for Business Analytics: Concepts, Techniques, and Applications in R. Wiley.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.