TM5140–EDA and Data Visualization

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

Data source: https://www.kaggle.com/code/varimp/a-mostly-tidyverse-tour-of-the-titanic/report

Below is a brief overview of the variables included in the data set:

PassengerID: A unique identifier for each passenger (just a row number, not useful for prediction)
Survived: Indicates whether the passenger survived (1) or died (0)
Pclass: The passenger’s class (first, second, or third)
Name: The name of the passenger
Sex: The passenger’s gender
Age: The age of the passenger
SibSp: The number of siblings/spouses aboard the Titanic
Parch: The number of parents/children aboard the Titanic
Ticket: The ticket number
Fare: The fare paid by the passenger
Cabin: The cabin number
Embarked: The port of embarkation (S = Southampton, C = Cherbourg, Q = Queenstown)

setwd("C:/Users/asank/Desktop/My Personal/01-JobWork/UOM Docs/TM5140")

1. Data Loading and Preprocessing

Before any analysis, it is essential to load the data correctly and set the proper data types. Converting variables like PassengerId to a character (since it’s merely an identifier) and factors for categorical variables (such as Sex, Survived, etc.) ensures that later analyses treat these variables appropriately.

library(tidyverse)

# Load the data set (adjust path as necessary)
path <- here::here("Datasets", "titanic.csv")
titanic <- read_csv(path)

# Quick look at the data
head(titanic)

# A tibble: 6 × 12
  PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
        <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
# ℹ 1 more variable: Embarked <chr>

colnames(titanic)

 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"

# Data preprocessing: Converting variables to appropriate types
titanic <- titanic |>
  mutate(PassengerId = as.character(PassengerId),
         Sex = factor(Sex),
         Survived = factor(Survived, levels = c(1,0), labels = c("survived", "died")),
         Pclass = factor(Pclass),
         Embarked = factor(Embarked))

2. Descriptive Statistics

Descriptive statistics provide an overview of the data’s central tendency, dispersion, and overall distribution.

There is an easier way to compute some measures of central tendency and variability using the summary() function.
The summary() function provides the min(), max(), median(), mean(), the 25%, and 75% quantiles for numeric variables (like Age, Fare, etc.) and counts for categorical variables.

# Overall summary
summary(titanic)

 PassengerId            Survived   Pclass      Name               Sex     
 Length:891         survived:342   1:216   Length:891         female:314  
 Class :character   died    :549   2:184   Class :character   male  :577  
 Mode  :character                  3:491   Mode  :character               
                                                                          
                                                                          
                                                                          
                                                                          
      Age            SibSp           Parch           Ticket         
 Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
 1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
 Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
 Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
 3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
 Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
 NA's   :177                                                        
      Fare           Cabin           Embarked  
 Min.   :  0.00   Length:891         C   :168  
 1st Qu.:  7.91   Class :character   Q   : 77  
 Median : 14.45   Mode  :character   S   :644  
 Mean   : 32.20                      NA's:  2  
 3rd Qu.: 31.00                                
 Max.   :512.33

A similar function to the summary() function is the describe() function in the psych package.
This function is useful when you are dealing with quantitative data.
Unlike the summary() function, it also includes other measures such as the trimmed mean (default is 10%), skew, kurtosis, range, etc. n is the sample size (or the number of non-missing values).

# Detailed summary
library(psych)
subdata <- titanic |>
  select(Age, SibSp, Parch, Fare)
describe(subdata)

      vars   n  mean    sd median trimmed   mad  min    max  range skew
Age      1 714 29.70 14.53  28.00   29.27 13.34 0.42  80.00  79.58 0.39
SibSp    2 891  0.52  1.10   0.00    0.27  0.00 0.00   8.00   8.00 3.68
Parch    3 891  0.38  0.81   0.00    0.18  0.00 0.00   6.00   6.00 2.74
Fare     4 891 32.20 49.69  14.45   21.38 10.24 0.00 512.33 512.33 4.77
      kurtosis   se
Age       0.16 0.54
SibSp    17.73 0.04
Parch     9.69 0.03
Fare     33.12 1.66

A brief interpretation: A skewness near zero (as seen for Age) suggests a roughly symmetric distribution, while high kurtosis in Fare indicates heavy tails or outliers.

# Group-wise summaries by Survival Status
titanic |>
  group_by(Survived) |>
  summarise(mean_Fare = mean(Fare),
            sd_Fare = sd(Fare),
            median_Fare = median(Fare),
            IQR_Fare = IQR(Fare))

# A tibble: 2 × 5
  Survived mean_Fare sd_Fare median_Fare IQR_Fare
  <fct>        <dbl>   <dbl>       <dbl>    <dbl>
1 survived      48.4    66.6        26       44.5
2 died          22.1    31.4        10.5     18.1

A brief interpretation: Comparing these summary statistics between survivors and non-survivors provides insights into fare differences. For example, survivors tend to have a higher average fare, which may reflect their socio-economic status.

# Group-wise summaries by Survival and Passenger Class
titanic |>
  group_by(Survived, Pclass) |>
  summarise(mean_Fare = mean(Fare),
            sd_Fare = sd(Fare),
            median_Fare = median(Fare),
            IQR_Fare = IQR(Fare))

# A tibble: 6 × 6
# Groups:   Survived [2]
  Survived Pclass mean_Fare sd_Fare median_Fare IQR_Fare
  <fct>    <fct>      <dbl>   <dbl>       <dbl>    <dbl>
1 survived 1           95.6    85.3       78.0     60.5 
2 survived 2           22.1    10.9       21       13.2 
3 survived 3           13.7    10.7        8.52     8.1 
4 died     1           64.7    60.7       44.8     51.2 
5 died     2           19.4    15.3       13       14.5 
6 died     3           13.7    12.1        8.05     7.39

A brief interpretation: This additional breakdown allows us to see, for example, that first-class survivors paid significantly more than third-class survivors, suggesting a potential link between class, fare, and survival probability.

# Using describeBy function in psych package
subdata <- titanic |>
  select(Age, SibSp, Parch, Fare, Survived)
describeBy(subdata, group = subdata$Survived)


 Descriptive statistics by group 
group: survived
         vars   n  mean    sd median trimmed   mad  min    max  range skew
Age         1 290 28.34 14.95     28   28.18 13.34 0.42  80.00  79.58 0.18
SibSp       2 342  0.47  0.71      0    0.36  0.00 0.00   4.00   4.00 1.94
Parch       3 342  0.46  0.77      0    0.31  0.00 0.00   5.00   5.00 1.74
Fare        4 342 48.40 66.60     26   34.32 25.81 0.00 512.33 512.33 3.83
Survived    5 342  1.00  0.00      1    1.00  0.00 1.00   1.00   0.00  NaN
         kurtosis   se
Age         -0.10 0.88
SibSp        5.31 0.04
Parch        3.47 0.04
Fare        20.21 3.60
Survived      NaN 0.00
------------------------------------------------------------ 
group: died
         vars   n  mean    sd median trimmed   mad min max range skew kurtosis
Age         1 424 30.63 14.17   28.0   29.76 11.86   1  74    73 0.58     0.25
SibSp       2 549  0.55  1.29    0.0    0.22  0.00   0   8     8 3.50    14.27
Parch       3 549  0.33  0.82    0.0    0.11  0.00   0   6     6 3.30    13.23
Fare        4 549 22.12 31.39   10.5   15.34  4.85   0 263   263 4.53    26.01
Survived    5 549  2.00  0.00    2.0    2.00  0.00   2   2     0  NaN      NaN
           se
Age      0.69
SibSp    0.05
Parch    0.04
Fare     1.34
Survived 0.00

3. Visual Presentations

Visualization not only makes the data more accessible but also highlights patterns and relationships that might not be apparent from tables alone. The form of the plot/ chart varies according to the nature of the data. Below are several plots along with reasons for choosing each.

3.1. Bar Charts for Qualitative/ Categorical Data

Passenger Class Composition

Why a Bar Chart?

Bar charts are ideal for categorical data. They enable straightforward comparison of counts across groups–in this case, passenger classes.

pclassdata <- titanic |>
  mutate(Pclass = recode(Pclass, "1" = "First", "2" = "Second", "3" = "Third")) |>
  count(Pclass) |>
  mutate(Percentage = round(prop.table(n)*100, 3)) |>
  setNames(c("Passenger Class", "Count", "Percentage"))

knitr::kable(pclassdata)

Passenger Class	Count	Percentage
First	216	24.242
Second	184	20.651
Third	491	55.107

# Plotting count of passengers by class
ggplot(data = pclassdata, aes(x = `Passenger Class`, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = Count), vjust = -0.3, size = 4.5) +
  ggtitle("Composition of Passengers by Passenger Class") +
  theme_minimal() +
  theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))

This chart shows that third class had the highest number of passengers, followed by first and second classes.

# Plotting percentage of passengers by class
ggplot(data = pclassdata, aes(x = `Passenger Class`, y = Percentage)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste(Percentage, "%")), vjust = -0.3, size = 4.5) +
  ggtitle("Composition of Passengers by Passenger Class") +
  theme_minimal() +
  theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))

Stacked Bar Chart: Survival by Class

Why a Stacked/ Composite Bar Chart?

Stacked bar charts allow to compare two or more qualitative variables (nominal or ordinal). We use those to compare sub-groups (e.g., survived vs. died) within each passenger class. This helps to reveal differences in survival rates across classes.

survivalpclasstab <- titanic |>
  mutate(Pclass = recode(Pclass, "1" = "First", "2" = "Second", "3" = "Third"),
         Survived = recode(Survived, "survived" = "Survived", "died" = "Died")) |>
  count(Pclass, Survived)

survivalpclasstab

# A tibble: 6 × 3
  Pclass Survived     n
  <fct>  <fct>    <int>
1 First  Survived   136
2 First  Died        80
3 Second Survived    87
4 Second Died        97
5 Third  Survived   119
6 Third  Died       372

knitr::kable(survivalpclasstab |>
               pivot_wider(names_from = Pclass, values_from = n))

Survived	First	Second	Third
Survived	136	87	119
Died	80	97	372

# Stacked bar chart for survival by class
ggplot(data = survivalpclasstab, aes(x = Pclass, y = n, fill = Survived)) +
  geom_bar(stat = "identity") +
  ggtitle("Passenger Composition by Class and Survival Status") +
  labs(x = "Passenger Class", y = "Count") +
  theme_minimal() +
  theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))

This chart clearly shows that a higher percentage of first-class passengers survived compared to those in lower classes.

Multiple Bar Chart/ Grouped Bar Chart/ Clustered Bar Chart

Multiple bar charts allow to compare two or more qualitative variables. These bar charts are drawn side-by-side.

survivalpclasstab1 <- survivalpclasstab |>
  group_by(Pclass) |>
  mutate(prop = round(n/sum(n), 4)*100) |>
  select(-n)

survivalpclasstab1

# A tibble: 6 × 3
# Groups:   Pclass [3]
  Pclass Survived  prop
  <fct>  <fct>    <dbl>
1 First  Survived  63.0
2 First  Died      37.0
3 Second Survived  47.3
4 Second Died      52.7
5 Third  Survived  24.2
6 Third  Died      75.8

knitr::kable(survivalpclasstab1 |>
               pivot_wider(names_from = Pclass, values_from = prop))

Survived	First	Second	Third
Survived	62.96	47.28	24.24
Died	37.04	52.72	75.76

# Multiple bar chart by passenger class and survival status
ggplot(data = survivalpclasstab1, aes(x = Pclass, y = prop, fill = Survived)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  ggtitle("Passenger Composition by Class and Survival Status") +
  geom_text(aes(label = prop), vjust = 1.6, color="black", position = position_dodge(0.9), size = 4.5) +
  labs(x = "Passenger Class", y = "Percentage") +
  theme_minimal() +
  theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))

3.2. Visualizing Quantitative Data

Histogram of Fare

Why a Histogram?

Histograms are used to display the distribution of a continuous variable. The Fare histogram shows the spread, central tendency, and potential outliers.

ggplot(data = titanic, aes(x = Fare)) +
  geom_histogram(binwidth = 15, color = "black", fill = "white") +
  ggtitle("Histogram of Fare") +
  theme_minimal()

The histogram of Fare reveals that most fares are relatively low with a long tail towards higher values, indicating the presence of expensive tickets (often corresponding to higher classes).

ggplot(data = titanic, aes(x = Fare)) +
  geom_histogram(binwidth = 15, color = "black", fill = "white") +
  ggtitle("Histograms of Fare by Class") +
  facet_wrap(vars(Pclass), nrow = 3, scales = "free_y") +
  theme(axis.text = element_text(size = 15),
        axis.title = element_text(size = 15),
        legend.position = "bottom") +
  theme_minimal()

Box Plot of Fare

Why a Box Plot/ Box and Whisker Plot?

Box plots summarize the distribution of a numeric variable and display its quartiles and outliers. A box plot is constructed by identifying five summary statistics of the data set as largest value, smallest value, median, Q1, and Q3.

ggplot(data = titanic, aes(y = Fare)) +
  geom_boxplot() +
  ggtitle("Box Plot of Fare") +
  theme_minimal()

Plotting Age by PClass can reveal differences in age distribution among classes.

ggplot(data = titanic, aes(x = Pclass, y = Age)) +
  geom_boxplot() +
  ggtitle("Box Plot of Age by Passenger Class") +
  theme_minimal()

This plot may indicate, for example, that first-class passengers tend to be older (or younger) than those in other classes, with different ranges and medians.

Plotting Fare by PClass can reveal differences in fare distribution among classes.

ggplot(data = titanic, aes(x = Pclass, y = Fare)) +
  geom_boxplot() +
  ggtitle("Box Plot of Fare by Passenger Class") +
  theme_minimal()

Violin Plot of Age by Survival Status

Why a Violin Plot?

A violin plot combines the benefits of a box plot and a density plot to show the distribution of a continuous variable. They not only display summary statistics but also the full distribution shape. This is particularly useful to see if there are multiple modes or skewness, for example, in Age across survival status.

ggplot(data = titanic, aes(x = Survived, y = Age)) +
  geom_violin() +
  geom_jitter(width = 0.2, alpha = 0.5) +
  ggtitle("Violin Plot of Age by Survival Status") +
  theme_minimal() +
  labs(x = "Survived", y = "Age")

This visualization may show, for example, that survivors have a different age distribution than those who died, suggesting age as a factor in survival.

ggplot(data = titanic, aes(x = Pclass, y = Fare)) +
  geom_violin() +
  ggtitle("Violin Plot of Fare by Passenger Class") +
  theme_minimal() +
  labs(x = "Passenger Class", y = "Fare")

The width of the violin plot at different Fare values represents the density of passengers paying that fare.
A wider section means more passengers paid that fare, while narrow sections indicate fewer passengers.
First Class (Pclass = 1): The violin plot is wider at higher Fare values, indicating that first-class passengers generally paid higher fares. The distribution is spread out, meaning fares varied significantly. There is a long upper tail, showing that some passengers paid extremely high fares (outliers).
Second Class (Pclass = 2): The fare distribution is narrower and more concentrated compared to first class. Most second-class passengers paid moderate fares, with fewer extreme values.
Third Class (Pclass = 3): The violin is widest at the lower Fare values, indicating that most third-class passengers paid very little. There are few high-fare outliers, meaning fare variation in third class is low.
Skewness and Outliers: First-class fares show right-skewed distribution, meaning a few passengers paid significantly more than the majority. Third-class fares are concentrated at the lower end, with very few expensive tickets.
Insights:
- The violin plot shows how ticket prices were strongly linked to passenger class.
- Higher-class passengers paid significantly more, which aligns with historical records that first-class accommodations were luxurious and expensive.
- The overlap of fares between classes is minimal, meaning there was a clear distinction in pricing.

3.3. Scatter Plots and Correlation

Scatter Plot: Age vs. Fare

Why a Scatter Plot?

Scatter plots are ideal for exploring relationships between two continuous variables. They help identify correlations, clusters, or outliers.

ggplot(data = titanic, aes(x = Age, y = Fare)) +
  geom_point() +
  ggtitle("Scatter Plot of Age vs. Fare") +
  theme_minimal()

This plot shows how Age and Fare relate to each other.

data <- titanic |> drop_na()
cor(data$Age, data$Fare)

[1] -0.09242386

It seems that there is a very weak correlation between Age and Fare (as quantified by a correlation coefficient around -0.092), indicating that age isn’t strongly associated with the fare paid.

Scatter Plot with Density Contours

Adding density contours to the scatter plot provides insight into where the data points are most concentrated.

ggplot(data = titanic, aes(x = Age, y = Fare)) +
  geom_point() +
  geom_density_2d() + # Contours of a 2D density estimate
  ggtitle("Age vs. Fare with Density Contours") +
  theme_minimal()

The contours highlight regions with a high density of observations, helping to reveal clusters that may be hidden in a standard scatter plot.

ggplot(data = titanic, aes(x = Age, y = Fare, colour = Pclass)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_d() +
  theme_minimal()

3.4. Pairwise Relationships with `ggpairs`

Why use ggpairs?

The GGally::ggpairs() function creates a matrix of plots showing pairwise relationships. This multivariate visualization is very effective for identifying correlations and potential interactions among several variables.

data_for_pairs <- titanic |>
  select(Survived, Pclass, Sex, Age, Fare)

GGally::ggpairs(data_for_pairs, aes(color = Survived, fill = Survived)) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()

The matrix provides insights into how Age, Fare, and other variables interact with each other. For example, you can quickly compare distributions and see how survival status may differentiate clusters in the data.

titanic |>
  select(Age, SibSp, Parch, Fare) |>
  GGally::ggpairs()

4. Additional Illustrative Example: Time Series Plot

Although the Titanic data set isn’t time series data, let’s see a time series example (using dummy data) to show how trends over time can be visualized.

A time series is a sequence of observations taken sequentially in time.

# Dummy time series data
time_data <- data.frame(
  day = as.Date("2024-01-01") + 0:364,
  value = runif(365) + seq(-140, 224)^2 / 10000)
head(time_data)

         day    value
1 2024-01-01 2.934478
2 2024-01-02 1.937903
3 2024-01-03 2.122333
4 2024-01-04 1.943358
5 2024-01-05 2.565018
6 2024-01-06 2.717201

p_time <- ggplot(time_data, aes(x = day, y = value)) +
  geom_line(color = "darkblue") +
  ggtitle("Time Series Example") +
  theme_minimal()
print(p_time)

The COVID-19 Data Hub (https://covid19datahub.io/) provides access to global COVID-19 data, including confirmed cases, deaths, recoveries, and government response measures. Below is a short time series data analysis plan based on this data.

Installing and Loading Required Packages

#install.packages("COVID19")
#install.packages("tidyverse")

library(COVID19)   # For COVID-19 data
library(tidyverse) # For data manipulation & visualization

Retrieve COVID-19 Data

The COVID19 package allows you to retrieve COVID-19 data for different countries and regions.

# Load COVID-19 data for a specific country (e.g., Sri Lanka)
covid_data <- covid19(country = "Sri Lanka", level = 1)

We have invested a lot of time and effort in creating COVID-19 Data
Hub, please cite the following when using it:

  Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
  Source Software 5(51):2376, doi: 10.21105/joss.02376

The implementation details and the latest version of the data are
described in:

  Guidotti, E., (2022), "A worldwide epidemiological database for
  COVID-19 at fine-grained spatial resolution", Sci Data 9(1):112, doi:
  10.1038/s41597-022-01245-1
To print citations in BibTeX format use:
 > print(citation('COVID19'), bibtex=TRUE)

To hide this message use 'verbose = FALSE'.

# View the first few rows
#head(covid_data)

Time Series Analysis

COVID-19 Cases Over Time

ggplot(covid_data, aes(x = date, y = confirmed)) +
  geom_line(color = "blue") +
  ggtitle("COVID-19 Confirmed Cases Over Time") +
  xlab("Date") +
  ylab("Total Cases") +
  theme_minimal()

The plot shows the trend of COVID-19 cases over time. Peaks indicate waves of infections.

COVID-19 Deaths Over Time

ggplot(covid_data, aes(x = date, y = deaths)) +
  geom_line(color = "red") +
  ggtitle("COVID-19 Deaths Over Time") +
  xlab("Date") +
  ylab("Total Deaths") +
  theme_minimal()

The plot provides insight into mortality trends and it helps assess waves with high fatality rates.

Case Fatality Rate (CFR) Over Time

Case Fatality Rate (CFR) = (Deaths / Confirmed Cases) * 100

covid_data <- covid_data |>
  mutate(CFR = (deaths / confirmed) * 100)

ggplot(covid_data, aes(x = date, y = CFR)) +
  geom_line(color = "purple") +
  ggtitle("Case Fatality Rate (CFR) Over Time") +
  xlab("Date") +
  ylab("CFR (%)") +
  theme_minimal()

The plot shows how the fatality rate has changed over time. Decreases in CFR may indicate improved treatment or vaccination.