The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
Data source: https://www.kaggle.com/code/varimp/a-mostly-tidyverse-tour-of-the-titanic/report
Below is a brief overview of the variables included in the data set:
PassengerID
: A unique identifier for each passenger
(just a row number, not useful for prediction)
Survived
: Indicates whether the passenger survived
(1) or died (0)
Pclass
: The passenger’s class (first, second, or
third)
Name
: The name of the passenger
Sex
: The passenger’s gender
Age
: The age of the passenger
SibSp
: The number of siblings/spouses aboard the
Titanic
Parch
: The number of parents/children aboard the
Titanic
Ticket
: The ticket number
Fare
: The fare paid by the passenger
Cabin
: The cabin number
Embarked
: The port of embarkation (S
=
Southampton, C
= Cherbourg, Q
=
Queenstown)
setwd("C:/Users/asank/Desktop/My Personal/01-JobWork/UOM Docs/TM5140")
Before any analysis, it is essential to load the data correctly and
set the proper data types. Converting variables like
PassengerId
to a character (since it’s merely an
identifier) and factors for categorical variables (such as
Sex
, Survived
, etc.) ensures that later
analyses treat these variables appropriately.
library(tidyverse)
# Load the data set (adjust path as necessary)
path <- here::here("Datasets", "titanic.csv")
titanic <- read_csv(path)
# Quick look at the data
head(titanic)
# A tibble: 6 × 12
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
<dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
6 6 0 3 Moran,… male NA 0 0 330877 8.46 <NA>
# ℹ 1 more variable: Embarked <chr>
colnames(titanic)
[1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
[6] "Age" "SibSp" "Parch" "Ticket" "Fare"
[11] "Cabin" "Embarked"
# Data preprocessing: Converting variables to appropriate types
titanic <- titanic |>
mutate(PassengerId = as.character(PassengerId),
Sex = factor(Sex),
Survived = factor(Survived, levels = c(1,0), labels = c("survived", "died")),
Pclass = factor(Pclass),
Embarked = factor(Embarked))
Descriptive statistics provide an overview of the data’s central tendency, dispersion, and overall distribution.
There is an easier way to compute some measures of central
tendency and variability using the summary()
function.
The summary()
function provides the
min()
, max()
, median()
,
mean()
, the 25%, and 75% quantiles for numeric variables
(like Age, Fare, etc.) and counts for categorical variables.
# Overall summary
summary(titanic)
PassengerId Survived Pclass Name Sex
Length:891 survived:342 1:216 Length:891 female:314
Class :character died :549 2:184 Class :character male :577
Mode :character 3:491 Mode :character
Age SibSp Parch Ticket
Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
Median :28.00 Median :0.000 Median :0.0000 Mode :character
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Fare Cabin Embarked
Min. : 0.00 Length:891 C :168
1st Qu.: 7.91 Class :character Q : 77
Median : 14.45 Mode :character S :644
Mean : 32.20 NA's: 2
3rd Qu.: 31.00
Max. :512.33
A similar function to the summary()
function is the
describe()
function in the psych
package.
This function is useful when you are dealing with quantitative data.
Unlike the summary()
function, it also includes
other measures such as the trimmed mean (default is 10%), skew,
kurtosis, range, etc. n
is the sample size (or the number
of non-missing values).
# Detailed summary
library(psych)
subdata <- titanic |>
select(Age, SibSp, Parch, Fare)
describe(subdata)
vars n mean sd median trimmed mad min max range skew
Age 1 714 29.70 14.53 28.00 29.27 13.34 0.42 80.00 79.58 0.39
SibSp 2 891 0.52 1.10 0.00 0.27 0.00 0.00 8.00 8.00 3.68
Parch 3 891 0.38 0.81 0.00 0.18 0.00 0.00 6.00 6.00 2.74
Fare 4 891 32.20 49.69 14.45 21.38 10.24 0.00 512.33 512.33 4.77
kurtosis se
Age 0.16 0.54
SibSp 17.73 0.04
Parch 9.69 0.03
Fare 33.12 1.66
A brief interpretation: A skewness near zero (as
seen for Age
) suggests a roughly symmetric distribution,
while high kurtosis in Fare
indicates heavy tails or
outliers.
# Group-wise summaries by Survival Status
titanic |>
group_by(Survived) |>
summarise(mean_Fare = mean(Fare),
sd_Fare = sd(Fare),
median_Fare = median(Fare),
IQR_Fare = IQR(Fare))
# A tibble: 2 × 5
Survived mean_Fare sd_Fare median_Fare IQR_Fare
<fct> <dbl> <dbl> <dbl> <dbl>
1 survived 48.4 66.6 26 44.5
2 died 22.1 31.4 10.5 18.1
A brief interpretation: Comparing these summary statistics between survivors and non-survivors provides insights into fare differences. For example, survivors tend to have a higher average fare, which may reflect their socio-economic status.
# Group-wise summaries by Survival and Passenger Class
titanic |>
group_by(Survived, Pclass) |>
summarise(mean_Fare = mean(Fare),
sd_Fare = sd(Fare),
median_Fare = median(Fare),
IQR_Fare = IQR(Fare))
# A tibble: 6 × 6
# Groups: Survived [2]
Survived Pclass mean_Fare sd_Fare median_Fare IQR_Fare
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 survived 1 95.6 85.3 78.0 60.5
2 survived 2 22.1 10.9 21 13.2
3 survived 3 13.7 10.7 8.52 8.1
4 died 1 64.7 60.7 44.8 51.2
5 died 2 19.4 15.3 13 14.5
6 died 3 13.7 12.1 8.05 7.39
A brief interpretation: This additional breakdown allows us to see, for example, that first-class survivors paid significantly more than third-class survivors, suggesting a potential link between class, fare, and survival probability.
# Using describeBy function in psych package
subdata <- titanic |>
select(Age, SibSp, Parch, Fare, Survived)
describeBy(subdata, group = subdata$Survived)
Descriptive statistics by group
group: survived
vars n mean sd median trimmed mad min max range skew
Age 1 290 28.34 14.95 28 28.18 13.34 0.42 80.00 79.58 0.18
SibSp 2 342 0.47 0.71 0 0.36 0.00 0.00 4.00 4.00 1.94
Parch 3 342 0.46 0.77 0 0.31 0.00 0.00 5.00 5.00 1.74
Fare 4 342 48.40 66.60 26 34.32 25.81 0.00 512.33 512.33 3.83
Survived 5 342 1.00 0.00 1 1.00 0.00 1.00 1.00 0.00 NaN
kurtosis se
Age -0.10 0.88
SibSp 5.31 0.04
Parch 3.47 0.04
Fare 20.21 3.60
Survived NaN 0.00
------------------------------------------------------------
group: died
vars n mean sd median trimmed mad min max range skew kurtosis
Age 1 424 30.63 14.17 28.0 29.76 11.86 1 74 73 0.58 0.25
SibSp 2 549 0.55 1.29 0.0 0.22 0.00 0 8 8 3.50 14.27
Parch 3 549 0.33 0.82 0.0 0.11 0.00 0 6 6 3.30 13.23
Fare 4 549 22.12 31.39 10.5 15.34 4.85 0 263 263 4.53 26.01
Survived 5 549 2.00 0.00 2.0 2.00 0.00 2 2 0 NaN NaN
se
Age 0.69
SibSp 0.05
Parch 0.04
Fare 1.34
Survived 0.00
Visualization not only makes the data more accessible but also highlights patterns and relationships that might not be apparent from tables alone. The form of the plot/ chart varies according to the nature of the data. Below are several plots along with reasons for choosing each.
Passenger Class Composition
Why a Bar Chart?
Bar charts are ideal for categorical data. They enable straightforward comparison of counts across groups–in this case, passenger classes.
pclassdata <- titanic |>
mutate(Pclass = recode(Pclass, "1" = "First", "2" = "Second", "3" = "Third")) |>
count(Pclass) |>
mutate(Percentage = round(prop.table(n)*100, 3)) |>
setNames(c("Passenger Class", "Count", "Percentage"))
knitr::kable(pclassdata)
Passenger Class | Count | Percentage |
---|---|---|
First | 216 | 24.242 |
Second | 184 | 20.651 |
Third | 491 | 55.107 |
# Plotting count of passengers by class
ggplot(data = pclassdata, aes(x = `Passenger Class`, y = Count)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = Count), vjust = -0.3, size = 4.5) +
ggtitle("Composition of Passengers by Passenger Class") +
theme_minimal() +
theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))
This chart shows that third class had the highest number of passengers, followed by first and second classes.
# Plotting percentage of passengers by class
ggplot(data = pclassdata, aes(x = `Passenger Class`, y = Percentage)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste(Percentage, "%")), vjust = -0.3, size = 4.5) +
ggtitle("Composition of Passengers by Passenger Class") +
theme_minimal() +
theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))
Stacked Bar Chart: Survival by Class
Why a Stacked/ Composite Bar Chart?
Stacked bar charts allow to compare two or more qualitative variables (nominal or ordinal). We use those to compare sub-groups (e.g., survived vs. died) within each passenger class. This helps to reveal differences in survival rates across classes.
survivalpclasstab <- titanic |>
mutate(Pclass = recode(Pclass, "1" = "First", "2" = "Second", "3" = "Third"),
Survived = recode(Survived, "survived" = "Survived", "died" = "Died")) |>
count(Pclass, Survived)
survivalpclasstab
# A tibble: 6 × 3
Pclass Survived n
<fct> <fct> <int>
1 First Survived 136
2 First Died 80
3 Second Survived 87
4 Second Died 97
5 Third Survived 119
6 Third Died 372
knitr::kable(survivalpclasstab |>
pivot_wider(names_from = Pclass, values_from = n))
Survived | First | Second | Third |
---|---|---|---|
Survived | 136 | 87 | 119 |
Died | 80 | 97 | 372 |
# Stacked bar chart for survival by class
ggplot(data = survivalpclasstab, aes(x = Pclass, y = n, fill = Survived)) +
geom_bar(stat = "identity") +
ggtitle("Passenger Composition by Class and Survival Status") +
labs(x = "Passenger Class", y = "Count") +
theme_minimal() +
theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))
This chart clearly shows that a higher percentage of first-class passengers survived compared to those in lower classes.
Multiple Bar Chart/ Grouped Bar Chart/ Clustered Bar Chart
Multiple bar charts allow to compare two or more qualitative variables. These bar charts are drawn side-by-side.
survivalpclasstab1 <- survivalpclasstab |>
group_by(Pclass) |>
mutate(prop = round(n/sum(n), 4)*100) |>
select(-n)
survivalpclasstab1
# A tibble: 6 × 3
# Groups: Pclass [3]
Pclass Survived prop
<fct> <fct> <dbl>
1 First Survived 63.0
2 First Died 37.0
3 Second Survived 47.3
4 Second Died 52.7
5 Third Survived 24.2
6 Third Died 75.8
knitr::kable(survivalpclasstab1 |>
pivot_wider(names_from = Pclass, values_from = prop))
Survived | First | Second | Third |
---|---|---|---|
Survived | 62.96 | 47.28 | 24.24 |
Died | 37.04 | 52.72 | 75.76 |
# Multiple bar chart by passenger class and survival status
ggplot(data = survivalpclasstab1, aes(x = Pclass, y = prop, fill = Survived)) +
geom_bar(stat = "identity", position = position_dodge()) +
ggtitle("Passenger Composition by Class and Survival Status") +
geom_text(aes(label = prop), vjust = 1.6, color="black", position = position_dodge(0.9), size = 4.5) +
labs(x = "Passenger Class", y = "Percentage") +
theme_minimal() +
theme(axis.text = element_text(size = 15), axis.title = element_text(size = 15))
Histogram of Fare
Why a Histogram?
Histograms are used to display the distribution of a continuous
variable. The Fare
histogram shows the spread, central
tendency, and potential outliers.
ggplot(data = titanic, aes(x = Fare)) +
geom_histogram(binwidth = 15, color = "black", fill = "white") +
ggtitle("Histogram of Fare") +
theme_minimal()
The histogram of Fare
reveals that most fares are
relatively low with a long tail towards higher values, indicating the
presence of expensive tickets (often corresponding to higher
classes).
ggplot(data = titanic, aes(x = Fare)) +
geom_histogram(binwidth = 15, color = "black", fill = "white") +
ggtitle("Histograms of Fare by Class") +
facet_wrap(vars(Pclass), nrow = 3, scales = "free_y") +
theme(axis.text = element_text(size = 15),
axis.title = element_text(size = 15),
legend.position = "bottom") +
theme_minimal()
Box Plot of Fare
Why a Box Plot/ Box and Whisker Plot?
Box plots summarize the distribution of a numeric variable and display its quartiles and outliers. A box plot is constructed by identifying five summary statistics of the data set as largest value, smallest value, median, Q1, and Q3.
ggplot(data = titanic, aes(y = Fare)) +
geom_boxplot() +
ggtitle("Box Plot of Fare") +
theme_minimal()
Plotting Age
by PClass
can reveal
differences in age distribution among classes.
ggplot(data = titanic, aes(x = Pclass, y = Age)) +
geom_boxplot() +
ggtitle("Box Plot of Age by Passenger Class") +
theme_minimal()
This plot may indicate, for example, that first-class passengers tend to be older (or younger) than those in other classes, with different ranges and medians.
Plotting Fare
by PClass
can reveal
differences in fare distribution among classes.
ggplot(data = titanic, aes(x = Pclass, y = Fare)) +
geom_boxplot() +
ggtitle("Box Plot of Fare by Passenger Class") +
theme_minimal()
Violin Plot of Age by Survival Status
Why a Violin Plot?
A violin plot combines the benefits of a box plot and a density plot
to show the distribution of a continuous variable. They not only display
summary statistics but also the full distribution shape. This is
particularly useful to see if there are multiple modes or skewness, for
example, in Age
across survival status.
ggplot(data = titanic, aes(x = Survived, y = Age)) +
geom_violin() +
geom_jitter(width = 0.2, alpha = 0.5) +
ggtitle("Violin Plot of Age by Survival Status") +
theme_minimal() +
labs(x = "Survived", y = "Age")
This visualization may show, for example, that survivors have a different age distribution than those who died, suggesting age as a factor in survival.
ggplot(data = titanic, aes(x = Pclass, y = Fare)) +
geom_violin() +
ggtitle("Violin Plot of Fare by Passenger Class") +
theme_minimal() +
labs(x = "Passenger Class", y = "Fare")
The width of the violin plot at different Fare
values represents the density of passengers paying that
fare.
A wider section means more passengers paid that fare, while narrow sections indicate fewer passengers.
First Class (Pclass = 1
): The violin plot
is wider at higher Fare
values, indicating that
first-class passengers generally paid higher fares. The distribution is
spread out, meaning fares varied significantly. There is a
long upper tail, showing that some passengers paid extremely
high fares (outliers).
Second Class (Pclass = 2
): The fare
distribution is narrower and more concentrated compared to
first class. Most second-class passengers paid moderate fares,
with fewer extreme values.
Third Class (Pclass = 3
): The violin is
widest at the lower Fare
values, indicating that
most third-class passengers paid very little. There are few
high-fare outliers, meaning fare variation in third class is
low.
Skewness and Outliers: First-class fares show right-skewed distribution, meaning a few passengers paid significantly more than the majority. Third-class fares are concentrated at the lower end, with very few expensive tickets.
Insights:
Scatter Plot: Age vs. Fare
Why a Scatter Plot?
Scatter plots are ideal for exploring relationships between two continuous variables. They help identify correlations, clusters, or outliers.
ggplot(data = titanic, aes(x = Age, y = Fare)) +
geom_point() +
ggtitle("Scatter Plot of Age vs. Fare") +
theme_minimal()
This plot shows how Age
and Fare
relate to
each other.
data <- titanic |> drop_na()
cor(data$Age, data$Fare)
[1] -0.09242386
It seems that there is a very weak correlation between
Age
and Fare
(as quantified by a correlation
coefficient around -0.092), indicating that age isn’t strongly
associated with the fare paid.
Scatter Plot with Density Contours
Adding density contours to the scatter plot provides insight into where the data points are most concentrated.
ggplot(data = titanic, aes(x = Age, y = Fare)) +
geom_point() +
geom_density_2d() + # Contours of a 2D density estimate
ggtitle("Age vs. Fare with Density Contours") +
theme_minimal()
The contours highlight regions with a high density of observations, helping to reveal clusters that may be hidden in a standard scatter plot.
ggplot(data = titanic, aes(x = Age, y = Fare, colour = Pclass)) +
geom_point(alpha = 0.5) +
scale_color_viridis_d() +
theme_minimal()
ggpairs
Why use ggpairs
?
The GGally::ggpairs()
function creates a matrix of plots
showing pairwise relationships. This multivariate visualization is very
effective for identifying correlations and potential interactions among
several variables.
data_for_pairs <- titanic |>
select(Survived, Pclass, Sex, Age, Fare)
GGally::ggpairs(data_for_pairs, aes(color = Survived, fill = Survived)) +
scale_color_viridis_d() +
scale_fill_viridis_d()
The matrix provides insights into how Age
,
Fare
, and other variables interact with each other. For
example, you can quickly compare distributions and see how survival
status may differentiate clusters in the data.
titanic |>
select(Age, SibSp, Parch, Fare) |>
GGally::ggpairs()
Although the Titanic
data set isn’t time series data,
let’s see a time series example (using dummy data) to show how trends
over time can be visualized.
A time series is a sequence of observations taken sequentially in time.
# Dummy time series data
time_data <- data.frame(
day = as.Date("2024-01-01") + 0:364,
value = runif(365) + seq(-140, 224)^2 / 10000)
head(time_data)
day value
1 2024-01-01 2.934478
2 2024-01-02 1.937903
3 2024-01-03 2.122333
4 2024-01-04 1.943358
5 2024-01-05 2.565018
6 2024-01-06 2.717201
p_time <- ggplot(time_data, aes(x = day, y = value)) +
geom_line(color = "darkblue") +
ggtitle("Time Series Example") +
theme_minimal()
print(p_time)
The COVID-19 Data Hub (https://covid19datahub.io/) provides access to global COVID-19 data, including confirmed cases, deaths, recoveries, and government response measures. Below is a short time series data analysis plan based on this data.
Installing and Loading Required Packages
#install.packages("COVID19")
#install.packages("tidyverse")
library(COVID19) # For COVID-19 data
library(tidyverse) # For data manipulation & visualization
Retrieve COVID-19 Data
The COVID19
package allows you to retrieve COVID-19 data
for different countries and regions.
# Load COVID-19 data for a specific country (e.g., Sri Lanka)
covid_data <- covid19(country = "Sri Lanka", level = 1)
We have invested a lot of time and effort in creating COVID-19 Data
Hub, please cite the following when using it:
Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
Source Software 5(51):2376, doi: 10.21105/joss.02376
The implementation details and the latest version of the data are
described in:
Guidotti, E., (2022), "A worldwide epidemiological database for
COVID-19 at fine-grained spatial resolution", Sci Data 9(1):112, doi:
10.1038/s41597-022-01245-1
To print citations in BibTeX format use:
> print(citation('COVID19'), bibtex=TRUE)
To hide this message use 'verbose = FALSE'.
# View the first few rows
#head(covid_data)
Time Series Analysis
COVID-19 Cases Over Time
ggplot(covid_data, aes(x = date, y = confirmed)) +
geom_line(color = "blue") +
ggtitle("COVID-19 Confirmed Cases Over Time") +
xlab("Date") +
ylab("Total Cases") +
theme_minimal()
The plot shows the trend of COVID-19 cases over time. Peaks indicate waves of infections.
COVID-19 Deaths Over Time
ggplot(covid_data, aes(x = date, y = deaths)) +
geom_line(color = "red") +
ggtitle("COVID-19 Deaths Over Time") +
xlab("Date") +
ylab("Total Deaths") +
theme_minimal()
The plot provides insight into mortality trends and it helps assess waves with high fatality rates.
Case Fatality Rate (CFR) Over Time
Case Fatality Rate (CFR) = (Deaths / Confirmed Cases) * 100
covid_data <- covid_data |>
mutate(CFR = (deaths / confirmed) * 100)
ggplot(covid_data, aes(x = date, y = CFR)) +
geom_line(color = "purple") +
ggtitle("Case Fatality Rate (CFR) Over Time") +
xlab("Date") +
ylab("CFR (%)") +
theme_minimal()
The plot shows how the fatality rate has changed over time. Decreases in CFR may indicate improved treatment or vaccination.