2025-09-14Last week, my coworkers and I had a spirited debate on cruises, with one of them swearing they will never go on a cruise in their life for a variety of reasons! This naturally got me interested in ship accidents.
I found the .csv file from https://vincentarelbundock.github.io/Rdatasets/datasets.html it’s under ShipAccidents.
Loading in the data and previewing it:
#loads libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.2 âś” tibble 3.3.0
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Loads data
ship_data <- read.csv("ShipAccidents.csv")
# Previews the first few rows
head(ship_data)
## rownames type construction operation service incidents
## 1 1 A 1960-64 1960-74 127 0
## 2 2 A 1960-64 1975-79 63 0
## 3 3 A 1965-69 1960-74 1095 3
## 4 4 A 1965-69 1975-79 1095 4
## 5 5 A 1970-74 1960-74 1512 6
## 6 6 A 1970-74 1975-79 3353 18
Let’s look at some statistics regarding incidents:
# Calculate mean, median, standard deviation, min, max
# Calculate summary statistics of ship_data$incidents
mean_incidents <- mean(ship_data$incidents)
median_incidents <- median(ship_data$incidents)
sd_incidents <- sd(ship_data$incidents)
min_incidents <- min(ship_data$incidents)
max_incidents <- max(ship_data$incidents)
# Print results as sentences
print(paste("The mean number of incidents is", round(mean_incidents, 2)))
## [1] "The mean number of incidents is 8.9"
print(paste("The median number of incidents is", round(median_incidents, 2)))
## [1] "The median number of incidents is 2"
print(paste("The standard deviation of incidents is", round(sd_incidents, 2)))
## [1] "The standard deviation of incidents is 14.96"
print(paste("The minimum number of incidents is", min_incidents))
## [1] "The minimum number of incidents is 0"
print(paste("The maximum number of incidents is", max_incidents))
## [1] "The maximum number of incidents is 58"
Considering the minimum number of incidents is 0 and the maximum is 58, that means there are a few ships skewing the data. The median number of incidents is 2, so most ships don’t have many incidents, but because of some outliers with higher incident counts, the mean is pulled up to around 9.8.
Is there a correlation between ship service (how long a ship is active) and amount of incidents?
correlation <- cor(ship_data$service, ship_data$incidents)
print(paste("The correlation between service years and incidents is",
round(correlation, 2)))
## [1] "The correlation between service years and incidents is 0.85"
0.85 implies a strong relationship between years of service and rate of incidents!
I am curious about the ships with higher incidents
# Filter ships with more than 10 incidents
high_incidents <- subset(ship_data, incidents > 10)
# Preview results
head(high_incidents)
## rownames type construction operation service incidents
## 6 6 A 1970-74 1975-79 3353 18
## 8 8 A 1975-79 1975-79 2244 11
## 9 9 B 1960-64 1960-74 44882 39
## 10 10 B 1960-64 1975-79 17176 29
## 11 11 B 1965-69 1960-74 28609 58
## 12 12 B 1965-69 1975-79 20370 53
print(paste("The number of ships with more than 10 incidents is",
nrow(high_incidents)))
## [1] "The number of ships with more than 10 incidents is 11"
print(paste("The average years of service for these ships is",
round(mean(high_incidents$service / 365), )))
## [1] "The average years of service for these ships is 37"
#note that the column $service is in days, so I divided by 365 to get the years.
There are 11 ships with more than 10 incidents in the dataset. On average, these ships have been in service for about 37 years.
Now let’s see how it looks like in a nice graph
# Plots service vs incidents
ggplot(ship_data, aes(x = service / 365, y = incidents)) +
geom_point() +
labs(
x = "Service (years)",
y = "Incidents",
title = "Ship Service Years vs Incidents"
)
TL;DR don’t go on an old ship!