library(dplyr)
library(ggplot2)
library(tidyr)
library(plotly)
tsunami_data <- read.csv("DAT301_Proj1_sources.csv")
2025-09-24
library(dplyr)
library(ggplot2)
library(tidyr)
library(plotly)
tsunami_data <- read.csv("DAT301_Proj1_sources.csv")
This Tsunami data set was originally used to demonstrate the destruction caused by tsunamis. Although there was one focus, there was extensive data collected that can help us answer different questions unrelated to the main topic. The questions that I asked myself were:
1. What month had the most tsunami occurrences?
2. What was the tallest tsunami recorded?
3. Is there a relationship between month and tsunami height?
4. What is the probability of a tsunami occurring each month?
What month had the most tsunami occurrences?
# Filter NA values
filtered_month <- tsunami_data %>%
select(MONTH) %>%
filter(!is.na(MONTH))
# Count of each occurrence of corresponding month numbers
tsunami_month <- filtered_month %>%
summarise(
January = sum(MONTH==1),
February = sum(MONTH==2),
March = sum(MONTH==3),
April = sum(MONTH==4),
May = sum(MONTH==5),
June = sum(MONTH==6),
July = sum(MONTH==7),
August = sum(MONTH==8),
September = sum(MONTH==9),
October = sum(MONTH==10),
November = sum(MONTH==11),
December = sum(MONTH==12)) %>%
pivot_longer(everything(),
names_to="Month",
values_to="Count")
# assign a variable for chronological order for months otherwise its alphabetical
month_order <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December")
# create vector for the assigned variable
new_tsunami_month <- tsunami_month %>%
mutate(month = factor(Month, levels = month_order, ordered = TRUE)) %>%
arrange(month)
# output table
new_tsunami_month
## # A tibble: 12 × 3 ## Month Count month ## <chr> <int> <ord> ## 1 January 213 January ## 2 February 187 February ## 3 March 209 March ## 4 April 184 April ## 5 May 203 May ## 6 June 159 June ## 7 July 185 July ## 8 August 239 August ## 9 September 220 September ## 10 October 212 October ## 11 November 234 November ## 12 December 198 December
# bar plot
tsunami_month_plot <- ggplot(new_tsunami_month,
aes(x=month, y=Count, fill=Month)) +
geom_bar(stat="identity") +
labs(title="Tsunami Count by Month",
x = "Month",
y = "Count") +
theme(legend.position = "none")
Looking at the Graph, we can conclude that tsunamis have the highest chance of occurring during August, then November, and lastly September. This can be inferred as tsunamis more likely to occur from the range of end of Summer to Fall.
What was the shortest and tallest tsunami recorded?
filtered_heights <- tsunami_data %>%
select(MAXIMUM_HEIGHT) %>%
filter(!is.na(MAXIMUM_HEIGHT))
tsunami_height <- filtered_heights %>%
summarise(
min_height = min(MAXIMUM_HEIGHT),
max_height = max(MAXIMUM_HEIGHT)
)
tsunami_height
## min_height max_height ## 1 0.01 524.6
Is there a relationship between month and tsunami height?
#filter out NA values
filtered_month_height <- tsunami_data %>%
select(MONTH, MAXIMUM_HEIGHT) %>%
filter(!is.na(MONTH), !is.na(MAXIMUM_HEIGHT))
#create linear model
filtered_month_height_mod <- lm(MONTH ~ MAXIMUM_HEIGHT, data=filtered_month_height)
x = filtered_month_height$MONTH; y = filtered_month_height$MAXIMUM_HEIGHT
xax <- list(
title="Month")
yax <- list(
title="Tsunami Height")
#create scatter plot
fig <- plot_ly(x=x, y=y, type="scatter",
mode="markers", name="data",
width=800, height=430) %>%
# add fitted linear regression line
add_lines(x = x, y = fitted(filtered_month_height_mod),
name="fitted") %>%
layout(title="Month vs. Tsunami Height",xaxis=xax, yaxis=yax)
config(fig, displaylogo=FALSE)
As shown by the regression line in the Month vs. Tsunami Height plot, there
seems to be no linear relation between month and tsunami height.
What is the probability of a tsunami occurring each month?
# count how many tsunamis occurred each month using filtered data from Q1 monthly_counts <- table(filtered_month) # sum of tsunamis from all months total_tsunamis <- sum(monthly_counts) # calculate probability percentage <- monthly_counts / total_tsunamis # convert the numbers of the months into names names(percentage) <- month_order[as.numeric(names(percentage))]
pie(percentage,
main ="Probability of Tsunamis Each Month",
labels = paste(names(percentage), " (",
round(percentage * 100, 1), "%)"))