2025-09-24

Libraries

library(dplyr)  
library(ggplot2)  
library(tidyr)  
library(plotly)  
tsunami_data <- read.csv("DAT301_Proj1_sources.csv")  

Introduction

This Tsunami data set was originally used to demonstrate the destruction caused by tsunamis. Although there was one focus, there was extensive data collected that can help us answer different questions unrelated to the main topic. The questions that I asked myself were:
1. What month had the most tsunami occurrences?
2. What was the tallest tsunami recorded?
3. Is there a relationship between month and tsunami height?
4. What is the probability of a tsunami occurring each month?

Question 1

What month had the most tsunami occurrences?

# Filter NA values
filtered_month <- tsunami_data %>%
  select(MONTH) %>%
  filter(!is.na(MONTH))

# Count of each occurrence of corresponding month numbers
tsunami_month <- filtered_month %>%
  summarise(
    January = sum(MONTH==1),
    February = sum(MONTH==2),
    March = sum(MONTH==3),
    April = sum(MONTH==4),
    May = sum(MONTH==5),
    June = sum(MONTH==6),
    July = sum(MONTH==7),
    August = sum(MONTH==8),
    September = sum(MONTH==9),
    October = sum(MONTH==10),
    November = sum(MONTH==11),
    December = sum(MONTH==12)) %>%
  pivot_longer(everything(),
               names_to="Month", 
               values_to="Count")

Question 1

# assign a variable for chronological order for months otherwise its alphabetical
month_order <- c("January", "February", "March", "April", "May", "June",
                 "July", "August", "September", "October", "November", "December")

# create vector for the assigned variable
new_tsunami_month <- tsunami_month %>%
  mutate(month = factor(Month, levels = month_order, ordered = TRUE)) %>%
  arrange(month)
# output table
new_tsunami_month
## # A tibble: 12 × 3
##    Month     Count month    
##    <chr>     <int> <ord>    
##  1 January     213 January  
##  2 February    187 February 
##  3 March       209 March    
##  4 April       184 April    
##  5 May         203 May      
##  6 June        159 June     
##  7 July        185 July     
##  8 August      239 August   
##  9 September   220 September
## 10 October     212 October  
## 11 November    234 November 
## 12 December    198 December
# bar plot
tsunami_month_plot <- ggplot(new_tsunami_month,
       aes(x=month, y=Count, fill=Month)) +
  geom_bar(stat="identity") +
  labs(title="Tsunami Count by Month",
      x = "Month",
      y = "Count") +
  theme(legend.position = "none")

Question 1 Graph

Question 1 Conclusion

Looking at the Graph, we can conclude that tsunamis have the highest chance of occurring during August, then November, and lastly September. This can be inferred as tsunamis more likely to occur from the range of end of Summer to Fall.

Question 2

What was the shortest and tallest tsunami recorded?

filtered_heights <- tsunami_data %>%
  select(MAXIMUM_HEIGHT) %>%
  filter(!is.na(MAXIMUM_HEIGHT))
  
tsunami_height <- filtered_heights %>%
  summarise(
    min_height = min(MAXIMUM_HEIGHT),
    max_height = max(MAXIMUM_HEIGHT)
  )

tsunami_height
##   min_height max_height
## 1       0.01      524.6

Question 3

Is there a relationship between month and tsunami height?

#filter out NA values
filtered_month_height <- tsunami_data %>%
  select(MONTH, MAXIMUM_HEIGHT) %>%
  filter(!is.na(MONTH), !is.na(MAXIMUM_HEIGHT))

#create linear model
filtered_month_height_mod <- lm(MONTH ~ MAXIMUM_HEIGHT, data=filtered_month_height) 
x = filtered_month_height$MONTH; y = filtered_month_height$MAXIMUM_HEIGHT

xax <- list(
  title="Month")

yax <- list(
  title="Tsunami Height")

#create scatter plot
fig <- plot_ly(x=x, y=y, type="scatter", 
               mode="markers", name="data",
               width=800, height=430) %>%
  # add fitted linear regression line
  add_lines(x = x, y = fitted(filtered_month_height_mod), 
            name="fitted") %>%
  layout(title="Month vs. Tsunami Height",xaxis=xax, yaxis=yax)

Question 3 Data

config(fig, displaylogo=FALSE)

Question 3 Conclusion

As shown by the regression line in the Month vs. Tsunami Height plot, there
seems to be no linear relation between month and tsunami height.

Question 4

What is the probability of a tsunami occurring each month?

# count how many tsunamis occurred each month using filtered data from Q1
monthly_counts <- table(filtered_month)

# sum of tsunamis from all months
total_tsunamis <- sum(monthly_counts)

# calculate probability 
percentage <- monthly_counts / total_tsunamis

# convert the numbers of the months into names
names(percentage) <- month_order[as.numeric(names(percentage))]

Question 4 Chart

pie(percentage,
    main ="Probability of Tsunamis Each Month",
    labels = paste(names(percentage), " (",
                   round(percentage * 100, 1), "%)"))