Project2.0

Part A

Introduction

The data set I have chosen shows the number of documented shark attacks and their severity in Australia dating back to the late 1700’s. I have chosen this data set because although we have more people at our beaches than ever before, the level of care that we show victims of these attacks is very important with our gain in knowledge represented in the drastic drop in fatal attacks since the year 2000.

The data was published by the Australian Shark Incidence Database through the collaboration of the Taronga Wildlife Conservation Society, Flinders University, and the NSW Department of Primary Industries. It contains 1209 shark attacks from 1791 to 2023 with 60 variables. The variable I will be focusing on are the year, found in Shark_Data.Incident.year and the injury sustained, found in Victim.injury.

Initial Hypothesis

I hypothesize that the more recent the encounter is the less likely the incident will result in an injury or a death.

Reading Data.

library(readxl)
library(ggplot2)
library(knitr)
Warning: package 'knitr' was built under R version 4.3.3
Shark_Data <- read_excel("C:/Users/liamw/OneDrive/Desktop/ENVX1002/Project 2/Project2_template_2024/data/Shark_Data.xlsx")
Warning: Expecting logical in U1207 / R1207C21: got 'yes'
Warning: Expecting logical in U1208 / R1208C21: got 'yes'
Warning: Expecting logical in U1209 / R1209C21: got 'no'
New names:
• `` -> `...60`

Filtering data

I am filtering this data to only account for unprovoked attacks. This is done to maintain scientific method as all data points should be taken from as equal a viewpoint as possible.

#Filter non provoked
#Shark_Data <- filter(Shark_Data, Shark_Data$`Provoked/unprovoked` == "unprovoked")
#Create dataframe
#Shark_Data <- data.frame(Shark_Data)
#Filtering out missinputted data
#Shark_Data <- filter(Shark_Data, Shark_Data$Shark_Data.Victim.injury == "injured" | Shark_Data$Shark_Data.Victim.injury == "fatal" | Shark_Data$Shark_Data.Victim.injury == "uninjured") 

#Turn categorical data into factored data.
#injury <- as.factor(Shark_Data$Shark_Data.Victim.injury)
#state <- as.factor(Shark_Data$Shark_Data.State)

#Final dataframe
#Shark_Data <- data.frame(Shark_Data$Shark_Data.Incident.year, state, injury)

#Trasforming data
#Shark_DataNSW <- filter(Shark_Data, Shark_Data$state == "NSW")

#Shark_DataQLD <- filter(Shark_Data, Shark_Data$state == "QLD")

Analysis

Exploratory Data Analysis (EDA)

# Histogram NSW
# Determine the maximum y value needed
#max_y <- 25

#ggplot(Shark_DataNSW, aes(x = Shark_DataNSW$Shark_Data.Shark_Data.Incident.year, fill = Shark_DataNSW$injury)) +
#  geom_histogram(position = "dodge", bins = 30, width = 1) +
 # scale_fill_manual(values = c("black", "red", "skyblue")) +  # Manually setting colors
  #labs(x = "Weight", y = "Count", title = "Histogram of Shark attacks in New South Wales") +
  #ylim(0, max_y) +
  #xlim(1750, 2050) +  # Set the y-axis limits to ensure max height
  #theme(
   # plot.title = element_text(hjust = 0.5),  # Center the title
    #plot.background = element_rect(fill = "white"),  # White background
    #panel.background = element_rect(fill = "white"),  # White panel background
    #panel.grid.major = element_line(color = "gray"),  # Major grid lines
    #panel.grid.minor = element_blank(),  # Remove minor grid lines
    #axis.line = element_line(color = "black"),  # Axis lines
    #axis.text = element_text(color = "black"),  # Axis text color
    #axis.title = element_text(color = "black"),  # Axis title color
    #legend.position = "right",  # Legend position
    #legend.background = element_rect(fill = "white"),  # White legend background
    #legend.title = element_text(color = "black"),  # Legend title color
    #legend.text = element_text(color = "black")  # Legend text color
  #)

# Histogram QLD
#ggplot(Shark_DataQLD, aes(x = Shark_DataQLD$Shark_Data.Shark_Data.Incident.year, fill = Shark_DataQLD$injury)) +
 # geom_histogram(position = "dodge", bins = 30, width = 1) +
  #scale_fill_manual(values = c("black", "red", "skyblue")) +  # Manually setting colors
  #labs(x = "Weight", y = "Count", title = "Histogram of Shark attacks in Queensland") +
  #ylim(0, max_y) +  # Set the y-axis limits to ensure max height
  #xlim(1750,2050) +
  #theme(
   # plot.title = element_text(hjust = 0.5),  # Center the title
    #plot.background = element_rect(fill = "white"),  # White background
    #panel.background = element_rect(fill = "white"),  # White panel background
    #panel.grid.major = element_line(color = "gray"),  # Major grid lines
    #panel.grid.minor = element_blank(),  # Remove minor grid lines
    #axis.line = element_line(color = "black"),  # Axis lines
    #axis.text = element_text(color = "black"),  # Axis text color
    #axis.title = element_text(color = "black"),  # Axis title color
    #legend.position = "right",  # Legend position
    #legend.background = element_rect(fill = "white"),  # White legend background
    #legend.title = element_text(color = "black"),  # Legend title color
    #legend.text = element_text(color = "black")  # Legend text color
#knitr::kable(Shark_Data)

?(caption)

Boxplot

#ggplot(data = Shark_Data, aes(x = Shark_Data$Shark_Data.Shark_Data.Incident.year, y = Shark_Data$injury)) +
 # geom_boxplot() +
  #labs(x = "year", y = "injury") +
  #ggtitle("Box Plot of injury by year")

Summary Statistics

# Summary statistics
#Shark_Data %>%
 #   group_by(injury) %>%
  #  summarise(
   #     mean = mean(Shark_Data.Shark_Data.Incident.year),
    #    median = median(Shark_Data.Shark_Data.Incident.year),
     #   sd = sd(Shark_Data.Shark_Data.Incident.year),
      #  IQR = IQR(Shark_Data.Shark_Data.Incident.year),
       # n = n(),
        #skewness = skewness(Shark_Data.Shark_Data.Incident.year)
    #) %>%
    #kable()
#summary(Shark_Data)

As seen in the histograms above hist-injuryNSW and hist-injuryQLD it is clear that NSW has a higher number of shark attacks in total. This is likely due to the population difference between NSW and QLD, with New South Wales having a population of over 8 million compared to Queensland’s 5 million.

Furthermore, the change in the trends of injury over the past 200 years have shown an increase in beach-goer safety as we see the IQR of uninjured being the smallest out of the three conditions (1975-2005) seen in boxplot-injury. This the drop in injury vs fatal can be attributed to an increase in quality of healthcare.

Hypothesis Testing (HATPC)

Hypothesis

Our statistical hypothesis is:

\(H_0\) : The difference between Shark incidents in NSW and QLD is equal to zero.

\(H_1\) : The difference between Shark incidents in NSW and QLD is not equal to zero.

Assumptions

The assumption of normality will be confirmed by looking at the visualisation in the qqplot and in the Shapiro Wilks test. Checking the mean and the median is not viable here as the data set is too large for the test to be significant.

Transforming data

#Bind <- rbind(Shark_DataNSW,Shark_DataQLD)
#Shark_Data_Diff <- tibble(Bind)
#Shark_Data_Diff
#logNSW <- log10(Shark_DataNSW$Shark_Data.Shark_Data.Incident.year)
#logQLD <- log10(Shark_DataQLD$Shark_Data.Shark_Data.Incident.year)

Shapiro Test

#Shap <- log10(Shark_Data_Diff$Shark_Data.Shark_Data.Incident.year)
#shapiro.test(Shap)

The Shapiro Wilks normality test suggests to us that the data is not normally distributed due to the p-value (p-value<0.05).

QQPlot

# QQ plot
#ggplot(Shark_Data_Diff, aes(sample = Shark_Data_Diff$Shark_Data.Shark_Data.Incident.year)) +
 #   stat_qq() +
  #  stat_qq_line() +
   # theme_minimal()

The QQplot shows us that up until ~2000 the data was normally distributed, however, there is a change in that results in a trend off of normality.

Mean/median comparison

#Shark_Data_Diff %>%
 # summarise(
  #  mean = mean(Shark_Data_Diff$Shark_Data.Shark_Data.Incident.year),
   # median = median(Shark_Data_Diff$Shark_Data.Shark_Data.Incident.year)
  #) %>%
  #kable()

Although two relatively close values would make us assume our data is normal we cannot assume this due to the size of our data set,

T-test

# Paired t-test
#fit <- t.test(Shark_Data$Shark_Data.Shark_Data.Incident.year,Shark_Data$Shark_Data.Shark_Data.Incident.year, paired = TRUE)

#fit

# Extract results
#t_value <- fit$statistic
 # round(3) # round to 3 decimal places
#p_value <- fit$p.value  
 # round(3) # round to 3 decimal places
#df <- fit$parameter  # extract degrees of freedom
 # round(0) # round to 0 decimal places
#conf_int <- fit$conf.int  # extract confidence interval, and...
 # round(0)   # round to 0 decimal places, and...

The results are {r} t_value, {r} df and {r} p-value.

Conclusion

Our p-value {r} p-value is less than 0.05. Therefore we reject the null hypothesis. Furthermore the 95% confidence interval is {r} conf_int this does not contain the value zero therefore our analysis suggests that the difference in the amount of shark incidents between NSW and QLD is significantly different to zero.

This means we accept the alternate hypothesis \(H_1\) : The difference between NSW and QLD is not equal to zero. This is in line with the results of our histogram ?@fig-hist.

Part B

John West analyses shark related incidents in Australia from 1900 to 2010, in his article “Changing patterns of shark attacks in Australian waters”. Focusing heavily on the 1990-2010 period West states; “The majority of attacks happened in New South Wales” with NSW having 39% of all Australian shark attacks compared to Queenlands 23%.

Part C

West, J. (2011). Changing patterns of shark attacks in Australian waters. Changing patterns of shark attacks in Australian waters.

ChatGPT 3.5 (2024) - Artificial Intelligence Software