Setup

Importing the PlaneCrashes.csv dataset and making data structures for years and annual deaths. The dataset can be found on Kaggle. https://www.kaggle.com/datasets/abeperez/historical-plane-crash-data

The purpose of this experiment is to determine the existence and/or statistical significance of a decline in plane crashes and associated fatalities in the last 50 or so years.

pc <- read.csv("PlaneCrashes.csv")
Year <- pc$Year
annual_deaths <- aggregate(`Total.fatalities` ~ Year, data = pc, sum) #total plane crash fatalities per unique year
head(annual_deaths)
##   Year Total.fatalities
## 1 1918               29
## 2 1919               50
## 3 1920               77
## 4 1921               78
## 5 1922              105
## 6 1923               65

1) Bar Plot of Annual Fatalities Between 1918 and 2022

This barplot shows the annual frequency of plane crash fatalities. I chose a barplot for this because “Year” is discrete rather than continuous.

barplot(annual_deaths$Total.fatalities,
        names.arg = annual_deaths$Year, main = "Annual Plane Crash Fatalities", xlab = "Year", ylab = "Total Fatalities", las = 2)

#"las = 2" turns the x axis labels on their side so more numbers can fit

2) Mode of “Year” in the dataset. This will display the year with the most plane crashes

annual_crashes <- table(Year)
worst_year_crashes <- names(sort(annual_crashes, decreasing = TRUE))[1]
#This gets the mode by creating a frequency table, sorting it, and taking the largest number
worst_year_crashes 
## [1] "1944"

3) Using linear regression to determine if the decline in annual crash frequency is statistically significant

Linear regression was able to find a downward trend, evidenced by a years coefficient of -0.6691, but wasn’t able to deem it statistically significant due to a p value of 0.1464

years <- 1918:2022
model <- lm(annual_crashes ~ years)
summary(model)
## 
## Call:
## lm(formula = annual_crashes ~ years)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -281.56  -70.00  -15.12   53.85  655.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1589.8089   900.6654   1.765   0.0805 .
## years         -0.6691     0.4571  -1.464   0.1464  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142 on 103 degrees of freedom
## Multiple R-squared:  0.02037,    Adjusted R-squared:  0.01086 
## F-statistic: 2.142 on 1 and 103 DF,  p-value: 0.1464

4) Histogram of Annual Plane Crashes

hist(Year, ylim = c(0,1000), main = "Number of Plane Crashes Per Year", 
     ylab = "Number of Plane Crashes",breaks = 104)

5) Performing a Wilcoxon Rank Sum Test on Split Data

I chose to perform a Wilcox test for this because it is non-parametric, making it a better test for a non normal distribution than a t test. I chose to split the data at 1977 because it was the year of the deadliest aviation accident in history, the Tenerife disaster, where two fully loaded 747s collided on the runway in the Canary Islands after one of them began to take off without clearance.

Unfortunately, a p-value of 0.9871 is far too large to deem the difference in means between pre and post Tenerife air crash fatalities statistically significant.

pre_77 <- annual_deaths[annual_deaths$Year <= 1977, "Total.fatalities"]
post_77 <- annual_deaths[annual_deaths$Year > 1977, "Total.fatalities"]
wilcox.test(pre_77, post_77)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  pre_77 and post_77
## W = 1347, p-value = 0.9871
## alternative hypothesis: true location shift is not equal to 0