Importing the PlaneCrashes.csv dataset and making data structures for years and annual deaths. The dataset can be found on Kaggle. https://www.kaggle.com/datasets/abeperez/historical-plane-crash-data
The purpose of this experiment is to determine the existence and/or statistical significance of a decline in plane crashes and associated fatalities in the last 50 or so years.
pc <- read.csv("PlaneCrashes.csv")
Year <- pc$Year
annual_deaths <- aggregate(`Total.fatalities` ~ Year, data = pc, sum) #total plane crash fatalities per unique year
head(annual_deaths)
## Year Total.fatalities
## 1 1918 29
## 2 1919 50
## 3 1920 77
## 4 1921 78
## 5 1922 105
## 6 1923 65
This barplot shows the annual frequency of plane crash fatalities. I chose a barplot for this because “Year” is discrete rather than continuous.
barplot(annual_deaths$Total.fatalities,
names.arg = annual_deaths$Year, main = "Annual Plane Crash Fatalities", xlab = "Year", ylab = "Total Fatalities", las = 2)
#"las = 2" turns the x axis labels on their side so more numbers can fit
annual_crashes <- table(Year)
worst_year_crashes <- names(sort(annual_crashes, decreasing = TRUE))[1]
#This gets the mode by creating a frequency table, sorting it, and taking the largest number
worst_year_crashes
## [1] "1944"
Linear regression was able to find a downward trend, evidenced by a years coefficient of -0.6691, but wasn’t able to deem it statistically significant due to a p value of 0.1464
years <- 1918:2022
model <- lm(annual_crashes ~ years)
summary(model)
##
## Call:
## lm(formula = annual_crashes ~ years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -281.56 -70.00 -15.12 53.85 655.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1589.8089 900.6654 1.765 0.0805 .
## years -0.6691 0.4571 -1.464 0.1464
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142 on 103 degrees of freedom
## Multiple R-squared: 0.02037, Adjusted R-squared: 0.01086
## F-statistic: 2.142 on 1 and 103 DF, p-value: 0.1464
hist(Year, ylim = c(0,1000), main = "Number of Plane Crashes Per Year",
ylab = "Number of Plane Crashes",breaks = 104)
I chose to perform a Wilcox test for this because it is non-parametric, making it a better test for a non normal distribution than a t test. I chose to split the data at 1977 because it was the year of the deadliest aviation accident in history, the Tenerife disaster, where two fully loaded 747s collided on the runway in the Canary Islands after one of them began to take off without clearance.
Unfortunately, a p-value of 0.9871 is far too large to deem the difference in means between pre and post Tenerife air crash fatalities statistically significant.
pre_77 <- annual_deaths[annual_deaths$Year <= 1977, "Total.fatalities"]
post_77 <- annual_deaths[annual_deaths$Year > 1977, "Total.fatalities"]
wilcox.test(pre_77, post_77)
##
## Wilcoxon rank sum test with continuity correction
##
## data: pre_77 and post_77
## W = 1347, p-value = 0.9871
## alternative hypothesis: true location shift is not equal to 0