Airplane Crash Analysis

Nupur Biswas (s3889979) Deep Sanjaykumar Patel (s3908901) Sai Divya Praneetha Buddiga (s3890087)

Last updated: 06 June, 2022

Introduction

The goal of this project is to use the real time data present in Kaggle’s Historical Plane Crash Data(1918-2022) by Abe Ceasar Perez, and test whether the flight phase( of an air journey)ascent or descent affects it’s fatality rate. Statistical techniques and programming language ‘R’ that was learnt during the course, will be used for this purpose.

Plane Crashes in San Francisco

Img Source : Asiana Flight 214, arriving from Seoul, South Korea, broke apart and burst into flames as it crashed while landing at San Francisco International Airport. The plane’s tail, landing gear and one of its engines were ripped off.
Credit : Jed Jacobsohn/Reuters
By Norimitsu Onishi and Ravi Somaiya
July 6, 2013

Problem Statement

Hypothesis : Casualities is independent of flight phase.

Data

Click on this for direct link

Data Cont.

Descriptive Statistics and Visualisation

The important variables are total fatalities(0, 100, 200, 300, 400, 500) , flight phase(Flight landing, Parking, Takeoff, Taxing, Unknown).

From 1918 to 2019, more than 500 lives have been lost during flights and there were 300 fatalities due to landing problems like bad weather , collisions between planes or rough landings that caused injuries inside the aircraft. Around 70 deaths are caused by plane accidents in parking ,when the flight has landed and rest for parking ,due to some problems like fuel or if any plane strayed too close to the aerobridge and made contact will also lead to plane accidents in parking. Comparatively, Takeoff and landing have nearly identical percentages of deaths. . Basically taxiing means a plane to move along the ground under its own power, before takeoff. 335 people have died as a result of taxiing, either due to a lack of information about air taxiing or an engine failure.and finally 20 crashes for unknown cause. When we look at deaths by region, Asia leads with over 500 crashes and least in Oceania with 98 crashes.

We had a lot of missing data and null values , which we dealt with by first checking how many empty spaces were present, then replacing those empty spaces with na and further replacing na values with unknown for visualizing values better.

Plotting the number of fatalities according to the flight’s phase

#plotting the number of fatalities according to the flight's phase
ggplot(crash_analysis_df, aes(x = Flight_phase, y = Total_fatalities))+
  geom_bar(
    aes(fill = Total_fatalities), stat = "identity", color = "white",
    position = position_dodge(0.9)
  )

Plotting the number of fatalities according to region

#because we don't know which region the crash has occured we're categorising the factor world as unknown
crash_analysis_df[crash_analysis_df == "World"] <- 'Unknown'  
#plotting the number of fatalities according to region
ggplot(crash_analysis_df, aes(x = Region, y = Total_fatalities))+
  geom_bar(
    aes(fill = Total_fatalities), stat = "identity", color = "white",
    position = position_dodge(0.9)
  )

Decsriptive Statistics Cont.

#Calculating descriptive statistics for flight phase
crash_analysis_df1 %>% group_by(Flight_phase) %>% summarise(Min = min(Total_fatalities,na.rm = TRUE),
                                         Q1 = quantile(Total_fatalities,probs = .25,na.rm = TRUE),
                                         Median = median(Total_fatalities,na.rm = TRUE),
                                         Q3 = quantile(Total_fatalities,probs = .75,na.rm = TRUE),
                                         Max = max(Total_fatalities,na.rm = TRUE),
                                         Mean = mean(Total_fatalities,na.rm = TRUE),
                                         SD = sd(Total_fatalities,na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Total_fatalities))) -> table1
knitr::kable(table1)
Flight_phase Min Q1 Median Q3 Max Mean SD n Missing
Flight 0 0 2 6 520 5.8649123 16.143541 11400 0
Landing (descent or approach) 0 0 0 4 301 5.6219050 17.298880 10016 0
Parking 0 0 0 1 60 2.4414414 8.343407 111 0
Takeoff (climb) 0 0 1 5 298 5.7394833 17.523279 6038 0
Taxiing 0 0 0 0 335 2.2161017 22.029572 236 0
Unknown 0 0 0 0 27 0.1830986 1.383726 639 0

Decsriptive Statistics Cont.

#Calculating descriptive statistics for region
crash_analysis_df1 %>% group_by(Region) %>% summarise(Min = min(Total_fatalities,na.rm = TRUE),
                                                            Q1 = quantile(Total_fatalities,probs = .25,na.rm = TRUE),
                                                            Median = median(Total_fatalities,na.rm = TRUE),
                                                            Q3 = quantile(Total_fatalities,probs = .75,na.rm = TRUE),
                                                            Max = max(Total_fatalities,na.rm = TRUE),
                                                            Mean = mean(Total_fatalities,na.rm = TRUE),
                                                            SD = sd(Total_fatalities,na.rm = TRUE),
                                                            n = n(),
                                                            Missing = sum(is.na(Total_fatalities))) -> table2
knitr::kable(table2)
Region Min Q1 Median Q3 Max Mean SD n Missing
Africa 0 0 0 5 298 7.200765 21.246157 2092 0
Antarctica 0 0 0 2 257 5.672414 33.673035 58 0
Asia 0 0 2 7 520 8.869144 23.683322 5678 0
Central America 0 0 1 5 189 5.113671 13.730333 1302 0
Europe 0 0 1 4 346 4.748082 14.891611 6649 0
North America 0 0 1 4 273 3.501161 10.672970 8179 0
Oceania 0 0 1 4 97 3.051829 5.943702 1312 0
South America 0 0 2 6 228 6.389257 15.522604 2569 0
Unknown 10 10 10 10 10 10.000000 NA 1 0
World 0 1 5 9 329 9.485000 23.563358 600 0

Applying hypothesis

#applying cor.test on flight phase against the fatalities
cor.test(crash_analysis_df1$Flight_phase,crash_analysis_df1$Total_fatalities)
## 
##  Pearson's product-moment correlation
## 
## data:  crash_analysis_df1$Flight_phase and crash_analysis_df1$Total_fatalities
## t = -4.899, df = 28438, p-value = 9.684e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04064709 -0.01742238
## sample estimates:
##         cor 
## -0.02903866

Applying hypothesis Cont.

#applying t.test on flight phase against the fatalities
t.test(crash_analysis_df1$Flight_phase,crash_analysis_df1$Total_fatalities)
## 
##  Welch Two Sample t-test
## 
## data:  crash_analysis_df1$Flight_phase and crash_analysis_df1$Total_fatalities
## t = -34.545, df = 28778, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.634081 -3.243838
## sample estimates:
## mean of x mean of y 
##  2.142440  5.581399

Applying hypothesis Cont.

#applying cor.test on region against the fatalities
cor.test(crash_analysis_df1$Region,crash_analysis_df1$Total_fatalities)
## 
##  Pearson's product-moment correlation
## 
## data:  crash_analysis_df1$Region and crash_analysis_df1$Total_fatalities
## t = -10.124, df = 28438, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07149628 -0.04833543
## sample estimates:
##         cor 
## -0.05992392

Applying hypothesis Cont.

#applying t.test on region against the fatalities
t.test(crash_analysis_df1$Region,crash_analysis_df1$Total_fatalities)
## 
##  Welch Two Sample t-test
## 
## data:  crash_analysis_df1$Region and crash_analysis_df1$Total_fatalities
## t = -5.7073, df = 29221, p-value = 1.159e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7662348 -0.3744825
## sample estimates:
## mean of x mean of y 
##  5.011041  5.581399

Hypothesis Testing

We applied Cor and t test for our hypothesis testing for this investigation.

We believe that the flight phase and region variables will show us as to which level has the most affect toward fatalities.

Hypothesis Testing

model2 <- lm(Region ~ Flight_phase, data = crash_analysis_df1)
model2 %>% summary()
## 
## Call:
## lm(formula = Region ~ Flight_phase, data = crash_analysis_df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0969 -1.8714 -0.0218  0.9782  5.2790 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.172138   0.022518 229.688   <2e-16 ***
## Flight_phase -0.075193   0.009002  -8.353   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.96 on 28438 degrees of freedom
## Multiple R-squared:  0.002448,   Adjusted R-squared:  0.002413 
## F-statistic: 69.78 on 1 and 28438 DF,  p-value: < 2.2e-16

Hypthesis Testing Cont.

\[r = correlation\ coefficient\\ x_{i} = values\ of\ the\ x-variable\ in\ a\ sample\\ \bar{x} = mean\ of\ the\ values\ of\ the\ x-variable\\ y_{i} = values\ of\ the\ y-variable\ in\ a\ sample\\ \bar{y} = mean\ of\ the\ values\ of\ the\ y-variable\]

Hypthesis Testing Cont.

Discussion

References