Nupur Biswas (s3889979) Deep Sanjaykumar Patel (s3908901) Sai Divya Praneetha Buddiga (s3890087)
Last updated: 06 June, 2022
Img Source : Asiana Flight 214, arriving from Seoul, South Korea, broke apart and burst into flames as it crashed while landing at San Francisco International Airport. The plane’s tail, landing gear and one of its engines were ripped off.
Credit : Jed Jacobsohn/Reuters
By Norimitsu Onishi and Ravi Somaiya
July 6, 2013
Hypothesis : Casualities is independent of flight phase.
We are going to use cor-test and t-test for our hypothesis testing for this investigation.
Cor test : A string indicating which correlation coefficient to calculate and test.cor. test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation
t-test : We use it to determine whether the means of two groups are equal to each other. The assumption for the test is that both groups are sampled from normal distributions with equal variances.
The important variables are total fatalities(0, 100, 200, 300, 400, 500) , flight phase(Flight landing, Parking, Takeoff, Taxing, Unknown).
From 1918 to 2019, more than 500 lives have been lost during flights and there were 300 fatalities due to landing problems like bad weather , collisions between planes or rough landings that caused injuries inside the aircraft. Around 70 deaths are caused by plane accidents in parking ,when the flight has landed and rest for parking ,due to some problems like fuel or if any plane strayed too close to the aerobridge and made contact will also lead to plane accidents in parking. Comparatively, Takeoff and landing have nearly identical percentages of deaths. . Basically taxiing means a plane to move along the ground under its own power, before takeoff. 335 people have died as a result of taxiing, either due to a lack of information about air taxiing or an engine failure.and finally 20 crashes for unknown cause. When we look at deaths by region, Asia leads with over 500 crashes and least in Oceania with 98 crashes.
We had a lot of missing data and null values , which we dealt with by first checking how many empty spaces were present, then replacing those empty spaces with na and further replacing na values with unknown for visualizing values better.
#plotting the number of fatalities according to the flight's phase
ggplot(crash_analysis_df, aes(x = Flight_phase, y = Total_fatalities))+
geom_bar(
aes(fill = Total_fatalities), stat = "identity", color = "white",
position = position_dodge(0.9)
)#because we don't know which region the crash has occured we're categorising the factor world as unknown
crash_analysis_df[crash_analysis_df == "World"] <- 'Unknown'
#plotting the number of fatalities according to region
ggplot(crash_analysis_df, aes(x = Region, y = Total_fatalities))+
geom_bar(
aes(fill = Total_fatalities), stat = "identity", color = "white",
position = position_dodge(0.9)
)knitr:kable function to print nice HTML tables.#Calculating descriptive statistics for flight phase
crash_analysis_df1 %>% group_by(Flight_phase) %>% summarise(Min = min(Total_fatalities,na.rm = TRUE),
Q1 = quantile(Total_fatalities,probs = .25,na.rm = TRUE),
Median = median(Total_fatalities,na.rm = TRUE),
Q3 = quantile(Total_fatalities,probs = .75,na.rm = TRUE),
Max = max(Total_fatalities,na.rm = TRUE),
Mean = mean(Total_fatalities,na.rm = TRUE),
SD = sd(Total_fatalities,na.rm = TRUE),
n = n(),
Missing = sum(is.na(Total_fatalities))) -> table1
knitr::kable(table1)| Flight_phase | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Flight | 0 | 0 | 2 | 6 | 520 | 5.8649123 | 16.143541 | 11400 | 0 |
| Landing (descent or approach) | 0 | 0 | 0 | 4 | 301 | 5.6219050 | 17.298880 | 10016 | 0 |
| Parking | 0 | 0 | 0 | 1 | 60 | 2.4414414 | 8.343407 | 111 | 0 |
| Takeoff (climb) | 0 | 0 | 1 | 5 | 298 | 5.7394833 | 17.523279 | 6038 | 0 |
| Taxiing | 0 | 0 | 0 | 0 | 335 | 2.2161017 | 22.029572 | 236 | 0 |
| Unknown | 0 | 0 | 0 | 0 | 27 | 0.1830986 | 1.383726 | 639 | 0 |
#Calculating descriptive statistics for region
crash_analysis_df1 %>% group_by(Region) %>% summarise(Min = min(Total_fatalities,na.rm = TRUE),
Q1 = quantile(Total_fatalities,probs = .25,na.rm = TRUE),
Median = median(Total_fatalities,na.rm = TRUE),
Q3 = quantile(Total_fatalities,probs = .75,na.rm = TRUE),
Max = max(Total_fatalities,na.rm = TRUE),
Mean = mean(Total_fatalities,na.rm = TRUE),
SD = sd(Total_fatalities,na.rm = TRUE),
n = n(),
Missing = sum(is.na(Total_fatalities))) -> table2
knitr::kable(table2)| Region | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Africa | 0 | 0 | 0 | 5 | 298 | 7.200765 | 21.246157 | 2092 | 0 |
| Antarctica | 0 | 0 | 0 | 2 | 257 | 5.672414 | 33.673035 | 58 | 0 |
| Asia | 0 | 0 | 2 | 7 | 520 | 8.869144 | 23.683322 | 5678 | 0 |
| Central America | 0 | 0 | 1 | 5 | 189 | 5.113671 | 13.730333 | 1302 | 0 |
| Europe | 0 | 0 | 1 | 4 | 346 | 4.748082 | 14.891611 | 6649 | 0 |
| North America | 0 | 0 | 1 | 4 | 273 | 3.501161 | 10.672970 | 8179 | 0 |
| Oceania | 0 | 0 | 1 | 4 | 97 | 3.051829 | 5.943702 | 1312 | 0 |
| South America | 0 | 0 | 2 | 6 | 228 | 6.389257 | 15.522604 | 2569 | 0 |
| Unknown | 10 | 10 | 10 | 10 | 10 | 10.000000 | NA | 1 | 0 |
| World | 0 | 1 | 5 | 9 | 329 | 9.485000 | 23.563358 | 600 | 0 |
#applying cor.test on flight phase against the fatalities
cor.test(crash_analysis_df1$Flight_phase,crash_analysis_df1$Total_fatalities)##
## Pearson's product-moment correlation
##
## data: crash_analysis_df1$Flight_phase and crash_analysis_df1$Total_fatalities
## t = -4.899, df = 28438, p-value = 9.684e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04064709 -0.01742238
## sample estimates:
## cor
## -0.02903866
#applying t.test on flight phase against the fatalities
t.test(crash_analysis_df1$Flight_phase,crash_analysis_df1$Total_fatalities)##
## Welch Two Sample t-test
##
## data: crash_analysis_df1$Flight_phase and crash_analysis_df1$Total_fatalities
## t = -34.545, df = 28778, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.634081 -3.243838
## sample estimates:
## mean of x mean of y
## 2.142440 5.581399
#applying cor.test on region against the fatalities
cor.test(crash_analysis_df1$Region,crash_analysis_df1$Total_fatalities)##
## Pearson's product-moment correlation
##
## data: crash_analysis_df1$Region and crash_analysis_df1$Total_fatalities
## t = -10.124, df = 28438, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07149628 -0.04833543
## sample estimates:
## cor
## -0.05992392
#applying t.test on region against the fatalities
t.test(crash_analysis_df1$Region,crash_analysis_df1$Total_fatalities)##
## Welch Two Sample t-test
##
## data: crash_analysis_df1$Region and crash_analysis_df1$Total_fatalities
## t = -5.7073, df = 29221, p-value = 1.159e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7662348 -0.3744825
## sample estimates:
## mean of x mean of y
## 5.011041 5.581399
We applied Cor and t test for our hypothesis testing for this investigation.
We believe that the flight phase and region variables will show us as to which level has the most affect toward fatalities.
##
## Call:
## lm(formula = Region ~ Flight_phase, data = crash_analysis_df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0969 -1.8714 -0.0218 0.9782 5.2790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.172138 0.022518 229.688 <2e-16 ***
## Flight_phase -0.075193 0.009002 -8.353 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.96 on 28438 degrees of freedom
## Multiple R-squared: 0.002448, Adjusted R-squared: 0.002413
## F-statistic: 69.78 on 1 and 28438 DF, p-value: < 2.2e-16
\[r = correlation\ coefficient\\ x_{i} = values\ of\ the\ x-variable\ in\ a\ sample\\ \bar{x} = mean\ of\ the\ values\ of\ the\ x-variable\\ y_{i} = values\ of\ the\ y-variable\ in\ a\ sample\\ \bar{y} = mean\ of\ the\ values\ of\ the\ y-variable\]