A statistical study to find out the most incident prone and most incident free Airlines within the two time intervals : 1985 - 1999 and 2000 - 2014. For this analyses, I will perform Chi-Square test, F-test and T-test.
For this project, I have used to following link to obtain data :
x <- getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv")
Airline <- read.csv(text = x)
nrow(Airline)
## [1] 56
Each case is the summary for each airline’s incidents. Total 56 cases.
This is an observational study to find out the most incident pront and most incident free Airlines for the given two time intervals.
For my data analyses, I would like to consider the total number of incidents occurred for each Airlines for the two time periods.
To do this, I have created a subset of the original data by considering Incidents only.
# Max Incidents
AirIncident <- data.frame(Airline$airline , Airline$avail_seat_km_per_week, Airline$incidents_85_99, Airline$incidents_00_14)
datatable(AirIncident)
It is unfair to compare Aer Lingus with Aeroflot since ASK (KM X No.of Seats/Week) for Aeroflot is much greater than Aer Lingus. Hence , lets add a column which can quantify the incidents based on ASK. This column will let us know that after how much KM/Seat, there is an incident reported for an Airlines.
P.S : Available seat kilometers (ASKs) is defined as the number of seats multiplied by the number of kilometers the airline flies.
AirIncident <- mutate(AirIncident, ASK_per_Inc85_99 = Airline.avail_seat_km_per_week/Airline.incidents_85_99 )
AirIncident <- mutate(AirIncident, ASK_per_Inc85_99 = round(ASK_per_Inc85_99, 0) )
AirIncident <- mutate(AirIncident, ASK_per_Inc00_14 = Airline.avail_seat_km_per_week/Airline.incidents_00_14 )
AirIncident <- mutate(AirIncident, ASK_per_Inc00_14 = round(ASK_per_Inc00_14, 0) )
colnames(AirIncident) <- c("Airline","ASK", "Incidents85_99","Incidents00_14", "ASK_per_Inc85", "ASK_per_Inc00")
datatable(AirIncident)
As seen above the two new columns added are ASK_per_Inc<interval> which signifies that a particular Airlines is susceptible to an incident every ASK_per_Inc
In order to do further analysis , we need to divide our data into two parts. The first part will consist of Airline data where incident happened and the second set of data will consist of Airlines with zero incidents. From the above it is quite apparent that the second set of data would be very small.
### Analyses for time period 1985 - 1999
## Airlines with atleast one incident
Incident85_99 <- sqldf('select Airline, Incidents85_99, ASK_per_Inc85 from AirIncident where Incidents85_99 > 0 group by ASK_per_Inc85')
nrow(Incident85_99)
## [1] 53
## Airlines with no incident
NoIncident85_99 <- sqldf('select Airline, ASK, Incidents85_99 from AirIncident where Incidents85_99 = 0')
#NoIncident85_99 <- AirIncident[is.infinite(AirIncident$ASK_per_Inc85),]
nrow(NoIncident85_99)
## [1] 3
p1 <- plot_ly(Incident85_99, x = Airline, y = ASK_per_Inc85, type = 'bar') %>%
layout(title = "Lower value(Y axis) signifies Airlines most susceptible to incidents(1985-99)",
xaxis = list(title = ""),
yaxis = list(title = "KM/Incident"))
p1
It is pretty obvious from the above plot that most prone to incident is Aeroflot as every 15 Million KMs , there is an incident reported for Aeroflot. Since there are 53 Airlines mentioned in the above plot its difficult to analyse data for every airline. Lets concentrate our study on ten most incident prone Airlines.
MostIncident85_99 <- head(sqldf('select * from Incident85_99 ORDER BY ASK_per_Inc85 ASC'), 10)
p2 <- plot_ly(MostIncident85_99, x = Airline, y = ASK_per_Inc85, type = 'bar') %>%
layout(title = "Top 10 Airlines susceptible to incidents (1985-99)",
xaxis = list(title = ""),
yaxis = list(title = "KM/Incident"))
p2
The above plot without any degree of ambiguity shows the top 10 Incident Prone Airlines for the period 1985 - 1999.
The top 3 most incident prone Airlines are :
In order to get Most Safest Airline we cannot refer to Plot 1 since Plot 1 only has data for Airlines with at least one incidents but we do have Airlines with zero incident. Now the challenge here is to declare winner amongst the Airlines with No Incidents. I have devised the below plot which will tell which Airlines is safest amongst the zero incident Airline.
p3 <- plot_ly(NoIncident85_99, x = Airline, y = ASK, type = 'bar') %>%
layout(title = "Airlines with Zero Incident",
xaxis = list(title = ""),
yaxis = list(title = "Distance travelled in KM every WeekXSeat"))
p3
Without any doubt, Cathay Pacific is the most incident free Airline, since amongst all the Zero Incident Airlines Cathay has to fly significantly more distance. If I have to order the above three Airlines, I would follow the below Order.
Most Incident free Airlines for time interval 1985 - 1999.
Lets follow the same approach for Airlines data for interval 2000 - 2014
### Analyses for time period 2000 - 2014
## Airlines with atleast one incident
Incident00_14 <- sqldf('select Airline, Incidents00_14, ASK_per_Inc00 from AirIncident where Incidents00_14 > 0 group by ASK_per_Inc00')
nrow(Incident00_14)
## [1] 47
p4 <- plot_ly(Incident00_14, x = Airline, y = ASK_per_Inc00, type = 'bar') %>%
layout(title = "Lower value(Y axis) signifies Airlines most susceptible to incidents(2000-14)",
xaxis = list(title = ""),
yaxis = list(title = "KM/Incident"))
p4
As seen in the above plot, the airlines most susceptible to incident in Pakistan International. Lets choose the top 10 most susceptible Airlines from the above data set.
MostIncident00_14 <- head(sqldf('select * from Incident00_14 ORDER BY ASK_per_Inc00 ASC'), 10)
p5 <- plot_ly(MostIncident00_14, x = Airline, y = ASK_per_Inc00, type = 'bar') %>%
layout(title = "Top 10 Airlines susceptible to incidents (2000-14)",
xaxis = list(title = ""),
yaxis = list(title = "KM/Incident"))
p5
As seen above the top three most incident prone airlines for the interval 2000 - 2014 are :
In order to get most incident free Airlines we need to select Airlines with no incidents reported
## Airlines with no incident
NoIncident00_14 <- sqldf('select Airline, ASK, Incidents00_14 from AirIncident where Incidents00_14 = 0 ORDER BY ASK DESC')
nrow(NoIncident00_14)
## [1] 9
p6 <- plot_ly(NoIncident00_14, x = Airline, y = ASK, type = 'bar') %>%
layout(title = "Airlines with Zero Incident",
xaxis = list(title = ""),
yaxis = list(title = "Distance travelled in KM every WeekXSeat"))
p6
As seen above, Japan Airlines is the most incident free Airline, since amongst all the Zero Incident Airlines Japan has flown maximum distance. If I have to order the top three Airlines, I would follow the below Order.
Most Incident free Airlines for time interval 2000 - 2014.
Since in order to compare the different airlines fairly, we introduced two additional columns (ASK_per_Inc<interval>) in the data set. This was based on the assumption that column ASK (KM X No.of Seats/Week) impacts the incident as it would be unfair to compare two Airlines with significant difference between the ASK. Lets see if there is any relationship between Incident and ASK. To do this we will perform Chi-square test.
Lets do a Chi-Square test for data in time interval 1985 -1999
tbl1 = table(AirIncident$ASK, AirIncident$Incidents85_99)
chisq.test(tbl1)
##
## Pearson's Chi-squared test
##
## data: tbl1
## X-squared = 1008, df = 990, p-value = 0.3383
Analyses : As p-value = 0.3383 which is greater than 5 %, which means we accept that null hypotheses i.e we accept that there is a relationship between Incidents and ASK.
Lets do a Chi-Square test for data in time interval 2000 - 2014
tbl2 = table(AirIncident$ASK, AirIncident$Incidents00_14)
chisq.test(tbl2)
##
## Pearson's Chi-squared test
##
## data: tbl2
## X-squared = 728, df = 715, p-value = 0.3597
Analyses : As p-value = 0.3597 which is greater than 5 %, which means we accept that null hypotheses i.e we accept that there is a relationship between Incidents and ASK.
Lets do a f-test to compare variances of Incidents of the two intervals ( i.e. 1985 - 1999 vs 2000 - 2014)
var.test(AirIncident$Incidents85_99, AirIncident$Incidents00_14)
##
## F test to compare two variances
##
## data: AirIncident$Incidents85_99 and AirIncident$Incidents00_14
## F = 5.8957, num df = 55, denom df = 55, p-value = 6.251e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 3.456088 10.057293
## sample estimates:
## ratio of variances
## 5.895667
Since p=value is less than 0.05 (p-value = 6.251e-10), we accept the null hypothesis that the two variances are significantly different.
Lets a do a t-test two check if there is any difference in the mean number of incidents between the two time intervals (1985-1999 , 2000-2014)
t.test(AirIncident$Incidents85_99, AirIncident$Incidents00_14, var.equal = FALSE, paired = TRUE)
##
## Paired t-test
##
## data: AirIncident$Incidents85_99 and AirIncident$Incidents00_14
## t = 2.2624, df = 55, p-value = 0.02764
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.348704 5.758439
## sample estimates:
## mean of the differences
## 3.053571
As p-value is less than 0.05 (p-value = 0.02764), we reject the null hypothesis. therefore we can conclude that there is a significant difference in the mean number of incidents in the two time intervals (1985-1999 vs 2000-2014)
bp1 <- ggplot(AirIncident, aes(x="", y=Incidents85_99))+geom_point(aes(color = Airline))+geom_boxplot()+ labs(title="Incidents between 1985-99", x="Airlines", y="No. of Incidents")
ggplotly(bp1)
bp2 <- ggplot(AirIncident, aes(x="", y=Incidents00_14))+geom_point(aes(color = Airline))+geom_boxplot()+ labs(title="Incidents between 2000-14", x="Airlines", y="No. of Incidents")
ggplotly(bp2)
The two box plots further confirm our findings of t-test. The mean of incidents occurred in the two time intervals is significantly different. As per the box plot 1 (1985-99), the mean of incidents is 4 , where as the mean for incidents between the time frame 2000-2014 is 3. This means that the Airlines in the time frame 2000-2014 were more incident free when compared to the Airlines in the time frame 1985-1999.
As per our test and analyses, following is our conclusion: