Final Project Data 606

Introduction

A statistical study to find out the most incident prone and most incident free Airlines within the two time intervals : 1985 - 1999 and 2000 - 2014. For this analyses, I will perform Chi-Square test, F-test and T-test.

Data

For this project, I have used to following link to obtain data :

Airline Data

x <- getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv")
Airline <- read.csv(text = x)

Cases

nrow(Airline)

## [1] 56

Each case is the summary for each airline’s incidents. Total 56 cases.

Type of study

This is an observational study to find out the most incident pront and most incident free Airlines for the given two time intervals.

Exploratory data analysis

For my data analyses, I would like to consider the total number of incidents occurred for each Airlines for the two time periods.

Study I - 1985 - 1999

To do this, I have created a subset of the original data by considering Incidents only.

# Max Incidents
AirIncident <- data.frame(Airline$airline , Airline$avail_seat_km_per_week, Airline$incidents_85_99, Airline$incidents_00_14)
datatable(AirIncident)

Add additional column

It is unfair to compare Aer Lingus with Aeroflot since ASK (KM X No.of Seats/Week) for Aeroflot is much greater than Aer Lingus. Hence , lets add a column which can quantify the incidents based on ASK. This column will let us know that after how much KM/Seat, there is an incident reported for an Airlines.

P.S : Available seat kilometers (ASKs) is defined as the number of seats multiplied by the number of kilometers the airline flies.

AirIncident <- mutate(AirIncident, ASK_per_Inc85_99 =    Airline.avail_seat_km_per_week/Airline.incidents_85_99 )
AirIncident <- mutate(AirIncident, ASK_per_Inc85_99 = round(ASK_per_Inc85_99, 0) )

AirIncident <- mutate(AirIncident, ASK_per_Inc00_14 = Airline.avail_seat_km_per_week/Airline.incidents_00_14 )
AirIncident <- mutate(AirIncident, ASK_per_Inc00_14 = round(ASK_per_Inc00_14, 0) )

colnames(AirIncident) <- c("Airline","ASK", "Incidents85_99","Incidents00_14", "ASK_per_Inc85", "ASK_per_Inc00")
datatable(AirIncident)

As seen above the two new columns added are ASK_per_Inc<interval> which signifies that a particular Airlines is susceptible to an incident every ASK_per_Inc KM. Obviously, lower the ASK_per_Inc, more the Airline is susceptible to an incident. The only exception to this rule are those Airlines which did not have a single incident in the given time period. For such Airlines, the ASK_per_Inc would be NA which also means that these are some of the safest Airlines.

Analysis

In order to do further analysis , we need to divide our data into two parts. The first part will consist of Airline data where incident happened and the second set of data will consist of Airlines with zero incidents. From the above it is quite apparent that the second set of data would be very small.

### Analyses for time period 1985 - 1999

## Airlines with atleast one incident
Incident85_99 <- sqldf('select Airline, Incidents85_99, ASK_per_Inc85 from AirIncident where  Incidents85_99 > 0 group by ASK_per_Inc85')
nrow(Incident85_99)

## [1] 53

## Airlines with no incident
NoIncident85_99 <- sqldf('select Airline, ASK, Incidents85_99 from AirIncident where Incidents85_99 = 0')
#NoIncident85_99 <- AirIncident[is.infinite(AirIncident$ASK_per_Inc85),]
nrow(NoIncident85_99)

## [1] 3

p1 <- plot_ly(Incident85_99, x = Airline, y = ASK_per_Inc85, type = 'bar') %>%
  layout(title = "Lower value(Y axis) signifies Airlines most susceptible to incidents(1985-99)",
         xaxis = list(title = ""),
         yaxis = list(title = "KM/Incident"))
 p1

It is pretty obvious from the above plot that most prone to incident is Aeroflot as every 15 Million KMs , there is an incident reported for Aeroflot. Since there are 53 Airlines mentioned in the above plot its difficult to analyse data for every airline. Lets concentrate our study on ten most incident prone Airlines.

Most Incident Prone Airlines for 1985 - 1999

MostIncident85_99 <- head(sqldf('select  * from Incident85_99 ORDER BY ASK_per_Inc85 ASC'), 10)
p2 <- plot_ly(MostIncident85_99, x = Airline, y = ASK_per_Inc85, type = 'bar') %>%
  layout(title = "Top 10 Airlines susceptible to incidents (1985-99)",
         xaxis = list(title = ""),
         yaxis = list(title = "KM/Incident"))
 p2

The above plot without any degree of ambiguity shows the top 10 Incident Prone Airlines for the period 1985 - 1999.

The top 3 most incident prone Airlines are :

Aerflot : Incident reported every 15758846 KM
Ethiopian Airlines : Incident reported every 19542426 KM
Pakistan International : Incident reported every 43570392 KM

Most Incident free Airline between 1985 - 1999

In order to get Most Safest Airline we cannot refer to Plot 1 since Plot 1 only has data for Airlines with at least one incidents but we do have Airlines with zero incident. Now the challenge here is to declare winner amongst the Airlines with No Incidents. I have devised the below plot which will tell which Airlines is safest amongst the zero incident Airline.

p3 <- plot_ly(NoIncident85_99, x = Airline, y = ASK, type = 'bar') %>%
  layout(title = "Airlines with Zero Incident",
         xaxis = list(title = ""),
         yaxis = list(title = "Distance travelled in KM every WeekXSeat"))
 p3

Without any doubt, Cathay Pacific is the most incident free Airline, since amongst all the Zero Incident Airlines Cathay has to fly significantly more distance. If I have to order the above three Airlines, I would follow the below Order.

Most Incident free Airlines for time interval 1985 - 1999.

Cathay Pacific : No incident report for 2582459303 KM
Air Portugal : No incident report for 619130754 KM
Hawaiian Airlines : No incident report for 493877795 KM

Study II - 2000 - 2014

Lets follow the same approach for Airlines data for interval 2000 - 2014

### Analyses for time period 2000 - 2014

## Airlines with atleast one incident
Incident00_14 <- sqldf('select Airline, Incidents00_14, ASK_per_Inc00 from AirIncident where  Incidents00_14 > 0 group by ASK_per_Inc00')
nrow(Incident00_14)

## [1] 47

p4 <- plot_ly(Incident00_14, x = Airline, y = ASK_per_Inc00, type = 'bar') %>%
  layout(title = "Lower value(Y axis) signifies Airlines most susceptible to incidents(2000-14)",
         xaxis = list(title = ""),
         yaxis = list(title = "KM/Incident"))
 p4

Most incident prone Airline (2000 - 2014)

As seen in the above plot, the airlines most susceptible to incident in Pakistan International. Lets choose the top 10 most susceptible Airlines from the above data set.

MostIncident00_14 <- head(sqldf('select  * from Incident00_14 ORDER BY ASK_per_Inc00 ASC'), 10)
p5 <- plot_ly(MostIncident00_14, x = Airline, y = ASK_per_Inc00, type = 'bar') %>%
  layout(title = "Top 10 Airlines susceptible to incidents (2000-14)",
         xaxis = list(title = ""),
         yaxis = list(title = "KM/Incident"))
 p5

As seen above the top three most incident prone airlines for the interval 2000 - 2014 are :

Pakistan International : Incident reported for every 34856314 KM
Saudi Arabian : Incident reported for every 78152173 KM
Sri Lankan / AirLanka : Incident reported for every 81395744 KM

Most Incident free Airline between 2000 - 2014

In order to get most incident free Airlines we need to select Airlines with no incidents reported

## Airlines with no incident
NoIncident00_14 <- sqldf('select Airline, ASK, Incidents00_14 from AirIncident where Incidents00_14 = 0 ORDER BY ASK DESC')
nrow(NoIncident00_14)

## [1] 9

p6 <- plot_ly(NoIncident00_14, x = Airline, y = ASK, type = 'bar') %>%
  layout(title = "Airlines with Zero Incident",
         xaxis = list(title = ""),
         yaxis = list(title = "Distance travelled in KM every WeekXSeat"))
 p6

As seen above, Japan Airlines is the most incident free Airline, since amongst all the Zero Incident Airlines Japan has flown maximum distance. If I have to order the top three Airlines, I would follow the below Order.

Most Incident free Airlines for time interval 2000 - 2014.

Japan Airlines : No incident report for 1574217531 KM
Virgin Atlantic : No incident report for 1005248585 KM
LAN Airlines : No incident report for 1001965891 KM

Inference

Since in order to compare the different airlines fairly, we introduced two additional columns (ASK_per_Inc<interval>) in the data set. This was based on the assumption that column ASK (KM X No.of Seats/Week) impacts the incident as it would be unfair to compare two Airlines with significant difference between the ASK. Lets see if there is any relationship between Incident and ASK. To do this we will perform Chi-square test.

Chi-Square Test for independence for time interval 1985 - 1999

Lets do a Chi-Square test for data in time interval 1985 -1999

tbl1 = table(AirIncident$ASK, AirIncident$Incidents85_99)
chisq.test(tbl1)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl1
## X-squared = 1008, df = 990, p-value = 0.3383

Analyses : As p-value = 0.3383 which is greater than 5 %, which means we accept that null hypotheses i.e we accept that there is a relationship between Incidents and ASK.

Chi-Square Test for independence for time interval 2000 - 2014

Lets do a Chi-Square test for data in time interval 2000 - 2014

tbl2 = table(AirIncident$ASK, AirIncident$Incidents00_14)
chisq.test(tbl2)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl2
## X-squared = 728, df = 715, p-value = 0.3597

Analyses : As p-value = 0.3597 which is greater than 5 %, which means we accept that null hypotheses i.e we accept that there is a relationship between Incidents and ASK.

F-test to compare two variance

Lets do a f-test to compare variances of Incidents of the two intervals ( i.e. 1985 - 1999 vs 2000 - 2014)

var.test(AirIncident$Incidents85_99, AirIncident$Incidents00_14)

## 
##  F test to compare two variances
## 
## data:  AirIncident$Incidents85_99 and AirIncident$Incidents00_14
## F = 5.8957, num df = 55, denom df = 55, p-value = 6.251e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   3.456088 10.057293
## sample estimates:
## ratio of variances 
##           5.895667

Since p=value is less than 0.05 (p-value = 6.251e-10), we accept the null hypothesis that the two variances are significantly different.

Sample two paired T-Test

Lets a do a t-test two check if there is any difference in the mean number of incidents between the two time intervals (1985-1999 , 2000-2014)

t.test(AirIncident$Incidents85_99, AirIncident$Incidents00_14, var.equal = FALSE, paired = TRUE)

## 
##  Paired t-test
## 
## data:  AirIncident$Incidents85_99 and AirIncident$Incidents00_14
## t = 2.2624, df = 55, p-value = 0.02764
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.348704 5.758439
## sample estimates:
## mean of the differences 
##                3.053571

As p-value is less than 0.05 (p-value = 0.02764), we reject the null hypothesis. therefore we can conclude that there is a significant difference in the mean number of incidents in the two time intervals (1985-1999 vs 2000-2014)

bp1 <- ggplot(AirIncident, aes(x="", y=Incidents85_99))+geom_point(aes(color = Airline))+geom_boxplot()+ labs(title="Incidents between 1985-99", x="Airlines", y="No. of Incidents")
ggplotly(bp1)

bp2 <- ggplot(AirIncident, aes(x="", y=Incidents00_14))+geom_point(aes(color = Airline))+geom_boxplot()+ labs(title="Incidents between 2000-14", x="Airlines", y="No. of Incidents")
ggplotly(bp2)

The two box plots further confirm our findings of t-test. The mean of incidents occurred in the two time intervals is significantly different. As per the box plot 1 (1985-99), the mean of incidents is 4 , where as the mean for incidents between the time frame 2000-2014 is 3. This means that the Airlines in the time frame 2000-2014 were more incident free when compared to the Airlines in the time frame 1985-1999.

Conclusion

As per our test and analyses, following is our conclusion:

Most incident prone Airline for time interval 1999-1985: Aeroflot
Most incident free Airline for time interval 1999-1985: Cathay Pacific
Most incident prone Airline for time interval 2000-2014: Pakistan International
Most incident free Airline for time interval 2000-2014: Japan Airlines
The ASK (KM X No.of Seats/Week) has an impact on incidents.
There is an improvement in incidents for the time interval 2000-2014 when compared with incidents for the time interval 1999-1985