This document is my project report which consists of Airplane Crash Analysis from the year 1908.

1.Synopsis

Problem Statement
This exploratory data analysis of the airplane crash data analyzes the crash trend for over 100 years beginning from the year 1908 to 2008. It is particularly interesting to observe the trend of airplane crashes and the reasons behind them, as air travel is the one of the most common transport medium these days. It is also important to examine our progress in overcoming the crashes.

Approach
The data used for this analysis is a public dataset hosted by open Data by Socrata. Various data cleaning steps were performed to work on a tidy dataset. After calculations, graphs were plotted to visualize the results and come to a conclusion.

Summary of the insights from this analysis
This analysis provides insights in observing the trend of aircrash over the years. It shows the percent of fatalities observed due to the crash. The analysis also help in determining which airline operator and types are worst to fly with. We also observe the top 10 countries which we should avoid to esacpe the crash. All these topics will be addressed and analyzed.

2.Packages Required

Following packages were used to perform the exercise:

library(readr)    # to read functions to import xls file
library(stringr)  # to use regx and other string functions
library(tidyverse)  # to manipulate data
library(dplyr)      # to manipulate data
library(ggplot2)    # to plot graph
library(readr)      # to read flat/tabular text files
library(lubridate)  # to manipulate as date
library(tm)         # to perform text mining operations (for wordcloud here)
library(caret)      # to spilt data and and select featured data
library(wordcloud)  # to write most reasons for crash in a cloud form
library(gridExtra)  # to arrange multiple grid based plots on a page
library(RColorBrewer)# to have nice color palettes
library(DT)         # to have html representation of the data

3.Data Preparation Section

3a.Orginal Source

The data set I used is a public dataset: “Airplane Crashes and Fatalities Since 1908” which is hosted by open Data by Socrata at: https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq

3b.Data Description

My data consists of 5268 observations and 13 variables and it represents the full history of airplanes crashes throughout the world:

  1. Date - The date on which the flight crashed.
  2. Time - The time at which flight crashed.
  3. Location - Location of the crash
  4. Operator - The name of the flight operator
  5. Flight - Flight Number of the airplane that crashed
  6. Route - The Route of the flight
  7. Type - The type of flight carrier
  8. Registration - Description unavailable. This variable wouldn’t be used for analysis.
  9. cn.In - Description unavailable.
  10. Aboard - The number of passenger on board
  11. Fatalities - The number of deaths
  12. Ground - Description unavailable.
  13. Summary - Brief summary

3c.Data Cleaning

Importing Dataset

AirplaneCrashURL <- "https://raw.githubusercontent.com/ApurvaBhoite/AirCrash/master/Airplane_Crashes_and_Fatalities_Since_1908.csv"
AirplaneCrash <- read.csv(AirplaneCrashURL, stringsAsFactors = FALSE ) 
AirplaneCrash <- as_tibble(AirplaneCrash)
  1. Missing Values
    1. Aboard has total 22 missing values.
    2. Fatalities has total 12 missing values.
    3. Ground has total 22 missing values.
      Now, since I had a total of 5268 observations these 32 observations only constitue of 0.6% percent of my data. So these NA values have been omitted.
      #1. Omitting Na values 
      AirplaneCrash <- na.omit(AirplaneCrash)
  1. Spilting date column in day month year
    The date column was spilt into three columns which consisted Month, Day and Year using the separate function.
      #2. Spilting Date Column
      AirplaneCrash <- AirplaneCrash %>% separate(Date, into = c("Month","Day","Year"))
  1. Converting the Location Column to consist of only Country names or Location Description
    Location column contained mixed values from city, state and country. A few values had all the three a few had state and country name, few had only country name. There were also values which constited descriptions such as over the mountain, in the sea. To understand which country had the most aircrashes, cleaning this column was necessary. String operations were performed to extract only the country name.
    AirplaneCrash$Location <- sapply(AirplaneCrash$Location, as.character)
    AirplaneCrash$Location <- gsub(".*,", "", AirplaneCrash$Location)
    #remove white space at beginning
    AirplaneCrash$Location <- str_trim(AirplaneCrash$Location, side = "both")
    #Convert string back to factors
    AirplaneCrash$Location <- sapply(AirplaneCrash$Location, as.factor)

3d.Final Dataset

The velow HTML table is the cleaned dataset with all the data cleaning activities performed on it. This data is in a scrollable format rom top to bottom. The value in the summary column is a lot therefore, one row appears to be large.

datatable(AirplaneCrash ,extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))

3e.Summary of Dataset

The Data Description consists of all the description of the variables. However not all variables are relevant to our analysis.In this dataset the most important variables that are used for the analysis are as follows:
1. Year : The dataset can be grouped by year to see the yearly trends.
2. Month : The dataset can be grouped by monthly trends (if any).
3. Location : The column now consists of only one value which is either the country or description of the place.
4. Operator: This Aircraft Operator column is used to understand which operator had maximum crashes.
5. Type: This Aircraft Type column is used to analyze which type of aircraft caused maximum crashes.
6. Aboard: This column is used to determine the percent of fatalities that occured every year.
7. Fatalities: This is count of deaths occured. Helpful in determing the total loss.
8. Summary: This has reasons for the crash. Text mining can be performed on this column to understand the most frequent causes of aircrash.

These variables form the basis of my analysis. Other variables such as Flight, Time, Route, cn.In, Ground aren’t much helpful in deriving any insights.

4.Exploratory Data Analysis

4a.Crash Trend

Decription
Here two plots are combined into a grid view to observe the trend in the crashes over the years and also the months in particular. The Date column that was spilt into Month and Year is used here. A trend line in the Total number of crashes per year shows that number of crashes are reducing from the decade 1968-1978. And it was the maximum in the decade 1968-1978.

The monthwise analysis just gives a confirmation that crashes occur irrespective of the month. That means, the time of the year is not significant influencing parameter.

#Monthly
months <- as.data.frame(table(AirplaneCrash$Month))
A2 <- ggplot(months, aes(Var1, Freq)) + 
      geom_bar(stat = "identity", fill = "Navy", width = 0.3) + 
      xlab("Month") + ylab("Crashes") +
      ggtitle("Total number of crashes per month")

#Yearly
years <- as.data.frame(table(AirplaneCrash$Year))
A1 <- ggplot(years, aes(y = Freq, x = Var1, group = 1))  + 
      geom_line(size = 1, linetype = 1, color = "Navy") + 
      geom_point(size = 3, shape = 20)+ 
      geom_smooth() +
      xlab("Years") + ylab("Crashes") + 
      scale_x_discrete(breaks = seq(from = 1908, to = 2009, by = 10)) + 
      ggtitle("Total number of crashes per year")

grid.arrange(A1, A2, nrow = 2, heights=2:1)
## `geom_smooth()` using method = 'loess'

4b.Fatality Percent

Decription
Firt the Fatalities are grouped from the main Aircrash table grouped by the year. The total number of fatalities and number of passengers aboard is calculated. This information is used to plot the percent fatalities over the year. Here we observe that the percent of fatalities is decreasing with time. This should imply that the safety measures for the people onboard must have increased.

Fatalities <- AirplaneCrash %>% group_by(Year) %>% 
              summarise(total_fatalities = sum(Fatalities), total_passengers = sum(Aboard))

f1 <- ggplot(Fatalities, aes(y = (total_fatalities/total_passengers)*100, x = Year, group = 10))  + 
      geom_line(size = 1, linetype = 1, color = "Red") + 
      geom_point(size = 3, shape = 20) + 
      geom_smooth() +
      xlab("Years") + ylab("% Fatalities") + 
      scale_x_discrete(breaks = seq(from = 1908, to = 2009, by = 10)) +
      ggtitle("Percent of fatalities per year")
f1
## `geom_smooth()` using method = 'loess'

4c.Crash Locations

Description
The Location which was cleaning using string operations, is used here. The data is grouped by the location and the total fatalities for each location is calculated. Here we plot a graph to observe the top 10 countries which encountered the aircrash. It is observed that Russia has had the maximum crashes out of all the Countries.

Location_Crash <-   AirplaneCrash %>% group_by(Location) %>% 
                    summarise(total_fatalities = sum(Fatalities)) %>% arrange(desc(total_fatalities))

L1 <- ggplot(Location_Crash[1:10,], aes(x = reorder(Location, -total_fatalities), y = total_fatalities, alpha = total_fatalities)) + 
      geom_bar(stat = "identity", fill = "maroon", width = 0.5) +
      xlab("Countries") + ylab("Number of fatalities") + 
      ggtitle("Top 10 Countries with Maximum Fatalities")
L1

4d.Aircraft Operators

Description
To understand which Operators caused more crashes the data is grouped by the Operator and arranged in a descending format of the frequency. The graph has been plotted to observe the top 10 operators that caused the aircrash.

crash_operator <-   AirplaneCrash %>% group_by(Operator) %>% 
                    summarise(Freq = n()) %>% arrange(desc(Freq))

operator <- ggplot(crash_operator[1:10,], aes(x = reorder(factor(Operator), Freq), y = Freq, alpha = Freq)) + 
            geom_bar(stat = "identity", fill = "Blue", width = 0.05) + geom_point(stat = "identity") + 
            xlab("Aircraft Operators") + ylab("Crashes") + ggtitle("Top 10 Aircraft Operator causing Aircrash") + 
            coord_flip() 
operator

4e.Aircraft Type

Description
To understand which Types have caused more crashes the data is grouped by the Type and arranged in a descending format of the frequency. The graph has been plotted to observe the top 10 types that caused the aircrash.

crash_type <- AirplaneCrash %>% group_by(Type) %>% 
              summarise(Freq = n()) %>% arrange(desc(Freq))

type <- ggplot(crash_type[1:10,], aes(x = reorder(factor(Type), Freq), y = Freq, alpha = Freq)) + 
        geom_bar(stat = "identity", fill = "Purple", width = 0.05) + geom_point(stat = "identity") + 
        xlab("Types") + ylab("Crashes") + ggtitle("Top 10 Aircraft Type causing Aircrash") +
        coord_flip() 
type  

4f.Reasons for Crash

Description
Here I have experimented with text mining in R on the summary column to form a word cloud that states the reasons for aircrash.

data <- Corpus(VectorSource(AirplaneCrash$Summary))

corpus_clean <- tm_map(data, tolower)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removeWords, "flight")
corpus_clean <- tm_map(corpus_clean, removeWords, "crashed")
corpus_clean <- tm_map(corpus_clean, removeWords, "plane")
corpus_clean <- tm_map(corpus_clean, removeWords, "aircraft")

tdm <- TermDocumentMatrix(corpus_clean)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]

wordcloud(corpus_clean, max.words = 100, min.freq = 35, random.order = FALSE, colors=pal)

5. Summary

1. What was the general crash trend?
Over the years the aircrash increased year until the decade 1968 - 1978. And then the number of crashes started reducing again, and it dropped considerably in the year 2008. The monthly crashes from 1908 was observed to check if any particular month was significantly responsible for the crash, but no such observation was made. Which implies that the crashes are well distributed through out the year.

2. What percent of life was lost every year?
It was observed that with time there is a decrease in the percent of fatalities. This might imply that constructive measures have been undertaken over the years for the safety of people onboard.

3. Which countries should you avoid to escape the aircrash?
Russia! A maximum number of aircrashes were observed in Russia. One should avoid flying through Russia inorder to have a safe trip. The other countries that followed up were Brazil, Colombia, USSR, France, India, China, Indonesia, Japan, Canada.

4. Which are the worst Operators to Fly with?
Aeroflot, Military - U.S. Air Force are worst operators as they have been responsible for maximum crashes.

5. Which types of the aircrafts are considered prone to crashes?
Douglas Dc-3 types of aircraft are most prone to crashes.

6. What were the common reasons for the aircrashes?
Most crashes occured due to pilots, engine failures, approach, during take-off’s, on the runway, due to weather, mountains, altitudes.