This document is my project report which consists of Airplane Crash Analysis from the year 1908.
Problem Statement
This exploratory data analysis of the airplane crash data analyzes the crash trend for over 100 years beginning from the year 1908 to 2008. It is particularly interesting to observe the trend of airplane crashes and the reasons behind them, as air travel is the one of the most common transport medium these days. It is also important to examine our progress in overcoming the crashes.
Approach
The data used for this analysis is a public dataset hosted by open Data by Socrata. Various data cleaning steps were performed to work on a tidy dataset. After calculations, graphs were plotted to visualize the results and come to a conclusion.
Summary of the insights from this analysis
This analysis provides insights in observing the trend of aircrash over the years. It shows the percent of fatalities observed due to the crash. The analysis also help in determining which airline operator and types are worst to fly with. We also observe the top 10 countries which we should avoid to esacpe the crash. All these topics will be addressed and analyzed.
Following packages were used to perform the exercise:
library(readr) # to read functions to import xls file
library(stringr) # to use regx and other string functions
library(tidyverse) # to manipulate data
library(dplyr) # to manipulate data
library(ggplot2) # to plot graph
library(readr) # to read flat/tabular text files
library(lubridate) # to manipulate as date
library(tm) # to perform text mining operations (for wordcloud here)
library(caret) # to spilt data and and select featured data
library(wordcloud) # to write most reasons for crash in a cloud form
library(gridExtra) # to arrange multiple grid based plots on a page
library(RColorBrewer)# to have nice color palettes
library(DT) # to have html representation of the data
The data set I used is a public dataset: “Airplane Crashes and Fatalities Since 1908” which is hosted by open Data by Socrata at: https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq
My data consists of 5268 observations and 13 variables and it represents the full history of airplanes crashes throughout the world:
Importing Dataset
AirplaneCrashURL <- "https://raw.githubusercontent.com/ApurvaBhoite/AirCrash/master/Airplane_Crashes_and_Fatalities_Since_1908.csv"
AirplaneCrash <- read.csv(AirplaneCrashURL, stringsAsFactors = FALSE )
AirplaneCrash <- as_tibble(AirplaneCrash)
#1. Omitting Na values
AirplaneCrash <- na.omit(AirplaneCrash)
#2. Spilting Date Column
AirplaneCrash <- AirplaneCrash %>% separate(Date, into = c("Month","Day","Year"))
AirplaneCrash$Location <- sapply(AirplaneCrash$Location, as.character)
AirplaneCrash$Location <- gsub(".*,", "", AirplaneCrash$Location)
#remove white space at beginning
AirplaneCrash$Location <- str_trim(AirplaneCrash$Location, side = "both")
#Convert string back to factors
AirplaneCrash$Location <- sapply(AirplaneCrash$Location, as.factor)
The velow HTML table is the cleaned dataset with all the data cleaning activities performed on it. This data is in a scrollable format rom top to bottom. The value in the summary column is a lot therefore, one row appears to be large.
datatable(AirplaneCrash ,extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))
The Data Description consists of all the description of the variables. However not all variables are relevant to our analysis.In this dataset the most important variables that are used for the analysis are as follows:
1. Year : The dataset can be grouped by year to see the yearly trends.
2. Month : The dataset can be grouped by monthly trends (if any).
3. Location : The column now consists of only one value which is either the country or description of the place.
4. Operator: This Aircraft Operator column is used to understand which operator had maximum crashes.
5. Type: This Aircraft Type column is used to analyze which type of aircraft caused maximum crashes.
6. Aboard: This column is used to determine the percent of fatalities that occured every year.
7. Fatalities: This is count of deaths occured. Helpful in determing the total loss.
8. Summary: This has reasons for the crash. Text mining can be performed on this column to understand the most frequent causes of aircrash.
These variables form the basis of my analysis. Other variables such as Flight, Time, Route, cn.In, Ground aren’t much helpful in deriving any insights.
Decription
Here two plots are combined into a grid view to observe the trend in the crashes over the years and also the months in particular. The Date column that was spilt into Month and Year is used here. A trend line in the Total number of crashes per year shows that number of crashes are reducing from the decade 1968-1978. And it was the maximum in the decade 1968-1978.
The monthwise analysis just gives a confirmation that crashes occur irrespective of the month. That means, the time of the year is not significant influencing parameter.
#Monthly
months <- as.data.frame(table(AirplaneCrash$Month))
A2 <- ggplot(months, aes(Var1, Freq)) +
geom_bar(stat = "identity", fill = "Navy", width = 0.3) +
xlab("Month") + ylab("Crashes") +
ggtitle("Total number of crashes per month")
#Yearly
years <- as.data.frame(table(AirplaneCrash$Year))
A1 <- ggplot(years, aes(y = Freq, x = Var1, group = 1)) +
geom_line(size = 1, linetype = 1, color = "Navy") +
geom_point(size = 3, shape = 20)+
geom_smooth() +
xlab("Years") + ylab("Crashes") +
scale_x_discrete(breaks = seq(from = 1908, to = 2009, by = 10)) +
ggtitle("Total number of crashes per year")
grid.arrange(A1, A2, nrow = 2, heights=2:1)
## `geom_smooth()` using method = 'loess'
Decription
Firt the Fatalities are grouped from the main Aircrash table grouped by the year. The total number of fatalities and number of passengers aboard is calculated. This information is used to plot the percent fatalities over the year. Here we observe that the percent of fatalities is decreasing with time. This should imply that the safety measures for the people onboard must have increased.
Fatalities <- AirplaneCrash %>% group_by(Year) %>%
summarise(total_fatalities = sum(Fatalities), total_passengers = sum(Aboard))
f1 <- ggplot(Fatalities, aes(y = (total_fatalities/total_passengers)*100, x = Year, group = 10)) +
geom_line(size = 1, linetype = 1, color = "Red") +
geom_point(size = 3, shape = 20) +
geom_smooth() +
xlab("Years") + ylab("% Fatalities") +
scale_x_discrete(breaks = seq(from = 1908, to = 2009, by = 10)) +
ggtitle("Percent of fatalities per year")
f1
## `geom_smooth()` using method = 'loess'
Description
The Location which was cleaning using string operations, is used here. The data is grouped by the location and the total fatalities for each location is calculated. Here we plot a graph to observe the top 10 countries which encountered the aircrash. It is observed that Russia has had the maximum crashes out of all the Countries.
Location_Crash <- AirplaneCrash %>% group_by(Location) %>%
summarise(total_fatalities = sum(Fatalities)) %>% arrange(desc(total_fatalities))
L1 <- ggplot(Location_Crash[1:10,], aes(x = reorder(Location, -total_fatalities), y = total_fatalities, alpha = total_fatalities)) +
geom_bar(stat = "identity", fill = "maroon", width = 0.5) +
xlab("Countries") + ylab("Number of fatalities") +
ggtitle("Top 10 Countries with Maximum Fatalities")
L1
Description
To understand which Operators caused more crashes the data is grouped by the Operator and arranged in a descending format of the frequency. The graph has been plotted to observe the top 10 operators that caused the aircrash.
crash_operator <- AirplaneCrash %>% group_by(Operator) %>%
summarise(Freq = n()) %>% arrange(desc(Freq))
operator <- ggplot(crash_operator[1:10,], aes(x = reorder(factor(Operator), Freq), y = Freq, alpha = Freq)) +
geom_bar(stat = "identity", fill = "Blue", width = 0.05) + geom_point(stat = "identity") +
xlab("Aircraft Operators") + ylab("Crashes") + ggtitle("Top 10 Aircraft Operator causing Aircrash") +
coord_flip()
operator
Description
To understand which Types have caused more crashes the data is grouped by the Type and arranged in a descending format of the frequency. The graph has been plotted to observe the top 10 types that caused the aircrash.
crash_type <- AirplaneCrash %>% group_by(Type) %>%
summarise(Freq = n()) %>% arrange(desc(Freq))
type <- ggplot(crash_type[1:10,], aes(x = reorder(factor(Type), Freq), y = Freq, alpha = Freq)) +
geom_bar(stat = "identity", fill = "Purple", width = 0.05) + geom_point(stat = "identity") +
xlab("Types") + ylab("Crashes") + ggtitle("Top 10 Aircraft Type causing Aircrash") +
coord_flip()
type
Description
Here I have experimented with text mining in R on the summary column to form a word cloud that states the reasons for aircrash.
data <- Corpus(VectorSource(AirplaneCrash$Summary))
corpus_clean <- tm_map(data, tolower)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removeWords, "flight")
corpus_clean <- tm_map(corpus_clean, removeWords, "crashed")
corpus_clean <- tm_map(corpus_clean, removeWords, "plane")
corpus_clean <- tm_map(corpus_clean, removeWords, "aircraft")
tdm <- TermDocumentMatrix(corpus_clean)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
wordcloud(corpus_clean, max.words = 100, min.freq = 35, random.order = FALSE, colors=pal)
1. What was the general crash trend?
Over the years the aircrash increased year until the decade 1968 - 1978. And then the number of crashes started reducing again, and it dropped considerably in the year 2008. The monthly crashes from 1908 was observed to check if any particular month was significantly responsible for the crash, but no such observation was made. Which implies that the crashes are well distributed through out the year.
2. What percent of life was lost every year?
It was observed that with time there is a decrease in the percent of fatalities. This might imply that constructive measures have been undertaken over the years for the safety of people onboard.
3. Which countries should you avoid to escape the aircrash?
Russia! A maximum number of aircrashes were observed in Russia. One should avoid flying through Russia inorder to have a safe trip. The other countries that followed up were Brazil, Colombia, USSR, France, India, China, Indonesia, Japan, Canada.
4. Which are the worst Operators to Fly with?
Aeroflot, Military - U.S. Air Force are worst operators as they have been responsible for maximum crashes.
5. Which types of the aircrafts are considered prone to crashes?
Douglas Dc-3 types of aircraft are most prone to crashes.
6. What were the common reasons for the aircrashes?
Most crashes occured due to pilots, engine failures, approach, during take-off’s, on the runway, due to weather, mountains, altitudes.