Dataset and source

Source

opendata.socrata.com - Colloection of publically available datasets available online for exploring and analysis

Dataset used

Airplane Crashes and Fatalities Since 1908 - Which contains full history of airplane crashes throughout the world since 1908 to 2009. The dataset is hosted by Open Data by socrata

Data Description

Variable Description Sample Values
Date Date of airplane crash Spet 1908 to June 2009
Time Time of accident time in 24 hr format
Location Location of accident City, country e.g. Cleveland, Ohio OR nearest landmark e.g. Off West Hartlepool, England
Operator Ownner of the aircraft e.g. Military - German Navy, Private, US Aerial Mail Service
Flight # Unique flight number given to each flight e.g. 2L272, 739/14
Route The source and destination of the flight E.G. Lympne, England - Rotterdam, The Netherlands
Aboard Number of people aboard integer values
Fatalities Number of deaths integer values (Check - should be less than or equal to Aboard)
Summary summary of the incident / information known afterwards e.g. Shot down by British aircraft.

Data Import

library(RSocrata)
library(tibble)
library(dplyr)
library(tidyverse)

Data_url <- "https://opendata.socrata.com/Government/Airplane-Crashes-and-
            Fatalities-Since-1908/q2te-8cvq"

Crash_raw_data <- read.socrata(Data_url) %>% 
  as_tibble()

Data Cleaning

Checking the level of data:
Level_of_data_check <- Crash_raw_data %>% 
  group_by(Date,Location) %>% 
  mutate(rows=n()) %>% 
  filter(rows>1)
Level_of_data_check
## Source: local data frame [6 x 14]
## Groups: Date, Location [3]
## 
##         Date   Time                 Location
##       <fctr> <fctr>                   <fctr>
## 1 11/19/1943                  Kunming, China
## 2 11/19/1943                  Kunming, China
## 3 07/25/1967  10:30 Near Luang Prabang, Laos
## 4 07/25/1967        Near Luang Prabang, Laos
## 5 09/11/2001  08:47  New York City, New York
## 6 09/11/2001  09:03  New York City, New York
## # ... with 11 more variables: Operator <fctr>, Flight.. <fctr>,
## #   Route <fctr>, Type <fctr>, Registration <fctr>, cn.In <fctr>,
## #   Aboard <int>, Fatalities <int>, Ground <int>, Summary <fctr>,
## #   rows <int>
Next steps:
Check number of missing values in each column and find out how they can be treated
Removing duplicate values after removing unwanted columns

Analysis

Following are some of the Analysis i plan to do -

How have the number of airplane accidents varied over time
Are there any particular routes which have more number of acidents
Do text mining to categorize cargo, military, private and civil airplanes( if possible)
Do Text mining to identify the accidents caused due to war or military operations
Do Topic modelling on the summary column to identify common reasons of airplane accidents and how have they varied over time