This is a proposal to analyse the Airplane Crashes.
The data set I would be using is a public dataset: “Airplane Crashes and Fatalities Since 1908” which is hosted by open Data by Socrata at: https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq
My data consists of 5268 observations and 13 variables and it represents the full history of airplanes crashes throughout the world:
library(tidyverse)
AirplaneCrashURL <- "https://raw.githubusercontent.com/ApurvaBhoite/AirCrash/master/Airplane_Crashes_and_Fatalities_Since_1908.csv"
AirplaneCrash <- read.csv(AirplaneCrashURL, stringsAsFactors = FALSE )
AirplaneCrash <- as_tibble(AirplaneCrash)
str(AirplaneCrash)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5268 obs. of 13 variables:
## $ Date : chr "09/17/1908" "07/12/1912" "08/06/1913" "09/09/1913" ...
## $ Time : chr "17:18" "06:30" "" "18:30" ...
## $ Location : chr "Fort Myer, Virginia" "AtlantiCity, New Jersey" "Victoria, British Columbia, Canada" "Over the North Sea" ...
## $ Operator : chr "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
## $ Flight.. : chr "" "" "-" "" ...
## $ Route : chr "Demonstration" "Test flight" "" "" ...
## $ Type : chr "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
## $ Registration: chr "" "" "" "" ...
## $ cn.In : chr "1" "" "" "" ...
## $ Aboard : int 2 5 1 20 30 41 19 20 22 19 ...
## $ Fatalities : int 1 5 1 14 30 21 19 20 22 19 ...
## $ Ground : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Summary : chr "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximat"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the se"| __truncated__ ...
summary(AirplaneCrash)
## Date Time Location
## Length:5268 Length:5268 Length:5268
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Operator Flight.. Route
## Length:5268 Length:5268 Length:5268
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Type Registration cn.In Aboard
## Length:5268 Length:5268 Length:5268 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 5.00
## Mode :character Mode :character Mode :character Median : 13.00
## Mean : 27.55
## 3rd Qu.: 30.00
## Max. :644.00
## NA's :22
## Fatalities Ground Summary
## Min. : 0.00 Min. : 0.000 Length:5268
## 1st Qu.: 3.00 1st Qu.: 0.000 Class :character
## Median : 9.00 Median : 0.000 Mode :character
## Mean : 20.07 Mean : 1.609
## 3rd Qu.: 23.00 3rd Qu.: 0.000
## Max. :583.00 Max. :2750.000
## NA's :12 NA's :22
AirplaneCrash
## # A tibble: 5,268 × 13
## Date Time Location
## <chr> <chr> <chr>
## 1 09/17/1908 17:18 Fort Myer, Virginia
## 2 07/12/1912 06:30 AtlantiCity, New Jersey
## 3 08/06/1913 Victoria, British Columbia, Canada
## 4 09/09/1913 18:30 Over the North Sea
## 5 10/17/1913 10:30 Near Johannisthal, Germany
## 6 03/05/1915 01:00 Tienen, Belgium
## 7 09/03/1915 15:20 Off Cuxhaven, Germany
## 8 07/28/1916 Near Jambol, Bulgeria
## 9 09/24/1916 01:00 Billericay, England
## 10 10/01/1916 23:45 Potters Bar, England
## # ... with 5,258 more rows, and 10 more variables: Operator <chr>,
## # Flight.. <chr>, Route <chr>, Type <chr>, Registration <chr>,
## # cn.In <chr>, Aboard <int>, Fatalities <int>, Ground <int>,
## # Summary <chr>
I plan to clean the data in the following way:
All these missings integer values will be replaced by the mean values in the respective columns and with respect to each aircraft operator.
Spliting the columns
The Date column will be spilt into Day Month and Year to have better yearly analysis. The Location column will be spilt into Place and Country to have countrywise analysis.
Accurately determining the operator and the type variable
Currently the observations are mixed, various string operations will be necessary to handle and analyse this type of data
*Some changes in the analysis will be included/excluded as the project proceeds