This is a proposal to analyse the Airplane Crashes.

Data Description

The data set I would be using is a public dataset: “Airplane Crashes and Fatalities Since 1908” which is hosted by open Data by Socrata at: https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq

My data consists of 5268 observations and 13 variables and it represents the full history of airplanes crashes throughout the world:

  1. Date - The date on which the flight crashed.
  2. Time - The time at which flight crashed.
  3. Location - Location of the crash
  4. Operator - The name of the flight operator
  5. Flight - Flight Number of the airplane that crashed
  6. Route - The Route of the flight
  7. Type - The type of flight carrier
  8. Registration - Description unavailable. This variable wouldn’t be used for analysis.
  9. cn.In - Description unavailable.
  10. Aboard - The number of passenger on board
  11. Fatalities - The number of deaths
  12. Ground - Description unavailable.
  13. Summary - Brief summary of the reason for the crash.

Data Importing

  1. Data Importing from an online resource
library(tidyverse)

AirplaneCrashURL <- "https://raw.githubusercontent.com/ApurvaBhoite/AirCrash/master/Airplane_Crashes_and_Fatalities_Since_1908.csv"
AirplaneCrash <- read.csv(AirplaneCrashURL, stringsAsFactors = FALSE ) 
AirplaneCrash <- as_tibble(AirplaneCrash)
  1. Contents of the data
str(AirplaneCrash)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5268 obs. of  13 variables:
##  $ Date        : chr  "09/17/1908" "07/12/1912" "08/06/1913" "09/09/1913" ...
##  $ Time        : chr  "17:18" "06:30" "" "18:30" ...
##  $ Location    : chr  "Fort Myer, Virginia" "AtlantiCity, New Jersey" "Victoria, British Columbia, Canada" "Over the North Sea" ...
##  $ Operator    : chr  "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
##  $ Flight..    : chr  "" "" "-" "" ...
##  $ Route       : chr  "Demonstration" "Test flight" "" "" ...
##  $ Type        : chr  "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
##  $ Registration: chr  "" "" "" "" ...
##  $ cn.In       : chr  "1" "" "" "" ...
##  $ Aboard      : int  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : int  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : chr  "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximat"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the se"| __truncated__ ...
  1. Summary of the data
summary(AirplaneCrash)
##      Date               Time             Location        
##  Length:5268        Length:5268        Length:5268       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    Operator           Flight..            Route          
##  Length:5268        Length:5268        Length:5268       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Type           Registration          cn.In               Aboard      
##  Length:5268        Length:5268        Length:5268        Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.:  5.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 13.00  
##                                                           Mean   : 27.55  
##                                                           3rd Qu.: 30.00  
##                                                           Max.   :644.00  
##                                                           NA's   :22      
##    Fatalities         Ground           Summary         
##  Min.   :  0.00   Min.   :   0.000   Length:5268       
##  1st Qu.:  3.00   1st Qu.:   0.000   Class :character  
##  Median :  9.00   Median :   0.000   Mode  :character  
##  Mean   : 20.07   Mean   :   1.609                     
##  3rd Qu.: 23.00   3rd Qu.:   0.000                     
##  Max.   :583.00   Max.   :2750.000                     
##  NA's   :12       NA's   :22
  1. Data
AirplaneCrash
## # A tibble: 5,268 × 13
##          Date  Time                           Location
##         <chr> <chr>                              <chr>
## 1  09/17/1908 17:18                Fort Myer, Virginia
## 2  07/12/1912 06:30            AtlantiCity, New Jersey
## 3  08/06/1913       Victoria, British Columbia, Canada
## 4  09/09/1913 18:30                 Over the North Sea
## 5  10/17/1913 10:30         Near Johannisthal, Germany
## 6  03/05/1915 01:00                    Tienen, Belgium
## 7  09/03/1915 15:20              Off Cuxhaven, Germany
## 8  07/28/1916                    Near Jambol, Bulgeria
## 9  09/24/1916 01:00                Billericay, England
## 10 10/01/1916 23:45               Potters Bar, England
## # ... with 5,258 more rows, and 10 more variables: Operator <chr>,
## #   Flight.. <chr>, Route <chr>, Type <chr>, Registration <chr>,
## #   cn.In <chr>, Aboard <int>, Fatalities <int>, Ground <int>,
## #   Summary <chr>

Data Cleaning

I plan to clean the data in the following way:

  1. Missing Values
    1. Aboard has total 22 missing values.
    2. Fatalities has total 12 missing values.
    3. Ground has total 22 missing values.

    All these missings integer values will be replaced by the mean values in the respective columns and with respect to each aircraft operator.

  2. Spliting the columns
    The Date column will be spilt into Day Month and Year to have better yearly analysis.
    The Location column will be spilt into Place and Country to have countrywise analysis.

  3. Accurately determining the operator and the type variable
    Currently the observations are mixed, various string operations will be necessary to handle and analyse this type of data

Planned Analysis

  1. Which year had the maximum number of crashes?
  2. Which is the most favourable location for airplanes crashes in the world?
  3. Listing the operators in descending order of life safety?
  4. Which type of aircraft has the best survial rate?
  5. Any particular interesting trends that can be observed.

*Some changes in the analysis will be included/excluded as the project proceeds