Data Loading

Loading Data

Loading the dataframe as a csv file which is in wide format with missing values.

flights_dt <- as_tibble( read.csv("flights data.csv", stringsAsFactors = FALSE,na.strings = c("","N/A")))

Data

Display of the CSV file structure.

kable(flights_dt)
X X.1 Los.Angeles Phoenix San.Diego San.Fransisco Seattle
Alaska On time 497 221 212 503 1841
NA Delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM West On time 694 4840 383 320 201
NA Delayed 117 415 65 129 61

Data Tidying

Column names for the first two columns are missing. Therefore, it is neccesary to create two columns names in order to transform the dataframe

names(flights_dt)[1:2] <- c("Airline","Status")

Since the destination state from airlines are displayed as column names, the gather function will collapse such states in a row format. The Destination column that will be creared by gather displays the name of the state and the fl_no column displays the number of flighs for each category labeled as “On Time” or “Delayed”. In addition, it removes the empty line that divides both airlines data.

flights_data <- flights_dt %>% gather("Destination","fl_no",3:7,na.rm = TRUE)
flights_data
## # A tibble: 20 x 4
##    Airline Status  Destination   fl_no
##    <chr>   <chr>   <chr>         <int>
##  1 Alaska  On time Los.Angeles     497
##  2 <NA>    Delayed Los.Angeles      62
##  3 AM West On time Los.Angeles     694
##  4 <NA>    Delayed Los.Angeles     117
##  5 Alaska  On time Phoenix         221
##  6 <NA>    Delayed Phoenix          12
##  7 AM West On time Phoenix        4840
##  8 <NA>    Delayed Phoenix         415
##  9 Alaska  On time San.Diego       212
## 10 <NA>    Delayed San.Diego        20
## 11 AM West On time San.Diego       383
## 12 <NA>    Delayed San.Diego        65
## 13 Alaska  On time San.Fransisco   503
## 14 <NA>    Delayed San.Fransisco   102
## 15 AM West On time San.Fransisco   320
## 16 <NA>    Delayed San.Fransisco   129
## 17 Alaska  On time Seattle        1841
## 18 <NA>    Delayed Seattle         305
## 19 AM West On time Seattle         201
## 20 <NA>    Delayed Seattle          61

As shown above, Airline have missing values that correspond to the association of airlines with its respective status(each NA value should have the same airline name as the field above it). In addition, values that represent the number of flights delayed and on time should be displayed in a single row representing the destination of the number of flights and its respective status and airline.

flights_final <-  flights_data %>% 
  select(Destination,Status,fl_no) %>% # selects the destination , the status and the number of flights
  group_by(Status) %>%  # group those flights according to the status, which  causes that flights going to certain places are grouping together. Finally, A row that works as an index for the new dataframe should be added because it is needed for spreading the data eventually
  mutate(id = row_number())


flights_final
## # A tibble: 20 x 4
## # Groups:   Status [2]
##    Destination   Status  fl_no    id
##    <chr>         <chr>   <int> <int>
##  1 Los.Angeles   On time   497     1
##  2 Los.Angeles   Delayed    62     1
##  3 Los.Angeles   On time   694     2
##  4 Los.Angeles   Delayed   117     2
##  5 Phoenix       On time   221     3
##  6 Phoenix       Delayed    12     3
##  7 Phoenix       On time  4840     4
##  8 Phoenix       Delayed   415     4
##  9 San.Diego     On time   212     5
## 10 San.Diego     Delayed    20     5
## 11 San.Diego     On time   383     6
## 12 San.Diego     Delayed    65     6
## 13 San.Fransisco On time   503     7
## 14 San.Fransisco Delayed   102     7
## 15 San.Fransisco On time   320     8
## 16 San.Fransisco Delayed   129     8
## 17 Seattle       On time  1841     9
## 18 Seattle       Delayed   305     9
## 19 Seattle       On time   201    10
## 20 Seattle       Delayed    61    10

Then, the column status shown above is broken into two columns that represents the flights delayed and on time going to a specific state. Then, all the fields are selected except the field created for indexing.

flights_final <- flights_final %>% 
  spread(Status,fl_no) %>% 
  select(-id)

flights_final
## # A tibble: 10 x 3
##    Destination   Delayed `On time`
##    <chr>           <int>     <int>
##  1 Los.Angeles        62       497
##  2 Los.Angeles       117       694
##  3 Phoenix            12       221
##  4 Phoenix           415      4840
##  5 San.Diego          20       212
##  6 San.Diego          65       383
##  7 San.Fransisco     102       503
##  8 San.Fransisco     129       320
##  9 Seattle           305      1841
## 10 Seattle            61       201

Finally, since the airline column was missing, a factor which identifies all the airlines present in the data needs to be created to be added as a column for the transformed dataframe. Since the structure of the dataframe has two rows representing the states names (on individual state for one airline), the factor that contains two levels will be replicated 10 times to match the number of rows in the new dataframe.

Airlines <- flights_data$Airline %>% as.factor() %>% na.omit() %>%  levels()
flights_final <- flights_final %>% add_column(Airline = rep(Airlines,5), .before = "Destination")

Finally, the processed dataframe is shown.

kable(flights_final,align = "c")
Airline Destination Delayed On time
Alaska Los.Angeles 62 497
AM West Los.Angeles 117 694
Alaska Phoenix 12 221
AM West Phoenix 415 4840
Alaska San.Diego 20 212
AM West San.Diego 65 383
Alaska San.Fransisco 102 503
AM West San.Fransisco 129 320
Alaska Seattle 305 1841
AM West Seattle 61 201