Loading the dataframe as a csv file which is in wide format with missing values.
flights_dt <- as_tibble( read.csv("flights data.csv", stringsAsFactors = FALSE,na.strings = c("","N/A")))
Display of the CSV file structure.
kable(flights_dt)
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Fransisco | Seattle |
|---|---|---|---|---|---|---|
| Alaska | On time | 497 | 221 | 212 | 503 | 1841 |
| NA | Delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM West | On time | 694 | 4840 | 383 | 320 | 201 |
| NA | Delayed | 117 | 415 | 65 | 129 | 61 |
Column names for the first two columns are missing. Therefore, it is neccesary to create two columns names in order to transform the dataframe
names(flights_dt)[1:2] <- c("Airline","Status")
Since the destination state from airlines are displayed as column names, the gather function will collapse such states in a row format. The Destination column that will be creared by gather displays the name of the state and the fl_no column displays the number of flighs for each category labeled as “On Time” or “Delayed”. In addition, it removes the empty line that divides both airlines data.
flights_data <- flights_dt %>% gather("Destination","fl_no",3:7,na.rm = TRUE)
flights_data
## # A tibble: 20 x 4
## Airline Status Destination fl_no
## <chr> <chr> <chr> <int>
## 1 Alaska On time Los.Angeles 497
## 2 <NA> Delayed Los.Angeles 62
## 3 AM West On time Los.Angeles 694
## 4 <NA> Delayed Los.Angeles 117
## 5 Alaska On time Phoenix 221
## 6 <NA> Delayed Phoenix 12
## 7 AM West On time Phoenix 4840
## 8 <NA> Delayed Phoenix 415
## 9 Alaska On time San.Diego 212
## 10 <NA> Delayed San.Diego 20
## 11 AM West On time San.Diego 383
## 12 <NA> Delayed San.Diego 65
## 13 Alaska On time San.Fransisco 503
## 14 <NA> Delayed San.Fransisco 102
## 15 AM West On time San.Fransisco 320
## 16 <NA> Delayed San.Fransisco 129
## 17 Alaska On time Seattle 1841
## 18 <NA> Delayed Seattle 305
## 19 AM West On time Seattle 201
## 20 <NA> Delayed Seattle 61
As shown above, Airline have missing values that correspond to the association of airlines with its respective status(each NA value should have the same airline name as the field above it). In addition, values that represent the number of flights delayed and on time should be displayed in a single row representing the destination of the number of flights and its respective status and airline.
flights_final <- flights_data %>%
select(Destination,Status,fl_no) %>% # selects the destination , the status and the number of flights
group_by(Status) %>% # group those flights according to the status, which causes that flights going to certain places are grouping together. Finally, A row that works as an index for the new dataframe should be added because it is needed for spreading the data eventually
mutate(id = row_number())
flights_final
## # A tibble: 20 x 4
## # Groups: Status [2]
## Destination Status fl_no id
## <chr> <chr> <int> <int>
## 1 Los.Angeles On time 497 1
## 2 Los.Angeles Delayed 62 1
## 3 Los.Angeles On time 694 2
## 4 Los.Angeles Delayed 117 2
## 5 Phoenix On time 221 3
## 6 Phoenix Delayed 12 3
## 7 Phoenix On time 4840 4
## 8 Phoenix Delayed 415 4
## 9 San.Diego On time 212 5
## 10 San.Diego Delayed 20 5
## 11 San.Diego On time 383 6
## 12 San.Diego Delayed 65 6
## 13 San.Fransisco On time 503 7
## 14 San.Fransisco Delayed 102 7
## 15 San.Fransisco On time 320 8
## 16 San.Fransisco Delayed 129 8
## 17 Seattle On time 1841 9
## 18 Seattle Delayed 305 9
## 19 Seattle On time 201 10
## 20 Seattle Delayed 61 10
Then, the column status shown above is broken into two columns that represents the flights delayed and on time going to a specific state. Then, all the fields are selected except the field created for indexing.
flights_final <- flights_final %>%
spread(Status,fl_no) %>%
select(-id)
flights_final
## # A tibble: 10 x 3
## Destination Delayed `On time`
## <chr> <int> <int>
## 1 Los.Angeles 62 497
## 2 Los.Angeles 117 694
## 3 Phoenix 12 221
## 4 Phoenix 415 4840
## 5 San.Diego 20 212
## 6 San.Diego 65 383
## 7 San.Fransisco 102 503
## 8 San.Fransisco 129 320
## 9 Seattle 305 1841
## 10 Seattle 61 201
Finally, since the airline column was missing, a factor which identifies all the airlines present in the data needs to be created to be added as a column for the transformed dataframe. Since the structure of the dataframe has two rows representing the states names (on individual state for one airline), the factor that contains two levels will be replicated 10 times to match the number of rows in the new dataframe.
Airlines <- flights_data$Airline %>% as.factor() %>% na.omit() %>% levels()
flights_final <- flights_final %>% add_column(Airline = rep(Airlines,5), .before = "Destination")
Finally, the processed dataframe is shown.
kable(flights_final,align = "c")
| Airline | Destination | Delayed | On time |
|---|---|---|---|
| Alaska | Los.Angeles | 62 | 497 |
| AM West | Los.Angeles | 117 | 694 |
| Alaska | Phoenix | 12 | 221 |
| AM West | Phoenix | 415 | 4840 |
| Alaska | San.Diego | 20 | 212 |
| AM West | San.Diego | 65 | 383 |
| Alaska | San.Fransisco | 102 | 503 |
| AM West | San.Fransisco | 129 | 320 |
| Alaska | Seattle | 305 | 1841 |
| AM West | Seattle | 61 | 201 |