Data607-Week-5

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Fransisco	Seattle
Alaska	On time	497	221	212	503	1841
NA	Delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM West	On time	694	4840	383	320	201
NA	Delayed	117	415	65	129	61

Data Tidying

Column names for the first two columns are missing. Therefore, it is neccesary to create two columns names in order to transform the dataframe

names(flights_dt)[1:2] <- c("Airline","Status")

Since the destination state from airlines are displayed as column names, the gather function will collapse such states in a row format. The Destination column that will be creared by gather displays the name of the state and the fl_no column displays the number of flighs for each category labeled as “On Time” or “Delayed”. In addition, it removes the empty line that divides both airlines data.

flights_data <- flights_dt %>% gather("Destination","fl_no",3:7,na.rm = TRUE)
flights_data

## # A tibble: 20 x 4
##    Airline Status  Destination   fl_no
##    <chr>   <chr>   <chr>         <int>
##  1 Alaska  On time Los.Angeles     497
##  2 <NA>    Delayed Los.Angeles      62
##  3 AM West On time Los.Angeles     694
##  4 <NA>    Delayed Los.Angeles     117
##  5 Alaska  On time Phoenix         221
##  6 <NA>    Delayed Phoenix          12
##  7 AM West On time Phoenix        4840
##  8 <NA>    Delayed Phoenix         415
##  9 Alaska  On time San.Diego       212
## 10 <NA>    Delayed San.Diego        20
## 11 AM West On time San.Diego       383
## 12 <NA>    Delayed San.Diego        65
## 13 Alaska  On time San.Fransisco   503
## 14 <NA>    Delayed San.Fransisco   102
## 15 AM West On time San.Fransisco   320
## 16 <NA>    Delayed San.Fransisco   129
## 17 Alaska  On time Seattle        1841
## 18 <NA>    Delayed Seattle         305
## 19 AM West On time Seattle         201
## 20 <NA>    Delayed Seattle          61

As shown above, Airline have missing values that correspond to the association of airlines with its respective status(each NA value should have the same airline name as the field above it). In addition, values that represent the number of flights delayed and on time should be displayed in a single row representing the destination of the number of flights and its respective status and airline.

flights_final <-  flights_data %>% 
  select(Destination,Status,fl_no) %>% # selects the destination , the status and the number of flights
  group_by(Status) %>%  # group those flights according to the status, which  causes that flights going to certain places are grouping together. Finally, A row that works as an index for the new dataframe should be added because it is needed for spreading the data eventually
  mutate(id = row_number())


flights_final

## # A tibble: 20 x 4
## # Groups:   Status [2]
##    Destination   Status  fl_no    id
##    <chr>         <chr>   <int> <int>
##  1 Los.Angeles   On time   497     1
##  2 Los.Angeles   Delayed    62     1
##  3 Los.Angeles   On time   694     2
##  4 Los.Angeles   Delayed   117     2
##  5 Phoenix       On time   221     3
##  6 Phoenix       Delayed    12     3
##  7 Phoenix       On time  4840     4
##  8 Phoenix       Delayed   415     4
##  9 San.Diego     On time   212     5
## 10 San.Diego     Delayed    20     5
## 11 San.Diego     On time   383     6
## 12 San.Diego     Delayed    65     6
## 13 San.Fransisco On time   503     7
## 14 San.Fransisco Delayed   102     7
## 15 San.Fransisco On time   320     8
## 16 San.Fransisco Delayed   129     8
## 17 Seattle       On time  1841     9
## 18 Seattle       Delayed   305     9
## 19 Seattle       On time   201    10
## 20 Seattle       Delayed    61    10

Then, the column status shown above is broken into two columns that represents the flights delayed and on time going to a specific state. Then, all the fields are selected except the field created for indexing.

flights_final <- flights_final %>% 
  spread(Status,fl_no) %>% 
  select(-id)

flights_final

## # A tibble: 10 x 3
##    Destination   Delayed `On time`
##    <chr>           <int>     <int>
##  1 Los.Angeles        62       497
##  2 Los.Angeles       117       694
##  3 Phoenix            12       221
##  4 Phoenix           415      4840
##  5 San.Diego          20       212
##  6 San.Diego          65       383
##  7 San.Fransisco     102       503
##  8 San.Fransisco     129       320
##  9 Seattle           305      1841
## 10 Seattle            61       201

Finally, since the airline column was missing, a factor which identifies all the airlines present in the data needs to be created to be added as a column for the transformed dataframe. Since the structure of the dataframe has two rows representing the states names (on individual state for one airline), the factor that contains two levels will be replicated 10 times to match the number of rows in the new dataframe.

Airlines <- flights_data$Airline %>% as.factor() %>% na.omit() %>%  levels()
flights_final <- flights_final %>% add_column(Airline = rep(Airlines,5), .before = "Destination")

Finally, the processed dataframe is shown.

kable(flights_final,align = "c")

Airline	Destination	Delayed	On time
Alaska	Los.Angeles	62	497
AM West	Los.Angeles	117	694
Alaska	Phoenix	12	221
AM West	Phoenix	415	4840
Alaska	San.Diego	20	212
AM West	San.Diego	65	383
Alaska	San.Fransisco	102	503
AM West	San.Fransisco	129	320
Alaska	Seattle	305	1841
AM West	Seattle	61	201

Data607-Week-5

Lewris Mota

March 3, 2019

Data Loading

Loading Data

Data

Data Tidying