Assignment - Tidying and Transforming Data
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
getURL <- "https://raw.githubusercontent.com/deepakmongia/Fall2018/master/Arrival_Delays.csv"
arrival.delays.df <- read.csv(getURL, header = TRUE, sep = ",")
colnames(arrival.delays.df)[colnames(arrival.delays.df) == "X"] <- "airline"
colnames(arrival.delays.df)[colnames(arrival.delays.df) == "X.1"] <- "status"
arrival.delays.df
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# Removing the blank lines
arrival.delays.df <- arrival.delays.df %>% filter(status == "on time" | status == "delayed")
arrival.delays.df
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
arrival.delays.df.molten <- arrival.delays.df %>% gather("destination", "frequency", 3:7)
arrival.delays.df.molten
## airline status destination frequency
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
arrival.delays.df.tidy <- arrival.delays.df.molten[, c(1,3,2,4)]
arrival.delays.df.tidy
## airline destination status frequency
## 1 ALASKA Los.Angeles on time 497
## 2 ALASKA Los.Angeles delayed 62
## 3 AM WEST Los.Angeles on time 694
## 4 AM WEST Los.Angeles delayed 117
## 5 ALASKA Phoenix on time 221
## 6 ALASKA Phoenix delayed 12
## 7 AM WEST Phoenix on time 4840
## 8 AM WEST Phoenix delayed 415
## 9 ALASKA San.Diego on time 212
## 10 ALASKA San.Diego delayed 20
## 11 AM WEST San.Diego on time 383
## 12 AM WEST San.Diego delayed 65
## 13 ALASKA San.Francisco on time 503
## 14 ALASKA San.Francisco delayed 102
## 15 AM WEST San.Francisco on time 320
## 16 AM WEST San.Francisco delayed 129
## 17 ALASKA Seattle on time 1841
## 18 ALASKA Seattle delayed 305
## 19 AM WEST Seattle on time 201
## 20 AM WEST Seattle delayed 61
Now if we see the above data set, it is a tidy one, as each column is a variable, and each row is an observation.
We will be doing some analysis now on this tidy data set.
## # A tibble: 20 x 4
## # Groups: airline, destination, status [20]
## airline destination status frequency
## <fct> <chr> <fct> <int>
## 1 ALASKA Los.Angeles on time 497
## 2 ALASKA Los.Angeles delayed 62
## 3 AM WEST Los.Angeles on time 694
## 4 AM WEST Los.Angeles delayed 117
## 5 ALASKA Phoenix on time 221
## 6 ALASKA Phoenix delayed 12
## 7 AM WEST Phoenix on time 4840
## 8 AM WEST Phoenix delayed 415
## 9 ALASKA San.Diego on time 212
## 10 ALASKA San.Diego delayed 20
## 11 AM WEST San.Diego on time 383
## 12 AM WEST San.Diego delayed 65
## 13 ALASKA San.Francisco on time 503
## 14 ALASKA San.Francisco delayed 102
## 15 AM WEST San.Francisco on time 320
## 16 AM WEST San.Francisco delayed 129
## 17 ALASKA Seattle on time 1841
## 18 ALASKA Seattle delayed 305
## 19 AM WEST Seattle on time 201
## 20 AM WEST Seattle delayed 61
## # A tibble: 20 x 4
## # Groups: airline, destination [?]
## airline destination status count
## <fct> <chr> <fct> <int>
## 1 ALASKA Los.Angeles delayed 62
## 2 ALASKA Los.Angeles on time 497
## 3 ALASKA Phoenix delayed 12
## 4 ALASKA Phoenix on time 221
## 5 ALASKA San.Diego delayed 20
## 6 ALASKA San.Diego on time 212
## 7 ALASKA San.Francisco delayed 102
## 8 ALASKA San.Francisco on time 503
## 9 ALASKA Seattle delayed 305
## 10 ALASKA Seattle on time 1841
## 11 AM WEST Los.Angeles delayed 117
## 12 AM WEST Los.Angeles on time 694
## 13 AM WEST Phoenix delayed 415
## 14 AM WEST Phoenix on time 4840
## 15 AM WEST San.Diego delayed 65
## 16 AM WEST San.Diego on time 383
## 17 AM WEST San.Francisco delayed 129
## 18 AM WEST San.Francisco on time 320
## 19 AM WEST Seattle delayed 61
## 20 AM WEST Seattle on time 201
## # A tibble: 10 x 4
## # Groups: airline, destination [5]
## airline destination status count
## <fct> <chr> <fct> <int>
## 1 ALASKA Los.Angeles delayed 62
## 2 ALASKA Los.Angeles on time 497
## 3 ALASKA Phoenix delayed 12
## 4 ALASKA Phoenix on time 221
## 5 ALASKA San.Diego delayed 20
## 6 ALASKA San.Diego on time 212
## 7 ALASKA San.Francisco delayed 102
## 8 ALASKA San.Francisco on time 503
## 9 ALASKA Seattle delayed 305
## 10 ALASKA Seattle on time 1841
## # A tibble: 10 x 4
## # Groups: airline, destination [5]
## airline destination status count
## <fct> <chr> <fct> <int>
## 1 AM WEST Los.Angeles delayed 117
## 2 AM WEST Los.Angeles on time 694
## 3 AM WEST Phoenix delayed 415
## 4 AM WEST Phoenix on time 4840
## 5 AM WEST San.Diego delayed 65
## 6 AM WEST San.Diego on time 383
## 7 AM WEST San.Francisco delayed 129
## 8 AM WEST San.Francisco on time 320
## 9 AM WEST Seattle delayed 61
## 10 AM WEST Seattle on time 201
Counts break-up of the 2 airlines for city wise on-time and delayed flight counts
Percentage break-ups of the 2 airlines, ratio wise break-ups for each city
Conclusion:
From the last 2 graphs - ratio break-ups for each city signifies that for both the airlines, each city has almost similar trend for the 2 airlies when the ratio of the delayed to the on-time arrivals are considered.
However, the it is also quite visible from the graphs that AM West Airlines also has more chance of a delay as compared to the Alaska Airlines for any of the 5 given cities for which the readings have been considered for this analysis.