Assignment - Tidying and Transforming Data

  1. Loading the required packages
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  1. Reading the file into a data.frame from the Raw Github link
getURL <- "https://raw.githubusercontent.com/deepakmongia/Fall2018/master/Arrival_Delays.csv"
arrival.delays.df <- read.csv(getURL, header = TRUE, sep = ",")
colnames(arrival.delays.df)[colnames(arrival.delays.df) == "X"] <- "airline"
colnames(arrival.delays.df)[colnames(arrival.delays.df) == "X.1"] <- "status"
arrival.delays.df
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
# Removing the blank lines
arrival.delays.df <- arrival.delays.df %>% filter(status == "on time" | status == "delayed")
arrival.delays.df
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61
  1. Replacing the blank values in the column - airlines with their flight name
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61
arrival.delays.df.molten <- arrival.delays.df %>% gather("destination", "frequency", 3:7)
arrival.delays.df.molten
##    airline  status   destination frequency
## 1   ALASKA on time   Los.Angeles       497
## 2   ALASKA delayed   Los.Angeles        62
## 3  AM WEST on time   Los.Angeles       694
## 4  AM WEST delayed   Los.Angeles       117
## 5   ALASKA on time       Phoenix       221
## 6   ALASKA delayed       Phoenix        12
## 7  AM WEST on time       Phoenix      4840
## 8  AM WEST delayed       Phoenix       415
## 9   ALASKA on time     San.Diego       212
## 10  ALASKA delayed     San.Diego        20
## 11 AM WEST on time     San.Diego       383
## 12 AM WEST delayed     San.Diego        65
## 13  ALASKA on time San.Francisco       503
## 14  ALASKA delayed San.Francisco       102
## 15 AM WEST on time San.Francisco       320
## 16 AM WEST delayed San.Francisco       129
## 17  ALASKA on time       Seattle      1841
## 18  ALASKA delayed       Seattle       305
## 19 AM WEST on time       Seattle       201
## 20 AM WEST delayed       Seattle        61
arrival.delays.df.tidy <- arrival.delays.df.molten[, c(1,3,2,4)]
arrival.delays.df.tidy
##    airline   destination  status frequency
## 1   ALASKA   Los.Angeles on time       497
## 2   ALASKA   Los.Angeles delayed        62
## 3  AM WEST   Los.Angeles on time       694
## 4  AM WEST   Los.Angeles delayed       117
## 5   ALASKA       Phoenix on time       221
## 6   ALASKA       Phoenix delayed        12
## 7  AM WEST       Phoenix on time      4840
## 8  AM WEST       Phoenix delayed       415
## 9   ALASKA     San.Diego on time       212
## 10  ALASKA     San.Diego delayed        20
## 11 AM WEST     San.Diego on time       383
## 12 AM WEST     San.Diego delayed        65
## 13  ALASKA San.Francisco on time       503
## 14  ALASKA San.Francisco delayed       102
## 15 AM WEST San.Francisco on time       320
## 16 AM WEST San.Francisco delayed       129
## 17  ALASKA       Seattle on time      1841
## 18  ALASKA       Seattle delayed       305
## 19 AM WEST       Seattle on time       201
## 20 AM WEST       Seattle delayed        61

Now if we see the above data set, it is a tidy one, as each column is a variable, and each row is an observation.

We will be doing some analysis now on this tidy data set.

## # A tibble: 20 x 4
## # Groups:   airline, destination, status [20]
##    airline destination   status  frequency
##    <fct>   <chr>         <fct>       <int>
##  1 ALASKA  Los.Angeles   on time       497
##  2 ALASKA  Los.Angeles   delayed        62
##  3 AM WEST Los.Angeles   on time       694
##  4 AM WEST Los.Angeles   delayed       117
##  5 ALASKA  Phoenix       on time       221
##  6 ALASKA  Phoenix       delayed        12
##  7 AM WEST Phoenix       on time      4840
##  8 AM WEST Phoenix       delayed       415
##  9 ALASKA  San.Diego     on time       212
## 10 ALASKA  San.Diego     delayed        20
## 11 AM WEST San.Diego     on time       383
## 12 AM WEST San.Diego     delayed        65
## 13 ALASKA  San.Francisco on time       503
## 14 ALASKA  San.Francisco delayed       102
## 15 AM WEST San.Francisco on time       320
## 16 AM WEST San.Francisco delayed       129
## 17 ALASKA  Seattle       on time      1841
## 18 ALASKA  Seattle       delayed       305
## 19 AM WEST Seattle       on time       201
## 20 AM WEST Seattle       delayed        61
## # A tibble: 20 x 4
## # Groups:   airline, destination [?]
##    airline destination   status  count
##    <fct>   <chr>         <fct>   <int>
##  1 ALASKA  Los.Angeles   delayed    62
##  2 ALASKA  Los.Angeles   on time   497
##  3 ALASKA  Phoenix       delayed    12
##  4 ALASKA  Phoenix       on time   221
##  5 ALASKA  San.Diego     delayed    20
##  6 ALASKA  San.Diego     on time   212
##  7 ALASKA  San.Francisco delayed   102
##  8 ALASKA  San.Francisco on time   503
##  9 ALASKA  Seattle       delayed   305
## 10 ALASKA  Seattle       on time  1841
## 11 AM WEST Los.Angeles   delayed   117
## 12 AM WEST Los.Angeles   on time   694
## 13 AM WEST Phoenix       delayed   415
## 14 AM WEST Phoenix       on time  4840
## 15 AM WEST San.Diego     delayed    65
## 16 AM WEST San.Diego     on time   383
## 17 AM WEST San.Francisco delayed   129
## 18 AM WEST San.Francisco on time   320
## 19 AM WEST Seattle       delayed    61
## 20 AM WEST Seattle       on time   201
## # A tibble: 10 x 4
## # Groups:   airline, destination [5]
##    airline destination   status  count
##    <fct>   <chr>         <fct>   <int>
##  1 ALASKA  Los.Angeles   delayed    62
##  2 ALASKA  Los.Angeles   on time   497
##  3 ALASKA  Phoenix       delayed    12
##  4 ALASKA  Phoenix       on time   221
##  5 ALASKA  San.Diego     delayed    20
##  6 ALASKA  San.Diego     on time   212
##  7 ALASKA  San.Francisco delayed   102
##  8 ALASKA  San.Francisco on time   503
##  9 ALASKA  Seattle       delayed   305
## 10 ALASKA  Seattle       on time  1841
## # A tibble: 10 x 4
## # Groups:   airline, destination [5]
##    airline destination   status  count
##    <fct>   <chr>         <fct>   <int>
##  1 AM WEST Los.Angeles   delayed   117
##  2 AM WEST Los.Angeles   on time   694
##  3 AM WEST Phoenix       delayed   415
##  4 AM WEST Phoenix       on time  4840
##  5 AM WEST San.Diego     delayed    65
##  6 AM WEST San.Diego     on time   383
##  7 AM WEST San.Francisco delayed   129
##  8 AM WEST San.Francisco on time   320
##  9 AM WEST Seattle       delayed    61
## 10 AM WEST Seattle       on time   201

Counts break-up of the 2 airlines for city wise on-time and delayed flight counts

Percentage break-ups of the 2 airlines, ratio wise break-ups for each city

Conclusion:

  1. From the last 2 graphs - ratio break-ups for each city signifies that for both the airlines, each city has almost similar trend for the 2 airlies when the ratio of the delayed to the on-time arrivals are considered.

  2. However, the it is also quite visible from the graphs that AM West Airlines also has more chance of a delay as compared to the Alaska Airlines for any of the 5 given cities for which the readings have been considered for this analysis.