This is the Markdown file for the my third homework assignment. The problem statement is as follows:

  1. Before we begin, let’s get libraries tidyr and dplyr, which we’ll use.
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Problem statement synopsized here: Format of a CSV file is given, from which, I created a CSV files, called “Airlines_Performance.csv”. I am supposed to read into R, and use tidyr & dplyr, to transform the data. Then, perform analysis to compare arrival and delays of two airlines.

  1. Yank file into program: After locally creating Airlines_Performance.csv, I uploaded it to my github, and yanked it from there. Github link, for the raw file is given below:

    https://raw.githubusercontent.com/ShovanBiswas/DATA607/master/05/Airlines_Performance.csv

I scraped it from here.

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
  1. We’ll omit rows and renumber rows (because dropped rows create a gap). name columns 1 and 2, omit NA rows and renumber rows (because dropped rows create a gap), and fill gaps in columns 1 and 2.
  1. Now, we’ll name columns 1 and 2, and then insert names of airlines, in 2nd and 4tn rows of columns 1.
##   Airlines Delay_vs_On_time Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1   ALASKA          on time         497     221       212           503    1841
## 2   ALASKA          delayed          62      12        20           102     305
## 3  AM WEST          on time         694    4840       383           320     201
## 4  AM WEST          delayed         117     415        65           129      61
  1. Now, this dataframe is ready for processing. But, we can make the cities Los.Angeles, Phoenix, San.Diego, San.Francisco and Seattle as values of a variable Location. So, we’ll gather the dataframe.
##    Airlines Delay_vs_On_time     Locations No_of_flts
## 1    ALASKA          on time   Los.Angeles        497
## 2    ALASKA          delayed   Los.Angeles         62
## 3   AM WEST          on time   Los.Angeles        694
## 4   AM WEST          delayed   Los.Angeles        117
## 5    ALASKA          on time       Phoenix        221
## 6    ALASKA          delayed       Phoenix         12
## 7   AM WEST          on time       Phoenix       4840
## 8   AM WEST          delayed       Phoenix        415
## 9    ALASKA          on time     San.Diego        212
## 10   ALASKA          delayed     San.Diego         20
## 11  AM WEST          on time     San.Diego        383
## 12  AM WEST          delayed     San.Diego         65
## 13   ALASKA          on time San.Francisco        503
## 14   ALASKA          delayed San.Francisco        102
## 15  AM WEST          on time San.Francisco        320
## 16  AM WEST          delayed San.Francisco        129
## 17   ALASKA          on time       Seattle       1841
## 18   ALASKA          delayed       Seattle        305
## 19  AM WEST          on time       Seattle        201
## 20  AM WEST          delayed       Seattle         61
  1. Now, we are ready, for some analysis: Computing total number of flights. Subsequently, we’ll use this value.
## [1] 11000
  1. Group level summary at the level of (Airlines, Delay_vs_On_time): Computing total number of flights, and percentage of delays and “on time”, with respect to (wrt) tot_no_of_flts, at group level.
## `summarise()` regrouping output by 'Airlines' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups:   Airlines [2]
##   Airlines Delay_vs_On_time No_of_flts_in_grp Perc_of_total
##   <chr>    <chr>                        <int>         <dbl>
## 1 ALASKA   delayed                        501          4.55
## 2 ALASKA   on time                       3274         29.8 
## 3 AM WEST  delayed                        787          7.15
## 4 AM WEST  on time                       6438         58.5

Comment: The report shows that AM WEST has much more flights “on time” than ALASKA airlines.

  1. Group level summary at the level of Airlines: Computing total number of flights, at group level. Subsequently, we’ll use this value.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Airlines Grp_tot
##   <chr>      <int>
## 1 ALASKA      3775
## 2 AM WEST     7225
  1. We’ll join Arlns_grps with Arlns_tots, on “Airlines”.
##   Airlines Delay_vs_On_time No_of_flts_in_grp Grp_tot Grp_prc Perc_of_total
## 1   ALASKA          delayed               501    3775   13.27          4.55
## 2   ALASKA          on time              3274    3775   86.73         29.76
## 3  AM WEST          delayed               787    7225   10.89          7.15
## 4  AM WEST          on time              6438    7225   89.11         58.53

Comment: The report clearly shows that AM WEST (89.11%) is only 2.38% better on flights “on time” than ALASKA airlines.

  1. We spread air_tbl.df.tidy, and use it differently.
##    Airlines     Locations delayed on time
## 1    ALASKA   Los.Angeles      62     497
## 2    ALASKA       Phoenix      12     221
## 3    ALASKA     San.Diego      20     212
## 4    ALASKA San.Francisco     102     503
## 5    ALASKA       Seattle     305    1841
## 6   AM WEST   Los.Angeles     117     694
## 7   AM WEST       Phoenix     415    4840
## 8   AM WEST     San.Diego      65     383
## 9   AM WEST San.Francisco     129     320
## 10  AM WEST       Seattle      61     201
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   Airlines tot_dlyd_grp tot_ontm_grp
##   <chr>           <dbl>        <dbl>
## 1 ALASKA            501         3274
## 2 AM WEST           787         6438
## # A tibble: 2 x 4
##   Airlines tot_dlyd_grp tot_ontm_grp Grp_tot
##   <chr>           <dbl>        <dbl>   <int>
## 1 ALASKA            501         3274    3775
## 2 AM WEST           787         6438    7225
## # A tibble: 2 x 6
##   Airlines tot_dlyd_grp tot_ontm_grp Grp_tot prc_dlyd_grp prc_otime_grp
##   <chr>           <dbl>        <dbl>   <int>        <dbl>         <dbl>
## 1 ALASKA            501         3274    3775         13.3          86.7
## 2 AM WEST           787         6438    7225         10.9          89.1

Comment: By spreading, we get a different view of the analysis.

Marker: 607-05