Dataframe Transformation
This is the Markdown file for the my third homework assignment. The problem statement is as follows:
- Before we begin, let’s get libraries tidyr and dplyr, which we’ll use.
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Problem statement synopsized here: Format of a CSV file is given, from which, I created a CSV files, called “Airlines_Performance.csv”. I am supposed to read into R, and use tidyr & dplyr, to transform the data. Then, perform analysis to compare arrival and delays of two airlines.
Yank file into program: After locally creating Airlines_Performance.csv, I uploaded it to my github, and yanked it from there. Github link, for the raw file is given below:
https://raw.githubusercontent.com/ShovanBiswas/DATA607/master/05/Airlines_Performance.csv
I scraped it from here.
url <- "https://raw.githubusercontent.com/ShovanBiswas/DATA607/master/05/Airlines_Performance.csv?token=ADFB4WQBJZLFQFHMZ46VLE2742XQQ"
air_tbl <- read.table(url, sep = ',', header = TRUE, stringsAsFactors = FALSE)
air_tbl## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
- We’ll omit rows and renumber rows (because dropped rows create a gap). name columns 1 and 2, omit NA rows and renumber rows (because dropped rows create a gap), and fill gaps in columns 1 and 2.
air_tbl.df <- data.frame(air_tbl %>% drop_na())
row.names(air_tbl.df) <- 1:nrow(air_tbl.df)
air_tbl.df <- data.frame(air_tbl %>% drop_na())- Now, we’ll name columns 1 and 2, and then insert names of airlines, in 2nd and 4tn rows of columns 1.
names(air_tbl.df)[1] <- "Airlines"
names(air_tbl.df)[2] <- "Delay_vs_On_time"
air_tbl.df$Airlines[[2]] <- "ALASKA"
air_tbl.df$Airlines[[4]] <- "AM WEST"
air_tbl.df## Airlines Delay_vs_On_time Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
- Now, this dataframe is ready for processing. But, we can make the cities Los.Angeles, Phoenix, San.Diego, San.Francisco and Seattle as values of a variable Location. So, we’ll gather the dataframe.
air_tbl.df.tidy <- air_tbl.df %>% gather(3:7, key = "Locations", value = "No_of_flts")
air_tbl.df.tidy## Airlines Delay_vs_On_time Locations No_of_flts
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
- Now, we are ready, for some analysis: Computing total number of flights. Subsequently, we’ll use this value.
## [1] 11000
- Group level summary at the level of (Airlines, Delay_vs_On_time): Computing total number of flights, and percentage of delays and “on time”, with respect to (wrt) tot_no_of_flts, at group level.
Arlns_grps <- mutate( summarise(group_by(air_tbl.df.tidy, Airlines, Delay_vs_On_time),
No_of_flts_in_grp = sum(No_of_flts)),
Perc_of_total = round((No_of_flts_in_grp / tot_no_of_flts ) * 100, 2) )## `summarise()` regrouping output by 'Airlines' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups: Airlines [2]
## Airlines Delay_vs_On_time No_of_flts_in_grp Perc_of_total
## <chr> <chr> <int> <dbl>
## 1 ALASKA delayed 501 4.55
## 2 ALASKA on time 3274 29.8
## 3 AM WEST delayed 787 7.15
## 4 AM WEST on time 6438 58.5
Comment: The report shows that AM WEST has much more flights “on time” than ALASKA airlines.
- Group level summary at the level of Airlines: Computing total number of flights, at group level. Subsequently, we’ll use this value.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Airlines Grp_tot
## <chr> <int>
## 1 ALASKA 3775
## 2 AM WEST 7225
- We’ll join Arlns_grps with Arlns_tots, on “Airlines”.
analysis_df <- mutate( inner_join(Arlns_grps, Arlns_tots, by = "Airlines"), Grp_prc = round((No_of_flts_in_grp / Grp_tot) * 100, 2) )
analysis_df <- data.frame(analysis_df$Airlines, analysis_df$Delay_vs_On_time, analysis_df$No_of_flts_in_grp, analysis_df$Grp_tot, analysis_df$Grp_prc, analysis_df$Perc_of_total)
names(analysis_df) <- c("Airlines", "Delay_vs_On_time", "No_of_flts_in_grp", "Grp_tot", "Grp_prc", "Perc_of_total")
analysis_df## Airlines Delay_vs_On_time No_of_flts_in_grp Grp_tot Grp_prc Perc_of_total
## 1 ALASKA delayed 501 3775 13.27 4.55
## 2 ALASKA on time 3274 3775 86.73 29.76
## 3 AM WEST delayed 787 7225 10.89 7.15
## 4 AM WEST on time 6438 7225 89.11 58.53
Comment: The report clearly shows that AM WEST (89.11%) is only 2.38% better on flights “on time” than ALASKA airlines.
- We spread air_tbl.df.tidy, and use it differently.
air_tbl.df.spread <- air_tbl.df.tidy %>% spread(2, key = "Delay_vs_On_time", value = "No_of_flts")
air_tbl.df.spread## Airlines Locations delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## Airlines tot_dlyd_grp tot_ontm_grp
## <chr> <dbl> <dbl>
## 1 ALASKA 501 3274
## 2 AM WEST 787 6438
air_tbl.df.spread.sumry.join <- inner_join(air_tbl.df.spread.sumry, Arlns_tots, by = "Airlines")
air_tbl.df.spread.sumry.join## # A tibble: 2 x 4
## Airlines tot_dlyd_grp tot_ontm_grp Grp_tot
## <chr> <dbl> <dbl> <int>
## 1 ALASKA 501 3274 3775
## 2 AM WEST 787 6438 7225
## # A tibble: 2 x 6
## Airlines tot_dlyd_grp tot_ontm_grp Grp_tot prc_dlyd_grp prc_otime_grp
## <chr> <dbl> <dbl> <int> <dbl> <dbl>
## 1 ALASKA 501 3274 3775 13.3 86.7
## 2 AM WEST 787 6438 7225 10.9 89.1
Comment: By spreading, we get a different view of the analysis.
Marker: 607-05