Overview

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.
- You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into\(\ R\), and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an\(\ R \ Markdown\) file, posted to RPUBS, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
- The URL to the .Rmd file in your GitHub repository. and
- The URL for your rpubs.com web page.

Import Data

Data is imported from the .csv file located in my github Week 5 folder. Function read.csv from the utils package is used.

arrivals_delays <- read.csv(url, as.is = TRUE)

arrivals_delays
X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1,841
	delayed	62	12	20	102	305
		NA		NA	NA
AM WEST	on time	694	4,840	383	320	201
	delayed	117	415	65	129	61
NOTE:
The present table reflects the NULL cells and unlabeled columns in the .csv file

Cleanup

The above data set has a few problems that need correcting including:

Removal of row comprised of empty cells () and NA
Relabeling of column names for functional use
Inserting logical data, for blank cells in column 1.

`NA` Removal

The only row requiring removal is row 3. This is done using brackets [ ] in conjunction with - symbol to specify removal of the selected row.

(arrivals_delays <- arrivals_delays[-3,])

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Relabeling of `colnames`

Since all but two columns are labeled correctly, a list of names is best to insert into the data frame.

df_names <-
  c("Airline", "Arrival", "Los Angeles", "Phoenix",
    "San Diego", "San Francisco", "Seattle")

colnames(arrivals_delays) <- df_names

##   Airline Arrival Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Inserting Data

The most logical data to place inside the empty cells in column one, is the Airline being referenced. Rather than manually type in the name I chose to reference the Airline name in the row before the blank one. Had this been a larger data set a while loop may have been a helpful tool.

arrivals_delays[2,1]<-arrivals_delays[1,1]
arrivals_delays[4,1]<-arrivals_delays[3,1]

arrivals_delays
Airline	Arrival	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1,841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4,840	383	320	201
AM WEST	delayed	117	415	65	129	61

Transforming columns and rows

Gather()

Using gather() function along with pipe %>% I reorganize the data so the counts and cities are accounted for on two columns.

key is the name of the new key or column holding the previous column names
value is the name of the column holding the values associated with said key values.

(arrivals_delays_steps<-arrivals_delays %>%
  gather(key = "City",value = "Count", 3:7))

##    Airline Arrival          City Count
## 1   ALASKA on time   Los Angeles   497
## 2   ALASKA delayed   Los Angeles    62
## 3  AM WEST on time   Los Angeles   694
## 4  AM WEST delayed   Los Angeles   117
## 5   ALASKA on time       Phoenix   221
## 6   ALASKA delayed       Phoenix    12
## 7  AM WEST on time       Phoenix 4,840
## 8  AM WEST delayed       Phoenix   415
## 9   ALASKA on time     San Diego   212
## 10  ALASKA delayed     San Diego    20
## 11 AM WEST on time     San Diego   383
## 12 AM WEST delayed     San Diego    65
## 13  ALASKA on time San Francisco   503
## 14  ALASKA delayed San Francisco   102
## 15 AM WEST on time San Francisco   320
## 16 AM WEST delayed San Francisco   129
## 17  ALASKA on time       Seattle 1,841
## 18  ALASKA delayed       Seattle   305
## 19 AM WEST on time       Seattle   201
## 20 AM WEST delayed       Seattle    61

Spread()

Using spread() function

key is the column name or positions we are referencing.
values are still the values affiliated with the key, but graphically the reference is lateral. Technically the shared relationship for these options is a single row.
Columns \(\ 2 \ \&\ 4\) are referenced. e.g.¹

(arrivals_delays_steps<-arrivals_delays_steps %>%
  spread(key = 2,value = 4))

##    Airline          City delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San Diego      20     212
## 4   ALASKA San Francisco     102     503
## 5   ALASKA       Seattle     305   1,841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415   4,840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

Arrange()

Using the arrange() function I sort the data by city

arrivals_delays_steps %>%
  arrange(City)

##    Airline          City delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2  AM WEST   Los Angeles     117     694
## 3   ALASKA       Phoenix      12     221
## 4  AM WEST       Phoenix     415   4,840
## 5   ALASKA     San Diego      20     212
## 6  AM WEST     San Diego      65     383
## 7   ALASKA San Francisco     102     503
## 8  AM WEST San Francisco     129     320
## 9   ALASKA       Seattle     305   1,841
## 10 AM WEST       Seattle      61     201

Single command

Alternatively all step can occur at once

(arrivals_delays<-arrivals_delays %>%
  gather(key = "City",value = "Count", 3:7)%>%
    spread(key = 2,value = 4) %>%
        arrange(City))

##    Airline          City delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2  AM WEST   Los Angeles     117     694
## 3   ALASKA       Phoenix      12     221
## 4  AM WEST       Phoenix     415   4,840
## 5   ALASKA     San Diego      20     212
## 6  AM WEST     San Diego      65     383
## 7   ALASKA San Francisco     102     503
## 8  AM WEST San Francisco     129     320
## 9   ALASKA       Seattle     305   1,841
## 10 AM WEST       Seattle      61     201

Calculations

The Calculations are as follows:

Total Flights = delayed + on time values by row.
The for loop traverses each row, calculating Total Flights and storing it in the corresponding row.
Delay Rate and On Time Rate are calculated by dividing entire columns:
- delayed/Total Flights
- on time/Total Flights

# for loop to create totals
for(row in 1:nrow(arrivals_delays))
  arrivals_delays$`Total Flights`[row]<-
  as.numeric(arrivals_delays[row,3])+
  as.numeric(gsub(",","",arrivals_delays[row,4]))

#outside of for loop calculations for rates
  arrivals_delays$`Delay Rate`<-
    round( as.numeric( arrivals_delays$delayed ) / arrivals_delays$`Total Flights`,2 ) 

#arrival rates calculations
  arrivals_delays$`On Time Rate`<-
    round( as.numeric( gsub(",","",arrivals_delays$`on time` )) / arrivals_delays$`Total Flights`,2 )

arrivals_delays
Airline	City	delayed	on time	Total Flights	Delay Rate	On Time Rate
ALASKA	Los Angeles	62	497	559	0.11	0.89
AM WEST	Los Angeles	117	694	811	0.14	0.86
ALASKA	Phoenix	12	221	233	0.05	0.95
AM WEST	Phoenix	415	4,840	5255	0.08	0.92
ALASKA	San Diego	20	212	232	0.09	0.91
AM WEST	San Diego	65	383	448	0.15	0.85
ALASKA	San Francisco	102	503	605	0.17	0.83
AM WEST	San Francisco	129	320	449	0.29	0.71
ALASKA	Seattle	305	1,841	2146	0.14	0.86
AM WEST	Seattle	61	201	262	0.23	0.77

Conclusion

The summary below and the corresponding line graphs show, AM West varies greatly on its delay and on time rate by city, specifically regarding San Francisco. ALASKA airlines, however, fluctuates less and is consistently on time more with less delays on all major cities compared to AM WEST.

## # A tibble: 2 x 5
##   Airline `Avg Delay` `Avg Delay Rate` `On Time Avg` `On Time Avg Rate`
## * <chr>         <dbl>            <dbl>         <dbl>              <dbl>
## 1 ALASKA         100.            0.112          655.              0.888
## 2 AM WEST        157.            0.178         1288.              0.822

arrivals_delays[2] and arrivals_delays[4] for columns 2 & 4 respectively:↩︎

DATA607 Week 5 Assignment

Gabriel Campos

March 05 2021

Overview

Import Data

Cleanup

`NA` Removal

Relabeling of `colnames`

Inserting Data

Transforming columns and rows

Gather()

Spread()

Arrange()

Single command

Calculations

Conclusion

DATA607 Week 5 Assignment

Gabriel Campos

March 05 2021

Overview

Import Data

Cleanup

NA Removal

Relabeling of colnames

Inserting Data

Transforming columns and rows

Gather()

Spread()

Arrange()

Single command

Calculations

Conclusion

`NA` Removal

Relabeling of `colnames`