(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

I made a CSV file using MS Note Pad and saved it on GitHub.

(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

Here I use the read.csv function to read in my csv file. It is VERY untidy.

untidy <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/week6.csv", header = TRUE, sep = "/", row.names = NULL, stringsAsFactors = FALSE)
untidy
##   row.names       X     X.1 Los.Angeles Phoenix San.Diego San.Francisco
## 1            ALASKA on time         497     221       212           503
## 2                   delayed          62      12        20           102
## 3                                    NA                NA            NA
## 4           AM WEST on time         694   4,840       383           320
## 5                   delayed         117     415        65           129
##   Seattle
## 1   1,841
## 2     305
## 3        
## 4     201
## 5      61

I begin to tidy the dataframe I read in by assigning column names, getting rid of an extra column, and getting rid of an empty row. I also filled in the airline name on the rows where it wasn’t repeated.

untidy <- subset(untidy, c(TRUE, TRUE, FALSE, TRUE, TRUE), 2:8)
names(untidy)[names(untidy) == "X"] <- "Airline"
names(untidy)[names(untidy) == "X.1"] <- "Arrival"
untidy[2, 1] <- "ALASKA"
untidy[4, 1] <- "AM WEST"
untidy
##   Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Now that the data frame is a little less untidy or at least easier to read I am ready to tidy it up. My goal is to make a 4-column dataframe with Airline, Arrival, City, and Count.

library(tidyr)
tidy <- gather(untidy, "City", "Count", 3:7)
tidy
##    Airline Arrival          City Count
## 1   ALASKA on time   Los.Angeles   497
## 2   ALASKA delayed   Los.Angeles    62
## 3  AM WEST on time   Los.Angeles   694
## 4  AM WEST delayed   Los.Angeles   117
## 5   ALASKA on time       Phoenix   221
## 6   ALASKA delayed       Phoenix    12
## 7  AM WEST on time       Phoenix 4,840
## 8  AM WEST delayed       Phoenix   415
## 9   ALASKA on time     San.Diego   212
## 10  ALASKA delayed     San.Diego    20
## 11 AM WEST on time     San.Diego   383
## 12 AM WEST delayed     San.Diego    65
## 13  ALASKA on time San.Francisco   503
## 14  ALASKA delayed San.Francisco   102
## 15 AM WEST on time San.Francisco   320
## 16 AM WEST delayed San.Francisco   129
## 17  ALASKA on time       Seattle 1,841
## 18  ALASKA delayed       Seattle   305
## 19 AM WEST on time       Seattle   201
## 20 AM WEST delayed       Seattle    61

(3) Perform analysis to compare the arrival delays for the two airlines.

Now that I want to perform analysis on the data it will be handy to have the counts in the fourth column be numeric values. I also noticed that my very tidy dataframe may not be that useful for calculations. It may be more useful to have Airline, City, On-time, Delayed. It is the same number of columns and more useful.

tidy$Count <- as.numeric(gsub(",", "", tidy$Count))
tidy <- spread(tidy, "Arrival", Count)
tidy
##    Airline          City delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

Now that I have my new tidier dataframe I can look at some analysis. I would like to add the on-time and delayed numbers together to be total flights by that airline to that city. I can then calculate the percentage of delayed flights and look at a few other things.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
tidy <- mutate(tidy, Total = delayed + `on time`, PercentDelayed = delayed / Total * 100)
tidy <- arrange(tidy, City, PercentDelayed)
tidy
##    Airline          City delayed on time Total PercentDelayed
## 1   ALASKA   Los.Angeles      62     497   559      11.091234
## 2  AM WEST   Los.Angeles     117     694   811      14.426634
## 3   ALASKA       Phoenix      12     221   233       5.150215
## 4  AM WEST       Phoenix     415    4840  5255       7.897241
## 5   ALASKA     San.Diego      20     212   232       8.620690
## 6  AM WEST     San.Diego      65     383   448      14.508929
## 7   ALASKA San.Francisco     102     503   605      16.859504
## 8  AM WEST San.Francisco     129     320   449      28.730512
## 9   ALASKA       Seattle     305    1841  2146      14.212488
## 10 AM WEST       Seattle      61     201   262      23.282443

We can see my this first round of analysis that Alaska Airlines is delayed less often than American West in every city we have in the data. The percentage difference looks to be the closest in Phoenix and the furthest in San Francisco. Let’s try some summarizing to see if we can learn any more.

delays <- tidy %>% group_by(Airline) %>% summarise(MeanPercent = round(mean(PercentDelayed), 0))
delays
## Source: local data frame [2 x 2]
## 
##   Airline MeanPercent
##     (chr)       (dbl)
## 1  ALASKA          11
## 2 AM WEST          18

We see from this summary that Alaska is delayed on arrival 11% of the time and American West is 18% of the time.

(4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:

. The URL to the .Rmd file in your GitHub repository. and https://github.com/Godbero/CUNY-MSDA-IS607/blob/master/IS607-Week6-Assignment.Rmd

. The URL for your rpubs.com web page. http://www.rpubs.com/Godbero/115655