I made a CSV file using MS Note Pad and saved it on GitHub.
Here I use the read.csv function to read in my csv file. It is VERY untidy.
untidy <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/week6.csv", header = TRUE, sep = "/", row.names = NULL, stringsAsFactors = FALSE)
untidy
## row.names X X.1 Los.Angeles Phoenix San.Diego San.Francisco
## 1 ALASKA on time 497 221 212 503
## 2 delayed 62 12 20 102
## 3 NA NA NA
## 4 AM WEST on time 694 4,840 383 320
## 5 delayed 117 415 65 129
## Seattle
## 1 1,841
## 2 305
## 3
## 4 201
## 5 61
I begin to tidy the dataframe I read in by assigning column names, getting rid of an extra column, and getting rid of an empty row. I also filled in the airline name on the rows where it wasn’t repeated.
untidy <- subset(untidy, c(TRUE, TRUE, FALSE, TRUE, TRUE), 2:8)
names(untidy)[names(untidy) == "X"] <- "Airline"
names(untidy)[names(untidy) == "X.1"] <- "Arrival"
untidy[2, 1] <- "ALASKA"
untidy[4, 1] <- "AM WEST"
untidy
## Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4,840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Now that the data frame is a little less untidy or at least easier to read I am ready to tidy it up. My goal is to make a 4-column dataframe with Airline, Arrival, City, and Count.
library(tidyr)
tidy <- gather(untidy, "City", "Count", 3:7)
tidy
## Airline Arrival City Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4,840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1,841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Now that I want to perform analysis on the data it will be handy to have the counts in the fourth column be numeric values. I also noticed that my very tidy dataframe may not be that useful for calculations. It may be more useful to have Airline, City, On-time, Delayed. It is the same number of columns and more useful.
tidy$Count <- as.numeric(gsub(",", "", tidy$Count))
tidy <- spread(tidy, "Arrival", Count)
tidy
## Airline City delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
Now that I have my new tidier dataframe I can look at some analysis. I would like to add the on-time and delayed numbers together to be total flights by that airline to that city. I can then calculate the percentage of delayed flights and look at a few other things.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tidy <- mutate(tidy, Total = delayed + `on time`, PercentDelayed = delayed / Total * 100)
tidy <- arrange(tidy, City, PercentDelayed)
tidy
## Airline City delayed on time Total PercentDelayed
## 1 ALASKA Los.Angeles 62 497 559 11.091234
## 2 AM WEST Los.Angeles 117 694 811 14.426634
## 3 ALASKA Phoenix 12 221 233 5.150215
## 4 AM WEST Phoenix 415 4840 5255 7.897241
## 5 ALASKA San.Diego 20 212 232 8.620690
## 6 AM WEST San.Diego 65 383 448 14.508929
## 7 ALASKA San.Francisco 102 503 605 16.859504
## 8 AM WEST San.Francisco 129 320 449 28.730512
## 9 ALASKA Seattle 305 1841 2146 14.212488
## 10 AM WEST Seattle 61 201 262 23.282443
We can see my this first round of analysis that Alaska Airlines is delayed less often than American West in every city we have in the data. The percentage difference looks to be the closest in Phoenix and the furthest in San Francisco. Let’s try some summarizing to see if we can learn any more.
delays <- tidy %>% group_by(Airline) %>% summarise(MeanPercent = round(mean(PercentDelayed), 0))
delays
## Source: local data frame [2 x 2]
##
## Airline MeanPercent
## (chr) (dbl)
## 1 ALASKA 11
## 2 AM WEST 18
We see from this summary that Alaska is delayed on arrival 11% of the time and American West is 18% of the time.
. The URL to the .Rmd file in your GitHub repository. and https://github.com/Godbero/CUNY-MSDA-IS607/blob/master/IS607-Week6-Assignment.Rmd
. The URL for your rpubs.com web page. http://www.rpubs.com/Godbero/115655