1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

First, we load our data into R.

library(tidyr)
library(plyr)
getwd()
## [1] "/Users/omarpineda/Desktop/CUNY SPS MS Data Science/DATA607 Data Acquisition and Management/Week 5"
flights <- read.csv(file="/Users/omarpineda/Desktop/CUNY SPS MS Data Science/DATA607 Data Acquisition and Management/Week 5/flights.csv", header = TRUE)
flights <- flights[-c(3),]  #removes empty row
flights
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 4 AM West on time         694    4840       383           320     201
## 5 AM West delayed         117     415        65           129      61

Next, we tidy our data so that it is easier to consume.

colnames(flights)[1:2] <- c("airline", "flightStatus")
flights
##   airline flightStatus Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska      on time         497     221       212           503    1841
## 2  Alaska      delayed          62      12        20           102     305
## 4 AM West      on time         694    4840       383           320     201
## 5 AM West      delayed         117     415        65           129      61
newFlights <- gather(flights, "City", "val", 3:7) %>%
              spread(flightStatus, val)
newFlights
##    airline          City delayed on time
## 1   Alaska   Los.Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San.Diego      20     212
## 4   Alaska San.Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6  AM West   Los.Angeles     117     694
## 7  AM West       Phoenix     415    4840
## 8  AM West     San.Diego      65     383
## 9  AM West San.Francisco     129     320
## 10 AM West       Seattle      61     201
  1. Perform analysis to compare the arrival delays for the two airlines.

Now we do some analysis on our data using ggplot2.

library(ggplot2)
newFlights$totalFlights <- newFlights$delayed + newFlights$`on time`
newFlights$percentDelayed <- newFlights$delayed / newFlights$totalFlights
newFlights
##    airline          City delayed on time totalFlights percentDelayed
## 1   Alaska   Los.Angeles      62     497          559     0.11091234
## 2   Alaska       Phoenix      12     221          233     0.05150215
## 3   Alaska     San.Diego      20     212          232     0.08620690
## 4   Alaska San.Francisco     102     503          605     0.16859504
## 5   Alaska       Seattle     305    1841         2146     0.14212488
## 6  AM West   Los.Angeles     117     694          811     0.14426634
## 7  AM West       Phoenix     415    4840         5255     0.07897241
## 8  AM West     San.Diego      65     383          448     0.14508929
## 9  AM West San.Francisco     129     320          449     0.28730512
## 10 AM West       Seattle      61     201          262     0.23282443
ggplot(newFlights, aes(x = City, y=percentDelayed)) + geom_point(stat = "identity", aes(color = airline)) + ylab("Percentage of Delayed Flights")

AM West has a higher percentage of its flights delayed at every airport compared to Alaska Airlines. Phoenix has the lowest percentage of delayed flights and San Francisco has the highest percentage of delayed flights overall.