Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
First, we load our data into R.
library(tidyr)
library(plyr)
getwd()
## [1] "/Users/omarpineda/Desktop/CUNY SPS MS Data Science/DATA607 Data Acquisition and Management/Week 5"
flights <- read.csv(file="/Users/omarpineda/Desktop/CUNY SPS MS Data Science/DATA607 Data Acquisition and Management/Week 5/flights.csv", header = TRUE)
flights <- flights[-c(3),] #removes empty row
flights
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 4 AM West on time 694 4840 383 320 201
## 5 AM West delayed 117 415 65 129 61
Next, we tidy our data so that it is easier to consume.
colnames(flights)[1:2] <- c("airline", "flightStatus")
flights
## airline flightStatus Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 4 AM West on time 694 4840 383 320 201
## 5 AM West delayed 117 415 65 129 61
newFlights <- gather(flights, "City", "val", 3:7) %>%
spread(flightStatus, val)
newFlights
## airline City delayed on time
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM West Los.Angeles 117 694
## 7 AM West Phoenix 415 4840
## 8 AM West San.Diego 65 383
## 9 AM West San.Francisco 129 320
## 10 AM West Seattle 61 201
Now we do some analysis on our data using ggplot2.
library(ggplot2)
newFlights$totalFlights <- newFlights$delayed + newFlights$`on time`
newFlights$percentDelayed <- newFlights$delayed / newFlights$totalFlights
newFlights
## airline City delayed on time totalFlights percentDelayed
## 1 Alaska Los.Angeles 62 497 559 0.11091234
## 2 Alaska Phoenix 12 221 233 0.05150215
## 3 Alaska San.Diego 20 212 232 0.08620690
## 4 Alaska San.Francisco 102 503 605 0.16859504
## 5 Alaska Seattle 305 1841 2146 0.14212488
## 6 AM West Los.Angeles 117 694 811 0.14426634
## 7 AM West Phoenix 415 4840 5255 0.07897241
## 8 AM West San.Diego 65 383 448 0.14508929
## 9 AM West San.Francisco 129 320 449 0.28730512
## 10 AM West Seattle 61 201 262 0.23282443
ggplot(newFlights, aes(x = City, y=percentDelayed)) + geom_point(stat = "identity", aes(color = airline)) + ylab("Percentage of Delayed Flights")
AM West has a higher percentage of its flights delayed at every airport compared to Alaska Airlines. Phoenix has the lowest percentage of delayed flights and San Francisco has the highest percentage of delayed flights overall.