Data 607 - Assignment - Tidying and Transforming Data

Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Week__5 Github: https://github.com/littlejohnjeff/DATA607_Fall2018/blob/master/Data%20607%20-%20Week%205%20Assignment%20-%20Tidying%20and%20Transforming%20Data%20-%20Jeff%20Littlejohn.Rmd

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository.??? The URL for your rpubs.com web page.

Let’s start.

#load libraries
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Load our data.

#Reading csv file into dataframe called flight_info.
flight_info <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/Numbersense.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))
flight_info

## # A tibble: 6 x 7
##   V1      V2      V3          V4      V5        V6            V7     
##   <chr>   <chr>   <chr>       <chr>   <chr>     <chr>         <chr>  
## 1 ""      ""      Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA  on time 497         221     212       503           1841   
## 3 ""      delayed 62          12      20        102           305    
## 4 ""      ""      ""          ""      ""        ""            ""     
## 5 AM WEST on time 694         4840    383       320           201    
## 6 ""      delayed 117         415     65        129           61

Clean up the dataframe to prepare for fancier operations.

#add column names
flight_col_names <- c("Airline","Arrival_Status",flight_info[1,3:7])
names(flight_info) <- flight_col_names
#add airline name to blank fields in first column
flight_info[3,1] <- flight_info[2,1]
flight_info[6,1] <- flight_info[5,1]
#remove empty 3rd row
flight_info <- flight_info[!apply(flight_info == "", 1, all),]
#remove first row that has city names
flight_info <- flight_info[2:5,]
flight_info

## # A tibble: 4 x 7
##   Airline Arrival_Status `Los Angeles` Phoenix `San Diego` `San Francisco`
##   <chr>   <chr>          <chr>         <chr>   <chr>       <chr>          
## 1 ALASKA  on time        497           221     212         503            
## 2 ALASKA  delayed        62            12      20          102            
## 3 AM WEST on time        694           4840    383         320            
## 4 AM WEST delayed        117           415     65          129            
## # ... with 1 more variable: Seattle <chr>

Now let’s use gather to unpivot the city names into one (tidy!) column.

#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave Airline and Arrival_Status as is
flight_info <- flight_info %>%
  gather(City,Flight_Count,-Airline,-Arrival_Status)

We still need to do one more thing before starting analysis - make sure the Flight_Count field is numeric.

#Convert Flight_Count to numeric field to enable analysis
flight_info <- transform(flight_info, Flight_Count = as.numeric(Flight_Count))

Which airline is more reliable? Let’s create some subsets to facilitate analyis.

#delayed flights
delay_counts <- subset(flight_info, flight_info$Arrival_Status == "delayed")
#Alaska flights - total, on time, delayed
Alaska_flight_info <- subset(flight_info, flight_info$Airline == "ALASKA")
Alaska_ontime_counts <- subset(flight_info, flight_info$Airline == "ALASKA" & flight_info$Arrival_Status == "on time")
Alaska_delay_counts <- subset(flight_info, flight_info$Airline == "ALASKA" & flight_info$Arrival_Status == "delayed")
#Am West flights
Amwest_flight_info <- subset(flight_info, flight_info$Airline == "AM WEST")
Amwest_ontime_counts <- subset(flight_info, flight_info$Airline == "AM WEST" & flight_info$Arrival_Status == "on time")
Amwest_delay_counts <- subset(flight_info, flight_info$Airline == "AM WEST" & flight_info$Arrival_Status == "delayed")
#One check to verify it worked
head(Amwest_delay_counts)

##    Airline Arrival_Status          City Flight_Count
## 4  AM WEST        delayed   Los Angeles          117
## 8  AM WEST        delayed       Phoenix          415
## 12 AM WEST        delayed     San Diego           65
## 16 AM WEST        delayed San Francisco          129
## 20 AM WEST        delayed       Seattle           61

Which airline has a higher percentage of delayed flights?

#Alaska delays
Alaska_delay_sum <- sum(Alaska_delay_counts$Flight_Count)
Alaska_flight_sum <- sum(Alaska_flight_info$Flight_Count)
Alaska_delay_rate <- Alaska_delay_sum/Alaska_flight_sum
print("Alaska delay count:")

## [1] "Alaska delay count:"

Alaska_delay_sum

## [1] 501

print("Alaska delay rate:")

## [1] "Alaska delay rate:"

Alaska_delay_rate

## [1] 0.1327152

#Amwest delays
Amwest_delay_sum <- sum(Amwest_delay_counts$Flight_Count)
Amwest_flight_sum <- sum(Amwest_flight_info$Flight_Count)
Amwest_delay_rate <- Amwest_delay_sum/Amwest_flight_sum
print("Amwest delay count:")

## [1] "Amwest delay count:"

Amwest_delay_sum

## [1] 787

print("Amwest delay rate:")

## [1] "Amwest delay rate:"

Amwest_delay_rate

## [1] 0.1089273

We see that although Amwest has more delays, they actually have a lower rate of delays per flight. Let’s plot the delay counts to see how this looks.

ggplot(flight_info,aes(flight_info$Airline,flight_info$Flight_Count, colour = flight_info$Arrival_Status)) + geom_point()

Amwest has one big on-time outlier, as does Alaska. Let’s check to see if they are both the same city. If not, instead of choosing one airline over the other based on overall delay rate, we might be selective and choose an airline based on which city from which we’re flying.

ggplot(delay_counts,aes(delay_counts$City,delay_counts$Flight_Count, colour = delay_counts$Airline)) + geom_point()

We see that Phoenix is home to the largest number of Amwest delays, and Seattle is home to the biggest portion of Alaska delays, but we can’t determine if that’s due to a large volume of total flights. Let’s get messy with ggplot.

ggplot(flight_info,aes(flight_info$City,flight_info$Flight_Count, colour = flight_info$Airline,flight_info$Arrival_Status)) + geom_text(aes(label=paste0(flight_info$Arrival_Status,flight_info$Airline)))

Based on this more detailed analysis, independent of price and schedule convenience, it looks like we should fly Alaska out of Seattle and San Francisco and Amwest everywhere else.

Data 607 - Assignment - Tidying and Transforming Data

Jeff Littlejohn

September 28, 2018