Tidying and Tansforming Data

For this assigmnent we have 3 steps:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.

Step 1 & Step 2

Since I already created the csv file and published it to Github, we are just going to go ahead and pull it in from Github and take a view.

airlines <- read.csv("https://raw.githubusercontent.com/jhumms/DATA607/main/assignment-5/airlines.csv",header = TRUE, sep = ",")
head(airlines)

##       ï..       X Los.Angeles Pheonix San.Diego San.Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

So it has some missing values, the first two column names as well as missing data in the columns and the third row which is empty. Let’s fix those complainets.

Blank row - for this we will remove it if it is empty or null

airlines <- airlines[!apply(is.na(airlines) | airlines == "", 1, all),]

Rename the columns, and we will make them lowercase as well as make any seperated by a ‘.’ with a ’_’

airlines <- airlines %>% rename(carrier = 1, status = 2)
airlines <- airlines %>% rename_all(tolower)
airlines <- rename_with(airlines, ~ tolower(gsub(".", "_", .x, fixed = TRUE)))

Add in any missing data using Zoo’s na.locf which adds the first non na above into the na.

(Unfortunately fill() does not work in this scenario becuase you are working with strings instead of numbers, and thanks to IRTFM on Stackoverflow for the answer)

airlines[airlines==""] <- NA
airlines[1] <- lapply(airlines[1], zoo::na.locf)

Let’s take a look at the data now

head(airlines)

##   carrier  status los_angeles pheonix san_diego san_fransisco seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Much Better!

Now we should be good to go to turn this into long format!

library("reshape2")

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

melt.airlines <- melt(airlines)

## Using carrier, status as id variables

melt.airlines <- melt.airlines %>% rename(destination = 3, amt = 4)
head(melt.airlines)

##   carrier  status destination amt
## 1  ALASKA on time los_angeles 497
## 2  ALASKA delayed los_angeles  62
## 3 AM WEST on time los_angeles 694
## 4 AM WEST delayed los_angeles 117
## 5  ALASKA on time     pheonix 221
## 6  ALASKA delayed     pheonix  12

Step 3

Let’s analyze these two and compare arrival delays for these two airlinesby creating a bar chart to see their on time as well as delayed flights.

ggplot(data=melt.airlines, aes(x=carrier, y=amt, fill=status)) +
  geom_bar(stat="identity", position=position_dodge())

It looks like AM West has a larger number of flights and also a larger number of delays. But, to be fair, let’s compare the overall delays with the total number of flights for each carrier.

delays <- melt.airlines %>% select(carrier, status, amt) %>% group_by(carrier, status) %>% summarise(amt = sum(amt))

## `summarise()` has grouped output by 'carrier'. You can override using the `.groups` argument.

Then we need to cast it so we can get delayed next to on time

cast.delays <- dcast(delays, carrier~...)

## Using amt as value column: use value.var to override.

cast.delays <- rename_with(cast.delays, ~ gsub(" ", "_", .x, fixed = TRUE))

Finally we need to get the percent of delayed by the total number of flights and show the results!

cast.delays <- cast.delays %>% mutate(perc_delayed = delayed/(delayed + on_time))


ggplot(data=cast.delays, aes(x=carrier, y=perc_delayed)) +
  geom_bar(stat="identity")

It looks like Alaska has a higher number flights overall, but how about analyzing the delayed flights by destinations?

Let’s begin by casting the data so we can see on time and delayed by destination

cast.airlines <- dcast(melt.airlines, carrier+destination~...)

## Using amt as value column: use value.var to override.

cast.airlines <- rename_with(cast.airlines, ~ gsub(" ", "_", .x, fixed = TRUE))

Now let’s calculate the percent of time a flight to a destination was delayed

cast.airlines <- cast.airlines %>% mutate(perc_delayed = round(delayed/((on_time+ delayed)), 2))

on_time_airlines<- cast.airlines %>% select(carrier, destination, perc_delayed)

And let’s graph it

ggplot(data=on_time_airlines, aes(x=destination, y=perc_delayed, fill=carrier)) +
  geom_bar(stat="identity", position=position_dodge())

As the reader can see, AM WEST has a higher percentage of delays for each destination, but from what we saw above, Alaska has a higher percentage of delayed flights.