DATA607- Tidying and Transforming Data

Rmarkdown Output

I am using prettydoc with cayman theme for my rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html

Problem Statement

(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.

Git-Hub

The data, .rmd used in this assignment can be found at https://github.com/forhadakbar/data607fall2019/tree/master/Week%2005

Loading package

The tidyverse includes both tidyr and dplyr. I found Amazing documentation here on tidyverse https://www.tidyverse.org/packages/. Kudos to http://hadley.nz/

library(tidyverse)
library(stringr) 
library(knitr)

Create a .csv file

One way is to manually create a .csv and push it to github. Then we can read the file using read.csv from github raw link.

Another way is to create the dataset in R using rbind and then write a table to get a .csv file as output. Then we can read the file using read.csv from either github or working directory. I would write the .csv file to my locale github repro week 05 folder then commit and push to my github using either bash command or git client so that i can get a raw link of the .csv file. I am adding all missing values at 3rd row to replicate same structure given in the assignment and to practice tidying and transformations.

flights <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
             c("ALASKA", "on time", 497, 221, 212, 503, 1841),
             c(NA, "delayed", 62, 12, 20, 102, 305),
             c(NA, NA, NA, NA, NA, NA, NA),
             c("AM WEST", "on time", 694, 4840, 383, 320, 201),
             c(NA, "delayed", 117, 415, 65, 129, 61))

#write the data in a .csv file
write.table(flights, file = "D:/CUNY-DataScience/Git-Hub/data607fall2019/Week 05/flight.csv", sep = ",", col.names=F, row.names=F)

Read .CSV file into R and fill in missing values

I am using github raw link so that anyone can run the code and have the access to the .csv file created. Then i will Fill in the airline values and have a look using kable function from knitr package.

 flights <-read.csv(paste("https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2005/flight.csv"))

#Fill in the airline values
flights[2,1] <- flights[1,1]
flights[5,1] <- flights[4,1]
# replace space with underscore to avoid using backtick later
flights[,2] <- sapply(flights[,2], str_replace, " ", "_")
kable(flights)

NA.	NA..1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on_time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on_time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Tidy dataset

Goals of making any data tidy or perform wrangling on it mainly depends on use case. We can consider following main data wrangling goals when performing any wrangling operation.

Make data suitable to use with a particular piece of software. For example R in our course context
Reveal information by changing format and structure of the data

Now let’s analyse why the the flights dataset is untidy in context of R. R prefers just one format of the data. We are calling it tidy here. There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row
Each value must have its own cell

So in order to any dataset to be tidy it needs to maintain all three above rules, flights dataset doesn’t fulfill all three rules.

Flights dataset has most common problem as some of the column names are not names of variables, but values of a variable. The column names Los.Angeles, Phoenix, San.Diego San.Francisco and Seattle represent values of the location or state variable, and each row represents five observations, not one. We can use gathering and spreading to tidy this dataset. We can also make separate column for on time and delayed if needed. Also, there is a blank row and two column names are missing. R reading the missing column name as NA. and NA..1

My Approach

First, I will remove the blank row and rename the missing column name. Then i will use gathering from tidyr to get long form of the dataset to make it tidy. Later, i will use spreading from tidyr to get number of flights arrived on time and delay in separate column. Then i will do analysis to get insight like compare airlines performance using dplyr.

I am interested in not only overall airlines performace but also performace by destination city.

I will use pipe (%>%) opertor instead of chaining method as i am fairly new to piping. The history of pipe (%>%) in R and how it gets introduced in tidyverse is amazing to read. In short pipe (%>%) takes the output of one statement and makes it the input of the next statement. When reading it, we can think of it as a “THEN”. I studied pipe operator and it’s history from https://www.datacamp.com/community/tutorials/pipe-r-tutorial

Why tidyr

tidyr is a package that reshapes the layout of a tables. Two main function:

gather() Convert variable into observation
spread () Convert observation into variable

Transformation and using gather()

long_tidy_flights <- flights %>% 
  na.omit() %>% #remove blank row
  rename(airline = NA., status = NA..1) %>% #name missing column name
   gather(3:7, key ="destination", value = "number_of_flights" )#Transform the data from its wide to a more normalized long format
kable(long_tidy_flights)

airline	status	destination	number_of_flights
ALASKA	on_time	Los.Angeles	497
ALASKA	delayed	Los.Angeles	62
AM WEST	on_time	Los.Angeles	694
AM WEST	delayed	Los.Angeles	117
ALASKA	on_time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on_time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on_time	San.Diego	212
ALASKA	delayed	San.Diego	20
AM WEST	on_time	San.Diego	383
AM WEST	delayed	San.Diego	65
ALASKA	on_time	San.Francisco	503
ALASKA	delayed	San.Francisco	102
AM WEST	on_time	San.Francisco	320
AM WEST	delayed	San.Francisco	129
ALASKA	on_time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on_time	Seattle	201
AM WEST	delayed	Seattle	61

Here i used gather() from tidyr package after removing blank row and name misssing column names to transform dataset from its untidy form to a normalized tidy long form which fullfill all three rules a dataset need to fulfil to become tidy for r. gather() convert variable into observation. Here, variables like Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle became observation.

Using spread ()

wide_tidy_flights<- long_tidy_flights %>%
  spread(key = status, value = number_of_flights)
kable(wide_tidy_flights)

airline	destination	delayed	on_time
ALASKA	Los.Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San.Diego	20	212
ALASKA	San.Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los.Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San.Diego	65	383
AM WEST	San.Francisco	129	320
AM WEST	Seattle	61	201

Here i used spread() from tidyr package to transform dataset from its long form to a wide form to have seprate column for on time and delayed filghts in order to better analyse the airlines performance. spread() convert obseration into variables. Here, observation like Delayed and On Time became variables.

why use dplyr

Often datsets contain more informatio than they display. dplyr helps access that information. I will use as much all functions from dplyr to practice. Main function:

select() Extract existing variable
filter() Extract existing observations
mutate() Derive new variable (from existing variable)
summarise() Change the unit of analysis

Use mutate to add columns for total, percent on time, and percent delayed

tidyflights <- wide_tidy_flights %>% 
  mutate(total_flights = delayed + on_time, # use mutate from dplyr to add columns
         percent_on_time = round(on_time/total_flights*100, 2),
         percent_delayed = round(delayed/total_flights*100, 2))
kable(tidyflights)

airline	destination	delayed	on_time	total_flights	percent_on_time	percent_delayed
ALASKA	Los.Angeles	62	497	559	88.91	11.09
ALASKA	Phoenix	12	221	233	94.85	5.15
ALASKA	San.Diego	20	212	232	91.38	8.62
ALASKA	San.Francisco	102	503	605	83.14	16.86
ALASKA	Seattle	305	1841	2146	85.79	14.21
AM WEST	Los.Angeles	117	694	811	85.57	14.43
AM WEST	Phoenix	415	4840	5255	92.10	7.90
AM WEST	San.Diego	65	383	448	85.49	14.51
AM WEST	San.Francisco	129	320	449	71.27	28.73
AM WEST	Seattle	61	201	262	76.72	23.28

Here i used mutate function which takes a data frame and return the data frame with new variables added to it. As i had replace space in “on time” with underscore i didn’t need to use backtick. I used round to round the percentage to 2 decimel points.

Let’s see it visually using ggplot2

ggplot(tidyflights, aes(x=destination, y=percent_delayed, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    geom_text(aes(label=percent_delayed), vjust=.5, hjust=1,position= position_dodge(width=0.9),  color="black") +
        ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("Destination") + ylab("Percentage of Flights Delayed") +
        coord_flip()

ggplot(tidyflights, aes(x=destination, y=percent_on_time, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    geom_text(aes(label=percent_on_time), vjust=.5, hjust=1,position= position_dodge(width=0.9),  color="black") +
        ggtitle("Percentage of Flights on time by Airline by City") +
    xlab("Destination") + ylab("Percentage of Flights on time") +
        coord_flip()

Arrange to see performance by destination

tidyflights<- tidyflights %>%
  arrange(desc(percent_on_time))
kable(tidyflights)

airline	destination	delayed	on_time	total_flights	percent_on_time	percent_delayed
ALASKA	Phoenix	12	221	233	94.85	5.15
AM WEST	Phoenix	415	4840	5255	92.10	7.90
ALASKA	San.Diego	20	212	232	91.38	8.62
ALASKA	Los.Angeles	62	497	559	88.91	11.09
ALASKA	Seattle	305	1841	2146	85.79	14.21
AM WEST	Los.Angeles	117	694	811	85.57	14.43
AM WEST	San.Diego	65	383	448	85.49	14.51
ALASKA	San.Francisco	102	503	605	83.14	16.86
AM WEST	Seattle	61	201	262	76.72	23.28
AM WEST	San.Francisco	129	320	449	71.27	28.73

Here i used arrange() to arrage the row by percent_on_time by descending order. We can see ALASKA airlines has 4 out of the top five spots for percentage of on time flights.

Use dplyr summarize and select function to compare overall airlines performance

compare_airlines <- tidyflights %>% 
  group_by(airline) %>%
  summarize(on_time = sum(on_time), delayed = sum(delayed), 
            percent_on_time = round(on_time/sum(on_time,delayed)*100,2),
            percent_delay = round(delayed/sum(on_time,delayed)*100,2))%>% # use summarise to change unit of analysis
  select(airline, percent_on_time,percent_delay) # use select to extract columns
compare_airlines %>% arrange(desc(percent_on_time))

## # A tibble: 2 x 3
##   airline percent_on_time percent_delay
##   <fct>             <dbl>         <dbl>
## 1 AM WEST            89.1          10.9
## 2 ALASKA             86.7          13.3

First i group the rows by airlines. Then use summarise change units of analysis. Here we used sum(). Finally use select() to extract variable and then display them in percent_delay order by percent_on_time using arrange().

    ggplot(compare_airlines, aes(x=airline, y=percent_delay, fill=airline)) +
    geom_bar(stat="identity", position=position_dodge(), color="black", width = 0.5) +
    coord_flip() + 
    ggtitle("Overall Percentage of Flights Delayed by Airline") +
    xlab("Airline") + ylab("Percentage of Flights Delayed")

We can see AM WEST overall performance is better than ALASKA. This seems odd as ALASKA airlines has 4 out of the top five spots for percentage of on time flights. I noticed AM West has a very large number of flights to Phoenix with a high on time percentage to that destination. I would like to investigate if we filter out phoenix then what happen to overall performace.

Use dplyr to filter out Phoenix to see the comparison accross the other destinations

filter_Phoenix <- tidyflights %>%
  filter(destination!="Phoenix")%>%
  group_by(airline) %>%
  summarize(on_time = sum(on_time), delayed = sum(delayed), 
            percent_on_time = round(on_time/sum(on_time,delayed)*100,2),
            percent_delay = round(delayed/sum(on_time,delayed)*100,2))%>% # use summarise to change unit of analysis
  select(airline, percent_on_time,percent_delay) # use select to extract columns
filter_Phoenix %>% arrange(desc(percent_on_time))

## # A tibble: 2 x 3
##   airline percent_on_time percent_delay
##   <fct>             <dbl>         <dbl>
## 1 ALASKA             86.2          13.8
## 2 AM WEST            81.1          18.9

kable(filter_Phoenix)

airline	percent_on_time	percent_delay
ALASKA	86.19	13.81
AM WEST	81.12	18.88

When we filter out Phoenix we can see that now ALASKA Airlines has a better on time percentage.

Conclusion

We can conclude that if someone is flying to Phoenix then take AM West. But if flying to any of the other destinations you have a better chance of being on time with ALASKA airlines. We can also see how performance vary depending on destination. In this assignment we learn how to formate dataset, tidy it using tidr and get useful insight using dplyr package.

Source

I used following sources to study and learn all necessary skills to complete this assignment.

kable: https://www.rdocumentation.org/packages/knitr/versions/1.1/topics/kable
Prettydoc: https://prettydoc.statr.me/index.html
Pipe: https://www.datacamp.com/community/tutorials/pipe-r-tutorial
tidyverse: https://www.tidyverse.org/
Book: https://r4ds.had.co.nz/tidy-data.html#spreading-and-gathering
Video on tidyr and dplyr: https://rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/?fbclid=IwAR2Elr92Jbv8S2E3XV2NX5e9zlRHxAahOHrzvNQF8gR3uD1yoxIt_U6-xoE