Rmarkdown Output

I am using prettydoc with cayman theme for my rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html

Problem Statement

(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.

Git-Hub

The data, .rmd used in this assignment can be found at https://github.com/forhadakbar/data607fall2019/tree/master/Week%2005

Loading package

The tidyverse includes both tidyr and dplyr. I found Amazing documentation here on tidyverse https://www.tidyverse.org/packages/. Kudos to http://hadley.nz/

Create a .csv file

One way is to manually create a .csv and push it to github. Then we can read the file using read.csv from github raw link.

Another way is to create the dataset in R using rbind and then write a table to get a .csv file as output. Then we can read the file using read.csv from either github or working directory. I would write the .csv file to my locale github repro week 05 folder then commit and push to my github using either bash command or git client so that i can get a raw link of the .csv file. I am adding all missing values at 3rd row to replicate same structure given in the assignment and to practice tidying and transformations.

Read .CSV file into R and fill in missing values

I am using github raw link so that anyone can run the code and have the access to the .csv file created. Then i will Fill in the airline values and have a look using kable function from knitr package.

NA. NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on_time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on_time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Tidy dataset

Goals of making any data tidy or perform wrangling on it mainly depends on use case. We can consider following main data wrangling goals when performing any wrangling operation.

Now let’s analyse why the the flights dataset is untidy in context of R. R prefers just one format of the data. We are calling it tidy here. There are three interrelated rules which make a dataset tidy:

So in order to any dataset to be tidy it needs to maintain all three above rules, flights dataset doesn’t fulfill all three rules.

Flights dataset has most common problem as some of the column names are not names of variables, but values of a variable. The column names Los.Angeles, Phoenix, San.Diego San.Francisco and Seattle represent values of the location or state variable, and each row represents five observations, not one. We can use gathering and spreading to tidy this dataset. We can also make separate column for on time and delayed if needed. Also, there is a blank row and two column names are missing. R reading the missing column name as NA. and NA..1

My Approach

First, I will remove the blank row and rename the missing column name. Then i will use gathering from tidyr to get long form of the dataset to make it tidy. Later, i will use spreading from tidyr to get number of flights arrived on time and delay in separate column. Then i will do analysis to get insight like compare airlines performance using dplyr.

I am interested in not only overall airlines performace but also performace by destination city.

I will use pipe (%>%) opertor instead of chaining method as i am fairly new to piping. The history of pipe (%>%) in R and how it gets introduced in tidyverse is amazing to read. In short pipe (%>%) takes the output of one statement and makes it the input of the next statement. When reading it, we can think of it as a “THEN”. I studied pipe operator and it’s history from https://www.datacamp.com/community/tutorials/pipe-r-tutorial

Why tidyr

tidyr is a package that reshapes the layout of a tables. Two main function:

Transformation and using gather()

airline status destination number_of_flights
ALASKA on_time Los.Angeles 497
ALASKA delayed Los.Angeles 62
AM WEST on_time Los.Angeles 694
AM WEST delayed Los.Angeles 117
ALASKA on_time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on_time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on_time San.Diego 212
ALASKA delayed San.Diego 20
AM WEST on_time San.Diego 383
AM WEST delayed San.Diego 65
ALASKA on_time San.Francisco 503
ALASKA delayed San.Francisco 102
AM WEST on_time San.Francisco 320
AM WEST delayed San.Francisco 129
ALASKA on_time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on_time Seattle 201
AM WEST delayed Seattle 61

Here i used gather() from tidyr package after removing blank row and name misssing column names to transform dataset from its untidy form to a normalized tidy long form which fullfill all three rules a dataset need to fulfil to become tidy for r. gather() convert variable into observation. Here, variables like Los.Angeles, Phoenix, San.Diego, San.Francisco, Seattle became observation.

Using spread ()

airline destination delayed on_time
ALASKA Los.Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San.Diego 20 212
ALASKA San.Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los.Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San.Diego 65 383
AM WEST San.Francisco 129 320
AM WEST Seattle 61 201

Here i used spread() from tidyr package to transform dataset from its long form to a wide form to have seprate column for on time and delayed filghts in order to better analyse the airlines performance. spread() convert obseration into variables. Here, observation like Delayed and On Time became variables.

why use dplyr

Often datsets contain more informatio than they display. dplyr helps access that information. I will use as much all functions from dplyr to practice. Main function:

Use mutate to add columns for total, percent on time, and percent delayed

airline destination delayed on_time total_flights percent_on_time percent_delayed
ALASKA Los.Angeles 62 497 559 88.91 11.09
ALASKA Phoenix 12 221 233 94.85 5.15
ALASKA San.Diego 20 212 232 91.38 8.62
ALASKA San.Francisco 102 503 605 83.14 16.86
ALASKA Seattle 305 1841 2146 85.79 14.21
AM WEST Los.Angeles 117 694 811 85.57 14.43
AM WEST Phoenix 415 4840 5255 92.10 7.90
AM WEST San.Diego 65 383 448 85.49 14.51
AM WEST San.Francisco 129 320 449 71.27 28.73
AM WEST Seattle 61 201 262 76.72 23.28

Here i used mutate function which takes a data frame and return the data frame with new variables added to it. As i had replace space in “on time” with underscore i didn’t need to use backtick. I used round to round the percentage to 2 decimel points.

Let’s see it visually using ggplot2

Arrange to see performance by destination

airline destination delayed on_time total_flights percent_on_time percent_delayed
ALASKA Phoenix 12 221 233 94.85 5.15
AM WEST Phoenix 415 4840 5255 92.10 7.90
ALASKA San.Diego 20 212 232 91.38 8.62
ALASKA Los.Angeles 62 497 559 88.91 11.09
ALASKA Seattle 305 1841 2146 85.79 14.21
AM WEST Los.Angeles 117 694 811 85.57 14.43
AM WEST San.Diego 65 383 448 85.49 14.51
ALASKA San.Francisco 102 503 605 83.14 16.86
AM WEST Seattle 61 201 262 76.72 23.28
AM WEST San.Francisco 129 320 449 71.27 28.73

Here i used arrange() to arrage the row by percent_on_time by descending order. We can see ALASKA airlines has 4 out of the top five spots for percentage of on time flights.

Use dplyr summarize and select function to compare overall airlines performance

## # A tibble: 2 x 3
##   airline percent_on_time percent_delay
##   <fct>             <dbl>         <dbl>
## 1 AM WEST            89.1          10.9
## 2 ALASKA             86.7          13.3

First i group the rows by airlines. Then use summarise change units of analysis. Here we used sum(). Finally use select() to extract variable and then display them in percent_delay order by percent_on_time using arrange().

We can see AM WEST overall performance is better than ALASKA. This seems odd as ALASKA airlines has 4 out of the top five spots for percentage of on time flights. I noticed AM West has a very large number of flights to Phoenix with a high on time percentage to that destination. I would like to investigate if we filter out phoenix then what happen to overall performace.

Use dplyr to filter out Phoenix to see the comparison accross the other destinations

## # A tibble: 2 x 3
##   airline percent_on_time percent_delay
##   <fct>             <dbl>         <dbl>
## 1 ALASKA             86.2          13.8
## 2 AM WEST            81.1          18.9
airline percent_on_time percent_delay
ALASKA 86.19 13.81
AM WEST 81.12 18.88

When we filter out Phoenix we can see that now ALASKA Airlines has a better on time percentage.

Conclusion

We can conclude that if someone is flying to Phoenix then take AM West. But if flying to any of the other destinations you have a better chance of being on time with ALASKA airlines. We can also see how performance vary depending on destination. In this assignment we learn how to formate dataset, tidy it using tidr and get useful insight using dplyr package.

Source

I used following sources to study and learn all necessary skills to complete this assignment.

kable: https://www.rdocumentation.org/packages/knitr/versions/1.1/topics/kable
Prettydoc: https://prettydoc.statr.me/index.html
Pipe: https://www.datacamp.com/community/tutorials/pipe-r-tutorial
tidyverse: https://www.tidyverse.org/
Book: https://r4ds.had.co.nz/tidy-data.html#spreading-and-gathering
Video on tidyr and dplyr: https://rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/?fbclid=IwAR2Elr92Jbv8S2E3XV2NX5e9zlRHxAahOHrzvNQF8gR3uD1yoxIt_U6-xoE