Tidying and Transforming Data


Introduction

This week an exercise in tidying data. We’re provided untidy data in a form of csv format. The requirement is to tidy and transform it into an analyzable table where each row is a single observation and each column a single variable.

Methodology

  • We need to read the data first into R and browse the table
  • Use the tidyr library to tidy the data and dplyr to transform it
  • Perform analysis to compare arrival and delayed time for each air line

Requirements

  1. Create a csv file.
  2. Read information from the csv file
  3. Compare the arrival delays for the two airlines
  4. Publish the R-Markdown file to the Rpubs

Questions

Before starting the analysis process, we have to know what are the information we need to extract from the avaliable data. In other meaning what is the addedd business value from doing this particular analysis. For me, my questions are:

  • What is the status of both delay and on-time arrivals for each air line company per state? gives us an indicator which airline can be reliable or not

  • What is the delay rate per state?

  • Can we rank the airlines per state according to total number of flights?

Steps

Loading the necessary libraries

  1. Read the csv into R

The first step is to read the data into R, I wrote the data on Excel file and save it as .csv file. After that I used the function read.csv() to read the data into R.

X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 62

As illustrated in the table, the data is messy contianing NA rows, and double title rows. We have to clean up the data first before tidy/transform the data.

  1. Remove the NA row and empty rows
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 delayed 117 415 65 129 62
  1. Rename columns for (first, second) and add airline name to the empty cells in the first column
air_line status Los.Angeles Phoenix San.Diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 delayed 117 415 65 129 62
air_line status Los.Angeles Phoenix San.Diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 AM WEST delayed 117 415 65 129 62
  1. The table at this shape cannot be analyzed, we have to tidy the data first. To do so, we can use the tidyr package in R. We need also to transpose the table.
  • We can start by generating two columns for the City and Flight
## 'data.frame':    4 obs. of  7 variables:
##  $ air_line     : Factor w/ 3 levels "","ALASKA","AM WEST": 2 2 3 3
##  $ status       : Factor w/ 3 levels "","delayed","on time": 3 2 3 2
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 62

Using tidyr package we can tidy the data using the gather() which will create two columns for city names and flights. Here after gathering, I removed the “.” between cities’ long name

air_line status City Flight
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 62

Then, spread the status column into two columns of delayed and on-time

air_line City delayed on time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 62 201

Analyze the data

We will get an overview of the data to explore it then will get indeep to have more insight analysis

We can see here that the majority of the flights status arriving Phoenix are on time >5,000 flights and almost 500 delayed flights. That’s makes us more curious to know which airline has the most flights in Phoenix. However, let’s calculate the delay rate per state first.

To calculate the delay rate, I summarize the sum of the delayed time devided on sum of the number of on-time flights inside the aggrigation function group_by

It seems that AM WEST has more delays rate especially for San Franscisco and Seattle with 40.3% and 30.8% respectively compared to ALASKA which has almost half these rates.

Then we have to get the total flights per state

air_line City delayed on time total_num
ALASKA Los Angeles 62 497 559
ALASKA Phoenix 12 221 233
ALASKA San Diego 20 212 232
ALASKA San Francisco 102 503 605
ALASKA Seattle 305 1841 2146
AM WEST Los Angeles 117 694 811
AM WEST Phoenix 415 4840 5255
AM WEST San Diego 65 383 448
AM WEST San Francisco 129 320 449
AM WEST Seattle 62 201 263

After that, get the highest number per airline company

air_line City total_num
AM WEST Phoenix 5255
ALASKA Seattle 2146

It seams that AM WEST is domenating in Phoenix while ALASKA is in Seattle. This comparasion should be clear if we create a ranking system to rank air-lines in each state

From all the previous analysis, we may conclude that AM WEST is the best company in Phoenix; however, ALASKA is the best on Seattle based on the total number of flights.

Conclusion

To have a final conclusion for this analysis, we need more information like tickets prices for instance. However, the preliminary analysis showed that AM WEST is more reputable than ALASKA it has larger number of flights and higher ranking.