1 Assignment – Tidying and Transforming Data

Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

Table 1: The chart above describes arrival delays for two airlines across five destinations.

1.1 Loading packeges

Firstly, we wil load packages that are used for data tidying and transforming

  • Two packages
    • tidyr
    • dplyr
    • ggplot2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1.3 Read the information from the .csv file above into R

Let’s read the .CSV file that we create above.

1.4 Modify the table

The table from .CSV file is a little untidy and messy. There are no headers and two NA values that has to be replaced. It will be modified with adding column names, removing unnecessary row and column, and filling NAs with values

1.5 Perform analysis to compare the arrival delays for the two airlines with using tidyr and dplyr.

Let’s take a look at the table above and find the variables.

  • Variables in the table
    • Airline
    • Arrival status
    • City
    • Count

1.5.1 Collapse multiple columns into four columns; (Airline, Arrival_Status, City, Count)

Since the format of delay2 table is untidy, let’s re-structure the table into tidy format with using gather() function that collapse multibple comlumns into four columns.

1.5.2 Return rows with matching specific airline and order by arrival status

Let’s divide into two table according to Airline; Alaska and Am west. To extract the rows of interest, filter() function is used and then arrange() function is used to make an order by arrival status. We will use the pipe operate here because we will pipe the results after one expression into a second expression.

1.5.3 Graph expression

Now we are creating barplot by using ggplot2 package in order to compare observations of each city. Because the total of flights are different between citites, we will generate stacked barplot with normalized height, so that it will be easier to compare the data.

## 'data.frame':    10 obs. of  4 variables:
##  $ Airline       : chr  "ALASKA" "ALASKA" "ALASKA" "ALASKA" ...
##  $ Arrival_Status: chr  "delayed" "delayed" "delayed" "delayed" ...
##  $ city          : chr  "Los_Angeles" "Phoenix" "San Diego" "San Francisco" ...
##  $ count         : chr  "62" "12" "20" "102" ...
## 'data.frame':    10 obs. of  4 variables:
##  $ Airline       : chr  "AM WEST" "AM WEST" "AM WEST" "AM WEST" ...
##  $ Arrival_Status: chr  "delayed" "delayed" "delayed" "delayed" ...
##  $ city          : chr  "Los_Angeles" "Phoenix" "San Diego" "San Francisco" ...
##  $ count         : chr  "117" "415" "65" "129" ...

1.5.4 Calculating percentage of delayed flights for each airline

We will calculate how much proportions of delayed flights and on time flights for each airline. Each airline table is grouped by arrival_status, calculate total count for each arrival status , and then add a column for percentage of each case.

1.6 Conclusion

  1. According to the ALASKA and AM WEST barplots, San Francisco airport has the highest delayed rates among five cities for both airlines.

  2. According to the ALASKA and AM WEST barplots, Phoenix airpot has the lowest delayed rates among five cities for both airlines.

  3. Based on the alaska1 and amwest1 tables, Alaska airline has higher delayed rate in the total number of flights than AM WEST airline. The delayed rate for Alaska is 13% and the rate for AM WEST is 11%. But the total number of flights are different which Alaska has total 3775 flights and AM WEST has total 7225 flights.