Clean Data in, Clear Results out.

    The quality of results relies on the quality of the source data. Therefore, cleaning data is a necessary step before real data analysis can begin.
     Hadley Wickham has outlined a standard ‘tidy’ organization for data where:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observation forms a table

In this demo, we will apply tidy methods to a small, but quite messy dataset.



Set up the environment

These are the libraries we will use to execute our code


Access the data

    The data was generated in MySQL Workbench with the a SQL script and the result was saved to a .csv file. Both were uploaded and available on the author’s github account.

Here the .csv is accessed & stored as an R data.frame

##         x      xx Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 3                          NA                NA            NA        
## 4 AM West on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61


Data Formatting

The dataframe is messy. There are some basic issues with the format that need be fixed:

  • Column Names: the first two columns don’t have informative labels and some of the destination names are long. Give all the columns names that are short & intuitive. For example, the cities are all flight destinations, so use the airport codes)
  • Unnecessary Blank Row: remove
  • Blank Carrier Values: Fill in the ‘Carrier’ value adjacent to the ‘delayed’ entries in the ‘Status’ field to the ‘Carrier’ value entered in the row above.
  • ‘chr’->‘int’ some features cast as ‘chr’ values because of ‘,’ characters. Remove the ‘,’ from any values and recast the ‘PHX’ & ‘SEA’ features as ‘int’
##   Carrier  Status LAX  PHX SAN SFO  SEA
## 1  ALASKA on time 497  221 212 503 1841
## 2  ALASKA delayed  62   12  20 102  305
## 3 AM West on time 694 4840 383 320  201
## 4 AM West delayed 117  415  65 129   61

Data Tidying

    Now that the data has gone through some preliminary cleaning, it can be restructured with ease using dplyr & tidyr methods.
    We will ‘tidy’ the dataset by changing the wide shape to a longer version that gives each observation it’s own row. The columns for each destination ( LAX, PHX, SAN, SFO & SEA) will be melted, or stacked into seperate rows in the dataset. To do this, we will use the pivot_longer() function from the tidyr library.

## # A tibble: 20 x 4
##    Carrier Status  Destination count
##    <chr>   <chr>   <chr>       <int>
##  1 ALASKA  on time LAX           497
##  2 ALASKA  on time PHX           221
##  3 ALASKA  on time SAN           212
##  4 ALASKA  on time SFO           503
##  5 ALASKA  on time SEA          1841
##  6 ALASKA  delayed LAX            62
##  7 ALASKA  delayed PHX            12
##  8 ALASKA  delayed SAN            20
##  9 ALASKA  delayed SFO           102
## 10 ALASKA  delayed SEA           305
## 11 AM West on time LAX           694
## 12 AM West on time PHX          4840
## 13 AM West on time SAN           383
## 14 AM West on time SFO           320
## 15 AM West on time SEA           201
## 16 AM West delayed LAX           117
## 17 AM West delayed PHX           415
## 18 AM West delayed SAN            65
## 19 AM West delayed SFO           129
## 20 AM West delayed SEA            61


    This version of the dataframe allows us to further manipulate the dataset to suit the needs of our analysis. For instance, we would like to make direct comparisons of flight status. Therefore, restructuring ‘on time’ and ‘delayed’ observations across two distinct columns would facilitate this analysis. This is made simple by the pivot_wider() function from the tidyr library.

## # A tibble: 10 x 4
##    Carrier Destination `on time` delayed
##    <chr>   <chr>           <int>   <int>
##  1 ALASKA  LAX               497      62
##  2 ALASKA  PHX               221      12
##  3 ALASKA  SAN               212      20
##  4 ALASKA  SFO               503     102
##  5 ALASKA  SEA              1841     305
##  6 AM West LAX               694     117
##  7 AM West PHX              4840     415
##  8 AM West SAN               383      65
##  9 AM West SFO               320     129
## 10 AM West SEA               201      61


    Great!, now that ‘on time’ and ‘delayed’ variables have distinct columns, we can perform some simple calculations. Here we will use the mutate() function from dplyr to add new features to our data.frame. Note that we can derive columnwise manipulations of both pre-existing features and features we generate from within our call to mutate().

## # A tibble: 10 x 7
##    Carrier Destination `on time` delayed total4Carrier   pOT    pD
##    <chr>   <chr>           <int>   <int>         <int> <dbl> <dbl>
##  1 ALASKA  LAX               497      62           559  88.9  11.1
##  2 ALASKA  PHX               221      12           233  94.8   5.2
##  3 ALASKA  SAN               212      20           232  91.4   8.6
##  4 ALASKA  SFO               503     102           605  83.1  16.9
##  5 ALASKA  SEA              1841     305          2146  85.8  14.2
##  6 AM West LAX               694     117           811  85.6  14.4
##  7 AM West PHX              4840     415          5255  92.1   7.9
##  8 AM West SAN               383      65           448  85.5  14.5
##  9 AM West SFO               320     129           449  71.3  28.7
## 10 AM West SEA               201      61           262  76.7  23.3


    Observe that in the above call the mutate(), a new variable, ‘total’ is formed and is immediately available to calculate the subsequent features ‘pOT’ & ‘pD’.



Data Visualization

    Now we are ready to draw some insights from our data set. The current structure of the data facilitates visualization with calls to ggplot().

Let’s start by plotting:

  • Overall Percent Delayed Flights
  • Delayed Flight Percent by Destination
  • Stacked Percent of Flight by Carrier



## Delayed flights ALASKA = 13.3 %
## Delayed flights AM WEST =  10.9 %

ALASKA appears to have more delayed flights than AM WEST, however….


    The figure above is a box plot showing the distribution of the % delayed flights of each airline carrier for the destinations we have data on. There is a trend for AM West flights to have a higher percentage of delayed flights. However, the difference between the two distributions was not found to be significant (p=0.22,Wilkoxon t-test).

    This figure is interesting because the distributions shows AM West to have higher delay rates per city than ALASKA whereas the overall delay rate suggests the opposite (ALASKA 13.3% > AM WEST 10.9%). Let’s visualize the data as a barplot:

    The figure above plots the percentage of delayed flights per destination for both of the airline carriers. Although there was no difference in significance of the distributions between the grouped delayed percentages, we can see here that for every destination, AM West consistently has a higher percentage of delayed flights.
    This figure plots the percentage of flights per destination for each carrier. Each of these measure represents the chance of a flight to a certain destination to be delayed if we know the identity of the airline carrier. For example,
PSAN( delayed | ALASKA ) = 8.6% or 0.086 (if you prefer proportions)

    This information gives us the conditional probability. With it we can answer questions like, “If we randomly chose an Alaska flight with a destination of SAN, what is the probability that it will be delayed?”

By weighing the conditional probability PSAN( delayed | ALASKA ) against PSAN( ALASKA ), we would gt an estimate of the probability that a flight to a SAN is a delayed ALASKA flight.
But first we need to calculate PSAN( ALASKA ) ….

    The figure above is a stacked percentage bar plot. It shows the percentage of total flights to a city for each air carrier. This gives the probability that a flight to a given city was from a specific carrier. For example: PSAN( ALASKA ) = 34.1% or 0.341 (if you prefere proportions)

    If we imagined, for the sake of this demo, that there were no other airline carriers operating out of these destinations then this visualization could answer questions like, “If we randomly chose a flight with a destination of SAN, what is the probability that it is an Alaska flight?”

If we continue with our SAN example, we can summarize the information as follows:

Status ALASKA (34.1%) AM WEST (65.9%)
on time 91.4% 85.5%
delayed 8.6% 14.5%

    Now we can use Bayes’ Theorem to invert the conditional probability to find a different relation in the data: the Probability of a flight to a certain destination to be from a certain Carrier given the flight was delayed. or,
PSAN( ALASKA | delayed )
In plain words, this is like asking, “If we randomly chose an delayed flight with a destination of SAN, what is the probability that it was an ALASKA flight?”

Applying Bayes’ Theorem:

PSAN( ALASKA | delayed ) = \(\frac{P_{SAN}( delayed | ALASKA ) * P_{SAN}( ALASKA )}{P_{SAN}( delayed | ALASKA ) x P_{SAN}( ALASKA ) + P_{SAN}( delayed | AM WEST ) x P_{SAN}( AM WEST )}\)

    That looks really long and complicated. Don’t be intimidated! We already have all of the information needed in our data.frame…

    The figure above looks very similar to the previous figure, but is represents some very different information. The previous figure shows the marginal probability, or the probability of observing an ALASKA flight or an AM WEST flight at a destination. Before that, we used a bar plot to show the conditional probability of a flight being delayed given we know the airline. The figure above shows a different conditional probability, the chance that a flight is either ALASKA or AM West given we know that the flight was delayed.
    The outcome of the figure above is as expected: The probability that a delayed flight is AM WEST is always greater than the probability that a flight is AM WEST because AM WEST has consistently higher likelihood of delay. In fact, for the case of SFO flights, even though a given flight to SFO is more likely to be an ALASKA flight (57.4%), a delayed flight is more likely to be AM WEST (55.8%) because AM WEST has a much higher rate of delays (28.7% AM WEST vs 16.9% ALASKA).

Conclusions

    In this demo will practiced tidying and transforming a dataset. We used a small, but quite messy dataset that summarized the on time and delayed status of arrivals to several airports for two airline carriers. After some initial formatting, we used methods from tidyr and dplyer to further clean our data. Making the data orderly is beneficial because it allows time and resources to be focused on generating insights from the data. Once the data was made tidy, we performed a brief analysis to compare arrival delays for the the two airlinees.

Hopefully this demo reinforced the idea that a clean & tidy dataset is it’s own reward!

References

All images: Watterson, Bill. The Essential Calvin and Hobbes: A Calvin and Hobbes Treasury. , 1988. Print. Wickham, Hadley. “Tidy data.” Journal of Statistical Software 59.10 (2014): 1-23.
introducing tidyr
tidyr tidyverse
Pimp my Rmd