The quality of results relies on the quality of the source data. Therefore, cleaning data is a necessary step before real data analysis can begin.
Hadley Wickham has outlined a standard ‘tidy’ organization for data where:
In this demo, we will apply tidy methods to a small, but quite messy dataset.
These are the libraries we will use to execute our code
The data was generated in MySQL Workbench with the a SQL script and the result was saved to a .csv file. Both were uploaded and available on the author’s github account.
Here the .csv is accessed & stored as an R data.frame
#the url for the .csv file
flightsURL <-'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/flights.csv'
#read to an R data.frame with read.csv()
flights_df <- read.csv( flightsURL ,stringsAsFactors = FALSE)
#display the result
flights_df## x xx Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM West on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
The dataframe is messy. There are some basic issues with the format that need be fixed:
#make some column names that are more intuitive
colnames( flights_df ) <- c('Carrier', 'Status', 'LAX', 'PHX', 'SAN', 'SFO', 'SEA')
#filter out a row if all values are blank
#then, if a cell holds chr values & is filled with an empty string, replace with 'NA'
#then, use the fill() to replace NA with the value above it.
flights_df <- flights_df %>% filter_all(all_vars(!is.na(.))) %>% mutate_if(is.character, list(~na_if(.,""))) %>% fill(Carrier)
#2 columns are currently class 'chr' but should be class 'int'. To fix this,
#must remove ',' from any values and recast as 'int'
flights_df[,c('PHX','SEA')] <- lapply(flights_df[,c('PHX','SEA')],
function(i) as.integer(sub(',', '', i)))
#displar the result
flights_df## Carrier Status LAX PHX SAN SFO SEA
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 AM West delayed 117 415 65 129 61
Now that the data has gone through some preliminary cleaning, it can be restructured with ease using dplyr & tidyr methods.
We will ‘tidy’ the dataset by changing the wide shape to a longer version that gives each observation it’s own row. The columns for each destination ( LAX, PHX, SAN, SFO & SEA) will be melted, or stacked into seperate rows in the dataset. To do this, we will use the pivot_longer() function from the tidyr library.
#use the tidyr pivot_longer() function from tidyr to pivot long a subset of the columns (LAX:SEA) with the labels to a new column "Destination" and the values to another new column "count"
flights_long <- flights_df %>%
pivot_longer(cols = LAX:SEA, names_to = "Destination", values_to = "count")
#display the result
flights_long## # A tibble: 20 x 4
## Carrier Status Destination count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time LAX 497
## 2 ALASKA on time PHX 221
## 3 ALASKA on time SAN 212
## 4 ALASKA on time SFO 503
## 5 ALASKA on time SEA 1841
## 6 ALASKA delayed LAX 62
## 7 ALASKA delayed PHX 12
## 8 ALASKA delayed SAN 20
## 9 ALASKA delayed SFO 102
## 10 ALASKA delayed SEA 305
## 11 AM West on time LAX 694
## 12 AM West on time PHX 4840
## 13 AM West on time SAN 383
## 14 AM West on time SFO 320
## 15 AM West on time SEA 201
## 16 AM West delayed LAX 117
## 17 AM West delayed PHX 415
## 18 AM West delayed SAN 65
## 19 AM West delayed SFO 129
## 20 AM West delayed SEA 61
This version of the dataframe allows us to further manipulate the dataset to suit the needs of our analysis. For instance, we would like to make direct comparisons of flight status. Therefore, restructuring ‘on time’ and ‘delayed’ observations across two distinct columns would facilitate this analysis. This is made simple by the pivot_wider() function from the tidyr library.
#use the tidyr pivot_wider() function to increase the width by splitting 'Status' into two columns: 'on time' and 'delayed'
flights_status <- flights_long %>% pivot_wider( names_from = Status, values_from = count )
flights_status## # A tibble: 10 x 4
## Carrier Destination `on time` delayed
## <chr> <chr> <int> <int>
## 1 ALASKA LAX 497 62
## 2 ALASKA PHX 221 12
## 3 ALASKA SAN 212 20
## 4 ALASKA SFO 503 102
## 5 ALASKA SEA 1841 305
## 6 AM West LAX 694 117
## 7 AM West PHX 4840 415
## 8 AM West SAN 383 65
## 9 AM West SFO 320 129
## 10 AM West SEA 201 61
Great!, now that ‘on time’ and ‘delayed’ variables have distinct columns, we can perform some simple calculations. Here we will use the mutate() function from dplyr to add new features to our data.frame. Note that we can derive columnwise manipulations of both pre-existing features and features we generate from within our call to mutate().
#use the mutate() function from dplyr to add new features
results_df <- flights_status %>% mutate( total4Carrier = `on time`+ delayed,
pOT = round( `on time`/total4Carrier*100, 1) ,
pD = round( delayed/total4Carrier*100, 1 ) )
results_df## # A tibble: 10 x 7
## Carrier Destination `on time` delayed total4Carrier pOT pD
## <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA LAX 497 62 559 88.9 11.1
## 2 ALASKA PHX 221 12 233 94.8 5.2
## 3 ALASKA SAN 212 20 232 91.4 8.6
## 4 ALASKA SFO 503 102 605 83.1 16.9
## 5 ALASKA SEA 1841 305 2146 85.8 14.2
## 6 AM West LAX 694 117 811 85.6 14.4
## 7 AM West PHX 4840 415 5255 92.1 7.9
## 8 AM West SAN 383 65 448 85.5 14.5
## 9 AM West SFO 320 129 449 71.3 28.7
## 10 AM West SEA 201 61 262 76.7 23.3
Observe that in the above call the mutate(), a new variable, ‘total’ is formed and is immediately available to calculate the subsequent features ‘pOT’ & ‘pD’.
Now we are ready to draw some insights from our data set. The current structure of the data facilitates visualization with calls to ggplot().
Let’s start by plotting:
#calculate the overal delay rate for the two airline carriers
flights_summary <- colSums(t(flights_df[,c('LAX','PHX','SAN','SFO','SEA')]))
delayedALASKA <- round(flights_summary[2]/(flights_summary[1]+flights_summary [2])*100,1)
delayedAM_WEST <- round(flights_summary[4]/(flights_summary[3]+flights_summary [4])*100,1)
cat( 'Delayed flights ALASKA =', delayedALASKA,'%\nDelayed flights AM WEST = ',delayedAM_WEST,'%')## Delayed flights ALASKA = 13.3 %
## Delayed flights AM WEST = 10.9 %
ALASKA appears to have more delayed flights than AM WEST, however….
#make a boxplot of the % delayed flights for each carrier
p <- ggplot(results_df, aes(x=Carrier, y=pD, fill=Carrier)) +
geom_boxplot() +
stat_compare_means() +
stat_compare_means( aes(label = ..p.signif..), label.x = 0.92, label.y = 27.5) +
ylab( 'Delayed %') +
labs( title = 'Delayed Flight %', subtitle= 'overall airline performance')
p
The figure above is a box plot showing the distribution of the % delayed flights of each airline carrier for the destinations we have data on. There is a trend for AM West flights to have a higher percentage of delayed flights. However, the difference between the two distributions was not found to be significant (p=0.22,Wilkoxon t-test).
This figure is interesting because the distributions shows AM West to have higher delay rates per city than ALASKA whereas the overall delay rate suggests the opposite (ALASKA 13.3% > AM WEST 10.9%). Let’s visualize the data as a barplot:
#make a bar plot of the %delayed flights for each airline for each destination
ggplot( results_df, aes(fill=Carrier, y=pD, x=Destination)) +
geom_bar(position="dodge", stat="identity" ) +
ylab( '% Delayed') +
labs( title = 'Delayed Flight %', subtitle= 'what is the % chance that a flight to a destination is delayed for each Carrier') The figure above plots the percentage of delayed flights per destination for both of the airline carriers. Although there was no difference in significance of the distributions between the grouped delayed percentages, we can see here that for every destination, AM West consistently has a higher percentage of delayed flights.
This figure plots the percentage of flights per destination for each carrier. Each of these measure represents the chance of a flight to a certain destination to be delayed if we know the identity of the airline carrier. For example,
PSAN( delayed | ALASKA ) = 8.6% or 0.086 (if you prefer proportions)
This information gives us the conditional probability. With it we can answer questions like, “If we randomly chose an Alaska flight with a destination of SAN, what is the probability that it will be delayed?”
By weighing the conditional probability PSAN( delayed | ALASKA ) against PSAN( ALASKA ), we would gt an estimate of the probability that a flight to a SAN is a delayed ALASKA flight.
But first we need to calculate PSAN( ALASKA ) ….
#we need to find the total flights for both carriers to a destination.
#pivot_wider the carrier totals by destination...
findT <- results_df %>% pivot_wider( names_from = Destination, values_from = total4Carrier )
#then take a sum of the columns (ignore NA)
totals <- colSums( findT[,c('LAX','PHX','SAN','SFO','SEA')], na.rm = TRUE)
#now we just need to rep the totals*2 and pass he values to a
#new column in our results dataframe
results_df <- results_df %>% mutate( total4All = as.numeric(rep( totals,2)),
pTotal = total4Carrier/total4All )
#for plottong purposes: positions for labels
results_df$labelpos <- ifelse(results_df$Carrier=="ALASKA",
1 - results_df$pTotal/2, results_df$pTotal/2)
# Make a figure with the stacked percentages of flights
#from each carrier to a destination
ggplot(results_df, aes(fill=Carrier, y=pTotal, x=Destination)) +
geom_bar(position="fill", stat="identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = paste0(round(pTotal*100,1),"%"),y=labelpos),size = 3) +
ylab( '% of Total Flights to a destination') +
labs( title = '% Flights by Carrier', subtitle= 'what is the chance a flight to a given destination is ALASKA or AM West')The figure above is a stacked percentage bar plot. It shows the percentage of total flights to a city for each air carrier. This gives the probability that a flight to a given city was from a specific carrier. For example: PSAN( ALASKA ) = 34.1% or 0.341 (if you prefere proportions)
If we imagined, for the sake of this demo, that there were no other airline carriers operating out of these destinations then this visualization could answer questions like, “If we randomly chose a flight with a destination of SAN, what is the probability that it is an Alaska flight?”
If we continue with our SAN example, we can summarize the information as follows:
| Status | ALASKA (34.1%) | AM WEST (65.9%) |
|---|---|---|
| on time | 91.4% | 85.5% |
| delayed | 8.6% | 14.5% |
Now we can use Bayes’ Theorem to invert the conditional probability to find a different relation in the data: the Probability of a flight to a certain destination to be from a certain Carrier given the flight was delayed. or,
PSAN( ALASKA | delayed )
In plain words, this is like asking, “If we randomly chose an delayed flight with a destination of SAN, what is the probability that it was an ALASKA flight?”
Applying Bayes’ Theorem:
PSAN( ALASKA | delayed ) = \(\frac{P_{SAN}( delayed | ALASKA ) * P_{SAN}( ALASKA )}{P_{SAN}( delayed | ALASKA ) x P_{SAN}( ALASKA ) + P_{SAN}( delayed | AM WEST ) x P_{SAN}( AM WEST )}\)
That looks really long and complicated. Don’t be intimidated! We already have all of the information needed in our data.frame…
#let's add two new columns that shift the values for the 'pD' & 'pTotal'
#features for the opposite carrier are in the same row.
#this will make calculating the inverse probability a
#relatively simple columnwise opperation
results_df$opppD <- results_df$pD[ c(6:10,1:5)]
results_df$opppTotal <- results_df$pTotal[ c(6:10,1:5)]
#now, apply Bayes!
results_df$invProb <- (results_df$pD*results_df$pTotal)/
(results_df$pD*results_df$pTotal + results_df$opppD*results_df$opppTotal)
#for plotting purposes: label positions
results_df$labelpos <- ifelse(results_df$Carrier=="ALASKA",
1 - results_df$invProb/2, results_df$invProb/2)
#make a stacked barplot showing the chance that, given a flight is delayed,
#that it was either ALASKA or AM WEST
ggplot(results_df, aes(fill=Carrier, y=invProb, x=Destination)) +
geom_bar(position="fill", stat="identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = paste0(round(invProb*100,1),"%"),y=labelpos),size = 3) +
ylab( '% Carrier') +
labs( title = '% Chance a delayed flight is from a carrier', subtitle= 'if a flight is delayed, what is the % chance it was ALASKA or AM WEST') The figure above looks very similar to the previous figure, but is represents some very different information. The previous figure shows the marginal probability, or the probability of observing an ALASKA flight or an AM WEST flight at a destination. Before that, we used a bar plot to show the conditional probability of a flight being delayed given we know the airline. The figure above shows a different conditional probability, the chance that a flight is either ALASKA or AM West given we know that the flight was delayed.
The outcome of the figure above is as expected: The probability that a delayed flight is AM WEST is always greater than the probability that a flight is AM WEST because AM WEST has consistently higher likelihood of delay. In fact, for the case of SFO flights, even though a given flight to SFO is more likely to be an ALASKA flight (57.4%), a delayed flight is more likely to be AM WEST (55.8%) because AM WEST has a much higher rate of delays (28.7% AM WEST vs 16.9% ALASKA).
In this demo will practiced tidying and transforming a dataset. We used a small, but quite messy dataset that summarized the on time and delayed status of arrivals to several airports for two airline carriers. After some initial formatting, we used methods from tidyr and dplyer to further clean our data. Making the data orderly is beneficial because it allows time and resources to be focused on generating insights from the data. Once the data was made tidy, we performed a brief analysis to compare arrival delays for the the two airlinees.
Hopefully this demo reinforced the idea that a clean & tidy dataset is it’s own reward!
All images: Watterson, Bill. The Essential Calvin and Hobbes: A Calvin and Hobbes Treasury. , 1988. Print. Wickham, Hadley. “Tidy data.” Journal of Statistical Software 59.10 (2014): 1-23.
introducing tidyr
tidyr tidyverse
Pimp my Rmd