Tidying Data

Here, we will take some messy data and tidy it

Importing the untidy data

#Imports the csv (which I put on github)
dirty <- read.csv('https://raw.githubusercontent.com/davidblumenstiel/data/master/Flight.csv')
dirty
##        X     X.1 Los.Angeles Pheonix San.Diego San.Franscisco Seattle
## 1 ALASKA on time         497     221       212            503    1841
## 2        delayed          62      12        20            102     305
## 3                         NA      NA        NA             NA      NA
## 4 AMWEST on time         694    4840       383            320     201
## 5        delayed         117     415        65            129      61

Cleaning up

#Removing blank rows
df <- na.omit(dirty)
df
##        X     X.1 Los.Angeles Pheonix San.Diego San.Franscisco Seattle
## 1 ALASKA on time         497     221       212            503    1841
## 2        delayed          62      12        20            102     305
## 4 AMWEST on time         694    4840       383            320     201
## 5        delayed         117     415        65            129      61
#Adding column names

names(df)[1:2] <- c("Airline", "Status")

#Appending missing values in "Airline"
#This is a little overkill, but would work if the dataset were larger
j = 0
for (i in df$Airline) {
  j = j + 1
  if (i != "") {
    x = i
  }
  else {
    df$Airline[j] <- x
  }
}
df
##   Airline  Status Los.Angeles Pheonix San.Diego San.Franscisco Seattle
## 1  ALASKA on time         497     221       212            503    1841
## 2  ALASKA delayed          62      12        20            102     305
## 4  AMWEST on time         694    4840       383            320     201
## 5  AMWEST delayed         117     415        65            129      61
#Tidying up the destinations
clean <- gather(df,"Destination","Number Flights",3:7)
head(clean)
##   Airline  Status Destination Number Flights
## 1  ALASKA on time Los.Angeles            497
## 2  ALASKA delayed Los.Angeles             62
## 3  AMWEST on time Los.Angeles            694
## 4  AMWEST delayed Los.Angeles            117
## 5  ALASKA on time     Pheonix            221
## 6  ALASKA delayed     Pheonix             12

Comparing arrivals:

What proportion of flights were on time/delayed for each destination across each airline?
#Creates a vector of the total number of flights per airline and destination
totals <- rep(clean$`Number Flights`[clean$Status == "on time"] + clean$`Number Flights`[clean$Status == "delayed"], each = 2)

#Initiates new column (Proportion Ontime/Delayed)
clean["Proportion Ontime/Delayed"] <- NA

#Fills the new column
clean$`Proportion Ontime/Delayed` <- clean$`Number Flights`/totals
head(clean)
##   Airline  Status Destination Number Flights Proportion Ontime/Delayed
## 1  ALASKA on time Los.Angeles            497                0.88908766
## 2  ALASKA delayed Los.Angeles             62                0.11091234
## 3  AMWEST on time Los.Angeles            694                0.85573366
## 4  AMWEST delayed Los.Angeles            117                0.14426634
## 5  ALASKA on time     Pheonix            221                0.94849785
## 6  ALASKA delayed     Pheonix             12                0.05150215
A little visualization
delayed <- filter(clean,Status == "delayed")

ggplot(delayed, aes(Airline, `Proportion Ontime/Delayed`)) + geom_boxplot()

Although there isn’t too much data to work with, it appears that the ‘AMWEST’ airline had a higher proportion of delayed flights than did the ‘ALASKA’ airline.