Here, we will take some messy data and tidy it
#Imports the csv (which I put on github)
dirty <- read.csv('https://raw.githubusercontent.com/davidblumenstiel/data/master/Flight.csv')
dirty
## X X.1 Los.Angeles Pheonix San.Diego San.Franscisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AMWEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Removing blank rows
df <- na.omit(dirty)
df
## X X.1 Los.Angeles Pheonix San.Diego San.Franscisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AMWEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Adding column names
names(df)[1:2] <- c("Airline", "Status")
#Appending missing values in "Airline"
#This is a little overkill, but would work if the dataset were larger
j = 0
for (i in df$Airline) {
j = j + 1
if (i != "") {
x = i
}
else {
df$Airline[j] <- x
}
}
df
## Airline Status Los.Angeles Pheonix San.Diego San.Franscisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AMWEST on time 694 4840 383 320 201
## 5 AMWEST delayed 117 415 65 129 61
#Tidying up the destinations
clean <- gather(df,"Destination","Number Flights",3:7)
head(clean)
## Airline Status Destination Number Flights
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AMWEST on time Los.Angeles 694
## 4 AMWEST delayed Los.Angeles 117
## 5 ALASKA on time Pheonix 221
## 6 ALASKA delayed Pheonix 12
#Creates a vector of the total number of flights per airline and destination
totals <- rep(clean$`Number Flights`[clean$Status == "on time"] + clean$`Number Flights`[clean$Status == "delayed"], each = 2)
#Initiates new column (Proportion Ontime/Delayed)
clean["Proportion Ontime/Delayed"] <- NA
#Fills the new column
clean$`Proportion Ontime/Delayed` <- clean$`Number Flights`/totals
head(clean)
## Airline Status Destination Number Flights Proportion Ontime/Delayed
## 1 ALASKA on time Los.Angeles 497 0.88908766
## 2 ALASKA delayed Los.Angeles 62 0.11091234
## 3 AMWEST on time Los.Angeles 694 0.85573366
## 4 AMWEST delayed Los.Angeles 117 0.14426634
## 5 ALASKA on time Pheonix 221 0.94849785
## 6 ALASKA delayed Pheonix 12 0.05150215
delayed <- filter(clean,Status == "delayed")
ggplot(delayed, aes(Airline, `Proportion Ontime/Delayed`)) + geom_boxplot()
Although there isn’t too much data to work with, it appears that the ‘AMWEST’ airline had a higher proportion of delayed flights than did the ‘ALASKA’ airline.