When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.
The data is loaded into a data-frame called ‘flights_df’.
rm(list = ls())
flights_df <- data.frame(
c('ALASKA',NA,NA,'AM WEST',NA),
c('on time','delayed',NA,'on time','delayed'),
c(497,62,NA,694,117),
c(221,12,NA,4840,415),
c(212,20,NA,383,65),
c(503,102,NA,320,129),
c(1841,305,NA,201,61)
)
colnames(flights_df) <- c('','','Los Angeles','Phoenix','San Diego','San Francisco','Seattle')
#view current state of data-frame
flights_df
## Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
We can now write the ‘flights_df’ data-frame to a CSV file. Within the ‘write.csv()’ function, I specified the ‘row.names’ parameter to ‘FALSE’ to avoid creating non-blank row names. I also set the ‘na’ parameter to “” to replace all ‘NA’ values with blanks. The ‘stringsAsFactors’ parameter is set to ‘FALSE’ to avoid changing the character entries. At first glance it looks correct, but the data-frame still shows up with ‘NA’ values in some of the cells.
#write to csv file
write.csv(flights_df, "flights.csv", row.names = FALSE, na = "")
#read from csv file into R
flights_csv <- read.csv('flights.csv', stringsAsFactors = FALSE, check.names = FALSE)
flights_csv
## Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
To remove the remaining ‘NA’ values I subsetted the data-frame and set any value that ‘is.na’ to ““. When calling the updated data, it seemed to have removed the blank row entirely. To confirm that this was not the case, I presented the data nicely using the ‘datatable()’ function.
#change NA values to blank
flights_csv[is.na(flights_csv)] <- ""
#check updated data-frame
flights_csv
## Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# Display the data frame nicely
datatable(flights_csv)
The previous output looks almost perfect, but the first two columns were given default names, namely “Var.2” and “Var.3”. To avoid this, I set the ‘options’ parameter accordingly. We can now see that our data-frame is formatted exactly as we want it to be.
#specify additional options
datatable(flights_csv, options = list(columnDefs = list(list(targets = c(1, 2), title = ""))))
Before we can pivot, we need to change the column/variable names to a valid format. This means no white-spaces or special characters.
#check data frame again
flights_csv
## Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Here, I replace white-spaces with a period and labeled the first two columns that originally had blank titles as ‘Airline’ and ‘Status’. For the blank row entries in the ‘Airline’ column, labeled them according to the proper airline.
#
colnames(flights_csv) <- c('Airline','Status','Los.Angeles','Phoenix','San.Diego','San.Francisco','Seattle')
flights_csv$Airline[2] <- 'ALASKA'
flights_csv$Airline[5] <- 'AM WEST'
flights_csv
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Now that all of the required data is loaded, we need to transform what we have so that it is readable to our liking. In this case, this means pivoting the matrix and aggregating values and columns. To avoid gaps in our pivoted data, I remove the blank row before applying the ‘pivot_longer()’ function.
flights_csv <- flights_csv[-3, ]
p_flights <- pivot_longer(flights_csv,
cols = c('Los.Angeles','Phoenix','San.Diego','San.Francisco','Seattle'),
names_to = "City",
values_to = "Flight.Count")
p_flights
## # A tibble: 20 × 4
## Airline Status City Flight.Count
## <chr> <chr> <chr> <chr>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
We need to change the ‘Flight.Count’ column to numeric values to perform mathematical operations. After doing so, we can group the data according to Airline, City, and Status, in that order.
p_flights$Flight.Count <- as.numeric(p_flights$Flight.Count)
#Aggregate data by city and airline
grouped_flights <- p_flights %>%
group_by(Airline, City, Status) %>%
summarize(Flight.Count, .groups = 'drop')
grouped_flights
## # A tibble: 20 × 4
## Airline City Status Flight.Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA Los.Angeles delayed 62
## 2 ALASKA Los.Angeles on time 497
## 3 ALASKA Phoenix delayed 12
## 4 ALASKA Phoenix on time 221
## 5 ALASKA San.Diego delayed 20
## 6 ALASKA San.Diego on time 212
## 7 ALASKA San.Francisco delayed 102
## 8 ALASKA San.Francisco on time 503
## 9 ALASKA Seattle delayed 305
## 10 ALASKA Seattle on time 1841
## 11 AM WEST Los.Angeles delayed 117
## 12 AM WEST Los.Angeles on time 694
## 13 AM WEST Phoenix delayed 415
## 14 AM WEST Phoenix on time 4840
## 15 AM WEST San.Diego delayed 65
## 16 AM WEST San.Diego on time 383
## 17 AM WEST San.Francisco delayed 129
## 18 AM WEST San.Francisco on time 320
## 19 AM WEST Seattle delayed 61
## 20 AM WEST Seattle on time 201
As shown above, for each airline there are duplicate entries for each city in order to match to either the ‘delayed’ or ‘on time’ status. This gives us a data-frame with 20 rows. We can condense this down to just 10 rows by including the flight status for each city on the same row by creating a ‘delayed’ column and an ‘on time’ column. I relabeled the resulting ‘on time’ column to ‘on.time’ since the former label used a white-space.
#present corresponding values for (on time) and (delayed) side by side
wide_flights <- grouped_flights %>%
pivot_wider(names_from = Status,
values_from = Flight.Count)
colnames(wide_flights)[4] <- 'on.time'
wide_flights
## # A tibble: 10 × 4
## Airline City delayed on.time
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
Now that we have data that is readable to our liking, we can create new columns to use for our analysis. Here, I created a column for the percentage of delayed flights relative to all flights at the given destination city and corresponding to a specific airline.
#create new column to show percentage of delayed flights for each row
wide_flights$pct_delayed <- wide_flights$delayed / (wide_flights$delayed + wide_flights$on.time)
wide_flights
## # A tibble: 10 × 5
## Airline City delayed on.time pct_delayed
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497 0.111
## 2 ALASKA Phoenix 12 221 0.0515
## 3 ALASKA San.Diego 20 212 0.0862
## 4 ALASKA San.Francisco 102 503 0.169
## 5 ALASKA Seattle 305 1841 0.142
## 6 AM WEST Los.Angeles 117 694 0.144
## 7 AM WEST Phoenix 415 4840 0.0790
## 8 AM WEST San.Diego 65 383 0.145
## 9 AM WEST San.Francisco 129 320 0.287
## 10 AM WEST Seattle 61 201 0.233
I decided to create two more columns for the mean of delayed flights as well as the total number of flights for each airline, regardless of destination. This is to compute a percentage for the mean in the same way that I computed a percentage for the individual cities.
mAir_delay <- wide_flights %>%
group_by(Airline) %>%
summarize(mean_delayed = mean(delayed))
mDelay_AMWest <- mAir_delay$mean_delayed[1]
mDelay_Alaska <- mAir_delay$mean_delayed[2]
wide_flights$mean_delay <- NA
wide_flights$mean_delay <- ifelse(wide_flights$Airline == "AM WEST", mDelay_AMWest, wide_flights$mean_delay)
wide_flights$mean_delay <- ifelse(wide_flights$Airline == "ALASKA", mDelay_Alaska, wide_flights$mean_delay)
wide_flights
## # A tibble: 10 × 6
## Airline City delayed on.time pct_delayed mean_delay
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497 0.111 157.
## 2 ALASKA Phoenix 12 221 0.0515 157.
## 3 ALASKA San.Diego 20 212 0.0862 157.
## 4 ALASKA San.Francisco 102 503 0.169 157.
## 5 ALASKA Seattle 305 1841 0.142 157.
## 6 AM WEST Los.Angeles 117 694 0.144 100.
## 7 AM WEST Phoenix 415 4840 0.0790 100.
## 8 AM WEST San.Diego 65 383 0.145 100.
## 9 AM WEST San.Francisco 129 320 0.287 100.
## 10 AM WEST Seattle 61 201 0.233 100.
Here I include columns for the total number of delayed flights, total number of flights, and delay percentages for the overall airlines. Then added a final column that calculates the mean across all cities of the percentages of delayed flights for each airline.
mAir_total1 <- wide_flights %>%
group_by(Airline) %>%
summarize(delay_total = sum(delayed))
mAir_total <- wide_flights %>%
group_by(Airline) %>%
summarize(total = sum(delayed, on.time))
mDel_AMWest <- mAir_total1$delay_total[2]
mDel_Alaska <- mAir_total1$delay_total[1]
mTot_AMWest <- mAir_total$total[1]
mTot_Alaska <- mAir_total$total[2]
wide_flights$delay_total <- NA
wide_flights$total <- NA
wide_flights$delay_total <- ifelse(wide_flights$Airline == "AM WEST", mDel_AMWest, wide_flights$delay_total)
wide_flights$delay_total <- ifelse(wide_flights$Airline == "ALASKA", mDel_Alaska, wide_flights$delay_total)
wide_flights$total <- ifelse(wide_flights$Airline == "AM WEST", mTot_AMWest, wide_flights$total)
wide_flights$total <- ifelse(wide_flights$Airline == "ALASKA", mTot_Alaska, wide_flights$total)
wide_flights$airline_delay_pct <- wide_flights$delay_total / wide_flights$total
wide_flights$pct_mean_delayed <- NA
wide_flights$pct_mean_delayed <- wide_flights$mean_delay / wide_flights$total
wide_flights
## # A tibble: 10 × 10
## Airline City delayed on.time pct_delayed mean_delay delay_total total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497 0.111 157. 501 7225
## 2 ALASKA Phoenix 12 221 0.0515 157. 501 7225
## 3 ALASKA San.Diego 20 212 0.0862 157. 501 7225
## 4 ALASKA San.Francis… 102 503 0.169 157. 501 7225
## 5 ALASKA Seattle 305 1841 0.142 157. 501 7225
## 6 AM WEST Los.Angeles 117 694 0.144 100. 787 3775
## 7 AM WEST Phoenix 415 4840 0.0790 100. 787 3775
## 8 AM WEST San.Diego 65 383 0.145 100. 787 3775
## 9 AM WEST San.Francis… 129 320 0.287 100. 787 3775
## 10 AM WEST Seattle 61 201 0.233 100. 787 3775
## # ℹ 2 more variables: airline_delay_pct <dbl>, pct_mean_delayed <dbl>
In my analysis, I decided to find the sample standard deviation based on the percentage points that I computed for the individual cities and the percentages of the means. I did this to compare the delays according variation in the data set.
sample_pct_sd <- wide_flights %>%
group_by(Airline) %>%
reframe(pct_delayed, pct_mean_delayed)
sample_pct_sd$sq_diff <- (sample_pct_sd$pct_delayed - sample_pct_sd$pct_mean_delayed)^2
sample_pct_sd1 <- sample_pct_sd %>%
group_by(Airline) %>%
summarize(pct_sd = sqrt(sum(sq_diff)/(5 - 1)))
#standard deviation of delay percentages for ALASKA
alaska_pct_sd <- sample_pct_sd1$pct_sd[1]
#standard deviation of delay percentages for AM WEST
amWest_pct_sd <- sample_pct_sd1$pct_sd[2]
delayed_df <- data.frame(Airline = wide_flights$Airline,
City = wide_flights$City,
Delayed = wide_flights$delayed,
Pct_Delay = wide_flights$pct_delayed,
#Mean_Pct_Delay = wide_flights$pct_mean_delayed,
#Pct_SD = round(sqrt(((wide_flights$pct_delayed - wide_flights$pct_mean_delayed)^2) / (wide_flights$total - 1)), 5),
#Avg_Pct_SD = NA,
Pct_SD = NA,
Airline_Delay_Pct = wide_flights$airline_delay_pct
)
delayed_df$Pct_SD <- ifelse(wide_flights$Airline == "AM WEST", amWest_pct_sd, delayed_df$Pct_SD)
delayed_df$Pct_SD <- ifelse(wide_flights$Airline == "ALASKA", alaska_pct_sd, delayed_df$Pct_SD)
#write to csv
write.csv(delayed_df, 'flight_analysis.csv')
flights_analysis <- read.csv('flight_analysis.csv')
flights_analysis
## X Airline City Delayed Pct_Delay Pct_SD Airline_Delay_Pct
## 1 1 ALASKA Los.Angeles 62 0.11091234 0.1106926 0.06934256
## 2 2 ALASKA Phoenix 12 0.05150215 0.1106926 0.06934256
## 3 3 ALASKA San.Diego 20 0.08620690 0.1106926 0.06934256
## 4 4 ALASKA San.Francisco 102 0.16859504 0.1106926 0.06934256
## 5 5 ALASKA Seattle 305 0.14212488 0.1106926 0.06934256
## 6 6 AM WEST Los.Angeles 117 0.14426634 0.1878894 0.20847682
## 7 7 AM WEST Phoenix 415 0.07897241 0.1878894 0.20847682
## 8 8 AM WEST San.Diego 65 0.14508929 0.1878894 0.20847682
## 9 9 AM WEST San.Francisco 129 0.28730512 0.1878894 0.20847682
## 10 10 AM WEST Seattle 61 0.23282443 0.1878894 0.20847682
The graphs below show the percentages of delays in two different ways. Both visuals include error bars that show the variability of the data.
Here we see the airlines grouped in two graphs, each with the delay percentages of the individual cities.
ggplot(flights_analysis, aes(x = Pct_Delay, y = City, fill = Airline)) +
facet_wrap(~ Airline) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = Pct_Delay - Pct_SD, xmax = Pct_Delay + Pct_SD), width = 0.2) +
labs(
title = "Delay Percentage by City",
x = "Percentage",
y = "City"
)
Here we see the cities grouped in five graphs. Each shows the delay percentages of the individual airlines.
ggplot(flights_analysis, aes(x = Pct_Delay, y = Airline, fill = City)) +
facet_wrap(~ City) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = Pct_Delay - Pct_SD, xmax = Pct_Delay + Pct_SD), width = 0.2) +
labs(
title = "Delay Percentage by Airline",
x = "Percentage",
y = "Airline"
)
Below I compared the overall delays for each airline and used a bar graph to compare. It is clear that ALASKA Airlines has a significantly smaller delay rate relative to their total number of flights across all cities.
del_group <- flights_analysis %>%
group_by(Airline) %>%
summarize(air_delay_p = mean(Airline_Delay_Pct))
ggplot(del_group, aes(x = Airline, y = air_delay_p, fill = Airline)) + geom_bar(stat = "identity") +
labs(
title = "Overall Delay Percentage by Airline",
x = "Airline",
y = "Percentage"
)
When breaking down the percentage comparison by city, I visualized it as (1) a single stacked graph with the information for all cities and (2) as separate graphs for each city. In either case, we see a significant difference in delay percentages across airlines.
Here the graph is visually appealing, but may be harder to compare as the bars representing the delay percentage for each city are not properly aligned side by side.
ggplot(flights_analysis, aes(x = Airline, y = Pct_Delay, fill = City)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Individual Delay Percentage vs Airline (Stacked)",
x = "Airline",
y = "Percentage"
)
Here the separate graphs represent each and within each graph is a comparison of their delays by . The individual graphs allow easier comparison.
ggplot(flights_analysis, aes(x = Airline, y = Pct_Delay, fill = City)) +
facet_wrap(~ City) +
geom_bar(stat = "identity") +
labs(
title = "Individual Delay Percentage vs Airline (Separated)",
x = "Airline",
y = "Percentage"
)
I likewise visualized it by comparing the delay percentages of each city and filled by airline. I also swapped the axes for better readability.
ggplot(flights_analysis, aes(x = Pct_Delay, y = City, fill = Airline)) +
geom_bar(stat = 'identity') +
labs(
title = "Delay Percentage vs City (Stacked)",
x = "Percentage",
y = "City"
)
Here the separate graphs represent each and within each graph is a comparison of their delays by .
ggplot(flights_analysis, aes(x = Pct_Delay, y = City, fill = Airline)) +
facet_wrap(~ Airline) +
geom_bar(stat = 'identity') +
labs(
title = "Delay Percentage vs City (Separated)",
x = "Percentage",
y = "City"
)
My analysis focused on comparing the spread of the data according to airline as well as comparing percentages of delays. In comparing the standard deviations of overall delayed flight percentages for each airline, we observe that AM West has a standard deviation of 0.1878894, while Alaska has a lower standard deviation of 0.1106926. This indicates that Alaska has more consistent performance in terms of flight delays, with less variability in the percentage of delayed flights compared to AM West. Although both standard deviations are quite small, suggesting that delays for both airlines are fairly stable, Alaska’s slightly smaller standard deviation suggests greater reliability in maintaining a predictable schedule. When comparing the actual percentages of delays, whether by airline overall or individual cities, we see that Alaska has a smaller delay rate across the board. Based on this analysis, Alaska seems to be the better airline to fly with as they have less delays and are more consistent arrival times.