Let’s load the necessary libraries:
# Loading libraries
library(tidyverse)
library(RMySQL)
library(DBI)
library(knitr)
library(tidyr)
library(kableExtra)
I create a table in the flight_data.csv file to store
the data from Week 4 assignment and uploaded it to Github.
This code reads the flight_data.csvfrom my Github URL
“https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA607/LAB4/flight_data.csv”
and creates data table.
# Reading CSV file from my Github and load it to the database
data <- read_csv("https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA607/LAB4/flight_data.csv", col_names = TRUE)
data |>
kable() |>
kable_styling(full_width = F)
| …1 | …2 | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | On Time | 497 | 221 | 212 | 503 | 1841 |
| NA | Delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | On Time | 694 | 4840 | 383 | 320 | 201 |
| NA | Delayed | 117 | 415 | 65 | 129 | 61 |
The problems why this data table is not tidy and not suited well for analyses:
“Los Angeles”, “Phoenix”, “San Diego”, “San Francisco”, “Seattle” are city names. In a tidy dataset, these should be values in a column named “city”.
“...2” column represents flight’s status and its
values “On Time” and “Delayed” should be variables since they have
numerical values spread across the table
From here I’m tidying the data table to resolve the problems above:
# Checking column names
colnames(data)
## [1] "...1" "...2" "Los Angeles" "Phoenix"
## [5] "San Diego" "San Francisco" "Seattle"
The code below is renaming and standardizing column names, handling
potential missing values in the “Airline” column,
standardizing string values within the “Status” column
(lowercase and underscores), making a readable table format.
# Renaming first two columns
colnames(data)[1:2] <- c("Airline", "Status")
# Filling missing values in the Airline column
data <- data |>
fill(Airline, .direction = "down")
# Turning all column names data in the table lowercase and replacing white spaces with underscore
colnames(data) <- tolower(colnames(data))
colnames(data) <- str_replace_all(colnames(data), " ", "_")
# Replacing white spaces with underscore in all string values in the Status column
data <- data |>
mutate(across(status,
str_replace_all, " ", "_")
) |>
mutate(across(status, tolower))
data |>
kable() |>
kable_styling(full_width = F)
| airline | status | los_angeles | phoenix | san_diego | san_francisco | seattle |
|---|---|---|---|---|---|---|
| ALASKA | on_time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on_time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
This code first pivoting the table into a long format to create a
“city” variable with city names as its values. At the same
time it takes the data values associated with each city and puts them
into a new column “count”
After that the code pivoting table into wider format to create two
variables “on_time” and “delayed”, and assigns
corresponding numerical values from the”count” column ,
aligned with the corresponding “airline” and
“city” variables. These “on_time” and
“delayed” variables were values of “status”
variable.
# pivoting the data
airline_city_long_table <- data |>
pivot_longer(
cols = -c(airline, status),
names_to = "city",
values_to = "count"
) |>
pivot_wider(
names_from = status,
values_from = count
) |>
mutate(airline = str_to_lower(airline)) |>
mutate(airline = str_replace_all(airline, "\\s+", "_")) |>
arrange(airline, city)
colnames(airline_city_long_table) <- str_replace_all(colnames(airline_city_long_table), " ", "_")
colnames(airline_city_long_table) <- tolower(colnames(airline_city_long_table))
# Viewing the data
airline_city_long_table |>
kable() |>
kable_styling(full_width = F)
| airline | city | on_time | delayed |
|---|---|---|---|
| alaska | los_angeles | 497 | 62 |
| alaska | phoenix | 221 | 12 |
| alaska | san_diego | 212 | 20 |
| alaska | san_francisco | 503 | 102 |
| alaska | seattle | 1841 | 305 |
| am_west | los_angeles | 694 | 117 |
| am_west | phoenix | 4840 | 415 |
| am_west | san_diego | 383 | 65 |
| am_west | san_francisco | 320 | 129 |
| am_west | seattle | 201 | 61 |
This new airline_city_long_tableis a tidy format of the
original data table and meets tidy data principles:
Each variable is a column; each column is a variable.
airline, city, on_time, delayed :each of these has
distinct information and values.
Each observation is a row; each row is an observation.
Each row is a unique observation of airline-specific city route combination , showing its on-time and delayed flight counts for that route.
Each value is a cell; each cell is a single value.
There are no cell has combined info or lists, just one single value per cell.
Below I have done a couple of analyses.
This one creates a separate
flight_status_proportionstable with proportions of on_time
and delayed for each airline-city route observation.
flight_status_proportions <- airline_city_long_table |>
mutate(total_flights = on_time + delayed) |>
mutate(on_time_percentage = round((on_time / total_flights) * 100, 2)) |>
mutate(delayed_percentage = round((delayed / total_flights) * 100, 2)) |>
select(airline, city, on_time_percentage, delayed_percentage)
flight_status_proportions |>
kable() |>
kable_styling(full_width = F)
| airline | city | on_time_percentage | delayed_percentage |
|---|---|---|---|
| alaska | los_angeles | 88.91 | 11.09 |
| alaska | phoenix | 94.85 | 5.15 |
| alaska | san_diego | 91.38 | 8.62 |
| alaska | san_francisco | 83.14 | 16.86 |
| alaska | seattle | 85.79 | 14.21 |
| am_west | los_angeles | 85.57 | 14.43 |
| am_west | phoenix | 92.10 | 7.90 |
| am_west | san_diego | 85.49 | 14.51 |
| am_west | san_francisco | 71.27 | 28.73 |
| am_west | seattle | 76.72 | 23.28 |
Let’s create a column plot to compare on time and delay performances for each specific airline-city route observation:
#analyzing the data
ggplot(
flight_status_proportions,
aes(
x = city,
y = on_time_percentage,
fill = airline
)
) +
geom_col(position = "dodge") +
labs(
title = "On Time Performance",
x = "City",
y = "On Time Percentage"
) +
theme_bw()
From this chart I can say that Alaska Airlines on time performance for each airline-city route is slightly better than AM West’s.
This code creates a plot to compare delayed flights percentage for each airline-city observation:
#analyzing the data
ggplot(
flight_status_proportions,
aes(
x = city,
y = delayed_percentage,
fill = airline
)
) +
geom_col(position = "dodge") +
labs(
title = "Delayed flights comparision",
x = "City",
y = "Delayed Flights Percentage"
) +
theme_bw()
From this plot I can say that AM West has more delayed flights percentage comparing to Alaska airlines. Especially on San Francisco and Seattle routes.
This code creates a new separate table to show total flights on each airline-city route observation:
airline_city_share_table <- airline_city_long_table |>
group_by(airline, city) |>
summarise(
total_flights = sum(on_time + delayed),
) |>
arrange(desc(total_flights))
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
airline_city_share_table |>
kable() |>
kable_styling(full_width = F)
| airline | city | total_flights |
|---|---|---|
| am_west | phoenix | 5255 |
| alaska | seattle | 2146 |
| am_west | los_angeles | 811 |
| alaska | san_francisco | 605 |
| alaska | los_angeles | 559 |
| am_west | san_francisco | 449 |
| am_west | san_diego | 448 |
| am_west | seattle | 262 |
| alaska | phoenix | 233 |
| alaska | san_diego | 232 |
Let’s compare each observation on this plot:
ggplot(
airline_city_share_table,
aes(
x = city,
y = total_flights,
fill = airline
)
) +
geom_col(position = "dodge") +
labs(
title = "Total Flights by City",
x = "City",
y = "Total Flights"
) +
theme_bw()
From this plot I can see that AM West’s busiest route is to Phoenix. And Alaska’s busiest route is Seattle. Overall AM West Airlines has more flights combined from all routes than Alaska Airlines.
To see overall performance for each airline, we need to compare totals (sums) of on time and totals (sums) of delayed flights across all cities for each airline. Codes below calculates total numbers and overall percentages (proportions) of on time and delayed flights for each airline.
overall_performance_table <- airline_city_long_table |>
group_by(airline) |>
summarise(
on_time_total = sum(on_time),
delayed_total = sum(delayed),
total_flights = sum(on_time_total, delayed_total)
) |>
select(airline, on_time_total, delayed_total, total_flights)
print(overall_performance_table)
## # A tibble: 2 × 4
## airline on_time_total delayed_total total_flights
## <chr> <dbl> <dbl> <dbl>
## 1 alaska 3274 501 3775
## 2 am_west 6438 787 7225
overall_performance_table |>
kable() |>
kable_styling(full_width = F)
| airline | on_time_total | delayed_total | total_flights |
|---|---|---|---|
| alaska | 3274 | 501 | 3775 |
| am_west | 6438 | 787 | 7225 |
overall_performance_percentage_table <- overall_performance_table |>
group_by(airline) |>
summarise(
on_time_overall_percentage = round((on_time_total / total_flights) * 100, 2),
delayed_overall_percentage = round((delayed_total / total_flights) * 100, 2)
)|>
select(airline,
on_time_overall_percentage,
delayed_overall_percentage
)
print(overall_performance_percentage_table)
## # A tibble: 2 × 3
## airline on_time_overall_percentage delayed_overall_percentage
## <chr> <dbl> <dbl>
## 1 alaska 86.7 13.3
## 2 am_west 89.1 10.9
overall_performance_percentage_table |>
kable() |>
kable_styling(full_width = F)
| airline | on_time_overall_percentage | delayed_overall_percentage |
|---|---|---|
| alaska | 86.73 | 13.27 |
| am_west | 89.11 | 10.89 |
ggplot(overall_performance_percentage_table,
aes(x = airline, y = on_time_overall_percentage, fill = "On time")) +
geom_col() +
geom_text(aes(label = paste0(round(on_time_overall_percentage, 1), "%")),
position = position_stack(vjust = 0.5)) +
labs(title = "Overall on time flights percentage by airline",
x = "Airline",
y = "Percentage")
Conclusion:
From the overall_performance_percentage_tableI can say
that AM West (89.11%) has better overall on time performance than Alaska
(86.73%) in all flights across all cities. I would travel with AM West
according this analyses. But for better booking choice I would like to
see more statistic summary (like medians, means and IQRs of flight delay
times) to see typical delay times and variety of delay times.
Here is the code to connect to MySQL database. My credentials are stored in my Windows environment variables.
# Connecting to the MySQL database using the credentials stored in the environment variables
# con <- dbConnect(
# drv = RMySQL::MySQL(),
# host = Sys.getenv("DB_HOST"),
# port = as.integer(Sys.getenv("DB_PORT")),
# dbname = Sys.getenv("DB_NAME"),
# username = Sys.getenv("DB_USER"),
# password <- Sys.getenv("DB_PASS")
# )
This code creates new tables in MySQL and uploads the data into it
# more analysis
# dbWriteTable(conn = con,
# value = data,
# name = "airline_city_messy_table",
# row.names = FALSE,
# overwrite = TRUE )
#
# dbWriteTable(conn = con,
# value = airline_city_long_table,
# name = "airline_city_long_table",
# row.names = FALSE,
# overwrite = TRUE )
#
# dbWriteTable(conn = con,
# value = flight_status_proportions,
# name = "flight_status_proportions",
# row.names = FALSE,
# overwrite = TRUE )
#
# dbWriteTable(conn = con,
# value = airline_city_share_table,
# name = "airline_city_share_table",
# row.names = FALSE,
# overwrite = TRUE )
#
# dbDisconnect(con)