# Load necessary libraries
library(nycflights23)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(RColorBrewer)NYC Flights Assignment
Data Visualization for NYC Flights
Data
To start, I will load the necessary libraries and the nycflights23 dataset, which contains information on 422,818 flights.
# Load the flights dataset
data("flights")Explore the dataset
I will use the function head() and names() to explore the dataset.
# Display the first few rows of the dataset
head(flights)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
To overview of the variables available in the data frame, I will run this code to get a list of the column names in the flights dataset.
# Display the names of the colums in the dataset
names(flights) [1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
To view the dimensions and data types of the columns in the flights data frame, I will use the str() function.
# View the dimensions and dtat types of the dataset
str(flights)tibble [435,352 × 19] (S3: tbl_df/tbl/data.frame)
$ year : int [1:435352] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ month : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
$ day : int [1:435352] 1 1 1 1 1 1 1 1 1 1 ...
$ dep_time : int [1:435352] 1 18 31 33 36 503 520 524 537 547 ...
$ sched_dep_time: int [1:435352] 2038 2300 2344 2140 2048 500 510 530 520 545 ...
$ dep_delay : num [1:435352] 203 78 47 173 228 3 10 -6 17 2 ...
$ arr_time : int [1:435352] 328 228 500 238 223 808 948 645 926 845 ...
$ sched_arr_time: int [1:435352] 3 135 426 2352 2252 815 949 710 818 852 ...
$ arr_delay : num [1:435352] 205 53 34 166 211 -7 -1 -25 68 -7 ...
$ carrier : chr [1:435352] "UA" "DL" "B6" "B6" ...
$ flight : int [1:435352] 628 393 371 1053 219 499 996 981 206 225 ...
$ tailnum : chr [1:435352] "N25201" "N830DN" "N807JB" "N265JB" ...
$ origin : chr [1:435352] "EWR" "JFK" "JFK" "JFK" ...
$ dest : chr [1:435352] "SMF" "ATL" "BQN" "CHS" ...
$ air_time : num [1:435352] 367 108 190 108 80 154 192 119 258 157 ...
$ distance : num [1:435352] 2500 760 1576 636 488 ...
$ hour : num [1:435352] 20 23 23 21 20 5 5 5 5 5 ...
$ minute : num [1:435352] 38 0 44 40 48 0 10 30 20 45 ...
$ time_hour : POSIXct[1:435352], format: "2023-01-01 20:00:00" "2023-01-01 23:00:00" ...
There are 19 variables in the NYC flights dataset, and I am interested in the “destination” variable for all carriers. Since December has long holidays, I am curious to know which destination is the most popular during this month.
Check missing values and clean data
First I need to check the missing values and clean data. To avoid errors when running the analysis, I will first check for missing values and clean the dataset.
To check if there are any missing values (NA values) in the flights dataset and then print whether any NAs are found or not, the following R code will be used:
ifelse(mean(complete.cases(flights)) == 1, "No NA Founded", "Found NA")[1] "Found NA"
This dataset contains missing values, next we will remove the missing values.
flights <- na.omit(flights)Using the dplyr commands
Before creating charts, I convert the data type from character data to factor data because we mainly create bar charts, factor are more efficient to store and process.
I use the bargraph for the visualization because the variable “Destination” is categorical representing different airports. I create a bar graph visualization using the flights dataset where the variable “Destination” (dest) is categorical (representing different airports), and ensuring the dest variable is converted to a factor for efficient processing.
flights_clean <- flights %>% mutate_if(is.character, as.factor)Create a bargraph categorized by destination
top_flights_dec <- flights_clean %>%
filter(month == 12) %>%
count(dest) %>%
arrange(-n) %>%
head(10) %>%
ggplot(., aes(x = reorder(dest, -n), y = n, fill = dest)) +
geom_bar(alpha = 0.8, stat = "identity") +
theme_minimal() +
scale_fill_brewer(palette = "Paired") +
geom_label(aes(label = n), vjust = -0.03, label.size = NA, fill = "white") +
labs(x = "Destinations",
y = "Count",
title = "The top travel spots in December",
caption = "Source: the Bureau of Transportation Statistics - BTS") +
theme(axis.text.x = element_text(angle = 45))
top_flights_decThe plot above shows that top 10 popular destination in December are Orlando International Airport (MCO), Miami International Airport (MIA), Hartsfield-Jackson Atlanta International Airport (ATL), Chicago O’Hare Airport (ORD), Fort Lauderdale-Hollywood International Airport (FLL), Los Angeles International Airport (LAX), Boston Logan International Airport (BOS), Charlotte Douglas International Airport (CLT), San Francisco International Airport (SFO) and Dallas Fort Worth International Airport (DFW).
Based on the bar graph showing the top 10 popular destinations in December, the following insights or observations can be made:
Frequency of Flights: The bar graph illustrates the frequency of flights to various destinations in December, with Orlando International Airport (MCO) having the highest number of flights among the top destinations.
Travel Demand: The presence of certain airports like Miami International Airport (MIA) and Hartsfield-Jackson Atlanta International Airport (ATL) in the top ranks suggests high travel demand during the holiday season.
Regional and International Connectivity: Airports such as Los Angeles International Airport (LAX), San Francisco International Airport (SFO), and Dallas Fort Worth International Airport (DFW) indicate significant connectivity to both domestic and international destinations during December.
Hub Airports: Airports like Chicago O’Hare Airport (ORD) and Charlotte Douglas International Airport (CLT) serve as major hubs, facilitating a large number of connecting flights during the holiday period.
Seasonal Trends: The distribution of flights among these top destinations reflects seasonal travel patterns and preferences during December, influenced by factors such as holidays and weather conditions.