#Installation method
library(pacman)
p_load(nycflights13, tidyverse)
data("flights") #loads in data frame "flights" Data Exploration and Descriptive Analysis with flights Dataset
Loading Packages and Data
First we need to load the flights data, which comes as part of the nycflights13 package.
Data Exploration and Analysis
Data Exploration
The following chunk displays an initial data inspection of the flights dataset, including a preview with the head() function, a structural overview with the str() function, and summary statistics with the summary() function.
# Display the first 7 rows of the flights dataset
head(flights, 7)# A tibble: 7 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Examine the structure of the dataset (variable types and layout)
str(flights)tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
$ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
$ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
$ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
$ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
$ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
$ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
$ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
$ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
$ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
$ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
$ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
$ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
$ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
$ distance : num [1:336776] 1400 1416 1089 1576 762 ...
$ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
$ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
$ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# Generate summary statistics for all variables
summary(flights) year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
NA's :8255
dep_delay arr_time sched_arr_time arr_delay
Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
Median : -2.00 Median :1535 Median :1556 Median : -5.000
Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
NA's :8255 NA's :8713 NA's :9430
carrier flight tailnum origin
Length:336776 Min. : 1 Length:336776 Length:336776
Class :character 1st Qu.: 553 Class :character Class :character
Mode :character Median :1496 Mode :character Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
dest air_time distance hour
Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
Mode :character Median :129.0 Median : 872 Median :13.00
Mean :150.7 Mean :1040 Mean :13.18
3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
Max. :695.0 Max. :4983 Max. :23.00
NA's :9430
minute time_hour
Min. : 0.00 Min. :2013-01-01 05:00:00
1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
Median :29.00 Median :2013-07-03 10:00:00
Mean :26.23 Mean :2013-07-03 05:22:54
3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
Max. :59.00 Max. :2013-12-31 23:00:00
The code below provides an initial exploration of the dataset. First,head(flights, 7) displays the first seven rows, allowing a quick look at how variables such as year, month, dep_time, arr_delay, carrier, origin, dest, and others appear in the raw data. The str(flights) command then examines the structure of the dataset by showing each variable’s type (e.g., integer, character) and a preview of its values, helping clarify how variables like sched_dep_time, flight, tailnum, air_time, and distance are stored. Finally, summary(flights) generates summary statistics for all 19 variables, including measures such as minimums, maximums, means, and counts of missing values. Together, these commands give a clear overview of the dataset’s layout, variable types, and key characteristics before deeper analysis. These observations are done for 336,776 flights in the flights data frame.
A list of the 19 variables are below:
- year
- month
- day
- departure time
- schedule time
- departure delay
- scheduled arrival
- time arrival
- time
- carrier flight
- tail number
- origin of flight
- destination
- air time
- distance
- hour amount of flight
- minute amount of flight
- dated time
Average Scheduled Arrival Time and Actual Arrival Time
Calculates the mean actual and scheduled arrival times while ignoring missing values, then stores those averages in the variables avg_arr_time and avg_sched_arr_time.
# Get average arrival time and scheduled arrival time, ignoring NA values
mean(flights$arr_time, na.rm = TRUE) # Calculates mean of actual arrival times, removing NA values[1] 1502.055
mean(flights$sched_arr_time, na.rm = TRUE) # Calculates mean of scheduled arrival times, removing NA values[1] 1536.38
# Store the means as variables
avg_arr_time <- mean(flights$arr_time, na.rm = TRUE) # Saves mean arrival time into avg_arr_time
avg_sched_arr_time <- mean(flights$sched_arr_time, na.rm = TRUE) # Saves mean scheduled arrival time into avg_sched_arr_timeThe plane was scheduled to arrive at 1502.055 but actually arrived at 1536.38, meaning it landed about 34 minutes late, so if its departure delay was larger than 34 minutes, it made up some time in the air.