I aim to conduct a comprehensive analysis of the flights dataset to extract actionable insights that can be utilized by airport management and airlines to improve operational efficiency, enhance customer satisfaction and optimize resource allocation.
This data is made available by Mahoora00135 as a Kaggle dataset with a CC0: Public Domain License.
| Variable | Description |
|---|---|
| id | A unique identifier assigned to each flight record in this dataset. |
| year | The year in which the flight took place. The dataset includes flights from the year 2013 |
| month | The month of the year in which the flight occurred, represented by an integer ranging from 1 January to 12 Dec |
| day | The day of the month on which the flight took place, represented by an integer from 1 to 31 |
| dep_time | The actual departure time of the flight, represented in 24-hour format (hhmm) |
| sched_dep_time | The locally scheduled departure time of the flight, presented in a 24-hour format (hhmm) |
| dep_delay | The delay in flight departure, calculated as the difference (in minutes) between the actual and scheduled departure times. Positive values indicate a delay, while negative values indicate an early departure. |
| arr_time | The actual arrival time of the flight, represented in 24-hour format (hhmm) |
| sched_arr_time | The locally scheduled arrival time of the flight, presented in a 24-hour format (hhmm) |
| arr_delay | The delay in flight arrival, calculated as the difference (in minutes) between the actual and scheduled arrival times. Positive values indicate a delay, while negative values indicate an early arrival |
| carrier | A two-letter code representing the airline carrier responsible for the flight |
| flight | The designated number of the flight |
| tailnum | A unique identifier associated with the aircraft used for the flight |
| origin | A three-letter code signifying the airport from which the flight departed |
| dest | A three-letter code representing the airport at which the flight arrived |
| air_time | The duration of the flight, measured in minutes |
| distance | The total distance (in miles) between the origin and destination airports |
| hour | The hour component of the scheduled departure time, expressed in local time |
| minute | The minute component of the scheduled departure time, expressed in local time |
| time_hour | The scheduled departure time of the flight, represented in local time and formatted as “yyyy-mm-dd hh:mm:ss” |
| name | The full name of the airline carrier responsible for the flight |
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
Flights <- read.csv("C:\\Users\\Asuquo Bassey\\Downloads\\Compressed\\Datasets\\Flights datasets\\flights.csv")
The steps I took for the data wrangling includes Data Discovery, Data structuring and Cleaning which includes reformatting of some wrongly formatted data types, Checking and handling data entry inconsistency/irregularities, handling blank values, checking for duplicates and so on. Perform data quality checks to ensure accuracy and reliability.
View(Flights)
glimpse(Flights)
## Rows: 336,776
## Columns: 21
## $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <dbl> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <dbl> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <int> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <int> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <chr> "2013-01-01 05:00:00", "2013-01-01 05:00:00", "2013-01-…
## $ name <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…
str(Flights)
## 'data.frame': 336776 obs. of 21 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : num 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : num 830 850 923 1004 812 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : int 1400 1416 1089 1576 762 719 1065 229 944 733 ...
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
## $ name : chr "United Air Lines Inc." "United Air Lines Inc." "American Airlines Inc." "JetBlue Airways" ...
The view function showed that the dataset contains date, time, text, blank values and some numerical values. Previewing the structure of a flights data frame after running the glimpse and str functions, the following where observed;
Flights %>%
summarize(max_length = max(nchar(time_hour)),
min_length = min(nchar(time_hour))) # maxlength is 19 and min length is 19
## max_length min_length
## 1 19 19
Flights <- Flights %>%
mutate(schd_dep_date = as.Date(paste(year, month, day, sep = "-"))) %>%
select(id, schd_dep_date, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier,
flight, tailnum, origin, dest, air_time, distance, name, -hour, -minute, -year, -month, -day, -time_hour)
time_columns <- c("dep_time", "sched_dep_time", "arr_time", "sched_arr_time")
Flights <- Flights %>%
mutate(across(all_of(time_columns), ~ as.POSIXct(sprintf("%04d", .), format = "%H%M"))) %>%
mutate(across(all_of(time_columns), ~ format(., format = "%H:%M")))
glimpse(Flights)
## Rows: 336,776
## Columns: 16
## $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ schd_dep_date <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-0…
## $ dep_time <chr> "05:17", "05:33", "05:42", "05:44", "05:54", "05:54", "…
## $ sched_dep_time <chr> "05:15", "05:29", "05:40", "05:45", "06:00", "05:58", "…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <chr> "08:30", "08:50", "09:23", "10:04", "08:12", "07:40", "…
## $ sched_arr_time <chr> "08:19", "08:30", "08:50", "10:22", "08:37", "07:28", "…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ name <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…
sum(is.na(Flights)) # 44,083 cells
## [1] 44083
null_obs <- Flights[apply(is.na(Flights), 1, any), ]
Out of the 336,776 obs, 9,430 obs had at least 1 NA. Thus 2.8% of the original rows had no information. A figure of 44,083, indicates that there are multiple null values within some observations. I noticed some null values in arr_delay when we had data in both arr_time and sched_arr_time. In a bid to reduce null values, we can calculate arr_delay.
Flights <- Flights %>%
mutate(
arr_delay = ifelse(!is.na(arr_time) & !is.na(sched_arr_time),
as.numeric(gsub(":", "", arr_time)) - as.numeric(gsub(":", "", sched_arr_time)),
NA)
)
This reduced the null values (cells) by 717
sum(is.na(Flights)) #43,366 cells.
## [1] 43366
On investigation of the null observations, it was observed that variables with missing data are air_time, tailnum, dep_time, arr_time.
airline_counts_null<- table(null_obs$name)
airline_counts <- table(Flights$name)
print(airline_counts)
##
## AirTran Airways Corporation Alaska Airlines Inc.
## 3260 714
## American Airlines Inc. Delta Air Lines Inc.
## 32729 48110
## Endeavor Air Inc. Envoy Air
## 18460 26397
## ExpressJet Airlines Inc. Frontier Airlines Inc.
## 54173 685
## Hawaiian Airlines Inc. JetBlue Airways
## 342 54635
## Mesa Airlines Inc. SkyWest Airlines Inc.
## 601 32
## Southwest Airlines Co. United Air Lines Inc.
## 12275 58665
## US Airways Inc. Virgin America
## 20536 5162
print(airline_counts_null)
##
## AirTran Airways Corporation Alaska Airlines Inc.
## 85 5
## American Airlines Inc. Delta Air Lines Inc.
## 782 452
## Endeavor Air Inc. Envoy Air
## 1166 1360
## ExpressJet Airlines Inc. Frontier Airlines Inc.
## 3065 4
## JetBlue Airways Mesa Airlines Inc.
## 586 57
## SkyWest Airlines Inc. Southwest Airlines Co.
## 3 231
## United Air Lines Inc. US Airways Inc.
## 883 705
## Virgin America
## 46
Hawaiian Airlines Inc. is the only airline carrier that does not appear in the null_obs (that does not have a null value in any row). Because of the varying possibilities of null values and ALL airlines but one (Hawaiian Airlines Inc.) having at least 1 null value in 9,430 observations, I am leaving the NA values for now
Airport_Code <- read.csv("C:\\Users\\Asuquo Bassey\\Downloads\\Compressed\\Datasets\\Flights datasets\\Airport Code.csv")
Airport_Code_Geotag <- read.csv("C:\\Users\\Asuquo Bassey\\Downloads\\Compressed\\Datasets\\Flights datasets\\Airport Code_geotag.csv")
glimpse(Flights)
## Rows: 336,776
## Columns: 16
## $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ schd_dep_date <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-0…
## $ dep_time <chr> "05:17", "05:33", "05:42", "05:44", "05:54", "05:54", "…
## $ sched_dep_time <chr> "05:15", "05:29", "05:40", "05:45", "06:00", "05:58", "…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <chr> "08:30", "08:50", "09:23", "10:04", "08:12", "07:40", "…
## $ sched_arr_time <chr> "08:19", "08:30", "08:50", "10:22", "08:37", "07:28", "…
## $ arr_delay <dbl> 11, 20, 73, -18, -25, 12, 59, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ name <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…
glimpse(Airport_Code)
## Rows: 108
## Columns: 5
## $ Code <chr> "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", …
## $ Airport.Name <chr> " George Bush Intercontinental Airport", " Miami Interna…
## $ Town.City <chr> " Houston ", " Miami ", " Aguadilla ", " Atlanta ", " Ch…
## $ State..Region <chr> " Texas", " Florida", " Puerto Rico", " Georgia", " Illi…
## $ Country <chr> " USA", " USA", " USA", " USA", " USA", " USA", " …
Airport_Code <- unique(Airport_Code)
Airport_Code_Geotag <- unique(Airport_Code_Geotag)
Flights2 <- left_join(Flights, Airport_Code_Geotag, by = c("origin" = "original_Code"))
# Rename columns for origin airports
colnames(Flights2)[17:22] <- c("Origin_Airport.Name", "Origin_Town.City", "Origin_State..Region", "Origin_Country","Origin_latitude","Origin_longitude")
# Merge Flights dataframe with Airport_Code dataframe for destination airports
Flights2 <- left_join(Flights2, Airport_Code_Geotag, by = c("dest" = "original_Code"))
# Rename columns for destination airports
colnames(Flights2)[23:28] <- c("Dest_Airport.Name", "Dest_Town.City", "Dest_State..Region", "Dest_Country","Dest_latitude","Dest_longitude")
# View the first few rows of the resulting dataframe
head(Flights2)
## id schd_dep_date dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 0 2013-01-01 05:17 05:15 2 08:30 08:19
## 2 1 2013-01-01 05:33 05:29 4 08:50 08:30
## 3 2 2013-01-01 05:42 05:40 2 09:23 08:50
## 4 3 2013-01-01 05:44 05:45 -1 10:04 10:22
## 5 4 2013-01-01 05:54 06:00 -6 08:12 08:37
## 6 5 2013-01-01 05:54 05:58 -4 07:40 07:28
## arr_delay carrier flight tailnum origin dest air_time distance
## 1 11 UA 1545 N14228 EWR IAH 227 1400
## 2 20 UA 1714 N24211 LGA IAH 227 1416
## 3 73 AA 1141 N619AA JFK MIA 160 1089
## 4 -18 B6 725 N804JB JFK BQN 183 1576
## 5 -25 DL 461 N668DN LGA ATL 116 762
## 6 12 UA 1696 N39463 EWR ORD 150 719
## name Origin_Airport.Name
## 1 United Air Lines Inc. Newark Liberty International Airport
## 2 United Air Lines Inc. LaGuardia Airport
## 3 American Airlines Inc. John F. Kennedy International Airport
## 4 JetBlue Airways John F. Kennedy International Airport
## 5 Delta Air Lines Inc. LaGuardia Airport
## 6 United Air Lines Inc. Newark Liberty International Airport
## Origin_Town.City Origin_State..Region Origin_Country Origin_latitude
## 1 Newark New Jersey USA 40.73566
## 2 Queens New York USA 40.71351
## 3 Queens New York USA 40.71351
## 4 Queens New York USA 40.71351
## 5 Queens New York USA 40.71351
## 6 Newark New Jersey USA 40.73566
## Origin_longitude Dest_Airport.Name
## 1 -74.17237 George Bush Intercontinental Airport
## 2 -73.82831 George Bush Intercontinental Airport
## 3 -73.82831 Miami International Airport
## 4 -73.82831 Rafael Hernández Airport
## 5 -73.82831 Hartsfield-Jackson Atlanta International Airport
## 6 -74.17237 O'Hare International Airport
## Dest_Town.City Dest_State..Region Dest_Country Dest_latitude Dest_longitude
## 1 Houston Texas USA 29.75894 -95.36770
## 2 Houston Texas USA 29.75894 -95.36770
## 3 Miami Florida USA 25.77417 -80.19362
## 4 Aguadilla Puerto Rico USA 18.42979 -67.15422
## 5 Atlanta Georgia USA 33.74899 -84.39026
## 6 Chicago Illinois USA 41.87556 -87.62442
# Duplication
# Identify duplicate rows
duplicated_rows <- Flights[duplicated(Flights), ]
# Print or view the duplicated rows
print(duplicated_rows)
## [1] id schd_dep_date dep_time sched_dep_time dep_delay
## [6] arr_time sched_arr_time arr_delay carrier flight
## [11] tailnum origin dest air_time distance
## [16] name
## <0 rows> (or 0-length row.names)
I noticed multiple variables and see possible reasons; tailnum, name of carrier - Aircraft Rotation, Aircraft Substitutions, Leased or Shared Aircraft. Having multiple tail numbers associated with flights is generally acceptable as long as it reflects real-world operations and is consistent with the airline’s operations.
I will do only minimal analysis and visualizations -Airline/Carrier summary, relationship between fleet size and arrival & departure delays and calculation of correlation coefficients (see Below). Then I will export the data to Tableau.
Airline_Summary <- Flights2 %>%
group_by(name) %>%
summarize(Fleet_Size = n_distinct(tailnum, na.rm = TRUE),
Average_Distance = mean(distance, na.rm = TRUE),
Average_Flight_Hours = mean(air_time, na.rm = TRUE),
Average_Departure_Delay = mean(dep_delay, na.rm = TRUE),
Average_Arrival_Delay = mean(arr_delay, na.rm = TRUE),
Max_Individual_Distance_Flown = max(distance)) %>%
arrange(desc(Fleet_Size))
head(Airline_Summary)
## # A tibble: 6 × 7
## name Fleet_Size Average_Distance Average_Flight_Hours Average_Departure_De…¹
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Delta… 629 1237. 174. 9.26
## 2 Unite… 621 1529. 212. 12.1
## 3 Ameri… 601 1340. 189. 8.59
## 4 South… 583 996. 148. 17.7
## 5 Expre… 316 563. 90.1 20.0
## 6 US Ai… 290 553. 88.6 3.78
## # ℹ abbreviated name: ¹Average_Departure_Delay
## # ℹ 2 more variables: Average_Arrival_Delay <dbl>,
## # Max_Individual_Distance_Flown <int>
Airline_Summary %>%
ggplot(aes(Fleet_Size,Average_Departure_Delay,colour=Average_Arrival_Delay))+
geom_point(alpha=0.4,)+
scale_color_gradient(low = "blue", high = "red") + # Color gradient for calories
geom_smooth()+
labs(x = "Fleet_Size",
y = "Average_Departure_Delay",
color = "Average_Arrival_Delay",
title = "Relationship between Fleet Size and Arrival & Departure Delays") +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Calculate correlation coefficients
correlation_delay_arrival <- cor(Flights2$arr_delay, Flights2$dep_delay,use = "complete.obs")
print(correlation_delay_arrival)
## [1] -0.1941299
correlation_distance_arrival <- cor(Flights2$distance,Flights2$arr_delay,use = "complete.obs")
print(correlation_distance_arrival)
## [1] -0.03437079
correlation_flight_time_distance <- cor(Flights2$air_time,Flights2$distance,use = "complete.obs")
print(correlation_distance_arrival)
## [1] -0.03437079
Based on the correlation coefficient results:
Correlation between Departure Delay and Arrival Delay:
The correlation coefficient of approximately -0.19 suggests a weak negative linear relationship between departure delay and arrival delay.
This indicates that, on average, flights with longer departure delays tend to have slightly shorter arrival delays, and vice versa.
However, the strength of this relationship is weak, indicating that departure delay alone may not be a strong predictor of arrival delay.
Correlation between Distance and Arrival Delay:
The correlation coefficient of approximately -0.03 suggests a very weak negative linear relationship between distance traveled and arrival delay.
This indicates that there is little to no meaningful relationship between the distance a flight travels and the delay in its arrival time.
It implies that longer flights do not necessarily experience more or less arrival delay compared to shorter flights.
Correlation between Flight Time and Distance:
The correlation coefficient of approximately -0.03 indicates a very weak negative linear relationship between flight time and distance traveled.
This suggests that, on average, longer flights tend to have slightly shorter flight times, and vice versa.
However, the strength of this relationship is very weak, indicating that flight time alone may not be a strong predictor of the distance traveled.
The dataset was exported as CSV file to Tableau for further analysis and visualizations. Please,Visit Tableau Public for multi-dashboard visualizations to provide insights on different route analysis, airline & airport operations/performance.
# Specify file path to save the .csv file
file_path <- "flights2_data.csv"
# Write the dataframe to a .csv file}
write.csv(Flights, file = file_path, row.names = FALSE)
# Optionally, print a message to confirm the file has been saved
cat("DataFrame saved as", file_path, "\n")
## DataFrame saved as flights2_data.csv
# Get the current working directory
current_directory <- getwd()
print(current_directory)
## [1] "C:/Users/Asuquo Bassey/Downloads/Compressed/Datasets/Flights datasets"
print(unique(Flights$dest))
## [1] "IAH" "MIA" "BQN" "ATL" "ORD" "FLL" "IAD" "MCO" "PBI" "TPA" "LAX" "SFO"
## [13] "DFW" "BOS" "LAS" "MSP" "DTW" "RSW" "SJU" "PHX" "BWI" "CLT" "BUF" "DEN"
## [25] "SNA" "MSY" "SLC" "XNA" "MKE" "SEA" "ROC" "SYR" "SRQ" "RDU" "CMH" "JAX"
## [37] "CHS" "MEM" "PIT" "SAN" "DCA" "CLE" "STL" "MYR" "JAC" "MDW" "HNL" "BNA"
## [49] "AUS" "BTV" "PHL" "STT" "EGE" "AVL" "PWM" "IND" "SAV" "CAK" "HOU" "LGB"
## [61] "DAY" "ALB" "BDL" "MHT" "MSN" "GSO" "CVG" "BUR" "RIC" "GSP" "GRR" "MCI"
## [73] "ORF" "SAT" "SDF" "PDX" "SJC" "OMA" "CRW" "OAK" "SMF" "TUL" "TYS" "OKC"
## [85] "PVD" "DSM" "PSE" "BHM" "CAE" "HDN" "BZN" "MTJ" "EYW" "PSP" "ACK" "BGR"
## [97] "ABQ" "ILM" "MVY" "SBN" "LEX" "CHO" "TVC" "ANC" "LGA"
print(length(unique(Flights$dest)))
## [1] 105
I pulled in data, reviewed and investigated it. Checked for data entry Inconsistencies and irregularities. Managed data types and blanks. Deleted unused columns and created new ones. I was able to show the ability to parsed Date and Time, Detect and correct badly formatted data and check for other error that can skew my analysis. After having a well wrangled data that is well formatted and can be used for further analysis, I proceeded with analysis and visualization without any fear and most especially;
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.3.3 (2024-02-29 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Nigeria.utf8
## ctype English_Nigeria.utf8
## tz Africa/Lagos
## date 2024-04-29
## pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.3.3)
## cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.3)
## cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.3)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.3)
## digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.3)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.3)
## evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.3)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.3)
## farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.3)
## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.3)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.3)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.3)
## ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.3)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.3)
## gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.3)
## highr 0.10 2022-12-22 [1] CRAN (R 4.3.3)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.3)
## htmltools 0.5.8 2024-03-25 [1] CRAN (R 4.3.3)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.3.3)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.3)
## knitr 1.45 2023-10-30 [1] CRAN (R 4.3.3)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.1)
## lattice 0.22-5 2023-10-24 [2] CRAN (R 4.3.3)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.3)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.3)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.3)
## Matrix 1.6-5 2024-01-11 [2] CRAN (R 4.3.3)
## mgcv 1.9-1 2023-12-21 [2] CRAN (R 4.3.3)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.3)
## nlme 3.1-164 2023-11-27 [2] CRAN (R 4.3.3)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.3)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.3)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.3)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.3)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.3.3)
## rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.3)
## rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.3)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.3)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.3.3)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.3)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.3)
## stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.2)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.3)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.3)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.3)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.3)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.3)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.3)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.3)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.3)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.3)
## xfun 0.43 2024-03-25 [1] CRAN (R 4.3.3)
## yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
##
## [1] C:/Users/Asuquo Bassey/AppData/Local/R/win-library/4.3
## [2] C:/Program Files/R/R-4.3.3/library
##
## ──────────────────────────────────────────────────────────────────────────────