#I will work on the “US flights from 2015 to 2020” data set
This data set contains information on over 7 million domestic flights within the United States from 2015 to 2020, including the airline, flight number, departure and arrival times, and various performance metrics.
data<- read.csv("flights.csv",header = TRUE)
I will start analyzing the data, so i will first take a look at the structure and contents of the data frame by using the str() and head() functions.
#The str() function gives you a summary of the data frame’s structure
str(data)
## 'data.frame': 5819079 obs. of 31 variables:
## $ YEAR : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ MONTH : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DAY : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DAY_OF_WEEK : int 4 4 4 4 4 4 4 4 4 4 ...
## $ AIRLINE : chr "AS" "AA" "US" "AA" ...
## $ FLIGHT_NUMBER : int 98 2336 840 258 135 806 612 2013 1112 1173 ...
## $ TAIL_NUMBER : chr "N407AS" "N3KUAA" "N171US" "N3HYAA" ...
## $ ORIGIN_AIRPORT : chr "ANC" "LAX" "SFO" "LAX" ...
## $ DESTINATION_AIRPORT: chr "SEA" "PBI" "CLT" "MIA" ...
## $ SCHEDULED_DEPARTURE: int 5 10 20 20 25 25 25 30 30 30 ...
## $ DEPARTURE_TIME : int 2354 2 18 15 24 20 19 44 19 33 ...
## $ DEPARTURE_DELAY : int -11 -8 -2 -5 -1 -5 -6 14 -11 3 ...
## $ TAXI_OUT : int 21 12 16 15 11 18 11 13 17 12 ...
## $ WHEELS_OFF : int 15 14 34 30 35 38 30 57 36 45 ...
## $ SCHEDULED_TIME : int 205 280 286 285 235 217 181 273 195 221 ...
## $ ELAPSED_TIME : int 194 279 293 281 215 230 170 249 193 203 ...
## $ AIR_TIME : int 169 263 266 258 199 206 154 228 173 186 ...
## $ DISTANCE : int 1448 2330 2296 2342 1448 1589 1299 2125 1464 1747 ...
## $ WHEELS_ON : int 404 737 800 748 254 604 504 745 529 651 ...
## $ TAXI_IN : int 4 4 11 8 5 6 5 8 3 5 ...
## $ SCHEDULED_ARRIVAL : int 430 750 806 805 320 602 526 803 545 711 ...
## $ ARRIVAL_TIME : int 408 741 811 756 259 610 509 753 532 656 ...
## $ ARRIVAL_DELAY : int -22 -9 5 -9 -21 8 -17 -10 -13 -15 ...
## $ DIVERTED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CANCELLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CANCELLATION_REASON: chr "" "" "" "" ...
## $ AIR_SYSTEM_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
## $ SECURITY_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
## $ AIRLINE_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
## $ LATE_AIRCRAFT_DELAY: int NA NA NA NA NA NA NA NA NA NA ...
## $ WEATHER_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
head(data)
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1 2015 1 1 4 AS 98 N407AS ANC
## 2 2015 1 1 4 AA 2336 N3KUAA LAX
## 3 2015 1 1 4 US 840 N171US SFO
## 4 2015 1 1 4 AA 258 N3HYAA LAX
## 5 2015 1 1 4 AS 135 N527AS SEA
## 6 2015 1 1 4 DL 806 N3730B SFO
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 1 SEA 5 2354 -11
## 2 PBI 10 2 -8
## 3 CLT 20 18 -2
## 4 MIA 20 15 -5
## 5 ANC 25 24 -1
## 6 MSP 25 20 -5
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON
## 1 21 15 205 194 169 1448 404
## 2 12 14 280 279 263 2330 737
## 3 16 34 286 293 266 2296 800
## 4 15 30 285 281 258 2342 748
## 5 11 35 235 215 199 1448 254
## 6 18 38 217 230 206 1589 604
## TAXI_IN SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 4 430 408 -22 0 0
## 2 4 750 741 -9 0 0
## 3 11 806 811 5 0 0
## 4 8 805 756 -9 0 0
## 5 5 320 259 -21 0 0
## 6 6 602 610 8 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## LATE_AIRCRAFT_DELAY WEATHER_DELAY
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
To get a sense of the distribution of variables in the data, i will calculate summary statistics such as mean, median, and standard deviation.
summary(data)
## YEAR MONTH DAY DAY_OF_WEEK
## Min. :2015 Min. : 1.000 Min. : 1.0 Min. :1.000
## 1st Qu.:2015 1st Qu.: 4.000 1st Qu.: 8.0 1st Qu.:2.000
## Median :2015 Median : 7.000 Median :16.0 Median :4.000
## Mean :2015 Mean : 6.524 Mean :15.7 Mean :3.927
## 3rd Qu.:2015 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:6.000
## Max. :2015 Max. :12.000 Max. :31.0 Max. :7.000
##
## AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## Length:5819079 Min. : 1 Length:5819079 Length:5819079
## Class :character 1st Qu.: 730 Class :character Class :character
## Mode :character Median :1690 Mode :character Mode :character
## Mean :2173
## 3rd Qu.:3230
## Max. :9855
##
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## Length:5819079 Min. : 1 Min. : 1 Min. : -82.00
## Class :character 1st Qu.: 917 1st Qu.: 921 1st Qu.: -5.00
## Mode :character Median :1325 Median :1330 Median : -2.00
## Mean :1330 Mean :1335 Mean : 9.37
## 3rd Qu.:1730 3rd Qu.:1740 3rd Qu.: 7.00
## Max. :2359 Max. :2400 Max. :1988.00
## NA's :86153 NA's :86153
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## Min. : 1.00 Min. : 1 Min. : 18.0 Min. : 14
## 1st Qu.: 11.00 1st Qu.: 935 1st Qu.: 85.0 1st Qu.: 82
## Median : 14.00 Median :1343 Median :123.0 Median :118
## Mean : 16.07 Mean :1357 Mean :141.7 Mean :137
## 3rd Qu.: 19.00 3rd Qu.:1754 3rd Qu.:173.0 3rd Qu.:168
## Max. :225.00 Max. :2400 Max. :718.0 Max. :766
## NA's :89047 NA's :89047 NA's :6 NA's :105071
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## Min. : 7.0 Min. : 21.0 Min. : 1 Min. : 1.00
## 1st Qu.: 60.0 1st Qu.: 373.0 1st Qu.:1054 1st Qu.: 4.00
## Median : 94.0 Median : 647.0 Median :1509 Median : 6.00
## Mean :113.5 Mean : 822.4 Mean :1471 Mean : 7.43
## 3rd Qu.:144.0 3rd Qu.:1062.0 3rd Qu.:1911 3rd Qu.: 9.00
## Max. :690.0 Max. :4983.0 Max. :2400 Max. :248.00
## NA's :105071 NA's :92513 NA's :92513
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## Min. : 1 Min. : 1 Min. : -87.00 Min. :0.00000
## 1st Qu.:1110 1st Qu.:1059 1st Qu.: -13.00 1st Qu.:0.00000
## Median :1520 Median :1512 Median : -5.00 Median :0.00000
## Mean :1494 Mean :1476 Mean : 4.41 Mean :0.00261
## 3rd Qu.:1918 3rd Qu.:1917 3rd Qu.: 8.00 3rd Qu.:0.00000
## Max. :2400 Max. :2400 Max. :1971.00 Max. :1.00000
## NA's :92513 NA's :105071
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## Min. :0.00000 Length:5819079 Min. : 0 Min. : 0
## 1st Qu.:0.00000 Class :character 1st Qu.: 0 1st Qu.: 0
## Median :0.00000 Mode :character Median : 2 Median : 0
## Mean :0.01545 Mean : 13 Mean : 0
## 3rd Qu.:0.00000 3rd Qu.: 18 3rd Qu.: 0
## Max. :1.00000 Max. :1134 Max. :573
## NA's :4755640 NA's :4755640
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 3 Median : 0
## Mean : 19 Mean : 23 Mean : 3
## 3rd Qu.: 19 3rd Qu.: 29 3rd Qu.: 0
## Max. :1971 Max. :1331 Max. :1211
## NA's :4755640 NA's :4755640 NA's :4755640
Min: The minimum value of the feature.
1st Qu: The first quartile, which is the 25th percentile of the feature.
Median: The middle value of the feature, when the values are sorted in increasing order.
Mean: The average of all the values of the feature.
3rd Qu: The third quartile, which is the 75th percentile of the feature.
Max: The maximum value of the feature.
NA’s: The number of missing values in the feature.
then i will use some various plotting functions to visualize the data
install.packages("ggplot2")
then i will use some various plotting functions to visualize the data
install.packages("ggplot2")
library(ggplot2)
ggplot(data, aes(x = ARRIVAL_DELAY)) +
geom_histogram(fill = "blue", alpha = 0.5) +
xlab("Arrival Delay (minutes)") +
ylab("Frequency") +
ggtitle("Histogram of Arrival Delay")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 105071 rows containing non-finite values (`stat_bin()`).
What i see in the histogram , such as the scheduled arrival time, the day of the week, and the airline, had an impact on the arrival delay.
#after looking at all what mentioned above i see that
“SCHEDULED_ARRIVAL”, has a minimum value of 1, a 1st quartile of 1110, a median of 1520, a mean of 1494, a 3rd quartile of 1918, and a maximum value of 2400. The number of missing values (NA’s) in the column is 92513. indicating that the flights are generally arriving close to the scheduled arrival time.
ARRIVAL_DELAY: The minimum value is -87 and the maximum value is 1971, indicating that the flights can arrive up to 87 minutes before the scheduled arrival time or up to 1971 minutes after the scheduled arrival time. The median value of -5 is close to zero, indicating that the flights are generally arriving close to the scheduled arrival time.
DIVERTED: The minimum and maximum values are 0 and 1, indicating that a flight is either diverted or not. The mean value is 0.00261, indicating that a small fraction of flights are diverted.
CANCELLED: The minimum and maximum values are 0 and 1, indicating that a flight is either cancelled or not. The mean value is 0.01545, indicating that a small fraction of flights are cancelled.
CANCELLATION_REASON: The feature has length 5819079 and class “character”, indicating that this feature is a categorical feature with multiple categories. AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY: These features have a minimum value of 0, indicating that there are no negative delays, and a maximum value indicating that the maximum delay for each feature can be substantial. The median value for all of these features is close to 0, indicating that the majority of flights experience little to no delay due to these factors. The mean values, on the other hand, indicate that the average delay due to these factors is non-zero.
In conclusion, the “US flights from 2015 to 2020” data set provides valuable insights into the flight industry in the US. The data highlights important factors that contribute to flight delays and cancellations and provides a baseline for future analysis and decision making in the aviation industry