#I will work on the “US flights from 2015 to 2020” data set

This data set contains information on over 7 million domestic flights within the United States from 2015 to 2020, including the airline, flight number, departure and arrival times, and various performance metrics.

data<- read.csv("flights.csv",header = TRUE)

I will start analyzing the data, so i will first take a look at the structure and contents of the data frame by using the str() and head() functions.

#The str() function gives you a summary of the data frame’s structure

str(data)
## 'data.frame':    5819079 obs. of  31 variables:
##  $ YEAR               : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ MONTH              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DAY                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DAY_OF_WEEK        : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ AIRLINE            : chr  "AS" "AA" "US" "AA" ...
##  $ FLIGHT_NUMBER      : int  98 2336 840 258 135 806 612 2013 1112 1173 ...
##  $ TAIL_NUMBER        : chr  "N407AS" "N3KUAA" "N171US" "N3HYAA" ...
##  $ ORIGIN_AIRPORT     : chr  "ANC" "LAX" "SFO" "LAX" ...
##  $ DESTINATION_AIRPORT: chr  "SEA" "PBI" "CLT" "MIA" ...
##  $ SCHEDULED_DEPARTURE: int  5 10 20 20 25 25 25 30 30 30 ...
##  $ DEPARTURE_TIME     : int  2354 2 18 15 24 20 19 44 19 33 ...
##  $ DEPARTURE_DELAY    : int  -11 -8 -2 -5 -1 -5 -6 14 -11 3 ...
##  $ TAXI_OUT           : int  21 12 16 15 11 18 11 13 17 12 ...
##  $ WHEELS_OFF         : int  15 14 34 30 35 38 30 57 36 45 ...
##  $ SCHEDULED_TIME     : int  205 280 286 285 235 217 181 273 195 221 ...
##  $ ELAPSED_TIME       : int  194 279 293 281 215 230 170 249 193 203 ...
##  $ AIR_TIME           : int  169 263 266 258 199 206 154 228 173 186 ...
##  $ DISTANCE           : int  1448 2330 2296 2342 1448 1589 1299 2125 1464 1747 ...
##  $ WHEELS_ON          : int  404 737 800 748 254 604 504 745 529 651 ...
##  $ TAXI_IN            : int  4 4 11 8 5 6 5 8 3 5 ...
##  $ SCHEDULED_ARRIVAL  : int  430 750 806 805 320 602 526 803 545 711 ...
##  $ ARRIVAL_TIME       : int  408 741 811 756 259 610 509 753 532 656 ...
##  $ ARRIVAL_DELAY      : int  -22 -9 5 -9 -21 8 -17 -10 -13 -15 ...
##  $ DIVERTED           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CANCELLED          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CANCELLATION_REASON: chr  "" "" "" "" ...
##  $ AIR_SYSTEM_DELAY   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SECURITY_DELAY     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AIRLINE_DELAY      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LATE_AIRCRAFT_DELAY: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WEATHER_DELAY      : int  NA NA NA NA NA NA NA NA NA NA ...

The head() function shows the first few rows of the data frame

head(data)
##   YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1 2015     1   1           4      AS            98      N407AS            ANC
## 2 2015     1   1           4      AA          2336      N3KUAA            LAX
## 3 2015     1   1           4      US           840      N171US            SFO
## 4 2015     1   1           4      AA           258      N3HYAA            LAX
## 5 2015     1   1           4      AS           135      N527AS            SEA
## 6 2015     1   1           4      DL           806      N3730B            SFO
##   DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 1                 SEA                   5           2354             -11
## 2                 PBI                  10              2              -8
## 3                 CLT                  20             18              -2
## 4                 MIA                  20             15              -5
## 5                 ANC                  25             24              -1
## 6                 MSP                  25             20              -5
##   TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON
## 1       21         15            205          194      169     1448       404
## 2       12         14            280          279      263     2330       737
## 3       16         34            286          293      266     2296       800
## 4       15         30            285          281      258     2342       748
## 5       11         35            235          215      199     1448       254
## 6       18         38            217          230      206     1589       604
##   TAXI_IN SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1       4               430          408           -22        0         0
## 2       4               750          741            -9        0         0
## 3      11               806          811             5        0         0
## 4       8               805          756            -9        0         0
## 5       5               320          259           -21        0         0
## 6       6               602          610             8        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                                   NA             NA            NA
## 2                                   NA             NA            NA
## 3                                   NA             NA            NA
## 4                                   NA             NA            NA
## 5                                   NA             NA            NA
## 6                                   NA             NA            NA
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY
## 1                  NA            NA
## 2                  NA            NA
## 3                  NA            NA
## 4                  NA            NA
## 5                  NA            NA
## 6                  NA            NA

To get a sense of the distribution of variables in the data, i will calculate summary statistics such as mean, median, and standard deviation.

summary(data)
##       YEAR          MONTH             DAY        DAY_OF_WEEK   
##  Min.   :2015   Min.   : 1.000   Min.   : 1.0   Min.   :1.000  
##  1st Qu.:2015   1st Qu.: 4.000   1st Qu.: 8.0   1st Qu.:2.000  
##  Median :2015   Median : 7.000   Median :16.0   Median :4.000  
##  Mean   :2015   Mean   : 6.524   Mean   :15.7   Mean   :3.927  
##  3rd Qu.:2015   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:6.000  
##  Max.   :2015   Max.   :12.000   Max.   :31.0   Max.   :7.000  
##                                                                
##    AIRLINE          FLIGHT_NUMBER  TAIL_NUMBER        ORIGIN_AIRPORT    
##  Length:5819079     Min.   :   1   Length:5819079     Length:5819079    
##  Class :character   1st Qu.: 730   Class :character   Class :character  
##  Mode  :character   Median :1690   Mode  :character   Mode  :character  
##                     Mean   :2173                                        
##                     3rd Qu.:3230                                        
##                     Max.   :9855                                        
##                                                                         
##  DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME  DEPARTURE_DELAY  
##  Length:5819079      Min.   :   1        Min.   :   1    Min.   : -82.00  
##  Class :character    1st Qu.: 917        1st Qu.: 921    1st Qu.:  -5.00  
##  Mode  :character    Median :1325        Median :1330    Median :  -2.00  
##                      Mean   :1330        Mean   :1335    Mean   :   9.37  
##                      3rd Qu.:1730        3rd Qu.:1740    3rd Qu.:   7.00  
##                      Max.   :2359        Max.   :2400    Max.   :1988.00  
##                                          NA's   :86153   NA's   :86153    
##     TAXI_OUT        WHEELS_OFF    SCHEDULED_TIME   ELAPSED_TIME   
##  Min.   :  1.00   Min.   :   1    Min.   : 18.0   Min.   : 14     
##  1st Qu.: 11.00   1st Qu.: 935    1st Qu.: 85.0   1st Qu.: 82     
##  Median : 14.00   Median :1343    Median :123.0   Median :118     
##  Mean   : 16.07   Mean   :1357    Mean   :141.7   Mean   :137     
##  3rd Qu.: 19.00   3rd Qu.:1754    3rd Qu.:173.0   3rd Qu.:168     
##  Max.   :225.00   Max.   :2400    Max.   :718.0   Max.   :766     
##  NA's   :89047    NA's   :89047   NA's   :6       NA's   :105071  
##     AIR_TIME         DISTANCE        WHEELS_ON        TAXI_IN      
##  Min.   :  7.0    Min.   :  21.0   Min.   :   1    Min.   :  1.00  
##  1st Qu.: 60.0    1st Qu.: 373.0   1st Qu.:1054    1st Qu.:  4.00  
##  Median : 94.0    Median : 647.0   Median :1509    Median :  6.00  
##  Mean   :113.5    Mean   : 822.4   Mean   :1471    Mean   :  7.43  
##  3rd Qu.:144.0    3rd Qu.:1062.0   3rd Qu.:1911    3rd Qu.:  9.00  
##  Max.   :690.0    Max.   :4983.0   Max.   :2400    Max.   :248.00  
##  NA's   :105071                    NA's   :92513   NA's   :92513   
##  SCHEDULED_ARRIVAL  ARRIVAL_TIME   ARRIVAL_DELAY        DIVERTED      
##  Min.   :   1      Min.   :   1    Min.   : -87.00   Min.   :0.00000  
##  1st Qu.:1110      1st Qu.:1059    1st Qu.: -13.00   1st Qu.:0.00000  
##  Median :1520      Median :1512    Median :  -5.00   Median :0.00000  
##  Mean   :1494      Mean   :1476    Mean   :   4.41   Mean   :0.00261  
##  3rd Qu.:1918      3rd Qu.:1917    3rd Qu.:   8.00   3rd Qu.:0.00000  
##  Max.   :2400      Max.   :2400    Max.   :1971.00   Max.   :1.00000  
##                    NA's   :92513   NA's   :105071                     
##    CANCELLED       CANCELLATION_REASON AIR_SYSTEM_DELAY  SECURITY_DELAY   
##  Min.   :0.00000   Length:5819079      Min.   :   0      Min.   :  0      
##  1st Qu.:0.00000   Class :character    1st Qu.:   0      1st Qu.:  0      
##  Median :0.00000   Mode  :character    Median :   2      Median :  0      
##  Mean   :0.01545                       Mean   :  13      Mean   :  0      
##  3rd Qu.:0.00000                       3rd Qu.:  18      3rd Qu.:  0      
##  Max.   :1.00000                       Max.   :1134      Max.   :573      
##                                        NA's   :4755640   NA's   :4755640  
##  AIRLINE_DELAY     LATE_AIRCRAFT_DELAY WEATHER_DELAY    
##  Min.   :   0      Min.   :   0        Min.   :   0     
##  1st Qu.:   0      1st Qu.:   0        1st Qu.:   0     
##  Median :   2      Median :   3        Median :   0     
##  Mean   :  19      Mean   :  23        Mean   :   3     
##  3rd Qu.:  19      3rd Qu.:  29        3rd Qu.:   0     
##  Max.   :1971      Max.   :1331        Max.   :1211     
##  NA's   :4755640   NA's   :4755640     NA's   :4755640

Min: The minimum value of the feature.

1st Qu: The first quartile, which is the 25th percentile of the feature.

Median: The middle value of the feature, when the values are sorted in increasing order.

Mean: The average of all the values of the feature.

3rd Qu: The third quartile, which is the 75th percentile of the feature.

Max: The maximum value of the feature.

NA’s: The number of missing values in the feature.

then i will use some various plotting functions to visualize the data

install.packages("ggplot2")

then i will use some various plotting functions to visualize the data

install.packages("ggplot2")
library(ggplot2)
ggplot(data, aes(x = ARRIVAL_DELAY)) + 
  geom_histogram(fill = "blue", alpha = 0.5) + 
  xlab("Arrival Delay (minutes)") + 
  ylab("Frequency") + 
  ggtitle("Histogram of Arrival Delay")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 105071 rows containing non-finite values (`stat_bin()`).

What i see in the histogram , such as the scheduled arrival time, the day of the week, and the airline, had an impact on the arrival delay.

#after looking at all what mentioned above i see that

“SCHEDULED_ARRIVAL”, has a minimum value of 1, a 1st quartile of 1110, a median of 1520, a mean of 1494, a 3rd quartile of 1918, and a maximum value of 2400. The number of missing values (NA’s) in the column is 92513. indicating that the flights are generally arriving close to the scheduled arrival time.

ARRIVAL_DELAY: The minimum value is -87 and the maximum value is 1971, indicating that the flights can arrive up to 87 minutes before the scheduled arrival time or up to 1971 minutes after the scheduled arrival time. The median value of -5 is close to zero, indicating that the flights are generally arriving close to the scheduled arrival time.

DIVERTED: The minimum and maximum values are 0 and 1, indicating that a flight is either diverted or not. The mean value is 0.00261, indicating that a small fraction of flights are diverted.

CANCELLED: The minimum and maximum values are 0 and 1, indicating that a flight is either cancelled or not. The mean value is 0.01545, indicating that a small fraction of flights are cancelled.

CANCELLATION_REASON: The feature has length 5819079 and class “character”, indicating that this feature is a categorical feature with multiple categories. AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY: These features have a minimum value of 0, indicating that there are no negative delays, and a maximum value indicating that the maximum delay for each feature can be substantial. The median value for all of these features is close to 0, indicating that the majority of flights experience little to no delay due to these factors. The mean values, on the other hand, indicate that the average delay due to these factors is non-zero.

In conclusion, the “US flights from 2015 to 2020” data set provides valuable insights into the flight industry in the US. The data highlights important factors that contribute to flight delays and cancellations and provides a baseline for future analysis and decision making in the aviation industry