Dataset Exploration

The flights dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

cat('number of rows:',nrow(flights), #336776 
   '\nnumber of columns:',ncol(flights)) #19
## number of rows: 336776 
## number of columns: 19
glimpse(flights) #shows rows, columns, and snummary of all columns
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
summary(flights) #column names
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
## 

Description of columns

Column Name Variable Type Variable Description
year categorical ordinal year of flight departure/arrival
month categorical ordinal month of flight departure/arrival
day categorical ordinal day of flight departure/arrival
dep_time numerical discrete time of flight departure
sched_dep_time numerical discrete scheduled flight departure
dep_delay numerical discrete departure delay in minutes
arr_time numerical discrete time of flight arrival
sched_arr_time numerical discrete scheduled arrival time
arr_delay numerical discrete arrival delay in minutes
carrier categorical nominal abbreviated flight carrier name
flight categorical nominal flight number
tailnum categorical nominal tail number, a license plate of a plane
origin categorical nominal origin city of flight
dest categorical nominal destination city of flight
air_time numerical discrete total flight time in minutes
distance numerical discrete total flight distance
hour numerical discrete scheduled hour departure of flight
minute numerical discrete scheduled minute departure of flight
time_hour datetime variable/numerical continuous departure of flight with date and time

Question 1 (description of anti_join())

What does anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?

Part A (anti_join(flights, airports, by = c(“dest” = “faa”)))

Answer

To start off I wanted to understand the two datasets used in the anti_join() function. This was ‘flights’ and ‘airports’. I typed ?flights and ?airports in R to get a description of each data set and the columns used in each one. This is what I got:

  • flights: This data set contains data for all the flights that departed NYC. It also includes the data for the arrival airport.
  • airports: This data set gives the latitude and longitude for each airport. This data set does not have a focus on flights that left from a particular region (the flights data set has a focus on a particular region which is New York since it only holds records for flights that departed NYC)

So given this context the code anti_join(flights, airports, by = c("dest" = "faa")) would use the “dest” column in the flights dataset and compare it to the “faa” column values in the airports dataset. After the comparison of the column values R is going to see which values in the “dest” column do not have a match with the values in the “faa” column.

So to conclude, the code anti_join(flights, airports, by = c("dest" = "faa")) displays the destinations (in the flights dataset) that are not present in the airports dataset. One explanation about why this might be the case is that the destinations in the flights dataset might be international destinations. Maybe the airports dataset only includes locations of airports in the U.S only. Another reason is that there might have been a error in the data entry for destinations which is why there was no match between the dest and faa datasets.

Part B (anti_join(airports, flights, by = c(“faa” = “dest”)))

The code anti_join(airports, flights, by = c("faa" = "dest")) works in a similar manner. It would use the “faa” column in the airports dataset and compare it to the “dest” column values in the flights dataset. After the comparison of the column values R is going to see which values in the “faa” column do not have a match with the values in the “dest” column.

Since the “faa” was called first R is going to display the “faa” column values that do not have a match or appear in the “dest” column.

One reason as why there might not be any matches between the “faa” and “dest” column is that the flights dataset only includes the information of flights that left NYC. This means that in the flights dataset the departure airport would be one based in New York. Given this, it can be concluded that the destinations displayed in the anti_join(airports, flights, by = c("faa" = "dest")) statement are showing airport destinations that airplanes (from NYC airports) have not traveled too. Another reason is that there might have been a error in the data entry for destinations which is why there was no match between the “faa” and “dest” datasets.

Question 2 (line graph of two numerical variables)

Find the day during 2013 that had the longest average total delay (arrival + departure). Using the weather data, can you shed any light on what happened on that day?

Answer

I created a object non_cancelled which filters out the cancelled flights in the flights dataset. I did this so I do not see empty data values in my final datasets

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

To answer this problem I had to first find the day in 2013 that had the longest average delay. I found this out by first filtering my data with the filter() function to only include flights that occurred in 2013.

I then used the group_by() function to calculate the average delays per day.

The mutate() function was then used to create a new column in the “not_cancelled” dataset that would display the total delay for every flight.

The summarize() function was used to calculate the average delay per day. Then, arrange() was used to order the average delays from greatest to least.

head() was not required, but I just used it in my code to only see the day that had the greatest delay (this was the first row in the new dataset that was made and the output is shown below)

not_cancelled %>% 
  filter(year == 2013) %>%
  group_by(.,year, month, day) %>% 
  mutate(total_delay = dep_delay + arr_delay) %>%
  summarize(ave_delay = mean(total_delay, na.rm=T),) %>%
  arrange(desc(ave_delay)) %>%
  head(1) 
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 1 × 4
## # Groups:   year, month [1]
##    year month   day ave_delay
##   <int> <int> <int>     <dbl>
## 1  2013     3     8      170.

Now I had to investigate further and understand why there were the highest delays on March 8th by creating two objects (“hour_temp” and “hour_humid”).

“hour_temp” would display a line graph of the average temperature in each hour while the object “hour_humid” would display a line graph of the average humidity in each hour of the day.

To display the two line graphs together I used the function grid_arrange(). I did this for easier analysis of the 2 graphs.

hour_temp <- weather %>%
  filter(year == 2013, month == 3, day == 8) %>%
  group_by(hour) %>%
  summarize(ave_temp = mean(temp), ave_humid = mean(humid)) %>%
  ggplot() +
  geom_line( mapping = aes(x=hour,y=ave_temp))

hour_humid <- weather %>%
  filter(year == 2013, month == 3, day == 8) %>%
  group_by(hour) %>%
  summarize(ave_temp = mean(temp), ave_humid = mean(humid)) %>%
  ggplot() +
  geom_line( mapping = aes(x=hour,y=ave_humid))

grid.arrange(hour_temp, hour_humid)

As you can see in hours 0-13 (approximately) of both graphs the temperature was below 36 degrees and the humidity was above 85 degrees.

The low temperature suggests that the weather in March 8th was cold and it was probably snowing which is why there were a lot of delays on March 8th.

The high humidity is also another explanation as to why there were many delays. In fact, the humidity decreased to 75 degrees at around 16 hours which means that the humidity was high for most of the day.

Question 3 (calculating average speed by airline)

Using planes and flights find the airplane models with the fastest average speeds. Which planes are they? Is this surprising? Why or why not?

Answer

I now want to find which airplane models have the highest average speed. I am then going to see if my results were surprising based on what I know/have learned about the planes.

I used started off by joining the “not_cancelled” dataset to the “planes” dataset. I did this because I did not want all the values in the “planes” dataset and just wanted the values that had a match (through the “tailnum” column). I am joining the datasets since I want the model number and use that to compare the average speed.

I had to first use the mutate() function to calculate the “avg_trip_speed” since there was no existing column in the “flights” dataset that gave this value. The average trip speed is by hour since I convert the “air_time” variable from a minutes to hour format by dividing by 60.

After I get “avg_trip_speed” for each plane I then need to group by model and manufacturer so I can calculate the average speed for each airplane model. The variable “ave_model_speed” makes this calculation. I then use the arrange() function to sort the “ave_model_speed” from greatest to lowest order. This way I know which airplane models have the highest average speed.

not_cancelled %>%
  left_join(planes, by = "tailnum") %>%
  mutate(avg_trip_speed = distance/(air_time/60)) %>%
  group_by(model, manufacturer) %>%
  summarize(ave_model_speed = mean(avg_trip_speed,na.rm=T)) %>%
  arrange(desc(ave_model_speed)) %>%
  head(10)
## `summarise()` has grouped output by 'model'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups:   model [10]
##    model     manufacturer     ave_model_speed
##    <chr>     <chr>                      <dbl>
##  1 777-222   BOEING                      483.
##  2 A330-243  AIRBUS                      480.
##  3 767-424ER BOEING                      467.
##  4 A321-231  AIRBUS INDUSTRIE            460.
##  5 A330-223  AIRBUS INDUSTRIE            458.
##  6 757-212   BOEING                      456.
##  7 A319-115  AIRBUS                      455.
##  8 767-223   BOEING                      452.
##  9 A330-323  AIRBUS                      450.
## 10 A319-112  AIRBUS                      450.

Now from what I have seen in the dataset. The 3 planes that have the highest average speed are Boeing, Air Bus, and Air Bus Industine. I am not surprised with the results since a search online has informed me that Boeing and Air Buses are well known planes used by many airline companies. All three plane models can travel super fast and are used for long haul flights. The aircraft models have been in the market for many years and are bought by major airline companies such as American Airlines, Delta, United Airlines, etc.

Question 4 (scatter plot with regression line)

Is there a relationship between the age of a plane and its delays?

Answer

Now I am going to find out if there is a relationship between a planes age and its delays. To start I have to join the flights dataset with the planes dataset.

I got the planes age by subtracting the x year (year in flight dataset) with the y year (year in planes dataset). I then used group_by() to get the average departure and arrival delays for each plane.

Here is my scatter plot graph of a planes age and the average departure delay as the plane age increases.

flights %>%
  left_join(planes, by= "tailnum") %>%
  mutate(plane_age = year.x - year.y) %>% 
  group_by(plane_age) %>%
  summarize(ave_dep_delay = mean(dep_delay, na.rm=T),
            ave_arr_delay = mean(arr_delay, na.rm=T)) %>%
   ggplot(mapping = aes(x=plane_age, y=ave_dep_delay)) +
   geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

The scatterplot shows that there is not a relationship between plane age and departure delays because throughout the planes lifecycle most of the points with a plane age of 0-30 is in the 10-20 range and it does not decrease. There are a few points after the plane age of 30 that have a departure delay below 10.

This means that as the plane gets older (beyond the age of 30) the departure delay does slightly decrease but there are no average departure delays that are below zero.

This graph could possibly suggest that as a plane gets older it experiences less departure delays. But ultimately, no matter what the planes age it would most likely experience a departure delay. There is one outlier point that does have a negative departure delay (this outlier point has a plane age close to 50). But this outlier is only one point and it does not acculturatly represent the data.

Here is my scatter plot graph of a planes age and the average arrival delay as the plane ages.

flights %>%
   left_join(planes, by= "tailnum") %>%
   mutate(plane_age = year.x - year.y) %>% 
   group_by(plane_age) %>%
   summarize(ave_dep_delay = mean(dep_delay, na.rm=T),
             ave_arr_delay = mean(arr_delay, na.rm=T)) %>%
   ggplot(mapping = aes(x=plane_age, y=ave_arr_delay)) +
   geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

As shown in the scatterplot the relationship between “plane age” and “arrival delays” does not exist, but it is stronger than the relationship between “plane age” and “departure delays”.

This is because most of the points in the scattersplot have a delay between 0-10. Majority of the points in the “plane age** and”departure delays” scatterplot had a delay which was greater than 10 which suggests that the average arrival delay time would most likely be lower (by a few minutes) than the average departure delay time.

So I have concluded that there is no relationship between airplane age and delays. This was surprising since I assumed that airplanes with a low age would have less average delays since the airplane was new/young. However, I did notice that the planes in the “plane age” and “departure delays” scatter plot had higher average delay times then the planes listed in the “plane age” and “arrival delays” scatterplot (but do note that this difference was not big and was only a few minutes – 10 minutes approximately)

A possible explanation for this difference between departure and arrival delays could be that before a plane departures it needs to go through aircraft cleaning and loading the luggage off the plane. When a plane arrives the moment the plane lands on a airport is counted as arriving and the other factors are not taken into consideration (such as aircraft cleaning and loading the luggage off the plane).

NOTE:But as I said before this observation might not hold true because the difference between average departure and arrival delays was only by a few minutes (10 minutes approximately) and nothing too big.