1 Exercises

For this project, we focus on the following datasets: airlines, airports, flights, planes, and weather.We are going to invstigate these datasets to answeer the following questions:

1.1 Exercise

Construct a barplot displaying number of flights per month.

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

## # A tibble: 1,458 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044    -5 A     America/New_~
##  2 06A   Moton Field Municip~  32.5  -85.7   264    -6 A     America/Chic~
##  3 06C   Schaumburg Regional   42.0  -88.1   801    -6 A     America/Chic~
##  4 06N   Randall Airport       41.4  -74.4   523    -5 A     America/New_~
##  5 09J   Jekyll Island Airpo~  31.1  -81.4    11    -5 A     America/New_~
##  6 0A9   Elizabethton Munici~  36.4  -82.2  1593    -5 A     America/New_~
##  7 0G6   Williams County Air~  41.5  -84.5   730    -5 A     America/New_~
##  8 0G7   Finger Lakes Region~  42.9  -76.8   492    -5 A     America/New_~
##  9 0P2   Shoestring Aviation~  39.8  -76.6  1000    -5 U     America/New_~
## 10 0S9   Jefferson County In~  48.1 -123.    108    -8 A     America/Los_~
## # ... with 1,448 more rows

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## # ... with 3,312 more rows

## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

Remember to read about the datasets.

2 Exercises

2.1 Exercise

Construct a barplot displaying number of flights per month.

# Here, we convert these variables to the categorical ones.

flights$month <- as.factor(flights$month)
flights$origin <- as.factor(flights$origin)
flights$dest <- as.factor(flights$dest)

planes$manufacturer <-as.factor(planes$manufacturer)

# Vector Name of months

v_months <- c("1" = "Jan", "2" = "Feb","3" = "Mar", 
              "4" = "Apr", "5" = "May", "6" = "Jun",
              "7" = "Jul", "8" = "Aug","9" = "sep",                 
              "10" = "Oct","11" = "Nov", "12" = "Dec")

Find the number of flights per month.

 flights%>%
  group_by(month)%>% na.omit()%>% # aggregation by months
  summarise(Flights =n())%>%
  pander(big.mark = '.', justify = c('left', 'right'))

month	Flights
1	26.398
2	23.611
3	27.902
4	27.564
5	28.128
6	27.075
7	28.293
8	28.756
9	27.010
10	28.618
11	26.971
12	27.020

Now we can construct a barplot displaying number of flights per month.

nfly_per_month <- flights%>%
  group_by(year, month)%>%
  ggplot(aes(x = month))+
  geom_bar() +
  labs(x = "Months", 
       y = "Number of flights",
       title = "Number of flights per month")+

  scale_x_discrete(labels = v_months) 

  nfly_per_month

# ggplotly(nfly_per_month)  # to makee it interactive.

Now, in the barplot (showing number of flights per month), make a separate bar for each origin.

origin_flights <- flights%>%
  ggplot(aes( x = month))+
  geom_bar(aes(fill = origin), position = "dodge") +
  labs(x = "Months",
       y = "Number of flights", 
       title = "flights for each origin per month") +
  scale_x_discrete(labels = v_months)

 # ggplotly(origin_flights) # make it interactive
 
 origin_flights

List of the number of flights for each origin per month

months_list <- flights%>%
  group_by(month, origin)%>%
  summarise(n_of_flights = n())


  months_list%>%
    pander(big.mark = '.', justify = c('left','center', 'right'))

month	origin	n_of_flights
1	EWR	9.893
1	JFK	9.161
1	LGA	7.950
2	EWR	9.107
2	JFK	8.421
2	LGA	7.423
3	EWR	10.420
3	JFK	9.697
3	LGA	8.717
4	EWR	10.531
4	JFK	9.218
4	LGA	8.581
5	EWR	10.592
5	JFK	9.397
5	LGA	8.807
6	EWR	10.175
6	JFK	9.472
6	LGA	8.596
7	EWR	10.475
7	JFK	10.023
7	LGA	8.927
8	EWR	10.359
8	JFK	9.983
8	LGA	8.985
9	EWR	9.550
9	JFK	8.908
9	LGA	9.116
10	EWR	10.104
10	JFK	9.143
10	LGA	9.642
11	EWR	9.707
11	JFK	8.710
11	LGA	8.851
12	EWR	9.922
12	JFK	9.146
12	LGA	9.067

kable(months_list, "html") %>%
  kable_styling("striped","hover") %>%
  scroll_box(width = "500px", height = "600px")

month	origin	n_of_flights
1	EWR	9893
1	JFK	9161
1	LGA	7950
2	EWR	9107
2	JFK	8421
2	LGA	7423
3	EWR	10420
3	JFK	9697
3	LGA	8717
4	EWR	10531
4	JFK	9218
4	LGA	8581
5	EWR	10592
5	JFK	9397
5	LGA	8807
6	EWR	10175
6	JFK	9472
6	LGA	8596
7	EWR	10475
7	JFK	10023
7	LGA	8927
8	EWR	10359
8	JFK	9983
8	LGA	8985
9	EWR	9550
9	JFK	8908
9	LGA	9116
10	EWR	10104
10	JFK	9143
10	LGA	9642
11	EWR	9707
11	JFK	8710
11	LGA	8851
12	EWR	9922
12	JFK	9146
12	LGA	9067

2.2 Exercise

What are the top-10 destinations and how many flights were made to these?

# The top-10 destinations and number of flights to these.

top10 <-flights%>%na.omit()%>%
  group_by(dest)%>%
  summarise(n_flight= n())%>%
  arrange(desc(n_flight))%>%
  select( dest, n_flight)%>%
  head(10)

# List of top-10 destinations and number of flights to these.

top10%>%
  pander(big.mark = '.', decimal.mark = ',', justify = c('left', 'right'))

dest	n_flight
ATL	16.837
ORD	16.566
LAX	16.026
BOS	15.022
MCO	13.967
CLT	13.674
SFO	13.173
FLL	11.897
MIA	11.593
DCA	9.111

For these 10 destinations, make a barplot illustrating the number of flights from origin to top-10 destination.

nfly_to_top10 <-top10%>%
 ggplot(aes(dest, n_flight))+
  geom_col() +
  labs(x = "Destinations",
       y = "Number of flights",
       title = "flights to top-10 destinations")

# ggplotly(nfly_to_top10) # Make it interactive
 nfly_to_top10

Now, order the bars (destinations) according to the total number of flights to these destinations.

nfly_to_top10 <- top10 %>% 
ggplot( aes(x= reorder(dest,-n_flight),n_flight))+
  geom_bar(stat = "identity")+
  labs(x = "Top-10 destinations",
       y = "Number of flights",
       title = "Number of flights")
 nfly_to_top10

# ggplotly(nfly_to_top10) # To make it interactive

2.3 Exercise

Is the mean air time from each of the three origins different? Further, are these differences statistically significant?

flight_air_time <- flights %>% 
  filter(!is.na(air_time)) %>%
  select(air_time, origin)

flight_air_time

## # A tibble: 327,346 x 2
##    air_time origin
##       <dbl> <fct> 
##  1      227 EWR   
##  2      227 LGA   
##  3      160 JFK   
##  4      183 JFK   
##  5      116 LGA   
##  6      150 EWR   
##  7      158 EWR   
##  8       53 LGA   
##  9      140 JFK   
## 10      138 LGA   
## # ... with 327,336 more rows

# perform anova test 
plot(air_time ~ origin, data = flight_air_time)

flight_air_time %>%
  aov(air_time ~ origin, data = .) %>%
  summary()

##                 Df    Sum Sq  Mean Sq F value Pr(>F)    
## origin           2 1.935e+08 96741314   11817 <2e-16 ***
## Residuals   327343 2.680e+09     8186                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Visualize the mean air time for all origins to see the differences between them.

m_air_time_month <- flights%>%
  group_by(origin, month, day)%>%
  summarise(mean_air =  mean(air_time, na.rm = TRUE))%>%
   arrange(desc(mean_air))

 m_air_plot <- m_air_time_month%>% 
ggplot(aes(origin, mean_air))+ theme_bw() +
  geom_boxplot(alpha = 0.4)+
  labs(x = "Origin",
       y = "Mean of air times",
       title = "Origin mean time") +
   
   # added after submission
   # geom_boxplot(alpha=0.4) +
    stat_summary(fun.y=mean, geom="point", shape=4,
                 size=2, color="red", fill="red") +
   
    theme(legend.position="none")
 
  m_air_plot

 # ggplotly(m_air_plot) # make it interactive

2.4 Exercise

How many weather observations are there for each origin?

# Compute the number of observations for each origin.
weather%>%
  group_by(origin)%>% na.omit()%>%
  summarise(n_observ = n())%>% # Number of observations
  arrange(desc(n_observ))%>%
  pander(big.mark = ',', justify = c('left', 'right'))

origin	n_observ
LGA	1,892
EWR	1,689
JFK	1,399

Convert temperature to degrees Celsius. This will be used in the reminder of this miniproject. (We do this for both temp and dewp using mutate_at)

# Convert degree fahrenheit to degree celcius.
fahrenheit_to_celcius <- function(df)
{
  t_celcius <- (df - 32)* 5/9 
    
    return(t_celcius)
}

# Convert temp and dewp to degree selcius.
weather_dc <-  mutate_at(weather,    vars(temp,dewp),funs(fahrenheit_to_celcius))

head(weather_dc)

## # A tibble: 6 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##   <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1 EWR     2013     1     1     1  3.9  -3.3   59.4      270      10.4 
## 2 EWR     2013     1     1     2  3.9  -2.80  61.6      250       8.06
## 3 EWR     2013     1     1     3  3.9  -2.2   64.4      240      11.5 
## 4 EWR     2013     1     1     4  4.4  -2.2   62.2      250      12.7 
## 5 EWR     2013     1     1     5  3.9  -2.2   64.4      260      12.7 
## 6 EWR     2013     1     1     6  3.30 -2.2   67.2      240      11.5 
## # ... with 5 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

# convert variables  origin and month to fator. 

weather_dc$origin <- as.factor(weather_dc$origin)
weather_dc$month <- as.factor(weather_dc$month)

Construct a graph displaying the temperature at JFK.

 jfk_temp <- weather_dc%>%

  group_by(year, month, day)%>%
  filter(origin =="JFK")%>%
  ggplot(aes(month, temp))+
  geom_boxplot() +
  labs(x = "Months",
       y = "Temperature in degree C",
       title = "Graph temperature at JFK") +
  scale_x_discrete(labels = v_months)+
  
# added after submission
  #  geom_boxplot(alpha=0.4) +
    stat_summary(fun.y=mean, geom="point", shape=4,
                 size=2, color="red", fill="red") +
    theme(legend.position="none") 
# end added

 jfk_temp

 # ggplotly(jfk_temp)

Add a red line showing the mean temperature for each day.

l_mean_temp <- weather_dc%>% # with temp in degree celcius
 mutate(mydate = make_date(year, month, day))%>%
  group_by(year,month,day, mydate)%>%
 
 # filter(origin =="JFK")%>%
  summarise(m_temp = mean(temp, na.rm = TRUE))%>%

  ggplot(aes(mydate, m_temp))+
  geom_point()+
 geom_smooth(se = FALSE, col = "red", method = 'loess') +
  labs(x = "days",
       y = "Temperature",
       title = "Mean temperature for each day")
  l_mean_temp

 # ggplotly(l_mean_temp)

# Solution after submission

mean_temp_orign <- weather_dc%>%
  group_by(year,month, day)%>%
  summarise(mean_temp = mean(temp, na.rm = TRUE))%>%
  ggplot(aes(day,mean_temp))+
  geom_point()+
  #geom_hline()+
   geom_smooth(se = FALSE, col = "red", method = 'loess') +
  
  # added after submission
  
  
  labs(x = "Origin airport ",
       y = "Mean temperature",
       title = "Mean temperature ")
  
 mean_temp_orign

# ggplotly(mean_temp_orign)

Now, visualize the daily mean temperature for each origin airport.

mean_temp_orign <- weather_dc%>%
  group_by(origin,year,month, day)%>%
  summarise(mean_temp = mean(temp, na.rm = TRUE))%>%
  ggplot(aes(origin,mean_temp))+
  geom_boxplot(aes(fill = origin))+
  
  # added after submission
  
  #  geom_boxplot(alpha=0.4) +
    stat_summary(fun.y=mean, geom="point", shape=4, size=2, color="red", fill="red") +
  #  theme(legend.position="none") +
# end added
  
  
  
  #geom_point()+
  
  labs(x = "Origins ",
       y = "Mean temperature",
       title = "Mean temperature ")
  
 mean_temp_orign

# ggplotly(mean_temp_orign)

2.5 Exercise

Investigate if arrival delay is associated with the flight distance (and also departure delay).

In order to investigate if there is any statistical dependence or association between some variables, one can calculate the correlation between these variables. To do so, we will first investigate if these variables are numeric and if there is no missing values.

 select_cols <- flights%>%
  select(arr_delay, dep_delay, distance)  # Select relevant columns from flights

#Take complete cases. 

 flights_complet_cases <- select_cols[complete.cases(select_cols),]

Calculate the correlation between variables

# Correlation test between different variables or columns.

cor_test1 <- cor(flights_complet_cases)

cor_test_tb <- as.data.frame(cor_test1)

 cor_test_tb

##             arr_delay  dep_delay    distance
## arr_delay  1.00000000  0.9148028 -0.06186776
## dep_delay  0.91480276  1.0000000 -0.02168090
## distance  -0.06186776 -0.0216809  1.00000000

Format the correlation data

   kable(cor_test_tb, "html") %>% # format correlation table.
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered", "responsive"))

	arr_delay	dep_delay	distance
arr_delay	1.0000000	0.9148028	-0.0618678
dep_delay	0.9148028	1.0000000	-0.0216809
distance	-0.0618678	-0.0216809	1.0000000

We can also plot the correlation data to visualize the relation between variables.

corrplot(cor_test1, method="color", type = "upper",
         tl.srt = 45,addCoef.col = "black")

As we can see from the different graphs and correlation parameters , there is a strong correlation between arrival delay and departure delay. In the contrast, there is a negative correlation between the distance and arrival delay. In the next step, I will build a linear regression model using these three variables.

Linear model arrival delay and departure delay

# Model arrival delay and departure delay

 del_model1 <- lm(arr_delay ~ dep_delay, data = flights_complet_cases)
coef(del_model1)

## (Intercept)   dep_delay 
##   -5.899493    1.019093

summary(del_model1)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay, data = flights_complet_cases)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.587  -11.005   -1.883    8.938  201.938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8369 
## F-statistic: 1.679e+06 on 1 and 327344 DF,  p-value: < 2.2e-16

from this result, one can conclude that departure delay is an important predictor of arrival delay. The changes of departure delay can explain the changes in arrival delay.

pred_delay <- flights_complet_cases%>%
  add_predictions(del_model1)%>%
  add_residuals(del_model1)
head(pred_delay, 10)

## # A tibble: 10 x 5
##    arr_delay dep_delay distance   pred   resid
##        <dbl>     <dbl>    <dbl>  <dbl>   <dbl>
##  1        11         2     1400  -3.86  14.9  
##  2        20         4     1416  -1.82  21.8  
##  3        33         2     1089  -3.86  36.9  
##  4       -18        -1     1576  -6.92 -11.1  
##  5       -25        -6      762 -12.0  -13.0  
##  6        12        -4      719  -9.98  22.0  
##  7        19        -5     1065 -11.0   30.0  
##  8       -14        -3      229  -8.96  -5.04 
##  9        -8        -3      944  -8.96   0.957
## 10         8        -2      733  -7.94  15.9

  # After submission of report
  pred_plot <- pred_delay%>%
    ggplot(aes(dep_delay, pred))+
    geom_line()+
    geom_point(aes(dep_delay, arr_delay),
               col = "red")  

pred_plot

# Graph arr_delay vs dep_dealy using intercept and slope from the model.

flights_complet_cases%>%
  ggplot(aes(arr_delay, dep_delay))+
  geom_point()+
 geom_abline (intercept = -5.89, slope = 1, col= "red") +
  
  labs(x = "arrivel delay ",
       y = "departure delay",
       title = "arrival vs departure delay ")

Model arrival delay, departure delay and distance

# Model arrival delay, departure delay and distance
del_model2 <- lm(arr_delay ~ dep_delay+ distance, data = flights_complet_cases)
 coef(del_model2)

##  (Intercept)    dep_delay     distance 
## -3.212779441  1.018077208 -0.002550586

summary(del_model2)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay + distance, data = flights_complet_cases)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.944  -11.069   -2.019    8.728  205.562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.213e+00  5.560e-02  -57.78   <2e-16 ***
## dep_delay    1.018e+00  7.823e-04 1301.32   <2e-16 ***
## distance    -2.551e-03  4.259e-05  -59.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.93 on 327343 degrees of freedom
## Multiple R-squared:  0.8386, Adjusted R-squared:  0.8386 
## F-statistic: 8.506e+05 on 2 and 327343 DF,  p-value: < 2.2e-16

From this result, one can conclude that the changes of arrival delay can not be explained by the distance.

2.6 Exercise

Investigate if departure delay is associated with weather conditions at the origin airport. This includes descriptives (mean departure delay), plotting, regression modelling, considering missing values etc.

# We selct the relevant columns from  both flights and weather datasets
# then join  the two subsets.

flight_delay <- flights%>%
  select(year:day, origin, hour, arr_delay, dep_delay, tailnum)
#flight_delay

Join subset flights and planes datasets

fly_w_cd <-  inner_join(flight_delay, weather_dc,
                        by = c("origin" = "origin",
                                 "year" = "year",
                                "month" = "month",
                                  "day" = "day",
                                 "hour" = "hour"))

#c("year", "day","month", "hour", "origin")

# Here  I consider only the completecases from dataset.

 fly_weather_cd <- fly_w_cd[complete.cases(fly_w_cd),]
 
head(fly_weather_cd)

## # A tibble: 6 x 18
##    year month   day origin  hour arr_delay dep_delay tailnum  temp  dewp
##   <dbl> <fct> <int> <fct>  <dbl>     <dbl>     <dbl> <chr>   <dbl> <dbl>
## 1  2013 1         1 LGA        5        20         4 N24211    4.4 -3.90
## 2  2013 1         1 LGA        6       -25        -6 N668DN    4.4 -3.90
## 3  2013 1         1 LGA        6       -14        -3 N829AS    4.4 -3.90
## 4  2013 1         1 LGA        6         8        -2 N3ALAA    4.4 -3.90
## 5  2013 1         1 LGA        6        31        -1 N3DUAA    4.4 -3.90
## 6  2013 1         1 LGA        6        -7         0 N595JB    4.4 -3.90
## # ... with 8 more variables: humid <dbl>, wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

Now we can compute the correlation between numeric relevant variables.

num_cols <- sapply(fly_weather_cd, is.numeric)

fly_weather_cd  <- fly_weather_cd[, num_cols] # numeric columns

delay_weather_cd <- fly_weather_cd%>%
  select(dep_delay, humid,precip, visib,
         wind_dir, wind_speed, temp, dewp)

cor_delay_wc <- cor(delay_weather_cd)

cor_delay_wc

##              dep_delay        humid       precip       visib    wind_dir
## dep_delay   1.00000000  0.122166383  0.048397908 -0.09745755 -0.06842713
## humid       0.12216638  1.000000000  0.262944487 -0.45001107 -0.41160597
## precip      0.04839791  0.262944487  1.000000000 -0.47729425 -0.15145866
## visib      -0.09745755 -0.450011070 -0.477294245  1.00000000  0.20250473
## wind_dir   -0.06842713 -0.411605970 -0.151458658  0.20250473  1.00000000
## wind_speed  0.02100191 -0.009770943  0.003165484 -0.06396121  0.11603287
## temp        0.06124366  0.077493589 -0.016366496  0.04275419 -0.21351987
## dewp        0.10067971  0.484751473  0.073763317 -0.11416287 -0.34985317
##              wind_speed        temp        dewp
## dep_delay   0.021001914  0.06124366  0.10067971
## humid      -0.009770943  0.07749359  0.48475147
## precip      0.003165484 -0.01636650  0.07376332
## visib      -0.063961212  0.04275419 -0.11416287
## wind_dir    0.116032873 -0.21351987 -0.34985317
## wind_speed  1.000000000 -0.31494220 -0.28646035
## temp       -0.314942198  1.00000000  0.90528731
## dewp       -0.286460353  0.90528731  1.00000000

We can Plot the correlation data to see the relations between variables

corrplot(cor_delay_wc, method="color", type = "upper",
         tl.srt = 45,addCoef.col = "black")

#corrplot(cor_delay_wc, method = 'number')

According to correlation plots, there is no strong correlation between departure delay and any of the weather conditions variables and there is a tight competition between positive and negative correlations. To get more information we will visualize the relationship between delay and some of the variable such as temp and `precip.

fly_w_cd%>%
  group_by(precip)%>%
  summarise (dp_delay = mean(dep_delay, na.rm = TRUE))%>%
  ggplot(aes(precip, dp_delay))+
  geom_line()+
 
  geom_point()

As we can see from the graph, the precip doesn’t provide more information explanation about departure delay and the correlation between these two variables is weak.

fly_w_cd%>%
  group_by(temp)%>%na.omit()%>%
  summarise (dp_delay = mean(dep_delay, na.rm = TRUE))%>%
  ggplot(aes(temp, dp_delay))+
 # geom_line(col = "red")+
 
  geom_point()+
 geom_smooth(  se = FALSE, method = 'loess')

Surprisingly, can that the departure delay increase with temperature but this variation is not significant to provide more explanation about departure delay changes. In the next step I will build the linear regression model.

Modelling weather conditions delay model

set.seed(101)
#delay_weather_cd%>%
wc_delay_sample <- sample.split(delay_weather_cd$dep_delay, 
                                SplitRatio = 0.7)

# train set
train <- subset(delay_weather_cd, wc_delay_sample == TRUE)
head(train)

## # A tibble: 6 x 8
##   dep_delay humid precip visib wind_dir wind_speed  temp  dewp
##       <dbl> <dbl>  <dbl> <dbl>    <dbl>      <dbl> <dbl> <dbl>
## 1         4  54.8      0    10      250       15.0   4.4 -3.90
## 2        -1  54.8      0    10      260       16.1   4.4 -3.90
## 3        -3  54.8      0    10      260       16.1   4.4 -3.90
## 4        13  54.8      0    10      260       16.1   4.4 -3.90
## 5        -4  54.8      0    10      260       16.1   4.4 -3.90
## 6        -6  54.8      0    10      260       16.1   4.4 -3.90

#Test set
test <- subset(delay_weather_cd, wc_delay_sample == FALSE)
head(test)

## # A tibble: 6 x 8
##   dep_delay humid precip visib wind_dir wind_speed  temp  dewp
##       <dbl> <dbl>  <dbl> <dbl>    <dbl>      <dbl> <dbl> <dbl>
## 1        -6  54.8      0    10      260       16.1   4.4 -3.90
## 2        -3  54.8      0    10      260       16.1   4.4 -3.90
## 3        -2  54.8      0    10      260       16.1   4.4 -3.90
## 4         0  54.8      0    10      260       16.1   4.4 -3.90
## 5         0  54.8      0    10      260       16.1   4.4 -3.90
## 6        -8  54.8      0    10      260       16.1   4.4 -3.90

Build linear model

model <- lm(dep_delay~ ., data = train)
summary(model)

## 
## Call:
## lm(formula = dep_delay ~ ., data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.93 -16.97 -11.83  -0.15 754.66 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.352674   6.327225  -0.846 0.397571    
## humid        0.427979   0.062583   6.839 8.09e-12 ***
## precip       3.028329  14.635779   0.207 0.836079    
## visib       -1.605281   0.211482  -7.591 3.24e-14 ***
## wind_dir    -0.006602   0.002347  -2.813 0.004914 ** 
## wind_speed   0.287092   0.036878   7.785 7.11e-15 ***
## temp         0.882015   0.182450   4.834 1.34e-06 ***
## dewp        -0.717097   0.200783  -3.572 0.000355 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.25 on 50911 degrees of freedom
## Multiple R-squared:  0.02183,    Adjusted R-squared:  0.0217 
## F-statistic: 162.3 on 7 and 50911 DF,  p-value: < 2.2e-16

# Residuals

residus <- residuals(model)
residus <- as.data.frame(residus)
#head(residus)

residus%>%
ggplot(aes(residus)) +  geom_histogram(fill='blue',alpha=0.5, bins = 20)

# After submission of report

delay_weather_cd%>%
ggplot(aes(dep_delay)) +  geom_histogram(fill='blue',alpha=0.5, bins = 20)

Plotting model

Prediction computation

delay_predict <- predict(model, test)
predict_res  <- cbind(delay_predict, test$dep_delay)

colnames(predict_res) <- c("predict", "actual")

predict_res<- as.data.frame(predict_res)

 head(predict_res)

##    predict actual
## 1 11.63846     -6
## 2 11.63846     -3
## 3 11.63846     -2
## 4 11.63846      0
## 5 11.63846      0
## 6 11.63846     -8

 head(delay_predict)

##        1        2        3        4        5        6 
## 11.63846 11.63846 11.63846 11.63846 11.63846 11.63846

2.7 Exercise

Is the age of the plane associated to delay?

Here, we start by computing the age of each and store it in new column names age.

# This function compute the age the plane as integer.
cur_year <- as.integer(format(Sys.Date(), "%Y"))

year_diff <- function(x_year, y_year = cur_year)
{
   x_year <- as.integer(x_year)
  
    
 return(y_year - x_year)
 
}

pl_ages <- planes%>%
 mutate(age = year_diff(year))%>%
  select(tailnum, age, seats)
pl_ages <- pl_ages[complete.cases(pl_ages),]


flight_delay <- flights%>%
  select(year:day, origin, hour, arr_delay, dep_delay, tailnum)

#  join planes and flights datasets

join_fly_planes <- flight_delay%>%
  inner_join(pl_ages, by = "tailnum")
  #join_fly_planes
  
 fly_planes <-join_fly_planes%>%
   select(age,dep_delay,arr_delay)
 fly_planes <-fly_planes[complete.cases(fly_planes),]
 fly_planes

## # A tibble: 273,853 x 3
##      age dep_delay arr_delay
##    <int>     <dbl>     <dbl>
##  1    19         2        11
##  2    20         4        20
##  3    28         2        33
##  4     6        -1       -18
##  5    27        -6       -25
##  6     6        -4        12
##  7    18        -5        19
##  8    20        -3       -14
##  9    14        -3        -8
## 10     7        -2        -2
## # ... with 273,843 more rows

Now I will perform some plotting to visualize the graphic (age vs delay)

age_delay <- fly_planes%>%
  group_by(age)%>%
  filter(!is.na(dep_delay)) %>%
  summarise(delay = mean(dep_delay)) %>%
  ggplot(aes(x = age, y = delay)) +
 # geom_point() +
  geom_line() +
  labs(x = "Age of planes",
       y = "Mean of departure delay",
       title = "Graph dep_delay vs age")
 age_delay

#ggplotly(age_delay)

try correlation between age of planes and arrival delay.

cor_delay_age  <-cor(fly_planes)

cor_delay_age

##                   age   dep_delay   arr_delay
## age        1.00000000 -0.01513102 -0.01767153
## dep_delay -0.01513102  1.00000000  0.91674159
## arr_delay -0.01767153  0.91674159  1.00000000

According to the correlation table, There is a weak negative correlation between age and both arrival delay and departure delay. Meaning, the age of the planes can not explained the changes of delay.

Format the correlation data

	age	dep_delay	arr_delay
age	1.0000000	-0.0151310	-0.0176715
dep_delay	-0.0151310	1.0000000	0.9167416
arr_delay	-0.0176715	0.9167416	1.0000000

2.8 Exercise

It seems like the plane manufacturer could use a cleaning. After that, how many manufacturers have more than 200 planes? And how many flights are each manufacturer with more than 200 planes responsible for?

# Find list of manufacturers that have more than 200 planes.
manufact_200_planes <- planes%>%
  filter(!is.na(manufacturer)) %>%
group_by( manufacturer)%>%
  summarise(n_planes =n())%>%
  arrange(desc(n_planes))%>%
  filter(n_planes > 200)
#print list 
  manufact_200_planes%>%
    pander(big.mark = ',', justify = c('left', 'right'))

manufacturer	n_planes
BOEING	1,630
AIRBUS INDUSTRIE	400
BOMBARDIER INC	368
AIRBUS	336
EMBRAER	299

 # Combine the subset manufact_200_planes with flights dataset to find number of flights.
  
num_planes200 <- planes%>%
  inner_join(flights,"tailnum")%>% 
  filter(manufacturer %in%
           manufact_200_planes$manufacturer)%>% 
  
  group_by( manufacturer)%>%
  summarise(n_flight =n())%>%
  arrange(desc(n_flight))

  # print list
  num_planes200%>%
     pander(big.mark = ',', justify = c('left', 'right'))

manufacturer	n_flight
BOEING	82,912
EMBRAER	66,068
AIRBUS	47,302
AIRBUS INDUSTRIE	40,891
BOMBARDIER INC	28,272

 # format number of flights
  
 kable(num_planes200, "html") %>% # format correlation table.
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"))

manufacturer	n_flight
BOEING	82912
EMBRAER	66068
AIRBUS	47302
AIRBUS INDUSTRIE	40891
BOMBARDIER INC	28272

2.9 Exercise

It turns out that Airbus has several main models, e.g. several A320 (A320-211, A320-212 etc.) and so on. Create a frequency table of the main models for Airbus and how many planes there are in each.

# Find number of model for "AIRBUS"

airbus_models <- planes%>% 
  group_by(model)%>%
  filter(manufacturer =="AIRBUS")%>%
  summarise(n_of_planes = n())%>%
  arrange(desc(n_of_planes))

# Split column model to find main models
#airbus_models%>%
  
  split_m <- strsplit( airbus_models$model, "-")

main_mdl <- vector("character", length = length(split_m))

for (m in 1: length(split_m)) {
  
  main_mdl[m] <- split_m[[m]][1]
  
}
# List of main models

m_models <- airbus_models%>%
  mutate(main_model =  main_mdl)

 m_models %>%
  pander(big.mark = ',', justify = c('left','center', 'right'))

model	n_of_planes	main_model
A320-232	129	A320
A320-214	61	A320
A321-231	49	A321
A319-114	31	A319
A321-211	18	A321
A319-112	14	A319
A330-243	14	A330
A319-131	5	A319
A319-132	5	A319
A319-115	3	A319
A320-211	2	A320
A320-212	2	A320
A330-223	2	A330
A330-323	1	A330

 # format list
  
 kable(m_models, "html") %>% # format correlation table.
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"))

model	n_of_planes	main_model
A320-232	129	A320
A320-214	61	A320
A321-231	49	A321
A319-114	31	A319
A321-211	18	A321
A319-112	14	A319
A330-243	14	A330
A319-131	5	A319
A319-132	5	A319
A319-115	3	A319
A320-211	2	A320
A320-212	2	A320
A330-223	2	A330
A330-323	1	A330

# A Frequency table of main models

m_models%>%
  group_by(main_model)%>%
  summarise( num_model = n(), num_of_planes = sum(n_of_planes))%>%
pander(big.mark = ',', justify = c('left','center', 'right'))

main_model	num_model	num_of_planes
A319	5	58
A320	4	194
A321	2	67
A330	3	17

2.10 Exercise

Are larger planes (measured by number of seats) more or less delayed than smaller planes? To answer this question one can visualize the plot (seats vs arr_delay)

delay_seats <- join_fly_planes%>%
  group_by(seats)%>%
  summarise(mean_delay = mean(arr_delay, na.rm = TRUE))%>%
  arrange(desc(seats))

delay_seats%>%
pander(big.mark = ',', justify = c('left', 'right'))

seats	mean_delay
450	120
400	14.33
379	3.29
377	-6.635
330	5.281
300	24.2
292	6.216
290	7.736
275	11.95
269	-4.75
260	-4.5
255	-1.106
222	-8.819
200	7.599
199	2.683
191	0.8034
189	-1.337
182	3.222
179	4.298
178	2.286
172	-0.8459
149	3.388
147	33.47
145	0.3996
142	5.905
140	9.651
139	2.341
128	-3.053
102	3.5
100	17.79
95	8.039
80	8.247
55	15.81
22	7.173
20	8.056
16	-4.295
14	-1.346
12	-10.25
11	22.43
10	5.667
9	4.045
8	7.822
7	-3.034
6	10.67
5	7.137
4	6.365
2	5.497

kable(delay_seats, "html") %>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "600px")

seats	mean_delay
450	120.0000000
400	14.3333333
379	3.2904795
377	-6.6354839
330	5.2810127
300	24.2000000
292	6.2159091
290	7.7356322
275	11.9455645
269	-4.7500000
260	-4.5000000
255	-1.1056853
222	-8.8192771
200	7.5991997
199	2.6828016
191	0.8033501
189	-1.3370618
182	3.2215930
179	4.2978482
178	2.2861217
172	-0.8459043
149	3.3884528
147	33.4731183
145	0.3995519
142	5.9049074
140	9.6505463
139	2.3406593
128	-3.0526316
102	3.5000000
100	17.7885978
95	8.0389666
80	8.2470027
55	15.8059566
22	7.1729958
20	8.0563663
16	-4.2950820
14	-1.3461538
12	-10.2500000
11	22.4347826
10	5.6666667
9	4.0454545
8	7.8216561
7	-3.0338983
6	10.6735395
5	7.1370717
4	6.3654917
2	5.4972527

According to this graph, there is no sign that the larger planes are more delayed than the smaller ones.

One can also try to compute the correlation between age, arr_delay and dep_delay.

	seats	arr_delay	dep_delay
seats	1.0000000	-0.0726318	-0.0541665
arr_delay	-0.0726318	1.0000000	0.9167416
dep_delay	-0.0541665	0.9167416	1.0000000

There is a weak negative correlation between a number of seats and delay meaning that the number of seats (smaller/ larger) can not explain the arrival delay.

## 
## Call:
## lm(formula = seat_delay$arr_delay ~ seat_delay$seats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -91.03  -24.21  -11.82    7.29 1275.86 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      13.333210   0.186109   71.64   <2e-16 ***
## seat_delay$seats -0.045604   0.001197  -38.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.81 on 273851 degrees of freedom
## Multiple R-squared:  0.005275,   Adjusted R-squared:  0.005272 
## F-statistic:  1452 on 1 and 273851 DF,  p-value: < 2.2e-16

##      (Intercept) seat_delay$seats 
##      13.33320972      -0.04560441

2.11 Exercise

On a map (map_data("usa")), plot the airports that has flights to them.

usa <- map_data("usa")


#airports$faa <- as.factor(airports$faa)

fli_dest <- flights%>%
  group_by(dest)%>%
  summarise(n = n())%>%
  inner_join(airports, by =c(dest = "faa"))

## Warning: Column `dest`/`faa` joining factor and character vector, coercing
## into character vector

fli_dest

## # A tibble: 101 x 9
##    dest      n name                lat    lon   alt    tz dst   tzone     
##    <chr> <int> <chr>             <dbl>  <dbl> <int> <dbl> <chr> <chr>     
##  1 ABQ     254 Albuquerque Inte~  35.0 -107.   5355    -7 A     America/D~
##  2 ACK     265 Nantucket Mem      41.3  -70.1    48    -5 A     America/N~
##  3 ALB     439 Albany Intl        42.7  -73.8   285    -5 A     America/N~
##  4 ANC       8 Ted Stevens Anch~  61.2 -150.    152    -9 A     America/A~
##  5 ATL   17215 Hartsfield Jacks~  33.6  -84.4  1026    -5 A     America/N~
##  6 AUS    2439 Austin Bergstrom~  30.2  -97.7   542    -6 A     America/C~
##  7 AVL     275 Asheville Region~  35.4  -82.5  2165    -5 A     America/N~
##  8 BDL     443 Bradley Intl       41.9  -72.7   173    -5 A     America/N~
##  9 BGR     375 Bangor Intl        44.8  -68.8   192    -5 A     America/N~
## 10 BHM     297 Birmingham Intl    33.6  -86.8   644    -6 A     America/C~
## # ... with 91 more rows

Make a similar plot, but now points must have size relative to the number of flights each airport is destination for.

map_data <- ggplot() + 
  geom_polygon(data = usa, aes(long, lat, group = group), fill = "grey") +

geom_point(data =fli_dest, aes(x = lon, y = lat, size = n , color = n)) + 
  borders("state")+
  scale_color_viridis()+
  
  labs(x = "longitude",
       y = "latitude",
       title = "map flights for destinations")+
coord_quickmap()
map_data

#ggplotly(map_data)

2.12 Exercise

Do a principal component analysis of the weather at JFK using the following columns: temp, dewp, humid, wind_dir, wind_speed, precip, visib (only on complete.cases()). How many principal components should be used to capture the variability in the weather data?

weather_dc <-  mutate_at(weather, vars(temp, dewp),funs(fahrenheit_to_celcius))

weather_jfk_df <- weather_dc%>%
  group_by(origin)%>%
filter(origin == "JFK") %>%
   select(origin,temp, dewp, humid, wind_dir, wind_speed, precip, visib)

head(weather_jfk_df)

## # A tibble: 6 x 8
## # Groups:   origin [1]
##   origin  temp  dewp humid wind_dir wind_speed precip visib
##   <chr>  <dbl> <dbl> <dbl>    <dbl>      <dbl>  <dbl> <dbl>
## 1 JFK     3.9  -3.3   59.4      260       12.7      0    10
## 2 JFK     3.9  -3.3   59.4      270       11.5      0    10
## 3 JFK     4.4  -2.80  59.5      260       15.0      0    10
## 4 JFK     4.4  -2.2   62.2      250       17.3      0    10
## 5 JFK     3.9  -2.80  61.6      260       15.0      0    10
## 6 JFK     3.30 -2.80  64.3      260       13.8      0    10

# Here we consider the completecases

weather_jfk<- weather_jfk_df[complete.cases(weather_jfk_df),]
 
weather_jfk <-  weather_jfk[, sapply(weather_jfk, is.numeric)]

summary and correlation plot

Principal component analysis PCA

## # A tibble: 6 x 7
##    temp  dewp humid wind_dir wind_speed precip visib
##   <dbl> <dbl> <dbl>    <dbl>      <dbl>  <dbl> <dbl>
## 1  3.9  -3.3   59.4      260       12.7      0    10
## 2  3.9  -3.3   59.4      270       11.5      0    10
## 3  4.4  -2.80  59.5      260       15.0      0    10
## 4  4.4  -2.2   62.2      250       17.3      0    10
## 5  3.9  -2.80  61.6      260       15.0      0    10
## 6  3.30 -2.80  64.3      260       13.8      0    10

##       temp             dewp             humid           wind_dir    
##  Min.   :-11.10   Min.   :-23.300   Min.   : 15.21   Min.   :  0.0  
##  1st Qu.:  4.40   1st Qu.: -2.800   1st Qu.: 49.19   1st Qu.:140.0  
##  Median : 12.20   Median :  6.100   Median : 65.90   Median :220.0  
##  Mean   : 12.45   Mean   :  5.462   Mean   : 65.26   Mean   :204.2  
##  3rd Qu.: 20.60   3rd Qu.: 14.400   3rd Qu.: 82.81   3rd Qu.:290.0  
##  Max.   : 36.70   Max.   : 25.600   Max.   :100.00   Max.   :360.0  
##    wind_speed         precip             visib       
##  Min.   : 0.000   Min.   :0.000000   Min.   : 0.000  
##  1st Qu.: 6.905   1st Qu.:0.000000   1st Qu.:10.000  
##  Median :11.508   Median :0.000000   Median :10.000  
##  Mean   :11.503   Mean   :0.003948   Mean   : 9.175  
##  3rd Qu.:14.960   3rd Qu.:0.000000   3rd Qu.:10.000  
##  Max.   :42.579   Max.   :0.660000   Max.   :10.000

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6100 1.2085 1.0722 0.8732 0.78761 0.64222 0.05141
## Proportion of Variance 0.3703 0.2086 0.1642 0.1089 0.08862 0.05892 0.00038
## Cumulative Proportion  0.3703 0.5789 0.7431 0.8521 0.94070 0.99962 1.00000

## Standard deviations (1, .., p=7):
## [1] 1.60996245 1.20847381 1.07220661 0.87324870 0.78761090 0.64222082
## [7] 0.05141167
## 
## Rotation (n x k) = (7 x 7):
##                   PC1         PC2         PC3         PC4          PC5
## temp        0.4258354 -0.45733088  0.39167074 -0.10637799 -0.061768190
## dewp        0.5565280 -0.26675384  0.27395954  0.05549235  0.004384504
## humid       0.4795867  0.28491715 -0.08595086  0.33572394  0.111121220
## wind_dir   -0.3322328 -0.07787344  0.50057333  0.41389629  0.679300151
## wind_speed -0.2772188  0.17817362  0.64084659  0.06113809 -0.630844077
## precip      0.1471361  0.50302298  0.31168170 -0.71717007  0.336338691
## visib      -0.2646875 -0.58963620 -0.07582321 -0.42838399  0.106211496
##                    PC6           PC7
## temp        0.29644440  0.5942186847
## dewp       -0.08850770 -0.7301595113
## humid      -0.66588955  0.3361009569
## wind_dir   -0.01498101  0.0003908938
## wind_speed -0.28109860 -0.0010084427
## precip     -0.02647824 -0.0039567227
## visib      -0.61720034  0.0281169743

Accordingly, to the different plots and summary table, one can see that more than 94% of weather variability can be captured by 5 PCs. So we choose the 5 first components as principal ones.

Demo Job intterviw at FE. Data Scientist

Youssouf Souare

August 28, 2017