Exploratory Data Analysis of US International Air Traffic Dataset

The main objective of the project is to analyse the data and extract useful insights which if used by the authorities will help to get an overall idea of the statistical measures like average number of passengers, type of distribution of data, probabilities, validation of data through hypothesis testing. The linear regression model developed can be used to predict the number of passengers for future.

Obtaining the Passengers dataset and displaying descriptive statistics

library(fitdistrplus)

## Warning: package 'fitdistrplus' was built under R version 4.1.2

## Loading required package: MASS

## Loading required package: survival

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()

library(e1071) 
library(ggplot2)
library(dplyr)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(leaflet)

## Warning: package 'leaflet' was built under R version 4.1.3

library(airportr)

## Warning: package 'airportr' was built under R version 4.1.3

library(rworldmap)

## Warning: package 'rworldmap' was built under R version 4.1.3

## Loading required package: sp

## ### Welcome to rworldmap ###

## For a short introduction type :   vignette('rworldmap')

library(sf)

## Warning: package 'sf' was built under R version 4.1.3

## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE

library(geosphere)

## Warning: package 'geosphere' was built under R version 4.1.3

library(shiny)

## Warning: package 'shiny' was built under R version 4.1.2

## 
## Attaching package: 'shiny'

## The following object is masked from 'package:geosphere':
## 
##     span

library(mapproj)

## Warning: package 'mapproj' was built under R version 4.1.2

## Loading required package: maps

## 
## Attaching package: 'maps'

## The following object is masked from 'package:plyr':
## 
##     ozone

## The following object is masked from 'package:purrr':
## 
##     map

library(EnvStats)

## Warning: package 'EnvStats' was built under R version 4.1.2

## 
## Attaching package: 'EnvStats'

## The following objects are masked from 'package:e1071':
## 
##     kurtosis, skewness

## The following object is masked from 'package:MASS':
## 
##     boxcox

## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm

## The following object is masked from 'package:base':
## 
##     print.default

library(correlation)

## Warning: package 'correlation' was built under R version 4.1.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.2

## corrplot 0.92 loaded

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(forcats)
library(date)
library(flightplot)

## Warning: package 'flightplot' was built under R version 4.1.2

## 
## Attaching package: 'flightplot'

## The following object is masked from 'package:airportr':
## 
##     airports

library(plotly)

## Warning: package 'plotly' was built under R version 4.1.2

## 
## Attaching package: 'plotly'

## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(knitr)
setwd('C:/Sam/Sem1/Engineering Probability and Statistics/Project/archive (1)')

#-------------------------------------------------------------------------------------------------------------------
# Obtaining the Passengers dataset 
Passengers_df <- read.table('International_Report_Passengers1.csv',  # the dataset to be imported
                        header = TRUE,         # the dataset contains names of its columns called as header names
                        sep = ',')
head(Passengers_df)

##    data_dte Year Month usg_apt_id usg_apt usg_wac fg_apt_id fg_apt fg_wac
## 1  5/1/2014 2014     5      14492     RDU      36     11032    CUN    148
## 2  6/1/2007 2007     6      13204     MCO      33     16085    YHZ    951
## 3 12/1/2005 2005    12      11433     DTW      43     10411    AUA    277
## 4  4/1/2003 2003     4      13487     MSP      63     16304    ZIH    148
## 5 12/1/2005 2005    12      12016     GUM       5     11138    CRK    766
## 6  3/1/2007 2007     3      14843     SJU       3     15084    SXM    259
##   airlineid carrier carriergroup       type Scheduled Charter Total
## 1     19534      AM            0 Passengers         0     315   315
## 2     20364      C6            0 Passengers         0     683   683
## 3     20344      RD            1 Passengers         0    1010  1010
## 4     20204      MG            1 Passengers         0     508   508
## 5     20312      TZ            1 Passengers         0      76    76
## 6     20421     SLQ            1 Passengers         0      35    35

summary(Passengers_df)

##    data_dte              Year          Month          usg_apt_id   
##  Length:488087      Min.   :2000   Min.   : 1.000   Min.   :10010  
##  Class :character   1st Qu.:2005   1st Qu.: 3.000   1st Qu.:11618  
##  Mode  :character   Median :2011   Median : 6.000   Median :12478  
##                     Mean   :2010   Mean   : 6.427   Mean   :12777  
##                     3rd Qu.:2015   3rd Qu.: 9.000   3rd Qu.:13891  
##                     Max.   :2019   Max.   :12.000   Max.   :99999  
##    usg_apt             usg_wac        fg_apt_id        fg_apt         
##  Length:488087      Min.   : 1.00   Min.   :10125   Length:488087     
##  Class :character   1st Qu.:22.00   1st Qu.:11868   Class :character  
##  Mode  :character   Median :34.00   Median :13518   Mode  :character  
##                     Mean   :45.14   Mean   :13550                     
##                     3rd Qu.:74.00   3rd Qu.:15246                     
##                     Max.   :93.00   Max.   :16864                     
##      fg_wac        airlineid       carrier           carriergroup  
##  Min.   :106.0   Min.   :19386   Length:488087      Min.   :0.000  
##  1st Qu.:204.0   1st Qu.:19704   Class :character   1st Qu.:0.000  
##  Median :429.0   Median :19977   Mode  :character   Median :1.000  
##  Mean   :472.1   Mean   :20098                      Mean   :0.582  
##  3rd Qu.:736.0   3rd Qu.:20366                      3rd Qu.:1.000  
##  Max.   :975.0   Max.   :22062                      Max.   :1.000  
##      type             Scheduled         Charter            Total       
##  Length:488087      Min.   :     0   Min.   :    0.0   Min.   :     1  
##  Class :character   1st Qu.:   215   1st Qu.:    0.0   1st Qu.:   651  
##  Mode  :character   Median :  4168   Median :    0.0   Median :  4320  
##                     Mean   :  6945   Mean   :  135.9   Mean   :  7081  
##                     3rd Qu.: 10356   3rd Qu.:    0.0   3rd Qu.: 10395  
##                     Max.   :134263   Max.   :58284.0   Max.   :134263

Obtaining the Flights dataset

setwd('C:/Sam/Sem1/Engineering Probability and Statistics/Project/archive (1)')

Flight_df <- read.table('International_Report_Departures1.csv',  # the dataset to be imported
                        header = TRUE,         # the dataset contains names of its columns called as header names
                        sep = ',')
head(Flight_df)

##   data_dte Year Month usg_apt_id usg_apt usg_wac fg_apt_id fg_apt fg_wac
## 1 1/1/2000 2000     1      11618     EWR      22     11032    CUN    148
## 2 1/1/2000 2000     1      14771     SFO      91     14739    SDQ    224
## 3 1/1/2000 2000     1      10299     ANC       1     14752    SEL    778
## 4 1/1/2000 2000     1      11193     CVG      44     13605    NAS    204
## 5 1/1/2000 2000     1      13204     MCO      33     14312    PVR    148
## 6 1/1/2000 2000     1      12016     GUM       5     15994    YAP    810
##   airlineid carrier carriergroup       type Scheduled Charter Total
## 1     20377      X9            1 Departures         0       6     6
## 2     20312      TZ            1 Departures         0       1     1
## 3     20190      9S            1 Departures         0       1     1
## 4     20016     LGQ            0 Departures         0      12    12
## 5     20312      TZ            1 Departures         0       1     1
## 6     20177     PFQ            1 Departures         0       4     4

summary(Flight_df)

##    data_dte              Year          Month        usg_apt_id   
##  Length:679644      Min.   :2000   Min.   : 1.0   Min.   :10010  
##  Class :character   1st Qu.:2005   1st Qu.: 3.0   1st Qu.:11618  
##  Mode  :character   Median :2010   Median : 6.0   Median :12892  
##                     Mean   :2010   Mean   : 6.4   Mean   :12813  
##                     3rd Qu.:2015   3rd Qu.: 9.0   3rd Qu.:13495  
##                     Max.   :2020   Max.   :12.0   Max.   :99999  
##    usg_apt             usg_wac        fg_apt_id        fg_apt         
##  Length:679644      Min.   : 1.00   Min.   :10119   Length:679644     
##  Class :character   1st Qu.:22.00   1st Qu.:11874   Class :character  
##  Mode  :character   Median :33.00   Median :13514   Mode  :character  
##                     Mean   :43.35   Mean   :13543                     
##                     3rd Qu.:74.00   3rd Qu.:15129                     
##                     Max.   :93.00   Max.   :16881                     
##      fg_wac        airlineid       carrier           carriergroup   
##  Min.   :106.0   Min.   :19386   Length:679644      Min.   :0.0000  
##  1st Qu.:205.0   1st Qu.:19790   Class :character   1st Qu.:0.0000  
##  Median :429.0   Median :20093   Mode  :character   Median :1.0000  
##  Mean   :473.4   Mean   :20119                      Mean   :0.6147  
##  3rd Qu.:736.0   3rd Qu.:20366                      3rd Qu.:1.0000  
##  Max.   :975.0   Max.   :22067                      Max.   :1.0000  
##      type             Scheduled          Charter             Total        
##  Length:679644      Min.   :   0.00   Min.   :   0.000   Min.   :   1.00  
##  Class :character   1st Qu.:   0.00   1st Qu.:   0.000   1st Qu.:   3.00  
##  Mode  :character   Median :  20.00   Median :   0.000   Median :  24.00  
##                     Mean   :  42.68   Mean   :   1.793   Mean   :  44.47  
##                     3rd Qu.:  61.00   3rd Qu.:   1.000   3rd Qu.:  61.00  
##                     Max.   :1461.00   Max.   :1092.000   Max.   :1461.00

Finding Unique numbers of fields in the Passengers dataset

# Unique Airports
Unique_US_aiports <- unique(Passengers_df$usg_apt)
Number_unique_US_aiports <- length(Unique_US_aiports)

# Unique Foreign Airports where passengers are going to from US
Unique_foreign_aiports <- unique(Passengers_df$fg_apt)
Number_unique_foreign_aiports <- length(Unique_foreign_aiports)

# Unique Carriers
Unique_carrier <- unique(Passengers_df$carrier)
Number_unique_carrier <- length(Unique_carrier)

# Unique Airline Numbers
Unique_airlineid <- unique(Passengers_df$airlineid)
Number_unique_airlineid <- length(Unique_airlineid)

tab1 <- matrix((c(Number_unique_US_aiports,Number_unique_foreign_aiports,Number_unique_carrier,Number_unique_airlineid)), ncol=1,nrow = 4, byrow=TRUE)
colnames(tab1) <- c('Number of Uniques')
rownames(tab1) <- c('US airports','Foreign Airports','Carriers','Airlineids')
tab1 <- as.table(tab1)
tab1

##                  Number of Uniques
## US airports                    783
## Foreign Airports              1177
## Carriers                       473
## Airlineids                     448

Data Visualization

# Finding total number of passengers of all the airports per year and per month
count_per_year <- Passengers_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

count_per_year_per_month <- Passengers_df %>% group_by(Year,Month) %>% summarise_at(vars(Total),funs(sum(.)))

Plotting total number of passengers of all the airports per year

ggplot(count_per_year, aes(x = Year, y = Total,size=count_per_year$Total/100000 )) +
  geom_point(shape=18, color="red")+
  labs(title="No of passengers per year from 2000-2019") +
  theme(legend.title = element_blank())

## Warning: Use of `count_per_year$Total` is discouraged. Use `Total` instead.

Plotting total number of passengers of all the airports per year per month

count_per_year_per_month$Year_month <- paste(count_per_year_per_month$Year,count_per_year_per_month$Month, sep="_")

ggplot(count_per_year_per_month, aes(x = Year_month, y = Total,size=count_per_year_per_month$Total/1000000)) +
  geom_point(shape = 16,color="green")+
  geom_smooth(method=lm,  linetype="dashed",
              color="darkred", fill="blue") +
  labs(title="No of passengers per month per year from 2000-2019") +
  theme(legend.title = element_blank())

## `geom_smooth()` using formula 'y ~ x'

Finding Top busiest airports in US

Passengers_at_airports_count <- Passengers_df %>% group_by(usg_apt) %>% summarise_at(vars(Total),funs(sum(.)))
Passengers_at_airports_count_desc <- Passengers_at_airports_count[with(Passengers_at_airports_count, order(-Total)),]
Passengers_at_airports_count_desc_top <- head(Passengers_at_airports_count_desc,5)

Plotting barplot of top 5 busiest airports in US

ggplot(Passengers_at_airports_count_desc_top, aes(y=Total, x=usg_apt, fill = usg_apt)) +
  geom_bar(position="dodge",stat="identity") +  #stat="identity"
  ggtitle("Top 5 busiest Airports in US") +
  scale_fill_brewer(palette = "Set2")

Boxplot for top 5 busiest airports in US

Top_5_airports_df <- dplyr::filter(Passengers_df, (usg_apt == "JFK")|(usg_apt=="LAX")|(usg_apt=="MIA")|(usg_apt=="ORD")|(usg_apt=="EWR"))
Top_5_airports_df <- Top_5_airports_df %>% group_by(Year, Month,usg_apt) %>% summarise_at(vars(Total),funs(sum(.)))
Top_5_airports_df

## # A tibble: 1,200 x 4
## # Groups:   Year, Month [240]
##     Year Month usg_apt   Total
##    <int> <int> <chr>     <int>
##  1  2000     1 EWR      572638
##  2  2000     1 JFK     1248503
##  3  2000     1 LAX     1286851
##  4  2000     1 MIA     1466907
##  5  2000     1 ORD      697110
##  6  2000     2 EWR      563892
##  7  2000     2 JFK     1148275
##  8  2000     2 LAX     1203306
##  9  2000     2 MIA     1311275
## 10  2000     2 ORD      644291
## # ... with 1,190 more rows

ggplot(Top_5_airports_df, aes(x = usg_apt, y = Total, col = usg_apt )) +
  geom_violin(trim=FALSE) +
  geom_boxplot(outlier.colour="red", outlier.shape=16,
               outlier.size=2, notch=FALSE)

Finding Top foreign airports that passengers travel to from US

Passengers_to_foreign_dest_count <- Passengers_df %>% group_by(fg_apt) %>% summarise_at(vars(Total),funs(sum(.)))
Passengers_to_foreign_dest_count_desc <- Passengers_to_foreign_dest_count[with(Passengers_to_foreign_dest_count, order(-Total)),]
Passengers_to_foreign_dest_count_desc_top <- head(Passengers_to_foreign_dest_count_desc,5)

Plotting top 5 foreign airports

ggplot(Passengers_to_foreign_dest_count_desc_top, aes(y=Total, x=fg_apt, fill = fg_apt)) +
  geom_bar(position="dodge",stat="identity") +  #stat="identity"
  ggtitle("Top 5 foreign airports from US") +
  scale_fill_brewer(palette = "Paired")

Finding Top carriers in US

Passengers_by_carrier_count <- Passengers_df %>% group_by(carrier) %>% summarise_at(vars(Total),funs(sum(.)))
Passengers_by_carrier_count_desc <- Passengers_by_carrier_count[with(Passengers_by_carrier_count, order(-Total)),]
Passengers_by_carrier_count_desc_top <- head(Passengers_by_carrier_count_desc,10)

Plotting barplot of top 5 Carriers in US

ggplot(Passengers_by_carrier_count_desc_top, aes(y=Total, x=carrier, fill = carrier)) +
  geom_bar(position="dodge",stat="identity") +  #stat="identity"
  ggtitle("Top 5 Carriers") +
  scale_fill_brewer(palette = "PuOr")

Plotting piechart of top 5 Carriers in US

carrier_pie <- ggplot(Passengers_by_carrier_count_desc_top, aes(x="", y=Total, fill=carrier)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +scale_fill_brewer(palette="Spectral")
carrier_pie

Plotting connections of JFK airport to top 5 foreign airports

Airport_lat_long_df <- data.frame (Airtport_name = c("JFK","LAX","MIA","ORD","EWR"),
                                   Latitude = c(40.6 ,33.9,25.8, 42,40.7),
                                   Longitude = c(-73.8,-118, -80.3,-87.9,-74.2))
JFK_Foreign_AP_df <- data.frame (Airtport_name = c("LHR","CDG","SDQ","FRA","MEX"),
                                 Latitude  = c(51.5, 49.0, 18.4, 50.0,19.4),
                                 Longitude = c(-0.462,2.55,-69.7,8.57,-99.1))

 worldmap <- getMap(resolution = "coarse")
 # plot world map
 plot(worldmap, xlim = c(-180,180), ylim = c(-90, 90),
      asp = 1, bg = "azure2",border = "lightgrey", col = "wheat1")

 points(Airport_lat_long_df$Longitude,Airport_lat_long_df$Latitude,
        col = rgb(red = 0, green = 0, blue = 1, alpha = 0.3), cex = Passengers_at_airports_count_desc_top$Total/400000000,pch = 20)
 text(Airport_lat_long_df$Longitude,Airport_lat_long_df$Latitude, Airport_lat_long_df$Airtport_name, pos = 4, col = "black",
      cex = 0.3)
 points(JFK_Foreign_AP_df$Longitude,JFK_Foreign_AP_df$Latitude,
        col = rgb(red = 0, green = 0, blue = 1, alpha = 0.3), cex = 0.5,pch = 20)
 text(JFK_Foreign_AP_df$Longitude,JFK_Foreign_AP_df$Latitude, Airport_lat_long_df$Airtport_name, pos = 4, col = "black",
      cex = 0.3)
 for (j in 1:length(JFK_Foreign_AP_df$Airtport_name))
 {
   Inter <- gcIntermediate(c(-73.8 , 40.6), c(JFK_Foreign_AP_df$Longitude[j],JFK_Foreign_AP_df$Latitude[j]), n=100, addStartEnd=TRUE)
   lines(Inter, col="orange", lwd=0)
 }

Descriptive statistics for top 3 busiest airports in US

JFK_df <- dplyr::filter(Passengers_df, (usg_apt == "JFK"))
JFK_df_sorted1 <- JFK_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
JFK_mean <- mean(JFK_df_sorted1$Total)
JFK_var <- var(JFK_df_sorted1$Total)
JFK_std_dev <- sqrt(JFK_var)
JFK_skewness <- skewness(JFK_df_sorted1$Total)
JFK_kurtosis <- kurtosis(JFK_df_sorted1$Total)

LAX_df <- dplyr::filter(Passengers_df, (usg_apt == "LAX"))
LAX_df_sorted1 <- LAX_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
LAX_mean <- mean(LAX_df_sorted1$Total)
LAX_var <- var(LAX_df_sorted1$Total)
LAX_std_dev <- sqrt(LAX_var)
LAX_skewness <- skewness(LAX_df_sorted1$Total)
LAX_kurtosis <- kurtosis(LAX_df_sorted1$Total)

MIA_df <- dplyr::filter(Passengers_df, (usg_apt == "MIA"))
MIA_df_sorted1 <- MIA_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
MIA_mean <- mean(MIA_df_sorted1$Total)
MIA_var <- var(MIA_df_sorted1$Total)
MIA_std_dev <- sqrt(MIA_var)
MIA_skewness <- skewness(MIA_df_sorted1$Total)
MIA_kurtosis <- kurtosis(MIA_df_sorted1$Total)

tab <- matrix(c(JFK_mean,LAX_mean,MIA_mean,JFK_var,LAX_var,MIA_var,JFK_std_dev,LAX_std_dev,MIA_std_dev,JFK_skewness,LAX_skewness,MIA_skewness, JFK_kurtosis,LAX_kurtosis,MIA_kurtosis), ncol=3,nrow = 5, byrow=TRUE)
colnames(tab) <- c('JFK','LAX','MIA')
rownames(tab) <- c('Mean','Variance','Std. deviation','Skewness','Kurtosis')
tab <- as.table(tab)
tab

##                          JFK           LAX           MIA
## Mean            2.358225e+07  1.814302e+07  1.762154e+07
## Variance        4.070393e+13  1.387669e+13  7.362230e+12
## Std. deviation  6.379964e+06  3.725143e+06  2.713343e+06
## Skewness        3.541980e-01  1.225688e+00  2.376995e-01
## Kurtosis       -1.092538e+00  2.531447e-01 -1.795048e+00

PMF and CDF of airlines going from JFK to LHR in year 2000

JFK_freq <- Passengers_df %>%
  select(usg_apt, fg_apt, Year,Month ,airlineid,Total)%>%
  dplyr::filter(usg_apt == 'JFK' & fg_apt == 'LHR' & Year == '2000') %>%
  group_by(airlineid) %>%
  summarise(airlineid_count = n()) %>%
  mutate(pmf = airlineid_count/sum(airlineid_count))%>%
  mutate(cdf = cumsum(pmf))
JFK_freq

## # A tibble: 9 x 4
##   airlineid airlineid_count    pmf    cdf
##       <int>           <int>  <dbl>  <dbl>
## 1     19532               1 0.0123 0.0123
## 2     19533              12 0.148  0.160 
## 3     19540              12 0.148  0.309 
## 4     19616              12 0.148  0.457 
## 5     19682              12 0.148  0.605 
## 6     19805              12 0.148  0.753 
## 7     19977              12 0.148  0.901 
## 8     20045               7 0.0864 0.988 
## 9     20312               1 0.0123 1

Visualizing the ‘PMF’ table using a scatterplot

ggplot(data = JFK_freq , aes(x = airlineid, y = airlineid_count, color = pmf , size = 5)) +
  geom_point()+
  ggtitle("Scatterplot of PMF of airlines going from JFK to LHR in year 2000")

Finding Joint Probability of passengers at top 5 airports taking scheduled flights or chartered flights in the year 2019

Column1 <- dplyr::filter(Passengers_df, ((usg_apt == "JFK") | (usg_apt == "LAX") | (usg_apt == "MIA") | (usg_apt == "ORD") 
                                         | (usg_apt == "EWR")) & (Year == 2019))
Column1 <- Column1 %>% group_by(usg_apt) %>% summarise_at(vars(Scheduled, Charter),funs(sum(.)))
Column1 <- Column1 %>% select(Scheduled, Charter)
Joint_prob <- round(Column1/sum(Column1),3)
Joint_prob <- Joint_prob %>%
  add_column(Airport = c("EWR","JFK","LAX","MIA","ORD"))
Joint_prob

##   Scheduled Charter Airport
## 1     0.131   0.000     EWR
## 2     0.312   0.000     JFK
## 3     0.234   0.000     LAX
## 4     0.194   0.002     MIA
## 5     0.127   0.000     ORD

Plotting the Joint Probability

ggplot(data = Joint_prob , aes(x = Scheduled, y = Charter, color = Airport , size = 5)) +
  geom_point()+
  ggtitle("Joint Probability of passengers at top 5 airports taking scheduled flights or chartered flights in the year 2019")

Goodness of fit for distribution of passengers count per year for JFK airport

fit_n <- fitdist(JFK_df_sorted1$Total, "norm")
summary(fit_n)

## Fitting of the distribution ' norm ' by maximum likelihood 
## Parameters : 
##      estimate Std. Error
## mean 23582251         NA
## sd    6218419         NA
## Loglikelihood:  -341.2393   AIC:  686.4786   BIC:  688.4701 
## Correlation matrix:
## [1] NA

par(mfrow=c(2,2))
plot.legend <- c("normal")
denscomp(list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year', xlegend = 'topleft')
cdfcomp (list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year')
qqcomp (list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year')
ppcomp (list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year')

Hypothesis Testing

Confidence interval assumed = 95% Hypothesis Testing 1: H0: The mean number of passengers per month of a year < 1000000 H1: The mean number of passengers per month of a year > 1000000 This is a right tailed test

JFK_df_sorted2 <- JFK_df %>% group_by(usg_apt,Year,Month) %>% summarise_at(vars(Total),funs(sum(.)))
# Stratified random samples from JFK dataset
JFK_df_sample <- sample_n(JFK_df_sorted2,3)
JFK_df_sample_mean <- mean(JFK_df_sample$Total)
JFK_df_sample_var <- var(JFK_df_sample$Total)
sample_data_length <- nrow(JFK_df_sample)
# Mean of number of passengers fying from JFK every year every month from 2000-2019
JFK_mean <- mean(JFK_df_sorted2$Total)
# Variance of number of passengers fying from JFK every year every month from 2000-2019
JFK_var <- var(JFK_df_sorted2$Total)

z <- (JFK_df_sample_mean - 1000000)/sqrt((JFK_var/(sample_data_length)))
df <- data.frame("Z_calc"=z,"P_value"= pnorm(z, lower.tail=FALSE))
df

##     Z_calc      P_value
## 1 12.46183 6.028517e-36

Conclusion: As p<<0.05, we reject the null hypothesis and conclude that the sample mean is greater than 1000000

Hypothesis Testing 2: H0: The variance of number of passengers per month of a year > 55x10^10 H1: The variance of number of passengers per month of a year < 55x10^10 This is a left tailed test

varTest(JFK_df_sample$Total, alternative = "less", conf.level = 0.95, 
        sigma.squared = 550000000000, data.name = NULL)

## 
##  Chi-Squared Test on Variance
## 
## data:  JFK_df_sample$Total
## Chi-Squared = 45.75, df = 59, p-value = 0.1034
## alternative hypothesis: true variance is less than 5.5e+11
## 95 percent confidence interval:
##             0 594304794451
## sample estimates:
##     variance 
## 426482264029

Conclusion: As p<0.05, we reject the null hypothesis and conclude that the sample variance is less than 55x10^10

Hypothesis Testing 3: H0: The proportion of number of passengers going by BA airlines from JFK to LHR in 2008 > 0.41 H1: The proportion of number of passengers going by BA airlines from JFK to LHR in 2008 < 0.41 This is a left tailed test

JFK_LHR <- Passengers_df%>% filter(usg_apt == 'JFK' & fg_apt == 'LHR' & Year == '2008' )
JFK_LHR_BA <- Passengers_df%>% filter(usg_apt == 'JFK' & fg_apt == 'LHR' & carrier == 'BA' & Year == '2008')
Num_of_jfk_to_lhr <- JFK_LHR %>% select(Total) %>% summarise_at(vars(Total),funs(sum(.))) %>% arrange(-Total)
Num_of_jfk_to_lhr[1,]

## [1] 2762143

Num_of_jfk_to_lhr_1 <- JFK_LHR_BA %>% select(Total) %>% summarise_at(vars(Total),funs(sum(.))) %>% arrange(-Total)

prop.test(x= Num_of_jfk_to_lhr_1[1,], n=Num_of_jfk_to_lhr[1,], p=0.41, correct = TRUE, conf.level = 0.95,
          alternative = "less")

## 
##  1-sample proportions test with continuity correction
## 
## data:  Num_of_jfk_to_lhr_1[1, ] out of Num_of_jfk_to_lhr[1, ], null probability 0.41
## X-squared = 58.409, df = 1, p-value = 1.065e-14
## alternative hypothesis: true p is less than 0.41
## 95 percent confidence interval:
##  0.0000000 0.4082247
## sample estimates:
##         p 
## 0.4077381

Conclusion: As p<<0.05, we reject the null hypothesis and conclude that the proportion of number of passengers going by BA airlines from JFK to LHR in 2008 < 0.41

Linear Regression From Goodness of Fit plots we know that the JFK data(JFK_df_sorted1) is normally distributed. From the Correlation plot of JFK dataset(JFK_df_sorted1), we see that Year and Total are highly correlated. Therefore, regression is run on Total number of passengers from JFK airport every year as dependent variable and Years as independent variable.

JFK_df_sorted1

## # A tibble: 20 x 2
##     Year    Total
##    <int>    <int>
##  1  2000 18444274
##  2  2001 15897104
##  3  2002 14780207
##  4  2003 14968157
##  5  2004 17067640
##  6  2005 18469265
##  7  2006 19528167
##  8  2007 21459711
##  9  2008 21996790
## 10  2009 21420820
## 11  2010 22731710
## 12  2011 23472841
## 13  2012 24739981
## 14  2013 26196242
## 15  2014 27645230
## 16  2015 29967272
## 17  2016 32129317
## 18  2017 32936207
## 19  2018 33857402
## 20  2019 33936686

linear_model <- lm(Total~Year,data = JFK_df_sorted1)
summary(linear_model)

## 
## Call:
## lm(formula = Total ~ Year, data = JFK_df_sorted1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1828619 -1122920  -407858   779640  4779255 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.074e+09  1.281e+08  -16.19 3.59e-12 ***
## Year         1.044e+06  6.377e+04   16.37 2.96e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1645000 on 18 degrees of freedom
## Multiple R-squared:  0.9371, Adjusted R-squared:  0.9336 
## F-statistic:   268 on 1 and 18 DF,  p-value: 2.962e-12

summary(JFK_df_sorted1$Total - linear_model$fitted.values)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1828619 -1122920  -407858        0   779640  4779255

Test_year <- data.frame(Year=c(2016,2017,2018,2019))
predict_passenger_no <- predict(linear_model,newdata = Test_year, interval = 'confidence')

Conclusion: Looking at the output(Median < Mean) we can say that the distribution is not symmetrical but rightly skewed. From Coefficients output we can conclude following points: 1. Equation of the model: Total number of passengers for x year = 1.044e+06(x) - 2.074e+09 2. From model we get predicted value as 31411645 versus actual value as 32936207 for 2017 year which is very close. 3. t-value is 16.37 which means that our Year coefficient is 16.37 standard errors away from 0 which is far and we can say that the year coefficient is away from value 0 which is true naturally as years cannot be 0. 4. As p-values in our model are extremely small we can say that there is strong evidence that there is strong relationship between Year and Number of passengers. 5. The multiple asterisks indicate that Year is more significant to the model. 6. For our model, we can say that on average, the actual values of number of passengers per year at JFK airport would be 1645000 (1M) away from predicted values. As our max actual value is 33M, having all our predicted values off by 1M proves that model is a good fit for data. 7. Here, Year explains ~93.71% of the variation within Number of passengers, our dependent variable. Thus, we can conclude that our model fits the data very well.

Exploratory Data Analysis of US International Air Traffic Dataset

Samruddhi Kulkarni, Dev Patel, Shubham Chopade