The main objective of the project is to analyse the data and extract useful insights which if used by the authorities will help to get an overall idea of the statistical measures like average number of passengers, type of distribution of data, probabilities, validation of data through hypothesis testing. The linear regression model developed can be used to predict the number of passengers for future.
Obtaining the Passengers dataset and displaying descriptive statistics
library(fitdistrplus)
## Warning: package 'fitdistrplus' was built under R version 4.1.2
## Loading required package: MASS
## Loading required package: survival
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
library(e1071)
library(ggplot2)
library(dplyr)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.1.3
library(airportr)
## Warning: package 'airportr' was built under R version 4.1.3
library(rworldmap)
## Warning: package 'rworldmap' was built under R version 4.1.3
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(sf)
## Warning: package 'sf' was built under R version 4.1.3
## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
library(geosphere)
## Warning: package 'geosphere' was built under R version 4.1.3
library(shiny)
## Warning: package 'shiny' was built under R version 4.1.2
##
## Attaching package: 'shiny'
## The following object is masked from 'package:geosphere':
##
## span
library(mapproj)
## Warning: package 'mapproj' was built under R version 4.1.2
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
##
## ozone
## The following object is masked from 'package:purrr':
##
## map
library(EnvStats)
## Warning: package 'EnvStats' was built under R version 4.1.2
##
## Attaching package: 'EnvStats'
## The following objects are masked from 'package:e1071':
##
## kurtosis, skewness
## The following object is masked from 'package:MASS':
##
## boxcox
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
## The following object is masked from 'package:base':
##
## print.default
library(correlation)
## Warning: package 'correlation' was built under R version 4.1.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(forcats)
library(date)
library(flightplot)
## Warning: package 'flightplot' was built under R version 4.1.2
##
## Attaching package: 'flightplot'
## The following object is masked from 'package:airportr':
##
## airports
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(knitr)
setwd('C:/Sam/Sem1/Engineering Probability and Statistics/Project/archive (1)')
#-------------------------------------------------------------------------------------------------------------------
# Obtaining the Passengers dataset
Passengers_df <- read.table('International_Report_Passengers1.csv', # the dataset to be imported
header = TRUE, # the dataset contains names of its columns called as header names
sep = ',')
head(Passengers_df)
## data_dte Year Month usg_apt_id usg_apt usg_wac fg_apt_id fg_apt fg_wac
## 1 5/1/2014 2014 5 14492 RDU 36 11032 CUN 148
## 2 6/1/2007 2007 6 13204 MCO 33 16085 YHZ 951
## 3 12/1/2005 2005 12 11433 DTW 43 10411 AUA 277
## 4 4/1/2003 2003 4 13487 MSP 63 16304 ZIH 148
## 5 12/1/2005 2005 12 12016 GUM 5 11138 CRK 766
## 6 3/1/2007 2007 3 14843 SJU 3 15084 SXM 259
## airlineid carrier carriergroup type Scheduled Charter Total
## 1 19534 AM 0 Passengers 0 315 315
## 2 20364 C6 0 Passengers 0 683 683
## 3 20344 RD 1 Passengers 0 1010 1010
## 4 20204 MG 1 Passengers 0 508 508
## 5 20312 TZ 1 Passengers 0 76 76
## 6 20421 SLQ 1 Passengers 0 35 35
summary(Passengers_df)
## data_dte Year Month usg_apt_id
## Length:488087 Min. :2000 Min. : 1.000 Min. :10010
## Class :character 1st Qu.:2005 1st Qu.: 3.000 1st Qu.:11618
## Mode :character Median :2011 Median : 6.000 Median :12478
## Mean :2010 Mean : 6.427 Mean :12777
## 3rd Qu.:2015 3rd Qu.: 9.000 3rd Qu.:13891
## Max. :2019 Max. :12.000 Max. :99999
## usg_apt usg_wac fg_apt_id fg_apt
## Length:488087 Min. : 1.00 Min. :10125 Length:488087
## Class :character 1st Qu.:22.00 1st Qu.:11868 Class :character
## Mode :character Median :34.00 Median :13518 Mode :character
## Mean :45.14 Mean :13550
## 3rd Qu.:74.00 3rd Qu.:15246
## Max. :93.00 Max. :16864
## fg_wac airlineid carrier carriergroup
## Min. :106.0 Min. :19386 Length:488087 Min. :0.000
## 1st Qu.:204.0 1st Qu.:19704 Class :character 1st Qu.:0.000
## Median :429.0 Median :19977 Mode :character Median :1.000
## Mean :472.1 Mean :20098 Mean :0.582
## 3rd Qu.:736.0 3rd Qu.:20366 3rd Qu.:1.000
## Max. :975.0 Max. :22062 Max. :1.000
## type Scheduled Charter Total
## Length:488087 Min. : 0 Min. : 0.0 Min. : 1
## Class :character 1st Qu.: 215 1st Qu.: 0.0 1st Qu.: 651
## Mode :character Median : 4168 Median : 0.0 Median : 4320
## Mean : 6945 Mean : 135.9 Mean : 7081
## 3rd Qu.: 10356 3rd Qu.: 0.0 3rd Qu.: 10395
## Max. :134263 Max. :58284.0 Max. :134263
Obtaining the Flights dataset
setwd('C:/Sam/Sem1/Engineering Probability and Statistics/Project/archive (1)')
Flight_df <- read.table('International_Report_Departures1.csv', # the dataset to be imported
header = TRUE, # the dataset contains names of its columns called as header names
sep = ',')
head(Flight_df)
## data_dte Year Month usg_apt_id usg_apt usg_wac fg_apt_id fg_apt fg_wac
## 1 1/1/2000 2000 1 11618 EWR 22 11032 CUN 148
## 2 1/1/2000 2000 1 14771 SFO 91 14739 SDQ 224
## 3 1/1/2000 2000 1 10299 ANC 1 14752 SEL 778
## 4 1/1/2000 2000 1 11193 CVG 44 13605 NAS 204
## 5 1/1/2000 2000 1 13204 MCO 33 14312 PVR 148
## 6 1/1/2000 2000 1 12016 GUM 5 15994 YAP 810
## airlineid carrier carriergroup type Scheduled Charter Total
## 1 20377 X9 1 Departures 0 6 6
## 2 20312 TZ 1 Departures 0 1 1
## 3 20190 9S 1 Departures 0 1 1
## 4 20016 LGQ 0 Departures 0 12 12
## 5 20312 TZ 1 Departures 0 1 1
## 6 20177 PFQ 1 Departures 0 4 4
summary(Flight_df)
## data_dte Year Month usg_apt_id
## Length:679644 Min. :2000 Min. : 1.0 Min. :10010
## Class :character 1st Qu.:2005 1st Qu.: 3.0 1st Qu.:11618
## Mode :character Median :2010 Median : 6.0 Median :12892
## Mean :2010 Mean : 6.4 Mean :12813
## 3rd Qu.:2015 3rd Qu.: 9.0 3rd Qu.:13495
## Max. :2020 Max. :12.0 Max. :99999
## usg_apt usg_wac fg_apt_id fg_apt
## Length:679644 Min. : 1.00 Min. :10119 Length:679644
## Class :character 1st Qu.:22.00 1st Qu.:11874 Class :character
## Mode :character Median :33.00 Median :13514 Mode :character
## Mean :43.35 Mean :13543
## 3rd Qu.:74.00 3rd Qu.:15129
## Max. :93.00 Max. :16881
## fg_wac airlineid carrier carriergroup
## Min. :106.0 Min. :19386 Length:679644 Min. :0.0000
## 1st Qu.:205.0 1st Qu.:19790 Class :character 1st Qu.:0.0000
## Median :429.0 Median :20093 Mode :character Median :1.0000
## Mean :473.4 Mean :20119 Mean :0.6147
## 3rd Qu.:736.0 3rd Qu.:20366 3rd Qu.:1.0000
## Max. :975.0 Max. :22067 Max. :1.0000
## type Scheduled Charter Total
## Length:679644 Min. : 0.00 Min. : 0.000 Min. : 1.00
## Class :character 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 3.00
## Mode :character Median : 20.00 Median : 0.000 Median : 24.00
## Mean : 42.68 Mean : 1.793 Mean : 44.47
## 3rd Qu.: 61.00 3rd Qu.: 1.000 3rd Qu.: 61.00
## Max. :1461.00 Max. :1092.000 Max. :1461.00
Finding Unique numbers of fields in the Passengers dataset
# Unique Airports
Unique_US_aiports <- unique(Passengers_df$usg_apt)
Number_unique_US_aiports <- length(Unique_US_aiports)
# Unique Foreign Airports where passengers are going to from US
Unique_foreign_aiports <- unique(Passengers_df$fg_apt)
Number_unique_foreign_aiports <- length(Unique_foreign_aiports)
# Unique Carriers
Unique_carrier <- unique(Passengers_df$carrier)
Number_unique_carrier <- length(Unique_carrier)
# Unique Airline Numbers
Unique_airlineid <- unique(Passengers_df$airlineid)
Number_unique_airlineid <- length(Unique_airlineid)
tab1 <- matrix((c(Number_unique_US_aiports,Number_unique_foreign_aiports,Number_unique_carrier,Number_unique_airlineid)), ncol=1,nrow = 4, byrow=TRUE)
colnames(tab1) <- c('Number of Uniques')
rownames(tab1) <- c('US airports','Foreign Airports','Carriers','Airlineids')
tab1 <- as.table(tab1)
tab1
## Number of Uniques
## US airports 783
## Foreign Airports 1177
## Carriers 473
## Airlineids 448
Data Visualization
# Finding total number of passengers of all the airports per year and per month
count_per_year <- Passengers_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
count_per_year_per_month <- Passengers_df %>% group_by(Year,Month) %>% summarise_at(vars(Total),funs(sum(.)))
Plotting total number of passengers of all the airports per year
ggplot(count_per_year, aes(x = Year, y = Total,size=count_per_year$Total/100000 )) +
geom_point(shape=18, color="red")+
labs(title="No of passengers per year from 2000-2019") +
theme(legend.title = element_blank())
## Warning: Use of `count_per_year$Total` is discouraged. Use `Total` instead.
Plotting total number of passengers of all the airports per year per month
count_per_year_per_month$Year_month <- paste(count_per_year_per_month$Year,count_per_year_per_month$Month, sep="_")
ggplot(count_per_year_per_month, aes(x = Year_month, y = Total,size=count_per_year_per_month$Total/1000000)) +
geom_point(shape = 16,color="green")+
geom_smooth(method=lm, linetype="dashed",
color="darkred", fill="blue") +
labs(title="No of passengers per month per year from 2000-2019") +
theme(legend.title = element_blank())
## `geom_smooth()` using formula 'y ~ x'
Finding Top busiest airports in US
Passengers_at_airports_count <- Passengers_df %>% group_by(usg_apt) %>% summarise_at(vars(Total),funs(sum(.)))
Passengers_at_airports_count_desc <- Passengers_at_airports_count[with(Passengers_at_airports_count, order(-Total)),]
Passengers_at_airports_count_desc_top <- head(Passengers_at_airports_count_desc,5)
Plotting barplot of top 5 busiest airports in US
ggplot(Passengers_at_airports_count_desc_top, aes(y=Total, x=usg_apt, fill = usg_apt)) +
geom_bar(position="dodge",stat="identity") + #stat="identity"
ggtitle("Top 5 busiest Airports in US") +
scale_fill_brewer(palette = "Set2")
Boxplot for top 5 busiest airports in US
Top_5_airports_df <- dplyr::filter(Passengers_df, (usg_apt == "JFK")|(usg_apt=="LAX")|(usg_apt=="MIA")|(usg_apt=="ORD")|(usg_apt=="EWR"))
Top_5_airports_df <- Top_5_airports_df %>% group_by(Year, Month,usg_apt) %>% summarise_at(vars(Total),funs(sum(.)))
Top_5_airports_df
## # A tibble: 1,200 x 4
## # Groups: Year, Month [240]
## Year Month usg_apt Total
## <int> <int> <chr> <int>
## 1 2000 1 EWR 572638
## 2 2000 1 JFK 1248503
## 3 2000 1 LAX 1286851
## 4 2000 1 MIA 1466907
## 5 2000 1 ORD 697110
## 6 2000 2 EWR 563892
## 7 2000 2 JFK 1148275
## 8 2000 2 LAX 1203306
## 9 2000 2 MIA 1311275
## 10 2000 2 ORD 644291
## # ... with 1,190 more rows
ggplot(Top_5_airports_df, aes(x = usg_apt, y = Total, col = usg_apt )) +
geom_violin(trim=FALSE) +
geom_boxplot(outlier.colour="red", outlier.shape=16,
outlier.size=2, notch=FALSE)
Finding Top foreign airports that passengers travel to from US
Passengers_to_foreign_dest_count <- Passengers_df %>% group_by(fg_apt) %>% summarise_at(vars(Total),funs(sum(.)))
Passengers_to_foreign_dest_count_desc <- Passengers_to_foreign_dest_count[with(Passengers_to_foreign_dest_count, order(-Total)),]
Passengers_to_foreign_dest_count_desc_top <- head(Passengers_to_foreign_dest_count_desc,5)
Plotting top 5 foreign airports
ggplot(Passengers_to_foreign_dest_count_desc_top, aes(y=Total, x=fg_apt, fill = fg_apt)) +
geom_bar(position="dodge",stat="identity") + #stat="identity"
ggtitle("Top 5 foreign airports from US") +
scale_fill_brewer(palette = "Paired")
Finding Top carriers in US
Passengers_by_carrier_count <- Passengers_df %>% group_by(carrier) %>% summarise_at(vars(Total),funs(sum(.)))
Passengers_by_carrier_count_desc <- Passengers_by_carrier_count[with(Passengers_by_carrier_count, order(-Total)),]
Passengers_by_carrier_count_desc_top <- head(Passengers_by_carrier_count_desc,10)
Plotting barplot of top 5 Carriers in US
ggplot(Passengers_by_carrier_count_desc_top, aes(y=Total, x=carrier, fill = carrier)) +
geom_bar(position="dodge",stat="identity") + #stat="identity"
ggtitle("Top 5 Carriers") +
scale_fill_brewer(palette = "PuOr")
Plotting piechart of top 5 Carriers in US
carrier_pie <- ggplot(Passengers_by_carrier_count_desc_top, aes(x="", y=Total, fill=carrier)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +scale_fill_brewer(palette="Spectral")
carrier_pie
Plotting connections of JFK airport to top 5 foreign airports
Airport_lat_long_df <- data.frame (Airtport_name = c("JFK","LAX","MIA","ORD","EWR"),
Latitude = c(40.6 ,33.9,25.8, 42,40.7),
Longitude = c(-73.8,-118, -80.3,-87.9,-74.2))
JFK_Foreign_AP_df <- data.frame (Airtport_name = c("LHR","CDG","SDQ","FRA","MEX"),
Latitude = c(51.5, 49.0, 18.4, 50.0,19.4),
Longitude = c(-0.462,2.55,-69.7,8.57,-99.1))
worldmap <- getMap(resolution = "coarse")
# plot world map
plot(worldmap, xlim = c(-180,180), ylim = c(-90, 90),
asp = 1, bg = "azure2",border = "lightgrey", col = "wheat1")
points(Airport_lat_long_df$Longitude,Airport_lat_long_df$Latitude,
col = rgb(red = 0, green = 0, blue = 1, alpha = 0.3), cex = Passengers_at_airports_count_desc_top$Total/400000000,pch = 20)
text(Airport_lat_long_df$Longitude,Airport_lat_long_df$Latitude, Airport_lat_long_df$Airtport_name, pos = 4, col = "black",
cex = 0.3)
points(JFK_Foreign_AP_df$Longitude,JFK_Foreign_AP_df$Latitude,
col = rgb(red = 0, green = 0, blue = 1, alpha = 0.3), cex = 0.5,pch = 20)
text(JFK_Foreign_AP_df$Longitude,JFK_Foreign_AP_df$Latitude, Airport_lat_long_df$Airtport_name, pos = 4, col = "black",
cex = 0.3)
for (j in 1:length(JFK_Foreign_AP_df$Airtport_name))
{
Inter <- gcIntermediate(c(-73.8 , 40.6), c(JFK_Foreign_AP_df$Longitude[j],JFK_Foreign_AP_df$Latitude[j]), n=100, addStartEnd=TRUE)
lines(Inter, col="orange", lwd=0)
}
Descriptive statistics for top 3 busiest airports in US
JFK_df <- dplyr::filter(Passengers_df, (usg_apt == "JFK"))
JFK_df_sorted1 <- JFK_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
JFK_mean <- mean(JFK_df_sorted1$Total)
JFK_var <- var(JFK_df_sorted1$Total)
JFK_std_dev <- sqrt(JFK_var)
JFK_skewness <- skewness(JFK_df_sorted1$Total)
JFK_kurtosis <- kurtosis(JFK_df_sorted1$Total)
LAX_df <- dplyr::filter(Passengers_df, (usg_apt == "LAX"))
LAX_df_sorted1 <- LAX_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
LAX_mean <- mean(LAX_df_sorted1$Total)
LAX_var <- var(LAX_df_sorted1$Total)
LAX_std_dev <- sqrt(LAX_var)
LAX_skewness <- skewness(LAX_df_sorted1$Total)
LAX_kurtosis <- kurtosis(LAX_df_sorted1$Total)
MIA_df <- dplyr::filter(Passengers_df, (usg_apt == "MIA"))
MIA_df_sorted1 <- MIA_df %>% group_by(Year) %>% summarise_at(vars(Total),funs(sum(.)))
MIA_mean <- mean(MIA_df_sorted1$Total)
MIA_var <- var(MIA_df_sorted1$Total)
MIA_std_dev <- sqrt(MIA_var)
MIA_skewness <- skewness(MIA_df_sorted1$Total)
MIA_kurtosis <- kurtosis(MIA_df_sorted1$Total)
tab <- matrix(c(JFK_mean,LAX_mean,MIA_mean,JFK_var,LAX_var,MIA_var,JFK_std_dev,LAX_std_dev,MIA_std_dev,JFK_skewness,LAX_skewness,MIA_skewness, JFK_kurtosis,LAX_kurtosis,MIA_kurtosis), ncol=3,nrow = 5, byrow=TRUE)
colnames(tab) <- c('JFK','LAX','MIA')
rownames(tab) <- c('Mean','Variance','Std. deviation','Skewness','Kurtosis')
tab <- as.table(tab)
tab
## JFK LAX MIA
## Mean 2.358225e+07 1.814302e+07 1.762154e+07
## Variance 4.070393e+13 1.387669e+13 7.362230e+12
## Std. deviation 6.379964e+06 3.725143e+06 2.713343e+06
## Skewness 3.541980e-01 1.225688e+00 2.376995e-01
## Kurtosis -1.092538e+00 2.531447e-01 -1.795048e+00
PMF and CDF of airlines going from JFK to LHR in year 2000
JFK_freq <- Passengers_df %>%
select(usg_apt, fg_apt, Year,Month ,airlineid,Total)%>%
dplyr::filter(usg_apt == 'JFK' & fg_apt == 'LHR' & Year == '2000') %>%
group_by(airlineid) %>%
summarise(airlineid_count = n()) %>%
mutate(pmf = airlineid_count/sum(airlineid_count))%>%
mutate(cdf = cumsum(pmf))
JFK_freq
## # A tibble: 9 x 4
## airlineid airlineid_count pmf cdf
## <int> <int> <dbl> <dbl>
## 1 19532 1 0.0123 0.0123
## 2 19533 12 0.148 0.160
## 3 19540 12 0.148 0.309
## 4 19616 12 0.148 0.457
## 5 19682 12 0.148 0.605
## 6 19805 12 0.148 0.753
## 7 19977 12 0.148 0.901
## 8 20045 7 0.0864 0.988
## 9 20312 1 0.0123 1
Visualizing the ‘PMF’ table using a scatterplot
ggplot(data = JFK_freq , aes(x = airlineid, y = airlineid_count, color = pmf , size = 5)) +
geom_point()+
ggtitle("Scatterplot of PMF of airlines going from JFK to LHR in year 2000")
Finding Joint Probability of passengers at top 5 airports taking scheduled flights or chartered flights in the year 2019
Column1 <- dplyr::filter(Passengers_df, ((usg_apt == "JFK") | (usg_apt == "LAX") | (usg_apt == "MIA") | (usg_apt == "ORD")
| (usg_apt == "EWR")) & (Year == 2019))
Column1 <- Column1 %>% group_by(usg_apt) %>% summarise_at(vars(Scheduled, Charter),funs(sum(.)))
Column1 <- Column1 %>% select(Scheduled, Charter)
Joint_prob <- round(Column1/sum(Column1),3)
Joint_prob <- Joint_prob %>%
add_column(Airport = c("EWR","JFK","LAX","MIA","ORD"))
Joint_prob
## Scheduled Charter Airport
## 1 0.131 0.000 EWR
## 2 0.312 0.000 JFK
## 3 0.234 0.000 LAX
## 4 0.194 0.002 MIA
## 5 0.127 0.000 ORD
Plotting the Joint Probability
ggplot(data = Joint_prob , aes(x = Scheduled, y = Charter, color = Airport , size = 5)) +
geom_point()+
ggtitle("Joint Probability of passengers at top 5 airports taking scheduled flights or chartered flights in the year 2019")
Goodness of fit for distribution of passengers count per year for JFK airport
fit_n <- fitdist(JFK_df_sorted1$Total, "norm")
summary(fit_n)
## Fitting of the distribution ' norm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## mean 23582251 NA
## sd 6218419 NA
## Loglikelihood: -341.2393 AIC: 686.4786 BIC: 688.4701
## Correlation matrix:
## [1] NA
par(mfrow=c(2,2))
plot.legend <- c("normal")
denscomp(list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year', xlegend = 'topleft')
cdfcomp (list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year')
qqcomp (list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year')
ppcomp (list(fit_n), legendtext = plot.legend, xlab = '#passengers at JFK every year')
Hypothesis Testing
Confidence interval assumed = 95% Hypothesis Testing 1: H0: The mean number of passengers per month of a year < 1000000 H1: The mean number of passengers per month of a year > 1000000 This is a right tailed test
JFK_df_sorted2 <- JFK_df %>% group_by(usg_apt,Year,Month) %>% summarise_at(vars(Total),funs(sum(.)))
# Stratified random samples from JFK dataset
JFK_df_sample <- sample_n(JFK_df_sorted2,3)
JFK_df_sample_mean <- mean(JFK_df_sample$Total)
JFK_df_sample_var <- var(JFK_df_sample$Total)
sample_data_length <- nrow(JFK_df_sample)
# Mean of number of passengers fying from JFK every year every month from 2000-2019
JFK_mean <- mean(JFK_df_sorted2$Total)
# Variance of number of passengers fying from JFK every year every month from 2000-2019
JFK_var <- var(JFK_df_sorted2$Total)
z <- (JFK_df_sample_mean - 1000000)/sqrt((JFK_var/(sample_data_length)))
df <- data.frame("Z_calc"=z,"P_value"= pnorm(z, lower.tail=FALSE))
df
## Z_calc P_value
## 1 12.46183 6.028517e-36
Conclusion: As p<<0.05, we reject the null hypothesis and conclude that the sample mean is greater than 1000000
Hypothesis Testing 2: H0: The variance of number of passengers per month of a year > 55x10^10 H1: The variance of number of passengers per month of a year < 55x10^10 This is a left tailed test
varTest(JFK_df_sample$Total, alternative = "less", conf.level = 0.95,
sigma.squared = 550000000000, data.name = NULL)
##
## Chi-Squared Test on Variance
##
## data: JFK_df_sample$Total
## Chi-Squared = 45.75, df = 59, p-value = 0.1034
## alternative hypothesis: true variance is less than 5.5e+11
## 95 percent confidence interval:
## 0 594304794451
## sample estimates:
## variance
## 426482264029
Conclusion: As p<0.05, we reject the null hypothesis and conclude that the sample variance is less than 55x10^10
Hypothesis Testing 3: H0: The proportion of number of passengers going by BA airlines from JFK to LHR in 2008 > 0.41 H1: The proportion of number of passengers going by BA airlines from JFK to LHR in 2008 < 0.41 This is a left tailed test
JFK_LHR <- Passengers_df%>% filter(usg_apt == 'JFK' & fg_apt == 'LHR' & Year == '2008' )
JFK_LHR_BA <- Passengers_df%>% filter(usg_apt == 'JFK' & fg_apt == 'LHR' & carrier == 'BA' & Year == '2008')
Num_of_jfk_to_lhr <- JFK_LHR %>% select(Total) %>% summarise_at(vars(Total),funs(sum(.))) %>% arrange(-Total)
Num_of_jfk_to_lhr[1,]
## [1] 2762143
Num_of_jfk_to_lhr_1 <- JFK_LHR_BA %>% select(Total) %>% summarise_at(vars(Total),funs(sum(.))) %>% arrange(-Total)
prop.test(x= Num_of_jfk_to_lhr_1[1,], n=Num_of_jfk_to_lhr[1,], p=0.41, correct = TRUE, conf.level = 0.95,
alternative = "less")
##
## 1-sample proportions test with continuity correction
##
## data: Num_of_jfk_to_lhr_1[1, ] out of Num_of_jfk_to_lhr[1, ], null probability 0.41
## X-squared = 58.409, df = 1, p-value = 1.065e-14
## alternative hypothesis: true p is less than 0.41
## 95 percent confidence interval:
## 0.0000000 0.4082247
## sample estimates:
## p
## 0.4077381
Conclusion: As p<<0.05, we reject the null hypothesis and conclude that the proportion of number of passengers going by BA airlines from JFK to LHR in 2008 < 0.41
Linear Regression From Goodness of Fit plots we know that the JFK data(JFK_df_sorted1) is normally distributed. From the Correlation plot of JFK dataset(JFK_df_sorted1), we see that Year and Total are highly correlated. Therefore, regression is run on Total number of passengers from JFK airport every year as dependent variable and Years as independent variable.
JFK_df_sorted1
## # A tibble: 20 x 2
## Year Total
## <int> <int>
## 1 2000 18444274
## 2 2001 15897104
## 3 2002 14780207
## 4 2003 14968157
## 5 2004 17067640
## 6 2005 18469265
## 7 2006 19528167
## 8 2007 21459711
## 9 2008 21996790
## 10 2009 21420820
## 11 2010 22731710
## 12 2011 23472841
## 13 2012 24739981
## 14 2013 26196242
## 15 2014 27645230
## 16 2015 29967272
## 17 2016 32129317
## 18 2017 32936207
## 19 2018 33857402
## 20 2019 33936686
linear_model <- lm(Total~Year,data = JFK_df_sorted1)
summary(linear_model)
##
## Call:
## lm(formula = Total ~ Year, data = JFK_df_sorted1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1828619 -1122920 -407858 779640 4779255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.074e+09 1.281e+08 -16.19 3.59e-12 ***
## Year 1.044e+06 6.377e+04 16.37 2.96e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1645000 on 18 degrees of freedom
## Multiple R-squared: 0.9371, Adjusted R-squared: 0.9336
## F-statistic: 268 on 1 and 18 DF, p-value: 2.962e-12
summary(JFK_df_sorted1$Total - linear_model$fitted.values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1828619 -1122920 -407858 0 779640 4779255
Test_year <- data.frame(Year=c(2016,2017,2018,2019))
predict_passenger_no <- predict(linear_model,newdata = Test_year, interval = 'confidence')
Conclusion: Looking at the output(Median < Mean) we can say that the distribution is not symmetrical but rightly skewed. From Coefficients output we can conclude following points: 1. Equation of the model: Total number of passengers for x year = 1.044e+06(x) - 2.074e+09 2. From model we get predicted value as 31411645 versus actual value as 32936207 for 2017 year which is very close. 3. t-value is 16.37 which means that our Year coefficient is 16.37 standard errors away from 0 which is far and we can say that the year coefficient is away from value 0 which is true naturally as years cannot be 0. 4. As p-values in our model are extremely small we can say that there is strong evidence that there is strong relationship between Year and Number of passengers. 5. The multiple asterisks indicate that Year is more significant to the model. 6. For our model, we can say that on average, the actual values of number of passengers per year at JFK airport would be 1645000 (1M) away from predicted values. As our max actual value is 33M, having all our predicted values off by 1M proves that model is a good fit for data. 7. Here, Year explains ~93.71% of the variation within Number of passengers, our dependent variable. Thus, we can conclude that our model fits the data very well.