All the packages that are reuired for this analysis are given below:
#Required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(infotheo)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ stringr 1.4.0
## ✓ tibble 3.0.3 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(stringr)
This report briefly explains various data preprocessing steps such as loading, tidying, manipulating, scanning and transforming data along with combining two data sets namely “Operated Flights and Seats” and “Traffic by City Pairs” acquired from official data.gov.au website. Get data - Two data sets (operated_flights_rawdata and city_pairs_rawdata) are loaded in RStudio using read_csv(). The number of rows and columns in data sets are observed and further subsets of these two datasets are produced by selecting only first 2500 rows from each data set. Only 3 columns from first dataset and 5 columns from second data set were selected to form subset_operated_flights and subset_city_pairs data frames respectively. Tidy - The selected rows and columns of both datasets are then observed and two columns namely, Freight_In and Freight_Out in subset_city_pairs dataset were in wide format as Freight doesn’t have its own column and Freight weight in tonnes is spread over two columns. To reorganize this data, data frame was converted from wide to long format with freight type (inbound or outound) having its own column and freight weight in one column. After this step, both datasets are merged into one dataframe merged_data using inner_join() by specifying ForeignPort and International_City as the common columns to be joined on. Two column were converted to factors using as.factor() and new data frame filtered_data was formed by filtering only the USA entries from the Country variable. Then, its structure, attributes and levels are observed. Manipulate – A new variable vacant_seats was created by subtracting Passengers_Total from Max_Seats and this new variable was added to the filtered_data. Further, summary statistics for each numeric variable present in the data frame was calculated using summarise() function. Scan - The presence of any missing values and outliers is checked. Is.na() is used to see the occurrence of missing values in the numeric variables of the data frame. Using Tukey’s method of outlier detection, outliers were found and removed by creating factors of the dataset for every city in USA and applying cap() function. Transformation - At the end, one variable Max_Seats is considered to apply data transformation. Histograms for each factor was plotted to observe the distribution along with a line for its mean and median values. Appropriate transformation techniques were applied and mathematical operators were used for this.
The description of the two datasets used in this investigation is given below: First DataSet: "*International Airlines - Operated Flights and Seats" is taken from Bureau of Infrastructure, Transport and Regional Economics(available at data.gov.au) loacted at following website: https://data.gov.au/dataset/ds-dga-e82787e4-a480-4189-b963-1d0b6088103e/details?q= It contains details about Scheduled services operated by international airlines to and from Australia and operated flights and seats covered by city, airline, route, country and region. Second DataSet: Datasets “International Airlines - Traffic by City Pairs” is taken from Bureau of Infrastructure, Transport and Regional Economics(available at data.gov.au) loacted at following website: https://data.gov.au/dataset/ds-dga-d9fbffaa-836f-4f52-80e8-324249ff269f/details?q= It contains details about Scheduled operations of international airlines operating to and from Australia as well as details about passengers, freight and mail carried between city pairs given by flight number service.
Steps-
Load - Both data sets were available in .csv format. Using the read_csv() function in readr package, these datasets were imported in RStudio. After loading the data, two data frames were created namely operated_flights_rawdata and city_pairs_rawdata.
Subsets - The rows and columns in both datasets are observed and first 2500 rows were selected from each dataset for this investigation as well as all the unecessary columns were removed. After subsetting these data frames, subset_operated_flights and subset_city_pairs are created.
#Loading Data
operated_flights_rawdata <- read_csv("international_airline_activity_opfltsseats.csv")
## Parsed with column specification:
## cols(
## Month = col_double(),
## In_Out = col_character(),
## Australian_City = col_character(),
## International_City = col_character(),
## Airline = col_character(),
## Route = col_character(),
## Port_Country = col_character(),
## Port_Region = col_character(),
## Service_Country = col_character(),
## Service_Region = col_character(),
## Stops = col_double(),
## All_Flights = col_double(),
## Max_Seats = col_double(),
## Year = col_double(),
## Month_num = col_double()
## )
head(operated_flights_rawdata)
city_pairs_rawdata <- read_csv("city_pairs (1).csv")
## Parsed with column specification:
## cols(
## Month = col_double(),
## AustralianPort = col_character(),
## ForeignPort = col_character(),
## Country = col_character(),
## Passengers_In = col_double(),
## `Freight_In_(tonnes)` = col_double(),
## `Mail_In_(tonnes)` = col_double(),
## Passengers_Out = col_double(),
## `Freight_Out_(tonnes)` = col_double(),
## `Mail_Out_(tonnes)` = col_double(),
## Passengers_Total = col_double(),
## `Freight_Total_(tonnes)` = col_double(),
## `Mail_Total_(tonnes)` = col_double(),
## Year = col_double(),
## Month_num = col_double()
## )
head(city_pairs_rawdata)
operated_flights <- operated_flights_rawdata[,-c(1,2,3,6,7,8,9,10,11,12,14,15)]
head(operated_flights)
city_pairs <- city_pairs_rawdata[,-c(1,2,5,7,8,10,12,13,14,15)]
head(city_pairs)
#subsetting to get first 2,500 rows from both data sets
subset_operated_flights <- operated_flights[1:2500,]
subset_city_pairs <- city_pairs[1:2500,]
Here, both datasets were checked if these are in correct format to carry out the analysis or not. The second dataset i.e. subset_city_pairs was observed to be untidy since numerical entries which can be in single column were spread in two columns i.e. Freight_In_(tonnes) and Freight_Out_(tonnes). This shows that data is currently in wide format and needs to be changed to long format to get these two columns: Freight_type and Freight_weight which will contain type of frieght(inbound and outbound entries) and weight of freight in tonnes respectively. Therefore, for tidying this dataset, gather() function was used and data was converted from wide format to long format. Then, values in Freight_type were observed to be of two types Freight_Out_(tonnes) and Freight_In_(tonnes). To understand the column entries clearly, Freight_Out_(tonnes) was changed to Outbound and Freight_In_(tonnes) was changed to inbound. Hence, data is now in a tidy format.
#Tidy & Manipulate Data I
#Wide format to long format using gather() function
subset_city_pairs <- subset_city_pairs %>%
gather(`Freight_In_(tonnes)`,`Freight_Out_(tonnes)`, key="Freight_type", value="Freight_weight")
head(subset_city_pairs)
#replacing Freight_Out_(tonnes) by Out and in by in
subset_city_pairs$Freight_type[subset_city_pairs$Freight_type == "Freight_Out_(tonnes)"] <- "Outbound"
subset_city_pairs$Freight_type[subset_city_pairs$Freight_type == "Freight_In_(tonnes)"] <- "Inbound"
head(subset_city_pairs)
Next, a merged data set is to be created by joining subset_operated_flights and subset_city_pairs. Now, considering only the common values from both these data frame, inner_join() is applied to join the dataframes.
# Understand Section
#merging the data with inner_join() function to get common entries in two datasets
merged_data <- as.data.frame(inner_join(subset_city_pairs, subset_operated_flights,
by=c("ForeignPort"="International_City")))
head(merged_data)
str(merged_data)
## 'data.frame': 291732 obs. of 7 variables:
## $ ForeignPort : chr "Auckland" "Auckland" "Auckland" "Auckland" ...
## $ Country : chr "New Zealand" "New Zealand" "New Zealand" "New Zealand" ...
## $ Passengers_Total: num 2498 2498 2498 2498 2498 ...
## $ Freight_type : chr "Inbound" "Inbound" "Inbound" "Inbound" ...
## $ Freight_weight : num 42.2 42.2 42.2 42.2 42.2 ...
## $ Airline : chr "Air New Zealand" "Freedom Air International" "Garuda Indonesia" "Malaysia Airlines" ...
## $ Max_Seats : num 12624 2556 2296 5404 7733 ...
#View(merged_data)
#checking the datatypes in merged data and converting Country and Airline columns into factors
merged_data$Country <- as.factor(merged_data$Country)
merged_data$Airline <- as.factor(merged_data$Airline)
#Checking freight_weight, Max_seats for the cities in USA
#filtering the merged data by applying filter function
filtered_data <- merged_data %>% filter(Country =="USA")
#checking the structure of filtered_data
str(filtered_data)
## 'data.frame': 19598 obs. of 7 variables:
## $ ForeignPort : chr "Honolulu" "Honolulu" "Honolulu" "Honolulu" ...
## $ Country : Factor w/ 27 levels "Bahrain","Brunei",..: 25 25 25 25 25 25 25 25 25 25 ...
## $ Passengers_Total: num 654 654 654 654 654 654 654 654 654 654 ...
## $ Freight_type : chr "Inbound" "Inbound" "Inbound" "Inbound" ...
## $ Freight_weight : num 6.62 6.62 6.62 6.62 6.62 ...
## $ Airline : Factor w/ 38 levels "Aerolineas Argentinas",..: 3 30 30 30 3 30 30 30 3 30 ...
## $ Max_Seats : num 8455 3116 2437 227 8440 ...
#checking the attributes of filtered_data
#attributes(filtered_data)
#checking the levels Country column of filtered_data
levels(filtered_data$Country)
## [1] "Bahrain" "Brunei" "Canada"
## [4] "China" "Fiji" "France"
## [7] "Germany" "Hong Kong" "Indonesia"
## [10] "Italy" "Japan" "Malaysia"
## [13] "Mauritius" "Nauru" "New Caledonia"
## [16] "New Zealand" "Papua New Guinea" "Philippines"
## [19] "Singapore" "Solomon Islands" "South Africa"
## [22] "Thailand" "UK" "United Arab Emirates"
## [25] "USA" "Vanuatu" "Western Samoa"
levels(filtered_data$Airline)
## [1] "Aerolineas Argentinas" "Air Caledonie"
## [3] "Air Canada" "Air China"
## [5] "Air Mauritius" "Air New Zealand"
## [7] "Air Niugini" "Air Paradise International"
## [9] "Air Vanuatu" "Australian Airlines"
## [11] "British Airways" "Cathay Pacific Airways"
## [13] "China Eastern Airlines" "China Southern Airlines"
## [15] "Emirates" "Fiji Airways"
## [17] "Freedom Air International" "Garuda Indonesia"
## [19] "Gulf Air" "Hawaiian Airlines"
## [21] "Japan Airlines" "LATAM Airlines"
## [23] "Lauda Air" "Malaysia Airlines"
## [25] "Merpati Nusantara Airlines" "Nauru Airlines"
## [27] "Norfolk Jet Express" "Philippine Airlines"
## [29] "Polynesian Airlines" "Qantas Airways"
## [31] "Royal Brunei Airlines" "Royal Tongan Airlines"
## [33] "Singapore Airlines" "Solomon Airlines"
## [35] "South African Airways" "Thai Airways International"
## [37] "United Airlines" "Virgin Australia"
A new variable vacant_seats is calculated by subtracting total number of passengers(Passengers_Toatl) from Maximum seats(Maximum_Seats) available in a flight. This will give the details about how many seats are vacant and were not occupied by the travellers. To calculate this and add it to existsting dataset i.e. filtered_data, mutate() function was used. Then, descriptive statistics for Max_Seats, Freight_weight and vacant seats was calculated using the summarise() function.
#Tidy & Manipulate Data II
#creating a new variable vacant_seats by subtracting total passengers from max seats in a flight
filtered_data <- filtered_data %>% mutate(vacant_seats = Max_Seats-Passengers_Total)
head(filtered_data)
#Tidy and Manipulate data II
#descriptive statistics for Max_Seats, Freight_weight, and vacant_seats
filtered_data %>% group_by(ForeignPort) %>% summarise(Min = min(Max_Seats,na.rm = TRUE),
Q1 = quantile(Max_Seats,probs = .25,na.rm = TRUE),
Median = median(Max_Seats, na.rm = TRUE),
Q3 = quantile(Max_Seats,probs = .75,na.rm = TRUE),
Max = max(Max_Seats,na.rm = TRUE),
Mean = mean(Max_Seats, na.rm = TRUE),
number = n(),
null = sum(is.na(Max_Seats)))
## `summarise()` ungrouping output (override with `.groups` argument)
filtered_data %>% group_by(ForeignPort) %>% summarise(Min = min(Freight_weight,na.rm = TRUE),
Q1 = quantile(Freight_weight,probs = .25,na.rm = TRUE),
Median = median(Freight_weight, na.rm = TRUE),
Q3 = quantile(Freight_weight,probs = .75,na.rm = TRUE),
Max = max(Freight_weight,na.rm = TRUE),
Mean = mean(Freight_weight, na.rm = TRUE),
number = n(),
null = sum(is.na(Freight_weight)))
## `summarise()` ungrouping output (override with `.groups` argument)
filtered_data %>% group_by(ForeignPort) %>% summarise(Min = min(vacant_seats,na.rm = TRUE),
Q1 = quantile(vacant_seats,probs = .25,na.rm = TRUE),
Median = median(vacant_seats, na.rm = TRUE),
Q3 = quantile(vacant_seats,probs = .75,na.rm = TRUE),
Max = max(vacant_seats,na.rm = TRUE),
Mean = mean(vacant_seats, na.rm = TRUE),
number = n(),
null = sum(is.na(vacant_seats)))
## `summarise()` ungrouping output (override with `.groups` argument)
To verify if there are any missing values present in the numeric variable of the dataset, is.na() function is used. No missing values were observed and FALSE was returned after applying the is.na() function. A function to check the presence of infinite or NaN values in every numeric column, is also given below but since, the output is too large, it is commented. (Note: only the head of all the missing values is given here in this report since output is too big. The output for checking the presence of missing values was first checked for every column using is.na(filtered_data$Passengers_Total), but only head of this output is shown in the report).
#Scan I
#Checking if there are any missing values are present
head(is.na(filtered_data$Passengers_Total))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(is.na(filtered_data$Max_Seats))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(is.na(filtered_data$vacant_seats))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(is.na(filtered_data$Freight_weight))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
# Check every numerical column having infinite or NaN values using a function called is.special
#is.special <- function(x){
# if (is.numeric(x)) (is.infinite(x) | is.nan(x))
#}
# apply this function to the data frame.
#sapply(filtered_data, is.special)
#Scan II
#Checking Outliers by using tukey's method of outlier detection
boxplot_max_seats <- filtered_data %>% boxplot(Max_Seats ~ ForeignPort, data = .,
main="Box Plot for Max seats in USA cities flights",
ylab = "Max_Seats", xlab = "USA cities", col= c("blue"))
boxplot_freight <- filtered_data %>% boxplot(Freight_weight ~ ForeignPort, data = .,
main="Box Plot for freight weight in USA cities flights",
ylab = "Freight weights", xlab = "USA cities", col= c("orange"))
boxplot_vacant_seats <- filtered_data %>% boxplot(vacant_seats ~ ForeignPort, data=.,
main="Box Plot for Vacant seats in USA cities flights",
ylab = "Vacant seats", xlab = "USA cities", col = "green")
boxplot_passengers <- filtered_data %>% boxplot(Passengers_Total ~ ForeignPort, data = .,
main = "Boxplot for Passenger Total in USA cities flight",
ylab = "Passengers_Total", xlab = "USA cities", col=c("pink"))
head(boxplot_max_seats$out)
## [1] 22340 22552 23498 22340 22552 23498
head(boxplot_freight$out)
## [1] 102.89 102.89 102.89 102.89 102.89 102.89
head(boxplot_vacant_seats$out)
## [1] -11035 -11035 -10808 -10904 -9662 -9662
head(boxplot_passengers$out)
## [1] 11262 11262 11262 11262 11262 11262
#Subset filtered_data to f1,f2,f3,f4 and f5 on ForeignPort
f1 <- filtered_data %>% filter(ForeignPort == "Honolulu")
f2 <- filtered_data %>% filter(ForeignPort == "Los Angeles")
f3 <- filtered_data %>% filter(ForeignPort == "New York")
f4 <- filtered_data %>% filter(ForeignPort == "San Francisco")
f5 <- filtered_data %>% filter(ForeignPort == "Seattle")
# Define a function to cap the values outside the limits
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Removing outliers using capping
f1$Max_Seats <- f1$Max_Seats %>% cap()
f2$Max_Seats <- f2$Max_Seats %>% cap()
f3$Max_Seats <- f3$Max_Seats %>% cap()
f4$Max_Seats <- f4$Max_Seats %>% cap()
f5$Max_Seats <- f5$Max_Seats %>% cap()
f1$Freight_weight <- f1$Freight_weight %>% cap()
f2$Freight_weight <- f2$Freight_weight %>% cap()
f3$Freight_weight <- f3$Freight_weight %>% cap()
f4$Freight_weight <- f4$Freight_weight %>% cap()
f5$Freight_weight <- f5$Freight_weight %>% cap()
f1$vacant_seats <- f1$vacant_seats %>% cap()
f2$vacant_seats <- f2$vacant_seats %>% cap()
f3$vacant_seats <- f3$vacant_seats %>% cap()
f4$vacant_seats <- f4$vacant_seats %>% cap()
f5$vacant_seats <- f5$vacant_seats %>% cap()
f1$Passengers_Total <- f1$Passengers_Total %>% cap()
f2$Passengers_Total <- f2$Passengers_Total %>% cap()
f3$Passengers_Total <- f3$Passengers_Total %>% cap()
f4$Passengers_Total <- f4$Passengers_Total %>% cap()
f5$Passengers_Total <- f5$Passengers_Total %>% cap()
#Transform
#histograms for Max_Seats in Different city ports of USA
hist(f1$Max_Seats, main="Histogram of Max_Seats in Honolulu", xlab = "Max Seats")
abline(v = mean(f1$Max_Seats), col="red", lwd=2)
abline(v = median(f1$Max_Seats), col="green", lwd=2)
hist(f2$Max_Seats, main="Histogram of Max_Seats in Los Angeles", xlab = "Max Seats")
abline(v = mean(f2$Max_Seats), col="red", lwd=2)
abline(v = median(f2$Max_Seats), col="green", lwd=2)
hist(f3$Max_Seats, main="Histogram of Max_Seats in New York", xlab = "Max Seats")
abline(v = mean(f3$Max_Seats), col="red", lwd=2)
abline(v = median(f3$Max_Seats), col="green", lwd=2)
hist(f4$Max_Seats, main="Histogram of Max_Seats in San Francisco", xlab = "Max Seats")
abline(v = mean(f4$Max_Seats), col="red", lwd=2)
abline(v = median(f4$Max_Seats), col="green", lwd=2)
#Applying appropriate transformation on factors
#Applying boxcox, sqrt, log10 and log transformation on f2$Max_Seats i.e. max seats in Los Angeles
sqrt_LA <- sqrt(f2$Max_Seats)
hist(sqrt_LA)
log_LA <- log10(f2$Max_Seats)
hist(log_LA)
ln_LA <- log(f2$Max_Seats)
hist(ln_LA)
boxcox_LA <- BoxCox(f2$Max_Seats, lambda="auto")
hist(boxcox_LA)
#Applying cube transformation on f3$Max_Seats .i.e max seats in New York
cube_NY <- f3$Max_Seats^3
hist(cube_NY)
#Applying cube and boxcox on f4$Max_Seats i.e. max seats in San Francisco
boxcox_san <- BoxCox(f4$Max_Seats, lambda="auto")
hist(boxcox_san)
cube_san <- f4$Max_Seats^3
hist(cube_san)
[1]. Bureau of Infrastructure Transport and Regional Economics 2016, International Airlines - Operated Flights and Seats, Dataset, Australian Government, Bureau of Infrastructure Transport and Regional Economics, viewed 13 October 2020, https://data.gov.au/dataset/ds-dga-e82787e4-a480-4189-b963-1d0b6088103e/details?q=.
[2]. Bureau of Infrastructure Transport and Regional Economics 2016, International Airlines - Traffic by city pairs, Dataset, Australian Government, Bureau of Infrastructure Transport and Regional Economics, viewed 13 October 2020, https://data.gov.au/dataset/ds-dga-d9fbffaa-836f-4f52-80e8-324249ff269f/details?q=.