MATH2349 Data Wrangling

Required packages

All the packages that are reuired for this analysis are given below:

#Required packages
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(infotheo)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ stringr 1.4.0
## ✓ tibble  3.0.3     ✓ forcats 0.5.0
## ✓ purrr   0.3.4

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x tidyr::extract()         masks magrittr::extract()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::set_names()       masks magrittr::set_names()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

library(stringr)

Executive Summary

This report briefly explains various data preprocessing steps such as loading, tidying, manipulating, scanning and transforming data along with combining two data sets namely “Operated Flights and Seats” and “Traffic by City Pairs” acquired from official data.gov.au website. Get data - Two data sets (operated_flights_rawdata and city_pairs_rawdata) are loaded in RStudio using read_csv(). The number of rows and columns in data sets are observed and further subsets of these two datasets are produced by selecting only first 2500 rows from each data set. Only 3 columns from first dataset and 5 columns from second data set were selected to form subset_operated_flights and subset_city_pairs data frames respectively. Tidy - The selected rows and columns of both datasets are then observed and two columns namely, Freight_In and Freight_Out in subset_city_pairs dataset were in wide format as Freight doesn’t have its own column and Freight weight in tonnes is spread over two columns. To reorganize this data, data frame was converted from wide to long format with freight type (inbound or outound) having its own column and freight weight in one column. After this step, both datasets are merged into one dataframe merged_data using inner_join() by specifying ForeignPort and International_City as the common columns to be joined on. Two column were converted to factors using as.factor() and new data frame filtered_data was formed by filtering only the USA entries from the Country variable. Then, its structure, attributes and levels are observed. Manipulate – A new variable vacant_seats was created by subtracting Passengers_Total from Max_Seats and this new variable was added to the filtered_data. Further, summary statistics for each numeric variable present in the data frame was calculated using summarise() function. Scan - The presence of any missing values and outliers is checked. Is.na() is used to see the occurrence of missing values in the numeric variables of the data frame. Using Tukey’s method of outlier detection, outliers were found and removed by creating factors of the dataset for every city in USA and applying cap() function. Transformation - At the end, one variable Max_Seats is considered to apply data transformation. Histograms for each factor was plotted to observe the distribution along with a line for its mean and median values. Appropriate transformation techniques were applied and mathematical operators were used for this.

Data

The description of the two datasets used in this investigation is given below: First DataSet: "*International Airlines - Operated Flights and Seats" is taken from Bureau of Infrastructure, Transport and Regional Economics(available at data.gov.au) loacted at following website: https://data.gov.au/dataset/ds-dga-e82787e4-a480-4189-b963-1d0b6088103e/details?q= It contains details about Scheduled services operated by international airlines to and from Australia and operated flights and seats covered by city, airline, route, country and region. Second DataSet: Datasets “International Airlines - Traffic by City Pairs” is taken from Bureau of Infrastructure, Transport and Regional Economics(available at data.gov.au) loacted at following website: https://data.gov.au/dataset/ds-dga-d9fbffaa-836f-4f52-80e8-324249ff269f/details?q= It contains details about Scheduled operations of international airlines operating to and from Australia as well as details about passengers, freight and mail carried between city pairs given by flight number service.

Steps-

Load - Both data sets were available in .csv format. Using the read_csv() function in readr package, these datasets were imported in RStudio. After loading the data, two data frames were created namely operated_flights_rawdata and city_pairs_rawdata.
Subsets - The rows and columns in both datasets are observed and first 2500 rows were selected from each dataset for this investigation as well as all the unecessary columns were removed. After subsetting these data frames, subset_operated_flights and subset_city_pairs are created.
Variables in subset_operated_flights: International_City, Airline, and Max_Seats.
- International_City: Names of international cities
- Airline: Names of airlines travelling to different cities
- Max_Seats: Maximum seats in that particular airline
Variables in subset_city_pairs: Foreign_Port, Country, Freight_In_(tonnes), Freight_Out_(tonnes), and Passenger_Total
- Foreign_Port: Name of international port where traffic is uplifted or discharged within a single flight number
- Country: Name of countries based on the international uplift or discharge port within a single flight number
- Freight_In_(tonnes): Freight weight inbound in tonnes to Australia
- Freight_Out_(tonnes): Freight weight outbound in tonnes to Australia
- Passenger_Total: Total number of passengers in flight

#Loading Data 
operated_flights_rawdata <- read_csv("international_airline_activity_opfltsseats.csv")

## Parsed with column specification:
## cols(
##   Month = col_double(),
##   In_Out = col_character(),
##   Australian_City = col_character(),
##   International_City = col_character(),
##   Airline = col_character(),
##   Route = col_character(),
##   Port_Country = col_character(),
##   Port_Region = col_character(),
##   Service_Country = col_character(),
##   Service_Region = col_character(),
##   Stops = col_double(),
##   All_Flights = col_double(),
##   Max_Seats = col_double(),
##   Year = col_double(),
##   Month_num = col_double()
## )

head(operated_flights_rawdata)

city_pairs_rawdata <- read_csv("city_pairs (1).csv")

## Parsed with column specification:
## cols(
##   Month = col_double(),
##   AustralianPort = col_character(),
##   ForeignPort = col_character(),
##   Country = col_character(),
##   Passengers_In = col_double(),
##   `Freight_In_(tonnes)` = col_double(),
##   `Mail_In_(tonnes)` = col_double(),
##   Passengers_Out = col_double(),
##   `Freight_Out_(tonnes)` = col_double(),
##   `Mail_Out_(tonnes)` = col_double(),
##   Passengers_Total = col_double(),
##   `Freight_Total_(tonnes)` = col_double(),
##   `Mail_Total_(tonnes)` = col_double(),
##   Year = col_double(),
##   Month_num = col_double()
## )

head(city_pairs_rawdata)

operated_flights <- operated_flights_rawdata[,-c(1,2,3,6,7,8,9,10,11,12,14,15)]
head(operated_flights)

city_pairs <- city_pairs_rawdata[,-c(1,2,5,7,8,10,12,13,14,15)]
head(city_pairs)

#subsetting to get first 2,500 rows from both data sets
subset_operated_flights <- operated_flights[1:2500,]
subset_city_pairs <- city_pairs[1:2500,]

Tidy & Manipulate Data I

Here, both datasets were checked if these are in correct format to carry out the analysis or not. The second dataset i.e. subset_city_pairs was observed to be untidy since numerical entries which can be in single column were spread in two columns i.e. Freight_In_(tonnes) and Freight_Out_(tonnes). This shows that data is currently in wide format and needs to be changed to long format to get these two columns: Freight_type and Freight_weight which will contain type of frieght(inbound and outbound entries) and weight of freight in tonnes respectively. Therefore, for tidying this dataset, gather() function was used and data was converted from wide format to long format. Then, values in Freight_type were observed to be of two types Freight_Out_(tonnes) and Freight_In_(tonnes). To understand the column entries clearly, Freight_Out_(tonnes) was changed to Outbound and Freight_In_(tonnes) was changed to inbound. Hence, data is now in a tidy format.

#Tidy & Manipulate Data I 
#Wide format to long format using gather() function
subset_city_pairs <- subset_city_pairs %>% 
  gather(`Freight_In_(tonnes)`,`Freight_Out_(tonnes)`, key="Freight_type", value="Freight_weight")
head(subset_city_pairs)

#replacing Freight_Out_(tonnes) by Out and in by in
subset_city_pairs$Freight_type[subset_city_pairs$Freight_type == "Freight_Out_(tonnes)"] <- "Outbound"
subset_city_pairs$Freight_type[subset_city_pairs$Freight_type == "Freight_In_(tonnes)"] <- "Inbound"
head(subset_city_pairs)

Understand

Next, a merged data set is to be created by joining subset_operated_flights and subset_city_pairs. Now, considering only the common values from both these data frame, inner_join() is applied to join the dataframes.

Data was joined on the basis of specifying the common attribute present i.e ForeignPort in subset_city_pairs and Internation_City in subset_operated_flights.
After applying inner join, a new data frame merged_data was created containing these variables: ForeignPort, Country, Passengers_Total, Freight_type, Freight_weight, Airline, and Max_Seats. Then the structure of merged_data is checked using str() function. It was observed that ForeignPort, Country, Freight_type and Airline has character datatype values and Passengers_Total, Freight_weight and Max_Seats has numeric datatype values.
As per the requirement, two variables Country and Airline are converted to factors using as.factor() function.
Then, using the filter() funtion, details specific to USA country was filtered from the merged_data and stored into filtered_data.
Now, to check the structure and attributes of filtered_data, str() and attributes() functions were used.
Moreover, levels in ForeignPort variable was checked using levels() function
Thus, data frame contains 2 factors, 2 characters and 3 numeric type data type variables. Note: Attributes of filtered_data were checked using attributes() function but not included in the report as the output generated was long. But its commented under the following R chunk.

# Understand Section
#merging the data with inner_join() function to get common entries in two datasets
merged_data <- as.data.frame(inner_join(subset_city_pairs, subset_operated_flights, 
                                        by=c("ForeignPort"="International_City")))
head(merged_data)

str(merged_data)

## 'data.frame':    291732 obs. of  7 variables:
##  $ ForeignPort     : chr  "Auckland" "Auckland" "Auckland" "Auckland" ...
##  $ Country         : chr  "New Zealand" "New Zealand" "New Zealand" "New Zealand" ...
##  $ Passengers_Total: num  2498 2498 2498 2498 2498 ...
##  $ Freight_type    : chr  "Inbound" "Inbound" "Inbound" "Inbound" ...
##  $ Freight_weight  : num  42.2 42.2 42.2 42.2 42.2 ...
##  $ Airline         : chr  "Air New Zealand" "Freedom Air International" "Garuda Indonesia" "Malaysia Airlines" ...
##  $ Max_Seats       : num  12624 2556 2296 5404 7733 ...

#View(merged_data)
#checking the datatypes in merged data and converting Country and Airline columns into factors
merged_data$Country <- as.factor(merged_data$Country)
merged_data$Airline <- as.factor(merged_data$Airline)



#Checking freight_weight, Max_seats for the cities in USA
#filtering the merged data by applying filter function
filtered_data <- merged_data %>% filter(Country =="USA")

#checking the structure of filtered_data
str(filtered_data)

## 'data.frame':    19598 obs. of  7 variables:
##  $ ForeignPort     : chr  "Honolulu" "Honolulu" "Honolulu" "Honolulu" ...
##  $ Country         : Factor w/ 27 levels "Bahrain","Brunei",..: 25 25 25 25 25 25 25 25 25 25 ...
##  $ Passengers_Total: num  654 654 654 654 654 654 654 654 654 654 ...
##  $ Freight_type    : chr  "Inbound" "Inbound" "Inbound" "Inbound" ...
##  $ Freight_weight  : num  6.62 6.62 6.62 6.62 6.62 ...
##  $ Airline         : Factor w/ 38 levels "Aerolineas Argentinas",..: 3 30 30 30 3 30 30 30 3 30 ...
##  $ Max_Seats       : num  8455 3116 2437 227 8440 ...

#checking the attributes of filtered_data
#attributes(filtered_data)

#checking the levels Country column of filtered_data
levels(filtered_data$Country)

##  [1] "Bahrain"              "Brunei"               "Canada"              
##  [4] "China"                "Fiji"                 "France"              
##  [7] "Germany"              "Hong Kong"            "Indonesia"           
## [10] "Italy"                "Japan"                "Malaysia"            
## [13] "Mauritius"            "Nauru"                "New Caledonia"       
## [16] "New Zealand"          "Papua New Guinea"     "Philippines"         
## [19] "Singapore"            "Solomon Islands"      "South Africa"        
## [22] "Thailand"             "UK"                   "United Arab Emirates"
## [25] "USA"                  "Vanuatu"              "Western Samoa"

levels(filtered_data$Airline)

##  [1] "Aerolineas Argentinas"      "Air Caledonie"             
##  [3] "Air Canada"                 "Air China"                 
##  [5] "Air Mauritius"              "Air New Zealand"           
##  [7] "Air Niugini"                "Air Paradise International"
##  [9] "Air Vanuatu"                "Australian Airlines"       
## [11] "British Airways"            "Cathay Pacific Airways"    
## [13] "China Eastern Airlines"     "China Southern Airlines"   
## [15] "Emirates"                   "Fiji Airways"              
## [17] "Freedom Air International"  "Garuda Indonesia"          
## [19] "Gulf Air"                   "Hawaiian Airlines"         
## [21] "Japan Airlines"             "LATAM Airlines"            
## [23] "Lauda Air"                  "Malaysia Airlines"         
## [25] "Merpati Nusantara Airlines" "Nauru Airlines"            
## [27] "Norfolk Jet Express"        "Philippine Airlines"       
## [29] "Polynesian Airlines"        "Qantas Airways"            
## [31] "Royal Brunei Airlines"      "Royal Tongan Airlines"     
## [33] "Singapore Airlines"         "Solomon Airlines"          
## [35] "South African Airways"      "Thai Airways International"
## [37] "United Airlines"            "Virgin Australia"

Tidy & Manipulate Data II

A new variable vacant_seats is calculated by subtracting total number of passengers(Passengers_Toatl) from Maximum seats(Maximum_Seats) available in a flight. This will give the details about how many seats are vacant and were not occupied by the travellers. To calculate this and add it to existsting dataset i.e. filtered_data, mutate() function was used. Then, descriptive statistics for Max_Seats, Freight_weight and vacant seats was calculated using the summarise() function.

#Tidy & Manipulate Data II 
#creating a new variable vacant_seats by subtracting total passengers from max seats in a flight
filtered_data <- filtered_data %>% mutate(vacant_seats = Max_Seats-Passengers_Total)
head(filtered_data)

#Tidy and Manipulate data II
#descriptive statistics for Max_Seats, Freight_weight, and vacant_seats
filtered_data %>% group_by(ForeignPort) %>% summarise(Min = min(Max_Seats,na.rm = TRUE),
                                            Q1 = quantile(Max_Seats,probs = .25,na.rm = TRUE),
                                            Median = median(Max_Seats, na.rm = TRUE),
                                            Q3 = quantile(Max_Seats,probs = .75,na.rm = TRUE),
                                            Max = max(Max_Seats,na.rm = TRUE),
                                            Mean = mean(Max_Seats, na.rm = TRUE),
                                            number = n(),
                                            null = sum(is.na(Max_Seats)))

## `summarise()` ungrouping output (override with `.groups` argument)

filtered_data %>% group_by(ForeignPort) %>% summarise(Min = min(Freight_weight,na.rm = TRUE),
                                            Q1 = quantile(Freight_weight,probs = .25,na.rm = TRUE),
                                            Median = median(Freight_weight, na.rm = TRUE),
                                            Q3 = quantile(Freight_weight,probs = .75,na.rm = TRUE),
                                            Max = max(Freight_weight,na.rm = TRUE),
                                            Mean = mean(Freight_weight, na.rm = TRUE),
                                            number = n(),
                                            null = sum(is.na(Freight_weight)))

## `summarise()` ungrouping output (override with `.groups` argument)

filtered_data %>% group_by(ForeignPort) %>% summarise(Min = min(vacant_seats,na.rm = TRUE),
                                            Q1 = quantile(vacant_seats,probs = .25,na.rm = TRUE),
                                            Median = median(vacant_seats, na.rm = TRUE),
                                            Q3 = quantile(vacant_seats,probs = .75,na.rm = TRUE),
                                            Max = max(vacant_seats,na.rm = TRUE),
                                            Mean = mean(vacant_seats, na.rm = TRUE),
                                            number = n(),
                                            null = sum(is.na(vacant_seats)))

## `summarise()` ungrouping output (override with `.groups` argument)

Scan I

To verify if there are any missing values present in the numeric variable of the dataset, is.na() function is used. No missing values were observed and FALSE was returned after applying the is.na() function. A function to check the presence of infinite or NaN values in every numeric column, is also given below but since, the output is too large, it is commented. (Note: only the head of all the missing values is given here in this report since output is too big. The output for checking the presence of missing values was first checked for every column using is.na(filtered_data$Passengers_Total), but only head of this output is shown in the report).

#Scan I
#Checking if there are any missing values are present
head(is.na(filtered_data$Passengers_Total))

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

head(is.na(filtered_data$Max_Seats))

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

head(is.na(filtered_data$vacant_seats))

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

head(is.na(filtered_data$Freight_weight))

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

# Check every numerical column having infinite or NaN values using a function called is.special
#is.special <- function(x){
#  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
#}
# apply this function to the data frame.
#sapply(filtered_data, is.special)

Scan II

This data set contains four numeric variables, Max_Seats, Freights_weight, vacant_seats and Passengers_Total, outliers are scanned in all four of them by using ‘Tukeys method of outlier detection’.
Boxplot(side by side) is plotted for all cities in USA for the four varaibles
Outliers that were detected Outliers found in Max_seats for the following cities in USA: Los Angeles Outliers found in Freight_weight for the following cities in USA: Honolulu, Los Angeles, New York, San Francisco Outliers found in Vacant_seats for the following cities in USA: Honolulu, Los Angeles Outliers found in Passengers_Total for the following cities in USA: Honolulu, Lost Angeles and San Francisco
Subset filtered_data to f1,f2,f3,f4 and f5 vectors on ForeignPort.
A new function cap() was defined to cap the values outside the limits
All the outlier values were removed after cap() function on all the five vectors created for different cities(ForeignPort)

#Scan II
#Checking Outliers by using tukey's method of outlier detection
boxplot_max_seats <- filtered_data %>% boxplot(Max_Seats ~ ForeignPort, data = ., 
                                  main="Box Plot for Max seats in USA cities flights",
                                  ylab = "Max_Seats", xlab = "USA cities", col= c("blue"))

boxplot_freight <- filtered_data %>% boxplot(Freight_weight ~ ForeignPort, data = ., 
                                  main="Box Plot for freight weight in USA cities flights",
                                  ylab = "Freight weights", xlab = "USA cities", col= c("orange"))

boxplot_vacant_seats <- filtered_data %>% boxplot(vacant_seats ~ ForeignPort, data=., 
                                      main="Box Plot for Vacant seats in USA cities flights", 
                                      ylab = "Vacant seats", xlab = "USA cities", col = "green")

boxplot_passengers <- filtered_data %>% boxplot(Passengers_Total ~ ForeignPort, data = ., 
                                    main = "Boxplot for Passenger Total in USA cities flight",
                                    ylab = "Passengers_Total", xlab = "USA cities", col=c("pink"))

head(boxplot_max_seats$out)

## [1] 22340 22552 23498 22340 22552 23498

head(boxplot_freight$out)

## [1] 102.89 102.89 102.89 102.89 102.89 102.89

head(boxplot_vacant_seats$out)

## [1] -11035 -11035 -10808 -10904  -9662  -9662

head(boxplot_passengers$out)

## [1] 11262 11262 11262 11262 11262 11262

#Subset filtered_data to f1,f2,f3,f4 and f5 on ForeignPort
f1 <- filtered_data %>% filter(ForeignPort == "Honolulu")
f2 <- filtered_data %>% filter(ForeignPort == "Los Angeles")
f3 <- filtered_data %>% filter(ForeignPort == "New York")
f4 <- filtered_data %>% filter(ForeignPort == "San Francisco")
f5 <- filtered_data %>% filter(ForeignPort == "Seattle")

# Define a function to cap the values outside the limits
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

#Removing outliers using capping
f1$Max_Seats <- f1$Max_Seats %>% cap()
f2$Max_Seats <- f2$Max_Seats %>% cap()
f3$Max_Seats <- f3$Max_Seats %>% cap()
f4$Max_Seats <- f4$Max_Seats %>% cap()
f5$Max_Seats <- f5$Max_Seats %>% cap()
f1$Freight_weight <- f1$Freight_weight %>% cap()
f2$Freight_weight <- f2$Freight_weight %>% cap()
f3$Freight_weight <- f3$Freight_weight %>% cap()
f4$Freight_weight <- f4$Freight_weight %>% cap()
f5$Freight_weight <- f5$Freight_weight %>% cap()
f1$vacant_seats <- f1$vacant_seats %>% cap()
f2$vacant_seats <- f2$vacant_seats %>% cap()
f3$vacant_seats <- f3$vacant_seats %>% cap()
f4$vacant_seats <- f4$vacant_seats %>% cap()
f5$vacant_seats <- f5$vacant_seats %>% cap()
f1$Passengers_Total <- f1$Passengers_Total %>% cap()
f2$Passengers_Total <- f2$Passengers_Total %>% cap()
f3$Passengers_Total <- f3$Passengers_Total %>% cap()
f4$Passengers_Total <- f4$Passengers_Total %>% cap()
f5$Passengers_Total <- f5$Passengers_Total %>% cap()

Transform

Here, only one variable, Max_Seats is considered for applying various transformation techniques. To check the distribution, histogram is plotted for all the ports in USA, i.e. Honolulu, Los Angeles, New York, and San Francisco. A line for mean value and median value is represented by red and green line repectively in the plots.
For Honolulu port(f1$Max_Seats), no transformation was needed as it is already normalised
For Los Angeles port(f2$Max_Seats), boxcox, sqrt, log10 and ln transformation techniques were applied
For New York port(f3$Max_Seats), cube transformation technique was applied
For San Francisco port(f4$Max_Seats), boxcox and cube transformation techniques were applied.

#Transform
#histograms for Max_Seats in Different city ports of USA
hist(f1$Max_Seats, main="Histogram of Max_Seats in Honolulu", xlab = "Max Seats")
abline(v = mean(f1$Max_Seats), col="red", lwd=2)
abline(v = median(f1$Max_Seats), col="green", lwd=2)

hist(f2$Max_Seats, main="Histogram of Max_Seats in Los Angeles", xlab = "Max Seats")
abline(v = mean(f2$Max_Seats), col="red", lwd=2)
abline(v = median(f2$Max_Seats), col="green", lwd=2)

hist(f3$Max_Seats, main="Histogram of Max_Seats in New York", xlab = "Max Seats")
abline(v = mean(f3$Max_Seats), col="red", lwd=2)
abline(v = median(f3$Max_Seats), col="green", lwd=2)

hist(f4$Max_Seats, main="Histogram of Max_Seats in San Francisco", xlab = "Max Seats")
abline(v = mean(f4$Max_Seats), col="red", lwd=2)
abline(v = median(f4$Max_Seats), col="green", lwd=2)

#Applying appropriate transformation on factors 
#Applying boxcox, sqrt, log10 and log transformation on f2$Max_Seats i.e. max seats in Los Angeles
sqrt_LA <- sqrt(f2$Max_Seats)
hist(sqrt_LA)

log_LA <- log10(f2$Max_Seats)
hist(log_LA)

ln_LA <- log(f2$Max_Seats)
hist(ln_LA)

boxcox_LA <- BoxCox(f2$Max_Seats, lambda="auto")
hist(boxcox_LA)

#Applying cube transformation on f3$Max_Seats .i.e max seats in New York
cube_NY <- f3$Max_Seats^3
hist(cube_NY)

#Applying cube and boxcox on f4$Max_Seats i.e. max seats in San Francisco
boxcox_san <- BoxCox(f4$Max_Seats, lambda="auto")
hist(boxcox_san)

cube_san <- f4$Max_Seats^3
hist(cube_san)

References

[1]. Bureau of Infrastructure Transport and Regional Economics 2016, International Airlines - Operated Flights and Seats, Dataset, Australian Government, Bureau of Infrastructure Transport and Regional Economics, viewed 13 October 2020, https://data.gov.au/dataset/ds-dga-e82787e4-a480-4189-b963-1d0b6088103e/details?q=.

[2]. Bureau of Infrastructure Transport and Regional Economics 2016, International Airlines - Traffic by city pairs, Dataset, Australian Government, Bureau of Infrastructure Transport and Regional Economics, viewed 13 October 2020, https://data.gov.au/dataset/ds-dga-d9fbffaa-836f-4f52-80e8-324249ff269f/details?q=.