Introduction

Very big volume of data are normally stored in databases since our local machines seldom have enough space to store them. On many occasions, the datasets might be huge to read the entire dataset into R. The system will simply freeze. In those difficult scenarios, we can resort to performing SQL queries to do most of the processing in the database. Later, we can fetch smaller chunks of the dataset into R for quick analysis.

In this project, we will connect to the Amazon Web Service cloud and use the airlines database, which is a SQL database hosted on AWS created by the authors of MDSR package. The airlines is a pretty big databases with almost 48 million of flight observations. We will run SQL queries on this database starting from simple queries and later extending to more complex queries.

First of all, we need to load the prerequisite libraries.

library(tidyverse)   # for reading in data, graphing, and cleaning
library(dbplyr)      
library(mdsr)        # for accessing some databases 
library(RMySQL)      # for accessing MySQL databases
library(RSQLite) 

Connecting to the Database

Let us connect to the airlines database. We need to provide the database name, host address, user name, and the password to be able to connect to the database.

con_air <- dbConnect(RMySQL::MySQL(), 
                     dbname = "airlines", 
                     host = "mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com", 
                     user = "mdsr_public", 
                     password = "ImhsmflMDSwR")

Checking the Database

We will now get the name of the tables contained in the airlines database.

dbListTables(con_air)
## [1] "airports" "carriers" "flights"  "planes"

Alternatively, we can insert original SQL codes to get the list of tables.

SHOW tables;
4 records
Tables_in_airlines
airports
carriers
flights
planes

So, there are 4 tables in the database.

We will now get the variables names from the flights table.

dbListFields(con_air, "flights")
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "tailnum"        "flight"        
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "cancelled"      "diverted"       "hour"           "minute"        
## [21] "time_hour"

There are total 21 variables in the flights table.

Querying the Database

First look on the Tables

The following query will give us the variables names and the description from the flights table.

DESCRIBE flights;
Displaying records 1 - 10
Field Type Null Key Default Extra
year smallint(4) YES MUL NA
month smallint(2) YES NA
day smallint(2) YES NA
dep_time smallint(4) YES NA
sched_dep_time smallint(4) YES NA
dep_delay smallint(4) YES NA
arr_time smallint(4) YES NA
sched_arr_time smallint(4) YES NA
arr_delay smallint(4) YES NA
carrier varchar(2) NO MUL

We select every variables from the flights table but returning only 10 rows.

SELECT *
FROM flights
LIMIT 10;
Displaying records 1 - 10
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier tailnum flight origin dest air_time distance cancelled diverted hour minute time_hour
2010 10 1 1 2100 181 159 2320 159 XE N11137 2558 EWR OMA 162 1133 0 0 21 0 2010-10-01 21:00:00
2010 10 1 1 1920 281 230 2214 256 B6 N659JB 562 FLL SWF 131 1119 0 0 19 20 2010-10-01 19:20:00
2010 10 1 3 2355 8 339 334 5 B6 N563JB 701 JFK SJU 196 1597 0 0 23 55 2010-10-01 23:55:00
2010 10 1 5 2200 125 41 2249 112 XE N16559 5982 IAD BNA 82 542 0 0 22 0 2010-10-01 22:00:00
2010 10 1 7 2245 82 104 2347 77 OO N908SW 6433 LAX FAT 37 209 0 0 22 45 2010-10-01 22:45:00
2010 10 1 7 10 -3 451 500 -9 AA N3FRAA 700 LAX DFW 150 1235 0 0 0 10 2010-10-01 00:10:00
2010 10 1 7 2150 137 139 2337 122 DL N347NW 1752 ATL IAD 70 533 0 0 21 50 2010-10-01 21:50:00
2010 10 1 8 15 -7 538 537 1 CO N73283 1740 SMF IAH 193 1609 0 0 0 15 2010-10-01 00:15:00
2010 10 1 8 10 -2 643 645 -2 DL N333NW 2344 LAS CVG 196 1678 0 0 0 10 2010-10-01 00:10:00
2010 10 1 10 2225 105 831 642 109 B6 N585JB 174 SJC JFK 293 2570 0 0 22 25 2010-10-01 22:25:00

We can save the query from SQL as a R object for later use. First we run the query in SQL to check everything is going to the plan. Then we translate the SQL query as a R code then save it. The tbl() function works same as the FROM from the SQL.

query_01<-tbl(src = con_air, sql(
  "SELECT *
    FROM flights
  LIMIT 10"
))
query_01
## # Source:   SQL [?? x 21]
## # Database: mysql 5.7.33-log
## #   [@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:/airlines]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <int>    <int>          <int>
##  1  2010    10     1        1           2100       181      159           2320
##  2  2010    10     1        1           1920       281      230           2214
##  3  2010    10     1        3           2355         8      339            334
##  4  2010    10     1        5           2200       125       41           2249
##  5  2010    10     1        7           2245        82      104           2347
##  6  2010    10     1        7             10        -3      451            500
##  7  2010    10     1        7           2150       137      139           2337
##  8  2010    10     1        8             15        -7      538            537
##  9  2010    10     1        8             10        -2      643            645
## 10  2010    10     1       10           2225       105      831            642
## # ... with 13 more variables: arr_delay <int>, carrier <chr>, tailnum <chr>,
## #   flight <int>, origin <chr>, dest <chr>, air_time <int>, distance <int>,
## #   cancelled <int>, diverted <int>, hour <int>, minute <int>, time_hour <chr>

Although the output looks like a data frame in R but it not actually, it’s a special type of table returned.

class(query_01)
## [1] "tbl_MySQLConnection" "tbl_dbi"             "tbl_sql"            
## [4] "tbl_lazy"            "tbl"

Group by and Summary

Now, we will query the maximum and minimum year, total number of flights, and number of distinct carriers from the flights table. This might take some time as we are querying the whole flights table.

SELECT 
 MIN(year) AS min_year,
 MAX(year) AS max_year,
 COUNT(*) AS num_flights,
 COUNT(DISTINCT(carrier)) AS num_carriers
FROM flights
1 records
min_year max_year num_flights num_carriers
2010 2017 47932811 20

Now we will calculate the proportions of flights for each carriers being late for more than 20 minutes. We need to remember that SQL returns 1/0 rather than TRUE/FALSE so it’s easy to calculate and sum up on them. First, we will take a look at the 10 rows of those flights being late more than 20 minutes or not and they are assigned a value of 1 if late and o for not being late.

SELECT carrier, arr_delay, year,
(arr_delay>20) AS late_than_20
FROM flights
LIMIT 10;
Displaying records 1 - 10
carrier arr_delay year late_than_20
XE 159 2010 1
B6 256 2010 1
B6 5 2010 0
XE 112 2010 1
OO 77 2010 1
AA -9 2010 0
DL 122 2010 1
CO 1 2010 0
DL -2 2010 0
B6 109 2010 1

We calculate the proportions for the each of the carriers for delaying more than 20 minutes and we arrange them in descending order. We first do it for only 10000 rows to see whether the codes work fine.

SELECT carrier,
AVG(arr_delay>20) AS prop_delay
FROM (SELECT * FROM flights LIMIT 10000) fl
GROUP BY carrier
ORDER BY prop_delay DESC;
Displaying records 1 - 10
carrier prop_delay
B6 0.4404
OH 0.2269
9E 0.1557
US 0.1476
MQ 0.1243
EV 0.1214
OO 0.1186
CO 0.1175
DL 0.1003
AA 0.0973

As, we can see the carrier B6 has the highest proportion of flights being delayed for more than 20 minutes.

We now run the query on the whole flights table and this might take some time.

SELECT carrier,
AVG(arr_delay>20) AS prop_delay
FROM flights
GROUP BY carrier
ORDER BY prop_delay DESC;
Displaying records 1 - 10
carrier prop_delay
NK 0.1984
B6 0.1927
OH 0.1818
F9 0.1795
EV 0.1773
MQ 0.1697
VX 0.1693
XE 0.1680
AA 0.1553
UA 0.1539

When we ran the query on the whole flights table, now the carrier B6 does not seem to be the highest for being late rather it is NK.

Now, we will find the average delay of each carrier within each year, average distance covered, and arrange them by longest average delay. We run the query on just 1000 rows of the flights table reduce the amount of calculation time.

SELECT year, carrier, 
COUNT(*) AS n_flights,
AVG(arr_delay) AS average_delay,
AVG(distance) AS average_distance
FROM (SELECT * FROM flights LIMIT 1000) fl
GROUP BY year, carrier;
Displaying records 1 - 10
year carrier n_flights average_delay average_distance
2010 9E 47 -10.0213 391.1277
2010 AA 75 6.8133 1057.4667
2010 AS 20 -7.2500 976.7500
2010 B6 31 62.9032 1213.8710
2010 CO 40 -6.6500 1042.5500
2010 DL 114 -5.9298 854.2982
2010 EV 66 -10.8939 411.0909
2010 F9 14 -8.5714 960.2143
2010 FL 36 -11.0833 594.0278
2010 HA 13 0.4615 341.5385

Mutating New Column

We will take a look at the faa and the name columns from the airport table.

SELECT faa, name
FROM airports
LIMIT 10;
Displaying records 1 - 10
faa name
04G Lansdowne Airport
06A Moton Field Municipal Airport
06C Schaumburg Regional
06N Randall Airport
09J Jekyll Island Airport
0A9 Elizabethton Municipal Airport
0G6 Williams County Airport
0G7 Finger Lakes Regional Airport
0P2 Shoestring Aviation Airfield
0S9 Jefferson County Intl

We can effectively use other columns to return a new column. Here we will add the longitude and the latitude columns from the airports table to create a new co-ordinate column or location. Same as the mutate function from dplyr.

SELECT name,
CONCAT("(",lat, ",", lon, ")") AS location
FROM airports
LIMIT 10
Displaying records 1 - 10
name location
Lansdowne Airport (41.1304722,-80.6195833)
Moton Field Municipal Airport (32.4605722,-85.6800278)
Schaumburg Regional (41.9893408,-88.1012428)
Randall Airport (41.4319120,-74.3915611)
Jekyll Island Airport (31.0744722,-81.4277778)
Elizabethton Municipal Airport (36.3712222,-82.1734167)
Williams County Airport (41.4673056,-84.5067778)
Finger Lakes Regional Airport (42.8835647,-76.7812318)
Shoestring Aviation Airfield (39.7948244,-76.6471914)
Jefferson County Intl (48.0538086,-122.8106436)

WHERE

By the following query, We will select now only those flights that left the Bradley International Airport on June 26th, 2013.

SELECT year, month, day, origin, dest, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
LIMIT 10;
Displaying records 1 - 10
year month day origin dest flight carrier
2013 6 26 BDL EWR 4714 EV
2013 6 26 BDL MIA 2015 AA
2013 6 26 BDL DTW 1644 DL
2013 6 26 BDL BWI 2584 WN
2013 6 26 BDL ATL 1065 DL
2013 6 26 BDL DCA 1077 US
2013 6 26 BDL TPA 627 WN
2013 6 26 BDL MSP 797 DL
2013 6 26 BDL CLT 1705 US
2013 6 26 BDL CVG 3787 9E

As we can see, the origin of the flights is BDL short for Bradley International Airport, day is 26, month is 6 and the year is 2013.

Now, we will create a date column by concatenating the year, month, and day column and perform the previous query again.

SELECT
STR_TO_DATE(CONCAT(year, "-", month, "-", day), "%Y-%m-%d") AS date,
origin, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
LIMIT 10;
Displaying records 1 - 10
date origin flight carrier
2013-06-26 BDL 4714 EV
2013-06-26 BDL 2015 AA
2013-06-26 BDL 1644 DL
2013-06-26 BDL 2584 WN
2013-06-26 BDL 1065 DL
2013-06-26 BDL 1077 US
2013-06-26 BDL 627 WN
2013-06-26 BDL 797 DL
2013-06-26 BDL 1705 US
2013-06-26 BDL 3787 9E

GROUP BY

We will select the number of flights by each airlines carrier on 26th June 2013 departed from Bradly International Airport.

SELECT carrier,
COUNT(*) AS num_flights
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
GROUP BY carrier;
Displaying records 1 - 10
carrier num_flights
9E 5
AA 4
B6 5
DL 11
EV 5
MQ 5
UA 1
US 7
WN 19
YV 3

We can calculate the the minimum departure time for each airlines carrier with the total number of flights for each airlines.

SELECT carrier,
COUNT(*) AS num_flights,
MIN(dep_time) AS minimum_time
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
GROUP BY carrier;
Displaying records 1 - 10
carrier num_flights minimum_time
9E 5 0
AA 4 559
B6 5 719
DL 11 559
EV 5 555
MQ 5 0
UA 1 0
US 7 618
WN 19 601
YV 3 0

From the output of the previous query, we see that the time format stated there is not right. There are 3 digits or 4 digits. In 3 digits time, the first digit is the hour digit and next two digits are the minutes. In the 4 digits case, the first two digits represent the hour and next two digits represent the minutes. We will convert this 3 digits or 4 digits number in to proper time scale.

SELECT carrier,
COUNT(*) AS num_flights,
MAKETIME(
IF(LENGTH(MIN(dep_time))=3, 
LEFT(MIN(dep_time),1),
LEFT(MIN(dep_time),2)),
RIGHT(MIN(dep_time),2),
0) AS minimum_departure_Time
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
GROUP BY carrier;
Displaying records 1 - 10
carrier num_flights minimum_departure_Time
9E 5 00:00:00
AA 4 05:59:00
B6 5 07:19:00
DL 11 05:59:00
EV 5 05:55:00
MQ 5 00:00:00
UA 1 00:00:00
US 7 06:18:00
WN 19 06:01:00
YV 3 00:00:00

ORDER BY

Let’s find the most favorable destinations to visit from the Bradley International Airport in 2013.

SELECT dest, COUNT(*) AS num_flights 
FROM flights
WHERE year=2013 AND origin="BDL"
GROUP BY dest
ORDER BY num_flights DESC
LIMIT 6; 
6 records
dest num_flights
ORD 2657
BWI 2613
ATL 2277
CLT 1842
MCO 1789
DTW 1523

We see the top six favorable destinations from the Bradley International ordered by the number of flights.

Now, another question arises, which of these destinations have the lowest average arrival time? Let’s find out.

SELECT dest, COUNT(*) as num_flights,
AVG(arr_delay) AS arrival_delay
FROM flights
WHERE year=2013 AND origin="BDL"
GROUP BY dest
ORDER BY arrival_delay ASC
LIMIT 0,10;
Displaying records 1 - 10
dest num_flights arrival_delay
CLE 57 -13.0702
LAX 127 -10.3071
CVG 708 -7.3701
MSP 981 -3.6636
MIA 404 -3.2723
DCA 204 -2.8971
DTW 1523 -2.1477
DEN 365 -1.2767
LAS 262 -0.3092
CLT 1842 -0.1205

HAVING

As a destination cleaveland has the lowest delayed time. Rather the flights arrived before the schedule. But after digging a while it reveals that in 2013 only 57 flights made the journey to cleaveland. So we will try to filter the flights with more than 2 flights per day.

SELECT dest, 
COUNT(*) AS num_flights,
AVG(arr_delay) AS avg_delay
FROM flights
WHERE year=2013 AND origin="BDL"
GROUP BY dest
HAVING num_flights>2*365
ORDER BY avg_delay ASC
LIMIT 6;
6 records
dest num_flights avg_delay
MSP 981 -3.6636
DTW 1523 -2.1477
CLT 1842 -0.1205
FLL 1011 0.2770
DFW 1062 0.7495
ATL 2277 4.4704

This query reveals that Minneapolis-St. Paul has been the destination with the lowest arrival delay for 2 flights per day from the Bradley International.

Various Types of JOIN Clauses

JOIN

The JOIN function in SQL basically works as the Inner Join. That means only the common rows for the matching keys are returned. The dest and the origin column in the flight table only contains abbreviated names.

SELECT dest, origin, flight, carrier
FROM flights
LIMIT 0,10;
Displaying records 1 - 10
dest origin flight carrier
OMA EWR 2558 XE
SWF FLL 562 B6
SJU JFK 701 B6
BNA IAD 5982 XE
FAT LAX 6433 OO
DFW LAX 700 AA
IAD ATL 1752 DL
IAH SMF 1740 CO
CVG LAS 2344 DL
JFK SJC 174 B6

But, it would be very convenient if we could retrieve their full names. The names of the airports are contained in the name column in the airports table. We can use the JOIN function in this case. We will match the two similar type of columns dest from flight as flight.dest and faa from airports table as airports.faa.

SELECT origin, dest,
airports.name AS destination,
flight, carrier
FROM flights
JOIN airports ON flights.dest=airports.faa
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,6;
6 records
origin dest destination flight carrier
BDL ORD Chicago Ohare Intl 4147 EV
BDL IAD Washington Dulles Intl 6005 EV
BDL MDW Chicago Midway Intl 3029 WN
BDL EWR Newark Liberty Intl 4714 EV
BDL MIA Miami Intl 2015 AA
BDL BWI Baltimore Washington Intl 2584 WN

The name of the destination airports are now retrieved which are much more readable for us. But the airports names have been saved in the database with shorthand notations to increase the efficiency for space.

Now we will use aliases instead of the full name of the tables airports and flights. This saves a lot of typing and also time. And sometimes, they are quite necessary.

SELECT origin, dest,
a.name AS destination_airport,
flight, carrier
FROM flights AS f
JOIN airports AS a ON f.dest=a.faa
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,10;
Displaying records 1 - 10
origin dest destination_airport flight carrier
BDL ORD Chicago Ohare Intl 4147 EV
BDL IAD Washington Dulles Intl 6005 EV
BDL MDW Chicago Midway Intl 3029 WN
BDL EWR Newark Liberty Intl 4714 EV
BDL MIA Miami Intl 2015 AA
BDL BWI Baltimore Washington Intl 2584 WN
BDL ATL Hartsfield Jackson Atlanta Intl 1065 DL
BDL MSP Minneapolis St Paul Intl 797 DL
BDL CLT Charlotte Douglas Intl 1141 US
BDL TPA Tampa Intl 627 WN

The output of the query is the same.

The carriers column in the flight table also have shorthand notations but the full names are stored in the separate carriers table. So, we can also join the carriers name from the carriers table with the flight table.

SELECT origin, dest,
a.name AS destination_airports,
c.name AS carriers_name,
flight, f.carrier AS carriers
FROM flights AS f
JOIN airports AS a ON f.dest=a.faa
JOIN carriers AS c ON f.carrier=c.carrier
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,10;
Displaying records 1 - 10
origin dest destination_airports carriers_name flight carriers
BDL ORD Chicago Ohare Intl ExpressJet Airlines Inc. 4147 EV
BDL IAD Washington Dulles Intl ExpressJet Airlines Inc. 6005 EV
BDL MDW Chicago Midway Intl Southwest Airlines Co. 3029 WN
BDL EWR Newark Liberty Intl ExpressJet Airlines Inc. 4714 EV
BDL MIA Miami Intl American Airlines Inc. 2015 AA
BDL BWI Baltimore Washington Intl Southwest Airlines Co. 2584 WN
BDL ATL Hartsfield Jackson Atlanta Intl Delta Air Lines Inc. 1065 DL
BDL MSP Minneapolis St Paul Intl Delta Air Lines Inc. 797 DL
BDL CLT Charlotte Douglas Intl US Airways Inc. 1141 US
BDL TPA Tampa Intl Southwest Airlines Co. 627 WN

We can also see the cryptic shorthand notations for origin of the airports in the flight table. So, finally we can join origin from the flight table with name in the airports table to get the full name of the airports of origin. But there is a problem. The origin and dest columns are both in the flights table. So, we have to use two different aliases to extract the names from airports table. The airports table will be joined to the flights table using two different aliases.

SELECT flight,
a1.name AS origin_airports,
a2.name AS destination_airports,
c.name AS carriers_name
FROM flights AS f 
JOIN airports AS a1 ON f.origin=a1.faa
JOIN airports AS a2 ON f.dest=a2.faa
JOIN carriers AS c ON f.carrier=c.carrier
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,10;
Displaying records 1 - 10
flight origin_airports destination_airports carriers_name
4147 Bradley Intl Chicago Ohare Intl ExpressJet Airlines Inc.
6005 Bradley Intl Washington Dulles Intl ExpressJet Airlines Inc.
3029 Bradley Intl Chicago Midway Intl Southwest Airlines Co.
4714 Bradley Intl Newark Liberty Intl ExpressJet Airlines Inc.
2015 Bradley Intl Miami Intl American Airlines Inc.
2584 Bradley Intl Baltimore Washington Intl Southwest Airlines Co.
1065 Bradley Intl Hartsfield Jackson Atlanta Intl Delta Air Lines Inc.
797 Bradley Intl Minneapolis St Paul Intl Delta Air Lines Inc.
1141 Bradley Intl Charlotte Douglas Intl US Airways Inc.
627 Bradley Intl Tampa Intl Southwest Airlines Co.

Now it is evident from the result that the Southwest Airlines fight no 3029 flew from the Bradley airport on 23rd June, 2013 and reached the Chicago Midway Airport.

Left Join

The Left Join function returns all the results for the left table doesn’t matter if all the rows have matching keys or not. There are some missing names for airports in the airports table. We will take a look at a couple of them by using the left join.

SELECT year, month, day,
a.name AS destination_airports,
origin, dest 
FROM flights as f
LEFT JOIN airports AS a ON f.dest=a.faa
WHERE year=2013 AND month=6 AND day=23 AND a.name is NULL
LIMIT 0,10;
Displaying records 1 - 10
year month day destination_airports origin dest
2013 6 23 NA JFK SJU
2013 6 23 NA JFK SJU
2013 6 23 NA JFK PSE
2013 6 23 NA JFK BQN
2013 6 23 NA BOS SJU
2013 6 23 NA TPA SJU
2013 6 23 NA JFK BQN
2013 6 23 NA FLL SJU
2013 6 23 NA JFK SJU
2013 6 23 NA JFK SJU

UNION

UNION is a very useful function when we want to bind two separate queries into one. This is analogous to the R dplyr function bind_rows.

(
SELECT year, month, day, origin, dest, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL" AND dest="MSP"
)
UNION
(
SELECT year, month, day, origin, dest, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="JFK" AND dest="ORD"
)
LIMIT 0,10;
Displaying records 1 - 10
year month day origin dest flight carrier
2013 6 26 BDL MSP 797 DL
2013 6 26 BDL MSP 3338 9E
2013 6 26 BDL MSP 1226 DL
2013 6 26 JFK ORD 905 B6
2013 6 26 JFK ORD 1105 B6
2013 6 26 JFK ORD 3523 9E
2013 6 26 JFK ORD 1711 AA
2013 6 26 JFK ORD 105 B6
2013 6 26 JFK ORD 3521 9E
2013 6 26 JFK ORD 3525 9E

Subquery

Subquery is a small query which we can insert inside another big query as a condition or as a table. Although Bradley airport is known as an international airport but it functions most often as a regional airport. So, we are interested to know whether this airport receives any flight from Alaska or Hawaii. We can filter the states outside the lower 48 states using the time zone provided in the airport table.

SELECT faa, name, tz, city
FROM airports AS a
WHERE tz < -8
LIMIT 0,10;
Displaying records 1 - 10
faa name tz city
369 Atmautluak Airport -9 Atmautluak
6K8 Tok Junction Airport -9 Tok
ABL Ambler Airport -9 Ambler
ADK Adak Airport -9 Adak Island
ADQ Kodiak -9 Kodiak
AET Allakaket Airport -9 Allakaket
AFE Kake Airport -9 Kake
AGN Angoon Seaplane Base -9 Angoon
AIN Wainwright Airport -9 Wainwright
AKB Atka Airport -9 Atka

We will use the output returned through faa in this code snippet/subquery in the following query to get the number of flights from Bradley to the airports in the time zone below -8.

SELECT dest, a.name AS destination_airport,
SUM(1) AS totals, COUNT(DISTINCT carrier) AS number_of_carriers
FROM flights AS f
LEFT JOIN airports AS a ON f.dest=a.faa
WHERE year=2013 AND origin="BDL"
AND dest in
(SELECT faa
FROM airports
WHERE tz < -8)
GROUP BY dest;
0 records
dest destination_airport totals number_of_carriers

It returns nothing. That means we have found no airports where flights flew from Bradley airport to the airports in the time zone less than -8 in 2013.

Let’s check whether any flight flew from the Pacific region(time zone < -7) to the Bradley airport.

SELECT dest, a.name AS origin_airport,
SUM(1) AS totals, COUNT(DISTINCT carrier) AS number_of_carriers
FROM flights AS f
LEFT JOIN airports AS a ON f.origin=a.faa
WHERE year=2013 AND dest="BDL"
AND origin IN
(SELECT faa
FROM airports
WHERE tz < -7)
GROUP BY origin;
2 records
dest origin_airport totals number_of_carriers
BDL Mc Carran Intl 262 1
BDL Los Angeles Intl 127 1

Yes! You have found 262 flights from the Mc Carran Intl and 127 flights from the Los Angeles Intl airport to Bradley International airport. Both of the origin airports are below the time zone -7.

This marks the end of this SQL based project. We have almost covered most of the basic and fundamental ways to build queries for simple SQL operations and at the end we have build complex queries using join functions for more specific outcomes.