Very big volume of data are normally stored in databases since our local machines seldom have enough space to store them. On many occasions, the datasets might be huge to read the entire dataset into R. The system will simply freeze. In those difficult scenarios, we can resort to performing SQL queries to do most of the processing in the database. Later, we can fetch smaller chunks of the dataset into R for quick analysis.
In this project, we will connect to the Amazon Web Service cloud and use the airlines database, which is a SQL database hosted on AWS created by the authors of MDSR package. The airlines is a pretty big databases with almost 48 million of flight observations. We will run SQL queries on this database starting from simple queries and later extending to more complex queries.
First of all, we need to load the prerequisite libraries.
Let us connect to the airlines database. We need to provide the database name, host address, user name, and the password to be able to connect to the database.
We will now get the name of the tables contained in the airlines database.
## [1] "airports" "carriers" "flights" "planes"
Alternatively, we can insert original SQL codes to get the list of tables.
| Tables_in_airlines |
|---|
| airports |
| carriers |
| flights |
| planes |
So, there are 4 tables in the database.
We will now get the variables names from the flights table.
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "tailnum" "flight"
## [13] "origin" "dest" "air_time" "distance"
## [17] "cancelled" "diverted" "hour" "minute"
## [21] "time_hour"
There are total 21 variables in the flights table.
The following query will give us the variables names and the description from the flights table.
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| year | smallint(4) | YES | MUL | NA | |
| month | smallint(2) | YES | NA | ||
| day | smallint(2) | YES | NA | ||
| dep_time | smallint(4) | YES | NA | ||
| sched_dep_time | smallint(4) | YES | NA | ||
| dep_delay | smallint(4) | YES | NA | ||
| arr_time | smallint(4) | YES | NA | ||
| sched_arr_time | smallint(4) | YES | NA | ||
| arr_delay | smallint(4) | YES | NA | ||
| carrier | varchar(2) | NO | MUL |
We select every variables from the flights table but returning only 10 rows.
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | cancelled | diverted | hour | minute | time_hour |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2010 | 10 | 1 | 1 | 2100 | 181 | 159 | 2320 | 159 | XE | N11137 | 2558 | EWR | OMA | 162 | 1133 | 0 | 0 | 21 | 0 | 2010-10-01 21:00:00 |
| 2010 | 10 | 1 | 1 | 1920 | 281 | 230 | 2214 | 256 | B6 | N659JB | 562 | FLL | SWF | 131 | 1119 | 0 | 0 | 19 | 20 | 2010-10-01 19:20:00 |
| 2010 | 10 | 1 | 3 | 2355 | 8 | 339 | 334 | 5 | B6 | N563JB | 701 | JFK | SJU | 196 | 1597 | 0 | 0 | 23 | 55 | 2010-10-01 23:55:00 |
| 2010 | 10 | 1 | 5 | 2200 | 125 | 41 | 2249 | 112 | XE | N16559 | 5982 | IAD | BNA | 82 | 542 | 0 | 0 | 22 | 0 | 2010-10-01 22:00:00 |
| 2010 | 10 | 1 | 7 | 2245 | 82 | 104 | 2347 | 77 | OO | N908SW | 6433 | LAX | FAT | 37 | 209 | 0 | 0 | 22 | 45 | 2010-10-01 22:45:00 |
| 2010 | 10 | 1 | 7 | 10 | -3 | 451 | 500 | -9 | AA | N3FRAA | 700 | LAX | DFW | 150 | 1235 | 0 | 0 | 0 | 10 | 2010-10-01 00:10:00 |
| 2010 | 10 | 1 | 7 | 2150 | 137 | 139 | 2337 | 122 | DL | N347NW | 1752 | ATL | IAD | 70 | 533 | 0 | 0 | 21 | 50 | 2010-10-01 21:50:00 |
| 2010 | 10 | 1 | 8 | 15 | -7 | 538 | 537 | 1 | CO | N73283 | 1740 | SMF | IAH | 193 | 1609 | 0 | 0 | 0 | 15 | 2010-10-01 00:15:00 |
| 2010 | 10 | 1 | 8 | 10 | -2 | 643 | 645 | -2 | DL | N333NW | 2344 | LAS | CVG | 196 | 1678 | 0 | 0 | 0 | 10 | 2010-10-01 00:10:00 |
| 2010 | 10 | 1 | 10 | 2225 | 105 | 831 | 642 | 109 | B6 | N585JB | 174 | SJC | JFK | 293 | 2570 | 0 | 0 | 22 | 25 | 2010-10-01 22:25:00 |
We can save the query from SQL as a R object for later use. First we run the query in SQL to check everything is going to the plan. Then we translate the SQL query as a R code then save it. The tbl() function works same as the FROM from the SQL.
## # Source: SQL [?? x 21]
## # Database: mysql 5.7.33-log
## # [@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:/airlines]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2010 10 1 1 2100 181 159 2320
## 2 2010 10 1 1 1920 281 230 2214
## 3 2010 10 1 3 2355 8 339 334
## 4 2010 10 1 5 2200 125 41 2249
## 5 2010 10 1 7 2245 82 104 2347
## 6 2010 10 1 7 10 -3 451 500
## 7 2010 10 1 7 2150 137 139 2337
## 8 2010 10 1 8 15 -7 538 537
## 9 2010 10 1 8 10 -2 643 645
## 10 2010 10 1 10 2225 105 831 642
## # ... with 13 more variables: arr_delay <int>, carrier <chr>, tailnum <chr>,
## # flight <int>, origin <chr>, dest <chr>, air_time <int>, distance <int>,
## # cancelled <int>, diverted <int>, hour <int>, minute <int>, time_hour <chr>
Although the output looks like a data frame in R but it not actually, it’s a special type of table returned.
## [1] "tbl_MySQLConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
Now, we will query the maximum and minimum year, total number of flights, and number of distinct carriers from the flights table. This might take some time as we are querying the whole flights table.
SELECT
MIN(year) AS min_year,
MAX(year) AS max_year,
COUNT(*) AS num_flights,
COUNT(DISTINCT(carrier)) AS num_carriers
FROM flights| min_year | max_year | num_flights | num_carriers |
|---|---|---|---|
| 2010 | 2017 | 47932811 | 20 |
Now we will calculate the proportions of flights for each carriers being late for more than 20 minutes. We need to remember that SQL returns 1/0 rather than TRUE/FALSE so it’s easy to calculate and sum up on them. First, we will take a look at the 10 rows of those flights being late more than 20 minutes or not and they are assigned a value of 1 if late and o for not being late.
| carrier | arr_delay | year | late_than_20 |
|---|---|---|---|
| XE | 159 | 2010 | 1 |
| B6 | 256 | 2010 | 1 |
| B6 | 5 | 2010 | 0 |
| XE | 112 | 2010 | 1 |
| OO | 77 | 2010 | 1 |
| AA | -9 | 2010 | 0 |
| DL | 122 | 2010 | 1 |
| CO | 1 | 2010 | 0 |
| DL | -2 | 2010 | 0 |
| B6 | 109 | 2010 | 1 |
We calculate the proportions for the each of the carriers for delaying more than 20 minutes and we arrange them in descending order. We first do it for only 10000 rows to see whether the codes work fine.
SELECT carrier,
AVG(arr_delay>20) AS prop_delay
FROM (SELECT * FROM flights LIMIT 10000) fl
GROUP BY carrier
ORDER BY prop_delay DESC;| carrier | prop_delay |
|---|---|
| B6 | 0.4404 |
| OH | 0.2269 |
| 9E | 0.1557 |
| US | 0.1476 |
| MQ | 0.1243 |
| EV | 0.1214 |
| OO | 0.1186 |
| CO | 0.1175 |
| DL | 0.1003 |
| AA | 0.0973 |
As, we can see the carrier B6 has the highest proportion of flights being delayed for more than 20 minutes.
We now run the query on the whole flights table and this might take some time.
SELECT carrier,
AVG(arr_delay>20) AS prop_delay
FROM flights
GROUP BY carrier
ORDER BY prop_delay DESC;| carrier | prop_delay |
|---|---|
| NK | 0.1984 |
| B6 | 0.1927 |
| OH | 0.1818 |
| F9 | 0.1795 |
| EV | 0.1773 |
| MQ | 0.1697 |
| VX | 0.1693 |
| XE | 0.1680 |
| AA | 0.1553 |
| UA | 0.1539 |
When we ran the query on the whole flights table, now the carrier B6 does not seem to be the highest for being late rather it is NK.
Now, we will find the average delay of each carrier within each year, average distance covered, and arrange them by longest average delay. We run the query on just 1000 rows of the flights table reduce the amount of calculation time.
SELECT year, carrier,
COUNT(*) AS n_flights,
AVG(arr_delay) AS average_delay,
AVG(distance) AS average_distance
FROM (SELECT * FROM flights LIMIT 1000) fl
GROUP BY year, carrier;| year | carrier | n_flights | average_delay | average_distance |
|---|---|---|---|---|
| 2010 | 9E | 47 | -10.0213 | 391.1277 |
| 2010 | AA | 75 | 6.8133 | 1057.4667 |
| 2010 | AS | 20 | -7.2500 | 976.7500 |
| 2010 | B6 | 31 | 62.9032 | 1213.8710 |
| 2010 | CO | 40 | -6.6500 | 1042.5500 |
| 2010 | DL | 114 | -5.9298 | 854.2982 |
| 2010 | EV | 66 | -10.8939 | 411.0909 |
| 2010 | F9 | 14 | -8.5714 | 960.2143 |
| 2010 | FL | 36 | -11.0833 | 594.0278 |
| 2010 | HA | 13 | 0.4615 | 341.5385 |
We will take a look at the faa and the name columns from the airport table.
| faa | name |
|---|---|
| 04G | Lansdowne Airport |
| 06A | Moton Field Municipal Airport |
| 06C | Schaumburg Regional |
| 06N | Randall Airport |
| 09J | Jekyll Island Airport |
| 0A9 | Elizabethton Municipal Airport |
| 0G6 | Williams County Airport |
| 0G7 | Finger Lakes Regional Airport |
| 0P2 | Shoestring Aviation Airfield |
| 0S9 | Jefferson County Intl |
We can effectively use other columns to return a new column. Here we will add the longitude and the latitude columns from the airports table to create a new co-ordinate column or location. Same as the mutate function from dplyr.
| name | location |
|---|---|
| Lansdowne Airport | (41.1304722,-80.6195833) |
| Moton Field Municipal Airport | (32.4605722,-85.6800278) |
| Schaumburg Regional | (41.9893408,-88.1012428) |
| Randall Airport | (41.4319120,-74.3915611) |
| Jekyll Island Airport | (31.0744722,-81.4277778) |
| Elizabethton Municipal Airport | (36.3712222,-82.1734167) |
| Williams County Airport | (41.4673056,-84.5067778) |
| Finger Lakes Regional Airport | (42.8835647,-76.7812318) |
| Shoestring Aviation Airfield | (39.7948244,-76.6471914) |
| Jefferson County Intl | (48.0538086,-122.8106436) |
By the following query, We will select now only those flights that left the Bradley International Airport on June 26th, 2013.
SELECT year, month, day, origin, dest, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
LIMIT 10;| year | month | day | origin | dest | flight | carrier |
|---|---|---|---|---|---|---|
| 2013 | 6 | 26 | BDL | EWR | 4714 | EV |
| 2013 | 6 | 26 | BDL | MIA | 2015 | AA |
| 2013 | 6 | 26 | BDL | DTW | 1644 | DL |
| 2013 | 6 | 26 | BDL | BWI | 2584 | WN |
| 2013 | 6 | 26 | BDL | ATL | 1065 | DL |
| 2013 | 6 | 26 | BDL | DCA | 1077 | US |
| 2013 | 6 | 26 | BDL | TPA | 627 | WN |
| 2013 | 6 | 26 | BDL | MSP | 797 | DL |
| 2013 | 6 | 26 | BDL | CLT | 1705 | US |
| 2013 | 6 | 26 | BDL | CVG | 3787 | 9E |
As we can see, the origin of the flights is BDL short for Bradley International Airport, day is 26, month is 6 and the year is 2013.
Now, we will create a date column by concatenating the year, month, and day column and perform the previous query again.
SELECT
STR_TO_DATE(CONCAT(year, "-", month, "-", day), "%Y-%m-%d") AS date,
origin, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
LIMIT 10;| date | origin | flight | carrier |
|---|---|---|---|
| 2013-06-26 | BDL | 4714 | EV |
| 2013-06-26 | BDL | 2015 | AA |
| 2013-06-26 | BDL | 1644 | DL |
| 2013-06-26 | BDL | 2584 | WN |
| 2013-06-26 | BDL | 1065 | DL |
| 2013-06-26 | BDL | 1077 | US |
| 2013-06-26 | BDL | 627 | WN |
| 2013-06-26 | BDL | 797 | DL |
| 2013-06-26 | BDL | 1705 | US |
| 2013-06-26 | BDL | 3787 | 9E |
We will select the number of flights by each airlines carrier on 26th June 2013 departed from Bradly International Airport.
SELECT carrier,
COUNT(*) AS num_flights
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
GROUP BY carrier;| carrier | num_flights |
|---|---|
| 9E | 5 |
| AA | 4 |
| B6 | 5 |
| DL | 11 |
| EV | 5 |
| MQ | 5 |
| UA | 1 |
| US | 7 |
| WN | 19 |
| YV | 3 |
We can calculate the the minimum departure time for each airlines carrier with the total number of flights for each airlines.
SELECT carrier,
COUNT(*) AS num_flights,
MIN(dep_time) AS minimum_time
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
GROUP BY carrier;| carrier | num_flights | minimum_time |
|---|---|---|
| 9E | 5 | 0 |
| AA | 4 | 559 |
| B6 | 5 | 719 |
| DL | 11 | 559 |
| EV | 5 | 555 |
| MQ | 5 | 0 |
| UA | 1 | 0 |
| US | 7 | 618 |
| WN | 19 | 601 |
| YV | 3 | 0 |
From the output of the previous query, we see that the time format stated there is not right. There are 3 digits or 4 digits. In 3 digits time, the first digit is the hour digit and next two digits are the minutes. In the 4 digits case, the first two digits represent the hour and next two digits represent the minutes. We will convert this 3 digits or 4 digits number in to proper time scale.
SELECT carrier,
COUNT(*) AS num_flights,
MAKETIME(
IF(LENGTH(MIN(dep_time))=3,
LEFT(MIN(dep_time),1),
LEFT(MIN(dep_time),2)),
RIGHT(MIN(dep_time),2),
0) AS minimum_departure_Time
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL"
GROUP BY carrier;| carrier | num_flights | minimum_departure_Time |
|---|---|---|
| 9E | 5 | 00:00:00 |
| AA | 4 | 05:59:00 |
| B6 | 5 | 07:19:00 |
| DL | 11 | 05:59:00 |
| EV | 5 | 05:55:00 |
| MQ | 5 | 00:00:00 |
| UA | 1 | 00:00:00 |
| US | 7 | 06:18:00 |
| WN | 19 | 06:01:00 |
| YV | 3 | 00:00:00 |
Let’s find the most favorable destinations to visit from the Bradley International Airport in 2013.
SELECT dest, COUNT(*) AS num_flights
FROM flights
WHERE year=2013 AND origin="BDL"
GROUP BY dest
ORDER BY num_flights DESC
LIMIT 6; | dest | num_flights |
|---|---|
| ORD | 2657 |
| BWI | 2613 |
| ATL | 2277 |
| CLT | 1842 |
| MCO | 1789 |
| DTW | 1523 |
We see the top six favorable destinations from the Bradley International ordered by the number of flights.
Now, another question arises, which of these destinations have the lowest average arrival time? Let’s find out.
SELECT dest, COUNT(*) as num_flights,
AVG(arr_delay) AS arrival_delay
FROM flights
WHERE year=2013 AND origin="BDL"
GROUP BY dest
ORDER BY arrival_delay ASC
LIMIT 0,10;| dest | num_flights | arrival_delay |
|---|---|---|
| CLE | 57 | -13.0702 |
| LAX | 127 | -10.3071 |
| CVG | 708 | -7.3701 |
| MSP | 981 | -3.6636 |
| MIA | 404 | -3.2723 |
| DCA | 204 | -2.8971 |
| DTW | 1523 | -2.1477 |
| DEN | 365 | -1.2767 |
| LAS | 262 | -0.3092 |
| CLT | 1842 | -0.1205 |
As a destination cleaveland has the lowest delayed time. Rather the flights arrived before the schedule. But after digging a while it reveals that in 2013 only 57 flights made the journey to cleaveland. So we will try to filter the flights with more than 2 flights per day.
SELECT dest,
COUNT(*) AS num_flights,
AVG(arr_delay) AS avg_delay
FROM flights
WHERE year=2013 AND origin="BDL"
GROUP BY dest
HAVING num_flights>2*365
ORDER BY avg_delay ASC
LIMIT 6;| dest | num_flights | avg_delay |
|---|---|---|
| MSP | 981 | -3.6636 |
| DTW | 1523 | -2.1477 |
| CLT | 1842 | -0.1205 |
| FLL | 1011 | 0.2770 |
| DFW | 1062 | 0.7495 |
| ATL | 2277 | 4.4704 |
This query reveals that Minneapolis-St. Paul has been the destination with the lowest arrival delay for 2 flights per day from the Bradley International.
The JOIN function in SQL basically works as the Inner Join. That means only the common rows for the matching keys are returned. The dest and the origin column in the flight table only contains abbreviated names.
| dest | origin | flight | carrier |
|---|---|---|---|
| OMA | EWR | 2558 | XE |
| SWF | FLL | 562 | B6 |
| SJU | JFK | 701 | B6 |
| BNA | IAD | 5982 | XE |
| FAT | LAX | 6433 | OO |
| DFW | LAX | 700 | AA |
| IAD | ATL | 1752 | DL |
| IAH | SMF | 1740 | CO |
| CVG | LAS | 2344 | DL |
| JFK | SJC | 174 | B6 |
But, it would be very convenient if we could retrieve their full names. The names of the airports are contained in the name column in the airports table. We can use the JOIN function in this case. We will match the two similar type of columns dest from flight as flight.dest and faa from airports table as airports.faa.
SELECT origin, dest,
airports.name AS destination,
flight, carrier
FROM flights
JOIN airports ON flights.dest=airports.faa
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,6;| origin | dest | destination | flight | carrier |
|---|---|---|---|---|
| BDL | ORD | Chicago Ohare Intl | 4147 | EV |
| BDL | IAD | Washington Dulles Intl | 6005 | EV |
| BDL | MDW | Chicago Midway Intl | 3029 | WN |
| BDL | EWR | Newark Liberty Intl | 4714 | EV |
| BDL | MIA | Miami Intl | 2015 | AA |
| BDL | BWI | Baltimore Washington Intl | 2584 | WN |
The name of the destination airports are now retrieved which are much more readable for us. But the airports names have been saved in the database with shorthand notations to increase the efficiency for space.
Now we will use aliases instead of the full name of the tables airports and flights. This saves a lot of typing and also time. And sometimes, they are quite necessary.
SELECT origin, dest,
a.name AS destination_airport,
flight, carrier
FROM flights AS f
JOIN airports AS a ON f.dest=a.faa
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,10;| origin | dest | destination_airport | flight | carrier |
|---|---|---|---|---|
| BDL | ORD | Chicago Ohare Intl | 4147 | EV |
| BDL | IAD | Washington Dulles Intl | 6005 | EV |
| BDL | MDW | Chicago Midway Intl | 3029 | WN |
| BDL | EWR | Newark Liberty Intl | 4714 | EV |
| BDL | MIA | Miami Intl | 2015 | AA |
| BDL | BWI | Baltimore Washington Intl | 2584 | WN |
| BDL | ATL | Hartsfield Jackson Atlanta Intl | 1065 | DL |
| BDL | MSP | Minneapolis St Paul Intl | 797 | DL |
| BDL | CLT | Charlotte Douglas Intl | 1141 | US |
| BDL | TPA | Tampa Intl | 627 | WN |
The output of the query is the same.
The carriers column in the flight table also have shorthand notations but the full names are stored in the separate carriers table. So, we can also join the carriers name from the carriers table with the flight table.
SELECT origin, dest,
a.name AS destination_airports,
c.name AS carriers_name,
flight, f.carrier AS carriers
FROM flights AS f
JOIN airports AS a ON f.dest=a.faa
JOIN carriers AS c ON f.carrier=c.carrier
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,10;| origin | dest | destination_airports | carriers_name | flight | carriers |
|---|---|---|---|---|---|
| BDL | ORD | Chicago Ohare Intl | ExpressJet Airlines Inc. | 4147 | EV |
| BDL | IAD | Washington Dulles Intl | ExpressJet Airlines Inc. | 6005 | EV |
| BDL | MDW | Chicago Midway Intl | Southwest Airlines Co. | 3029 | WN |
| BDL | EWR | Newark Liberty Intl | ExpressJet Airlines Inc. | 4714 | EV |
| BDL | MIA | Miami Intl | American Airlines Inc. | 2015 | AA |
| BDL | BWI | Baltimore Washington Intl | Southwest Airlines Co. | 2584 | WN |
| BDL | ATL | Hartsfield Jackson Atlanta Intl | Delta Air Lines Inc. | 1065 | DL |
| BDL | MSP | Minneapolis St Paul Intl | Delta Air Lines Inc. | 797 | DL |
| BDL | CLT | Charlotte Douglas Intl | US Airways Inc. | 1141 | US |
| BDL | TPA | Tampa Intl | Southwest Airlines Co. | 627 | WN |
We can also see the cryptic shorthand notations for origin of the airports in the flight table. So, finally we can join origin from the flight table with name in the airports table to get the full name of the airports of origin. But there is a problem. The origin and dest columns are both in the flights table. So, we have to use two different aliases to extract the names from airports table. The airports table will be joined to the flights table using two different aliases.
SELECT flight,
a1.name AS origin_airports,
a2.name AS destination_airports,
c.name AS carriers_name
FROM flights AS f
JOIN airports AS a1 ON f.origin=a1.faa
JOIN airports AS a2 ON f.dest=a2.faa
JOIN carriers AS c ON f.carrier=c.carrier
WHERE year=2013 AND month=6 AND day=23 AND origin="BDL"
LIMIT 0,10;| flight | origin_airports | destination_airports | carriers_name |
|---|---|---|---|
| 4147 | Bradley Intl | Chicago Ohare Intl | ExpressJet Airlines Inc. |
| 6005 | Bradley Intl | Washington Dulles Intl | ExpressJet Airlines Inc. |
| 3029 | Bradley Intl | Chicago Midway Intl | Southwest Airlines Co. |
| 4714 | Bradley Intl | Newark Liberty Intl | ExpressJet Airlines Inc. |
| 2015 | Bradley Intl | Miami Intl | American Airlines Inc. |
| 2584 | Bradley Intl | Baltimore Washington Intl | Southwest Airlines Co. |
| 1065 | Bradley Intl | Hartsfield Jackson Atlanta Intl | Delta Air Lines Inc. |
| 797 | Bradley Intl | Minneapolis St Paul Intl | Delta Air Lines Inc. |
| 1141 | Bradley Intl | Charlotte Douglas Intl | US Airways Inc. |
| 627 | Bradley Intl | Tampa Intl | Southwest Airlines Co. |
Now it is evident from the result that the Southwest Airlines fight no 3029 flew from the Bradley airport on 23rd June, 2013 and reached the Chicago Midway Airport.
The Left Join function returns all the results for the left table doesn’t matter if all the rows have matching keys or not. There are some missing names for airports in the airports table. We will take a look at a couple of them by using the left join.
SELECT year, month, day,
a.name AS destination_airports,
origin, dest
FROM flights as f
LEFT JOIN airports AS a ON f.dest=a.faa
WHERE year=2013 AND month=6 AND day=23 AND a.name is NULL
LIMIT 0,10;| year | month | day | destination_airports | origin | dest |
|---|---|---|---|---|---|
| 2013 | 6 | 23 | NA | JFK | SJU |
| 2013 | 6 | 23 | NA | JFK | SJU |
| 2013 | 6 | 23 | NA | JFK | PSE |
| 2013 | 6 | 23 | NA | JFK | BQN |
| 2013 | 6 | 23 | NA | BOS | SJU |
| 2013 | 6 | 23 | NA | TPA | SJU |
| 2013 | 6 | 23 | NA | JFK | BQN |
| 2013 | 6 | 23 | NA | FLL | SJU |
| 2013 | 6 | 23 | NA | JFK | SJU |
| 2013 | 6 | 23 | NA | JFK | SJU |
UNION is a very useful function when we want to bind two separate queries into one. This is analogous to the R dplyr function bind_rows.
(
SELECT year, month, day, origin, dest, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="BDL" AND dest="MSP"
)
UNION
(
SELECT year, month, day, origin, dest, flight, carrier
FROM flights
WHERE year=2013 AND month=6 AND day=26 AND origin="JFK" AND dest="ORD"
)
LIMIT 0,10;| year | month | day | origin | dest | flight | carrier |
|---|---|---|---|---|---|---|
| 2013 | 6 | 26 | BDL | MSP | 797 | DL |
| 2013 | 6 | 26 | BDL | MSP | 3338 | 9E |
| 2013 | 6 | 26 | BDL | MSP | 1226 | DL |
| 2013 | 6 | 26 | JFK | ORD | 905 | B6 |
| 2013 | 6 | 26 | JFK | ORD | 1105 | B6 |
| 2013 | 6 | 26 | JFK | ORD | 3523 | 9E |
| 2013 | 6 | 26 | JFK | ORD | 1711 | AA |
| 2013 | 6 | 26 | JFK | ORD | 105 | B6 |
| 2013 | 6 | 26 | JFK | ORD | 3521 | 9E |
| 2013 | 6 | 26 | JFK | ORD | 3525 | 9E |
Subquery is a small query which we can insert inside another big query as a condition or as a table. Although Bradley airport is known as an international airport but it functions most often as a regional airport. So, we are interested to know whether this airport receives any flight from Alaska or Hawaii. We can filter the states outside the lower 48 states using the time zone provided in the airport table.
| faa | name | tz | city |
|---|---|---|---|
| 369 | Atmautluak Airport | -9 | Atmautluak |
| 6K8 | Tok Junction Airport | -9 | Tok |
| ABL | Ambler Airport | -9 | Ambler |
| ADK | Adak Airport | -9 | Adak Island |
| ADQ | Kodiak | -9 | Kodiak |
| AET | Allakaket Airport | -9 | Allakaket |
| AFE | Kake Airport | -9 | Kake |
| AGN | Angoon Seaplane Base | -9 | Angoon |
| AIN | Wainwright Airport | -9 | Wainwright |
| AKB | Atka Airport | -9 | Atka |
We will use the output returned through faa in this code snippet/subquery in the following query to get the number of flights from Bradley to the airports in the time zone below -8.
SELECT dest, a.name AS destination_airport,
SUM(1) AS totals, COUNT(DISTINCT carrier) AS number_of_carriers
FROM flights AS f
LEFT JOIN airports AS a ON f.dest=a.faa
WHERE year=2013 AND origin="BDL"
AND dest in
(SELECT faa
FROM airports
WHERE tz < -8)
GROUP BY dest;| dest | destination_airport | totals | number_of_carriers |
|---|
It returns nothing. That means we have found no airports where flights flew from Bradley airport to the airports in the time zone less than -8 in 2013.
Let’s check whether any flight flew from the Pacific region(time zone < -7) to the Bradley airport.
SELECT dest, a.name AS origin_airport,
SUM(1) AS totals, COUNT(DISTINCT carrier) AS number_of_carriers
FROM flights AS f
LEFT JOIN airports AS a ON f.origin=a.faa
WHERE year=2013 AND dest="BDL"
AND origin IN
(SELECT faa
FROM airports
WHERE tz < -7)
GROUP BY origin;| dest | origin_airport | totals | number_of_carriers |
|---|---|---|---|
| BDL | Mc Carran Intl | 262 | 1 |
| BDL | Los Angeles Intl | 127 | 1 |
Yes! You have found 262 flights from the Mc Carran Intl and 127 flights from the Los Angeles Intl airport to Bradley International airport. Both of the origin airports are below the time zone -7.
This marks the end of this SQL based project. We have almost covered most of the basic and fundamental ways to build queries for simple SQL operations and at the end we have build complex queries using join functions for more specific outcomes.