Important Note

The following problems are drawn from the textbook ‘Modern Data Science with R’.

Make the Connection

Here we make the connection with the airlines database in Rstudio

con <- dbConnect(RMySQL::MySQL(),
                 dbname = "airlines",
                 host = "scidb.smith.edu",
                 user = "mth292",
                 password = "RememberPi")
use airlines;

Here we see the different tables within the airlines database.

dbListTables(con)
## [1] "airports" "carriers" "flights"  "planes"   "summary"  "weather"
describe airports;
9 records
Field Type Null Key Default Extra
faa varchar(3) NO PRI
name varchar(255) YES NA
lat decimal(10,7) YES NA
lon decimal(10,7) YES NA
alt int(11) YES NA
tz smallint(4) YES NA
dst char(1) YES NA
city varchar(255) YES NA
country varchar(255) YES NA
describe carriers
2 records
Field Type Null Key Default Extra
carrier varchar(7) NO PRI
name varchar(255) NO

Practice query to make sure connection is up and running.

SELECT origin AS Origin, sum(air_time) AS Total_Time
FROM flights
WHERE year = 2013 AND Month = 6 AND DAY = 20
GROUP BY Origin
Displaying records 1 - 10
Origin Total_Time
ABE 674
ABI 264
ABQ 8732
ABR 83
ABY 100
ACK 104
ACT 127
ACV 509
ADK 146
ADQ 81

Problem 1

How many domestic flights flew into Dallas-Fort Worth on May 14, 1998?

The first five problems are pretty straightforward, so I’ll just let the code do the talking:

SELECT sum(1) As Total
FROM flights
WHERE dest = "DFW" AND year = 1998 AND month = 5 AND day = 14
1 records
Total
678

Problem 2

Find all flights between JFK and SFO in 1994. How many were cancelled? What percentage of the total number of flights were cancelled?

SELECT sum(1) as TotalFlights,sum(cancelled) as TotalCancelled, sum(cancelled) / sum(1) *100 as PercentageCancelled
FROM flights
WHERE origin = "JFK" AND dest = "SFO" AND year = 1994
1 records
TotalFlights TotalCancelled PercentageCancelled
4849 84 1.7323

Problem 3

Of all the destinations from Chicago O’Hare (ORD), which were the most common in 1997?

SELECT dest as Destination, sum(1) as n
FROM flights
WHERE year = 1997 AND origin = "ORD"
GROUP BY dest
ORDER BY n desc
LIMIT 5
5 records
Destination n
MSP 13364
LAX 12743
EWR 12227
DFW 11446
LGA 11433

Problem 4

Which airport had the highest average arrival delay time in 2008?

SELECT dest, sum(1) as numFlights, avg(arr_delay) as MeanArrivalDelay
FROM flights
WHERE year = 2008
GROUP BY dest
ORDER BY MeanArrivalDelay desc
LIMIT 1
1 records
dest numFlights MeanArrivalDelay
MQT 993 28.9013

Problem 5

How many domestic flights came into or flew out of Bradley Airport (BDL) in 2012?

SELECT sum(origin = "BDL") + sum(dest = "BDL") as Total_BDL_Flights
FROM flights
WHERE year = 2012 
1 records
Total_BDL_Flights
41949

Problem 6

List the airline and flight number for all flights between LAX and JFK on September 26, 1990.

The first thing I had to focus on here was to join flights with carriers, then it was pretty easy to get the airline with the flight number.

SELECT c.name as Airline, f.flight as FlightNumber
FROM flights as f
JOIN carriers as c
ON f.carrier = c.carrier
where f.month = 9 and f.year = 1990 and f.day = 26 and f.origin = "LAX" and f.dest = "JFK" 
Displaying records 1 - 10
Airline FlightNumber
Trans World Airways LLC 840
Florida Coastal Airlines 82
American Airlines Inc. 2
United Air Lines Inc. 6
Trans World Airways LLC 904
Florida Coastal Airlines 90
American Airlines Inc. 40
American Airlines Inc. 4
United Air Lines Inc. 8
American Airlines Inc. 32

Problem 7

This and problems 8 and 9 use the lahman database.

List the names of all batters who have at least 300 home runs (HR) and 300 stolen bases (SB) in their careers and rank them by their career batting average.

use lahman;
show tables;
Displaying records 1 - 10
Tables_in_lahman
AllstarFull
Appearances
AwardsManagers
AwardsPlayers
AwardsShareManagers
AwardsSharePlayers
Batting
BattingPost
CollegePlaying
Fielding
SELECT *
FROM AwardsPlayers
WHERE yearID = 1982;
Displaying records 1 - 10
playerID awardID yearID lgID tie notes
lynnfr01 ALCS MVP 1982 AL
conceda01 All-Star Game MVP 1982 ML
suttebr01 Babe Ruth Award 1982 NL RP
vuckope01 Cy Young Award 1982 AL
carltst01 Cy Young Award 1982 NL
bellbu01 Gold Glove 1982 AL 3B
boonebo01 Gold Glove 1982 AL C
evansdw01 Gold Glove 1982 AL OF
guidrro01 Gold Glove 1982 AL P
murphdw01 Gold Glove 1982 AL OF
SELECT *
FROM HallOfFame;
Displaying records 1 - 10
playerID yearid votedBy ballots needed votes inducted category needed_note
aaronha01 1982 BBWAA 415 312 406 Y Player
abbotji01 2005 BBWAA 516 387 13 N Player
adamsba01 1937 BBWAA 201 151 8 N Player
adamsba01 1938 BBWAA 262 197 11 N Player
adamsba01 1939 BBWAA 274 206 11 N Player
adamsba01 1942 BBWAA 233 175 11 N Player
adamsba01 1945 BBWAA 247 186 7 N Player
adamsba01 1946 Nominating Vote 202 0 6 N Player Top 20
adamsba01 1947 BBWAA 161 121 22 N Player
adamsba01 1948 BBWAA 121 91 4 N Player

The first thing I focused on here was joining ‘Batting’ with ‘Master’, so I could get the full names of the players. I also had to use ‘having’ for the first time here so I could filter down to the players with at least 300 home runs and 300 stolen bases.

SELECT concat(m.nameFirst, " ", m.nameLast) as Player, round(sum(H) * 1.0 / sum(AB), 3) AS Avg
FROM Batting as b
JOIN Master as m
on b.playerID = m.playerID
GROUP BY b.playerID
HAVING sum(b.HR) >= 300 AND sum(b.SB) >= 300
ORDER BY Avg desc
8 records
Player Avg
Willie Mays 0.302
Barry Bonds 0.298
Alex Rodriguez 0.295
Carlos Beltran 0.281
Andre Dawson 0.279
Steve Finley 0.271
Bobby Bonds 0.268
Reggie Sanders 0.267

Problem 8

List the names of all pitchers who have at least 300 wins (W) and 3000 strikeouts (SO) in their careers and rank them by career winning percentage (W/(W+L)).

This was very similar to the problem above, except I had to switch to switch to the ‘Pitching’ table here, and I had to calculate the winning percentage.

SELECT concat(m.nameFirst, " ", m.nameLast) as Player, round(sum(p.W)*1.0 / (sum(p.W)+sum(p.L)), 3) AS winningAvg
FROM Pitching as p
JOIN Master as m
on p.playerID = m.playerID
GROUP BY p.playerID
HAVING sum(p.W) >= 300 AND sum(p.SO) >= 3000
ORDER BY winningAvg desc
Displaying records 1 - 10
Player winningAvg
Roger Clemens 0.658
Randy Johnson 0.646
Greg Maddux 0.610
Tom Seaver 0.603
Walter Johnson 0.599
Steve Carlton 0.574
Don Sutton 0.559
Gaylord Perry 0.542
Phil Niekro 0.537
Nolan Ryan 0.526

Problem 9

The attainment of either 500 home runs (HR) or 3,000 hits (H) in a career is considered to be among the greatest achievements to which a batter can aspire. These milestones are thought to guarantee induction into the Baseball Hall of Fame, and yet several players who have attained either milestone have not been inducted into the Hall of Fame. Identify them.

I decided to use a method similar to an example in the practice, where I find the players who don’t show up in the Hall of Fame table, then I filter to the given requirements.

SELECT concat(m.nameFirst, " ", m.nameLast) as Player
FROM Batting as b
JOIN Master as m
on b.playerID = m.playerID
LEFT JOIN (SELECT playerID
            FROM HallOfFame) h ON b.playerID = h.playerID
WHERE h.playerID is null
GROUP BY b.playerID
HAVING (sum(b.HR) >= 500 OR sum(b.H) >= 3000)
6 records
Player
Derek Jeter
David Ortiz
Albert Pujols
Alex Rodriguez
Ichiro Suzuki
Jim Thome

Problem 10

Based on data from 2012 only, and assuming that transportation to the airport is not an issue, would you rather fly out of JFK, LaGuardia (LGA), or Newark (EWR)? Why or why not?

use airlines;

For the sake of variety, let’s pull the necessary data directly into R:

SELECT *
FROM flights
WHERE year = 2012 and origin in ('JFK', 'LGA', 'EWR');

We will use dplyr commands here, since flights is an imported object in your R session.

Since I’ve had many issues with departure delays in my life, I’m going to base this on the average departure delay for each airport. Over last semester’s fall break, I actually got stuck in Nebraska over night by myself because of a flight being continuously delayed. Let’s look at average departure delays for each of these airports:

flights %>% 
  group_by(origin) %>% 
  summarize(mean_dep_delay = mean(dep_delay)) %>% 
  arrange(mean_dep_delay)
## # A tibble: 3 x 2
##   origin mean_dep_delay
##   <chr>           <dbl>
## 1 LGA              6.30
## 2 JFK              9.43
## 3 EWR             15.1

LaGuardia’s airport has the lowest average departure delay by a decent margin, so it’s probably more reliable to fly out of. If transportation is not an issue, I would fly out of LaGuardia.

dbDisconnect(con)
## [1] TRUE