Important Note
The following problems are drawn from the textbook ‘Modern Data Science with R’.
Make the Connection
Here we make the connection with the airlines database in Rstudio
con <- dbConnect(RMySQL::MySQL(),
dbname = "airlines",
host = "scidb.smith.edu",
user = "mth292",
password = "RememberPi")use airlines;Here we see the different tables within the airlines database.
dbListTables(con)## [1] "airports" "carriers" "flights" "planes" "summary" "weather"
describe airports;| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| faa | varchar(3) | NO | PRI | ||
| name | varchar(255) | YES | NA | ||
| lat | decimal(10,7) | YES | NA | ||
| lon | decimal(10,7) | YES | NA | ||
| alt | int(11) | YES | NA | ||
| tz | smallint(4) | YES | NA | ||
| dst | char(1) | YES | NA | ||
| city | varchar(255) | YES | NA | ||
| country | varchar(255) | YES | NA |
describe carriers| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| carrier | varchar(7) | NO | PRI | ||
| name | varchar(255) | NO |
Practice query to make sure connection is up and running.
SELECT origin AS Origin, sum(air_time) AS Total_Time
FROM flights
WHERE year = 2013 AND Month = 6 AND DAY = 20
GROUP BY Origin| Origin | Total_Time |
|---|---|
| ABE | 674 |
| ABI | 264 |
| ABQ | 8732 |
| ABR | 83 |
| ABY | 100 |
| ACK | 104 |
| ACT | 127 |
| ACV | 509 |
| ADK | 146 |
| ADQ | 81 |
Problem 1
How many domestic flights flew into Dallas-Fort Worth on May 14, 1998?
The first five problems are pretty straightforward, so I’ll just let the code do the talking:
SELECT sum(1) As Total
FROM flights
WHERE dest = "DFW" AND year = 1998 AND month = 5 AND day = 14| Total |
|---|
| 678 |
Problem 2
Find all flights between JFK and SFO in 1994. How many were cancelled? What percentage of the total number of flights were cancelled?
SELECT sum(1) as TotalFlights,sum(cancelled) as TotalCancelled, sum(cancelled) / sum(1) *100 as PercentageCancelled
FROM flights
WHERE origin = "JFK" AND dest = "SFO" AND year = 1994| TotalFlights | TotalCancelled | PercentageCancelled |
|---|---|---|
| 4849 | 84 | 1.7323 |
Problem 3
Of all the destinations from Chicago O’Hare (ORD), which were the most common in 1997?
SELECT dest as Destination, sum(1) as n
FROM flights
WHERE year = 1997 AND origin = "ORD"
GROUP BY dest
ORDER BY n desc
LIMIT 5| Destination | n |
|---|---|
| MSP | 13364 |
| LAX | 12743 |
| EWR | 12227 |
| DFW | 11446 |
| LGA | 11433 |
Problem 4
Which airport had the highest average arrival delay time in 2008?
SELECT dest, sum(1) as numFlights, avg(arr_delay) as MeanArrivalDelay
FROM flights
WHERE year = 2008
GROUP BY dest
ORDER BY MeanArrivalDelay desc
LIMIT 1| dest | numFlights | MeanArrivalDelay |
|---|---|---|
| MQT | 993 | 28.9013 |
Problem 5
How many domestic flights came into or flew out of Bradley Airport (BDL) in 2012?
SELECT sum(origin = "BDL") + sum(dest = "BDL") as Total_BDL_Flights
FROM flights
WHERE year = 2012 | Total_BDL_Flights |
|---|
| 41949 |
Problem 6
List the airline and flight number for all flights between LAX and JFK on September 26, 1990.
The first thing I had to focus on here was to join flights with carriers, then it was pretty easy to get the airline with the flight number.
SELECT c.name as Airline, f.flight as FlightNumber
FROM flights as f
JOIN carriers as c
ON f.carrier = c.carrier
where f.month = 9 and f.year = 1990 and f.day = 26 and f.origin = "LAX" and f.dest = "JFK" | Airline | FlightNumber |
|---|---|
| Trans World Airways LLC | 840 |
| Florida Coastal Airlines | 82 |
| American Airlines Inc. | 2 |
| United Air Lines Inc. | 6 |
| Trans World Airways LLC | 904 |
| Florida Coastal Airlines | 90 |
| American Airlines Inc. | 40 |
| American Airlines Inc. | 4 |
| United Air Lines Inc. | 8 |
| American Airlines Inc. | 32 |
Problem 7
This and problems 8 and 9 use the lahman database.
List the names of all batters who have at least 300 home runs (HR) and 300 stolen bases (SB) in their careers and rank them by their career batting average.
use lahman;show tables;| Tables_in_lahman |
|---|
| AllstarFull |
| Appearances |
| AwardsManagers |
| AwardsPlayers |
| AwardsShareManagers |
| AwardsSharePlayers |
| Batting |
| BattingPost |
| CollegePlaying |
| Fielding |
SELECT *
FROM AwardsPlayers
WHERE yearID = 1982;| playerID | awardID | yearID | lgID | tie | notes |
|---|---|---|---|---|---|
| lynnfr01 | ALCS MVP | 1982 | AL | ||
| conceda01 | All-Star Game MVP | 1982 | ML | ||
| suttebr01 | Babe Ruth Award | 1982 | NL | RP | |
| vuckope01 | Cy Young Award | 1982 | AL | ||
| carltst01 | Cy Young Award | 1982 | NL | ||
| bellbu01 | Gold Glove | 1982 | AL | 3B | |
| boonebo01 | Gold Glove | 1982 | AL | C | |
| evansdw01 | Gold Glove | 1982 | AL | OF | |
| guidrro01 | Gold Glove | 1982 | AL | P | |
| murphdw01 | Gold Glove | 1982 | AL | OF |
SELECT *
FROM HallOfFame;| playerID | yearid | votedBy | ballots | needed | votes | inducted | category | needed_note |
|---|---|---|---|---|---|---|---|---|
| aaronha01 | 1982 | BBWAA | 415 | 312 | 406 | Y | Player | |
| abbotji01 | 2005 | BBWAA | 516 | 387 | 13 | N | Player | |
| adamsba01 | 1937 | BBWAA | 201 | 151 | 8 | N | Player | |
| adamsba01 | 1938 | BBWAA | 262 | 197 | 11 | N | Player | |
| adamsba01 | 1939 | BBWAA | 274 | 206 | 11 | N | Player | |
| adamsba01 | 1942 | BBWAA | 233 | 175 | 11 | N | Player | |
| adamsba01 | 1945 | BBWAA | 247 | 186 | 7 | N | Player | |
| adamsba01 | 1946 | Nominating Vote | 202 | 0 | 6 | N | Player | Top 20 |
| adamsba01 | 1947 | BBWAA | 161 | 121 | 22 | N | Player | |
| adamsba01 | 1948 | BBWAA | 121 | 91 | 4 | N | Player |
The first thing I focused on here was joining ‘Batting’ with ‘Master’, so I could get the full names of the players. I also had to use ‘having’ for the first time here so I could filter down to the players with at least 300 home runs and 300 stolen bases.
SELECT concat(m.nameFirst, " ", m.nameLast) as Player, round(sum(H) * 1.0 / sum(AB), 3) AS Avg
FROM Batting as b
JOIN Master as m
on b.playerID = m.playerID
GROUP BY b.playerID
HAVING sum(b.HR) >= 300 AND sum(b.SB) >= 300
ORDER BY Avg desc| Player | Avg |
|---|---|
| Willie Mays | 0.302 |
| Barry Bonds | 0.298 |
| Alex Rodriguez | 0.295 |
| Carlos Beltran | 0.281 |
| Andre Dawson | 0.279 |
| Steve Finley | 0.271 |
| Bobby Bonds | 0.268 |
| Reggie Sanders | 0.267 |
Problem 8
List the names of all pitchers who have at least 300 wins (W) and 3000 strikeouts (SO) in their careers and rank them by career winning percentage (W/(W+L)).
This was very similar to the problem above, except I had to switch to switch to the ‘Pitching’ table here, and I had to calculate the winning percentage.
SELECT concat(m.nameFirst, " ", m.nameLast) as Player, round(sum(p.W)*1.0 / (sum(p.W)+sum(p.L)), 3) AS winningAvg
FROM Pitching as p
JOIN Master as m
on p.playerID = m.playerID
GROUP BY p.playerID
HAVING sum(p.W) >= 300 AND sum(p.SO) >= 3000
ORDER BY winningAvg desc| Player | winningAvg |
|---|---|
| Roger Clemens | 0.658 |
| Randy Johnson | 0.646 |
| Greg Maddux | 0.610 |
| Tom Seaver | 0.603 |
| Walter Johnson | 0.599 |
| Steve Carlton | 0.574 |
| Don Sutton | 0.559 |
| Gaylord Perry | 0.542 |
| Phil Niekro | 0.537 |
| Nolan Ryan | 0.526 |
Problem 9
The attainment of either 500 home runs (HR) or 3,000 hits (H) in a career is considered to be among the greatest achievements to which a batter can aspire. These milestones are thought to guarantee induction into the Baseball Hall of Fame, and yet several players who have attained either milestone have not been inducted into the Hall of Fame. Identify them.
I decided to use a method similar to an example in the practice, where I find the players who don’t show up in the Hall of Fame table, then I filter to the given requirements.
SELECT concat(m.nameFirst, " ", m.nameLast) as Player
FROM Batting as b
JOIN Master as m
on b.playerID = m.playerID
LEFT JOIN (SELECT playerID
FROM HallOfFame) h ON b.playerID = h.playerID
WHERE h.playerID is null
GROUP BY b.playerID
HAVING (sum(b.HR) >= 500 OR sum(b.H) >= 3000)| Player |
|---|
| Derek Jeter |
| David Ortiz |
| Albert Pujols |
| Alex Rodriguez |
| Ichiro Suzuki |
| Jim Thome |
Problem 10
Based on data from 2012 only, and assuming that transportation to the airport is not an issue, would you rather fly out of JFK, LaGuardia (LGA), or Newark (EWR)? Why or why not?
use airlines;For the sake of variety, let’s pull the necessary data directly into R:
SELECT *
FROM flights
WHERE year = 2012 and origin in ('JFK', 'LGA', 'EWR');We will use dplyr commands here, since flights is an imported object in your R session.
Since I’ve had many issues with departure delays in my life, I’m going to base this on the average departure delay for each airport. Over last semester’s fall break, I actually got stuck in Nebraska over night by myself because of a flight being continuously delayed. Let’s look at average departure delays for each of these airports:
flights %>%
group_by(origin) %>%
summarize(mean_dep_delay = mean(dep_delay)) %>%
arrange(mean_dep_delay)## # A tibble: 3 x 2
## origin mean_dep_delay
## <chr> <dbl>
## 1 LGA 6.30
## 2 JFK 9.43
## 3 EWR 15.1
LaGuardia’s airport has the lowest average departure delay by a decent margin, so it’s probably more reliable to fly out of. If transportation is not an issue, I would fly out of LaGuardia.
dbDisconnect(con)## [1] TRUE