library(tidyverse)
library(DBI)
library(duckdb)
library(nycflights13)
library(dbplyr)Lab 6: R and SQL
Overview
This lab is divided into two parts. In the first part you will practice using joins for data wrangling and analysis on the nycflights dataset. Some of the problems come from Chapter 19 of your book. For the second part, you will download a dataset on the budgets of college sports programs and process it for storage in a relational database (I strongly recommend using duckdb which can be installed using install.packages("duckdb")- duckdb is highly performant, self-contained, and ideally suited both to learning SQL and performing data analysis). Then you will load this database and use dbplyr to perform an analysis. You will also practice using forcats to recode some of the variables as factors (which are supported by duckdb) and using separate_wider_delim to split columns of text data.
You will need to have installed and to the following libraries:
Problems
Part I: Airline Flight Delays
For the first part of this lab exercise, we will be using the nycflights library, which contains several different built in datasets including planes, which has information on each plane that appears in the data; flights, which has information on individual flights; airports, which has information on individual airports; and weather, which has information on the weather that the origin airports. In order to do this set of lab exercises, you will need to use different types of joins to combine variables in each data frame.
Problem 1
- Use the
flightsandplanestables to compute the mean departure delay of each aircraft that has more than 30 recorded flights in the dataset. Hint: Make note of the fact that the variableyearappears in bothflightsandplanesbut means different things in each before performing any joins.
flights2 <- flights |>
mutate(id = row_number(), .before = 1)
flights2 |>
left_join(planes, join_by(tailnum)) |>
group_by(tailnum) |>
summarise(
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
countflights = n()
) |>
filter(countflights >= 30) |>
arrange(tailnum)# A tibble: 2,745 × 3
tailnum avg_dep_delay countflights
<chr> <dbl> <int>
1 N0EGMQ 8.49 371
2 N10156 17.8 153
3 N102UW 8 48
4 N103US -3.20 46
5 N104UW 9.94 47
6 N10575 22.7 289
7 N105UW 2.58 45
8 N107US -0.463 41
9 N108UW 4.22 60
10 N109UW 0.104 48
# ℹ 2,735 more rows
- Use
anti-jointo identify flights wheretailnumdoes not have a match inplane. Determine the carriers for which this problem is the most common.
flights2 |>
anti_join(planes, by = "tailnum") |>
group_by(carrier) |>
summarize(carrier_missing_tailnum_count = n()) |>
arrange(desc(carrier_missing_tailnum_count))# A tibble: 10 × 2
carrier carrier_missing_tailnum_count
<chr> <int>
1 MQ 25397
2 AA 22558
3 UA 1693
4 9E 1044
5 B6 830
6 US 699
7 FL 187
8 DL 110
9 F9 50
10 WN 38
The carriers MQ (Envoy Air) and AA (American Airlines) are the carriers for which this problem is most common.
- Find the airplane model which made the most flights in the dataset, and filter the dataset to contain only flights flown by airplanes of that model, adding a variable which corresponds to the year each those airplanes were built. Then compute the average departure delay for each year of origin and plot the data. Is there any evidence that older planes have more greater departure delays?
While the value NA for model has the “highest count” of flights, this is not an indication all NA values for tailnum are the same airplane model. It’s more likely that the NA values for tailnum are not the same airplane model, but we really don’t know. So, we will continue analysis on the airplane model with the most flights in the dataset that we actually have information for: model = A320-232.
topmodels <- flights2 |>
left_join(planes, join_by(tailnum)) |>
group_by(model) |>
summarise(
countflights = n()
) |>
arrange(desc(countflights))
topmodel <- flights2 |>
left_join(planes, join_by(tailnum)) |>
filter(model == "A320-232") |>
rename(year_built = year.y) |>
group_by(year_built) |>
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
arrange(year_built)
print(topmodel)# A tibble: 21 × 2
year_built avg_dep_delay
<int> <dbl>
1 1993 15.6
2 1994 13.7
3 1995 15.2
4 1996 15.5
5 1997 13.2
6 1998 14.1
7 1999 10.0
8 2000 12.0
9 2001 12.8
10 2002 11.9
# ℹ 11 more rows
ggplot(topmodel, aes(x = year_built, y = avg_dep_delay)) +
geom_point() +
labs(title = "A320-232: Average departure delay by year built",
subtitle = "Older builds tend to have greater average departure delays")Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
As indicated in the plot and noted in the subtitle: yes, it appears that older builds tend to have have greater average departure delays.
Problem 2
- Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
airports |>
semi_join(flights, join_by(faa == dest)) |>
ggplot(aes(x = lon, y = lat)) +
borders("state") +
geom_point() +
coord_quickmap()You might want to use the size or color of the points to display the average delay for each airport.
avg_delay_dest <- flights2 |>
group_by(dest) |>
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) |>
inner_join(airports, join_by(dest == faa))
avg_delay_dest |>
ggplot(aes(x = lon, y = lat, color = avg_delay)) +
borders("state") +
geom_point() +
coord_quickmap() +
scale_color_gradient(low = "slategrey", high = "cyan")Part II: Creating and Accessing a Database
In this exercise we will begin with a flat file which contains data on college sports programs throughout the country. The source of the data is a government run database called Equity in Athletics Data Analysis, though we are working with just a small subset here. You can download this file by clicking here: sports_program_costs.csv. I have also included a data dictionary which gives a quick description of the dataset, which can be downloaded from here: sports_program_data_dictionary.qmd. This file contains information on two types of entities: sports teams and universities, however the information on both entities is combined into a single table, creating substantial redundancies. This exercise has several goals:
- Load this data into R, split the dataframe into two dataframes, one corresponding to colleges and another corresponding to sports teams, related to each other by common keys. Many databases are stored according to normalization rules, which are designed to limit redundancy and to make it easier to both work with the data and make changes to it. By splitting the data frame we will partially normalize it (but won’t go too far).
sportsdata = read_csv("data/sports_program_costs.csv")Rows: 132327 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): institution_name, city_txt, state_cd, zip_text, classification_nam...
dbl (20): year, unitid, classification_code, ef_male_count, ef_female_count,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sportsdata# A tibble: 132,327 × 28
year unitid institution_name city_txt state_cd zip_text classification_code
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 2015 100654 Alabama A & M Un… Normal AL 35762 2
2 2015 100654 Alabama A & M Un… Normal AL 35762 2
3 2015 100654 Alabama A & M Un… Normal AL 35762 2
4 2015 100654 Alabama A & M Un… Normal AL 35762 2
5 2015 100654 Alabama A & M Un… Normal AL 35762 2
6 2015 100654 Alabama A & M Un… Normal AL 35762 2
7 2015 100654 Alabama A & M Un… Normal AL 35762 2
8 2015 100654 Alabama A & M Un… Normal AL 35762 2
9 2015 100654 Alabama A & M Un… Normal AL 35762 2
10 2015 100654 Alabama A & M Un… Normal AL 35762 2
# ℹ 132,317 more rows
# ℹ 21 more variables: classification_name <chr>, classification_other <chr>,
# ef_male_count <dbl>, ef_female_count <dbl>, ef_total_count <dbl>,
# sector_cd <dbl>, sector_name <chr>, sportscode <dbl>, partic_men <dbl>,
# partic_women <dbl>, partic_coed_men <dbl>, partic_coed_women <dbl>,
# sum_partic_men <dbl>, sum_partic_women <dbl>, rev_men <dbl>,
# rev_women <dbl>, total_rev_menwomen <dbl>, exp_men <dbl>, …
- Create a relational database using
duckdbwhich contains these two tables. - Read this database into R and a/an SQL query/queries to perform an analysis.
Problem 3:
sports_program_data.csvcontains variables which either describe properties of a sports team or a college. Splitsports_programs_datainto two data frames, one calledcollegesand another calledteams. How can you tell which variables describe colleges and which describe teams? Use the data dictionary and observations of how the values vary as you move from college to college to help make the decision easier. Make sure there are primary keys for both the colleges and teams data frames (verify withcount)- what are the primary keys in each case and are they simple keys (one variable) or compound keys (require multiple variables). One of these data-sets should contain a foreign key- which one has it and what variables comprise it?
sportsdata_columns <- colnames(sportsdata)
print(sportsdata_columns) [1] "year" "unitid" "institution_name"
[4] "city_txt" "state_cd" "zip_text"
[7] "classification_code" "classification_name" "classification_other"
[10] "ef_male_count" "ef_female_count" "ef_total_count"
[13] "sector_cd" "sector_name" "sportscode"
[16] "partic_men" "partic_women" "partic_coed_men"
[19] "partic_coed_women" "sum_partic_men" "sum_partic_women"
[22] "rev_men" "rev_women" "total_rev_menwomen"
[25] "exp_men" "exp_women" "total_exp_menwomen"
[28] "sports"
colleges_columns = c("year", "unitid", "institution_name", "city_txt", "state_cd", "zip_text", "classification_code", "classification_name", "classification_other", "ef_male_count", "ef_female_count", "ef_total_count", "sector_cd", "sector_name")
teams_columns = c("year", "unitid", "sportscode", "partic_men", "partic_women", "partic_coed_men", "partic_coed_women", "sum_partic_men", "sum_partic_women", "rev_men", "rev_women", "total_rev_menwomen", "exp_men", "exp_women", "total_exp_menwomen", "sports")
colleges <- sportsdata |> select(all_of(colleges_columns))
teams <- sportsdata |> select(all_of(teams_columns))colleges |> unique() |>
count(year, institution_name) |> filter(n>1) |> arrange(desc(n))# A tibble: 77 × 3
year institution_name n
<dbl> <chr> <int>
1 2015 Columbia College 3
2 2015 Union College 3
3 2015 Westminster College 3
4 2016 Columbia College 3
5 2016 Union College 3
6 2016 Westminster College 3
7 2017 Columbia College 3
8 2017 Union College 3
9 2017 Westminster College 3
10 2018 Columbia College 3
# ℹ 67 more rows
colleges |>
filter(institution_name == "Columbia College")# A tibble: 187 × 14
year unitid institution_name city_txt state_cd zip_text classification_code
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 2015 112561 Columbia College Sonora CA 95370 17
2 2015 112561 Columbia College Sonora CA 95370 17
3 2015 177065 Columbia College Columbia MO 65216 9
4 2015 177065 Columbia College Columbia MO 65216 9
5 2015 177065 Columbia College Columbia MO 65216 9
6 2015 177065 Columbia College Columbia MO 65216 9
7 2015 177065 Columbia College Columbia MO 65216 9
8 2015 177065 Columbia College Columbia MO 65216 9
9 2015 217934 Columbia College Columbia SC 29203 9
10 2015 217934 Columbia College Columbia SC 29203 9
# ℹ 177 more rows
# ℹ 7 more variables: classification_name <chr>, classification_other <chr>,
# ef_male_count <dbl>, ef_female_count <dbl>, ef_total_count <dbl>,
# sector_cd <dbl>, sector_name <chr>
The above code is how I found out that the unitid should be used instead of the institution name… Columbia College came up in the list where there were more than 1 unique year and institution name.
colleges |> unique() |>
count(year, unitid) |> filter(n>1) |> arrange(desc(n))# A tibble: 0 × 3
# ℹ 3 variables: year <dbl>, unitid <dbl>, n <int>
# This returns 0 rows: we know that year and unitid are the keys for college.
colleges2 <- colleges |> distinct()
colleges2 |> unique() |> count(year, unitid) |> filter(n>1)# A tibble: 0 × 3
# ℹ 3 variables: year <dbl>, unitid <dbl>, n <int>
# this returns the expected number: 0
colleges2 <- colleges2 |>
mutate(id = row_number(), .before = 1)The primary keys in the dataframe we created called college2 are compound keys, based on the year and the unitid. I added a id based on the row number to act as the colleges2’s key.
teams |> unique() |>
count(year, unitid, sports) |> filter(n>1) |> arrange(desc(n))# A tibble: 0 × 4
# ℹ 4 variables: year <dbl>, unitid <dbl>, sports <chr>, n <int>
Since we knew year and unitid were the compound keys in colleges2, I guessed they would be useful in creating a similarly structured teams2 as well. After looking at the data and the data dictionary, it made sense to add sports to this as well, since we knew that the information would be broken down this way.
Below we can see that the dataframe number of distinct rows is equal to the original teams dataframe length, so there’s no need to do the same exact steps that we did to create colleges2. However, we will add an id for row number to act as the primary key for this dataset.
The dataset that contains the foreign key is teams – it is the unitid and year which enables a connection (join) to the colleges2 dataframe.
teams |> distinct()# A tibble: 132,327 × 16
year unitid sportscode partic_men partic_women partic_coed_men
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2015 100654 1 31 NA NA
2 2015 100654 2 19 16 NA
3 2015 100654 3 61 46 NA
4 2015 100654 7 99 NA NA
5 2015 100654 8 9 NA NA
6 2015 100654 15 NA 21 NA
7 2015 100654 16 NA 25 NA
8 2015 100654 22 7 10 NA
9 2015 100654 26 NA 16 NA
10 2015 100654 33 NA 9 NA
# ℹ 132,317 more rows
# ℹ 10 more variables: partic_coed_women <dbl>, sum_partic_men <dbl>,
# sum_partic_women <dbl>, rev_men <dbl>, rev_women <dbl>,
# total_rev_menwomen <dbl>, exp_men <dbl>, exp_women <dbl>,
# total_exp_menwomen <dbl>, sports <chr>
teams <- teams |> mutate(id = row_number(), .before = 1)
teams# A tibble: 132,327 × 17
id year unitid sportscode partic_men partic_women partic_coed_men
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2015 100654 1 31 NA NA
2 2 2015 100654 2 19 16 NA
3 3 2015 100654 3 61 46 NA
4 4 2015 100654 7 99 NA NA
5 5 2015 100654 8 9 NA NA
6 6 2015 100654 15 NA 21 NA
7 7 2015 100654 16 NA 25 NA
8 8 2015 100654 22 7 10 NA
9 9 2015 100654 26 NA 16 NA
10 10 2015 100654 33 NA 9 NA
# ℹ 132,317 more rows
# ℹ 10 more variables: partic_coed_women <dbl>, sum_partic_men <dbl>,
# sum_partic_women <dbl>, rev_men <dbl>, rev_women <dbl>,
# total_rev_menwomen <dbl>, exp_men <dbl>, exp_women <dbl>,
# total_exp_menwomen <dbl>, sports <chr>
- The variable
sector_namecontains information about whether a college is public, private, non-profit, for-profit, a 2-year college, or a 4-year + college. Split this variable (usingseparate_wider_delim) into two variables, one of which describes whether the college is a Public, Private nonprofit, or private for-profit, and another which describes how many years the college programs run.
colleges2 |>
group_by(sector_name) |> summarise(count = n())# A tibble: 7 × 2
sector_name count
<chr> <int>
1 Private for-profit, 2-year 9
2 Private for-profit, 4-year or above 91
3 Private nonprofit, 2-year 65
4 Private nonprofit, 4-year or above 4424
5 Public, 2-year 2799
6 Public, 4-year or above 2971
7 <NA> 5
We can see that we’ll need to split the sector_name on a comma as the delimiter.
colleges3 <- colleges2 |>
separate_wider_delim(sector_name, delim = ", ", names = c("sector_type", "sector_years"))
colleges3 |> group_by(sector_type) |> summarise(count = n())# A tibble: 4 × 2
sector_type count
<chr> <int>
1 Private for-profit 100
2 Private nonprofit 4489
3 Public 5770
4 <NA> 5
colleges3 |> group_by(sector_years) |> summarise(count = n())# A tibble: 3 × 2
sector_years count
<chr> <int>
1 2-year 2873
2 4-year or above 7486
3 <NA> 5
- Several variables are candidates to be recoded as factors, for example
state_cd,zip_text,classification_name,sports, and thesectorvariables you just created for the previous part. Recode these variables as categorical variables. For theclassificationvariable, use theclassification_codeto order the factors according to the numeric code.
colleges3 <- colleges3 |>
mutate(
state_cd = as_factor(state_cd),
zip_text = as_factor(zip_text),
classification_name = as_factor(classification_name),
sector_type = as_factor(sector_type),
sector_years = as_factor(sector_years)
)
colleges3 <- colleges3 |>
mutate(
classification_name = fct_reorder(classification_name, classification_code)
)teams <- teams |>
mutate(
sports = as_factor(sports)
)Problem 4
- Using
DBI,duckdb, anddbplyr, create a relational database with two tables, writing thesportsdata frame you created in problem 3 to one and thecollegesdata frame (also from problem 3) to the other. Write this database to disk. How does the size of the database file compare to the original csv?
con = dbConnect(duckdb(), "sports")
con |> dbWriteTable("sportsteams", teams, overwrite = TRUE)
con |> dbWriteTable("sportscolleges", colleges3, overwrite = TRUE)The size of the database file (~12 MB) is half the size of the original csv file (~25 MB). Nice!
- Use
dbplyrto write a query to this database that calculates the top 10 colleges ranked by the average profit (defined as revenue - expenses) of their american football team over the years of data. Print the SQL query that results from your R pipeline usingshow_query()and then usecollect()to show the results of this query.
sportsteams <- tbl(con, "sportsteams")
sportsteams# Source: table<sportsteams> [?? x 17]
# Database: DuckDB v1.1.0 [root@Darwin 23.6.0:R 4.4.1//Users/amandaknudsen/AmandaLocal/AmandaR/DATA-607/Lab06/sports]
id year unitid sportscode partic_men partic_women partic_coed_men
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2015 100654 1 31 NA NA
2 2 2015 100654 2 19 16 NA
3 3 2015 100654 3 61 46 NA
4 4 2015 100654 7 99 NA NA
5 5 2015 100654 8 9 NA NA
6 6 2015 100654 15 NA 21 NA
7 7 2015 100654 16 NA 25 NA
8 8 2015 100654 22 7 10 NA
9 9 2015 100654 26 NA 16 NA
10 10 2015 100654 33 NA 9 NA
# ℹ more rows
# ℹ 10 more variables: partic_coed_women <dbl>, sum_partic_men <dbl>,
# sum_partic_women <dbl>, rev_men <dbl>, rev_women <dbl>,
# total_rev_menwomen <dbl>, exp_men <dbl>, exp_women <dbl>,
# total_exp_menwomen <dbl>, sports <fct>
sportscolleges <- tbl(con, "sportscolleges")
sportscolleges# Source: table<sportscolleges> [?? x 16]
# Database: DuckDB v1.1.0 [root@Darwin 23.6.0:R 4.4.1//Users/amandaknudsen/AmandaLocal/AmandaR/DATA-607/Lab06/sports]
id year unitid institution_name city_txt state_cd zip_text
<int> <dbl> <dbl> <chr> <chr> <fct> <fct>
1 1 2015 100654 Alabama A & M University Normal AL 35762
2 2 2015 100663 University of Alabama at Birmi… Birming… AL 3529401…
3 3 2015 100706 University of Alabama in Hunts… Huntsvi… AL 35899
4 4 2015 100724 Alabama State University Montgom… AL 3610402…
5 5 2015 100751 The University of Alabama Tuscalo… AL 3548701…
6 6 2015 100760 Central Alabama Community Coll… Alexand… AL 35010
7 7 2015 100830 Auburn University at Montgomery Montgom… AL 3611735…
8 8 2015 100858 Auburn University Auburn … AL 36849
9 9 2015 100937 Birmingham Southern College Birming… AL 35254
10 10 2015 101028 Chattahoochee Valley Community… Phenix … AL 36869
# ℹ more rows
# ℹ 9 more variables: classification_code <dbl>, classification_name <fct>,
# classification_other <chr>, ef_male_count <dbl>, ef_female_count <dbl>,
# ef_total_count <dbl>, sector_cd <dbl>, sector_type <fct>,
# sector_years <fct>
sportsteams |>
mutate(
teamprofit = (total_rev_menwomen - total_exp_menwomen)
) |>
filter(sports == "Football") |>
group_by(unitid) |>
summarize(
avg_teamprofit = mean(teamprofit, na.rm = TRUE)
) |>
inner_join(sportscolleges, join_by(unitid)) |>
select(institution_name, avg_teamprofit) |>
arrange(desc(avg_teamprofit)) |>
distinct() |>
head(10) |>
show_query() <SQL>
SELECT DISTINCT q01.*
FROM (
SELECT institution_name, avg_teamprofit
FROM (
SELECT unitid, AVG(teamprofit) AS avg_teamprofit
FROM (
SELECT
sportsteams.*,
(total_rev_menwomen - total_exp_menwomen) AS teamprofit
FROM sportsteams
WHERE (sports = 'Football')
) q01
GROUP BY unitid
) LHS
INNER JOIN sportscolleges
ON (LHS.unitid = sportscolleges.unitid)
) q01
ORDER BY avg_teamprofit DESC
LIMIT 10
sportsteams |>
mutate(
teamprofit = (total_rev_menwomen - total_exp_menwomen)
) |>
filter(sports == "Football") |>
group_by(unitid) |>
summarize(
avg_teamprofit = mean(teamprofit, na.rm = TRUE)
) |>
inner_join(sportscolleges, join_by(unitid)) |>
select(institution_name, avg_teamprofit) |>
arrange(desc(avg_teamprofit)) |>
distinct() |>
head(10) |>
collect()# A tibble: 10 × 2
institution_name avg_teamprofit
<chr> <dbl>
1 The University of Texas at Austin 102949317.
2 University of Michigan-Ann Arbor 71516420
3 University of Georgia 69923628.
4 The University of Tennessee-Knoxville 62983384.
5 University of Notre Dame 59923466.
6 University of Oklahoma-Norman Campus 58268012.
7 Ohio State University-Main Campus 56567973
8 Louisiana State University and Agricultural & Mechanical Coll… 55336698.
9 University of Wisconsin-Madison 49896698.
10 Auburn University 49689829
dbDisconnect(con)