Lab 6: R and SQL

Author

Amanda Rose Knudsen

Overview

This lab is divided into two parts. In the first part you will practice using joins for data wrangling and analysis on the nycflights dataset. Some of the problems come from Chapter 19 of your book. For the second part, you will download a dataset on the budgets of college sports programs and process it for storage in a relational database (I strongly recommend using duckdb which can be installed using install.packages("duckdb")- duckdb is highly performant, self-contained, and ideally suited both to learning SQL and performing data analysis). Then you will load this database and use dbplyr to perform an analysis. You will also practice using forcats to recode some of the variables as factors (which are supported by duckdb) and using separate_wider_delim to split columns of text data.

You will need to have installed and to the following libraries:

library(tidyverse)
library(DBI)
library(duckdb)
library(nycflights13)
library(dbplyr)

Problems

Part I: Airline Flight Delays

For the first part of this lab exercise, we will be using the nycflights library, which contains several different built in datasets including planes, which has information on each plane that appears in the data; flights, which has information on individual flights; airports, which has information on individual airports; and weather, which has information on the weather that the origin airports. In order to do this set of lab exercises, you will need to use different types of joins to combine variables in each data frame.

Problem 1

  • Use the flights and planes tables to compute the mean departure delay of each aircraft that has more than 30 recorded flights in the dataset. Hint: Make note of the fact that the variable year appears in both flights and planes but means different things in each before performing any joins.
flights2 <- flights |> 
  mutate(id = row_number(), .before = 1)

flights2 |> 
  left_join(planes, join_by(tailnum)) |> 
  group_by(tailnum) |> 
  summarise(
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
    countflights = n()
    ) |> 
  filter(countflights >= 30) |> 
  arrange(tailnum)
# A tibble: 2,745 × 3
   tailnum avg_dep_delay countflights
   <chr>           <dbl>        <int>
 1 N0EGMQ          8.49           371
 2 N10156         17.8            153
 3 N102UW          8               48
 4 N103US         -3.20            46
 5 N104UW          9.94            47
 6 N10575         22.7            289
 7 N105UW          2.58            45
 8 N107US         -0.463           41
 9 N108UW          4.22            60
10 N109UW          0.104           48
# ℹ 2,735 more rows
  • Use anti-join to identify flights where tailnum does not have a match in plane. Determine the carriers for which this problem is the most common.
flights2 |> 
  anti_join(planes, by = "tailnum") |> 
  group_by(carrier) |> 
  summarize(carrier_missing_tailnum_count = n()) |> 
  arrange(desc(carrier_missing_tailnum_count))
# A tibble: 10 × 2
   carrier carrier_missing_tailnum_count
   <chr>                           <int>
 1 MQ                              25397
 2 AA                              22558
 3 UA                               1693
 4 9E                               1044
 5 B6                                830
 6 US                                699
 7 FL                                187
 8 DL                                110
 9 F9                                 50
10 WN                                 38

The carriers MQ (Envoy Air) and AA (American Airlines) are the carriers for which this problem is most common.

  • Find the airplane model which made the most flights in the dataset, and filter the dataset to contain only flights flown by airplanes of that model, adding a variable which corresponds to the year each those airplanes were built. Then compute the average departure delay for each year of origin and plot the data. Is there any evidence that older planes have more greater departure delays?

While the value NA for model has the “highest count” of flights, this is not an indication all NA values for tailnum are the same airplane model. It’s more likely that the NA values for tailnum are not the same airplane model, but we really don’t know. So, we will continue analysis on the airplane model with the most flights in the dataset that we actually have information for: model = A320-232.

topmodels <- flights2 |>
  left_join(planes, join_by(tailnum)) |>
  group_by(model) |>
  summarise(
    countflights = n()
    ) |>
  arrange(desc(countflights))

topmodel <- flights2 |> 
  left_join(planes, join_by(tailnum)) |> 
  filter(model == "A320-232") |> 
  rename(year_built = year.y) |> 
  group_by(year_built) |> 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE))  |> 
  arrange(year_built)
 
print(topmodel)
# A tibble: 21 × 2
   year_built avg_dep_delay
        <int>         <dbl>
 1       1993          15.6
 2       1994          13.7
 3       1995          15.2
 4       1996          15.5
 5       1997          13.2
 6       1998          14.1
 7       1999          10.0
 8       2000          12.0
 9       2001          12.8
10       2002          11.9
# ℹ 11 more rows
ggplot(topmodel, aes(x = year_built, y = avg_dep_delay)) +
  geom_point() + 
  labs(title = "A320-232: Average departure delay by year built",
       subtitle = "Older builds tend to have greater average departure delays")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

As indicated in the plot and noted in the subtitle: yes, it appears that older builds tend to have have greater average departure delays.

Problem 2

  • Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
airports |>
  semi_join(flights, join_by(faa == dest)) |>
  ggplot(aes(x = lon, y = lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

You might want to use the size or color of the points to display the average delay for each airport.

avg_delay_dest <- flights2 |> 
  group_by(dest) |> 
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) |> 
  inner_join(airports, join_by(dest == faa))

avg_delay_dest |> 
  ggplot(aes(x = lon, y = lat, color = avg_delay)) +
  borders("state") +
  geom_point() +
  coord_quickmap() +
  scale_color_gradient(low = "slategrey", high = "cyan")

Part II: Creating and Accessing a Database

In this exercise we will begin with a flat file which contains data on college sports programs throughout the country. The source of the data is a government run database called Equity in Athletics Data Analysis, though we are working with just a small subset here. You can download this file by clicking here: sports_program_costs.csv. I have also included a data dictionary which gives a quick description of the dataset, which can be downloaded from here: sports_program_data_dictionary.qmd. This file contains information on two types of entities: sports teams and universities, however the information on both entities is combined into a single table, creating substantial redundancies. This exercise has several goals:

  1. Load this data into R, split the dataframe into two dataframes, one corresponding to colleges and another corresponding to sports teams, related to each other by common keys. Many databases are stored according to normalization rules, which are designed to limit redundancy and to make it easier to both work with the data and make changes to it. By splitting the data frame we will partially normalize it (but won’t go too far).
sportsdata = read_csv("data/sports_program_costs.csv")
Rows: 132327 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): institution_name, city_txt, state_cd, zip_text, classification_nam...
dbl (20): year, unitid, classification_code, ef_male_count, ef_female_count,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sportsdata
# A tibble: 132,327 × 28
    year unitid institution_name  city_txt state_cd zip_text classification_code
   <dbl>  <dbl> <chr>             <chr>    <chr>    <chr>                  <dbl>
 1  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 2  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 3  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 4  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 5  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 6  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 7  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 8  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
 9  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
10  2015 100654 Alabama A & M Un… Normal   AL       35762                      2
# ℹ 132,317 more rows
# ℹ 21 more variables: classification_name <chr>, classification_other <chr>,
#   ef_male_count <dbl>, ef_female_count <dbl>, ef_total_count <dbl>,
#   sector_cd <dbl>, sector_name <chr>, sportscode <dbl>, partic_men <dbl>,
#   partic_women <dbl>, partic_coed_men <dbl>, partic_coed_women <dbl>,
#   sum_partic_men <dbl>, sum_partic_women <dbl>, rev_men <dbl>,
#   rev_women <dbl>, total_rev_menwomen <dbl>, exp_men <dbl>, …
  1. Create a relational database using duckdb which contains these two tables.
  2. Read this database into R and a/an SQL query/queries to perform an analysis.

Problem 3:

  • sports_program_data.csv contains variables which either describe properties of a sports team or a college. Split sports_programs_data into two data frames, one called colleges and another called teams. How can you tell which variables describe colleges and which describe teams? Use the data dictionary and observations of how the values vary as you move from college to college to help make the decision easier. Make sure there are primary keys for both the colleges and teams data frames (verify with count)- what are the primary keys in each case and are they simple keys (one variable) or compound keys (require multiple variables). One of these data-sets should contain a foreign key- which one has it and what variables comprise it?
sportsdata_columns <- colnames(sportsdata)

print(sportsdata_columns)
 [1] "year"                 "unitid"               "institution_name"    
 [4] "city_txt"             "state_cd"             "zip_text"            
 [7] "classification_code"  "classification_name"  "classification_other"
[10] "ef_male_count"        "ef_female_count"      "ef_total_count"      
[13] "sector_cd"            "sector_name"          "sportscode"          
[16] "partic_men"           "partic_women"         "partic_coed_men"     
[19] "partic_coed_women"    "sum_partic_men"       "sum_partic_women"    
[22] "rev_men"              "rev_women"            "total_rev_menwomen"  
[25] "exp_men"              "exp_women"            "total_exp_menwomen"  
[28] "sports"              
colleges_columns = c("year", "unitid", "institution_name", "city_txt", "state_cd", "zip_text", "classification_code", "classification_name", "classification_other", "ef_male_count", "ef_female_count", "ef_total_count", "sector_cd", "sector_name")

teams_columns = c("year", "unitid", "sportscode", "partic_men", "partic_women", "partic_coed_men", "partic_coed_women", "sum_partic_men", "sum_partic_women", "rev_men", "rev_women", "total_rev_menwomen", "exp_men", "exp_women", "total_exp_menwomen", "sports")

colleges <- sportsdata |> select(all_of(colleges_columns))

teams <- sportsdata |>  select(all_of(teams_columns))
colleges |>  unique() |> 
  count(year, institution_name) |> filter(n>1) |>  arrange(desc(n))
# A tibble: 77 × 3
    year institution_name        n
   <dbl> <chr>               <int>
 1  2015 Columbia College        3
 2  2015 Union College           3
 3  2015 Westminster College     3
 4  2016 Columbia College        3
 5  2016 Union College           3
 6  2016 Westminster College     3
 7  2017 Columbia College        3
 8  2017 Union College           3
 9  2017 Westminster College     3
10  2018 Columbia College        3
# ℹ 67 more rows
colleges |> 
  filter(institution_name == "Columbia College")
# A tibble: 187 × 14
    year unitid institution_name city_txt state_cd zip_text classification_code
   <dbl>  <dbl> <chr>            <chr>    <chr>    <chr>                  <dbl>
 1  2015 112561 Columbia College Sonora   CA       95370                     17
 2  2015 112561 Columbia College Sonora   CA       95370                     17
 3  2015 177065 Columbia College Columbia MO       65216                      9
 4  2015 177065 Columbia College Columbia MO       65216                      9
 5  2015 177065 Columbia College Columbia MO       65216                      9
 6  2015 177065 Columbia College Columbia MO       65216                      9
 7  2015 177065 Columbia College Columbia MO       65216                      9
 8  2015 177065 Columbia College Columbia MO       65216                      9
 9  2015 217934 Columbia College Columbia SC       29203                      9
10  2015 217934 Columbia College Columbia SC       29203                      9
# ℹ 177 more rows
# ℹ 7 more variables: classification_name <chr>, classification_other <chr>,
#   ef_male_count <dbl>, ef_female_count <dbl>, ef_total_count <dbl>,
#   sector_cd <dbl>, sector_name <chr>

The above code is how I found out that the unitid should be used instead of the institution name… Columbia College came up in the list where there were more than 1 unique year and institution name.

colleges |>  unique() |> 
  count(year, unitid) |> filter(n>1) |>  arrange(desc(n))
# A tibble: 0 × 3
# ℹ 3 variables: year <dbl>, unitid <dbl>, n <int>
# This returns 0 rows: we know that year and unitid are the keys for college.

colleges2 <- colleges |> distinct()
  
colleges2 |>  unique() |>  count(year, unitid) |>  filter(n>1)
# A tibble: 0 × 3
# ℹ 3 variables: year <dbl>, unitid <dbl>, n <int>
# this returns the expected number: 0

colleges2 <- colleges2 |> 
  mutate(id = row_number(), .before = 1)

The primary keys in the dataframe we created called college2 are compound keys, based on the year and the unitid. I added a id based on the row number to act as the colleges2’s key.

teams |>  unique() |> 
  count(year, unitid, sports) |> filter(n>1) |>  arrange(desc(n))
# A tibble: 0 × 4
# ℹ 4 variables: year <dbl>, unitid <dbl>, sports <chr>, n <int>

Since we knew year and unitid were the compound keys in colleges2, I guessed they would be useful in creating a similarly structured teams2 as well. After looking at the data and the data dictionary, it made sense to add sports to this as well, since we knew that the information would be broken down this way.

Below we can see that the dataframe number of distinct rows is equal to the original teams dataframe length, so there’s no need to do the same exact steps that we did to create colleges2. However, we will add an id for row number to act as the primary key for this dataset.

The dataset that contains the foreign key is teams – it is the unitid and year which enables a connection (join) to the colleges2 dataframe.

teams |> distinct()
# A tibble: 132,327 × 16
    year unitid sportscode partic_men partic_women partic_coed_men
   <dbl>  <dbl>      <dbl>      <dbl>        <dbl>           <dbl>
 1  2015 100654          1         31           NA              NA
 2  2015 100654          2         19           16              NA
 3  2015 100654          3         61           46              NA
 4  2015 100654          7         99           NA              NA
 5  2015 100654          8          9           NA              NA
 6  2015 100654         15         NA           21              NA
 7  2015 100654         16         NA           25              NA
 8  2015 100654         22          7           10              NA
 9  2015 100654         26         NA           16              NA
10  2015 100654         33         NA            9              NA
# ℹ 132,317 more rows
# ℹ 10 more variables: partic_coed_women <dbl>, sum_partic_men <dbl>,
#   sum_partic_women <dbl>, rev_men <dbl>, rev_women <dbl>,
#   total_rev_menwomen <dbl>, exp_men <dbl>, exp_women <dbl>,
#   total_exp_menwomen <dbl>, sports <chr>
teams <- teams |> mutate(id = row_number(), .before = 1)

teams
# A tibble: 132,327 × 17
      id  year unitid sportscode partic_men partic_women partic_coed_men
   <int> <dbl>  <dbl>      <dbl>      <dbl>        <dbl>           <dbl>
 1     1  2015 100654          1         31           NA              NA
 2     2  2015 100654          2         19           16              NA
 3     3  2015 100654          3         61           46              NA
 4     4  2015 100654          7         99           NA              NA
 5     5  2015 100654          8          9           NA              NA
 6     6  2015 100654         15         NA           21              NA
 7     7  2015 100654         16         NA           25              NA
 8     8  2015 100654         22          7           10              NA
 9     9  2015 100654         26         NA           16              NA
10    10  2015 100654         33         NA            9              NA
# ℹ 132,317 more rows
# ℹ 10 more variables: partic_coed_women <dbl>, sum_partic_men <dbl>,
#   sum_partic_women <dbl>, rev_men <dbl>, rev_women <dbl>,
#   total_rev_menwomen <dbl>, exp_men <dbl>, exp_women <dbl>,
#   total_exp_menwomen <dbl>, sports <chr>
  • The variable sector_name contains information about whether a college is public, private, non-profit, for-profit, a 2-year college, or a 4-year + college. Split this variable (using separate_wider_delim) into two variables, one of which describes whether the college is a Public, Private nonprofit, or private for-profit, and another which describes how many years the college programs run.
colleges2 |> 
  group_by(sector_name) |> summarise(count = n())
# A tibble: 7 × 2
  sector_name                         count
  <chr>                               <int>
1 Private for-profit, 2-year              9
2 Private for-profit, 4-year or above    91
3 Private nonprofit, 2-year              65
4 Private nonprofit, 4-year or above   4424
5 Public, 2-year                       2799
6 Public, 4-year or above              2971
7 <NA>                                    5

We can see that we’ll need to split the sector_name on a comma as the delimiter.

colleges3 <- colleges2 |> 
  separate_wider_delim(sector_name, delim = ", ", names = c("sector_type", "sector_years"))

colleges3 |> group_by(sector_type) |> summarise(count = n())
# A tibble: 4 × 2
  sector_type        count
  <chr>              <int>
1 Private for-profit   100
2 Private nonprofit   4489
3 Public              5770
4 <NA>                   5
colleges3 |> group_by(sector_years) |> summarise(count = n())
# A tibble: 3 × 2
  sector_years    count
  <chr>           <int>
1 2-year           2873
2 4-year or above  7486
3 <NA>                5
  • Several variables are candidates to be recoded as factors, for example state_cd, zip_text, classification_name, sports, and the sector variables you just created for the previous part. Recode these variables as categorical variables. For the classification variable, use the classification_code to order the factors according to the numeric code.
colleges3 <- colleges3 |> 
  mutate(
    state_cd = as_factor(state_cd),
    zip_text = as_factor(zip_text),
    classification_name = as_factor(classification_name),
    sector_type = as_factor(sector_type),
    sector_years = as_factor(sector_years)
  )

colleges3 <- colleges3 |> 
  mutate(
    classification_name = fct_reorder(classification_name, classification_code)
  )
teams <- teams |> 
  mutate(
    sports = as_factor(sports)
  )

Problem 4

  • Using DBI, duckdb, and dbplyr, create a relational database with two tables, writing the sports data frame you created in problem 3 to one and the colleges data frame (also from problem 3) to the other. Write this database to disk. How does the size of the database file compare to the original csv?
con = dbConnect(duckdb(), "sports")

con |> dbWriteTable("sportsteams", teams, overwrite = TRUE)

con |> dbWriteTable("sportscolleges", colleges3, overwrite = TRUE)

The size of the database file (~12 MB) is half the size of the original csv file (~25 MB). Nice!

  • Use dbplyr to write a query to this database that calculates the top 10 colleges ranked by the average profit (defined as revenue - expenses) of their american football team over the years of data. Print the SQL query that results from your R pipeline using show_query() and then use collect() to show the results of this query.
sportsteams <- tbl(con, "sportsteams")
sportsteams
# Source:   table<sportsteams> [?? x 17]
# Database: DuckDB v1.1.0 [root@Darwin 23.6.0:R 4.4.1//Users/amandaknudsen/AmandaLocal/AmandaR/DATA-607/Lab06/sports]
      id  year unitid sportscode partic_men partic_women partic_coed_men
   <int> <dbl>  <dbl>      <dbl>      <dbl>        <dbl>           <dbl>
 1     1  2015 100654          1         31           NA              NA
 2     2  2015 100654          2         19           16              NA
 3     3  2015 100654          3         61           46              NA
 4     4  2015 100654          7         99           NA              NA
 5     5  2015 100654          8          9           NA              NA
 6     6  2015 100654         15         NA           21              NA
 7     7  2015 100654         16         NA           25              NA
 8     8  2015 100654         22          7           10              NA
 9     9  2015 100654         26         NA           16              NA
10    10  2015 100654         33         NA            9              NA
# ℹ more rows
# ℹ 10 more variables: partic_coed_women <dbl>, sum_partic_men <dbl>,
#   sum_partic_women <dbl>, rev_men <dbl>, rev_women <dbl>,
#   total_rev_menwomen <dbl>, exp_men <dbl>, exp_women <dbl>,
#   total_exp_menwomen <dbl>, sports <fct>
sportscolleges <- tbl(con, "sportscolleges")
sportscolleges
# Source:   table<sportscolleges> [?? x 16]
# Database: DuckDB v1.1.0 [root@Darwin 23.6.0:R 4.4.1//Users/amandaknudsen/AmandaLocal/AmandaR/DATA-607/Lab06/sports]
      id  year unitid institution_name                city_txt state_cd zip_text
   <int> <dbl>  <dbl> <chr>                           <chr>    <fct>    <fct>   
 1     1  2015 100654 Alabama A & M University        Normal   AL       35762   
 2     2  2015 100663 University of Alabama at Birmi… Birming… AL       3529401…
 3     3  2015 100706 University of Alabama in Hunts… Huntsvi… AL       35899   
 4     4  2015 100724 Alabama State University        Montgom… AL       3610402…
 5     5  2015 100751 The University of Alabama       Tuscalo… AL       3548701…
 6     6  2015 100760 Central Alabama Community Coll… Alexand… AL       35010   
 7     7  2015 100830 Auburn University at Montgomery Montgom… AL       3611735…
 8     8  2015 100858 Auburn University               Auburn … AL       36849   
 9     9  2015 100937 Birmingham Southern College     Birming… AL       35254   
10    10  2015 101028 Chattahoochee Valley Community… Phenix … AL       36869   
# ℹ more rows
# ℹ 9 more variables: classification_code <dbl>, classification_name <fct>,
#   classification_other <chr>, ef_male_count <dbl>, ef_female_count <dbl>,
#   ef_total_count <dbl>, sector_cd <dbl>, sector_type <fct>,
#   sector_years <fct>
sportsteams |> 
  mutate(
    teamprofit = (total_rev_menwomen - total_exp_menwomen)
  ) |> 
  filter(sports == "Football") |> 
  group_by(unitid) |> 
  summarize(
    avg_teamprofit = mean(teamprofit, na.rm = TRUE)
  ) |> 
  inner_join(sportscolleges, join_by(unitid)) |> 
  select(institution_name, avg_teamprofit) |> 
  arrange(desc(avg_teamprofit)) |> 
  distinct() |> 
  head(10) |> 
  show_query() 
<SQL>
SELECT DISTINCT q01.*
FROM (
  SELECT institution_name, avg_teamprofit
  FROM (
    SELECT unitid, AVG(teamprofit) AS avg_teamprofit
    FROM (
      SELECT
        sportsteams.*,
        (total_rev_menwomen - total_exp_menwomen) AS teamprofit
      FROM sportsteams
      WHERE (sports = 'Football')
    ) q01
    GROUP BY unitid
  ) LHS
  INNER JOIN sportscolleges
    ON (LHS.unitid = sportscolleges.unitid)
) q01
ORDER BY avg_teamprofit DESC
LIMIT 10
sportsteams |> 
  mutate(
    teamprofit = (total_rev_menwomen - total_exp_menwomen)
  ) |> 
  filter(sports == "Football") |> 
  group_by(unitid) |> 
  summarize(
    avg_teamprofit = mean(teamprofit, na.rm = TRUE)
  ) |> 
  inner_join(sportscolleges, join_by(unitid)) |> 
  select(institution_name, avg_teamprofit) |> 
  arrange(desc(avg_teamprofit)) |> 
  distinct() |> 
  head(10) |> 
  collect()
# A tibble: 10 × 2
   institution_name                                               avg_teamprofit
   <chr>                                                                   <dbl>
 1 The University of Texas at Austin                                  102949317.
 2 University of Michigan-Ann Arbor                                    71516420 
 3 University of Georgia                                               69923628.
 4 The University of Tennessee-Knoxville                               62983384.
 5 University of Notre Dame                                            59923466.
 6 University of Oklahoma-Norman Campus                                58268012.
 7 Ohio State University-Main Campus                                   56567973 
 8 Louisiana State University and Agricultural & Mechanical Coll…      55336698.
 9 University of Wisconsin-Madison                                     49896698.
10 Auburn University                                                   49689829 
dbDisconnect(con)