Relational Data Assignment

Assignment Instructions

Complete all questions below. After completing the assignment, knit your document, and download both your .Rmd and knitted output. Upload your files for peer review.

For each response, include comments detailing your response and what each line does. Ensure you test your functions with sufficient test cases to identify and correct any potential bugs.

Required Libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.1

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0

## Warning: package 'ggplot2' was built under R version 4.2.2

## Warning: package 'readr' was built under R version 4.2.2

## Warning: package 'stringr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Question 1.

Identify the primary keys in the following datasets. Be sure to show that you have the primary key by showing there are no duplicate entries.

Lahman::Batting babynames::babynames nasaweather::atmos

#The primary Key of the Batting DataSet is the combination of <playerID, stint, yearID, teamID>
#the method anyDuplicated() returned an integer or real vector of length one with value the 1-based 
#index of the first duplicate if any, otherwise 0
library('Lahman')

## Warning: package 'Lahman' was built under R version 4.2.3

library('babynames')

## Warning: package 'babynames' was built under R version 4.2.3

library('nasaweather')

## 
## Attaching package: 'nasaweather'

## The following object is masked from 'package:dplyr':
## 
##     storms

Batting %>% select(playerID, stint, yearID, teamID) %>% anyDuplicated()

## [1] 0

#the primary key is <year, sex, name>
babynames %>% select(year, sex, name) %>% anyDuplicated()

## [1] 0

#the primary key is <lat, long, year, month>
atmos %>% select(lat, long, year, month) %>% anyDuplicated()

## [1] 0

Question 2.

What is the relationship between the “Batting”, “Master”, and “Salaries” tables in the “Lahman” package? What are the keys for each dataset and how do they relate to each other?

#The primary Key of the Batting DataSet is the combination of <playerID, stint, yearID, teamID>
Batting %>% select(playerID, stint, yearID, teamID) %>% anyDuplicated()

## [1] 0

#The primary Key of the Salaries DataSet is the combination of <playerID,yearID, teamID>
Salaries %>% select('yearID', 'teamID', 'playerID') %>% anyDuplicated()

## [1] 0

#The Salaries DataSet Contains the Salaries of each player in the Batting Dataset for each year within a specific team
#These is no table named 'Master' in the Lahman Package build under R version 4.2.3.

Question 3.

Load the “nycflights13” library. Use an appropriate join to add a column containing the airline name to the “flights” dataset present in the library. Be sure to put the carrier code and name in the first two columns of the result so we can see them. Save the result as “flights2”.

library('nycflights13')

## Warning: package 'nycflights13' was built under R version 4.2.3

flights2 <- right_join(airlines,flights, by="carrier")
flights2

## # A tibble: 336,776 × 20
##    carrier name      year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr>   <chr>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1 9E      Endeavo…  2013     1     1      810            810         0     1048
##  2 9E      Endeavo…  2013     1     1     1451           1500        -9     1634
##  3 9E      Endeavo…  2013     1     1     1452           1455        -3     1637
##  4 9E      Endeavo…  2013     1     1     1454           1500        -6     1635
##  5 9E      Endeavo…  2013     1     1     1507           1515        -8     1651
##  6 9E      Endeavo…  2013     1     1     1530           1530         0     1650
##  7 9E      Endeavo…  2013     1     1     1546           1540         6     1753
##  8 9E      Endeavo…  2013     1     1     1550           1550         0     1844
##  9 9E      Endeavo…  2013     1     1     1552           1600        -8     1749
## 10 9E      Endeavo…  2013     1     1     1554           1600        -6     1701
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 4.

Use an appropriate join to add the airport name to the “flights2” dataset you got above. The codes and names of the airports are in the “airports” dataset of the “nycflights13” package. Put the carrier and carrier name first followed by the destination and destination name, then everything else.

flights2 <-right_join(airports,flights2, by=c('faa'='dest')) %>%
  select(carrier, name.y, faa, name.x, everything()) %>%
  rename(carrier_name=name.y, dest=faa, dest_name=name.x)
flights2

## # A tibble: 336,776 × 27
##    carrier carrier_name    dest  dest_name     lat   lon   alt    tz dst   tzone
##    <chr>   <chr>           <chr> <chr>       <dbl> <dbl> <dbl> <dbl> <chr> <chr>
##  1 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  2 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  3 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  4 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  5 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  6 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  7 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  8 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
##  9 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
## 10 B6      JetBlue Airways ABQ   Albuquerqu…  35.0 -107.  5355    -7 A     Amer…
## # … with 336,766 more rows, and 17 more variables: year <int>, month <int>,
## #   day <int>, dep_time <int>, sched_dep_time <int>, dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## #   tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Question 5.

The “nycflights13” library and the code to create spatial map is provided for you. Now compute the average delay by destination, then join on the airports dataframe so you can show the spatial distribution of delays.

Use the size or colour of the points to display the average delay for each airport.
Add the location of the origin and destination (i.e. the lat and lon) to flights.
Compute the average delay by destination.

Use the textbook for reference.

library(nycflights13)
#Compute de average departure delay by destination and save the result in flights_delay data frame
flights_delay<-flights %>% select(dest,dep_delay) %>% group_by(dest) %>% summarise(Av_delay=mean(dep_delay, na.rm = TRUE))
flights_delay

## # A tibble: 105 × 2
##    dest  Av_delay
##    <chr>    <dbl>
##  1 ABQ      13.7 
##  2 ACK       6.46
##  3 ALB      23.6 
##  4 ANC      12.9 
##  5 ATL      12.5 
##  6 AUS      13.0 
##  7 AVL       8.19
##  8 BDL      17.7 
##  9 BGR      19.5 
## 10 BHM      29.7 
## # … with 95 more rows

#Joining the average departure delay to the "airports" data frame
airports1<- right_join(airports,flights_delay , by=c('faa'='dest'))


airports1 %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(size=Av_delay)) +
    coord_quickmap()

## Warning: Removed 5 rows containing missing values (`geom_point()`).

Question 6.

Use a set operation function to find which airport codes from flights are not in the airports dataset.

setdiff(flights$dest, airports$faa)

## [1] "BQN" "SJU" "STT" "PSE"