Complete all questions below. After completing the assignment, knit your document, and download both your .Rmd and knitted output. Upload your files for peer review.
For each response, include comments detailing your response and what each line does. Ensure you test your functions with sufficient test cases to identify and correct any potential bugs.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Identify the primary keys in the following datasets. Be sure to show that you have the primary key by showing there are no duplicate entries.
Lahman::Batting babynames::babynames nasaweather::atmos
#The primary Key of the Batting DataSet is the combination of <playerID, stint, yearID, teamID>
#the method anyDuplicated() returned an integer or real vector of length one with value the 1-based
#index of the first duplicate if any, otherwise 0
library('Lahman')
## Warning: package 'Lahman' was built under R version 4.2.3
library('babynames')
## Warning: package 'babynames' was built under R version 4.2.3
library('nasaweather')
##
## Attaching package: 'nasaweather'
## The following object is masked from 'package:dplyr':
##
## storms
Batting %>% select(playerID, stint, yearID, teamID) %>% anyDuplicated()
## [1] 0
#the primary key is <year, sex, name>
babynames %>% select(year, sex, name) %>% anyDuplicated()
## [1] 0
#the primary key is <lat, long, year, month>
atmos %>% select(lat, long, year, month) %>% anyDuplicated()
## [1] 0
What is the relationship between the “Batting”, “Master”, and “Salaries” tables in the “Lahman” package? What are the keys for each dataset and how do they relate to each other?
#The primary Key of the Batting DataSet is the combination of <playerID, stint, yearID, teamID>
Batting %>% select(playerID, stint, yearID, teamID) %>% anyDuplicated()
## [1] 0
#The primary Key of the Salaries DataSet is the combination of <playerID,yearID, teamID>
Salaries %>% select('yearID', 'teamID', 'playerID') %>% anyDuplicated()
## [1] 0
#The Salaries DataSet Contains the Salaries of each player in the Batting Dataset for each year within a specific team
#These is no table named 'Master' in the Lahman Package build under R version 4.2.3.
Load the “nycflights13” library. Use an appropriate join to add a column containing the airline name to the “flights” dataset present in the library. Be sure to put the carrier code and name in the first two columns of the result so we can see them. Save the result as “flights2”.
library('nycflights13')
## Warning: package 'nycflights13' was built under R version 4.2.3
flights2 <- right_join(airlines,flights, by="carrier")
flights2
## # A tibble: 336,776 × 20
## carrier name year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 9E Endeavo… 2013 1 1 810 810 0 1048
## 2 9E Endeavo… 2013 1 1 1451 1500 -9 1634
## 3 9E Endeavo… 2013 1 1 1452 1455 -3 1637
## 4 9E Endeavo… 2013 1 1 1454 1500 -6 1635
## 5 9E Endeavo… 2013 1 1 1507 1515 -8 1651
## 6 9E Endeavo… 2013 1 1 1530 1530 0 1650
## 7 9E Endeavo… 2013 1 1 1546 1540 6 1753
## 8 9E Endeavo… 2013 1 1 1550 1550 0 1844
## 9 9E Endeavo… 2013 1 1 1552 1600 -8 1749
## 10 9E Endeavo… 2013 1 1 1554 1600 -6 1701
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Use an appropriate join to add the airport name to the “flights2” dataset you got above. The codes and names of the airports are in the “airports” dataset of the “nycflights13” package. Put the carrier and carrier name first followed by the destination and destination name, then everything else.
flights2 <-right_join(airports,flights2, by=c('faa'='dest')) %>%
select(carrier, name.y, faa, name.x, everything()) %>%
rename(carrier_name=name.y, dest=faa, dest_name=name.x)
flights2
## # A tibble: 336,776 × 27
## carrier carrier_name dest dest_name lat lon alt tz dst tzone
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 2 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 3 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 4 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 5 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 6 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 7 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 8 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 9 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## 10 B6 JetBlue Airways ABQ Albuquerqu… 35.0 -107. 5355 -7 A Amer…
## # … with 336,766 more rows, and 17 more variables: year <int>, month <int>,
## # day <int>, dep_time <int>, sched_dep_time <int>, dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## # tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The “nycflights13” library and the code to create spatial map is provided for you. Now compute the average delay by destination, then join on the airports dataframe so you can show the spatial distribution of delays.
Use the textbook for reference.
library(nycflights13)
#Compute de average departure delay by destination and save the result in flights_delay data frame
flights_delay<-flights %>% select(dest,dep_delay) %>% group_by(dest) %>% summarise(Av_delay=mean(dep_delay, na.rm = TRUE))
flights_delay
## # A tibble: 105 × 2
## dest Av_delay
## <chr> <dbl>
## 1 ABQ 13.7
## 2 ACK 6.46
## 3 ALB 23.6
## 4 ANC 12.9
## 5 ATL 12.5
## 6 AUS 13.0
## 7 AVL 8.19
## 8 BDL 17.7
## 9 BGR 19.5
## 10 BHM 29.7
## # … with 95 more rows
#Joining the average departure delay to the "airports" data frame
airports1<- right_join(airports,flights_delay , by=c('faa'='dest'))
airports1 %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(size=Av_delay)) +
coord_quickmap()
## Warning: Removed 5 rows containing missing values (`geom_point()`).
Use a set operation function to find which airport codes from flights are not in the airports dataset.
setdiff(flights$dest, airports$faa)
## [1] "BQN" "SJU" "STT" "PSE"