Loading Libraries:

knitr::opts_chunk$set(warning = FALSE, fig.align = "center", out.width = "85%", 
                      message = FALSE, cache = TRUE)
library(openintro)
library(tidyverse)
library(nycflights13)
library(maps)


Global Theme Set Up:

my_theme <- theme(
  plot.title = element_text(size = rel(1.8), hjust = 0.5, 
                            margin = margin(10,15,10,10)),
  axis.title = element_text(size = rel(1.5), hjust = 0.5),
  axis.title.x = element_text(margin = margin(10,5,5,5)),
  axis.title.y = element_text(margin = margin(5,10,5,5)),
  axis.text = element_text(size = rel(1.1)))

Part I: Lab Exercise


Lab 1:
Explain how you understand the diagram above.

Answer: The diagram represents a relational database schema for flights data, illustrating the relationships between multiple tables. At the center is the flights table, which contains essential flight details such as year, month, day, hour, flight, origin, dest, tailnum, and carrier. The airports table stores airport information and is linked to the flights table through the origin and dest columns, which reference the faa (airport code). The planes table contains aircraft details and connects via the tailnum. Additionally, the airlines table provides airline names and is related to the flights table through the carrier column. Lastly, the weather table stores meteorological data and links to the flights table using year, month, day, hour, and origin, allowing analysis of weather conditions affecting flights. These relationships indicate that the schema is designed to analyze flight performance, delays, and external influences by integrating flights, weather, airport, and airline data.


Lab 2:
Answer whether there exists a primary key in the mpg data set with codes.


Code:

sum(duplicated(mpg))
## [1] 9

Answer: There is no primary key in mpg data set.


Lab 3:
Identify the following relationship to be one-to-many, one-to-one, or many-to-many

1. origin in flights and faa in airports

Answer: Many to one.


2. The variable student_name in a table student information for the college and a table grade score of our course.

Answer: The relationship between student_name and grade score depends on the structure of the table. If student_name contains unique values, the relationship is one-to-many, meaning each student can have multiple grade scores. However, if each student has only one grade score recorded, the relationship is one-to-one. In cases where there are multiple students with the same name, such as Anna or Linda, and each student has multiple scores across different subjects, the relationship becomes many-to-many.


Lab 4:
Add airports information to the data set flights2 for both origin and destination. Rename your columns to differentiate information for dest and origin.


Code:

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2 %>%
  left_join(airports, c("origin" = "faa")) 
## # A tibble: 336,776 × 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
flights2 %>%
  left_join(airports, c("dest" = "faa")) -> flights_airport

flights_airport %>%
  rename("daylight_saving" = `dst`, "airport_name" = `name`, 
         "departure_airport" = `origin`, "arrival_airport" = `dest`) %>%
  glimpse()
## Rows: 336,776
## Columns: 15
## $ year              <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
## $ month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour              <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
## $ departure_airport <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
## $ arrival_airport   <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
## $ tailnum           <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
## $ carrier           <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
## $ airport_name      <chr> "George Bush Intercontinental", "George Bush Interco…
## $ lat               <dbl> 29.98443, 29.98443, 25.79325, NA, 33.63672, 41.97860…
## $ lon               <dbl> -95.34144, -95.34144, -80.29056, NA, -84.42807, -87.…
## $ alt               <dbl> 97, 97, 8, NA, 1026, 668, 9, 313, 96, 668, 19, 26, 1…
## $ tz                <dbl> -6, -6, -5, NA, -5, -6, -5, -5, -5, -6, -5, -5, -8, …
## $ daylight_saving   <chr> "A", "A", "A", NA, "A", "A", "A", "A", "A", "A", "A"…
## $ tzone             <chr> "America/Chicago", "America/Chicago", "America/New_Y…


Lab 5:
Create a airport map similar to the one above with the color representing the number of flights per day from NYC to each airport.


Code:

flights_per_day <- flights %>%
  group_by(dest) %>%
  summarise(avg_flights_per_day = n()/365) %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  print()
## # A tibble: 105 × 9
##    dest  avg_flights_per_day name             lat    lon   alt    tz dst   tzone
##    <chr>               <dbl> <chr>          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
##  1 ABQ                0.696  Albuquerque I…  35.0 -107.   5355    -7 A     Amer…
##  2 ACK                0.726  Nantucket Mem   41.3  -70.1    48    -5 A     Amer…
##  3 ALB                1.20   Albany Intl     42.7  -73.8   285    -5 A     Amer…
##  4 ANC                0.0219 Ted Stevens A…  61.2 -150.    152    -9 A     Amer…
##  5 ATL               47.2    Hartsfield Ja…  33.6  -84.4  1026    -5 A     Amer…
##  6 AUS                6.68   Austin Bergst…  30.2  -97.7   542    -6 A     Amer…
##  7 AVL                0.753  Asheville Reg…  35.4  -82.5  2165    -5 A     Amer…
##  8 BDL                1.21   Bradley Intl    41.9  -72.7   173    -5 A     Amer…
##  9 BGR                1.03   Bangor Intl     44.8  -68.8   192    -5 A     Amer…
## 10 BHM                0.814  Birmingham In…  33.6  -86.8   644    -6 A     Amer…
## # ℹ 95 more rows
ggplot(flights_per_day, aes(x = lon, y = lat)) +
  borders("state") +
    geom_point(aes(color = avg_flights_per_day)) +
    scale_color_gradient( low = "palegreen", high = "rosybrown1") +
    coord_quickmap() +
  labs(title = "How Many Flights Departs From NYC?",
       x = "longitude",
       y = "latitude") +
  theme(plot.title = element_text(color = "palegreen4"),
        axis.title = element_text(color = "rosybrown4")) +
  my_theme