Loading Libraries:

knitr::opts_chunk$set(fig.align = "center", warning = FALSE, out.width = "85%", 
                      message = FALSE, cache = TRUE)
library(openintro)
library(tidyverse)
library(nycflights13)
library(lubridate)


Global Theme Set Up:

my_theme <- theme(
  plot.title = element_text(size = rel(1.8), hjust = 0.5),
  plot.margin = margin(1,1,1,1, unit = "cm"),
  axis.title = element_text(size = rel(1.5), hjust = 0.5,
                            margin = margin(15,15,15,15)),
  axis.title.x = element_text(margin = margin(10,5,5,5)),
  axis.title.y = element_text(margin = margin(5,10,5,5)),
  axis.text = element_text(size = rel(1.1)))

Part I: Data Transformation and Visualization with the flights data set


Working with the data set flights in the package nycflights13, answer the following questions by performing necessary data transformation/visualization:

(a) Create a histogram of arrival delays (excluding NAs) for all flights in June and July. Summarizeyour findings.

Code:

Jun_Jul_flights <- filter(flights, month == 6 | month == 7)

ggplot(Jun_Jul_flights) +
  geom_histogram(aes(x = arr_delay), fill = "lightskyblue2", 
                 binwidth = 50, na.rm = TRUE) +
  labs(title = "June & July Arrival Delay",
       x = "Time of delay in minutes",
       y = "Count") +
  scale_x_continuous(breaks = seq(0, 1200, 200)) +
  theme(plot.title = element_text(color = "lightsteelblue4"),
        axis.title = element_text(color = "cadetblue")) +
  my_theme

ggplot(Jun_Jul_flights) +
  geom_histogram(aes(x = arr_delay), na.rm = TRUE, 
                 binwidth = 50, fill = "lightskyblue2") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_x_continuous(breaks = seq(0, 1200, 200)) +
  labs(title = "June & July Arrival Delay (Zoom in)",
       x = "Time of delay in minutes",
       y = "Count") +
  theme(plot.title = element_text(color = "lightsteelblue4"),
        axis.title = element_text(color = "cadetblue")) +
  my_theme

Answer: The first histogram displays the arrival delays for June and July, with a highly skewed distribution. The majority of flights have delays close to zero, while a smaller number of flights experience significant delays extending beyond 200 minutes. The second histogram provides a zoomed-in view, and it highlights that most delays are concentrated below 200 minutes, with fewer flights experiencing extreme delays.

Overall, the data suggests that while delays are common, extreme delays are rare, occurring in only a small proportion of cases. This small proportion of extreme cases indidcates that there might be other factors contributing to these extreme delays, such as bad weather conditions, accidents, or operational issues.

(b) Create a smooth line graph of arrival delays vs departure delays for all flights departing from EWR on the first day of each month. Summarize your findings.

Code:

ewr_1st_day <- filter(flights, origin == "EWR" & day == 1 & month %in% c(1,2,3,4,5,6,7,8,9,10,11,12))

ggplot(ewr_1st_day, aes(x = arr_delay, y = dep_delay)) +
  geom_smooth(linetype = "dashed", linewidth = 2, color = "seagreen") +
  labs(title = "EWR Arrival & Departure Delay in 1st day of Months",
       x = "Arrival delay (minutes)",
       y = "Departure delay (minutes)") +
  theme(plot.title = element_text(color = "orange4"),
        axis.title = element_text(color = "orange3")) +
  my_theme

Answer: The graph shows a strong positive correlation between arrival and departure delays at Newark Liberty International Airport (EWR) on the first day of the months. As arrival delays increase, departure delays also rise proportionally, reaching over 400 minutes for the most extreme cases. This suggests that delays in incoming flights significantly impact outgoing flights.

(c) Find the flights that actually departed with the shortest travel distance. What is its origin and destination airport?

Code:

travel_distance <- flights %>%
  arrange(distance) %>%
  select(origin, dest, distance) %>%
  print()
## # A tibble: 336,776 × 3
##    origin dest  distance
##    <chr>  <chr>    <dbl>
##  1 EWR    LGA         17
##  2 EWR    PHL         80
##  3 EWR    PHL         80
##  4 EWR    PHL         80
##  5 EWR    PHL         80
##  6 EWR    PHL         80
##  7 EWR    PHL         80
##  8 EWR    PHL         80
##  9 EWR    PHL         80
## 10 EWR    PHL         80
## # ℹ 336,766 more rows

Answer: The shortest travel distance between origin and destination is 17 miles, from EWR (Newark Liberty International Airport) to LGA (LaGuardia Airport).

(d) Create a new categorical variable with two labels. Flights with a travel distance shorter than 500 miles are marked as “short-distance”, and otherwise “long-distance”. Create a bar plot to compare the number of flights in each category. Summarize your findings.

Code:

flight_distance <- mutate(flights, distance_label = 
                            cut(flights$distance, c(0, 500, Inf),
                                c('short-distance','long-distance')))

ggplot(flight_distance, aes(x = distance_label)) +
  geom_bar(fill = "darkorange3") +
  labs(title = "Short vs Long Distance Flights",
       x = "Label",
       y = "Count") +
  scale_y_continuous(breaks = seq(0, 200000, 50000)) +
  theme(plot.title = element_text(color = "orchid4"),
        axis.title = element_text(color = "plum4")) +
  my_theme

Answer: The bar chart compares the number of short-distance and long-distance flights. It shows that long-distance flights significantly outnumber short-distance flights, with more than double the count.

(e) Find the destination airport that has the longest average departure delay by creating a graph.

Code:

long_avg_delay <- flights %>%
  select(dest, dep_delay) %>%
  group_by(dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = T)) %>%
  arrange(desc(avg_dep_delay)) %>%
  print()
## # A tibble: 105 × 2
##    dest  avg_dep_delay
##    <chr>         <dbl>
##  1 CAE            35.6
##  2 TUL            34.9
##  3 OKC            30.6
##  4 BHM            29.7
##  5 TYS            28.5
##  6 JAC            26.5
##  7 DSM            26.2
##  8 RIC            23.6
##  9 ALB            23.6
## 10 MSN            23.6
## # ℹ 95 more rows
long_avg_delay <- mutate(flights, avg_dep_delay = cut(flights$dep_delay, 
                                                   c(0, 5, 10, 15, 20, 25, 30, 35, Inf),
                                                   c('<=5', '<=10','<=15', '<=20', '<=25',
                                                     '<=30', '<=35', '>35')))

ggplot(long_avg_delay, aes(x = avg_dep_delay)) +
  geom_bar(fill = "goldenrod") +
  labs(title = "Average of Departure Delay Flights",
       x = "Average delay hours by groups",
       y = "Count of flights") +
  scale_y_continuous(breaks = seq(0, 200000, 50000)) +
  theme(plot.title = element_text(color = "lightslateblue"),
        axis.title = element_text(color = "mediumpurple1")) +
  my_theme


(f) Answer the question in (e) without creating a graph.

Answer: The destination airport with the longest average departure delay is CAE (Columbia Metropolitan Airport), with an average departure delay of 35.6 minutes.

(g) Find the carriers with the highest and the lowest average flight speed for all their flights in the data set.

Code:

avg_speed <- flights%>%
  select(carrier, distance, air_time) %>%
  group_by(carrier) %>%
  summarise(avg_air_speed = distance/air_time*60)%>%
  arrange(avg_air_speed)%>%
  print()
## # A tibble: 336,776 × 2
## # Groups:   carrier [16]
##    carrier avg_air_speed
##    <chr>           <dbl>
##  1 US               76.8
##  2 B6               84.7
##  3 9E               92.5
##  4 9E               95.6
##  5 US               96  
##  6 US               96  
##  7 US               97.6
##  8 US               98.0
##  9 US              101. 
## 10 9E              103. 
## # ℹ 336,766 more rows
avg_speed <- flights%>%
  select(carrier, distance, air_time) %>%
  group_by(carrier) %>%
  summarise(avg_air_speed = distance/air_time*60)%>%
  arrange(desc(avg_air_speed))%>%
  print()
## # A tibble: 336,776 × 2
## # Groups:   carrier [16]
##    carrier avg_air_speed
##    <chr>           <dbl>
##  1 DL               703.
##  2 EV               650.
##  3 EV               648 
##  4 EV               641.
##  5 DL               591.
##  6 DL               564 
##  7 B6               557.
##  8 AA               556.
##  9 AA               554.
## 10 AA               554.
## # ℹ 336,766 more rows

Answer: The carrier with the lowest average flight speed is US (US Airways) with a speed of 76.8 mph; the carrier with the highest average flight speed is Delta Air Lines (DL) with a speed of 703 mph.

(h) (Self-study required) Find flights on which weekday (from Monday to Sunday) had the longest departure delay on average. You are allowed to ask AI to give you hints about the function that can return the weekday of a date. But you are not allowed to ask AI to generate codes directly.

Code:

lonest_delay_day <- flights %>%
  mutate(weekday = wday(time_hour, label = TRUE)) %>%
  group_by(weekday) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_dep_delay))

Part II: Analyzing the seattlepets data set


For the following questions, analyze the data set seattlepets in the package openintro. Read the help document and make sure that you understand the basic information about the data set before analysis:

(a) How many species are there in the data set? What are they?

Code:

unique(seattlepets$species)
## [1] "Dog"  "Cat"  "Goat" "Pig"

Answer: There are four species in the seattlepets data set, and they are dog, cat, goat, and pig.

(b) What are the most popular primary breeds for cats and dogs, respectively?

Code:

seattlepets %>%
  filter(species == "Dog" | species == "Cat") %>%
  group_by(species, primary_breed) %>%
  summarise(pribreed_catdog_ct = n(), .groups = "drop") %>%
  arrange(species, desc(pribreed_catdog_ct)) %>%
  print()
## # A tibble: 333 × 3
##    species primary_breed        pribreed_catdog_ct
##    <chr>   <chr>                             <int>
##  1 Cat     Domestic Shorthair                10086
##  2 Cat     Domestic Medium Hair               2146
##  3 Cat     Domestic Longhair                  1365
##  4 Cat     American Shorthair                  860
##  5 Cat     Siamese                             700
##  6 Cat     LaPerm                              358
##  7 Cat     Maine Coon                          290
##  8 Cat     Mix                                 253
##  9 Cat     Russian Blue                        145
## 10 Cat     Ragdoll                             138
## # ℹ 323 more rows

Answer: The most popular primary breed for dog is Labrador Retriever, and the most popular primary breed for cat is Dosmastic Shorthair.

(c) What are the three most common pet names in Seattle?

Code:

seattlepets %>%
  count(animal_name) %>%
  filter(n > 350)
## # A tibble: 4 × 2
##   animal_name     n
##   <chr>       <int>
## 1 Charlie       387
## 2 Lucy          439
## 3 Luna          355
## 4 <NA>          483

Answer: The three most common name among pets are Lucy, Charlie, and Luna, respectively.

(d) What are the ten most common pet names for cats? What are the ten most common pet names for dogs? Write a code to print the result and their frequencies.

Code:

top10_cat <- seattlepets %>%
  filter(species == "Cat", !is.na(animal_name)) %>%
  group_by(animal_name) %>%
  summarise(name_count = n()) %>%
  arrange(desc(name_count)) %>%
  print()
## # A tibble: 7,026 × 2
##    animal_name name_count
##    <chr>            <int>
##  1 Luna               111
##  2 Lucy               102
##  3 Lily                86
##  4 Max                 83
##  5 Bella               82
##  6 Charlie             81
##  7 Oliver              73
##  8 Jack                65
##  9 Sophie              59
## 10 Leo                 54
## # ℹ 7,016 more rows
top10_dog <- seattlepets %>%
  filter(species == "Dog", !is.na(animal_name)) %>%
  group_by(animal_name) %>%
  summarise(name_count = n()) %>%
  arrange(desc(name_count)) %>%
  print()
## # A tibble: 9,752 × 2
##    animal_name name_count
##    <chr>            <int>
##  1 Lucy               337
##  2 Charlie            306
##  3 Bella              249
##  4 Luna               244
##  5 Daisy              221
##  6 Cooper             189
##  7 Lola               187
##  8 Max                186
##  9 Molly              186
## 10 Stella             185
## # ℹ 9,742 more rows

Answer: The top 10 cat’s name are Luna, Lucy, Lily, Max, Bella, Charlie, Oliver, Jack, Sophie, and Leo, respectively. On the other hand, the top 10 dog’s name are Lucy, Charlie, Bella, Luna, Daisy, Cooper, Lola, Max, Molly, and Stella, respectively.

(e) How many names appear more than 100 times in the data set excluding “NA”?

Code:

seattlepets %>%
  count(animal_name) %>%
  filter(n > 100, !is.na(animal_name))
## # A tibble: 56 × 2
##    animal_name     n
##    <chr>       <int>
##  1 Abby          115
##  2 Bailey        157
##  3 Bear          109
##  4 Bella         331
##  5 Buddy         218
##  6 Charlie       387
##  7 Chloe         173
##  8 Coco          147
##  9 Cooper        205
## 10 Daisy         261
## # ℹ 46 more rows

Answer: There are 56 names appear more than 100 times.

(f) For all names that appear more than 100 times in the data set, which has the highest “cat_to_dog” ratio? Which has the lowest? The “cat_to_dog” ratio can be computed this way - if a name appears 200 times, in which 150 are for cats and 50 are for dogs, the ratio is 150/50 = 3.

Code:

seattlepets %>%
  count(species, animal_name) %>%
  filter(!is.na(animal_name), species == "Cat", n > 100) %>%
  arrange(desc(n))
## # A tibble: 2 × 3
##   species animal_name     n
##   <chr>   <chr>       <int>
## 1 Cat     Luna          111
## 2 Cat     Lucy          102
seattlepets %>%
  count(species, animal_name) %>%
  filter(!is.na(animal_name), species == "Dog", n > 100) %>%
  arrange(desc(n))
## # A tibble: 36 × 3
##    species animal_name     n
##    <chr>   <chr>       <int>
##  1 Dog     Lucy          337
##  2 Dog     Charlie       306
##  3 Dog     Bella         249
##  4 Dog     Luna          244
##  5 Dog     Daisy         221
##  6 Dog     Cooper        189
##  7 Dog     Lola          187
##  8 Dog     Max           186
##  9 Dog     Molly         186
## 10 Dog     Stella        185
## # ℹ 26 more rows
102/337
## [1] 0.3026706
111/244
## [1] 0.454918

Answer: Lucy: 102 / 337 = 0.30, Luna: 111 / 244 = 0.45. Therefore, Luna has the highest “Cat to Dog” ratio, and Lucy has the lowest “Cat to Dog” ratio.

(g) Present a question of your own interest related to this data set. Answer your question with analysis or visualization.

Question 1: I would like to find out the number of primary goat breeds and it’s types in the data set.

Code:

map(seattlepets, ~ sum(is.na(.)))
## $license_issue_date
## [1] 0
## 
## $license_number
## [1] 0
## 
## $animal_name
## [1] 483
## 
## $species
## [1] 0
## 
## $primary_breed
## [1] 0
## 
## $secondary_breed
## [1] 29517
## 
## $zip_code
## [1] 397
seattlepets %>%
  count(species, primary_breed) %>%
  filter(species == "Goat")
## # A tibble: 2 × 3
##   species primary_breed     n
##   <chr>   <chr>         <int>
## 1 Goat    Miniature        36
## 2 Goat    Standard          2
Goat <- seattlepets %>%
  filter(species == "Goat")

ggplot(Goat, aes(x = primary_breed)) +
  geom_bar(fill = "thistle3") +
  labs(title = "Primary Breed Type for Goat",
       x = "Type of breed",
       y = "Count") +
  theme(plot.title = element_text(color = "slateblue4"),
        axis.title = element_text(color = "slateblue3")) +
  my_theme 

Answer: The graph has two primary breeds listed: Miniature and Standard. There are 36 Miniature goats and 2 Standard goats, making a total of 38 goats in the dataset. I was initially curious about the different types of goat, but the dataset had far fewer than I expected, so I decided to explore another question instead. 

Question 2: I would like to find out the top 10 areas with the highest number of pet cats in the data set.

Code:

seattlepets %>%
  filter(!is.na(zip_code), species == "Cat") %>%
  group_by(zip_code) %>%
  summarise(petcat_ct = n()) %>%
  arrange(desc(petcat_ct)) %>%
  head(5) %>%
  print()
## # A tibble: 5 × 2
##   zip_code petcat_ct
##   <chr>        <int>
## 1 98103         1655
## 2 98115         1626
## 3 98117         1344
## 4 98125          934
## 5 98122          931
cat_lover <- seattlepets %>%
  filter(!is.na(zip_code), species == "Cat", 
         zip_code %in% c(98103, 98115, 98117, 98125, 98122)) 

ggplot(cat_lover, aes(x = zip_code, fill = zip_code)) +
  geom_bar(fill = "thistle3") +
  labs(title = "Top 5 Areas With The Most Pet Cat",
       x = "Zip code",
       y = "Numbers of cats") +
  theme(plot.title = element_text(color = "slateblue4"),
        axis.title = element_text(color = "slateblue3")) +
  my_theme 

Answer: The graph displays the top 5 areas with the highest number of pet cats. The zip codes included are 98103, 98115, 98117, 98125, and 98122, with pet cat counts of 1,655, 1,626, 1,344, 934, and 931, respectively. Zip code 98103 has the highest number of pet cats, while 98122 has the lowest.