Loading Libraries:
knitr::opts_chunk$set(fig.align = "center", warning = FALSE, out.width = "85%",
message = FALSE, cache = TRUE)
library(openintro)
library(tidyverse)
library(nycflights13)
library(lubridate)
Global Theme Set Up:
my_theme <- theme(
plot.title = element_text(size = rel(1.8), hjust = 0.5),
plot.margin = margin(1,1,1,1, unit = "cm"),
axis.title = element_text(size = rel(1.5), hjust = 0.5,
margin = margin(15,15,15,15)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.1)))
(a) Create a histogram of arrival delays (excluding NAs) for
all flights in June and July. Summarizeyour findings.
Code:
Jun_Jul_flights <- filter(flights, month == 6 | month == 7)
ggplot(Jun_Jul_flights) +
geom_histogram(aes(x = arr_delay), fill = "lightskyblue2",
binwidth = 50, na.rm = TRUE) +
labs(title = "June & July Arrival Delay",
x = "Time of delay in minutes",
y = "Count") +
scale_x_continuous(breaks = seq(0, 1200, 200)) +
theme(plot.title = element_text(color = "lightsteelblue4"),
axis.title = element_text(color = "cadetblue")) +
my_theme
ggplot(Jun_Jul_flights) +
geom_histogram(aes(x = arr_delay), na.rm = TRUE,
binwidth = 50, fill = "lightskyblue2") +
coord_cartesian(ylim = c(0, 100)) +
scale_x_continuous(breaks = seq(0, 1200, 200)) +
labs(title = "June & July Arrival Delay (Zoom in)",
x = "Time of delay in minutes",
y = "Count") +
theme(plot.title = element_text(color = "lightsteelblue4"),
axis.title = element_text(color = "cadetblue")) +
my_theme
Answer: The first histogram displays the arrival delays
for June and July, with a highly skewed distribution. The majority of
flights have delays close to zero, while a smaller number of flights
experience significant delays extending beyond 200 minutes. The second
histogram provides a zoomed-in view, and it highlights that most delays
are concentrated below 200 minutes, with fewer flights experiencing
extreme delays.
Overall, the data suggests that while delays are common, extreme
delays are rare, occurring in only a small proportion of cases. This
small proportion of extreme cases indidcates that there might be other
factors contributing to these extreme delays, such as bad weather
conditions, accidents, or operational issues.
(b) Create a smooth line graph of arrival delays vs departure
delays for all flights departing from EWR on the first day of each
month. Summarize your findings.
Code:
ewr_1st_day <- filter(flights, origin == "EWR" & day == 1 & month %in% c(1,2,3,4,5,6,7,8,9,10,11,12))
ggplot(ewr_1st_day, aes(x = arr_delay, y = dep_delay)) +
geom_smooth(linetype = "dashed", linewidth = 2, color = "seagreen") +
labs(title = "EWR Arrival & Departure Delay in 1st day of Months",
x = "Arrival delay (minutes)",
y = "Departure delay (minutes)") +
theme(plot.title = element_text(color = "orange4"),
axis.title = element_text(color = "orange3")) +
my_theme
Answer: The graph shows a strong positive correlation
between arrival and departure delays at Newark Liberty International
Airport (EWR) on the first day of the months. As arrival delays
increase, departure delays also rise proportionally, reaching over 400
minutes for the most extreme cases. This suggests that delays in
incoming flights significantly impact outgoing flights.
(c) Find the flights that actually departed with the shortest
travel distance. What is its origin and destination
airport?
Code:
travel_distance <- flights %>%
arrange(distance) %>%
select(origin, dest, distance) %>%
print()
## # A tibble: 336,776 × 3
## origin dest distance
## <chr> <chr> <dbl>
## 1 EWR LGA 17
## 2 EWR PHL 80
## 3 EWR PHL 80
## 4 EWR PHL 80
## 5 EWR PHL 80
## 6 EWR PHL 80
## 7 EWR PHL 80
## 8 EWR PHL 80
## 9 EWR PHL 80
## 10 EWR PHL 80
## # ℹ 336,766 more rows
Answer: The shortest travel distance between origin
and destination is 17 miles, from EWR (Newark Liberty International
Airport) to LGA (LaGuardia Airport).
(d) Create a new categorical variable with two labels.
Flights with a travel distance shorter than 500 miles are marked as
“short-distance”, and otherwise “long-distance”. Create a bar plot to
compare the number of flights in each category. Summarize your
findings.
Code:
flight_distance <- mutate(flights, distance_label =
cut(flights$distance, c(0, 500, Inf),
c('short-distance','long-distance')))
ggplot(flight_distance, aes(x = distance_label)) +
geom_bar(fill = "darkorange3") +
labs(title = "Short vs Long Distance Flights",
x = "Label",
y = "Count") +
scale_y_continuous(breaks = seq(0, 200000, 50000)) +
theme(plot.title = element_text(color = "orchid4"),
axis.title = element_text(color = "plum4")) +
my_theme
Answer: The bar chart compares the number of
short-distance and long-distance flights. It shows that long-distance
flights significantly outnumber short-distance flights, with more than
double the count.
(e) Find the destination airport that has the longest average
departure delay by creating a graph.
Code:
long_avg_delay <- flights %>%
select(dest, dep_delay) %>%
group_by(dest) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = T)) %>%
arrange(desc(avg_dep_delay)) %>%
print()
## # A tibble: 105 × 2
## dest avg_dep_delay
## <chr> <dbl>
## 1 CAE 35.6
## 2 TUL 34.9
## 3 OKC 30.6
## 4 BHM 29.7
## 5 TYS 28.5
## 6 JAC 26.5
## 7 DSM 26.2
## 8 RIC 23.6
## 9 ALB 23.6
## 10 MSN 23.6
## # ℹ 95 more rows
long_avg_delay <- mutate(flights, avg_dep_delay = cut(flights$dep_delay,
c(0, 5, 10, 15, 20, 25, 30, 35, Inf),
c('<=5', '<=10','<=15', '<=20', '<=25',
'<=30', '<=35', '>35')))
ggplot(long_avg_delay, aes(x = avg_dep_delay)) +
geom_bar(fill = "goldenrod") +
labs(title = "Average of Departure Delay Flights",
x = "Average delay hours by groups",
y = "Count of flights") +
scale_y_continuous(breaks = seq(0, 200000, 50000)) +
theme(plot.title = element_text(color = "lightslateblue"),
axis.title = element_text(color = "mediumpurple1")) +
my_theme
(f) Answer the question in (e) without creating a
graph.
Answer: The destination airport with the longest average departure delay is CAE (Columbia Metropolitan Airport), with an average departure delay of 35.6 minutes.
(g) Find the carriers with the highest and the lowest average
flight speed for all their flights in the data set.
Code:
avg_speed <- flights%>%
select(carrier, distance, air_time) %>%
group_by(carrier) %>%
summarise(avg_air_speed = distance/air_time*60)%>%
arrange(avg_air_speed)%>%
print()
## # A tibble: 336,776 × 2
## # Groups: carrier [16]
## carrier avg_air_speed
## <chr> <dbl>
## 1 US 76.8
## 2 B6 84.7
## 3 9E 92.5
## 4 9E 95.6
## 5 US 96
## 6 US 96
## 7 US 97.6
## 8 US 98.0
## 9 US 101.
## 10 9E 103.
## # ℹ 336,766 more rows
avg_speed <- flights%>%
select(carrier, distance, air_time) %>%
group_by(carrier) %>%
summarise(avg_air_speed = distance/air_time*60)%>%
arrange(desc(avg_air_speed))%>%
print()
## # A tibble: 336,776 × 2
## # Groups: carrier [16]
## carrier avg_air_speed
## <chr> <dbl>
## 1 DL 703.
## 2 EV 650.
## 3 EV 648
## 4 EV 641.
## 5 DL 591.
## 6 DL 564
## 7 B6 557.
## 8 AA 556.
## 9 AA 554.
## 10 AA 554.
## # ℹ 336,766 more rows
Answer: The carrier with the lowest average flight
speed is US (US Airways) with a speed of 76.8 mph; the carrier with the
highest average flight speed is Delta Air Lines (DL) with a speed of 703
mph.
(h) (Self-study required) Find flights on which weekday (from
Monday to Sunday) had the longest departure delay on average. You are
allowed to ask AI to give you hints about the function that can return
the weekday of a date. But you are not allowed to ask AI to generate
codes directly.
Code:
lonest_delay_day <- flights %>%
mutate(weekday = wday(time_hour, label = TRUE)) %>%
group_by(weekday) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(avg_dep_delay))
(a) How many species are there in the data set? What are
they?
Code:
unique(seattlepets$species)
## [1] "Dog" "Cat" "Goat" "Pig"
Answer: There are four species in the seattlepets
data set, and they are dog
, cat
,
goat
, and pig
.
(b) What are the most popular primary breeds for cats and
dogs, respectively?
Code:
seattlepets %>%
filter(species == "Dog" | species == "Cat") %>%
group_by(species, primary_breed) %>%
summarise(pribreed_catdog_ct = n(), .groups = "drop") %>%
arrange(species, desc(pribreed_catdog_ct)) %>%
print()
## # A tibble: 333 × 3
## species primary_breed pribreed_catdog_ct
## <chr> <chr> <int>
## 1 Cat Domestic Shorthair 10086
## 2 Cat Domestic Medium Hair 2146
## 3 Cat Domestic Longhair 1365
## 4 Cat American Shorthair 860
## 5 Cat Siamese 700
## 6 Cat LaPerm 358
## 7 Cat Maine Coon 290
## 8 Cat Mix 253
## 9 Cat Russian Blue 145
## 10 Cat Ragdoll 138
## # ℹ 323 more rows
Answer: The most popular primary breed for
dog
is Labrador Retriever
, and the most
popular primary breed for cat
is
Dosmastic Shorthair
.
(c) What are the three most common pet names in
Seattle?
Code:
seattlepets %>%
count(animal_name) %>%
filter(n > 350)
## # A tibble: 4 × 2
## animal_name n
## <chr> <int>
## 1 Charlie 387
## 2 Lucy 439
## 3 Luna 355
## 4 <NA> 483
Answer: The three most common name among pets are
Lucy
, Charlie
, and Luna
,
respectively.
(d) What are the ten most common pet names for cats? What are
the ten most common pet names for dogs? Write a code to print the result
and their frequencies.
Code:
top10_cat <- seattlepets %>%
filter(species == "Cat", !is.na(animal_name)) %>%
group_by(animal_name) %>%
summarise(name_count = n()) %>%
arrange(desc(name_count)) %>%
print()
## # A tibble: 7,026 × 2
## animal_name name_count
## <chr> <int>
## 1 Luna 111
## 2 Lucy 102
## 3 Lily 86
## 4 Max 83
## 5 Bella 82
## 6 Charlie 81
## 7 Oliver 73
## 8 Jack 65
## 9 Sophie 59
## 10 Leo 54
## # ℹ 7,016 more rows
top10_dog <- seattlepets %>%
filter(species == "Dog", !is.na(animal_name)) %>%
group_by(animal_name) %>%
summarise(name_count = n()) %>%
arrange(desc(name_count)) %>%
print()
## # A tibble: 9,752 × 2
## animal_name name_count
## <chr> <int>
## 1 Lucy 337
## 2 Charlie 306
## 3 Bella 249
## 4 Luna 244
## 5 Daisy 221
## 6 Cooper 189
## 7 Lola 187
## 8 Max 186
## 9 Molly 186
## 10 Stella 185
## # ℹ 9,742 more rows
Answer: The top 10 cat’s name are Luna
,
Lucy
, Lily
, Max
,
Bella
, Charlie
, Oliver
,
Jack
, Sophie
, and Leo
,
respectively. On the other hand, the top 10 dog’s name are
Lucy
, Charlie
, Bella
,
Luna
, Daisy
, Cooper
,
Lola
, Max
, Molly
, and
Stella
, respectively.
(e) How many names appear more than 100 times in the data set
excluding “NA”?
Code:
seattlepets %>%
count(animal_name) %>%
filter(n > 100, !is.na(animal_name))
## # A tibble: 56 × 2
## animal_name n
## <chr> <int>
## 1 Abby 115
## 2 Bailey 157
## 3 Bear 109
## 4 Bella 331
## 5 Buddy 218
## 6 Charlie 387
## 7 Chloe 173
## 8 Coco 147
## 9 Cooper 205
## 10 Daisy 261
## # ℹ 46 more rows
Answer: There are 56 names appear more than 100 times.
(f) For all names that appear more than 100 times in the data
set, which has the highest “cat_to_dog” ratio? Which has the lowest? The
“cat_to_dog” ratio can be computed this way - if a name appears 200
times, in which 150 are for cats and 50 are for dogs, the ratio is
150/50 = 3.
Code:
seattlepets %>%
count(species, animal_name) %>%
filter(!is.na(animal_name), species == "Cat", n > 100) %>%
arrange(desc(n))
## # A tibble: 2 × 3
## species animal_name n
## <chr> <chr> <int>
## 1 Cat Luna 111
## 2 Cat Lucy 102
seattlepets %>%
count(species, animal_name) %>%
filter(!is.na(animal_name), species == "Dog", n > 100) %>%
arrange(desc(n))
## # A tibble: 36 × 3
## species animal_name n
## <chr> <chr> <int>
## 1 Dog Lucy 337
## 2 Dog Charlie 306
## 3 Dog Bella 249
## 4 Dog Luna 244
## 5 Dog Daisy 221
## 6 Dog Cooper 189
## 7 Dog Lola 187
## 8 Dog Max 186
## 9 Dog Molly 186
## 10 Dog Stella 185
## # ℹ 26 more rows
102/337
## [1] 0.3026706
111/244
## [1] 0.454918
Answer: Lucy: 102 / 337 = 0.30, Luna: 111 / 244 =
0.45. Therefore, Luna has the highest “Cat
to
Dog
” ratio, and Lucy has the lowest “Cat
to
Dog
” ratio.
(g) Present a question of your own interest related to this
data set. Answer your question with analysis or
visualization.
Question 1: I would like to find out the number of
primary goat breeds and it’s types in the data set.
Code:
map(seattlepets, ~ sum(is.na(.)))
## $license_issue_date
## [1] 0
##
## $license_number
## [1] 0
##
## $animal_name
## [1] 483
##
## $species
## [1] 0
##
## $primary_breed
## [1] 0
##
## $secondary_breed
## [1] 29517
##
## $zip_code
## [1] 397
seattlepets %>%
count(species, primary_breed) %>%
filter(species == "Goat")
## # A tibble: 2 × 3
## species primary_breed n
## <chr> <chr> <int>
## 1 Goat Miniature 36
## 2 Goat Standard 2
Goat <- seattlepets %>%
filter(species == "Goat")
ggplot(Goat, aes(x = primary_breed)) +
geom_bar(fill = "thistle3") +
labs(title = "Primary Breed Type for Goat",
x = "Type of breed",
y = "Count") +
theme(plot.title = element_text(color = "slateblue4"),
axis.title = element_text(color = "slateblue3")) +
my_theme
Answer: The graph has two primary breeds listed:
Miniature
and Standard
. There are 36
Miniature
goats and 2 Standard
goats, making a
total of 38 goats in the dataset. I was initially curious about the
different types of goat
, but the dataset had far fewer than
I expected, so I decided to explore another question instead.
Question 2: I would like to find out the top 10
areas with the highest number of pet cats in the data set.
Code:
seattlepets %>%
filter(!is.na(zip_code), species == "Cat") %>%
group_by(zip_code) %>%
summarise(petcat_ct = n()) %>%
arrange(desc(petcat_ct)) %>%
head(5) %>%
print()
## # A tibble: 5 × 2
## zip_code petcat_ct
## <chr> <int>
## 1 98103 1655
## 2 98115 1626
## 3 98117 1344
## 4 98125 934
## 5 98122 931
cat_lover <- seattlepets %>%
filter(!is.na(zip_code), species == "Cat",
zip_code %in% c(98103, 98115, 98117, 98125, 98122))
ggplot(cat_lover, aes(x = zip_code, fill = zip_code)) +
geom_bar(fill = "thistle3") +
labs(title = "Top 5 Areas With The Most Pet Cat",
x = "Zip code",
y = "Numbers of cats") +
theme(plot.title = element_text(color = "slateblue4"),
axis.title = element_text(color = "slateblue3")) +
my_theme
Answer: The graph displays the top 5 areas with the
highest number of pet cats. The zip codes included are 98103, 98115,
98117, 98125, and 98122, with pet cat counts of 1,655, 1,626, 1,344,
934, and 931, respectively. Zip code 98103 has the highest number of pet
cats, while 98122 has the lowest.