library(tidyverse)
library(openintro)
data(nycflights)
Exercise 1
The three histograms display the same distribution of departure
delays but with different levels of detail due to the choice of
binwidth. All three plots show a strongly right-skewed distribution,
with most flights having delays near zero and a long tail of larger
delays. The histogram with the smaller binwidth reveals more detailed
structure, including a sharp spike around 0 minutes and more variation
across delay values, but it appears noisier. The default histogram
provides a balanced view, showing the overall pattern without excessive
noise. The histogram with the largest binwidth smooths the distribution
significantly, clearly highlighting the overall trend but obscuring
important details such as smaller clusters and variability. Overall,
smaller binwidths reveal fine details, while larger binwidths emphasize
the general shape but may hide meaningful patterns.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)

Exercise 2
A new data frame was created using filter() to include only flights
with destination SFO in February. The number of flights that met this
criteria is 68.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
nrow(sfo_feb_flights)
## [1] 68
Exercise 3
The histogram of arrival delays for flights to SFO in February shows
a right-skewed distribution. Most flights are concentrated around values
near zero, with many flights arriving early (negative delays), and a
long right tail representing a small number of flights with large
delays. Because the distribution is skewed and includes extreme values,
the median and interquartile range (IQR) are more appropriate summary
statistics than the mean and standard deviation. The median arrival
delay is -11 minutes, indicating that a typical flight arrives slightly
early. The IQR is 23.2 minutes, showing that the middle 50% of arrival
delays fall within a range of about 23 minutes. Overall, most flights
arrive close to or earlier than scheduled, but a small number of flights
experience large delays, creating the long right tail of the
distribution.
ggplot(sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 15)

sfo_feb_flights %>%
summarize(median_delay = median(arr_delay, na.rm = TRUE),
IQR_delay = IQR(arr_delay, na.rm = TRUE))
## # A tibble: 1 × 2
## median_delay IQR_delay
## <dbl> <dbl>
## 1 -11 23.2
sfo_feb_flights |>
group_by(origin) |>
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
## # A tibble: 2 × 4
## origin median_dd iqr_dd n_flights
## <chr> <dbl> <dbl> <int>
## 1 EWR 0.5 5.75 8
## 2 JFK -2.5 15.2 60
Exercise 4
Arrival delays (arr_delay) were summarized by carrier using the
median and interquartile range (IQR). Since arrival delays are
right-skewed with outliers, the median and IQR provide resistant
measures of center and spread. Based on the results, DL and UA have the
most variable arrival delays, each with an IQR of 22 minutes, meaning
the middle 50% of their arrival delays span a wider range than the other
carriers. DL and UA also have negative medians (DL = -15, UA = -10),
meaning their typical flight arrived early, but variability across
flights was still the greatest.
sfo_feb_flights %>%
group_by(carrier) %>%
summarize(median_delay = median(arr_delay, na.rm = TRUE),
IQR_delay = IQR(arr_delay, na.rm = TRUE))
## # A tibble: 5 × 3
## carrier median_delay IQR_delay
## <chr> <dbl> <dbl>
## 1 AA 5 17.5
## 2 B6 -10.5 12.2
## 3 DL -15 22
## 4 UA -10 22
## 5 VX -22.5 21.2
sfo_feb_flights %>%
group_by(carrier) %>%
summarize(
median_delay = median(arr_delay, na.rm = TRUE),
IQR_delay = IQR(arr_delay, na.rm = TRUE)
) %>%
arrange(desc(IQR_delay))
## # A tibble: 5 × 3
## carrier median_delay IQR_delay
## <chr> <dbl> <dbl>
## 1 DL -15 22
## 2 UA -10 22
## 3 VX -22.5 21.2
## 4 AA 5 17.5
## 5 B6 -10.5 12.2
Exercise 5
Choosing the month with the lowest mean departure delay minimizes the
overall expected delay. This approach takes into account all flights,
including those with extreme delays, and is useful if a traveler wants
to reduce the average delay they might experience. However, the mean is
sensitive to outliers, so a few unusually large delays (such as those
caused by severe weather or congestion) can inflate the mean and make a
month appear worse than the typical experience. Choosing the month with
the lowest median departure delay focuses on the typical flight
experience, since the median is resistant to extreme values. This is
especially useful for skewed data, such as departure delays, which often
have a long right tail. However, the median does not capture the risk of
rare but severe delays and may underestimate the likelihood of
experiencing a very long delay. In this dataset, October (month 10) has
both the lowest mean delay (5.88 minutes) and a negative median delay
(-3 minutes), indicating that flights typically depart early and delays
are relatively small and consistent. In contrast, July (month 7) has the
highest mean delay (20.8 minutes) and a median delay of 0 minutes, along
with a much larger IQR (26 minutes). This suggests that while a typical
flight in July may be on time, departure delays are much more variable
and there is a greater risk of experiencing large delays. Overall,
October is the best choice under both criteria, while July represents a
less desirable option due to its higher average delays and greater
variability. This comparison highlights how seasonal factors, such as
summer travel volume, can increase both the frequency and variability of
delays.
nycflights |>
group_by(month) |>
summarise(mean_dd = mean(dep_delay)) |>
arrange(desc(mean_dd))
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
nycflights %>%
group_by(month) %>%
summarize(mean_dd = mean(dep_delay, na.rm = TRUE)) %>%
arrange(mean_dd)
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 10 5.88
## 2 11 6.10
## 3 9 6.87
## 4 1 10.2
## 5 2 10.7
## 6 8 12.6
## 7 5 13.3
## 8 3 13.5
## 9 4 14.6
## 10 12 17.4
## 11 6 20.4
## 12 7 20.8
nycflights %>%
group_by(month) %>%
summarize(
mean_dd = mean(dep_delay, na.rm = TRUE),
median_dd = median(dep_delay, na.rm = TRUE),
IQR_dd = IQR(dep_delay, na.rm = TRUE),
n = sum(!is.na(dep_delay))) %>%
arrange(mean_dd)
## # A tibble: 12 × 5
## month mean_dd median_dd IQR_dd n
## <int> <dbl> <dbl> <dbl> <int>
## 1 10 5.88 -3 9 2884
## 2 11 6.10 -2 10 2733
## 3 9 6.87 -3 8 2681
## 4 1 10.2 -2 12 2610
## 5 2 10.7 -2 15 2286
## 6 8 12.6 -1 15 2880
## 7 5 13.3 -1 19 2821
## 8 3 13.5 -1 17 2869
## 9 4 14.6 -2 16 2781
## 10 12 17.4 1 25 2716
## 11 6 20.4 0 25 2732
## 12 7 20.8 0 26 2742
Exercise 6
To determine which NYC airport has the highest on-time departure
percentage, flights were classified as “on time” if the departure delay
was less than 5 minutes. A stacked bar chart was also created to
visualize the proportion of on-time and delayed flights by airport. The
results show that LaGuardia Airport (LGA) has the highest on-time
departure rate at 72.8%, followed by JFK (69.4%) and EWR (63.7%). The
visualization supports this finding, as LGA has the largest proportion
of on-time flights (teal) and the smallest proportion of delayed flights
(red), while EWR shows the opposite pattern. Therefore, if selecting an
airport based solely on on-time departure performance, LGA would be the
best choice. The consistency between the numerical summary and the
visual representation strengthens the conclusion that LGA has the best
on-time performance.
nycflights <- nycflights |>
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights |>
group_by(origin) |>
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) |>
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()

Exercise 7
A new variable, avg_speed, was created by dividing distance (miles)
by flight time in hours, converting air_time from minutes to hours.
nycflights <- nycflights %>%
mutate(avg_speed = distance / (air_time / 60))
summary(nycflights$avg_speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 76.8 357.4 404.0 394.1 438.8 703.4
Exercise 8
The scatterplot shows a positive relationship between distance and
average speed. Flights covering longer distances tend to have higher
average speeds. This pattern likely occurs because longer flights spend
more time at cruising altitude, where planes travel more efficiently and
at higher speeds. In contrast, shorter flights spend more time taking
off and landing, which reduces their average speed.
ggplot(nycflights, aes(x = distance, y = avg_speed)) +
geom_point(alpha = 0.3)

ggplot(nycflights, aes(x = distance, y = avg_speed)) +
geom_point(alpha = 0.3) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Exercise 9
The scatterplot of departure delay versus arrival delay for American
(AA), Delta (DL), and United (UA) shows a strong positive relationship:
flights that depart later tend to arrive later. However, there are many
observations where flights with small departure delays still arrive on
time or early (negative arrival delays). This suggests that airlines are
often able to make up some lost time during the flight. Based on the
plot, the approximate cutoff point appears to be around 30–60 minutes of
departure delay. Flights delayed less than this range often still arrive
on time, while flights with delays greater than this are much more
likely to arrive late. Overall, this indicates that moderate departure
delays can sometimes be recovered, but larger delays typically result in
late arrivals.
selected_flights <- nycflights %>%
filter(carrier %in% c("AA", "DL", "UA"))
ggplot(selected_flights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
geom_point(alpha = 0.5)

Exercises 3.5.1
Faceting a continuous variable creates a panel for each unique
value, which can result in a very large number of small plots. This is
often not useful becasue the panels become too numerous, each panel may
contain very few observations, and the plot becomes difficult to
interpret. In practice, continuous variables should be binned or
categorized before faceting.
Empty cells represent combinations of variables that do not exist
in the dataset. For example, if no cars have a certain combination of
drive type and number of cylinders, ggplot still creates the panel, but
it appears empty becasue no observations fall into that category. These
empty cells highlight missing combination in the data, not
errors.
3.First plot: Creates one column of plots. Each row = a different
value of drive type. You get stacked panels vertically. Second plot:
Creates one row of plots. Each column = a different value of number of
cylinders. you get panels arranged horizontally. . means no “faceting in
that direction”. In drz~. –> no columns, only rows; In .~cyl –> no
rows, only columns
Advantages of faceting: Separates groups into clear, individual
panels. Makes patterns easier to compare without overlap. Reduces
clutter in dense plots. Disadvantages of faceting: Takes up more space.
Harder to compare values across panels directly. Can become overwhelming
with many categories. Advantages of using color: all data shown in one
plot. Easier direct comparison across groups. More compact.
Disadvantages of color: Overplotting (points overlap). Hard to
distinguish groups if many categories. Can be visually cluttered. With
larger data sets, color can become cluttered, and faceting becomes more
useful for clarity.
In facet_wrap(), the arguments nrow and ncol control the layout
of the panels. The nrow argument specifies the number of rows, while
ncol specifies the number of columns used to arrange the panels. In
contrast, facet_grid() does not include nrow and ncol arguments because
its layout is determined entirely by the variables used for faceting.
The number of rows and columns is fixed based on the number of levels in
each variable, so the user does not manually control the
layout.
You should put the variable with more unique levels in the
columns because screens are typically wider than they are tall. Putting
more levels in columns uses horizontal space efficiently, prevents plots
from becoming too tall and compressed, and makes the layout easier to
read.
## <ggproto object: Class FacetGrid, Facet, gg>
## attach_axes: function
## attach_strips: function
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panel_content: function
## draw_panels: function
## finish_data: function
## format_strip_labels: function
## init_gtable: function
## init_scales: function
## map_data: function
## params: list
## set_panel_size: function
## setup_data: function
## setup_panel_params: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetGrid, Facet, gg>
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

Exercises 3.7.1
The default geom for stat_summary() is geom_pointrange(). To
rewrite a plot using the geom instead of the stat, you specify stat =
“summary” inside the geom:
geom_pointrange(stat = “summary”, fun.data = “mean_cl_normal”)
geom_col() creates bar plots using values that are already
present in the dataset, meaning the heights of the bars are taken
directly from a y variable. In contrast, geom_bar() calculates counts
automatically (using stat = “count” by default) and does not require a y
variable. Therefore, geom_col() is used when the data has already been
summarized, while geom_bar() is used for raw data.
Most geoms and stats come in pairs, such as geom_bar() with
stat_count(), geom_histogram() with stat_bin(), and geom_smooth() with
stat_smooth(). These pairs work together such that the stat computes a
transformation of the data, and the geom controls how the result is
displayed. They share the same underlying structure and can often be
interchanged by specifying the stat or geom argument
explicitly.
stat_smooth() computes a smoothed trend line through the data and
often includes a confidence interval around the estimate. Its behavior
is controlled by parameters such as method (e.g., “lm” or “loess”), se
(whether to display the confidence interval), and other arguments like
span for controlling smoothness.
The issue is that ggplot2 automatically groups data based on
mapped aesthetics such as fill. As a result, proportions are calculated
within each group instead of across the entire dataset. It creates gray,
solid filled bars. By setting group = 1, we override this default
grouping and ensure that the proportions are calculated relative to all
observations. Without this specification, the graphs display incorrect
proportions because each category is normalized separately rather than
globally.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

Exercises 3.8.1
The main problem with this plot is overplotting, since many
observations share the same values of cty and hwy, causing points to
overlap and obscure the true number of observations. This makes it
difficult to assess the density of the data. The plot can be improved by
using jittering (e.g., geom_jitter()), adding transparency, or using
geom_count() to better display the frequency of overlapping
points.
The amount of jittering is controlled by the width and height
parameters in geom_jitter(). The width argument determines the amount of
horizontal movement, while the height argument controls vertical
movement of points.
geom_jitter() reduces overplotting by adding small random
variation to the position of points, making overlapping observations
visible. In contrast, geom_count() represents overlapping observations
by increasing the size of the points based on their frequency. While
jittering spreads the data visually, geom_count() preserves the exact
values but encodes frequency through point size.
The default position adjustment for geom_boxplot() is position =
“dodge2”, which places boxplots side-by-side when there are multiple
groups. This allows for easier comparison between categories.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()

Exercises 3.9.1
A stacked bar chart can be converted into a pie chart by using
coord_polar(theta = “y”), which transforms the y-values into angular
space and wraps the bar chart into a circle.
The labs() function is used to add and modify labels in a ggplot,
including the title, subtitle, axis labels, legend labels, and caption.
It is preferred over older functions such as xlab() and ggtitle(), which
are now superseded. By improving labeling, labs() makes plots more
interpretable and accessible to a broader audience.
coord_map() applies a true map projection and preserves
geographic accuracy, but it is computationally slower. In contrast,
coord_quickmap() provides a faster approximation of a map projection and
works well for smaller geographic regions where distortion is
minimal.
The plot shows a strong positive relationship between city (cty)
and highway (hwy) mileage: vehicles with higher city MPG also tend to
have higher highway MPG. Most points lie above the reference line,
indicating that highway MPG is generally higher than city MPG. The
function coord_fixed() is important because it ensures equal scaling on
both axes, making the reference line meaningful and allowing accurate
visual comparison between the two variables. The function geom_abline()
adds a diagonal reference line (with slope = 1 and intercept = 0),
representing where city MPG equals highway MPG. This helps highlight the
difference between the two measures and makes it clear that most
vehicles perform better on the highway than in the city.
ggplot(diamonds, aes(x = "", fill = cut)) +
geom_bar(width = 1) +
coord_polar(theta = "y")

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()

Exercises 4.4
This code does not work because the assignment operator is not
written correctly. In R, the assignment must use the exact operator
<-. If an incorrect dash or formatting is used, R does not recognize
the assignment, and the variable is not created, resulting in the
error.
The corrected commands are:
library(tidyverse)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y =
hwy))
filter(mpg, cyl == 8)
filter(diamonds, carat > 3)
The errors included a misspelled argument name (dota instead of
data), using = instead of == for comparison, and reference the dataset
diamond instead of diamonds.
- Pressing Alt + Shift + K in RStudio opens the keyboard shortcuts
help menu. This menu provides a list of useful shortcuts to improve
efficiency when coding. The same menu can be accessed by navigating to
Help → Keyboard Shortcuts Help in the RStudio toolbar.
library(tidyverse)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

## # A tibble: 70 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
## 2 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 3 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
## 4 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 5 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
## 6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
## 7 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
## 8 chevrolet corvette 5.7 1999 8 auto… r 15 23 p 2sea…
## 9 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
## 10 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
## # ℹ 60 more rows
filter(diamonds, carat > 3)
## # A tibble: 32 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.4 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # ℹ 22 more rows
---
title: "Lab 2: Intro to Data"
author: "Caitlin Kennedy"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
data(nycflights)
```

### Exercise 1

The three histograms display the same distribution of departure delays but with different levels of detail due to the choice of binwidth. All three plots show a strongly right-skewed distribution, with most flights having delays near zero and a long tail of larger delays.
The histogram with the smaller binwidth reveals more detailed structure, including a sharp spike around 0 minutes and more variation across delay values, but it appears noisier. The default histogram provides a balanced view, showing the overall pattern without excessive noise. The histogram with the largest binwidth smooths the distribution significantly, clearly highlighting the overall trend but obscuring important details such as smaller clusters and variability.
Overall, smaller binwidths reveal fine details, while larger binwidths emphasize the general shape but may hide meaningful patterns.

```{r code-chunk-label}
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)
```

### Exercise 2

A new data frame was created using filter() to include only flights with destination SFO in February.
The number of flights that met this criteria is 68.

```{r}

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

nrow(sfo_feb_flights)

```

### Exercise 3

The histogram of arrival delays for flights to SFO in February shows a right-skewed distribution. Most flights are concentrated around values near zero, with many flights arriving early (negative delays), and a long right tail representing a small number of flights with large delays.
Because the distribution is skewed and includes extreme values, the median and interquartile range (IQR) are more appropriate summary statistics than the mean and standard deviation.
The median arrival delay is -11 minutes, indicating that a typical flight arrives slightly early. The IQR is 23.2 minutes, showing that the middle 50% of arrival delays fall within a range of about 23 minutes.
Overall, most flights arrive close to or earlier than scheduled, but a small number of flights experience large delays, creating the long right tail of the distribution.

```{r}

ggplot(sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 15)


sfo_feb_flights %>%
  summarize(median_delay = median(arr_delay, na.rm = TRUE),
    IQR_delay = IQR(arr_delay, na.rm = TRUE))
  
sfo_feb_flights |>
  group_by(origin) |>
  summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
```

### Exercise 4

Arrival delays (arr_delay) were summarized by carrier using the median and interquartile range (IQR). Since arrival delays are right-skewed with outliers, the median and IQR provide resistant measures of center and spread.
Based on the results, DL and UA have the most variable arrival delays, each with an IQR of 22 minutes, meaning the middle 50% of their arrival delays span a wider range than the other carriers.
DL and UA also have negative medians (DL = -15, UA = -10), meaning their typical flight arrived early, but variability across flights was still the greatest.

```{r}

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarize(median_delay = median(arr_delay, na.rm = TRUE),
    IQR_delay = IQR(arr_delay, na.rm = TRUE))


sfo_feb_flights %>%
  group_by(carrier) %>%
  summarize(
    median_delay = median(arr_delay, na.rm = TRUE),
    IQR_delay = IQR(arr_delay, na.rm = TRUE)
  ) %>%
  arrange(desc(IQR_delay))

```

### Exercise 5

Choosing the month with the lowest mean departure delay minimizes the overall expected delay. This approach takes into account all flights, including those with extreme delays, and is useful if a traveler wants to reduce the average delay they might experience. However, the mean is sensitive to outliers, so a few unusually large delays (such as those caused by severe weather or congestion) can inflate the mean and make a month appear worse than the typical experience.
Choosing the month with the lowest median departure delay focuses on the typical flight experience, since the median is resistant to extreme values. This is especially useful for skewed data, such as departure delays, which often have a long right tail. However, the median does not capture the risk of rare but severe delays and may underestimate the likelihood of experiencing a very long delay.
In this dataset, October (month 10) has both the lowest mean delay (5.88 minutes) and a negative median delay (-3 minutes), indicating that flights typically depart early and delays are relatively small and consistent.
In contrast, July (month 7) has the highest mean delay (20.8 minutes) and a median delay of 0 minutes, along with a much larger IQR (26 minutes). This suggests that while a typical flight in July may be on time, departure delays are much more variable and there is a greater risk of experiencing large delays.
Overall, October is the best choice under both criteria, while July represents a less desirable option due to its higher average delays and greater variability.
This comparison highlights how seasonal factors, such as summer travel volume, can increase both the frequency and variability of delays.

```{r}
nycflights |>
  group_by(month) |>
  summarise(mean_dd = mean(dep_delay)) |>
  arrange(desc(mean_dd))

nycflights %>%
     group_by(month) %>%
     summarize(mean_dd = mean(dep_delay, na.rm = TRUE)) %>%
     arrange(mean_dd)

nycflights %>%
     group_by(month) %>%
     summarize(
         mean_dd   = mean(dep_delay, na.rm = TRUE),
         median_dd = median(dep_delay, na.rm = TRUE),
         IQR_dd    = IQR(dep_delay, na.rm = TRUE),
         n         = sum(!is.na(dep_delay))) %>%
     arrange(mean_dd)
```


### Exercise 6

To determine which NYC airport has the highest on-time departure percentage, flights were classified as “on time” if the departure delay was less than 5 minutes.
A stacked bar chart was also created to visualize the proportion of on-time and delayed flights by airport.
The results show that LaGuardia Airport (LGA) has the highest on-time departure rate at 72.8%, followed by JFK (69.4%) and EWR (63.7%).
The visualization supports this finding, as LGA has the largest proportion of on-time flights (teal) and the smallest proportion of delayed flights (red), while EWR shows the opposite pattern.
Therefore, if selecting an airport based solely on on-time departure performance, LGA would be the best choice.
The consistency between the numerical summary and the visual representation strengthens the conclusion that LGA has the best on-time performance.

```{r}
nycflights <- nycflights |>
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights |>
  group_by(origin) |>
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) |>
  arrange(desc(ot_dep_rate))

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()

```

### Exercise 7

A new variable, avg_speed, was created by dividing distance (miles) by flight time in hours, converting air_time from minutes to hours.

```{r}


nycflights <- nycflights %>%
  mutate(avg_speed = distance / (air_time / 60))

summary(nycflights$avg_speed)

```

### Exercise 8

The scatterplot shows a positive relationship between distance and average speed. Flights covering longer distances tend to have higher average speeds.
This pattern likely occurs because longer flights spend more time at cruising altitude, where planes travel more efficiently and at higher speeds. In contrast, shorter flights spend more time taking off and landing, which reduces their average speed.

```{r}

ggplot(nycflights, aes(x = distance, y = avg_speed)) +
  geom_point(alpha = 0.3)


ggplot(nycflights, aes(x = distance, y = avg_speed)) +
  geom_point(alpha = 0.3) +
  geom_smooth()


```

### Exercise 9

The scatterplot of departure delay versus arrival delay for American (AA), Delta (DL), and United (UA) shows a strong positive relationship: flights that depart later tend to arrive later.
However, there are many observations where flights with small departure delays still arrive on time or early (negative arrival delays). This suggests that airlines are often able to make up some lost time during the flight.
Based on the plot, the approximate cutoff point appears to be around 30–60 minutes of departure delay. Flights delayed less than this range often still arrive on time, while flights with delays greater than this are much more likely to arrive late.
Overall, this indicates that moderate departure delays can sometimes be recovered, but larger delays typically result in late arrivals.

```{r}

selected_flights <- nycflights %>%
  filter(carrier %in% c("AA", "DL", "UA"))


ggplot(selected_flights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
  geom_point(alpha = 0.5)


```

### Exercises 3.5.1

1. Faceting a continuous variable creates a panel for each unique value, which can result in a very large number of small plots. This is often not useful becasue the panels become too numerous, each panel may contain very few observations, and the plot becomes difficult to interpret. In practice, continuous variables should be binned or categorized before faceting.

2. Empty cells represent combinations of variables that do not exist in the dataset. For example, if no cars have a certain combination of drive type and number of cylinders, ggplot still creates the panel, but it appears empty becasue no observations fall into that category. These empty cells highlight missing combination in the data, not errors. 

3.First plot: Creates one column of plots. Each row = a different value of drive type. You get stacked panels vertically. Second plot: Creates one row of plots. Each column = a different value of number of cylinders. you get panels arranged horizontally.
. means no "faceting in that direction". In drz~. --> no columns, only rows; In .~cyl --> no rows, only columns

4. Advantages of faceting: Separates groups into clear, individual panels. Makes patterns easier to compare without overlap. Reduces clutter in dense plots.
Disadvantages of faceting: Takes up more space. Harder to compare values across panels directly. Can become overwhelming with many categories.
Advantages of using color: all data shown in one plot. Easier direct comparison across groups. More compact.
Disadvantages of color: Overplotting (points overlap). Hard to distinguish groups if many categories. Can be visually cluttered. 
With larger data sets, color can become cluttered, and faceting becomes more useful for clarity.

5. In facet_wrap(), the arguments nrow and ncol control the layout of the panels. The nrow argument specifies the number of rows, while ncol specifies the number of columns used to arrange the panels. In contrast, facet_grid() does not include nrow and ncol arguments because its layout is determined entirely by the variables used for faceting. The number of rows and columns is fixed based on the number of levels in each variable, so the user does not manually control the layout.

6. You should put the variable with more unique levels in the columns because screens are typically wider than they are tall. Putting more levels in columns uses horizontal space efficiently, prevents plots from becoming too tall and compressed, and makes the layout easier to read. 

```{r}
facet_grid(drv ~ cyl)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

```

### Exercises 3.7.1

1. The default geom for stat_summary() is geom_pointrange(). To rewrite a plot using the geom instead of the stat, you specify stat = "summary" inside the geom:  
geom_pointrange(stat = "summary", fun.data = "mean_cl_normal")

2. geom_col() creates bar plots using values that are already present in the dataset, meaning the heights of the bars are taken directly from a y variable. In contrast, geom_bar() calculates counts automatically (using stat = "count" by default) and does not require a y variable. Therefore, geom_col() is used when the data has already been summarized, while geom_bar() is used for raw data.

3. Most geoms and stats come in pairs, such as geom_bar() with stat_count(), geom_histogram() with stat_bin(), and geom_smooth() with stat_smooth(). These pairs work together such that the stat computes a transformation of the data, and the geom controls how the result is displayed. They share the same underlying structure and can often be interchanged by specifying the stat or geom argument explicitly.

4. stat_smooth() computes a smoothed trend line through the data and often includes a confidence interval around the estimate. Its behavior is controlled by parameters such as method (e.g., "lm" or "loess"), se (whether to display the confidence interval), and other arguments like span for controlling smoothness.

5. The issue is that ggplot2 automatically groups data based on mapped aesthetics such as fill. As a result, proportions are calculated within each group instead of across the entire dataset. It creates gray, solid filled bars. By setting group = 1, we override this default grouping and ensure that the proportions are calculated relative to all observations. Without this specification, the graphs display incorrect proportions because each category is normalized separately rather than globally.

```{r}
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

```

### Exercises 3.8.1

1. The main problem with this plot is overplotting, since many observations share the same values of cty and hwy, causing points to overlap and obscure the true number of observations. This makes it difficult to assess the density of the data. The plot can be improved by using jittering (e.g., geom_jitter()), adding transparency, or using geom_count() to better display the frequency of overlapping points.

2. The amount of jittering is controlled by the width and height parameters in geom_jitter(). The width argument determines the amount of horizontal movement, while the height argument controls vertical movement of points.

3. geom_jitter() reduces overplotting by adding small random variation to the position of points, making overlapping observations visible. In contrast, geom_count() represents overlapping observations by increasing the size of the points based on their frequency. While jittering spreads the data visually, geom_count() preserves the exact values but encodes frequency through point size.

4. The default position adjustment for geom_boxplot() is position = "dodge2", which places boxplots side-by-side when there are multiple groups. This allows for easier comparison between categories. 


```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()
```

### Exercises 3.9.1

1. A stacked bar chart can be converted into a pie chart by using coord_polar(theta = "y"), which transforms the y-values into angular space and wraps the bar chart into a circle.

2. The labs() function is used to add and modify labels in a ggplot, including the title, subtitle, axis labels, legend labels, and caption. It is preferred over older functions such as xlab() and ggtitle(), which are now superseded. By improving labeling, labs() makes plots more interpretable and accessible to a broader audience.

3. coord_map() applies a true map projection and preserves geographic accuracy, but it is computationally slower. In contrast, coord_quickmap() provides a faster approximation of a map projection and works well for smaller geographic regions where distortion is minimal.

4. The plot shows a strong positive relationship between city (cty) and highway (hwy) mileage: vehicles with higher city MPG also tend to have higher highway MPG. Most points lie above the reference line, indicating that highway MPG is generally higher than city MPG.
The function coord_fixed() is important because it ensures equal scaling on both axes, making the reference line meaningful and allowing accurate visual comparison between the two variables.
The function geom_abline() adds a diagonal reference line (with slope = 1 and intercept = 0), representing where city MPG equals highway MPG. This helps highlight the difference between the two measures and makes it clear that most vehicles perform better on the highway than in the city.

```{r}
ggplot(diamonds, aes(x = "", fill = cut)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")


ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()
```

### Exercises 4.4

1. This code does not work because the assignment operator is not written correctly. In R, the assignment must use the exact operator <-. If an incorrect dash or formatting is used, R does not recognize the assignment, and the variable is not created, resulting in the error.

2. The corrected commands are:

library(tidyverse)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)

filter(diamonds, carat > 3)

The errors included a misspelled argument name (dota instead of data), using = instead of == for comparison, and reference the dataset diamond instead of diamonds.

3. Pressing Alt + Shift + K in RStudio opens the keyboard shortcuts help menu. This menu provides a list of useful shortcuts to improve efficiency when coding. The same menu can be accessed by navigating to Help → Keyboard Shortcuts Help in the RStudio toolbar.

```{r}

library(tidyverse)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)

filter(diamonds, carat > 3)

```

