1 Data Manipulation nycflights13

1.1 Problem 1: What months had the highest and lowest proportion of cancelled flights? Interpret any seasonal patterns.

# Calculate the proportion of cancelled flights by month
cancelled_flights <- flights %>% 
  group_by(month) %>%
  summarise(
    total_flights = n(),
    cancelled_flights = sum(is.na(dep_time)),
    prop_cancelled = cancelled_flights / total_flights
  ) %>%
  arrange(desc(prop_cancelled))

# Display the months with highest and lowest proportion of cancelled flights
cancelled_flights
## # A tibble: 12 × 4
##    month total_flights cancelled_flights prop_cancelled
##    <int>         <int>             <int>          <dbl>
##  1     2         24951              1261        0.0505 
##  2    12         28135              1025        0.0364 
##  3     6         28243              1009        0.0357 
##  4     7         29425               940        0.0319 
##  5     3         28834               861        0.0299 
##  6     4         28330               668        0.0236 
##  7     5         28796               563        0.0196 
##  8     1         27004               521        0.0193 
##  9     8         29327               486        0.0166 
## 10     9         27574               452        0.0164 
## 11    11         27268               233        0.00854
## 12    10         28889               236        0.00817
# Visualize the seasonal pattern with enhanced aesthetics
cancelled_flights %>%
  mutate(
    month = factor(month.name[month], levels = month.name),
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      TRUE ~ "Fall"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
  ) %>%
  ggplot(aes(x = month, y = prop_cancelled, fill = season)) +
  geom_col() +
  geom_text(
    aes(label = scales::percent(prop_cancelled, accuracy = 0.1)), 
    vjust = -0.5, 
    size = 3.5,
    fontface = "bold"
  ) +
  scale_fill_manual(values = c("Winter" = "#74add1", "Spring" = "#7fbf7b", 
                              "Summer" = "#fd8d3c", "Fall" = "#b2abd2")) +
  labs(
    title = "Proportion of Cancelled Flights by Month (2013)",
    subtitle = "Winter months show the highest cancellation rates",
    x = NULL,
    y = "Proportion of Flights Cancelled",
    caption = "Data: nycflights13",
    fill = "Season"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic"),
    panel.grid.major.x = element_blank(),
    legend.position = "bottom"
  ) +
  scale_y_continuous(labels = scales::percent, limits = c(0, max(cancelled_flights$prop_cancelled) * 1.2))

The highest proportion of cancelled flights occurs in February, followed by December and January. This indicates a clear winter seasonal pattern, likely due to severe weather conditions affecting the northeastern United States during winter months. The lowest cancellation rates occur in autumn months (September, October, November), when weather tends to be more stable. Summer months show moderate cancellation rates, possibly due to thunderstorms and high travel volume.

1.2 Problem 2: What plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013?

Please left_join() the resulting table with the table planes (also included in the nycflights13 package).

For the plane with the greatest number of flights and that had more than 50 seats, please create a table where it flew to during 2013.

# Find the plane that traveled the most times
most_traveled_planes <- flights %>%
  filter(!is.na(tailnum)) %>%
  count(tailnum, sort = TRUE) %>%
  left_join(planes, by = "tailnum") %>%
  arrange(desc(n))

# Show the top planes with improved visualization
most_traveled_planes %>% 
  slice_head(n = 10) %>%
  mutate(
    manufacturer = ifelse(is.na(manufacturer), "Unknown", manufacturer),
    model = ifelse(is.na(model), "Unknown", model),
    plane_info = paste0(manufacturer, " ", model, " (", seats, " seats)")
  ) %>%
  ggplot(aes(x = reorder(tailnum, n), y = n, fill = manufacturer)) +
  geom_col() +
  geom_text(
    aes(label = n), 
    hjust = -0.3, 
    size = 3.5, 
    fontface = "bold"
  ) +
  coord_flip() +
  labs(
    title = "Top 10 Most Traveled Planes from NYC (2013)",
    subtitle = "Identified by tail number, manufacturer, and model",
    x = NULL,
    y = "Number of Flights",
    fill = "Manufacturer"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic"),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(face = "bold")
  ) +
  scale_y_continuous(limits = c(0, max(most_traveled_planes$n[1:10]) * 1.1), 
                     expand = expansion(mult = c(0, 0.1))) +
  scale_fill_brewer(palette = "Set1")

# Find the plane with the most flights and more than 50 seats
most_traveled_large_plane <- most_traveled_planes %>%
  filter(seats > 50) %>%
  slice(1)

# Show where this plane flew to with enhanced visualization
large_plane_destinations <- flights %>%
  filter(tailnum == most_traveled_large_plane$tailnum) %>%
  count(dest, sort = TRUE) %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  select(dest, name, n) %>%
  rename(destination_code = dest, 
         destination_name = name, 
         number_of_flights = n)

# Create a more appealing visualization
large_plane_destinations %>%
  slice_head(n = 15) %>%
  mutate(
    destination_name = str_replace(destination_name, " Airport", ""),
    destination_name = str_replace(destination_name, "International", "Int'l"),
    destination = paste0(destination_code, ": ", destination_name)
  ) %>%
  ggplot(aes(x = reorder(destination, number_of_flights), y = number_of_flights)) +
  geom_col(aes(fill = number_of_flights)) +
  geom_text(
    aes(label = number_of_flights),
    hjust = -0.3,
    size = 3.5,
    fontface = "bold"
  ) +
  coord_flip() +
  scale_fill_viridis_c() +
  labs(
    title = paste0("Destinations of the Most Traveled Large Plane (", most_traveled_large_plane$tailnum, ")"),
    subtitle = paste0(most_traveled_large_plane$manufacturer, " ", 
                     most_traveled_large_plane$model, " with ", 
                     most_traveled_large_plane$seats, " seats"),
    x = NULL,
    y = "Number of Flights",
    caption = "Data: nycflights13 package"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic"),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(face = "bold")
  ) +
  scale_y_continuous(limits = c(0, max(large_plane_destinations$number_of_flights) * 1.1),
                     expand = expansion(mult = c(0, 0.1)))

# Display the table
large_plane_destinations
## # A tibble: 6 × 3
##   destination_code destination_name                   number_of_flights
##   <chr>            <chr>                                          <int>
## 1 LAX              Los Angeles Intl                                 313
## 2 SFO              San Francisco Intl                                52
## 3 MIA              Miami Intl                                        25
## 4 BOS              General Edward Lawrence Logan Intl                 1
## 5 MCO              Orlando Intl                                       1
## 6 SJU              <NA>                                               1

1.3 Problem 3: Use the flights and planes tables to answer the following questions:

# How many planes have a missing date of manufacture?
planes %>% 
  summarise(planes_missing_year = sum(is.na(year)))
## # A tibble: 1 × 1
##   planes_missing_year
##                 <int>
## 1                  70
# What are the five most common manufacturers?
top_manufacturers <- planes %>%
  count(manufacturer, sort = TRUE) %>%
  slice_head(n = 5)

# Visualize top manufacturers with enhanced chart
planes %>%
  count(manufacturer, sort = TRUE) %>%
  slice_head(n = 10) %>%
  mutate(manufacturer = fct_reorder(manufacturer, n)) %>%
  ggplot(aes(x = manufacturer, y = n, fill = manufacturer)) +
  geom_col(show.legend = FALSE) +
  geom_text(
    aes(label = n), 
    hjust = -0.2, 
    size = 3.5,
    fontface = "bold"
  ) +
  coord_flip() +
  labs(
    title = "Top 10 Aircraft Manufacturers",
    subtitle = "By number of planes in the NYC flights dataset",
    x = NULL,
    y = "Number of Planes"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic"),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(face = "bold")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  scale_fill_brewer(palette = "Set3")

# Has the distribution of manufacturer changed over time?
# First, identify manufacturers to include in "Other" category
manufacturer_counts <- planes %>%
  count(manufacturer, sort = TRUE)

# Create a new dataframe with recoded manufacturers
planes_with_recoded_manufacturer <- planes %>%
  mutate(manufacturer_recoded = case_when(
    manufacturer %in% top_manufacturers$manufacturer ~ manufacturer,
    TRUE ~ "Other"
  ))

# Join with flights data to analyze distribution by year
manufacturer_by_year <- flights %>%
  inner_join(planes_with_recoded_manufacturer, by = "tailnum") %>%
  filter(!is.na(year.y)) %>%  # Remove planes with missing manufacture year
  group_by(year.y, manufacturer_recoded) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(year.y) %>%
  mutate(percentage = count / sum(count) * 100) %>%
  ungroup()

# Visualize change over time with enhanced aesthetics
ggplot(manufacturer_by_year, aes(x = year.y, y = percentage, color = manufacturer_recoded)) +
  geom_line(size = 1.2) +
  geom_point(size = 2.5, aes(fill = manufacturer_recoded), shape = 21, color = "white", stroke = 0.5) +
  labs(
    title = "Change in Aircraft Manufacturer Distribution Over Time",
    subtitle = "Percentage of flights by year of plane manufacture",
    x = "Year of Manufacture",
    y = "Percentage of Flights",
    color = "Manufacturer",
    fill = "Manufacturer"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic"),
    panel.grid.minor = element_blank(),
    axis.title = element_text(face = "bold")
  ) +
  scale_x_continuous(breaks = seq(min(manufacturer_by_year$year.y), 
                                 max(manufacturer_by_year$year.y), 
                                 by = 5)) +
  scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  guides(fill = "none")  # Don't show duplicate legend for fill
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The analysis shows that 70 planes have a missing date of manufacture. The five most common manufacturers are BOEING, AIRBUS INDUSTRIE, BOMBARDIER INC, AIRBUS, and EMBRAER.

The distribution of manufacturers has indeed changed over time. Boeing has maintained a significant presence throughout the years, while Airbus has increased its market share in more recent planes. Older planes in the dataset tend to be from manufacturers like McDonnell Douglas, while newer planes show more diversity in manufacturers.

1.4 Problem 4: Which carriers service the route to San Francisco International (SFO).

Join the flights and airlines tables and count which airlines flew the most to SFO. Produce a new dataframe, fly_into_sfo that contains three variables: the name of the airline, e.g., United Air Lines Inc. not UA, the count (number) of times it flew to SFO, and the percent of the trips that that particular airline flew to SFO.

# Create a dataframe of airlines that fly to SFO
fly_into_sfo <- flights %>%
  filter(dest == "SFO") %>%
  group_by(carrier) %>%
  summarise(count = n()) %>%
  # Calculate the percent
  mutate(
    percent = count / sum(count) * 100,
    percent_label = sprintf("%.1f%%", percent)
  ) %>%
  # Join with airlines table to get airline names
  left_join(airlines, by = "carrier") %>%
  # Select and rename columns
  select(carrier, name, count, percent, percent_label) %>%
  # Sort by count in descending order
  arrange(desc(count))

# Display the results
fly_into_sfo
## # A tibble: 5 × 5
##   carrier name                   count percent percent_label
##   <chr>   <chr>                  <int>   <dbl> <chr>        
## 1 UA      United Air Lines Inc.   6819   51.2  51.2%        
## 2 VX      Virgin America          2197   16.5  16.5%        
## 3 DL      Delta Air Lines Inc.    1858   13.9  13.9%        
## 4 AA      American Airlines Inc.  1422   10.7  10.7%        
## 5 B6      JetBlue Airways         1035    7.76 7.8%

Now let’s create an enhanced visualization for these results:

fly_into_sfo %>% 
  # sort 'name' of airline by the numbers it times to flew to SFO
  mutate(
    name = fct_reorder(name, count),
    carrier = fct_reorder(carrier, count)
  ) %>% 
  ggplot() + 
  aes(x = count, y = name) +
  # a simple bar/column plot with custom fill by carrier
  geom_col(aes(fill = carrier)) +
  # add labels, so each bar shows the % of total flights 
  geom_text(
    aes(label = percent_label),
    hjust = -0.2, 
    colour = "black", 
    size = 4,
    fontface = "bold"
  ) +
  # add airline codes to the bars
  geom_text(
    aes(label = carrier, x = count/2),
    color = "white",
    fontface = "bold",
    size = 4
  ) +
  # add labels to help our audience  
  labs(
    title = "Which Airline Dominates the NYC to SFO Route?",
    subtitle = "Distribution of flights to San Francisco in 2013",
    x = "Number of Flights",
    y = NULL,
    fill = "Carrier Code"
  ) +
  theme_minimal(base_size = 12) + 
  theme(
    # title formatting
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(face = "italic", size = 12),
    # remove gridlines for cleaner look
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    # axis text formatting
    axis.text.y = element_text(size = 11, face = "bold"),
    # legend positioning
    legend.position = "bottom"
  ) +
  # use a custom color palette
  scale_fill_brewer(palette = "Paired") +
  # extend x-axis to fit labels
  scale_x_continuous(expand = expansion(mult = c(0, 0.25)))

1.5 Problem 5: Cancellations of flights to SFO.

We create a new dataframe cancellations as follows

cancellations <- flights %>% 
  
  # just filter for destination == 'SFO'
  filter(dest == 'SFO') %>% 
  
  # a cancelled flight is one with no `dep_time` 
  filter(is.na(dep_time))

To create the plot shown in the image (showing cancellations of flights to SFO by month, carrier, and weather), I would organize my data manipulation as follows:

  1. Start with the flights dataset and filter for those destined for SFO.
  2. Group the data by carrier and month.
  3. For each group, calculate:
    • Total number of flights scheduled
    • Number of flights cancelled (where dep_time is NA)
    • Proportion of cancelled flights
  4. Create a visualization that shows:
    • Months on the x-axis
    • Proportion of cancellations on the y-axis
    • Different carriers represented by different colored lines
    • Point size that varies based on the total number of flights

This approach would reveal both seasonal patterns in cancellations and differences between carriers. The larger points would indicate carriers with more flights on the route, making it easy to distinguish major players from carriers with fewer flights where a single cancellation might skew the percentage dramatically.

2 Rents in San Francisco 2000-2018

# Download the dataset from TidyTuesday repository
rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')
## Rows: 200796 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): post_id, nhood, city, county, address, title, descr, details
## dbl (9): date, year, price, beds, baths, sqft, room_in_apt, lat, lon
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1 Variable Types and Missing Values

# Examine variable types and missing values
skimr::skim(rent)
Data summary
Name rent
Number of rows 200796
Number of columns 17
_______________________
Column type frequency:
character 8
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
post_id 0 1.00 9 14 0 200796 0
nhood 0 1.00 4 43 0 167 0
city 0 1.00 5 19 0 104 0
county 1394 0.99 4 13 0 10 0
address 196888 0.02 1 38 0 2869 0
title 2517 0.99 2 298 0 184961 0
descr 197542 0.02 13 16975 0 3025 0
details 192780 0.04 4 595 0 7667 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
date 0 1.00 20095718.38 44694.07 20000902.00 20050227.00 20110924.00 20120805.0 20180717.00 ▁▇▁▆▃
year 0 1.00 2009.51 4.48 2000.00 2005.00 2011.00 2012.0 2018.00 ▁▇▁▆▃
price 0 1.00 2135.36 1427.75 220.00 1295.00 1800.00 2505.0 40000.00 ▇▁▁▁▁
beds 6608 0.97 1.89 1.08 0.00 1.00 2.00 3.0 12.00 ▇▂▁▁▁
baths 158121 0.21 1.68 0.69 1.00 1.00 2.00 2.0 8.00 ▇▁▁▁▁
sqft 136117 0.32 1201.83 5000.22 80.00 750.00 1000.00 1360.0 900000.00 ▇▁▁▁▁
room_in_apt 0 1.00 0.00 0.04 0.00 0.00 0.00 0.0 1.00 ▇▁▁▁▁
lat 193145 0.04 37.67 0.35 33.57 37.40 37.76 37.8 40.43 ▁▁▅▇▁
lon 196484 0.02 -122.21 0.78 -123.20 -122.42 -122.26 -122.0 -74.20 ▇▁▁▁▁

Based on the skim output, we can see that: 1. The dataset contains both character and numeric variables. 2. The date variable is stored as a double but should be converted to a date format. 3. The variables with the most missing values are sqft, lat, lon, and address. 4. Some listings are missing bedroom or bathroom information.

2.2 Top 20 Cities by Percentage of Classifieds

# Calculate number and percentage of listings by city
city_listings <- rent %>%
  count(city) %>%
  mutate(
    percent = n / sum(n) * 100,
    percent_label = sprintf("%.1f%%", percent)
  ) %>%
  arrange(desc(n)) %>%
  slice_head(n = 20)

# Create the enhanced plot of top 20 cities
city_listings %>%
  mutate(city = str_to_title(city)) %>%  # Capitalize city names
  mutate(city = fct_reorder(city, percent)) %>%
  ggplot(aes(x = percent, y = city)) +
  geom_col(aes(fill = percent), show.legend = FALSE) +
  geom_text(
    aes(label = percent_label), 
    hjust = -0.2, 
    size = 3.5,
    fontface = "bold"
  ) +
  labs(
    title = "Top 20 Cities by % of Bay Area Rental Listings (2000-2018)",
    subtitle = "San Francisco dominates with over 40% of total listings",
    x = "Percentage of All Listings",
    y = NULL,
    caption = "Data: TidyTuesday Rent dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic"),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    plot.caption = element_text(face = "italic")
  ) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  scale_fill_viridis_c(option = "plasma")

2.3 Median Rental Prices in San Francisco by Number of Bedrooms

# Filter for San Francisco and create a plot of median prices by bedroom count over time
sf_rentals <- rent %>%
  filter(city == "san francisco", beds %in% 0:3) %>%
  mutate(
    year = lubridate::year(as.Date(date, origin = "1970-01-01")),
    beds = as.factor(beds),
    bed_label = case_when(
      beds == 0 ~ "Studio",
      beds == 1 ~ "1 Bedroom",
      beds == 2 ~ "2 Bedrooms",
      beds == 3 ~ "3 Bedrooms",
      TRUE ~ paste(beds, "Bedrooms")
    )
  ) %>%
  group_by(year, beds, bed_label) %>%
  summarise(median_price = median(price, na.rm = TRUE), .groups = "drop")

# Create the enhanced plot with better aesthetics
ggplot(sf_rentals, aes(x = year, y = median_price, color = bed_label, group = bed_label)) +
  # Add a subtle background grid with year highlights
  geom_vline(xintercept = seq(2000, 2018, by = 2), color = "gray90", linewidth = 0.5) +
  # Add a recession band for the 2008-2009 financial crisis
  annotate("rect", xmin = 2008, xmax = 2010, ymin = 0, ymax = Inf, 
           fill = "gray90", alpha = 0.3) +
  # Add lines and points
  geom_line(linewidth = 1.2) +
  geom_point(size = 3, aes(shape = bed_label)) +
  # Add text labels for the latest year
  geom_text(
    data = sf_rentals %>% filter(year == max(year)),
    aes(label = bed_label, x = year + 0.3),
    hjust = 0,
    size = 3.5,
    fontface = "bold"
  ) +
  # Format y-axis as dollars
  scale_y_continuous(
    labels = scales::dollar_format(),
    breaks = seq(0, 6000, by = 1000),
    expand = expansion(mult = c(0, 0.1))
  ) +
  # Set x-axis breaks to every year
  scale_x_continuous(
    breaks = seq(2000, 2018, by = 2),
    limits = c(2000, 2018.5),
    expand = c(0, 0)
  ) +
  # Use a color-blind friendly palette
  scale_color_brewer(palette = "Set1") +
  # Add clear, informative labels
  labs(
    title = "The Rising Cost of San Francisco Housing (2000-2018)",
    subtitle = "Median rent by number of bedrooms shows steady increases across all unit types",
    x = NULL,
    y = "Median Monthly Rent",
    color = "Unit Type",
    shape = "Unit Type",
    caption = "Data: Craigslist rental listings | Gray band: 2008-2010 Financial Crisis"
  ) +
  # Use a clean, modern theme
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(face = "italic"),
    axis.title.y = element_text(face = "bold"),
    legend.position = "none",
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.caption = element_text(face = "italic", hjust = 0),
    axis.text = element_text(face = "bold")
  )
## Warning: Removed 251 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 251 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).

2.4 Median Rental Prices for One-Bedroom Apartments in Top 12 Bay Area Cities

# Identify top 12 cities by number of listings
top_cities <- rent %>%
  count(city) %>%
  arrange(desc(n)) %>%
  slice_head(n = 12) %>%
  pull(city)

# Filter for one-bedroom apartments in these cities
one_bed_rentals <- rent %>%
  filter(beds == 1, city %in% top_cities) %>%
  mutate(
    year = lubridate::year(as.Date(date, origin = "1970-01-01")),
    city = str_to_title(city) # Capitalize city names
  ) %>%
  group_by(city, year) %>%
  summarise(
    median_price = median(price, na.rm = TRUE),
    n_listings = n(),
    .groups = "drop"
  )

# Create an enhanced plot with better visualization
ggplot(one_bed_rentals, aes(x = year, y = median_price, color = city)) +
  # Add a recession band for context
  annotate("rect", xmin = 2008, xmax = 2010, ymin = 0, ymax = Inf, 
           fill = "gray90", alpha = 0.3) +
  # Add line and point layers
  geom_line(linewidth = 1) +
  geom_point(aes(size = n_listings/100), alpha = 0.7) +
  # Highlight San Francisco with a distinct style
  geom_line(
    data = filter(one_bed_rentals, city == "San Francisco"),
    linewidth = 1.8,
    color = "red"
  ) +
  # Add city labels for the latest year
  geom_text_repel(
    data = one_bed_rentals %>% 
      filter(year == max(year)) %>%
      arrange(desc(median_price)) %>%
      mutate(rank = row_number()) %>%
      filter(rank <= 6), # Label only top 6 most expensive
    aes(label = city, x = year + 0.1, y = median_price),
    hjust = 0,
    size = 3.5,
    fontface = "bold",
    nudge_x = 0.2,
    direction = "y",
    segment.color = "gray50",
    segment.size = 0.5,
    box.padding = 0.5,
    seed = 42
  ) +
  # Format y-axis as dollars
  scale_y_continuous(
    labels = scales::dollar_format(),
    expand = expansion(mult = c(0, 0.1))
  ) +
  # Set x-axis breaks
  scale_x_continuous(
    breaks = seq(2000, 2018, by = 2),
    limits = c(2000, 2019),
    expand = c(0, 0)
  ) +
  # Size legend
  scale_size_continuous(
    name = "Listings (hundreds)",
    range = c(1, 5)
  ) +
  # Use a colorblind-friendly color palette with many distinct colors
  scale_color_brewer(palette = "Paired") +
  # Add clear, informative labels
  labs(
    title = "One-Bedroom Apartment Rental Prices in the Bay Area (2000-2018)",
    subtitle = "San Francisco leads with dramatic increases, other cities showing similar trends",
    x = NULL,
    y = "Median Monthly Rent",
    caption = "Data: Craigslist rental listings | Size of points indicates number of listings | Gray band: 2008-2010 Financial Crisis"
  ) +
  # Use a clean theme
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(face = "italic"),
    axis.title.y = element_text(face = "bold"),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.caption = element_text(face = "italic", hjust = 0),
    legend.title = element_text(face = "bold"),
    axis.text = element_text(face = "bold")
  ) +
  guides(color = "none") # Remove color legend as we use direct labels
## Warning: Removed 637 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 637 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 63 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).

2.5 Interpretation of the Plots

The rental data tells a compelling story of the Bay Area housing crisis. San Francisco shows a dramatic increase in rental prices across all unit types, with median one-bedroom prices rising from around $1,500 in the early 2000s to over $3,500 by 2018. This surge corresponds with the tech boom, as companies like Google, Facebook, and Apple expanded operations in the region.

Interestingly, while San Francisco leads in absolute price, other nearby cities show similar growth trajectories, suggesting a regional housing shortage driving up costs across the entire Bay Area. The brief dip in prices around 2008-2010 reflects the impact of the Great Recession, but prices quickly recovered and continued their upward trend. The data shows how the influx of high-income tech workers, coupled with limited housing development, has transformed the Bay Area rental market into one of the least affordable in the country.

2.6 Challenge

To create the challenge visualization (a plot showing monthly percentage change in prices by city), I would:

  1. Start with the rental data and group by city and month/year.
  2. Calculate the median price for each city in each time period.
  3. Calculate the percent change from the previous month for each city.
  4. Create a heatmap where:
    • The x-axis represents the month/year over time
    • The y-axis lists the cities
    • The color intensity represents the percentage change (with a diverging color palette where red = increase, blue = decrease)
    • Include annotations for particularly notable changes

I would use geom_tile() in ggplot2 to create the heatmap, and customize the color scale to highlight price increases and decreases. The faceting would allow viewers to quickly identify patterns in rental price volatility across different cities and time periods.

3 Details

  • Who did you collaborate with: None
  • Approximately how much time did you spend on this problem set: 5 hours
  • What, if anything, gave you the most trouble: The challenge visualization and understanding the manufacturer trends over time required the most thought.

Please seek out help when you need it, and remember the 15-minute rule.