nycflights13# Calculate the proportion of cancelled flights by month
cancelled_flights <- flights %>%
group_by(month) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(is.na(dep_time)),
prop_cancelled = cancelled_flights / total_flights
) %>%
arrange(desc(prop_cancelled))
# Display the months with highest and lowest proportion of cancelled flights
cancelled_flights## # A tibble: 12 × 4
## month total_flights cancelled_flights prop_cancelled
## <int> <int> <int> <dbl>
## 1 2 24951 1261 0.0505
## 2 12 28135 1025 0.0364
## 3 6 28243 1009 0.0357
## 4 7 29425 940 0.0319
## 5 3 28834 861 0.0299
## 6 4 28330 668 0.0236
## 7 5 28796 563 0.0196
## 8 1 27004 521 0.0193
## 9 8 29327 486 0.0166
## 10 9 27574 452 0.0164
## 11 11 27268 233 0.00854
## 12 10 28889 236 0.00817
# Visualize the seasonal pattern with enhanced aesthetics
cancelled_flights %>%
mutate(
month = factor(month.name[month], levels = month.name),
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
TRUE ~ "Fall"
),
season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
) %>%
ggplot(aes(x = month, y = prop_cancelled, fill = season)) +
geom_col() +
geom_text(
aes(label = scales::percent(prop_cancelled, accuracy = 0.1)),
vjust = -0.5,
size = 3.5,
fontface = "bold"
) +
scale_fill_manual(values = c("Winter" = "#74add1", "Spring" = "#7fbf7b",
"Summer" = "#fd8d3c", "Fall" = "#b2abd2")) +
labs(
title = "Proportion of Cancelled Flights by Month (2013)",
subtitle = "Winter months show the highest cancellation rates",
x = NULL,
y = "Proportion of Flights Cancelled",
caption = "Data: nycflights13",
fill = "Season"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic"),
panel.grid.major.x = element_blank(),
legend.position = "bottom"
) +
scale_y_continuous(labels = scales::percent, limits = c(0, max(cancelled_flights$prop_cancelled) * 1.2))The highest proportion of cancelled flights occurs in February, followed by December and January. This indicates a clear winter seasonal pattern, likely due to severe weather conditions affecting the northeastern United States during winter months. The lowest cancellation rates occur in autumn months (September, October, November), when weather tends to be more stable. Summer months show moderate cancellation rates, possibly due to thunderstorms and high travel volume.
tailnum variable) traveled the most times
from New York City airports in 2013?Please left_join() the resulting table with the table
planes (also included in the nycflights13
package).
For the plane with the greatest number of flights and that had more than 50 seats, please create a table where it flew to during 2013.
# Find the plane that traveled the most times
most_traveled_planes <- flights %>%
filter(!is.na(tailnum)) %>%
count(tailnum, sort = TRUE) %>%
left_join(planes, by = "tailnum") %>%
arrange(desc(n))
# Show the top planes with improved visualization
most_traveled_planes %>%
slice_head(n = 10) %>%
mutate(
manufacturer = ifelse(is.na(manufacturer), "Unknown", manufacturer),
model = ifelse(is.na(model), "Unknown", model),
plane_info = paste0(manufacturer, " ", model, " (", seats, " seats)")
) %>%
ggplot(aes(x = reorder(tailnum, n), y = n, fill = manufacturer)) +
geom_col() +
geom_text(
aes(label = n),
hjust = -0.3,
size = 3.5,
fontface = "bold"
) +
coord_flip() +
labs(
title = "Top 10 Most Traveled Planes from NYC (2013)",
subtitle = "Identified by tail number, manufacturer, and model",
x = NULL,
y = "Number of Flights",
fill = "Manufacturer"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic"),
panel.grid.major.y = element_blank(),
axis.text.y = element_text(face = "bold")
) +
scale_y_continuous(limits = c(0, max(most_traveled_planes$n[1:10]) * 1.1),
expand = expansion(mult = c(0, 0.1))) +
scale_fill_brewer(palette = "Set1")# Find the plane with the most flights and more than 50 seats
most_traveled_large_plane <- most_traveled_planes %>%
filter(seats > 50) %>%
slice(1)
# Show where this plane flew to with enhanced visualization
large_plane_destinations <- flights %>%
filter(tailnum == most_traveled_large_plane$tailnum) %>%
count(dest, sort = TRUE) %>%
left_join(airports, by = c("dest" = "faa")) %>%
select(dest, name, n) %>%
rename(destination_code = dest,
destination_name = name,
number_of_flights = n)
# Create a more appealing visualization
large_plane_destinations %>%
slice_head(n = 15) %>%
mutate(
destination_name = str_replace(destination_name, " Airport", ""),
destination_name = str_replace(destination_name, "International", "Int'l"),
destination = paste0(destination_code, ": ", destination_name)
) %>%
ggplot(aes(x = reorder(destination, number_of_flights), y = number_of_flights)) +
geom_col(aes(fill = number_of_flights)) +
geom_text(
aes(label = number_of_flights),
hjust = -0.3,
size = 3.5,
fontface = "bold"
) +
coord_flip() +
scale_fill_viridis_c() +
labs(
title = paste0("Destinations of the Most Traveled Large Plane (", most_traveled_large_plane$tailnum, ")"),
subtitle = paste0(most_traveled_large_plane$manufacturer, " ",
most_traveled_large_plane$model, " with ",
most_traveled_large_plane$seats, " seats"),
x = NULL,
y = "Number of Flights",
caption = "Data: nycflights13 package"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic"),
panel.grid.major.y = element_blank(),
axis.text.y = element_text(face = "bold")
) +
scale_y_continuous(limits = c(0, max(large_plane_destinations$number_of_flights) * 1.1),
expand = expansion(mult = c(0, 0.1)))## # A tibble: 6 × 3
## destination_code destination_name number_of_flights
## <chr> <chr> <int>
## 1 LAX Los Angeles Intl 313
## 2 SFO San Francisco Intl 52
## 3 MIA Miami Intl 25
## 4 BOS General Edward Lawrence Logan Intl 1
## 5 MCO Orlando Intl 1
## 6 SJU <NA> 1
flights and planes tables to answer the
following questions:# How many planes have a missing date of manufacture?
planes %>%
summarise(planes_missing_year = sum(is.na(year)))## # A tibble: 1 × 1
## planes_missing_year
## <int>
## 1 70
# What are the five most common manufacturers?
top_manufacturers <- planes %>%
count(manufacturer, sort = TRUE) %>%
slice_head(n = 5)
# Visualize top manufacturers with enhanced chart
planes %>%
count(manufacturer, sort = TRUE) %>%
slice_head(n = 10) %>%
mutate(manufacturer = fct_reorder(manufacturer, n)) %>%
ggplot(aes(x = manufacturer, y = n, fill = manufacturer)) +
geom_col(show.legend = FALSE) +
geom_text(
aes(label = n),
hjust = -0.2,
size = 3.5,
fontface = "bold"
) +
coord_flip() +
labs(
title = "Top 10 Aircraft Manufacturers",
subtitle = "By number of planes in the NYC flights dataset",
x = NULL,
y = "Number of Planes"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic"),
panel.grid.major.y = element_blank(),
axis.text.y = element_text(face = "bold")
) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
scale_fill_brewer(palette = "Set3")# Has the distribution of manufacturer changed over time?
# First, identify manufacturers to include in "Other" category
manufacturer_counts <- planes %>%
count(manufacturer, sort = TRUE)
# Create a new dataframe with recoded manufacturers
planes_with_recoded_manufacturer <- planes %>%
mutate(manufacturer_recoded = case_when(
manufacturer %in% top_manufacturers$manufacturer ~ manufacturer,
TRUE ~ "Other"
))
# Join with flights data to analyze distribution by year
manufacturer_by_year <- flights %>%
inner_join(planes_with_recoded_manufacturer, by = "tailnum") %>%
filter(!is.na(year.y)) %>% # Remove planes with missing manufacture year
group_by(year.y, manufacturer_recoded) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(year.y) %>%
mutate(percentage = count / sum(count) * 100) %>%
ungroup()
# Visualize change over time with enhanced aesthetics
ggplot(manufacturer_by_year, aes(x = year.y, y = percentage, color = manufacturer_recoded)) +
geom_line(size = 1.2) +
geom_point(size = 2.5, aes(fill = manufacturer_recoded), shape = 21, color = "white", stroke = 0.5) +
labs(
title = "Change in Aircraft Manufacturer Distribution Over Time",
subtitle = "Percentage of flights by year of plane manufacture",
x = "Year of Manufacture",
y = "Percentage of Flights",
color = "Manufacturer",
fill = "Manufacturer"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "right",
legend.title = element_text(face = "bold"),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic"),
panel.grid.minor = element_blank(),
axis.title = element_text(face = "bold")
) +
scale_x_continuous(breaks = seq(min(manufacturer_by_year$year.y),
max(manufacturer_by_year$year.y),
by = 5)) +
scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
scale_color_brewer(palette = "Set1") +
scale_fill_brewer(palette = "Set1") +
guides(fill = "none") # Don't show duplicate legend for fill## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The analysis shows that 70 planes have a missing date of manufacture. The five most common manufacturers are BOEING, AIRBUS INDUSTRIE, BOMBARDIER INC, AIRBUS, and EMBRAER.
The distribution of manufacturers has indeed changed over time. Boeing has maintained a significant presence throughout the years, while Airbus has increased its market share in more recent planes. Older planes in the dataset tend to be from manufacturers like McDonnell Douglas, while newer planes show more diversity in manufacturers.
Join the flights and airlines tables and
count which airlines flew the most to SFO. Produce a new dataframe,
fly_into_sfo that contains three variables: the
name of the airline, e.g.,
United Air Lines Inc. not UA, the count
(number) of times it flew to SFO, and the percent of the
trips that that particular airline flew to SFO.
# Create a dataframe of airlines that fly to SFO
fly_into_sfo <- flights %>%
filter(dest == "SFO") %>%
group_by(carrier) %>%
summarise(count = n()) %>%
# Calculate the percent
mutate(
percent = count / sum(count) * 100,
percent_label = sprintf("%.1f%%", percent)
) %>%
# Join with airlines table to get airline names
left_join(airlines, by = "carrier") %>%
# Select and rename columns
select(carrier, name, count, percent, percent_label) %>%
# Sort by count in descending order
arrange(desc(count))
# Display the results
fly_into_sfo## # A tibble: 5 × 5
## carrier name count percent percent_label
## <chr> <chr> <int> <dbl> <chr>
## 1 UA United Air Lines Inc. 6819 51.2 51.2%
## 2 VX Virgin America 2197 16.5 16.5%
## 3 DL Delta Air Lines Inc. 1858 13.9 13.9%
## 4 AA American Airlines Inc. 1422 10.7 10.7%
## 5 B6 JetBlue Airways 1035 7.76 7.8%
Now let’s create an enhanced visualization for these results:
fly_into_sfo %>%
# sort 'name' of airline by the numbers it times to flew to SFO
mutate(
name = fct_reorder(name, count),
carrier = fct_reorder(carrier, count)
) %>%
ggplot() +
aes(x = count, y = name) +
# a simple bar/column plot with custom fill by carrier
geom_col(aes(fill = carrier)) +
# add labels, so each bar shows the % of total flights
geom_text(
aes(label = percent_label),
hjust = -0.2,
colour = "black",
size = 4,
fontface = "bold"
) +
# add airline codes to the bars
geom_text(
aes(label = carrier, x = count/2),
color = "white",
fontface = "bold",
size = 4
) +
# add labels to help our audience
labs(
title = "Which Airline Dominates the NYC to SFO Route?",
subtitle = "Distribution of flights to San Francisco in 2013",
x = "Number of Flights",
y = NULL,
fill = "Carrier Code"
) +
theme_minimal(base_size = 12) +
theme(
# title formatting
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(face = "italic", size = 12),
# remove gridlines for cleaner look
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
# axis text formatting
axis.text.y = element_text(size = 11, face = "bold"),
# legend positioning
legend.position = "bottom"
) +
# use a custom color palette
scale_fill_brewer(palette = "Paired") +
# extend x-axis to fit labels
scale_x_continuous(expand = expansion(mult = c(0, 0.25)))We create a new dataframe cancellations as follows
cancellations <- flights %>%
# just filter for destination == 'SFO'
filter(dest == 'SFO') %>%
# a cancelled flight is one with no `dep_time`
filter(is.na(dep_time))To create the plot shown in the image (showing cancellations of flights to SFO by month, carrier, and weather), I would organize my data manipulation as follows:
flights dataset and filter for those
destined for SFO.This approach would reveal both seasonal patterns in cancellations and differences between carriers. The larger points would indicate carriers with more flights on the route, making it easy to distinguish major players from carriers with fewer flights where a single cancellation might skew the percentage dramatically.
# Download the dataset from TidyTuesday repository
rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')## Rows: 200796 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): post_id, nhood, city, county, address, title, descr, details
## dbl (9): date, year, price, beds, baths, sqft, room_in_apt, lat, lon
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Name | rent |
| Number of rows | 200796 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| post_id | 0 | 1.00 | 9 | 14 | 0 | 200796 | 0 |
| nhood | 0 | 1.00 | 4 | 43 | 0 | 167 | 0 |
| city | 0 | 1.00 | 5 | 19 | 0 | 104 | 0 |
| county | 1394 | 0.99 | 4 | 13 | 0 | 10 | 0 |
| address | 196888 | 0.02 | 1 | 38 | 0 | 2869 | 0 |
| title | 2517 | 0.99 | 2 | 298 | 0 | 184961 | 0 |
| descr | 197542 | 0.02 | 13 | 16975 | 0 | 3025 | 0 |
| details | 192780 | 0.04 | 4 | 595 | 0 | 7667 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| date | 0 | 1.00 | 20095718.38 | 44694.07 | 20000902.00 | 20050227.00 | 20110924.00 | 20120805.0 | 20180717.00 | ▁▇▁▆▃ |
| year | 0 | 1.00 | 2009.51 | 4.48 | 2000.00 | 2005.00 | 2011.00 | 2012.0 | 2018.00 | ▁▇▁▆▃ |
| price | 0 | 1.00 | 2135.36 | 1427.75 | 220.00 | 1295.00 | 1800.00 | 2505.0 | 40000.00 | ▇▁▁▁▁ |
| beds | 6608 | 0.97 | 1.89 | 1.08 | 0.00 | 1.00 | 2.00 | 3.0 | 12.00 | ▇▂▁▁▁ |
| baths | 158121 | 0.21 | 1.68 | 0.69 | 1.00 | 1.00 | 2.00 | 2.0 | 8.00 | ▇▁▁▁▁ |
| sqft | 136117 | 0.32 | 1201.83 | 5000.22 | 80.00 | 750.00 | 1000.00 | 1360.0 | 900000.00 | ▇▁▁▁▁ |
| room_in_apt | 0 | 1.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.0 | 1.00 | ▇▁▁▁▁ |
| lat | 193145 | 0.04 | 37.67 | 0.35 | 33.57 | 37.40 | 37.76 | 37.8 | 40.43 | ▁▁▅▇▁ |
| lon | 196484 | 0.02 | -122.21 | 0.78 | -123.20 | -122.42 | -122.26 | -122.0 | -74.20 | ▇▁▁▁▁ |
Based on the skim output, we can see that: 1. The dataset contains
both character and numeric variables. 2. The date variable
is stored as a double but should be converted to a date format. 3. The
variables with the most missing values are sqft,
lat, lon, and address. 4. Some
listings are missing bedroom or bathroom information.
# Calculate number and percentage of listings by city
city_listings <- rent %>%
count(city) %>%
mutate(
percent = n / sum(n) * 100,
percent_label = sprintf("%.1f%%", percent)
) %>%
arrange(desc(n)) %>%
slice_head(n = 20)
# Create the enhanced plot of top 20 cities
city_listings %>%
mutate(city = str_to_title(city)) %>% # Capitalize city names
mutate(city = fct_reorder(city, percent)) %>%
ggplot(aes(x = percent, y = city)) +
geom_col(aes(fill = percent), show.legend = FALSE) +
geom_text(
aes(label = percent_label),
hjust = -0.2,
size = 3.5,
fontface = "bold"
) +
labs(
title = "Top 20 Cities by % of Bay Area Rental Listings (2000-2018)",
subtitle = "San Francisco dominates with over 40% of total listings",
x = "Percentage of All Listings",
y = NULL,
caption = "Data: TidyTuesday Rent dataset"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic"),
panel.grid.major.y = element_blank(),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
plot.caption = element_text(face = "italic")
) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
scale_fill_viridis_c(option = "plasma")# Filter for San Francisco and create a plot of median prices by bedroom count over time
sf_rentals <- rent %>%
filter(city == "san francisco", beds %in% 0:3) %>%
mutate(
year = lubridate::year(as.Date(date, origin = "1970-01-01")),
beds = as.factor(beds),
bed_label = case_when(
beds == 0 ~ "Studio",
beds == 1 ~ "1 Bedroom",
beds == 2 ~ "2 Bedrooms",
beds == 3 ~ "3 Bedrooms",
TRUE ~ paste(beds, "Bedrooms")
)
) %>%
group_by(year, beds, bed_label) %>%
summarise(median_price = median(price, na.rm = TRUE), .groups = "drop")
# Create the enhanced plot with better aesthetics
ggplot(sf_rentals, aes(x = year, y = median_price, color = bed_label, group = bed_label)) +
# Add a subtle background grid with year highlights
geom_vline(xintercept = seq(2000, 2018, by = 2), color = "gray90", linewidth = 0.5) +
# Add a recession band for the 2008-2009 financial crisis
annotate("rect", xmin = 2008, xmax = 2010, ymin = 0, ymax = Inf,
fill = "gray90", alpha = 0.3) +
# Add lines and points
geom_line(linewidth = 1.2) +
geom_point(size = 3, aes(shape = bed_label)) +
# Add text labels for the latest year
geom_text(
data = sf_rentals %>% filter(year == max(year)),
aes(label = bed_label, x = year + 0.3),
hjust = 0,
size = 3.5,
fontface = "bold"
) +
# Format y-axis as dollars
scale_y_continuous(
labels = scales::dollar_format(),
breaks = seq(0, 6000, by = 1000),
expand = expansion(mult = c(0, 0.1))
) +
# Set x-axis breaks to every year
scale_x_continuous(
breaks = seq(2000, 2018, by = 2),
limits = c(2000, 2018.5),
expand = c(0, 0)
) +
# Use a color-blind friendly palette
scale_color_brewer(palette = "Set1") +
# Add clear, informative labels
labs(
title = "The Rising Cost of San Francisco Housing (2000-2018)",
subtitle = "Median rent by number of bedrooms shows steady increases across all unit types",
x = NULL,
y = "Median Monthly Rent",
color = "Unit Type",
shape = "Unit Type",
caption = "Data: Craigslist rental listings | Gray band: 2008-2010 Financial Crisis"
) +
# Use a clean, modern theme
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(face = "italic"),
axis.title.y = element_text(face = "bold"),
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.caption = element_text(face = "italic", hjust = 0),
axis.text = element_text(face = "bold")
)## Warning: Removed 251 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 251 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_text()`).
# Identify top 12 cities by number of listings
top_cities <- rent %>%
count(city) %>%
arrange(desc(n)) %>%
slice_head(n = 12) %>%
pull(city)
# Filter for one-bedroom apartments in these cities
one_bed_rentals <- rent %>%
filter(beds == 1, city %in% top_cities) %>%
mutate(
year = lubridate::year(as.Date(date, origin = "1970-01-01")),
city = str_to_title(city) # Capitalize city names
) %>%
group_by(city, year) %>%
summarise(
median_price = median(price, na.rm = TRUE),
n_listings = n(),
.groups = "drop"
)
# Create an enhanced plot with better visualization
ggplot(one_bed_rentals, aes(x = year, y = median_price, color = city)) +
# Add a recession band for context
annotate("rect", xmin = 2008, xmax = 2010, ymin = 0, ymax = Inf,
fill = "gray90", alpha = 0.3) +
# Add line and point layers
geom_line(linewidth = 1) +
geom_point(aes(size = n_listings/100), alpha = 0.7) +
# Highlight San Francisco with a distinct style
geom_line(
data = filter(one_bed_rentals, city == "San Francisco"),
linewidth = 1.8,
color = "red"
) +
# Add city labels for the latest year
geom_text_repel(
data = one_bed_rentals %>%
filter(year == max(year)) %>%
arrange(desc(median_price)) %>%
mutate(rank = row_number()) %>%
filter(rank <= 6), # Label only top 6 most expensive
aes(label = city, x = year + 0.1, y = median_price),
hjust = 0,
size = 3.5,
fontface = "bold",
nudge_x = 0.2,
direction = "y",
segment.color = "gray50",
segment.size = 0.5,
box.padding = 0.5,
seed = 42
) +
# Format y-axis as dollars
scale_y_continuous(
labels = scales::dollar_format(),
expand = expansion(mult = c(0, 0.1))
) +
# Set x-axis breaks
scale_x_continuous(
breaks = seq(2000, 2018, by = 2),
limits = c(2000, 2019),
expand = c(0, 0)
) +
# Size legend
scale_size_continuous(
name = "Listings (hundreds)",
range = c(1, 5)
) +
# Use a colorblind-friendly color palette with many distinct colors
scale_color_brewer(palette = "Paired") +
# Add clear, informative labels
labs(
title = "One-Bedroom Apartment Rental Prices in the Bay Area (2000-2018)",
subtitle = "San Francisco leads with dramatic increases, other cities showing similar trends",
x = NULL,
y = "Median Monthly Rent",
caption = "Data: Craigslist rental listings | Size of points indicates number of listings | Gray band: 2008-2010 Financial Crisis"
) +
# Use a clean theme
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(face = "italic"),
axis.title.y = element_text(face = "bold"),
legend.position = "right",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.caption = element_text(face = "italic", hjust = 0),
legend.title = element_text(face = "bold"),
axis.text = element_text(face = "bold")
) +
guides(color = "none") # Remove color legend as we use direct labels## Warning: Removed 637 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 637 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 63 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
The rental data tells a compelling story of the Bay Area housing crisis. San Francisco shows a dramatic increase in rental prices across all unit types, with median one-bedroom prices rising from around $1,500 in the early 2000s to over $3,500 by 2018. This surge corresponds with the tech boom, as companies like Google, Facebook, and Apple expanded operations in the region.
Interestingly, while San Francisco leads in absolute price, other nearby cities show similar growth trajectories, suggesting a regional housing shortage driving up costs across the entire Bay Area. The brief dip in prices around 2008-2010 reflects the impact of the Great Recession, but prices quickly recovered and continued their upward trend. The data shows how the influx of high-income tech workers, coupled with limited housing development, has transformed the Bay Area rental market into one of the least affordable in the country.
To create the challenge visualization (a plot showing monthly percentage change in prices by city), I would:
I would use geom_tile() in ggplot2 to create the
heatmap, and customize the color scale to highlight price increases and
decreases. The faceting would allow viewers to quickly identify patterns
in rental price volatility across different cities and time periods.
Please seek out help when you need it, and remember the 15-minute rule.