Storm Events & Housing Values: INFO 201 Project Report

Author

Varnika Dokka · Ruby Xia · Jacob O’Connor

library(tidyverse)   # readr, dplyr, ggplot2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)   # dates (Intermediate tool)
library(DT)          # searchable tables
library(plotly)      # optional interactivity for charts

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

1 Introduction

Big picture question: How do typical home values (ZHVI) vary across U.S. places with different frequencies of storm events, and how have these patterns changed over time?

Why this matters: housing values relate to household wealth, insurance risks, and regional planning. Understanding whether areas with more frequent severe-weather reports look systematically different in typical prices can inform homeowners, insurers, and local governments. As background, Zillow’s ZHVI provides a model-based “typical” home value series that is widely used for regional comparisons and trend tracking (see Zillow Research ZHVI). On the weather side, NOAA’s NCEI compiles administrative storm reports from the National Weather Service (NWS), creating a long-running public record of event types, locations, and damage metrics (see the NCEI Storm Events Database).

What we aim to show: we do descriptive comparisons (not causal claims). We summarize storm exposure at the state level and compare distributions of metro-level home values across “lower / medium / higher” exposure groups. We also show time structure where appropriate (dates as an intermediate tool per INFO 201).

2 Analysis tied to Dataset 1: Storm Events (NOAA NCEI)

2.1 Questions

  • How many storm events are recorded by state in the selected year?
  • Which states appear in the top of the distribution (exposure proxy) for that year?

2.2 Storm Events Data

Collector/organization. NOAA National Centers for Environmental Information (compiled from NWS reports). Collection page. https://catalog.data.gov/dataset/ncdc-storm-events-database2

File used in this render (example year 1950). https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20250520.csv.gz

Population & sample. Population of interest is U.S. places’ storm exposure. The sample is an administrative listing of reported storm events (e.g., tornado, hail, flood) with timing, place, and impacts for the covered year. This render uses 1950, which is tornado-heavy relative to later years; definitions/reporting evolve over time.

This administrative listing offers nationwide, standardized event records suitable for simple exposure summaries by place.

storm <- read_csv(
  "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20250520.csv.gz",
  show_col_types = FALSE
)

tibble(rows = nrow(storm), columns = ncol(storm))
# A tibble: 1 × 2
   rows columns
  <int>   <int>
1   223      51

This file contains 223 rows and 51 columns and is suitable for simple state-level exposure summaries.

2.3 Data preparation

We parse event start times to datetimes so time-aware checks are possible (INFO 201 dates tool).

storm <- storm |> mutate(begin_dt = suppressWarnings(mdy_hms(BEGIN_DATE_TIME)))

We summarize state exposure as the count of recorded events per state in 1950 to create a simple exposure proxy for comparison later.

state_exposure <- storm |>
count(STATE, name = "events_1950") |>
arrange(desc(events_1950))

head(state_exposure, 10)
# A tibble: 10 × 2
   STATE          events_1950
   <chr>                <int>
 1 KANSAS                  33
 2 LOUISIANA               28
 3 OKLAHOMA                25
 4 TEXAS                   20
 5 MISSISSIPPI             16
 6 ARKANSAS                13
 7 ILLINOIS                11
 8 NORTH CAROLINA           9
 9 FLORIDA                  6
10 MISSOURI                 6

Searchable sample (transparency):

DT::datatable(
storm |> select(STATE, CZ_NAME, EVENT_TYPE, begin_dt, INJURIES_DIRECT, DEATHS_DIRECT, DAMAGE_PROPERTY) |> head(100),
options = list(pageLength = 10, scrollX = TRUE),
filter = "top"
)

2.4 Results

#| fig.alt: Bar chart of top 10 states by number of recorded storm events in 1950; shows relative exposure only for that year.
p_states <- state_exposure |>
  slice_max(events_1950, n = 10) |>
  arrange(events_1950) |>
  ggplot(aes(x = events_1950, y = reorder(STATE, events_1950))) +
  geom_col() +
  labs(
    title = "Top 10 States by Recorded Storm Events (1950 file)",
    x = "Number of events (1950)", y = "State",
    caption = "Source: NOAA NCEI Storm Events (details-ftp_v1.0_d1950_c20250520.csv.gz)"
  )


plotly::ggplotly(p_states)

Interpretation: This offers a relative exposure ranking for 1950 only. Because reporting practices and hazard mixes change over time, this is a limited snapshot.

3 Analysis of Dataset 2: Metro Home Values (Zillow ZHVI)

3.1 Questions

  • What is the distribution of current (latest month) metro ZHVI values?
  • Do median values differ across storm-exposure groups derived from Dataset 1?

3.2 ZHVI Data

Collector/organization. Zillow Research. CSV (smoothed, SA, metro). https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv

Population & sample. Population: U.S. metro housing stock. Sample: model-based monthly typical value per metro (includes homes not for sale). Many series begin ~2000.

ZHVI provides a consistent, metro-level ‘typical value’ series that supports distributional comparisons across regions.

zillow <- read_csv(
  "https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv",
  show_col_types = FALSE
)

tibble(rows = nrow(zillow), columns = ncol(zillow))
# A tibble: 1 × 2
   rows columns
  <int>   <int>
1   895     318

This file contains 895 rows and 318 columns and supports consistent metro-level comparisons.

3.3 Data Preparation

We keep the latest month and rename it to zhvi for clearer labeling in plots (INFO 201 select() and rename(); R4DS “Data transformation”).

latest_col <- names(zillow)[ncol(zillow)]

# keep only columns that exist in this file
z_latest <- zillow |>
  select(RegionName, StateName, !!latest_col) |>
  rename(zhvi = !!latest_col)

summary(z_latest$zhvi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47226  181825  242259  286820  338468 1568567 

We first check overall distribution. Because home values are typically right-skewed, we interpret center with the median.

p_hist <- ggplot(z_latest, aes(x = zhvi)) +
geom_histogram(binwidth = 50000, boundary = 0, closed = "left") +
labs(
title = paste0("Distribution of Metro ZHVI — ", latest_col),
x = "Typical home value (USD)",
y = "Count",
caption = "Source: Zillow Research ZHVI (metro, smoothed, seasonally adjusted)."
)

plotly::ggplotly(p_hist)

Searchable table:

DT::datatable(z_latest, options = list(pageLength = 10, scrollX = TRUE), filter = "top")

3.4 Linking datasets (Join) and comparison

We compare metro ZHVI distributions across state storm-exposure groups. Exposure is state-level; ZHVI is metro-level. We join by state, then create tertiles (lower/medium/higher) of exposure and make a box plot (aligns with peer feedback suggesting a distributional comparison).

# A. exposure by STATE from the storm file (full state names)
state_exposure <- storm |>
  count(STATE, name = "events_1950")

# B. keep the Zillow columns you need (this file has StateName, not State)
latest_col <- names(zillow)[ncol(zillow)]
z_latest <- zillow |>
  select(RegionName, StateName, !!latest_col) |>
  rename(zhvi = !!latest_col)

# C. JOIN explicitly: StateName (Zillow) ↔ STATE (storm)
z_joined <- z_latest |>
  left_join(state_exposure, by = c("StateName" = "STATE")) |>
  mutate(
    exposure_group = ntile(replace_na(events_1950, 0), 3),
    exposure_group = factor(exposure_group,
                            levels = c(1,2,3),
                            labels = c("Lower exposure","Medium exposure","Higher exposure"))
  )

"StateName" %in% names(z_latest); "STATE" %in% names(state_exposure)
[1] TRUE
[1] TRUE
z_joined |> summarize(metros = n(), matched = sum(!is.na(events_1950)))
# A tibble: 1 × 2
  metros matched
   <int>   <int>
1    895       0

We use a box plot (as suggested in peer feedback) to compare the full distributions of metro ZHVI across lower/medium/higher exposure groups instead of only comparing means.

p_box <- z_joined |>
  dplyr::filter(!is.na(exposure_group), !is.na(zhvi)) |>
  ggplot(aes(x = exposure_group, y = zhvi, group = exposure_group)) +  # explicit group
  geom_boxplot() +
  labs(
    title   = paste0("Metro ZHVI by State Storm Exposure Group (", latest_col, ")"),
    x       = "State storm exposure (tertiles from 1950 counts)",
    y       = "Metro ZHVI (USD)",
    caption = "Sources: NOAA NCEI Storm Events (1950 file) joined to Zillow Research ZHVI (latest month)."
  )

p_box

Box plot comparing metro ZHVI across lower, medium, and higher state storm-exposure tertiles; shows distributional differences in typical values.

Interpretation. If medians differ across groups, there is an association between where more events were recorded in 1950 and typical metro values today. Because exposure is from a single (tornado-heavy) year and geographies differ (state vs metro), treat this as descriptive and non-causal.

4 General Conclusions

We provided a transparent descriptive linkage: state-level storm exposure (1950) → metro ZHVI distributions (latest).

A box plot yields a clearer distributional comparison than side-by-side means and reflects peer feedback.

Bias/uncertainty. One-year exposure likely underestimates multi-hazard patterns; state-to-metro join can introduce aggregation bias; ZHVI is model-based with coverage differences.

Future directions. Aggregate multi-year exposure (e.g., per-capita or severity-weighted), harmonize geographies (county/metro), and incorporate time windows with INFO 201 dates tools.

5 Project Summary

Group Members: Varnika Dokka; Ruby Xia; Jacob O’Connor

Lab Section: Lab AC (Thu 12:30–1:30), Renée Singh

Public CSV links used :

  • Storm Events (example file used): https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20250520.csv.gz Collection: https://catalog.data.gov/dataset/ncdc-storm-events-database2

  • Zillow ZHVI (metro, smoothed, SA): https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv

6 Coding Notes

  • Joins (intermediate) — Needed to attach state storm-exposure counts to metro rows to enable a distributional comparison across exposure groups; there isn’t a simpler way to align those geographies. Reference: https://r4ds.hadley.nz/joins.html.
  • Dates (intermediate) — Parsed event timestamps with lubridate::mdy_hms() so time-aware checks/filters are possible. Reference: https://r4ds.hadley.nz/datetimes.html.
  • Plotly & DT (outside class tools)ggplotly() for light chart interactivity and DT::datatable() for searchable tables; learned from Plotly R “Getting Started” and RStudio DT docs.

7 Group Work Description

Ruby, Jacob, and I worked together for the duration of our project. We all three came up with a topic of interest during our lab section, and I pitched the idea of focusing on storm events data, and connecting it with Zillow housing data as our TA, Renee, helped us connect. We looked for data sets and settled on two datasets that I found and shared with our group. For the analysis, we worked together interactively to figure out how to load our data and split things up so that Jacob and I mostly worked on the code for Storm Events data and Ruby worked on the code for Zillow housing data, which all three of us also helped pitch in. Ruby also helped fix any errors in our report with bar charts, etc., while Jacob and I recorded the presentation for our dashboard. We wrote drafts of the text for our parts and together joined them into a full report before all of us double-checked it and rendered it. I am happy with our group and how we assigned tasks and finished our report together.