Synopsis

In this assignment, our task is to explore the NOAA Storm Database (the events in the database start in year 1950 and end in November 2011) and provide data analysis that answers two questions related to severe weather events:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

I use R and the tidyverse suite of tools (specifically dplyr and ggplot) to process, transform, and analyze this data.

As a result of my analysis, I can conclude that tornadoes are the most harmful with respect to population health, assessed using fatality and injury data. In comparison, floods are the event that have the greatest economic consequences as assessed by property and crop damage.

Data Processing

Data Importing

The data was download from the internet using the URL provided. Then read.csv was used to read in the CSV for analysis.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- tempfile(fileext = ".bz2")
download.file(url, temp)
df <- read.csv(bzfile(temp))
unlink(temp)

Data Characteristics

glimpse(df)
## Rows: 902,297
## Columns: 37
## $ STATE__    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ BGN_DATE   <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0…
## $ BGN_TIME   <chr> "0130", "0145", "1600", "0900", "1500", "2000", "0100", "09…
## $ TIME_ZONE  <chr> "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CS…
## $ COUNTY     <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, 49, 107,…
## $ COUNTYNAME <chr> "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMAN", "LAUD…
## $ STATE      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ EVTYPE     <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR…
## $ BGN_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGN_AZI    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ BGN_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_DATE   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_TIME   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ END_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ END_AZI    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LENGTH     <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, 1.3, 4.7…
## $ WIDTH      <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400, 400, 2…
## $ F          <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, 1, 1, 1,…
## $ MAG        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,…
## $ INJURIES   <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, …
## $ PROPDMG    <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, …
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",…
## $ CROPDMG    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CROPDMGEXP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ WFO        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ STATEOFFIC <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ ZONENAMES  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LATITUDE   <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 3334, 3336,…
## $ LONGITUDE  <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 8740, 8738,…
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, 0, 3432,…
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, 0, 8540,…
## $ REMARKS    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ REFNUM     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

Data Transformations

The dataset contains 37 columns and 902,297 rows. To answer the questions given, not all columns will be necessary. Data was subsetted keeping the following columns:

  • EVTYPE: event type
  • FATALITIES: number of fatalities (used in population health analysis)
  • INJURIES: number of injuries (used in population health analysis)
  • PROPDMG: property damage estimate (used in economic analysis)
  • CROPDMG: crop damage estimate (used in economic analysis)
  • PROPDMGEXP: property damage estimate magnitude (used in economic analysis)
  • CROPDMGEXP: crop damage estimate magnitude (used in economic analysis)
df <- df %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)

To check the class of each column selected, I used sapply.

sapply(df, class)
##      EVTYPE  FATALITIES    INJURIES     PROPDMG     CROPDMG  PROPDMGEXP 
## "character"   "numeric"   "numeric"   "numeric"   "numeric" "character" 
##  CROPDMGEXP 
## "character"

Next, we’ll check if there are any NAs present in the dataset.

sum(is.na(df))
## [1] 0

Next, we’ll check the EVTYPE column for unique elements. There is a total of 985 unique elements in this column. I will not revise the contents of this column since we are focused on which events are most harmful with respect to population health and the economy.

Finally, the PROPDMG and CROPDMG columns require additional transformation. They are currently without their corresponding units which is in the their respective EXP columns.

  • H means hundreds
  • K means thousands
  • M means millions
  • and B means billions

Total columns will be summed and a new column will be created called TOTALDMG.

df <- df %>%
  mutate(
    PROPDMGTOTAL = case_when(
      PROPDMGEXP == "H" ~ PROPDMG * 10^2,
      PROPDMGEXP == "K" ~ PROPDMG * 10^3, 
      PROPDMGEXP == "M" ~ PROPDMG * 10^6,
      PROPDMGEXP == "B" ~ PROPDMG * 10^9,
      TRUE ~ 0 
    ),
    
    CROPDMGTOTAL = case_when(
      CROPDMGEXP == "H" ~ CROPDMG * 10^2,
      CROPDMGEXP == "K" ~ CROPDMG * 10^3,
      CROPDMGEXP == "M" ~ CROPDMG * 10^6,
      CROPDMGEXP == "B" ~ CROPDMG * 10^9,
      TRUE ~ 0
    ),
    
    TOTALDMG = PROPDMGTOTAL + CROPDMGTOTAL
  )

Results

Across the United States, which types of events are most harmful with respect to population health?

To determine which events are most harmful with respect to population health, we’ll use the FATALITIES and INJURIES columns.

df %>%
  group_by(EVTYPE) %>%
  summarize(total_fatalities = sum(FATALITIES, na.rm = T)) %>%
  arrange(desc(total_fatalities)) %>%
  slice(1:10) %>%
  ggplot(aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) + 
  geom_col() +
  theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1)) +
  labs(
    x = 'Event Type',
    y = "Total Fatalities",
    title = "Total Number of Fatalities for the top 10 Weather Events"
  )

We can conclude based on fatalities that tornadoes are most harmful with respect to population health in the US.

df %>%
  group_by(EVTYPE) %>%
  summarize(total_injuries = sum(INJURIES, na.rm = T)) %>%
  arrange(desc(total_injuries)) %>%
  slice(1:10) %>%
  ggplot(aes(x = reorder(EVTYPE, total_injuries), y = total_injuries)) + 
  geom_col() +
  theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1)) +
  labs(
    x = 'Event Type',
    y = "Total Injuries",
    title = "Total Number of Injuries for the top 10 Weather Events"
  )

The previous finding is supported when looking at injuries. Tornadoes once again pose the greatest threat to population health compared to other severe weather events.

Across the United States, which types of events have the greatest economic consequences?

df %>% 
  group_by(EVTYPE) %>%
  summarize(total_dmg = sum(TOTALDMG, na.rm = T)) %>%
  arrange(desc(total_dmg)) %>%
  slice(1:10) %>%
  ggplot(aes(x = reorder(EVTYPE, total_dmg), y = total_dmg)) + 
  geom_col() +
  theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1)) +
  labs(
    x = 'Event Type',
    y = "Damages ($)",
    title = "Property and Crop Damage for the top 10 Weather Events"
  )

When looking at property and crop damage, floods have the greatest economic consequences. Interestingly, tornadoes are among of the top 10 events.

Session Info for Reproducibility

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [5] purrr_1.2.0     readr_2.1.6     tidyr_1.3.1     tibble_3.3.0   
##  [9] ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.4.3     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    scales_1.4.0       yaml_2.3.10        fastmap_1.2.0     
##  [9] R6_2.6.1           labeling_0.4.3     generics_0.1.4     knitr_1.50        
## [13] bslib_0.9.0        pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0        
## [17] rlang_1.1.6        stringi_1.8.7      cachem_1.1.0       xfun_0.54         
## [21] sass_0.4.10        S7_0.2.1           timechange_0.3.0   cli_3.6.5         
## [25] withr_3.0.2        magrittr_2.0.4     digest_0.6.39      grid_4.4.3        
## [29] rstudioapi_0.17.1  hms_1.1.4          lifecycle_1.0.4    vctrs_0.6.5       
## [33] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
## [37] tools_4.4.3        pkgconfig_2.0.3    htmltools_0.5.8.1