Exploratory Analysis of the NOAA Storm Database

Synopsis

This document presents a preliminary analysis of the NOAA Storm Database. The main purpose of this work is to identify the most harmful events in terms of population health and economic damage.

With respect to health impacts, the variable “FATALITIES” is chosen as a proxy for meaningful outcomes of an event. The “INJURIES” variable is not considered in this preliminary work as there is no clear method for comparing deaths with any amount of injuries.

Regarding economic damage, separate analysis for property and crop damages were executed. The reason for doing this is the different nature of the impact. While crop damage is almost exclusively related to business operations, property damage has direct impacts on people’s lives such as housing and city infrastructure.

Data Processing

These libraries are needed

library(tidyverse)
library(anytime)
library(lubridate)

#reading the file
dataCP2 <- read.csv("repdata_data_StormData.csv.bz2")
dataCP2 <- as_tibble(dataCP2)
head(dataCP2)

## # A tibble: 6 x 37
##   STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>    <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1       1 4/18/19… 0130     CST           97 MOBILE     AL    TORNA…         0
## 2       1 4/18/19… 0145     CST            3 BALDWIN    AL    TORNA…         0
## 3       1 2/20/19… 1600     CST           57 FAYETTE    AL    TORNA…         0
## 4       1 6/8/195… 0900     CST           89 MADISON    AL    TORNA…         0
## 5       1 11/15/1… 1500     CST           43 CULLMAN    AL    TORNA…         0
## 6       1 11/15/1… 2000     CST           77 LAUDERDALE AL    TORNA…         0
## # … with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

NOAA started recording all events type only from Jan. 1996. Therefore we need to filter out previous years in order to have a consistent comparison. We also removed non relevant variables from now on.

dataCP2$BGN_DATE <- anytime(dataCP2$BGN_DATE, useR = TRUE)
dataCP2 <- filter(dataCP2, year(BGN_DATE) >= 1996)
dataCP5 <- select(dataCP2,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

Results

muertes <- dataCP5$FATALITIES > 0
danosprop <- dataCP5$PROPDMG > 0
danosagro <- dataCP5$CROPDMG > 0
dataCP5$PROPDMGEXP <- gsub("K", 1000,dataCP5$PROPDMGEXP)
dataCP5$PROPDMGEXP <- gsub("M", 1000000,dataCP5$PROPDMGEXP)
dataCP5$PROPDMGEXP <- gsub("B", 1000000000,dataCP5$PROPDMGEXP)
dataCP5$CROPDMGEXP <- gsub("K", 1000,dataCP5$CROPDMGEXP)
dataCP5$CROPDMGEXP <- gsub("M", 1000000,dataCP5$CROPDMGEXP)
dataCP5$CROPDMGEXP <- gsub("B", 1000000000,dataCP5$CROPDMGEXP)
dataCPmuertes <- dataCP5[muertes,]
dataCPprop <- dataCP5[danosprop,]
dataCPagro <- dataCP5[danosagro,]

res_muertes <-  dataCPmuertes %>%
                group_by(EVTYPE) %>%
                summarize(tot_muertes = sum(FATALITIES))

## `summarise()` ungrouping output (override with `.groups` argument)

res_muertes <- arrange(res_muertes, desc(tot_muertes))
head(res_muertes)

## # A tibble: 6 x 2
##   EVTYPE         tot_muertes
##   <chr>                <dbl>
## 1 EXCESSIVE HEAT        1797
## 2 TORNADO               1511
## 3 FLASH FLOOD            885
## 4 LIGHTNING              651
## 5 FLOOD                  414
## 6 RIP CURRENT            339

dataCPprop <- mutate(dataCPprop, PROPTOT = PROPDMG * as.numeric(PROPDMGEXP))
res_prop <-     dataCPprop %>%
                group_by(EVTYPE) %>%
                summarize(tot_prop = sum(PROPTOT))

## `summarise()` ungrouping output (override with `.groups` argument)

res_prop <- arrange(res_prop, desc(tot_prop))
head(res_prop)

## # A tibble: 6 x 2
##   EVTYPE                tot_prop
##   <chr>                    <dbl>
## 1 FLOOD             143924318550
## 2 HURRICANE/TYPHOON  69305840000
## 3 STORM SURGE        43193536000
## 4 TORNADO            24616067710
## 5 FLASH FLOOD        15199298410
## 6 HAIL               14594440420

dataCPagro <- mutate(dataCPagro, CROPTOT = CROPDMG * as.numeric(CROPDMGEXP))
res_agro <-     dataCPagro %>%
                group_by(EVTYPE) %>%
                summarize(tot_agro = sum(CROPTOT))

## `summarise()` ungrouping output (override with `.groups` argument)

res_agro <- arrange(res_agro, desc(tot_agro))
head(res_agro)

## # A tibble: 6 x 2
##   EVTYPE               tot_agro
##   <chr>                   <dbl>
## 1 DROUGHT           13367566000
## 2 FLOOD              4974723400
## 3 HURRICANE          2741410000
## 4 HURRICANE/TYPHOON  2607872800
## 5 HAIL               2474934450
## 6 FLASH FLOOD        1333840700

Finally, we have that the top twenty entries in each impact analysis cover over 89% of the total fatalities, 98% of the crop damage and 99 % of the property damages.

sum(res_muertes$tot_muertes[1:20])/sum(res_muertes$tot_muertes)

## [1] 0.8912196

sum(res_agro$tot_agro[1:20])/sum(res_agro$tot_agro)

## [1] 0.9880562

sum(res_prop$tot_prop[1:20])/sum(res_prop$tot_prop)

## [1] 0.9925573

par(mar = c(7, 4, 2, 2) + 0.2)
barplot(res_prop$tot_prop[1:20],names.arg = res_prop$EVTYPE[1:20], cex.axis=.7, las=2, cex.names=.5, main = "Top 20 Property damages per Event type")

par(mar = c(7, 4, 2, 2) + 0.2)
barplot(res_agro$tot_agro[1:20],names.arg = res_agro$EVTYPE[1:20], cex.axis=.7, las=2, cex.names=.5, main = "Top 20 Crop damages per Event type")

par(mar = c(7, 4, 2, 2) + 0.2)
barplot(res_muertes$tot_muertes[1:20],names.arg = res_muertes$EVTYPE[1:20], cex.axis=.7, las=2, cex.names=.5, main = "Top 20 Fatalities per Event type")

Further refinements can combine event types in order to correct some typos in the database, but the analysis presented here give a sound preliminary answer to the man questions.