This document presents a preliminary analysis of the NOAA Storm Database. The main purpose of this work is to identify the most harmful events in terms of population health and economic damage.
With respect to health impacts, the variable “FATALITIES” is chosen as a proxy for meaningful outcomes of an event. The “INJURIES” variable is not considered in this preliminary work as there is no clear method for comparing deaths with any amount of injuries.
Regarding economic damage, separate analysis for property and crop damages were executed. The reason for doing this is the different nature of the impact. While crop damage is almost exclusively related to business operations, property damage has direct impacts on people’s lives such as housing and city infrastructure.
These libraries are needed
library(tidyverse)
library(anytime)
library(lubridate)
#reading the file
dataCP2 <- read.csv("repdata_data_StormData.csv.bz2")
dataCP2 <- as_tibble(dataCP2)
head(dataCP2)
## # A tibble: 6 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 4/18/19… 0130 CST 97 MOBILE AL TORNA… 0
## 2 1 4/18/19… 0145 CST 3 BALDWIN AL TORNA… 0
## 3 1 2/20/19… 1600 CST 57 FAYETTE AL TORNA… 0
## 4 1 6/8/195… 0900 CST 89 MADISON AL TORNA… 0
## 5 1 11/15/1… 1500 CST 43 CULLMAN AL TORNA… 0
## 6 1 11/15/1… 2000 CST 77 LAUDERDALE AL TORNA… 0
## # … with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## # END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## # STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
NOAA started recording all events type only from Jan. 1996. Therefore we need to filter out previous years in order to have a consistent comparison. We also removed non relevant variables from now on.
dataCP2$BGN_DATE <- anytime(dataCP2$BGN_DATE, useR = TRUE)
dataCP2 <- filter(dataCP2, year(BGN_DATE) >= 1996)
dataCP5 <- select(dataCP2,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
muertes <- dataCP5$FATALITIES > 0
danosprop <- dataCP5$PROPDMG > 0
danosagro <- dataCP5$CROPDMG > 0
dataCP5$PROPDMGEXP <- gsub("K", 1000,dataCP5$PROPDMGEXP)
dataCP5$PROPDMGEXP <- gsub("M", 1000000,dataCP5$PROPDMGEXP)
dataCP5$PROPDMGEXP <- gsub("B", 1000000000,dataCP5$PROPDMGEXP)
dataCP5$CROPDMGEXP <- gsub("K", 1000,dataCP5$CROPDMGEXP)
dataCP5$CROPDMGEXP <- gsub("M", 1000000,dataCP5$CROPDMGEXP)
dataCP5$CROPDMGEXP <- gsub("B", 1000000000,dataCP5$CROPDMGEXP)
dataCPmuertes <- dataCP5[muertes,]
dataCPprop <- dataCP5[danosprop,]
dataCPagro <- dataCP5[danosagro,]
res_muertes <- dataCPmuertes %>%
group_by(EVTYPE) %>%
summarize(tot_muertes = sum(FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
res_muertes <- arrange(res_muertes, desc(tot_muertes))
head(res_muertes)
## # A tibble: 6 x 2
## EVTYPE tot_muertes
## <chr> <dbl>
## 1 EXCESSIVE HEAT 1797
## 2 TORNADO 1511
## 3 FLASH FLOOD 885
## 4 LIGHTNING 651
## 5 FLOOD 414
## 6 RIP CURRENT 339
dataCPprop <- mutate(dataCPprop, PROPTOT = PROPDMG * as.numeric(PROPDMGEXP))
res_prop <- dataCPprop %>%
group_by(EVTYPE) %>%
summarize(tot_prop = sum(PROPTOT))
## `summarise()` ungrouping output (override with `.groups` argument)
res_prop <- arrange(res_prop, desc(tot_prop))
head(res_prop)
## # A tibble: 6 x 2
## EVTYPE tot_prop
## <chr> <dbl>
## 1 FLOOD 143924318550
## 2 HURRICANE/TYPHOON 69305840000
## 3 STORM SURGE 43193536000
## 4 TORNADO 24616067710
## 5 FLASH FLOOD 15199298410
## 6 HAIL 14594440420
dataCPagro <- mutate(dataCPagro, CROPTOT = CROPDMG * as.numeric(CROPDMGEXP))
res_agro <- dataCPagro %>%
group_by(EVTYPE) %>%
summarize(tot_agro = sum(CROPTOT))
## `summarise()` ungrouping output (override with `.groups` argument)
res_agro <- arrange(res_agro, desc(tot_agro))
head(res_agro)
## # A tibble: 6 x 2
## EVTYPE tot_agro
## <chr> <dbl>
## 1 DROUGHT 13367566000
## 2 FLOOD 4974723400
## 3 HURRICANE 2741410000
## 4 HURRICANE/TYPHOON 2607872800
## 5 HAIL 2474934450
## 6 FLASH FLOOD 1333840700
Finally, we have that the top twenty entries in each impact analysis cover over 89% of the total fatalities, 98% of the crop damage and 99 % of the property damages.
sum(res_muertes$tot_muertes[1:20])/sum(res_muertes$tot_muertes)
## [1] 0.8912196
sum(res_agro$tot_agro[1:20])/sum(res_agro$tot_agro)
## [1] 0.9880562
sum(res_prop$tot_prop[1:20])/sum(res_prop$tot_prop)
## [1] 0.9925573
par(mar = c(7, 4, 2, 2) + 0.2)
barplot(res_prop$tot_prop[1:20],names.arg = res_prop$EVTYPE[1:20], cex.axis=.7, las=2, cex.names=.5, main = "Top 20 Property damages per Event type")
par(mar = c(7, 4, 2, 2) + 0.2)
barplot(res_agro$tot_agro[1:20],names.arg = res_agro$EVTYPE[1:20], cex.axis=.7, las=2, cex.names=.5, main = "Top 20 Crop damages per Event type")
par(mar = c(7, 4, 2, 2) + 0.2)
barplot(res_muertes$tot_muertes[1:20],names.arg = res_muertes$EVTYPE[1:20], cex.axis=.7, las=2, cex.names=.5, main = "Top 20 Fatalities per Event type")
Further refinements can combine event types in order to correct some typos in the database, but the analysis presented here give a sound preliminary answer to the man questions.