Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This report explores the NOAA Storm Database and aim to answer some basic questions about the severe weather events that happened across the United States between 1950 and 2011:

Data processing

The storm event data is provided by Coursera and can be downloaded from here. It contains the data colelcted for the events that happened from the year 1950 to the end of November 2011. Data is originaly coming from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.

Loading

We first read in the storm event data from the CSV text file included in the zip archive.

data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")

The events data from 1950 to November 2011 has 37 variables and 902297 observations.

dim(data)
## [1] 902297     37

Missing data diagnosis

Let’s have a look at the missing data. Missing data can be detected in a few variables of the dataset, fortunately not the ones needed by the analysis, hence data imputation is not required:

  • event type
  • fatalities and injuries
  • prop. damages and proc. damages
colSums(is.na(data))
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

Data cleaning

The event type variable contains many similar values and has to be cleaned for more accuracy when doing aggregations. Moreover we take the opportunity to transform all the variable names to lower cases.

library(dplyr)
names(data) <- tolower(names(data))
data.tidy <- data %>% mutate(evtype=tolower(gsub("[/.&-// ]","",evtype)))

Results

This section gives concrete answers to both above questions, based on the analysis of the prepared tidy data.

Events harmful to population health

The variables needed for the analysis of the events which had an impact on the population health are selected and summarized, the count of fatalities and injuries are calculated by event type.

data.events.health <- data.tidy %>% select(evtype, fatalities, injuries)

data.events.health.frequency <- summarise(group_by(data.events.health, evtype),
                                          injuries=sum(injuries),
                                          fatalities=sum(fatalities),
                                          frequency=fatalities+injuries)

The ten most harmful types of event can be obtained easily by a descendant sort.

data.events.health.frequency.10 <- data.events.health.frequency %>%
    arrange(desc(frequency)) %>%
    slice(1:10)
data.events.health.frequency.10
## # A tibble: 10 × 4
##              evtype injuries fatalities frequency
##               <chr>    <dbl>      <dbl>     <dbl>
## 1           tornado    91346       5633     96979
## 2     excessiveheat     6525       1903      8428
## 3          tstmwind     6957        504      7461
## 4             flood     6789        470      7259
## 5         lightning     5230        817      6047
## 6              heat     2100        937      3037
## 7        flashflood     1777        978      2755
## 8          icestorm     1975         89      2064
## 9  thunderstormwind     1488        133      1621
## 10      winterstorm     1321        206      1527

As a result, tornados are the event that had the most impact on population health over the last six decades.

The next plot shows a graphical presentation of the above tabular view.

library(ggplot2)
library(scales)

data.events.health.frequency.10$evtype <- reorder(data.events.health.frequency.10$evtype, -data.events.health.frequency.10$frequency)

ggplot(data.events.health.frequency.10, aes(x=evtype,y=frequency)) + 
    geom_bar(stat="identity",fill="steelblue") +
    geom_text(aes(label = frequency), vjust=1.6, color="white", size=3.5) +
    scale_y_log10(labels = trans_format("log10", math_format(10^.x))) +
    theme(axis.text.x = element_text(angle=45,hjust=1,vjust=1.0)) +
    labs(title="Ten most harmful events from 1950 to 2011", x="event")

Events with greatest economic consequences

The variables needed for the analysis of the events which had the most economic impact are selected and summarized, the costs of the damages are calculated and summed.

Some preliminary work is required because the original dataset does not contains the definitive costs. First of all let’s build a table of factors based on the scale factors detected in both exponent variables.

library(data.table)
coefficientsDT <- data.table(x=c("","H","K","M","B"), y=c(1,100,1000,1e+06, 1e+09))
setkey(coefficientsDT)

These coefficients are applied to the provisory costs values in order to calculate the final costs.

data.events.damages <- data.tidy %>%
    select(evtype, propdmg, propdmgexp, cropdmg, cropdmgexp) %>%
    mutate(
            propdmg.cost=coefficientsDT[as.character(toupper(propdmgexp)),y]*propdmg,
            cropdmg.cost=coefficientsDT[as.character(toupper(cropdmgexp)),y]*cropdmg)

The summarization of the costs by event type can now be executed.

data.events.damages.costs <- summarise(group_by(data.events.damages, evtype),
                                          propdmg.costs=sum(propdmg.cost),
                                          cropdmg.costs=sum(cropdmg.cost),
                                          costs=sum(propdmg.cost+cropdmg.cost))

The ten types of event with most economic impact can be obtained easily by a descendant sort.

data.events.damages.costs.10 <- data.events.damages.costs %>%
    arrange(desc(costs)) %>%
    slice(1:10)
data.events.damages.costs.10
## # A tibble: 10 × 4
##                    evtype propdmg.costs cropdmg.costs       costs
##                     <chr>         <dbl>         <dbl>       <dbl>
## 1        hurricanetyphoon   69305840000    2607872800 71913712800
## 2              stormsurge   43323536000          5000 43323541000
## 3               hurricane   11868319010    2741910000 14610229010
## 4              riverflood    5118945500    5029459000 10148404500
## 5           tropicalstorm    7703890550     678346000  8382236550
## 6                wildfire    4765114000     295472800  5060586800
## 7          stormsurgetide    4641188000        850000  4642038000
## 8           hurricaneopal    3172846000      19000000  3191846000
## 9          wildforestfire    3001829500     106796830  3108626330
## 10 heavyrainsevereweather    2500000000             0  2500000000

As a result, hurricanes and typhoons are the event that had the most economic impact over the last six decades.

The next plot shows a graphical presentation of the above tabular view.

data.events.damages.costs.10$evtype <- reorder(
    data.events.damages.costs.10$evtype, -data.events.damages.costs.10$costs)

ggplot(data.events.damages.costs.10, aes(x=evtype,y=costs)) + 
    geom_bar(stat="identity",fill="steelblue") +
    geom_text(aes(label = costs), vjust=1.6, color="white", size=2.5) +
    scale_y_log10(labels = trans_format("log10", math_format(10^.x))) +
    theme(axis.text.x = element_text(angle=45,hjust=1,vjust=1.0)) +
    labs(title="Ten events with greatest economic consequences from 1950 to 2011", x="event")

Conclusion

As a result, the variables of the origin dataset and their observations reveal that tornados, hurricanes and typhoons are the types of event with most economic and health consequences over the last six decades.