Synposis

The data on severe weather events shows that the event category with the highest impact on health is - by far - TORNADO, and the one with the highest economic loss is FLOOD. However, it is clear that a lot of data on economic loss is missing as most events do not include loss estimates. Moreover, categorization is caotic, and there is - amon other problems - a clear overlap between different categories.


Introduction

In order to process the data, we need the following libraries:

library(dplyr)

Data Processing

Downloading Data

The data has been manually downloaded from the url https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 as per assignment instructions. read.csv() is used to read the downloaded data:

stormdata <- read.csv("repdata_data_StormData.csv", header = TRUE, sep = ",")

Note: A more reproducible way would be to download the data from the url when generating the report. However, downloading data from the given url requires a login. If you try to download it directly from R, you get a “403 Forbidden”-error. Thus, the data has been downloaded manually.

Description & Justification For Data Transformations

We are only looking for events and their consequences in terms of health and economic impact. Since the dataset is huge, we subset it to include only the columns we need for further processing, namely EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP (2). After subsetting, the original stormdata dataset is removed from memory to limit memory usage.

stormdata_subset <- stormdata[c("EVTYPE", "FATALITIES", "INJURIES",
    "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
rm(stormdata)

Numbers for property damage and crop damage must be calculated, because: > Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry. (1)

The letter in the PROPDMGEXP and CROPDMGEXP are replaced by the corresponding values. We use a separate script, function_magnitude.R to achieve this. The script looks like this:

numeric_value <- function(magnitude) {
    numeric_magnitude <- as.numeric(
        switch(
            as.character(magnitude),
            "K" = 10^3, 
            "M" = 10^6, 
            "B" = 10^9,
            0)
        )
    return(numeric_magnitude)
}

The numbers in the PROPDMG- and CROPDMG–columns are multiplied by the number in the PROPDMGEXP- and CROPDMGEXP-columns, respectively, and calculated values placed in the PROPDMGRES and CROPDMGRES, respectively:

source("function_magnitude.R")
stormdata_subset$PROPDMGEXP <- lapply(stormdata_subset$PROPDMGEXP,
    numeric_value)
stormdata_subset$PROPDMGRES <- stormdata_subset$PROPDMG *
    as.numeric(stormdata_subset$PROPDMGEXP)
stormdata_subset$CROPDMGEXP <- lapply(stormdata_subset$CROPDMGEXP,
    numeric_value)
stormdata_subset$CROPDMGRES <- stormdata_subset$CROPDMG *
    as.numeric(stormdata_subset$CROPDMGEXP)

We add a HARMFULNESS-column to calculate the total of fatalities and injuries. We add a ECONLOSS-column to calculate the total economic loss based on property damage and crop damage.

stormdata_subset$HARMFULNESS <- stormdata_subset$FATALITIES +
    stormdata_subset$INJURIES
stormdata_subset$ECONLOSS <- stormdata_subset$PROPDMGRES +
    stormdata_subset$CROPDMGRES

Analysis

Types of events that are most harmful to population health

Consequences for health are set out in the table’s FATALITIES and INJURIES columns. When calculating harmfulness, the sum of two columns is calculated for each event, set out in the added HARMFULNESS-column.

The single most harmful event registered:

stormdata_subset[which.max(stormdata_subset$HARMFULNESS), ]
##         EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 157885 TORNADO         42     1700     250      1e+06       0          0
##        PROPDMGRES CROPDMGRES HARMFULNESS ECONLOSS
## 157885    2.5e+08          0        1742  2.5e+08

The most harmful event category:

df_harmful_events <- stormdata_subset %>% group_by(EVTYPE) %>% summarise(sum=sum(HARMFULNESS)) %>% data.frame()
df_harmful_events[which.max(df_harmful_events$sum), ]
##      EVTYPE   sum
## 834 TORNADO 96979

There are a few problems with the data, most importantly the realitvely caotic categorization, where some categories overlap in various ways. However, since there is one event that stands out in terms of harm to health, no recategorization is necessary to find the event category with the - by far - highest effect on health:

arrange(df_harmful_events, desc(sum))[1:30, ]
##                EVTYPE   sum
## 1             TORNADO 96979
## 2      EXCESSIVE HEAT  8428
## 3           TSTM WIND  7461
## 4               FLOOD  7259
## 5           LIGHTNING  6046
## 6                HEAT  3037
## 7         FLASH FLOOD  2755
## 8           ICE STORM  2064
## 9   THUNDERSTORM WIND  1621
## 10       WINTER STORM  1527
## 11          HIGH WIND  1385
## 12               HAIL  1376
## 13  HURRICANE/TYPHOON  1339
## 14         HEAVY SNOW  1148
## 15           WILDFIRE   986
## 16 THUNDERSTORM WINDS   972
## 17           BLIZZARD   906
## 18                FOG   796
## 19        RIP CURRENT   600
## 20   WILD/FOREST FIRE   557
## 21       RIP CURRENTS   501
## 22          HEAT WAVE   481
## 23         DUST STORM   462
## 24     WINTER WEATHER   431
## 25     TROPICAL STORM   398
## 26          AVALANCHE   394
## 27       EXTREME COLD   391
## 28        STRONG WIND   383
## 29          DENSE FOG   360
## 30         HEAVY RAIN   349

Deciding the second most harmfull event, and so on, is - as we can see from the listing above - another matter. Several categories belong together, for instance, TSTM WIND, THUNDERSTORM WIND, and THUNDERSTORM WINDS, and putting these together in one category affects the end result.

Types of events that have the greatest economic consequences

The consequences in economic terms are calculated by the PROPDMGRES and CROPDMGRES columns. When calculating economic loss, the sum of two columns is calculated for each event, set out in the added ECONLOSS-column.

The single event with the highest economic loss:

stormdata_subset[which.max(stormdata_subset$ECONLOSS), ]
##        EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953  FLOOD          0        0     115      1e+09    32.5      1e+06
##        PROPDMGRES CROPDMGRES HARMFULNESS     ECONLOSS
## 605953   1.15e+11   32500000           0 115032500000

The event category with the highest economic loss:

df_econloss_events <- stormdata_subset %>% group_by(EVTYPE) %>% summarise(sum=sum(ECONLOSS)) %>% data.frame()
df_econloss_events[which.max(df_econloss_events$sum), ]
##     EVTYPE          sum
## 170  FLOOD 150319678250

Again, there are a few problems with the resultant data. Most events do not have an economic loss specified, as we can see from the summary (the median is 0):

summary(df_econloss_events)
##                    EVTYPE         sum           
##     HIGH SURF ADVISORY:  1   Min.   :0.000e+00  
##   COASTAL FLOOD       :  1   1st Qu.:0.000e+00  
##   FLASH FLOOD         :  1   Median :0.000e+00  
##   LIGHTNING           :  1   Mean   :4.836e+08  
##   TSTM WIND           :  1   3rd Qu.:8.500e+04  
##   TSTM WIND (G45)     :  1   Max.   :1.503e+11  
##  (Other)              :979

Then there is the caotic categorization. However, since there is one event that stands out in terms of economic loss (like there is for effect on health above), no recategorization is necessary to find the event category with the highest loss:

arrange(df_econloss_events, desc(sum))[1:15, ]
##               EVTYPE          sum
## 1              FLOOD 150319678250
## 2  HURRICANE/TYPHOON  71913712800
## 3            TORNADO  57340613590
## 4        STORM SURGE  43323541000
## 5               HAIL  18752904170
## 6        FLASH FLOOD  17562128610
## 7            DROUGHT  15018672000
## 8          HURRICANE  14610229010
## 9        RIVER FLOOD  10148404500
## 10         ICE STORM   8967041310
## 11    TROPICAL STORM   8382236550
## 12      WINTER STORM   6715441250
## 13         HIGH WIND   5908617560
## 14          WILDFIRE   5060586800
## 15         TSTM WIND   5038935790

Results

The most harmful category of events is by far TORNADO as seen in the bar chart of the 15 most harmful event categories.

par(mar=c(2,4,1,1))
df_top15_harmfulness <- arrange(df_harmful_events, desc(sum))[1:15, ]
barplot(df_top15_harmfulness$sum, legend.text=df_top15_harmfulness$EVTYPE,
    col=rainbow(15), ylab = "Number of persons harmed", 
    ylim = c(0,df_top15_harmfulness$sum[1]))
Figure 1: Harmful events by event category.

Figure 1: Harmful events by event category.

The event category causing the highest economic loss is FLOOD, as seen in the bar chart of the 15 event categories causing the highest economic loss.

par(mar=c(2,4,1,1))
df_top15_econloss <- arrange(df_econloss_events, desc(sum))[1:15, ]
barplot(df_top15_econloss$sum, legend.text=df_top15_econloss$EVTYPE,
    col=rainbow(15), ylab = "Loss in US dollars", 
    ylim = c(0,df_top15_econloss$sum[1]))
Figure 2: Economic loss by event category.

Figure 2: Economic loss by event category.

Listing all categories, we can clearly see that there is a lot of overlap between categories, something that may affect the evaluation of events taking the 2nd place and downwards. A recategorization is needed in cooperation with a content expert.

References

  1. [NATIONAL WEATHER SERVICE INSTRUCTION 10-1605] (d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) - Downloaded January 26, 2018.

  2. [NOOA Storm Data FAQ Page] (d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf) - Downloaded January 26, 2018.