Synopsis

This analysis takes the Storm Database provided by the NOAA and evaluates which types of events are the most damaging in terms of population health (injuries and fatalities) and economic consequences (damage). We find that the tropical storm gordon leads in both categories. We also see that the data is not specially well looked for. There doesn’t seem to be a list of what types of events can be recorded which leads to a lot of eventcategories, where we just have one single event. Either because of a misspelling or some other, typographical detail

Data Processing

We first read the datafile, which is downloaded, but still zipped in the working directory. Since it is a large file and requires some time to read, we’ll set the cache=TRUE in the codechunk.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
data <- read.csv("repdata_data_StormData.csv.bz2")

Results

Harmfulness in terms of population health

The first question we want to answer is, which type of event is the most harmful in terms of population health. Meaning, which type of event causes the most fatalities and/or injuries. In Order to do that, we’ll subset the data, so that we can deal with the much smaller dataset. For this analysis we’ll need the Eventtype (“EVTYPE”), the fatalities (“FATALITIES”) and the injuries (“INJURIES”). Then we’re going to add a new variable to this set called “IMPACT”. This variable will be the number of injuries plus twice the number of fatalities. This because the fatalities should be weighted more.

harmfuldata <- select(data,EVTYPE, FATALITIES, INJURIES)
harmfuldata <- mutate(harmfuldata, IMPACT = INJURIES + 2*FATALITIES)
harmfuldata <- mutate(harmfuldata, NEWTYPE = toupper(as.character(EVTYPE)))

Notice that we also created the variable NEWTYPE. This because not all the factor in EVTYPE were in uppercase and since we’ll want to summarize over that variable, that will be important.

Now to estimate which kind of event is the most harmful, we’re going to calculate the minimum value, the maximum value and the mean value of this variable IMPACT.

sum_harmful <- harmfuldata %>%
  group_by(NEWTYPE) %>%
  summarise(lower= min(IMPACT), upper=max(IMPACT), m = mean(IMPACT))

Now lets select the 5 eventtypes with the highest mean impact and arrange them in descending order:

top5 <- top_n(sum_harmful, 5, m)
top5 <- arrange(top5, -m)
top5
## # A tibble: 5 x 4
##   NEWTYPE                    lower upper     m
##   <chr>                      <dbl> <dbl> <dbl>
## 1 TROPICAL STORM GORDON         59    59    59
## 2 TORNADOES, TSTM WIND, HAIL    50    50    50
## 3 WILD FIRES                     0   156    39
## 4 COLD AND SNOW                 28    28    28
## 5 THUNDERSTORMW                 27    27    27

As we can see the most harmful Events by fatalities and injuriers are - Tropical Strom Gordon - Tornadoes, TSTM Wind, Hail - Wild fires - Cold and Snow - Thunderstorm It is worth noting that all of the above expect the wild fires are eventtypes which occured only once in the whole dataset. We can say that because the minimum, the maximum and the mean value for the impact all have the same value.

Nevertheless, let’s look at a quick plot of the data:

ggplot(data = top5, mapping = aes(x = NEWTYPE, y = m)) +
    geom_pointrange(mapping = aes(ymin = lower, ymax = upper))

This happens because the factor EVTYPE apparently is a user provided value. We tried to rule this out, by creating the varibale NEWEVENT, but obviously we can’t account for every misspelling in the set.

To make better analysis I would recommend to build a drop-down menu in the inputfile, so that the possible event types are given and can’t be chosen by the user.

Consequences in terms of economy

The second question is the question which types of Events have the greates economic consequences.

Similary to before, we’re going to subset our dataset, so that it is not that large anymore. Also, similarly to before, we’re going to calculate a variable called TOTALDMG which consists of the sum of the property damage (“PROPDMG”) and the crop damage (“CROPDMG”) and also, as before, we’re going to add the variable NEWTYPE where we change the EVENT to all upper case (to at least account for some of the missspellings in the dataset).

consdata <- select(data, EVTYPE, PROPDMG, CROPDMG)
consdata <- mutate(consdata, TOTALDMG = PROPDMG+CROPDMG)
consdata <- mutate(consdata, NEWTYPE = toupper(as.character(EVTYPE)))

We’re going to use the same technique as before and will group the data by the variable NEWTYPE and calculate the maximum, the minimum and the mean value for the variable TOTALDMG for each type of event.

sum_cons <- consdata %>%
  group_by(NEWTYPE) %>%
  summarise(lower= min(TOTALDMG), upper=max(TOTALDMG), m = mean(TOTALDMG))

Again, lets look for the 5 events with the highest mean damage.

top5_new <- top_n(sum_cons, 5, m)
top5_new <- arrange(top5_new, -m)
top5_new
## # A tibble: 5 x 4
##   NEWTYPE                lower upper     m
##   <chr>                  <dbl> <dbl> <dbl>
## 1 TROPICAL STORM GORDON   1000  1000  1000
## 2 COASTAL EROSION          766   766   766
## 3 HEAVY RAIN AND FLOOD     600   600   600
## 4 RIVER AND STREAM FLOOD   400   800   600
## 5 DUST STORM/HIGH WINDS    550   550   550

As we see the tropical storm gordon is again the top of the list. Followed by coastal erosion, heavy rain and flood, river and stream flood as well as dust storm/high winds. Lets do the same plot as before:

ggplot(data = top5_new, mapping = aes(x = NEWTYPE, y = m)) +
    geom_pointrange(mapping = aes(ymin = lower, ymax = upper))

Again we see, in the plot and in the data, that we only have one type of event (river and stream flood) where we actually have data on more than one event.