Synopsis

This document provides a short analysis of the impact that storms have on the health and economy of the U.S.A. We do this by analyzing the storm data provided by the National Weather Service. The analysis focuses on two questions, which types of events have the greatest impact on the population health and which types of events have the greatest economic impact. For this analysis we will only look at the United States as a whole and will not examine seperate states.

Health impacts - Tornadoes most hazardous

For the impact on population health, we look for the total number of injuries and fatalities. When summarized by event, we come to the conclusion that tornadoes are by far the greatest cause of injuries and fatalities with a total of 91,346 injuries and 5,633 fatalities.

Economical impacts - Floods most damaging

For the economical impact, we look for the total amount of crop and property damage. A summary shows clearly that floods cause that greatest amount of damage with a total of $150,319,678,257 (150 Billion) worth of damage.

Data Processing

Start by downloading and reading in the original data file.

#First we download the proper file
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2")

storm <- read.csv("stormdata.csv.bz2", stringsAsFactors = FALSE)

One problem that exists within the dataset is that the event types (EVTYPE) are not always properly filled out. For example the major cause of injuries, Tornadoes, are written down in various ways.

evfreq  <- data.frame(table(storm$EVTYPE))
kable(evfreq[grepl("tornado", evfreq$Var1, ignore.case=TRUE),])
Var1 Freq
62 COLD AIR TORNADO 1
830 TORNADO 60652
831 TORNADO DEBRIS 1
832 TORNADOES 2
833 TORNADOES, TSTM WIND, HAIL 1
834 TORNADO F0 19
835 TORNADO F1 4
836 TORNADO F2 3
837 TORNADO F3 2
838 TORNADOS 1
839 TORNADO/WATERSPOUT 1
940 WATERSPOUT TORNADO 1
941 WATERSPOUT-TORNADO 2
942 WATERSPOUT/TORNADO 8
943 WATERSPOUT/ TORNADO 2

The occurance of this is minor though. So we’ll treat improperly filled out fields as garbage and won’t try to scrub it clean. It will be a recommendation to the people supplying the dataset to add some checks for entering data.

Before we start creating our analysis datasets we will load the necessary libraries; dplyr for data manipulation and tidyr to transpose columns.

library(dplyr)
library(tidyr)

Next we create two separate data variables, one for the health and one for the economical analysis. In both cases we will create a dataset for viewing and a dataset for a plot generation.

Health

For health we will use the following columns

  • EVTYPE (event)
  • INJURIES (sustained injuries)
  • FATALITIES (fatalities)

Data will be processed in the following manner:

  1. Select the data
  2. Add a “Total” column by combining injuries and fatalities
  3. Group by event type
  4. Summarise injuries, fatalities and the total column for each eventtype
  5. Select the top 10 based on the total column
  6. Arrange on total descending.
health  <- select(storm, event = EVTYPE, injuries=INJURIES, fatalities=FATALITIES) %>%
    mutate(total=injuries + fatalities) %>%
    group_by(event) %>%
    summarise(injuries=sum(injuries), fatalities = sum(fatalities), total=sum(total)) %>%
    top_n(10, total) %>%
    arrange(desc(total))

Now we have a dataset with the top 10 event types causeing the most damage to the health of the US population. We still need to create an extra dataset for the chart. The main difference between the current dataset and the chart dataset is that we combine the injuries and fatalities columns into a single column of observations. This will make it possible to combine both values in a single chart.

health_chart <- 
    gather(health, "type", "casualties", injuries, fatalities) %>%
    arrange(desc(total)) 

Last but not least, we set the levels of the event factor to the order of the dataframe so that the chart will be properly arranged.

health_chart$event  <- factor(health_chart$event, level=rev(unique(health_chart$event)))

Economy

For the economical damage we make use of the columns PROPDMG and CROPDMG for property and crop damage respectively. But there’s a catch.. The values in these two columns need to be combined with the PROPDMGEXP and CROPDMGEXP columns. These columns contain multipliers to add an extra couple of zeroes. Let’s take a look at those two columns first.

data.frame(table(rbind(storm$PROPDMGEXP, storm$CROPDMGEXP)))
##    Var1    Freq
## 1       1084347
## 2     -       1
## 3     ?      15
## 4     +       5
## 5     0     235
## 6     1      25
## 7     2      14
## 8     3       4
## 9     4       4
## 10    5      28
## 11    6       4
## 12    7       5
## 13    8       1
## 14    B      49
## 15    h       1
## 16    H       6
## 17    k      21
## 18    K  706497
## 19    m       8
## 20    M   13324

Just like the EVTYPE file, there’s a lot of junk in here but we can find three modifiers that make sense: k (thousands), m (millions) and b(billions). This is in line with the documentation of the dataset. So we’ll create our own dataset combine the proper multipliers with the letters.

prefix <- as.character(c("k", "m", "b"))
multiplier <- c(10^3, 10^6, 10^9)
conv <- data.frame(prefix, multiplier, stringsAsFactors = FALSE)
kable(conv)
prefix multiplier
k 1e+03
m 1e+06
b 1e+09

This conv dataset can be joined with the damage dataset to get the correct damage amount.

Next we will create the damage dataset. This will include the following steps.

  1. Select the proper required columns from the storm dataset.
  2. set the propertymult and cropmult columns to lower case so a proper join can be made with the conv dataset
  3. Left join the conv dataset on the propertymult column
  4. If a join has been made, multiply propertydamage with the joined multiplier column, otherwise keep the original value
  5. Select only the necessary columns, dropping the multiplier column
  6. Left join the conv dataset on the cropmult column
  7. If a join has been made, multiply cropdamage with the joined multiplier column, otherwise keep the original value
  8. Group by even
  9. Summarise the propertydamage and cropdamage columns and add a third column consisting of a sum of both values added up.
  10. Select the top 10 rows based on the total value of damage.
  11. Arrange on total descending
damage <- 
    select(
        storm
        , event = EVTYPE
        , propertydamage = PROPDMG
        , propertymult = PROPDMGEXP
        , cropdamage= CROPDMG
        , cropmult = CROPDMGEXP) %>%
    mutate(propertymult = tolower(propertymult), cropmult = tolower(cropmult)) %>%
    left_join(conv, by=c("propertymult" ="prefix")) %>%
    mutate(propertydamage = propertydamage * ifelse(is.na(multiplier), 1, multiplier) ) %>%
    select(
        event
        , propertydamage
        , cropdamage
        , cropmult
    )%>%
    left_join(conv, by=c("cropmult" ="prefix")) %>%
    mutate(cropdamage = cropdamage * ifelse(is.na(multiplier), 1, multiplier) ) %>%
    group_by(event) %>%
    summarise(property = sum(propertydamage), crop=sum(cropdamage), total=sum(cropdamage+propertydamage))%>%
    top_n(10, total) %>%
    arrange(desc(total))

    
damage$event  <- factor(damage$event, level=rev(unique(damage$event)))

Now we have a dataset containing the necessary values. Next we’ll create a separate dataset for display in a chart. Just like the health chart, we combine both columns into one. We arrange the rows based on total damage and we set the order of the event factor levels to match this.

damage_chart <- gather(damage,"type", "amount", property, crop) %>%
    arrange(desc(total))

damage_chart$event  <- factor(damage_chart$event, level=rev(unique(damage_chart$event)))

Results

For all results, we use the ggplot library to draw the charts.

    library(ggplot2)

Which types of events are most harmful with respect to the population health?

A top 10 of the most harmful types of events are as follows:

ggplot(health_chart, aes(x = event, y = casualties)) + geom_bar(stat='identity', aes(fill=type)) +coord_flip() + ggtitle("Top 10 events most damaging to population health")

kable(health)
event injuries fatalities total
TORNADO 91346 5633 96979
EXCESSIVE HEAT 6525 1903 8428
TSTM WIND 6957 504 7461
FLOOD 6789 470 7259
LIGHTNING 5230 816 6046
HEAT 2100 937 3037
FLASH FLOOD 1777 978 2755
ICE STORM 1975 89 2064
THUNDERSTORM WIND 1488 133 1621
WINTER STORM 1321 206 1527

As the data shows, tornadoes are most harmful with respect to the population health.

Which types of events have the greatest economic consequences?

A top 10 of the most damaging types of events are as follows:

ggplot(damage_chart, aes(x = event, y = amount/10^9)) + geom_bar(stat='identity', aes(fill=type)) +coord_flip() + ylab("Damage in billion US dollars") + ggtitle("Top 10 events causing most financial damage")

kable(damage)
event property crop total
FLOOD 144657709807 5661968450 150319678257
HURRICANE/TYPHOON 69305840000 2607872800 71913712800
TORNADO 56937160779 414953270 57352114049
STORM SURGE 43323536000 5000 43323541000
HAIL 15732267048 3025954473 18758221521
FLASH FLOOD 16140812067 1421317100 17562129167
DROUGHT 1046106000 13972566000 15018672000
HURRICANE 11868319010 2741910000 14610229010
RIVER FLOOD 5118945500 5029459000 10148404500
ICE STORM 3944927860 5022113500 8967041360

As the data shows, floods have the greatest economical impact. Though a special mention should be given to droughts who have an especially large impact on crop damage.