Analysis of the health and economical impacts of storms within the US

Synopsis

This document provides a short analysis of the impact that storms have on the health and economy of the U.S.A. We do this by analyzing the storm data provided by the National Weather Service. The analysis focuses on two questions, which types of events have the greatest impact on the population health and which types of events have the greatest economic impact. For this analysis we will only look at the United States as a whole and will not examine seperate states.

Health impacts - Tornadoes most hazardous

For the impact on population health, we look for the total number of injuries and fatalities. When summarized by event, we come to the conclusion that tornadoes are by far the greatest cause of injuries and fatalities with a total of 91,346 injuries and 5,633 fatalities.

Economical impacts - Floods most damaging

For the economical impact, we look for the total amount of crop and property damage. A summary shows clearly that floods cause that greatest amount of damage with a total of $150,319,678,257 (150 Billion) worth of damage.

Data Processing

Start by downloading and reading in the original data file.

#First we download the proper file
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2")

storm <- read.csv("stormdata.csv.bz2", stringsAsFactors = FALSE)

One problem that exists within the dataset is that the event types (EVTYPE) are not always properly filled out. For example the major cause of injuries, Tornadoes, are written down in various ways.

evfreq  <- data.frame(table(storm$EVTYPE))
kable(evfreq[grepl("tornado", evfreq$Var1, ignore.case=TRUE),])

	Var1	Freq
62	COLD AIR TORNADO	1
830	TORNADO	60652
831	TORNADO DEBRIS	1
832	TORNADOES	2
833	TORNADOES, TSTM WIND, HAIL	1
834	TORNADO F0	19
835	TORNADO F1	4
836	TORNADO F2	3
837	TORNADO F3	2
838	TORNADOS	1
839	TORNADO/WATERSPOUT	1
940	WATERSPOUT TORNADO	1
941	WATERSPOUT-TORNADO	2
942	WATERSPOUT/TORNADO	8
943	WATERSPOUT/ TORNADO	2

The occurance of this is minor though. So we’ll treat improperly filled out fields as garbage and won’t try to scrub it clean. It will be a recommendation to the people supplying the dataset to add some checks for entering data.

Before we start creating our analysis datasets we will load the necessary libraries; dplyr for data manipulation and tidyr to transpose columns.

library(dplyr)
library(tidyr)

Next we create two separate data variables, one for the health and one for the economical analysis. In both cases we will create a dataset for viewing and a dataset for a plot generation.

Health

For health we will use the following columns

EVTYPE (event)
INJURIES (sustained injuries)
FATALITIES (fatalities)

Data will be processed in the following manner:

Select the data
Add a “Total” column by combining injuries and fatalities
Group by event type
Summarise injuries, fatalities and the total column for each eventtype
Select the top 10 based on the total column
Arrange on total descending.

health  <- select(storm, event = EVTYPE, injuries=INJURIES, fatalities=FATALITIES) %>%
    mutate(total=injuries + fatalities) %>%
    group_by(event) %>%
    summarise(injuries=sum(injuries), fatalities = sum(fatalities), total=sum(total)) %>%
    top_n(10, total) %>%
    arrange(desc(total))

Now we have a dataset with the top 10 event types causeing the most damage to the health of the US population. We still need to create an extra dataset for the chart. The main difference between the current dataset and the chart dataset is that we combine the injuries and fatalities columns into a single column of observations. This will make it possible to combine both values in a single chart.

health_chart <- 
    gather(health, "type", "casualties", injuries, fatalities) %>%
    arrange(desc(total))

Last but not least, we set the levels of the event factor to the order of the dataframe so that the chart will be properly arranged.

health_chart$event  <- factor(health_chart$event, level=rev(unique(health_chart$event)))

Economy

For the economical damage we make use of the columns PROPDMG and CROPDMG for property and crop damage respectively. But there’s a catch.. The values in these two columns need to be combined with the PROPDMGEXP and CROPDMGEXP columns. These columns contain multipliers to add an extra couple of zeroes. Let’s take a look at those two columns first.

data.frame(table(rbind(storm$PROPDMGEXP, storm$CROPDMGEXP)))

##    Var1    Freq
## 1       1084347
## 2     -       1
## 3     ?      15
## 4     +       5
## 5     0     235
## 6     1      25
## 7     2      14
## 8     3       4
## 9     4       4
## 10    5      28
## 11    6       4
## 12    7       5
## 13    8       1
## 14    B      49
## 15    h       1
## 16    H       6
## 17    k      21
## 18    K  706497
## 19    m       8
## 20    M   13324

Just like the EVTYPE file, there’s a lot of junk in here but we can find three modifiers that make sense: k (thousands), m (millions) and b(billions). This is in line with the documentation of the dataset. So we’ll create our own dataset combine the proper multipliers with the letters.

prefix <- as.character(c("k", "m", "b"))
multiplier <- c(10^3, 10^6, 10^9)
conv <- data.frame(prefix, multiplier, stringsAsFactors = FALSE)
kable(conv)

prefix	multiplier
k	1e+03
m	1e+06
b	1e+09

This conv dataset can be joined with the damage dataset to get the correct damage amount.

Next we will create the damage dataset. This will include the following steps.

Select the proper required columns from the storm dataset.
set the propertymult and cropmult columns to lower case so a proper join can be made with the conv dataset
Left join the conv dataset on the propertymult column
If a join has been made, multiply propertydamage with the joined multiplier column, otherwise keep the original value
Select only the necessary columns, dropping the multiplier column
Left join the conv dataset on the cropmult column
If a join has been made, multiply cropdamage with the joined multiplier column, otherwise keep the original value
Group by even
Summarise the propertydamage and cropdamage columns and add a third column consisting of a sum of both values added up.
Select the top 10 rows based on the total value of damage.
Arrange on total descending

damage <- 
    select(
        storm
        , event = EVTYPE
        , propertydamage = PROPDMG
        , propertymult = PROPDMGEXP
        , cropdamage= CROPDMG
        , cropmult = CROPDMGEXP) %>%
    mutate(propertymult = tolower(propertymult), cropmult = tolower(cropmult)) %>%
    left_join(conv, by=c("propertymult" ="prefix")) %>%
    mutate(propertydamage = propertydamage * ifelse(is.na(multiplier), 1, multiplier) ) %>%
    select(
        event
        , propertydamage
        , cropdamage
        , cropmult
    )%>%
    left_join(conv, by=c("cropmult" ="prefix")) %>%
    mutate(cropdamage = cropdamage * ifelse(is.na(multiplier), 1, multiplier) ) %>%
    group_by(event) %>%
    summarise(property = sum(propertydamage), crop=sum(cropdamage), total=sum(cropdamage+propertydamage))%>%
    top_n(10, total) %>%
    arrange(desc(total))

    
damage$event  <- factor(damage$event, level=rev(unique(damage$event)))

Now we have a dataset containing the necessary values. Next we’ll create a separate dataset for display in a chart. Just like the health chart, we combine both columns into one. We arrange the rows based on total damage and we set the order of the event factor levels to match this.

damage_chart <- gather(damage,"type", "amount", property, crop) %>%
    arrange(desc(total))

damage_chart$event  <- factor(damage_chart$event, level=rev(unique(damage_chart$event)))

Results

For all results, we use the ggplot library to draw the charts.

    library(ggplot2)

Which types of events are most harmful with respect to the population health?

A top 10 of the most harmful types of events are as follows:

ggplot(health_chart, aes(x = event, y = casualties)) + geom_bar(stat='identity', aes(fill=type)) +coord_flip() + ggtitle("Top 10 events most damaging to population health")

kable(health)

event	injuries	fatalities	total
TORNADO	91346	5633	96979
EXCESSIVE HEAT	6525	1903	8428
TSTM WIND	6957	504	7461
FLOOD	6789	470	7259
LIGHTNING	5230	816	6046
HEAT	2100	937	3037
FLASH FLOOD	1777	978	2755
ICE STORM	1975	89	2064
THUNDERSTORM WIND	1488	133	1621
WINTER STORM	1321	206	1527

As the data shows, tornadoes are most harmful with respect to the population health.

Which types of events have the greatest economic consequences?

A top 10 of the most damaging types of events are as follows:

ggplot(damage_chart, aes(x = event, y = amount/10^9)) + geom_bar(stat='identity', aes(fill=type)) +coord_flip() + ylab("Damage in billion US dollars") + ggtitle("Top 10 events causing most financial damage")

kable(damage)

event	property	crop	total
FLOOD	144657709807	5661968450	150319678257
HURRICANE/TYPHOON	69305840000	2607872800	71913712800
TORNADO	56937160779	414953270	57352114049
STORM SURGE	43323536000	5000	43323541000
HAIL	15732267048	3025954473	18758221521
FLASH FLOOD	16140812067	1421317100	17562129167
DROUGHT	1046106000	13972566000	15018672000
HURRICANE	11868319010	2741910000	14610229010
RIVER FLOOD	5118945500	5029459000	10148404500
ICE STORM	3944927860	5022113500	8967041360

As the data shows, floods have the greatest economical impact. Though a special mention should be given to droughts who have an especially large impact on crop damage.