This document provides a short analysis of the impact that storms have on the health and economy of the U.S.A. We do this by analyzing the storm data provided by the National Weather Service. The analysis focuses on two questions, which types of events have the greatest impact on the population health and which types of events have the greatest economic impact. For this analysis we will only look at the United States as a whole and will not examine seperate states.
For the impact on population health, we look for the total number of injuries and fatalities. When summarized by event, we come to the conclusion that tornadoes are by far the greatest cause of injuries and fatalities with a total of 91,346 injuries and 5,633 fatalities.
For the economical impact, we look for the total amount of crop and property damage. A summary shows clearly that floods cause that greatest amount of damage with a total of $150,319,678,257 (150 Billion) worth of damage.
Start by downloading and reading in the original data file.
#First we download the proper file
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2")
storm <- read.csv("stormdata.csv.bz2", stringsAsFactors = FALSE)
One problem that exists within the dataset is that the event types (EVTYPE) are not always properly filled out. For example the major cause of injuries, Tornadoes, are written down in various ways.
evfreq <- data.frame(table(storm$EVTYPE))
kable(evfreq[grepl("tornado", evfreq$Var1, ignore.case=TRUE),])
| Var1 | Freq | |
|---|---|---|
| 62 | COLD AIR TORNADO | 1 |
| 830 | TORNADO | 60652 |
| 831 | TORNADO DEBRIS | 1 |
| 832 | TORNADOES | 2 |
| 833 | TORNADOES, TSTM WIND, HAIL | 1 |
| 834 | TORNADO F0 | 19 |
| 835 | TORNADO F1 | 4 |
| 836 | TORNADO F2 | 3 |
| 837 | TORNADO F3 | 2 |
| 838 | TORNADOS | 1 |
| 839 | TORNADO/WATERSPOUT | 1 |
| 940 | WATERSPOUT TORNADO | 1 |
| 941 | WATERSPOUT-TORNADO | 2 |
| 942 | WATERSPOUT/TORNADO | 8 |
| 943 | WATERSPOUT/ TORNADO | 2 |
The occurance of this is minor though. So we’ll treat improperly filled out fields as garbage and won’t try to scrub it clean. It will be a recommendation to the people supplying the dataset to add some checks for entering data.
Before we start creating our analysis datasets we will load the necessary libraries; dplyr for data manipulation and tidyr to transpose columns.
library(dplyr)
library(tidyr)
Next we create two separate data variables, one for the health and one for the economical analysis. In both cases we will create a dataset for viewing and a dataset for a plot generation.
For health we will use the following columns
Data will be processed in the following manner:
health <- select(storm, event = EVTYPE, injuries=INJURIES, fatalities=FATALITIES) %>%
mutate(total=injuries + fatalities) %>%
group_by(event) %>%
summarise(injuries=sum(injuries), fatalities = sum(fatalities), total=sum(total)) %>%
top_n(10, total) %>%
arrange(desc(total))
Now we have a dataset with the top 10 event types causeing the most damage to the health of the US population. We still need to create an extra dataset for the chart. The main difference between the current dataset and the chart dataset is that we combine the injuries and fatalities columns into a single column of observations. This will make it possible to combine both values in a single chart.
health_chart <-
gather(health, "type", "casualties", injuries, fatalities) %>%
arrange(desc(total))
Last but not least, we set the levels of the event factor to the order of the dataframe so that the chart will be properly arranged.
health_chart$event <- factor(health_chart$event, level=rev(unique(health_chart$event)))
For the economical damage we make use of the columns PROPDMG and CROPDMG for property and crop damage respectively. But there’s a catch.. The values in these two columns need to be combined with the PROPDMGEXP and CROPDMGEXP columns. These columns contain multipliers to add an extra couple of zeroes. Let’s take a look at those two columns first.
data.frame(table(rbind(storm$PROPDMGEXP, storm$CROPDMGEXP)))
## Var1 Freq
## 1 1084347
## 2 - 1
## 3 ? 15
## 4 + 5
## 5 0 235
## 6 1 25
## 7 2 14
## 8 3 4
## 9 4 4
## 10 5 28
## 11 6 4
## 12 7 5
## 13 8 1
## 14 B 49
## 15 h 1
## 16 H 6
## 17 k 21
## 18 K 706497
## 19 m 8
## 20 M 13324
Just like the EVTYPE file, there’s a lot of junk in here but we can find three modifiers that make sense: k (thousands), m (millions) and b(billions). This is in line with the documentation of the dataset. So we’ll create our own dataset combine the proper multipliers with the letters.
prefix <- as.character(c("k", "m", "b"))
multiplier <- c(10^3, 10^6, 10^9)
conv <- data.frame(prefix, multiplier, stringsAsFactors = FALSE)
kable(conv)
| prefix | multiplier |
|---|---|
| k | 1e+03 |
| m | 1e+06 |
| b | 1e+09 |
This conv dataset can be joined with the damage dataset to get the correct damage amount.
Next we will create the damage dataset. This will include the following steps.
damage <-
select(
storm
, event = EVTYPE
, propertydamage = PROPDMG
, propertymult = PROPDMGEXP
, cropdamage= CROPDMG
, cropmult = CROPDMGEXP) %>%
mutate(propertymult = tolower(propertymult), cropmult = tolower(cropmult)) %>%
left_join(conv, by=c("propertymult" ="prefix")) %>%
mutate(propertydamage = propertydamage * ifelse(is.na(multiplier), 1, multiplier) ) %>%
select(
event
, propertydamage
, cropdamage
, cropmult
)%>%
left_join(conv, by=c("cropmult" ="prefix")) %>%
mutate(cropdamage = cropdamage * ifelse(is.na(multiplier), 1, multiplier) ) %>%
group_by(event) %>%
summarise(property = sum(propertydamage), crop=sum(cropdamage), total=sum(cropdamage+propertydamage))%>%
top_n(10, total) %>%
arrange(desc(total))
damage$event <- factor(damage$event, level=rev(unique(damage$event)))
Now we have a dataset containing the necessary values. Next we’ll create a separate dataset for display in a chart. Just like the health chart, we combine both columns into one. We arrange the rows based on total damage and we set the order of the event factor levels to match this.
damage_chart <- gather(damage,"type", "amount", property, crop) %>%
arrange(desc(total))
damage_chart$event <- factor(damage_chart$event, level=rev(unique(damage_chart$event)))
For all results, we use the ggplot library to draw the charts.
library(ggplot2)
A top 10 of the most harmful types of events are as follows:
ggplot(health_chart, aes(x = event, y = casualties)) + geom_bar(stat='identity', aes(fill=type)) +coord_flip() + ggtitle("Top 10 events most damaging to population health")
kable(health)
| event | injuries | fatalities | total |
|---|---|---|---|
| TORNADO | 91346 | 5633 | 96979 |
| EXCESSIVE HEAT | 6525 | 1903 | 8428 |
| TSTM WIND | 6957 | 504 | 7461 |
| FLOOD | 6789 | 470 | 7259 |
| LIGHTNING | 5230 | 816 | 6046 |
| HEAT | 2100 | 937 | 3037 |
| FLASH FLOOD | 1777 | 978 | 2755 |
| ICE STORM | 1975 | 89 | 2064 |
| THUNDERSTORM WIND | 1488 | 133 | 1621 |
| WINTER STORM | 1321 | 206 | 1527 |
As the data shows, tornadoes are most harmful with respect to the population health.
A top 10 of the most damaging types of events are as follows:
ggplot(damage_chart, aes(x = event, y = amount/10^9)) + geom_bar(stat='identity', aes(fill=type)) +coord_flip() + ylab("Damage in billion US dollars") + ggtitle("Top 10 events causing most financial damage")
kable(damage)
| event | property | crop | total |
|---|---|---|---|
| FLOOD | 144657709807 | 5661968450 | 150319678257 |
| HURRICANE/TYPHOON | 69305840000 | 2607872800 | 71913712800 |
| TORNADO | 56937160779 | 414953270 | 57352114049 |
| STORM SURGE | 43323536000 | 5000 | 43323541000 |
| HAIL | 15732267048 | 3025954473 | 18758221521 |
| FLASH FLOOD | 16140812067 | 1421317100 | 17562129167 |
| DROUGHT | 1046106000 | 13972566000 | 15018672000 |
| HURRICANE | 11868319010 | 2741910000 | 14610229010 |
| RIVER FLOOD | 5118945500 | 5029459000 | 10148404500 |
| ICE STORM | 3944927860 | 5022113500 | 8967041360 |
As the data shows, floods have the greatest economical impact. Though a special mention should be given to droughts who have an especially large impact on crop damage.