Author: Toni Rib
In this report I aim to describe both the human and economic consequences of various storm events in the United States between the years 1950 and 2011. Data was obtained from the National Oceanic and Atmospheric Administration’s (NOAA) storm database for the analysis. From the data, I concluded that while Tornados are the overwhelming cause of human death and injury, floods andd hurricanes/typhoons tend to cause the most economic damage as measured in property and crop damage costs.
The raw data for this research was obtained from this link from the Coursera Reproducible Research course website for Peer Assessment 2 on Monday, May 18. However, the data originally came from the United States National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur as well as estimates of any fatalities, injuries, and property/crop damage.
The data is originally in comma-separated values (csv) format and compressed with the bz2 algorithm. The ‘read.csv’ function can read in this type of data without decompressing it first, so I will use that to add the raw data to an R data frame called ‘sdata.’
sdata <- read.csv("repdata-data-StormData.csv.bz2", header = TRUE)
The result is a data set of 37 variables and 902297 observations.
The next two sections will create the tidy data sets that will be used for the analysis. One data set will be used for analysis of population health while the second data set will be used for analysis of economic consequences.
In the first section of our analysis, we are going to be looking at events which are harmful to human health. We assume that ‘harmful to human health’ means there must be at least one injury or fatality related to the observation. We will create a data set that only includes observations with at least one fatality or injury and sort them based on the number of fatalities and injuries.
mdata <- sdata[sdata$INJURIES > 0 | sdata$FATALITIES > 0, ]
## Sort data in decreasing order based on fatalities then injuries
library(plyr)
mdata <- arrange(mdata, FATALITIES, INJURIES, decreasing = TRUE)
One of the columns we are interested in is the type of event (variable EVTYPE). A quick look at this variable using ‘str’ shows that this is a factor variable with 985 levels.
str(mdata$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 275 834 834 834 130 834 834 130 130 834 ...
However, now that we have removed observations with no injuries or fatalities, we can use the aggregate function to see how many event types are still left in our data set.
mdataAggregates <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = mdata, sum)
With this new data set, there are only 220 event types in our data set.
We can also add a new column that contains the total number of fatalities plus injuries for each storm event type. To make the data easier to read, we will sort the data in decreasing order on this new column.
library(plyr)
mdataAggregates$TOTALHARM <- mdataAggregates$INJURIES + mdataAggregates$FATALITIES
mdataAggregates <- arrange(mdataAggregates, TOTALHARM, decreasing = TRUE)
We will look at this data set further in the Results section.
For this section, we are interested in the economic consequences of storms. We will assume that ‘economic consequences’ means total cost of property damage plus crop damage. These are columns ‘PROPDMG’ and ‘CROPDMG’ in the original, raw data set (which we called ‘sdata’) with the associated columns ‘PROPDMGEXP’ and ‘CROPDMGEXP’ which give a one letter symbol signifying the units of the first two columns.
Similar to our previous data processing, we will first remove all observations where no property damage or crop damage was present, as these are not relevant to our analysis. We will also shrink down our data set to only include the columns we are interested in.
econdata <- sdata[sdata$PROPDMG > 0 | sdata$CROPDMG > 0, ]
econSimple <- econdata[ ,c(8, 25, 26, 27, 28)]
library(plyr)
econSimple <- arrange(econSimple, PROPDMG, CROPDMG, decreasing = TRUE)
Before we can aggregate the data for each storm event type, we need to first take into account the fact that the numbers in the ‘PROPDMG’ and ‘CROPDMG’ columns are not all in the same units. We will need to use the ‘PROPDMPEXP’ and ‘CROPDMGEXP’ columns to convert the numbers into actual dollars, all in the same units of $1. First we will take a look at what symbols we have for each column and how many of our observations fall into each category.
## Property Damage Units
table(econSimple$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 4357 1 0 5 209 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 2 0 40 1 6 229057 7 11319
## Crop Damage Units
table(econSimple$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 145037 6 17 0 7 21 97960 1 1982
There are some unusual units in the data, such as ‘?’ or ‘+’ which do not follow the rules outlined in section 2.7 of the National Weather Service Storm Data Documentation which only allows
We do this by adding two additional columns, ‘PROPMAG’ and ‘CROPMAG’ which will be equal to either 0, 1, 1000, 100000, or 1000000000 depending on the values in the ‘PROPDMGEXP’ and ‘CROPDMGEXP’ columns. All rows will initially be set to a value of 0, then only those rows we wish to keep based on the above criteria will be set to the appropriate non-zero value.
After the new columns are added, we can calculate the actual values of the damage by multiplying the columns together.
## Set all initial multipliers to 0
econSimple$PROPMAG <- 0
econSimple$CROPMAG <- 0
## Set correct multipliers for property damage
econSimple$PROPMAG[econSimple$PROPDMGEXP == "B"] <- 1000000000
econSimple$PROPMAG[econSimple$PROPDMGEXP == "M" | econSimple$PROPDMGEXP == "m"] <- 1000000
econSimple$PROPMAG[econSimple$PROPDMGEXP == "K" | econSimple$PROPDMGEXP == "k"] <- 1000
econSimple$PROPMAG[econSimple$PROPDMGEXP == ""] <- 1
## Set correct multipliers for property damage
econSimple$CROPMAG[econSimple$CROPDMGEXP == "B"] <- 1000000000
econSimple$CROPMAG[econSimple$CROPDMGEXP == "M" | econSimple$CROPDMGEXP == "m"] <- 1000000
econSimple$CROPMAG[econSimple$CROPDMGEXP == "K" | econSimple$CROPDMGEXP == "k"] <- 1000
econSimple$CROPMAG[econSimple$CROPDMGEXP == ""] <- 1
## Calculate actual values
econSimple$PROPDMGACT <- econSimple$PROPDMG * econSimple$PROPMAG
econSimple$CROPDMGACT <- econSimple$CROPDMG * econSimple$CROPMAG
Now that we have converted the columns to actual dollars and set the data we are not interested in to zero, we can use the aggregate function to calculate the total damage for both property and crops for each storm event type. We also add a column for the total damage cost which is equal to the property damage cost plus the crop damage cost. For easier viewing, we sort this new ‘TOTALDAMAGE’ column with the highest values at the top.
EconAggregates <- aggregate(cbind(PROPDMGACT, CROPDMGACT) ~ EVTYPE, data = econSimple, sum)
EconAggregates$TOTALDAMAGE <- EconAggregates$PROPDMGACT + EconAggregates$CROPDMGACT
EconAggregates <- arrange(EconAggregates, TOTALDAMAGE, decreasing = TRUE)
head(EconAggregates, 10)
## EVTYPE PROPDMGACT CROPDMGACT TOTALDAMAGE
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160483 414953110 57352113593
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732266777 3025954453 18758221230
## 6 FLASH FLOOD 16140811717 1421317100 17562128817
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927810 5022113500 8967041310
We will look at this data set further in the results section.
Now that we have two tidy data sets, we analyze them to answer the following two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Looking at the first 10 rows of the ‘mdataAggregates’ data set, we can see there is a clear winner in all 3 columns: tornados.
head(mdataAggregates, 10)
## EVTYPE FATALITIES INJURIES TOTALHARM
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
It should be noted that while some event types could be combined for a more detailed analysis (for example: ‘TSTM WIND’ and ‘THUNDERSTORM WIND’), even adding these two columns together will only account to 10% of the total injuries and fatalities cause by tornados. In fact, tornados count for approximately 62% of ALL fatalities & injuries in the data set.
(mdataAggregates[mdataAggregates$EVTYPE == "TORNADO", 4] / (sum(mdataAggregates$TOTALHARM))) * 100
## [1] 62.29661
Therefore, combining even some of the top ranking events does not remove the TORNADO category from the number one spot.
However, we would also like to look at the top categories below tornados, so we will combine a few of the top categories to get a clear picture of which storm event catgories cause the most damage to humans. Ignoring the TORNADO outlier in the first column, the total fatalities and injuries in columns 2-15 account for almost 80% of the total fatalities and injuries in the data set (again, minus TORNADOS) so we will focus on those columns.
head(mdataAggregates, 15)
## EVTYPE FATALITIES INJURIES TOTALHARM
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## 11 HIGH WIND 248 1137 1385
## 12 HAIL 15 1361 1376
## 13 HURRICANE/TYPHOON 64 1275 1339
## 14 HEAVY SNOW 127 1021 1148
## 15 WILDFIRE 75 911 986
(sum(mdataAggregates$TOTALHARM[2:15]) / sum(mdataAggregates$TOTALHARM[2:220])) * 100
## [1] 79.1086
In the first 15 columns, we can combine ‘TSTM WIND’ with ‘THUNDERSTORM WIND’; ‘EXCESSIVE HEAT’ with ‘HEAT’; and ‘WINTER STORM’ with ‘HEAVY SNOW.’ We can then re-aggregate and sort the data.
h <- mdataAggregates[1:15, ]
h$EVTYPE[h$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
h$EVTYPE[h$EVTYPE == "HEAT"] <- "EXCESSIVE HEAT"
h$EVTYPE[h$EVTYPE == "HEAVY SNOW"] <- "WINTER STORM"
hAgg <- aggregate(cbind(FATALITIES, INJURIES, TOTALHARM) ~ EVTYPE, data = h, sum)
hAgg <- arrange(hAgg, TOTALHARM, decreasing = TRUE)
print(hAgg)
## EVTYPE FATALITIES INJURIES TOTALHARM
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 2840 8625 11465
## 3 THUNDERSTORM WIND 637 8445 9082
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 FLASH FLOOD 978 1777 2755
## 7 WINTER STORM 333 2342 2675
## 8 ICE STORM 89 1975 2064
## 9 HIGH WIND 248 1137 1385
## 10 HAIL 15 1361 1376
## 11 HURRICANE/TYPHOON 64 1275 1339
## 12 WILDFIRE 75 911 986
This data is better viewed in a barplot showing the top categories.
par(las=2, mar = c(11,5,2,2), cex.axis=0.75)
with(data = hAgg, barplot(TOTALHARM, names.arg = hAgg$EVTYPE[1:12],
main = "Total Fatalities & Injuries by Storm Event",
ylab = "# of Fatalities + Injuries",
col = "red"))
From this barplot, we can see that the storm event types that cause the next highest number of fatalities and injuries are excessive heat, thunderstorm wind, floods, and lightning. After lightning, the numbers drop off significantly.
To draw some conclusions about the economic consequences of storms, we will create a stacked barplot that allows us to see not only the total aggregate damage, but also the damage costs broken out by property and crops. We will only include the first 20 columns. To do this, we also need to drop the TOTALDAMAGE column.
ecoSmall <- EconAggregates[1:20, 1:3]
ecoSmall$EVTYPE <- reorder(ecoSmall$EVTYPE, rowSums(ecoSmall[-1]))
library(reshape2)
library(ggplot2)
eco.m <- melt(ecoSmall, id.var = "EVTYPE")
eco.m <- arrange(eco.m, variable, value, decreasing = TRUE)
ggplot(eco.m, aes(x=EVTYPE, y = value, fill = variable)) + geom_bar(stat= 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Property and Crop Damage by Storm Event",
x = "Storm Event Type", y ="Cost of Damage in US Dollars")
From this plot we can see that floods cause the most property damage, followed by hurricane/typhoon and tornados. However, the majority of crop damage is caused from drought, which is interested since it is the opposite of flooding. Therefore, while too much rain is bad for property, too little rain is bad for crops.