=====================================================================================================
In this document I attempt to figure the storm events from 1950-2011 that had the greatest impact on population health and on the economy. The data come from the National Weather Service.
I will determine which events cause the highest amounts of injuries and fatalities as well as Property and Crop damage. With this knowledge in hand, it will be easier to determine which types of natural events are the most dangerous and are most worth protecting against.
First I load in the data to the current working directory using the R.utils package’s bunzip2 function. I cache this process because it takes a very long time to load each time.
library(R.utils)
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./StormData.csv.bz2"
download.file(fileURL, destfile)
bunzip2("StormData.csv.bz2", "StormData.csv", remove = FALSE, skip = TRUE)
## [1] "StormData.csv"
## attr(,"temporary")
## [1] FALSE
stormdata <- read.csv("./StormData.csv", header=TRUE)
Next I loaded in the “dplyr” library because it has very useful functions for manipulating data frames.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
There are only a few relevant columnns for this analysis. They are the Event Type, Fatalities, Injuries, Property Damage, Property Damage Multiplier, Crop Damage, and Crop Damage Multiplier. I am subsetting the data to just these columns to make the work neater and easier. I should note that the data is very untidy and in many cases some events may show up with different forms depending on how each regional reporting station decided to record the event. This may affect the analysis later on but for the purposes of this report, I will ignore this.
stormdata <- stormdata[,c(8, 23:28)]
The first step is to find the total amount of injuries and fatalities based on the Event Type. To do this, I will use the dplyr aggregate and arrange functions and return the top 15 results for each grouping.
injuries <- aggregate(INJURIES~EVTYPE, stormdata, sum)
injuries <- arrange(injuries, desc(INJURIES))
injuries <- injuries[1:15, ]
injuries
## EVTYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
fatalities <- aggregate(FATALITIES~EVTYPE, stormdata, sum)
fatalities <- arrange(fatalities, desc(FATALITIES))
fatalities <- fatalities[1:15, ]
fatalities
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
Now that the data is sorted by most injuries and fatalities, I will plot the two using barplots. Hopefully these plots will give us an idea of what kinds of events are worst for public health.
par(mfrow=c(1,2), mar=c(12,4,4,2), cex=.75)
barplot(injuries$INJURIES, names.arg = injuries$EVTYPE, las = 3, ylab = "Number of Injuries", main = "Top 15 Causes of Injuries", col="red")
barplot(fatalities$FATALITIES, names.arg=fatalities$EVTYPE, las=3, ylab="Number of Fatalities", main = "Top 15 Causes of Fatalities", col="blue")
You can see very clearly that Tornadoes are far and away the most dangerous event with respect to public health. Excessive heat, though not on the scale of Tornadoes, also causes a lot of problems. Flooding/Flash Flooding and Lightning also appear in the top 5 in both injuries and fatalities so it is important to be cautious during any of these kinds of storms and follow the appropriate guidelines to ensure maximum safety.
This dataset has a different way of handling the columns for Property and Crop damage. Each kind of damage has two columns. The first contains a number between 1 and 4 digits. The second column contains a multiplier with a value of “H,” “K,” “M,” or “B,” which correstpond to hundred, thousand, million, or billion, respectively. Because of this, we need to multiply the (PC)ROPDMG column by the (PC)ROPDMGEXP column where the second column has the letter replaced with the corresponding number.
# initialize a new column with the total property damage
stormdata$PROPERTY = 0
# multiply the values in the PROPDMG column by the appropriate multiplier and add it to the new column
stormdata[stormdata$PROPDMGEXP == "H", ]$PROPERTY = stormdata[stormdata$PROPDMGEXP == "H", ]$PROPDMG * 10^2
stormdata[stormdata$PROPDMGEXP == "K", ]$PROPERTY = stormdata[stormdata$PROPDMGEXP == "K", ]$PROPDMG * 10^3
stormdata[stormdata$PROPDMGEXP == "M", ]$PROPERTY = stormdata[stormdata$PROPDMGEXP == "M", ]$PROPDMG * 10^6
stormdata[stormdata$PROPDMGEXP == "B", ]$PROPERTY = stormdata[stormdata$PROPDMGEXP == "B", ]$PROPDMG * 10^9
# initialize a new column with the total crop damage
stormdata$CROP = 0
#multiply the values in the CROPDMG column by the appropriate multiplier and add it to the new column
stormdata[stormdata$CROPDMGEXP == "H", ]$CROP = stormdata[stormdata$CROPDMGEXP == "H", ]$CROPDMG * 10^2
stormdata[stormdata$CROPDMGEXP == "K", ]$CROP = stormdata[stormdata$CROPDMGEXP == "K", ]$CROPDMG * 10^3
stormdata[stormdata$CROPDMGEXP == "M", ]$CROP = stormdata[stormdata$CROPDMGEXP == "M", ]$CROPDMG * 10^6
stormdata[stormdata$CROPDMGEXP == "B", ]$CROP = stormdata[stormdata$CROPDMGEXP == "B", ]$CROPDMG * 10^9
Now that the correct values have been calculated, we can arrange the data like we did in the injuries and fatalities scenarios above using dplyr. I am going to add the values together rather than reporting the total property and crop damage for each type of event.
economictoll <- aggregate(PROPERTY+CROP ~ EVTYPE, stormdata, sum)
names(economictoll) <- c("Event", "TotalEconomicToll")
economictoll <- arrange(economictoll, desc(TotalEconomicToll))
economictoll <- economictoll[1:15,]
#Divide by 1 billion because the results are in the billions and it makes the scale of the ensuing plot cleaner
economictoll$TotalEconomicToll <- economictoll$TotalEconomicToll/10^9
With this data sorted by economic toll, it is time to create a barplot to see what event causes the highest economic toll.
par(mar=c(12,4,4,2))
barplot(economictoll$TotalEconomicToll, names.arg = economictoll$Event, las = 3, ylab = "Total Economic Toll (Billions of $)", main = "Top 15 Events by Total Economic Toll", col = "blue")
Based on this plot, the storm events with the highest economic toll have to do with precipitation and wind. Flooding is far and away the highest, followed by Hurricane/Typhoon and Tornado. These results make sense because flooding does damages to both buildings and crops, so it should be high in both property and crop damage. Hurricanes and typhoons happen mostly along the coasts but those storms bring heavy winds and strong rains so most of the damage is probably done to property. Tornadoes happen mainly in the middle of the country where a lot of the land is crops, but the high winds from a tornado can also devastate a city so it has potential to cause a lot of damage.