The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Based on total amounts from 1996 to 2011, (and by weighting each injury as 10 percent of one fatality,) tornadoes are the deadliest event type in the United States. However, if you order the event types by average population harm per event, it turns out you are nearly 15 times as likely to be harmed if you find yourself dealing with a tsunami than with a tornado. But, tsunamis occur far less often, which is why they do not show up in the top ten most dangerous types in the first analysis.
Based on total property and crop damage amounts from 1996 to 2011, flooding is the costliest event type in the United States. However, if you order the event types by average damage per event, it turns out each hurricane is more than 130 times as costly as each flooding incident. Although they occur far less often, that disparity in terms of average economic impact means that hurricanes are still the second-costliest event type overall.
Check for the file in your working directory. If it does not already exist, download the noaa.csv.bz2 file. Use read.csv to read in the data.
destfile <- "noaa.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(destfile)) {
download.file(fileURL, destfile, method = "curl")
}
noaa <- read.csv(destfile)
In order to answer the two research questions, you only need to keep the BGN_DATE EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP variables. Use the dplyr package to select only those columns.
library(dplyr)
noaa <- tbl_df(noaa)
noaa <- select(noaa, BGN_DATE, EVTYPE, FATALITIES:CROPDMGEXP)
Next, clean up the EVTYPE column. You’ll notice that there are 985 different factors, but according to the National Weather Service, there are actually just 48 different types of recognized storms.
n_distinct(noaa$EVTYPE)
## [1] 985
Note that these 48 event types were standardized after 1996. So filter by all dates on or after 1996-01-01.
noaa$BGN_DATE <- as.POSIXct(strptime(noaa$BGN_DATE, "%m/%d/%Y %T"))
noaa <- filter(noaa, BGN_DATE >= "1996-01-01")
Before filtering by the list of 48 event types mentioned above, first clean some of the data entry errors. Use the toupper function to standardize the variable.
noaa$EVTYPE <- toupper(noaa$EVTYPE)
Then, clean up some of the abbreviations and extraneous notations.
library(stringr)
noaa$EVTYPE <- str_trim(noaa$EVTYPE)
noaa$EVTYPE <- gsub("TSTM", "THUNDERSTORM", noaa$EVTYPE)
noaa$EVTYPE <- gsub("G[0-9][0-9]", "", noaa$EVTYPE)
noaa$EVTYPE <- gsub("/MIX", "", noaa$EVTYPE)
noaa$EVTYPE <- gsub("WINDS", "WIND", noaa$EVTYPE)
noaa$EVTYPE <- gsub("FLD", "FLOOD", noaa$EVTYPE)
noaa$EVTYPE <- gsub("FLOODING", "FLOOD", noaa$EVTYPE)
noaa$EVTYPE <- gsub("/FOREST ", "", noaa$EVTYPE)
noaa$EVTYPE <- gsub("/TYPHOON", " (TYPHOON)", noaa$EVTYPE)
noaa$EVTYPE <- gsub("/HAIL", "", noaa$EVTYPE)
Using the PDF linked above, copy the event types, and save them to a list vector labeled types. Then filter by types. Check to see that you have ~48 distinct types.
types <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))
noaa <- filter(noaa, EVTYPE %in% types)
noaa$EVTYPE <- as.factor(noaa$EVTYPE)
n_distinct(noaa$EVTYPE)
## [1] 47
Create a new table called noaaPopHealth. Summarize by event type, and add two variables. The first, total.harm, adds the number of fatalities and injuries (weighted by 10 percent). The second, avg.harm, takes total.harm and divides it by the number of instances of each event to calculate the average impact.
noaaPopHealth <- noaa %>%
group_by(EVTYPE) %>%
summarise(total.harm = sum(FATALITIES, INJURIES*.1), avg.harm = total.harm / length(EVTYPE)) %>%
rename(event.type = EVTYPE)
Prepare to adjust the property damage and crop damage variables by their exponentials. Do so, first, by replacing letters with numbers in the exponential columns.
noaa$PROPDMGEXP <- ifelse(noaa$PROPDMGEXP == "K", 1000, ifelse(noaa$PROPDMGEXP == "M", 1000000, ifelse(noaa$PROPDMGEXP == "B", 1000000000, 1)))
noaa$CROPDMGEXP <- ifelse(noaa$CROPDMGEXP == "K", 1000, ifelse(noaa$CROPDMGEXP == "M", 1000000, ifelse(noaa$CROPDMGEXP == "B", 1000000000, 1)))
Create a new table called noaaEcon. Summarize by event type, and add two variables. The first, total.econdamage, adds the total amounts of property damage and crop damage. The second, avg.econdamage, takes total.econdamage and divides it by the number of instances of each event to calculate the average amount of damage.
noaaEcon <- noaa %>%
group_by(EVTYPE) %>%
summarise(prop.econdamage = sum(PROPDMG*PROPDMGEXP), avg.propdamage = prop.econdamage / length(EVTYPE), crop.econdamage = sum(CROPDMG*CROPDMGEXP), avg.cropdamage = crop.econdamage / length(EVTYPE), total.econdamage = sum(prop.econdamage, crop.econdamage), avg.econdamage = total.econdamage / length(EVTYPE)) %>%
rename(event.type = EVTYPE)
In order to save space, isolate the top ten event types by total and average harm.
topPopHealth <- noaaPopHealth %>%
arrange(desc(total.harm)) %>%
top_n(10, total.harm)
| event.type | total.harm | avg.harm |
|---|---|---|
| TORNADO | 3578 | 0.1545 |
| EXCESSIVE HEAT | 2436 | 1.471 |
| FLOOD | 1090 | 0.04494 |
| LIGHTNING | 1065 | 0.08066 |
| FLASH FLOOD | 1054 | 0.02067 |
| THUNDERSTORM WIND | 888.4 | 0.004208 |
| RIP CURRENT | 360.9 | 0.8354 |
| HEAT | 359.2 | 0.5017 |
| HIGH WIND | 343.3 | 0.01724 |
| WINTER STORM | 320.2 | 0.02829 |
topavgPopHealth <- noaaPopHealth %>%
arrange(desc(avg.harm)) %>%
top_n(10, avg.harm)
| event.type | total.harm | avg.harm |
|---|---|---|
| TSUNAMI | 45.9 | 2.295 |
| HURRICANE (TYPHOON) | 191.5 | 2.176 |
| EXCESSIVE HEAT | 2436 | 1.471 |
| RIP CURRENT | 360.9 | 0.8354 |
| AVALANCHE | 238.6 | 0.6312 |
| HEAT | 359.2 | 0.5017 |
| MARINE STRONG WIND | 16.2 | 0.3375 |
| COLD/WIND CHILL | 96.2 | 0.1785 |
| TORNADO | 3578 | 0.1545 |
| HIGH SURF | 105 | 0.1446 |
Then, using the ggplot2 package, plot both of them together, one on top of the other.
library(ggplot2)
library(gridExtra)
ggtotalharm <- ggplot(topPopHealth, aes(x=reorder(event.type, total.harm), y=total.harm)) + geom_bar(stat="identity") + coord_flip() + xlab("") + ylab("Total Population Harm (Fatalities + Injuries*.1)")
ggavgharm <- ggplot(topavgPopHealth, aes(x=reorder(event.type, avg.harm), y=avg.harm)) + geom_bar(stat="identity") + coord_flip() + xlab("") + ylab("Average Population Harm (Fatalities + Injuries*.1) per Event")
grid.arrange(ggtotalharm, ggavgharm, nrow=2, top="Top Ten Harmful Event Types, by Total and Average Population Harm")
In order to save space, isolate the top ten event types by total and average economic impact.
topEcon <- noaaEcon %>%
arrange(desc(total.econdamage)) %>%
top_n(10, total.econdamage)
| event.type | total.econdamage | avg.econdamage |
|---|---|---|
| FLOOD | 1.489e+11 | 6141521 |
| HURRICANE (TYPHOON) | 7.191e+10 | 817201282 |
| TORNADO | 2.49e+10 | 1075424 |
| HAIL | 1.707e+10 | 82186 |
| FLASH FLOOD | 1.656e+10 | 324644 |
| DROUGHT | 1.441e+10 | 5924236 |
| THUNDERSTORM WIND | 8.93e+09 | 42303 |
| TROPICAL STORM | 8.32e+09 | 12199687 |
| WILDFIRE | 8.163e+09 | 1955139 |
| HIGH WIND | 5.882e+09 | 295425 |
topavgEcon <- noaaEcon %>%
arrange(desc(avg.econdamage)) %>%
top_n(10, avg.econdamage)
| event.type | total.econdamage | avg.econdamage |
|---|---|---|
| HURRICANE (TYPHOON) | 7.191e+10 | 817201282 |
| STORM SURGE/TIDE | 4.642e+09 | 31365122 |
| TROPICAL STORM | 8.32e+09 | 12199687 |
| TSUNAMI | 144082000 | 7204100 |
| FLOOD | 1.489e+11 | 6141521 |
| DROUGHT | 1.441e+10 | 5924236 |
| WILDFIRE | 8.163e+09 | 1955139 |
| ICE STORM | 3.658e+09 | 1946732 |
| TORNADO | 2.49e+10 | 1075424 |
| FROST/FREEZE | 1.105e+09 | 822536 |
To best plot these, use the reshape2 package to melt the data frames. This allows you to create a stacked bar chart, showing the impact each storm has on properties and crops.
library(reshape2)
topEconMelt <- topEcon[,c(1:2, 4)]
topEconMelt <- melt(topEconMelt, id = "event.type")
topavgEconMelt <- topavgEcon[,c(1, 3, 5)]
topavgEconMelt <- melt(topavgEconMelt, id = "event.type")
Using the ggplot2 package, first plot total economic impact.
ggplot(topEconMelt, aes(x=reorder(event.type, value), y=value/1000000)) +
geom_bar(stat="identity", aes(fill = variable)) +
xlab("") +
ylab("Total Economic Impact (Millions USD)") +
scale_fill_discrete(labels = c("Property", "Crops"), name = "Damage") +
theme(axis.text.x = element_text(angle = 45, size=8, hjust = 1, vjust = 1)) +
ggtitle("Top Ten Costly Event Types, by Total Damage Amounts")
Next, plot average economic impact.
ggplot(topavgEconMelt, aes(x=reorder(event.type, value), y=value/1000000)) +
geom_bar(stat="identity", aes(fill = variable)) +
xlab("") +
ylab("Average Economic Impact (Millions USD)") +
scale_fill_discrete(labels = c("Property", "Crops"), name = "Damage") +
theme(axis.text.x = element_text(angle = 45, size=8, hjust = 1, vjust = 1)) +
ggtitle("Top Ten Costly Event Types, by Average Damage Amounts")