This report is the result of the Peer Assignment 2 of the “Reproducible Research” course from Johns Hopkins University on Coursera. The goal of the assigment was to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine which types of weather events have the greatest impact on public health and the economy, respectively.
Tornadoes were found to cause by far the most injuries and fatalities. Floods, hurricanes, tornadoes and storm surges are responsible for the highest amounts of property damage, while drought, floods and ice storms cause the worst damages to crops.
We start by loading the required packages for this project, The data from the NOAA database is then loaded into the data frame stormData. This report assumes that the .csv file is located in the working directory, for reasons of loading time.
library(dplyr)
library(ggplot2)
library(gridExtra)
stormData <- read.csv("repdata_data_StormData.csv.bz2")
We can subset the data frame to include the event type, injuries and fatalites, as well as property and crop damage (including their exponents). These are the only columns needed for the further analysis required in this project.
stormData <- stormData %>%
select(EVTYPE, INJURIES, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
The EVTYPE column contains a lot of different names for events that are seemingly the same. Looking at the top 20 names we see a potential to combine some of these into the same factor. A good example is the various types of thunderstorm winds.
stormData$EVTYPE <- tolower(stormData$EVTYPE)
head(sort(table(stormData$EVTYPE), decreasing = TRUE), 20)
##
## hail tstm wind thunderstorm wind
## 288661 219942 82564
## tornado flash flood flood
## 60652 54277 25327
## thunderstorm winds high wind lightning
## 20843 20214 15754
## heavy snow heavy rain winter storm
## 15708 11742 11433
## winter weather funnel cloud marine tstm wind
## 7045 6844 6175
## marine thunderstorm wind waterspout strong wind
## 5812 3796 3569
## urban/sml stream fld wildfire
## 3392 2761
I have decided to combine only the instances of thunderstorm winds, as these events are spread over three high-occuring event names and will therefore have an impact on the overall total when combined.
stormData[grep("^thunderstorm.*wind.*|^tstm.*wind.*", stormData$EVTYPE),]$EVTYPE <- c("thunderstorm wind")
head(sort(table(stormData$EVTYPE), decreasing = TRUE), 20)
##
## thunderstorm wind hail tornado
## 324696 288661 60652
## flash flood flood high wind
## 54277 25327 20214
## lightning heavy snow heavy rain
## 15754 15708 11742
## winter storm winter weather funnel cloud
## 11433 7045 6844
## marine tstm wind marine thunderstorm wind waterspout
## 6175 5812 3796
## strong wind urban/sml stream fld wildfire
## 3569 3392 2761
## blizzard drought
## 2719 2488
Note that marine thunderstorm winds are a seperate category, divided over two event types (thunderstorm and tstm), so we can clearly see that there is room for further clean up, but not to a degree that will impact the number of events order at the top to a serious degree.
The next processing we need to do is to combine the damage values with their exponents to get the correct damage amounts.
corrDamages <- function(dmg, exp) {
if(exp %in% c("+"))
return(dmg*1)
else if (exp %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9"))
return(dmg*10)
else if(exp %in% c("h", "H"))
return(dmg*100)
else if(exp %in% c("k", "K"))
return(dmg*1000)
else if(exp %in% c("m", "M"))
return(dmg*1000000)
else if(exp %in% c("b", "B"))
return(dmg*1000000000)
else if(exp %in% c("", "-", "?"))
return(0)
}
stormData$corrPropDmg <- mapply(corrDamages, stormData$PROPDMG, stormData$PROPDMGEXP)
stormData$corrCropDmg <- mapply(corrDamages, stormData$CROPDMG, stormData$CROPDMGEXP)
Finally we create data frames containing the sum of total injuries, fatalities, property damage and crop damage, ordered by the event type with the largest respective impact.
injuryDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
summarize(injuries = sum(INJURIES)) %>% arrange(desc(injuries)))
fatalityDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
summarize(fatalities = sum(FATALITIES)) %>% arrange(desc(fatalities)))
propDmgDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
summarize(propDmg = sum(corrPropDmg)) %>% arrange(desc(propDmg)))
cropDmgDf <- data.frame(stormData %>% group_by(EVTYPE) %>%
summarize(cropDmg = sum(corrCropDmg)) %>% arrange(desc(cropDmg)))
Here we will present tables and plots of the top 10 weather events for each type of consequence listed (injuries, fatalities, property damage and crop damage).
We start with a tables and a plot of the top 10 events with regards to injuries and fatalities
head(injuryDf, 10)
## EVTYPE injuries
## 1 tornado 91346
## 2 thunderstorm wind 9469
## 3 flood 6789
## 4 excessive heat 6525
## 5 lightning 5230
## 6 heat 2100
## 7 ice storm 1975
## 8 flash flood 1777
## 9 hail 1361
## 10 winter storm 1321
head(fatalityDf, 10)
## EVTYPE fatalities
## 1 tornado 5633
## 2 excessive heat 1903
## 3 flash flood 978
## 4 heat 937
## 5 lightning 816
## 6 thunderstorm wind 709
## 7 flood 470
## 8 rip current 368
## 9 high wind 248
## 10 avalanche 224
g1 <- ggplot(injuryDf[1:10,], aes(x = reorder(EVTYPE, desc(injuries)), y = injuries)) +
geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total number of injuries")
g2 <- ggplot(fatalityDf[1:10,], aes(x = reorder(EVTYPE, desc(fatalities)), y = fatalities)) +
geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total number of fatalities")
grid.arrange(g1, g2, nrow = 1, top = "Top 10 weather events ordered by caused number of injuries and fatalities")
As we can see, tornadoes are responsible for the vast majority of injuries and fatalities.
Finally we present tables and a plot for the top 10 events with regards to property and crop damage
head(propDmgDf, 10)
## EVTYPE propDmg
## 1 flood 1.45e+11
## 2 hurricane/typhoon 6.93e+10
## 3 tornado 5.69e+10
## 4 storm surge 4.33e+10
## 5 flash flood 1.61e+10
## 6 hail 1.57e+10
## 7 hurricane 1.19e+10
## 8 thunderstorm wind 9.75e+09
## 9 tropical storm 7.70e+09
## 10 winter storm 6.69e+09
head(cropDmgDf, 10)
## EVTYPE cropDmg
## 1 drought 1.40e+10
## 2 flood 5.66e+09
## 3 river flood 5.03e+09
## 4 ice storm 5.02e+09
## 5 hail 3.03e+09
## 6 hurricane 2.74e+09
## 7 hurricane/typhoon 2.61e+09
## 8 flash flood 1.42e+09
## 9 extreme cold 1.31e+09
## 10 thunderstorm wind 1.22e+09
g1 <- ggplot(propDmgDf[1:10,], aes(x = reorder(EVTYPE, desc(propDmg)), y = propDmg)) +
geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total sum of property damages")
g2 <- ggplot(cropDmgDf[1:10,], aes(x = reorder(EVTYPE, desc(cropDmg)), y = cropDmg)) +
geom_col() + theme(plot.margin = unit(c(2,1,1,1), "lines"), axis.text.x = element_text(angle=30, hjust=1, vjust=1)) + xlab("Event type") + ylab("Total sum of crop damages")
grid.arrange(g1, g2, nrow = 1, top = "Top 10 weather events ordered by sum of property and crop damages")
Here it is clearly visible that floots, hurricanes, tornadoes and storm surges cause the most property damages, while draught, floods and ice storms cause the most crop damages.