This is the documentation for Course Project 2 for the online course Reproducible Research offered by Johns Hopkins University on Coursera.

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The project aims to process the dataset and identify the severe weather events that occur the most frequently and those that cause the most disruption to human health and the economy (of the USA).

Data

Here is the code to download the dataset and unzip it. Note that the data is in the bz2 format.

if(!file.exists("StormData.csv")){
  library(R.utils)
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                "stormdata.csv.bz", method = "curl")
  bunzip2("stormdata.csv.bz", "StormData.csv")
}
  stormdata <- read.csv("StormData.csv")

What are the dimensions of the dataset?

dim(stormdata)
## [1] 902297     37

Data Processing

Events by Frequency of Occurrence

The following code snippet extracts and displays the ten most frequently occurring severe weather events across the USA since 1950.

counts <- table(stormdata$EVTYPE)
counts <- as.data.frame(counts)
counts <- counts[order(-counts$Freq), ]
top10occur <- counts[1:10, ]
colorscheme <- c("red", "green", "blue", "yellow", "orange", "purple", "magenta", "skyblue", "cyan", "black")
colnames(top10occur)[1] <- "EVENT"
colnames(top10occur)[2] <- "FREQUENCY"
top10occur
##                  EVENT FREQUENCY
## 244               HAIL    288661
## 856          TSTM WIND    219940
## 760  THUNDERSTORM WIND     82563
## 834            TORNADO     60652
## 153        FLASH FLOOD     54277
## 170              FLOOD     25326
## 786 THUNDERSTORM WINDS     20843
## 359          HIGH WIND     20212
## 464          LIGHTNING     15754
## 310         HEAVY SNOW     15708

Events by Health Impact

The following code snippet extracts and displays the ten severe weather events that have caused the most impact on health (in terms of fatalities and injuries) since 1950.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fatalitycount <- stormdata %>% select(EVTYPE, FATALITIES) %>% group_by(EVTYPE) %>%
summarise(total.fatalities = sum(FATALITIES)) %>% arrange(-total.fatalities)
top10fatal <- fatalitycount[1:10, ]
injurycount <- stormdata %>% select(EVTYPE, INJURIES) %>% group_by(EVTYPE) %>%
summarise(total.injuries = sum(INJURIES)) %>% arrange(-total.injuries)
top10injury <- injurycount[1:10, ]
colnames(top10fatal)[1] <- "EVENT"
colnames(top10fatal)[2] <- "FATALITIES"
top10fatal
## # A tibble: 10 x 2
##    EVENT          FATALITIES
##    <fct>               <dbl>
##  1 TORNADO              5633
##  2 EXCESSIVE HEAT       1903
##  3 FLASH FLOOD           978
##  4 HEAT                  937
##  5 LIGHTNING             816
##  6 TSTM WIND             504
##  7 FLOOD                 470
##  8 RIP CURRENT           368
##  9 HIGH WIND             248
## 10 AVALANCHE             224
colnames(top10injury)[1] <- "EVENT"
colnames(top10injury)[2] <- "INJURIES"
top10injury
## # A tibble: 10 x 2
##    EVENT             INJURIES
##    <fct>                <dbl>
##  1 TORNADO              91346
##  2 TSTM WIND             6957
##  3 FLOOD                 6789
##  4 EXCESSIVE HEAT        6525
##  5 LIGHTNING             5230
##  6 HEAT                  2100
##  7 ICE STORM             1975
##  8 FLASH FLOOD           1777
##  9 THUNDERSTORM WIND     1488
## 10 HAIL                  1361

Events by Economic Impact

The interpretation of the index of the data in the PROPDMGEXP and CROPDMGEXP is as follows:
* H/h = X 10^2 * K = X 10^3 * M/m = X 10^6 * B/b = X 10^9 * + = X 1 * -/?/blank = X 0
The following code snippet extracts and displays the ten severe weather events that have caused the most economic impact since 1950.

totaldamage<- stormdata %>% select(EVTYPE, PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

Symbol <- sort(unique(as.character(totaldamage$PROPDMGEXP)))
## symbol <- c(blank, -, ?, +, 0, 1, 2, 3, 4, 5, 6, 7, 8, B, h, H, K, m, M)
Multiplier <- c(0,0,0,1,10,10,10,10,10,10,10,10,10,10^9,10^2,10^2,10^3,10^6,10^6)
convert.Multiplier <- data.frame(Symbol, Multiplier)

totaldamage$Prop.Multiplier <- convert.Multiplier$Multiplier[match(totaldamage$PROPDMGEXP, convert.Multiplier$Symbol)]
totaldamage$Crop.Multiplier <- convert.Multiplier$Multiplier[match(totaldamage$CROPDMGEXP, convert.Multiplier$Symbol)]

totaldamage <- totaldamage %>% mutate(PROPDMG = PROPDMG*Prop.Multiplier) %>% mutate(CROPDMG = CROPDMG*Crop.Multiplier) %>% mutate(TOTAL.DMG = PROPDMG+CROPDMG)

totaldamage.total <- totaldamage %>% group_by(EVTYPE) %>% summarize(TOTAL.DMG.EVTYPE = sum(TOTAL.DMG))%>% arrange(-TOTAL.DMG.EVTYPE) %>% mutate(TOTAL.DMG.EVTYPE = TOTAL.DMG.EVTYPE/10^9)

top10damage <- totaldamage.total[1:10, ]
colnames(top10damage)[1] <- "EVENT"
colnames(top10damage)[2] <- "ECONOMIC IMPACT (in billions of USD)"
top10damage
## # A tibble: 10 x 2
##    EVENT             `ECONOMIC IMPACT (in billions of USD)`
##    <fct>                                              <dbl>
##  1 FLOOD                                             150.  
##  2 HURRICANE/TYPHOON                                  71.9 
##  3 TORNADO                                            57.4 
##  4 STORM SURGE                                        43.3 
##  5 FLASH FLOOD                                        17.6 
##  6 DROUGHT                                            15.0 
##  7 HURRICANE                                          14.6 
##  8 RIVER FLOOD                                        10.1 
##  9 ICE STORM                                           8.97
## 10 TROPICAL STORM                                      8.38

Results

Events by Frequency of Occurrence

Here are the ten most commonly occurring severe weather events since 1950. We see that hail is the most common severe weather event occurring in the USA. There appears to be a discrepancy in the data as, for example, most probably, TSTM WIND, THUNDERSTORM WIND and THUNDERSTORM WINDS actually refer to the same event.

barplot(top10occur$FREQUENCY, legend.text = top10occur$EVENT, col = colorscheme, main = "Top ten most frequently occurring extreme weather events", xlab = "Event", ylab = "Frequency", ylim = c(0, 300000))

Events by Health Impact

Here are the ten severe weather events that have caused the most damage to health (in terms of fatalities and injuries) since 1950. We see that tornadoes account for the most fatalities and injuries due to extreme weather events in the USA, and that too by quite some margin.

barplot(top10fatal$FATALITIES, legend.text = top10fatal$EVENT, col = colorscheme, main = "Top ten most fatal extreme weather events", xlab = "Event", ylab = "Fatalities", ylim = c(0, 6000))

barplot(top10injury$INJURIES, legend.text = top10injury$EVENT, col = colorscheme, main = "Top ten extreme weather events with most injuries", xlab = "Event", ylab = "Injuries", ylim = c(0, 100000))

Events by Economic Impact

Here are the ten severe weather events that have caused the most economic impact since 1950. We see that floods account for the worst economic impact due to extreme weather events in the USA.

barplot(top10damage$`ECONOMIC IMPACT (in billions of USD)`, legend.text = top10damage$EVENT, col = colorscheme, main = "Top ten extreme weather events with the worst economic impact", xlab = "Event", ylab = "Economic Damage (in billions of USD)", ylim = c(0, 170))

Further Information

More information about the dataset can be found here.

A more detailed description of the events is available here.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.