This is the documentation for Course Project 2 for the online course Reproducible Research offered by Johns Hopkins University on Coursera.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The project aims to process the dataset and identify the severe weather events that occur the most frequently and those that cause the most disruption to human health and the economy (of the USA).
Here is the code to download the dataset and unzip it. Note that the data is in the bz2 format.
if(!file.exists("StormData.csv")){
library(R.utils)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"stormdata.csv.bz", method = "curl")
bunzip2("stormdata.csv.bz", "StormData.csv")
}
stormdata <- read.csv("StormData.csv")
What are the dimensions of the dataset?
dim(stormdata)
## [1] 902297 37
The following code snippet extracts and displays the ten most frequently occurring severe weather events across the USA since 1950.
counts <- table(stormdata$EVTYPE)
counts <- as.data.frame(counts)
counts <- counts[order(-counts$Freq), ]
top10occur <- counts[1:10, ]
colorscheme <- c("red", "green", "blue", "yellow", "orange", "purple", "magenta", "skyblue", "cyan", "black")
colnames(top10occur)[1] <- "EVENT"
colnames(top10occur)[2] <- "FREQUENCY"
top10occur
## EVENT FREQUENCY
## 244 HAIL 288661
## 856 TSTM WIND 219940
## 760 THUNDERSTORM WIND 82563
## 834 TORNADO 60652
## 153 FLASH FLOOD 54277
## 170 FLOOD 25326
## 786 THUNDERSTORM WINDS 20843
## 359 HIGH WIND 20212
## 464 LIGHTNING 15754
## 310 HEAVY SNOW 15708
The following code snippet extracts and displays the ten severe weather events that have caused the most impact on health (in terms of fatalities and injuries) since 1950.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fatalitycount <- stormdata %>% select(EVTYPE, FATALITIES) %>% group_by(EVTYPE) %>%
summarise(total.fatalities = sum(FATALITIES)) %>% arrange(-total.fatalities)
top10fatal <- fatalitycount[1:10, ]
injurycount <- stormdata %>% select(EVTYPE, INJURIES) %>% group_by(EVTYPE) %>%
summarise(total.injuries = sum(INJURIES)) %>% arrange(-total.injuries)
top10injury <- injurycount[1:10, ]
colnames(top10fatal)[1] <- "EVENT"
colnames(top10fatal)[2] <- "FATALITIES"
top10fatal
## # A tibble: 10 x 2
## EVENT FATALITIES
## <fct> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
colnames(top10injury)[1] <- "EVENT"
colnames(top10injury)[2] <- "INJURIES"
top10injury
## # A tibble: 10 x 2
## EVENT INJURIES
## <fct> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
The interpretation of the index of the data in the PROPDMGEXP and CROPDMGEXP is as follows:
* H/h = X 10^2 * K = X 10^3 * M/m = X 10^6 * B/b = X 10^9 * + = X 1 * -/?/blank = X 0
The following code snippet extracts and displays the ten severe weather events that have caused the most economic impact since 1950.
totaldamage<- stormdata %>% select(EVTYPE, PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
Symbol <- sort(unique(as.character(totaldamage$PROPDMGEXP)))
## symbol <- c(blank, -, ?, +, 0, 1, 2, 3, 4, 5, 6, 7, 8, B, h, H, K, m, M)
Multiplier <- c(0,0,0,1,10,10,10,10,10,10,10,10,10,10^9,10^2,10^2,10^3,10^6,10^6)
convert.Multiplier <- data.frame(Symbol, Multiplier)
totaldamage$Prop.Multiplier <- convert.Multiplier$Multiplier[match(totaldamage$PROPDMGEXP, convert.Multiplier$Symbol)]
totaldamage$Crop.Multiplier <- convert.Multiplier$Multiplier[match(totaldamage$CROPDMGEXP, convert.Multiplier$Symbol)]
totaldamage <- totaldamage %>% mutate(PROPDMG = PROPDMG*Prop.Multiplier) %>% mutate(CROPDMG = CROPDMG*Crop.Multiplier) %>% mutate(TOTAL.DMG = PROPDMG+CROPDMG)
totaldamage.total <- totaldamage %>% group_by(EVTYPE) %>% summarize(TOTAL.DMG.EVTYPE = sum(TOTAL.DMG))%>% arrange(-TOTAL.DMG.EVTYPE) %>% mutate(TOTAL.DMG.EVTYPE = TOTAL.DMG.EVTYPE/10^9)
top10damage <- totaldamage.total[1:10, ]
colnames(top10damage)[1] <- "EVENT"
colnames(top10damage)[2] <- "ECONOMIC IMPACT (in billions of USD)"
top10damage
## # A tibble: 10 x 2
## EVENT `ECONOMIC IMPACT (in billions of USD)`
## <fct> <dbl>
## 1 FLOOD 150.
## 2 HURRICANE/TYPHOON 71.9
## 3 TORNADO 57.4
## 4 STORM SURGE 43.3
## 5 FLASH FLOOD 17.6
## 6 DROUGHT 15.0
## 7 HURRICANE 14.6
## 8 RIVER FLOOD 10.1
## 9 ICE STORM 8.97
## 10 TROPICAL STORM 8.38
Here are the ten most commonly occurring severe weather events since 1950. We see that hail is the most common severe weather event occurring in the USA. There appears to be a discrepancy in the data as, for example, most probably, TSTM WIND, THUNDERSTORM WIND and THUNDERSTORM WINDS actually refer to the same event.
barplot(top10occur$FREQUENCY, legend.text = top10occur$EVENT, col = colorscheme, main = "Top ten most frequently occurring extreme weather events", xlab = "Event", ylab = "Frequency", ylim = c(0, 300000))
Here are the ten severe weather events that have caused the most damage to health (in terms of fatalities and injuries) since 1950. We see that tornadoes account for the most fatalities and injuries due to extreme weather events in the USA, and that too by quite some margin.
barplot(top10fatal$FATALITIES, legend.text = top10fatal$EVENT, col = colorscheme, main = "Top ten most fatal extreme weather events", xlab = "Event", ylab = "Fatalities", ylim = c(0, 6000))
barplot(top10injury$INJURIES, legend.text = top10injury$EVENT, col = colorscheme, main = "Top ten extreme weather events with most injuries", xlab = "Event", ylab = "Injuries", ylim = c(0, 100000))
Here are the ten severe weather events that have caused the most economic impact since 1950. We see that floods account for the worst economic impact due to extreme weather events in the USA.
barplot(top10damage$`ECONOMIC IMPACT (in billions of USD)`, legend.text = top10damage$EVENT, col = colorscheme, main = "Top ten extreme weather events with the worst economic impact", xlab = "Event", ylab = "Economic Damage (in billions of USD)", ylim = c(0, 170))
More information about the dataset can be found here.
A more detailed description of the events is available here.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.