by Leandro Jimenez (Dec, 2015)
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Your data analysis must address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
In this assignment, we analyzed the data of natural events from he U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We first read the data and clean up some event types by looking into the cookbook. Then we aggregate the fatality, body injury, property damag, and crop damage by using the aggregate function according to different event types. With data processing and analyzing, we summarized the most harmful events to human health and the events have strongest damage to property and crop by table and figures. The results are tornado, thunderstrom wind, flood, excessive heat are the most harmful events to human health while while flood, hurricane, tornado, storm surge and hail have the most economic consequences.
packages that we need and unzip the file
require(data.table)
## Loading required package: data.table
require(gridExtra)
## Loading required package: gridExtra
require(ggplot2)
## Loading required package: ggplot2
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:data.table':
##
## between, last
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(plyr)
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
csv <- bzfile("./repdata_data_StormData.csv.bz2","repdata_data_StormData.csv")
stormdata <- read.csv2(csv, sep = ",", stringsAsFactors = FALSE)
unlink(csv)
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : chr "1.00" "1.00" "1.00" "1.00" ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : chr "97.00" "3.00" "57.00" "89.00" ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : chr "0.00" "0.00" "0.00" "0.00" ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: chr "0.00" "0.00" "0.00" "0.00" ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : chr "0.00" "0.00" "0.00" "0.00" ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : chr "14.00" "2.00" "0.10" "0.00" ...
## $ WIDTH : chr "100.00" "150.00" "123.00" "100.00" ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : chr "0.00" "0.00" "0.00" "0.00" ...
## $ FATALITIES: chr "0.00" "0.00" "0.00" "0.00" ...
## $ INJURIES : chr "15.00" "0.00" "2.00" "2.00" ...
## $ PROPDMG : chr "25.00" "2.50" "25.00" "2.50" ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : chr "0.00" "0.00" "0.00" "0.00" ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : chr "3040.00" "3042.00" "3340.00" "3458.00" ...
## $ LONGITUDE : chr "8812.00" "8755.00" "8742.00" "8626.00" ...
## $ LATITUDE_E: chr "3051.00" "0.00" "0.00" "0.00" ...
## $ LONGITUDE_: chr "8806.00" "0.00" "0.00" "0.00" ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : chr "1.00" "2.00" "3.00" "4.00" ...
summary(stormdata)
## STATE__ BGN_DATE BGN_TIME
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## TIME_ZONE COUNTY COUNTYNAME
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## STATE EVTYPE BGN_RANGE
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## BGN_AZI BGN_LOCATI END_DATE
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Length:902297 Mode:logical Length:902297
## Class :character Class :character NA's:902297 Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## END_AZI END_LOCATI LENGTH
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## WIDTH F MAG FATALITIES
## Length:902297 Min. :0.0 Length:902297 Length:902297
## Class :character 1st Qu.:0.0 Class :character Class :character
## Mode :character Median :1.0 Mode :character Mode :character
## Mean :0.9
## 3rd Qu.:1.0
## Max. :5.0
## NA's :843563
## INJURIES PROPDMG PROPDMGEXP
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## CROPDMG CROPDMGEXP WFO
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LONGITUDE LATITUDE_E LONGITUDE_
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## REMARKS REFNUM
## Length:902297 Length:902297
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
head(stormdata,n=3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1.00 4/18/1950 0:00:00 0130 CST 97.00 MOBILE AL
## 2 1.00 4/18/1950 0:00:00 0145 CST 3.00 BALDWIN AL
## 3 1.00 2/20/1951 0:00:00 1600 CST 57.00 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0.00 0.00
## 2 TORNADO 0.00 0.00
## 3 TORNADO 0.00 0.00
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0.00 14.00 100.00 3 0.00 0.00
## 2 NA 0.00 2.00 150.00 2 0.00 0.00
## 3 NA 0.00 0.10 123.00 2 0.00 0.00
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15.00 25.00 K 0.00
## 2 0.00 2.50 K 0.00
## 3 2.00 25.00 K 0.00
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040.00 8812.00 3051.00 8806.00 1.00
## 2 3042.00 8755.00 0.00 0.00 2.00
## 3 3340.00 8742.00 0.00 0.00 3.00
Convert to numeric: fatalities and injuries. Then, show some events (EVTYPE). After of that, convert event to factors
stormdata$FATALITIES <- as.numeric(stormdata$FATALITIES)
stormdata$INJURIES <- as.numeric(stormdata$INJURIES)
##EVTYPE
stormdata$EVTYPE <- toupper(stormdata$EVTYPE)
eventtype <- sort(unique(stormdata$EVTYPE))
stormdata$EVTYPE <- as.factor(stormdata$EVTYPE)
## Show some event types
eventtype[1:30]
## [1] "?" "ABNORMALLY DRY"
## [3] "ABNORMALLY WET" "ABNORMAL WARMTH"
## [5] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [7] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [9] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [11] "AVALANCHE" "BEACH EROSIN"
## [13] "BEACH EROSION" "BEACH EROSION/COASTAL FLOOD"
## [15] "BEACH FLOOD" "BELOW NORMAL PRECIPITATION"
## [17] "BITTER WIND CHILL" "BITTER WIND CHILL TEMPERATURES"
## [19] "BLACK ICE" "BLIZZARD"
## [21] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [23] "BLIZZARD/FREEZING RAIN" "BLIZZARD/HEAVY SNOW"
## [25] "BLIZZARD/HIGH WIND" "BLIZZARD SUMMARY"
## [27] "BLIZZARD WEATHER" "BLIZZARD/WINTER STORM"
## [29] "BLOWING DUST" "BLOWING SNOW"
fatalities <- as.data.table(subset(aggregate(FATALITIES ~ EVTYPE, data = stormdata,
FUN = "sum"), FATALITIES > 0))
fatalities <- fatalities[order(-FATALITIES), ]
#show some rows
top5 <- fatalities[1:5, ]
library(ggplot2)
ggplot(data = top5, aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") + xlab("Event") + ylab("Fatalities") + ggtitle("Fatalities caused by Events (top 5) ") + theme(legend.position = "right")
injuries <- as.data.table(subset(aggregate(INJURIES ~ EVTYPE, data = stormdata,
FUN = "sum"), INJURIES > 0))
injuries <- injuries[order(-INJURIES), ]
#show some rows
top5i <- injuries[1:5, ]
ggplot(data = top5i, aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") + xlab("Event") + ylab("Injuries") + ggtitle("Injuries caused by Events (top 5) ") + theme(legend.position = "right")
check and clean exponents. Calculate of cost of damage
### check
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
stormdata$PROPDMGEXP <- toupper(stormdata$PROPDMGEXP)
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" "" "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"
table(stormdata$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B H K M
## 4 5 1 40 7 424665 11337
#### clean
calcExp <- function(x, exp = "") {
switch(exp, `-` = x * -1, `?` = x, `+` = x, `1` = x, `2` = x * (10^2), `3` = x *
(10^3), `4` = x * (10^4), `5` = x * (10^5), `6` = x * (10^6), `7` = x *
(10^7), `8` = x * (10^8), H = x * 100, K = x * 1000, M = x * 1e+06,
B = x * 1e+09, x)
}
applyCalcExp <- function(vx, vexp) {
if (length(vx) != length(vexp))
stop("Not same size")
result <- rep(0, length(vx))
for (i in 1:length(vx)) {
result[i] <- calcExp(vx[i], vexp[i])
}
result
}
### calculate the cost
stormdata$EconomicCosts <- applyCalcExp(as.numeric(stormdata$PROPDMG), stormdata$PROPDMGEXP)
summary(stormdata$EconomicCosts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.500e+01 0.000e+00 0.000e+00 4.746e+05 5.000e+02 1.150e+11
cost per event
costs <- as.data.table(subset(aggregate(EconomicCosts ~ EVTYPE, data = stormdata,
FUN = "sum"), EconomicCosts > 0))
costs <- costs[order(-EconomicCosts), ]
library(scales)
top25c <- costs[1:25, ]
ggplot(data = top25c, aes(EVTYPE, EconomicCosts, fill = EconomicCosts)) + geom_bar(stat = "identity") + scale_y_continuous(labels = comma) + xlab("Event") + ylab("Economic costs in $") + ggtitle("Economic costs caused by Events (top 25) ") + coord_flip() + theme(legend.position = "right")
As considered in the previous plot, storms, tornados and floods are many times part of the hurricanes. For this reason, we can consider Hurricanes the biggest threat for US economy, like Katrina demonstrated in 2005. It’s worth noticing that the #1 factor for Crop Damage is actually Drought, an event that shouldn’t be underesitmated especially in the warmest countries of the US.
Hurricanes, Tornados, Storms and Floods are the key events that threaten the safety and economics of the US.