In the following report, we will summarize and chart data contained in the NOAA (U.S. National Oceanic and Atmospheric Administration) Storm Database pertaining to severe weather events occurring domestically from 1950-2011. This analysis will specifically focus on the following questions:
Summaries and charts will be used to underscore which events result in the greatest negative impact, thereby signalling to authorities how best to prioritize resources for maximum harm reduction. We found that health was most greatly impacted by tornadoes, while floods had the most deleterious economic consequences.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(scales)
library(reshape2)
# Download file
if (!file.exists("StormData.csv.bz2")) {
fileURL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(fileURL, destfile='StormData.csv.bz2')
}
# Read into table - only read file if not in cache
if(!file.exists("stormData.Rdata")) {
stormData <- read.csv(bzfile('StormData.csv.bz2'),header=TRUE, stringsAsFactors = FALSE)
save(stormData, file = "stormData.Rdata")
} else {
load("stormData.Rdata")
}
Note: We will be focusing on EVTYPE (type of weather event), FATALITIES (count of fatalities), INJURIES (count of injuries), PROPDMG (a numerical measure of property damage), PROPDMGEXP (an exponent used as a multiplication factor for PROPDMG), CROPDMG (a numerical measure of crop damage), and CROPDMGEXP (an exponent used as a multiplication factor for CROPDMG).
# View structure of the data
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
# Select relevant fields for analysis
stormAnalysis <- select(stormData, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
str(stormAnalysis)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
The PROPDMGEXP and CROPDMGEXP fields are populated by character values meant to denote numerical values (ex. “K” = 1000). Some values are not applicable and stripped from the data for the purpose of the downstream analysis (ex. “+”, “5”, “6”, “0”).
Once these exponent fields are cleaned, they are assigned the appropriate numerical values. Finally, they are multiplied with their corresponding PROPDMG and CROPDMG columns to arrive at the true total amount.
# Convert exponential Property Damage and Crop Damage to number values
## Find unique values for PROPDMGEXP
unique(stormAnalysis$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
## Assign based on number value. Using 1 for non-assignables, and 0 for null.
stormAnalysis2 <- stormAnalysis
stormAnalysis_PROP <- mutate(stormAnalysis2, PROPDMGEXP_NUM = ifelse(PROPDMGEXP == "K" ,1000,
ifelse(PROPDMGEXP == "M" | PROPDMGEXP == "m" ,1000000,
ifelse(PROPDMGEXP == "B",1000000000,
ifelse(PROPDMGEXP == "H" | PROPDMGEXP == "h", 100,
ifelse(PROPDMGEXP == "", 0, 1))
))))
## Multiply the new numeric exponent field by the property damage to find true value. Add to dataframe as PROP_DAMAGE
stormAnalysis_PROP$PROP_DAMAGE <- stormAnalysis_PROP$PROPDMGEXP_NUM * stormAnalysis_PROP$PROPDMG
rm(stormAnalysis2)
## Follow same steps for Crop Damage...
## Find unique values for CROPDMGEXP
unique(stormAnalysis$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
## Assign based on number value. Using 1 for non-assignables, and 0 for null.
stormAnalysis2 <- stormAnalysis
stormAnalysis_CROP <- mutate(stormAnalysis2, CROPDMGEXP_NUM = ifelse(CROPDMGEXP == "K" | CROPDMGEXP == "k" ,1000,
ifelse(CROPDMGEXP == "M" | CROPDMGEXP == "m" ,1000000,
ifelse(CROPDMGEXP == "B",1000000000,
ifelse(CROPDMGEXP == "0", 0,
ifelse(CROPDMGEXP == "", 0, 1))
))))
## Multiply the new numeric exponent field by the crop damage to find true value. Add to dataframe as CROP_DAMAGE
stormAnalysis_CROP$CROP_DAMAGE <- stormAnalysis_CROP$CROPDMGEXP_NUM * stormAnalysis_CROP$CROPDMG
rm(stormAnalysis2)
Tornadoes are at the top of the list, followed by heat related events and flash floods.
FATALITY <- stormAnalysis %>%
group_by(EVTYPE) %>%
summarise(total = sum(FATALITIES)) %>%
arrange(desc(total))
FATALITY[1:10,]
## Source: local data frame [10 x 2]
##
## EVTYPE total
## (chr) (dbl)
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
Again, Tornadoes are at the top, followed by wind, floods and excessive heat.
INJURY <- stormAnalysis %>%
group_by(EVTYPE) %>%
summarise(total = sum(INJURIES)) %>%
arrange(desc(total))
INJURY[1:10,]
## Source: local data frame [10 x 2]
##
## EVTYPE total
## (chr) (dbl)
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
We observe from the following grid-chart that tornadoes are by far the leading cause of both fatality and injury from the sampled data. The charts display the top ten event types in diminishing order of frequency.
## Display data in side-by-side bar chart form.
### Redefining tables for chart (choosing top 10 and sorting the EVTYPE)
FATALITY <- FATALITY[1:10,]
FATALITY$EVTYPE <- factor(FATALITY$EVTYPE, levels = FATALITY$EVTYPE[order(FATALITY$total)])
INJURY <- INJURY[1:10,]
INJURY$EVTYPE <- factor(INJURY$EVTYPE, levels = INJURY$EVTYPE[order(INJURY$total)])
g1 <- ggplot(FATALITY, aes(x = EVTYPE, y = total)) +
geom_bar(stat = "identity", fill = "green") +
scale_y_continuous() +
coord_flip() +
labs(title = "Top 10 Fatality", x = "Weather Event", y = "Count of Fatalities")
g2 <- ggplot(INJURY, aes(x = EVTYPE, y = total)) +
geom_bar(stat = "identity", fill = "blue") +
scale_y_continuous() +
coord_flip() +
labs(title = "Top 10 Injury", x = "Weather Event", y = "Count of Injuries")
grid.arrange(g1, g2, ncol = 2, top = "Top 10 Harmful Events to Population Health: Fatalities / Injuries")
First, let’s examine a summary of the top ten events causing property damage. Floods, Hurricane/Typhoons and Tornadoes are the top three causes, with floods as the dominant event.
## Top 10 event types impacting property and crop damage
## Top 10 by Property
### Remove all but required fields
PROPERTY_CHART <- select(stormAnalysis_PROP, EVTYPE, PROP_DAMAGE)
PROPERTY <- PROPERTY_CHART %>%
group_by(EVTYPE) %>%
summarise(total = sum(PROP_DAMAGE)) %>%
arrange(desc(total))
PROPERTY10 <- PROPERTY[1:10,]
### display "total" as currency - Create separate table for displayed version (will use numberic version for combined chart below)
PROPERTY_DISPLAY <- PROPERTY10
PROPERTY_DISPLAY$total <- dollar(PROPERTY_DISPLAY$total)
PROPERTY_DISPLAY
## Source: local data frame [10 x 2]
##
## EVTYPE total
## (chr) (chr)
## 1 FLOOD $144,657,709,800
## 2 HURRICANE/TYPHOON $69,305,840,000
## 3 TORNADO $56,937,160,776
## 4 STORM SURGE $43,323,536,000
## 5 FLASH FLOOD $16,140,811,860
## 6 HAIL $15,732,267,486
## 7 HURRICANE $11,868,319,010
## 8 TROPICAL STORM $7,703,890,550
## 9 WINTER STORM $6,688,497,251
## 10 HIGH WIND $5,270,046,295
Next, we’ll look at the top ten weather events causing crop damage. Drought is the top cause, followed by Flood, River Flood and Ice Storm.
## Top 10 by Crop
### Remove all but required fields
CROP_CHART <- select(stormAnalysis_CROP, EVTYPE, CROP_DAMAGE)
CROP <- CROP_CHART %>%
group_by(EVTYPE) %>%
summarise(total = sum(CROP_DAMAGE)) %>%
arrange(desc(total))
CROP10 <- CROP[1:10,]
### display "total" as currency - Create separate table for displayed version (will use numberic version for combined chart below)
CROP_DISPLAY <- CROP
CROP_DISPLAY$total <- dollar(CROP_DISPLAY$total)
CROP_DISPLAY
## Source: local data frame [985 x 2]
##
## EVTYPE total
## (chr) (chr)
## 1 DROUGHT $13,972,566,000
## 2 FLOOD $5,661,968,450
## 3 RIVER FLOOD $5,029,459,000
## 4 ICE STORM $5,022,113,500
## 5 HAIL $3,025,954,450
## 6 HURRICANE $2,741,910,000
## 7 HURRICANE/TYPHOON $2,607,872,800
## 8 FLASH FLOOD $1,421,317,100
## 9 EXTREME COLD $1,292,973,000
## 10 FROST/FREEZE $1,094,086,000
## .. ... ...
Finally, we produce a chart displaying the top ten weather events by total overall damage (property combined with crop). Floods are the clearly dominant cause of economic damage.
## Display totals on a bar chart, thus outlining which events caused the most overall damage
PROP_CROP <- full_join(PROPERTY, CROP, by = "EVTYPE")
PROP_CROP <- mutate(PROP_CROP, GRAND_TOTAL = total.x + total.y)
## Take only top 10 by GRAND_TOTAL
PROP_CROP <- arrange(PROP_CROP, desc(GRAND_TOTAL))
PROP_CROP <- PROP_CROP[1:10,]
## Remove subtotals and display GRAND TOTAL as dollar amount in TOTAL version
PROP_CROP_TOTAL <- select(PROP_CROP, EVTYPE, GRAND_TOTAL)
#PROP_CROP_TOTAL$GRAND_TOTAL <- dollar(PROP_CROP_TOTAL$GRAND_TOTAL)
# Create chart of top 10 total economic damage
## Sort by factor
PROP_CROP_TOTAL$EVTYPE <- factor(PROP_CROP_TOTAL$EVTYPE, levels = PROP_CROP_TOTAL$EVTYPE[order(PROP_CROP_TOTAL$GRAND_TOTAL, decreasing = TRUE)])
g3 <- ggplot(PROP_CROP_TOTAL, aes(x = EVTYPE, y = GRAND_TOTAL)) +
geom_bar(stat = "identity", fill = "red") +
scale_y_continuous(labels = comma) +
#coord_flip() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Top 10 Total Economic Damage", x = "Weather Event", y = "Total Amount (in dollars)")
g3