To find out which weather types caused the most economical damage and damage to human health in the USA between 1950 and 2011, the NOAA Storm Database was explored. The set consists of property damage, crop damage, injuries and fatalities for all registered weather events in the USA for this period. This report shows how the data was processed, analyzed and visualized. Analysis shows that tornado’s cause most damage to properties, crops, and human health. Floods cause much proiperty damage and drought causes most damage to crops. The South and Midwest seem to have the most injuries, fatalities and damage caused by extreme weather. The West seems relatively safe.
In order to succesfully analyze the data, the set is processed as follows:
First, the required packages are attached and the right options are set.
## Attach packages
require(knitr)
require(R.utils)
require(dplyr)
require(lubridate)
require(data.table)
require(reshape2)
require(ggplot2)
require(scales)
require(ggmap)
require(maptools)
require(grid)
require(gridExtra)
require(RColorBrewer)
## Set global chunk options:
opts_chunk$set(cache=TRUE, cache.path = 'DocumentName_cache/', fig.path='figure/')
## Set plotting options
theme_set(theme_grey(base_size = 16))
If necassary, the data is downloaded from the original source (the National Climatic Data Center, National Oceanic and Atmposperic Administration). The file is unzipped, read, and the relevant columns are selected:
BGN_DATE - Start date and time of storm eventSTATE - Location of storm eventEVTYPE - Description of storm eventFATALITIES - Ammount of fatalities per storm eventINJURIES - Ammount of injuries per storm eventPROPDMG - Property damage per storm eventPROPDMGEXP - Character indicating unit size of PROPDMGCROPDMG - Crop damage per storm eventCROPDMGEXP - Character indicating unit size of CROPDMGThe variables are transformed into the right format and put into a a new data frame called data.
if(!file.exists("StormData.csv")) {
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile="StormData.csv.bz2")
bunzip2(filename="StormData.csv.bz2", destname="StormData.csv", remove=TRUE)
}
file <- read.csv("StormData.csv")
data <- select(file, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP)
data$BGN_DATE <- as.POSIXct(strptime(data$BGN_DATE,"%m/%d/%Y %H:%M:%S"))
In the original set, the damage to property and crops in dollars are coded in two seperate variables: PROPDMG and CROPDMG, and PROPDMGEXP and CROPDMGEXP. Followinging the NOAA Storms Database codebook, PROP- and CROPDMG are multiplied by 1000 when PROP- or CROPDMGEXP is ‘k’, by 1000000 when it is ‘m’, and by 0 when there is no value (cases do not have to match). Values that do not correspond to the k-, or m-multiplier in the NOAA Storm Database Codebook are ignored.
data <- data %>%
## creating new column DMGproperty for property damage in dollars.
mutate(DMGproperty = ifelse(grepl("k", PROPDMGEXP, ignore.case=TRUE, perl=TRUE), PROPDMG*1000,
ifelse(grepl("m", PROPDMGEXP, ignore.case=TRUE, perl=TRUE),PROPDMG*1000000,
ifelse(grepl("", PROPDMGEXP, ignore.case=TRUE, perl=TRUE),0,NA))),
## creating new column DMGcrops for crop damage in dollars
DMGcrops = ifelse(grepl("k", CROPDMGEXP, ignore.case=TRUE, perl=TRUE), CROPDMG*1000,
ifelse(grepl("m", CROPDMGEXP, ignore.case=TRUE, perl=TRUE),CROPDMG*1000000,
ifelse(grepl("", PROPDMGEXP, ignore.case=TRUE, perl=TRUE),0,NA))),
## creating new column DMGtotal which the total damage in dollars
DMGtotal=DMGproperty+DMGcrops
)
Although the National Climatic Data Center distinguishes a total of 48 unique Storm Types ( EVTYPES ), the number of unique EVTYPES in the data lay far beyond that number.Propably as a result of inconsistent coding.
summarize(data,n_distinct(EVTYPE))
## n_distinct(EVTYPE)
## 1 985
One approach to fix this would be to match each of the observations into one of 48 ‘official’ EVTYPES. However, some of the descriptions do not fit nicely into just one of those (e.g.: EXTREME HEAT AND DROUGHT, COLD WINTERWEATHER, etc.). To determine for which event types to use, I first extracted all unique words (seperated by a space) from EVTPE and display them in a frequency table. The table below shows the 50 most frequently used words within EVTYPES:
words <- paste(data$EVTYPE, collapse=" ")
uniqueWords <- strsplit(words, " ")[[1]]
wordCount <- as.data.frame(table(uniqueWords)) %>% arrange(desc(Freq))
head(wordCount,n=50)
## uniqueWords Freq
## 1 WIND 339051
## 2 HAIL 289327
## 3 TSTM 227232
## 4 THUNDERSTORM 109517
## 5 FLOOD 81427
## 6 TORNADO 60685
## 7 FLASH 55043
## 8 HEAVY 27939
## 9 WINDS 22856
## 10 HIGH 22814
## 11 WINTER 19578
## 12 SNOW 17373
## 13 LIGHTNING 15772
## 14 STORM 15024
## 15 MARINE 12614
## 16 RAIN 12086
## 17 WEATHER 7047
## 18 FUNNEL 6981
## 19 CLOUD 6852
## 20 STRONG 3813
## 21 WATERSPOUT 3800
## 22 STREAM 3472
## 23 URBAN/SML 3393
## 24 FLD 3392
## 25 WILDFIRE 2761
## 26 BLIZZARD 2726
## 27 HEAT 2643
## 28 DROUGHT 2497
## 29 ICE 2146
## 30 EXTREME 1915
## 31 FOG 1880
## 32 EXCESSIVE 1716
## 33 CHILL 1580
## 34 COLD/WIND 1541
## 35 FIRE 1463
## 36 WILD/FOREST 1458
## 37 FROST/FREEZE 1342
## 38 DENSE 1306
## 39 FLOODING 1123
## 40 WEATHER/MIX 1104
## 41 SURF 1052
## 42 WIND/HAIL 1031
## 43 COLD 877
## 44 COASTAL 812
## 45 RIP 777
## 46 TROPICAL 757
## 47 LAKE-EFFECT 636
## 48 FLOOD/FLASH 629
## 49 LANDSLIDE 601
## 50 DUST 578
This list of words is used to recode EVTYPE into a comprehensive set of weather types through regex-matching using the data.table package. To make sure no major weather types were missed, the original list of Storm Types in the NOAA Storms Database codebook was consulted.
temp <- data %>% select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, DMGproperty, DMGcrops) %>% as.data.table()
## Weather events are recoded using regex and the data.table package
temp[grepl("TSTM|THUNDERSTORM|LIGHTNING", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="thunderstorm"]
temp[grepl("WIND|MICROBURST|(?=.*MICRO)(?=.*BURST)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="wind"]
temp[grepl("HAIL", EVTYPE, ignore.case=TRUE), EVTYPE:="hail"]
temp[grepl("FLOOD|FLD", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="flood"]
temp[grepl("TORNADO|FUNNEL|WATERSPOUT", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="tornado"]
temp[grepl("SLEET|SNOW", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="sleet and snow"]
temp[grepl("RAIN", EVTYPE, ignore.case=TRUE), EVTYPE:="rain"]
temp[grepl("SURF|TIDE|SURGE|RIP|CURRENT", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="surftide"]
temp[grepl("ICE|FREEZ|FROST|FROZEN|COLD|CHILL", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="cold"]
temp[grepl("BLIZZARD|(?=.*ICE)(?=.*STORM)|(?=.*SNOW)(?=.*STORM)|(?=.*WINTER)(?=.*STORM)|(?=.*LAKE)(?=.*EFFECT)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="blizzard"]
temp[grepl("DUST", EVTYPE, ignore.case=TRUE), EVTYPE:="dust"]
temp[grepl("WILDFIRE|(?=.*WILD)(?=.*FIRE)|(?=.*FOREST)(?=.*FIRE)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="fire"]
temp[grepl("HEAT|WARM", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="heat"]
temp[grepl("(?=.*DRY)(?=.*WARM)|DROUGHT", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="drought"]
temp[grepl("FOG", EVTYPE, ignore.case=TRUE), EVTYPE:="fog"]
temp[grepl("HURRICANE|TYPHOON|(?=.*TROPICAL)(?=.*STORM)(?=.*depression)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="tropical storm"]
temp[grepl("LANDSLIDE", EVTYPE, ignore.case=TRUE), EVTYPE:="landslide"]
temp[grepl("AVALANCHE", EVTYPE, ignore.case=TRUE), EVTYPE:="avalanche"]
## Only observations with any of the new types of weather are kept.
processedData <- temp %>% filter(grepl("^thunderstorm$|^wind$|^hail$|^flood$|^tornado$|^sleet and snow$|^rain$|^surftide$|^cold$|^blizzard$|^dust$|^fire$|^heat$|^drought$|^fog$|^tropical storm$|^landslide$|^avalanche$", EVTYPE, perl=TRUE)) %>% as.data.frame()
processedData$EVTYPE <- as.factor(as.character(processedData$EVTYPE))
A downside of this method is that not all EVTYPE observations will be matched. To verify that not too many observations are discarded this way, we compare the amount of observations of the olf versus the new dataset.
(nrow(temp)-nrow(processedData))/nrow(temp)
## [1] 0.01091215
Approximately 1,1% of the observations is discarded. Considering the the large sample of the data and the exploratory nature of this research, this is justifiable.
The report attempts to answer the following questions:
Across the United States, which types of weather events are most harmful with respect to population health?
Across the United States, which types of weather events have the greatest economic consequences?
The most straight-forward approach is to compare total injuries and fatalities for every type of weather. The code below processes the tidy dataset and displays this information, sorted by the sum of fatalities and injuries per type of weather.
summaryData <- processedData %>%
group_by(EVTYPE) %>%
summarize(count=n(),
fatalities=sum(FATALITIES),
injuries=sum(INJURIES)) %>%
mutate(total=fatalities+injuries) %>%
arrange(desc(total)) %>%
select(EVTYPE, fatalities,injuries, total)
summaryData
## Source: local data frame [18 x 4]
##
## EVTYPE fatalities injuries total
## 1 tornado 5639 91439 97078
## 2 thunderstorm 1571 14775 16346
## 3 heat 3178 9243 12421
## 4 flood 1553 8683 10236
## 5 cold 317 2450 2767
## 6 wind 701 2022 2723
## 7 blizzard 317 2143 2460
## 8 fire 90 1606 1696
## 9 surftide 759 818 1577
## 10 tropical storm 133 1333 1466
## 11 hail 15 1371 1386
## 12 sleet and snow 166 1123 1289
## 13 fog 80 1076 1156
## 14 dust 24 483 507
## 15 rain 107 303 410
## 16 avalanche 224 170 394
## 17 landslide 39 53 92
## 18 drought 0 4 4
From all weather types types, tornado’s were most harmful with respect to human health. Thunderstorms, heat, flood and cold follow as weather types that caused the most fatalities and injuries.
## The data is sorted by total number of victims and aligns the EVTYPES to it in order to sort the categorical variables in the bar chart by total number of victims
summaryData <- arrange(summaryData, -total)
summaryData$EVTYPE <- factor(summaryData$EVTYPE, levels=unique(summaryData$EVTYPE))
## Prepares narrow data
plotData <- melt(summaryData, id.vars=1, measure.vars=c(2,3))
## Prepares basic plot
plot <- ggplot(plotData, aes(x=EVTYPE, y=value, fill=variable))
## Plots
plot +
geom_bar(position="dodge", stat="identity") +
ggtitle("Total Fatalities and Injuries caused by Extreme Weather in the USA\n(1950 - 2011) ") +
ylab("Fatalities / Injuries") +
xlab("Weather types") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=.25),
legend.title=element_blank())
Tornado’s and heat have not only caused the most fatalities, the plot above suggests tornado’s and heat, as well as floods and surf tides, are more fatal than other weather types. The proportion of fatalities compared to inuries is larger than for other weather types.
The most straight-forward approach is to compare total crop and property damage for every weather type. The code below processes the tidy dataset displays this information sorted by the sum of property and crop damage.
summaryData1 <- processedData %>%
group_by(EVTYPE) %>%
summarize(count=n(),
property_damage=sum(DMGproperty/1000000000, na.rm=TRUE),
crop_damage=sum(DMGcrops/1000000000, na.rm=TRUE)) %>%
mutate(total=property_damage+crop_damage) %>%
arrange(desc(total)) %>%
select(EVTYPE, property_damage,crop_damage, total)
summaryData1
## Source: local data frame [18 x 4]
##
## EVTYPE property_damage crop_damage total
## 1 tornado 51.70285703 0.4149614 52.11781839
## 2 flood 39.08102497 7.2757472 46.35677217
## 3 hail 14.17404272 3.0468374 17.22088017
## 4 tropical storm 10.95641001 3.9961178 14.95252781
## 5 drought 1.04610600 12.4725660 13.51867200
## 6 thunderstorm 10.71483631 1.2863010 12.00113730
## 7 cold 4.10392271 3.2282908 7.33221351
## 8 fire 5.95656350 0.4032816 6.35984513
## 9 wind 4.98227659 0.9033655 5.88564214
## 10 blizzard 2.34881120 0.1395040 2.48831520
## 11 rain 0.73398619 0.8061528 1.54013899
## 12 surftide 1.51660700 0.0008550 1.51746200
## 13 sleet and snow 1.02479974 0.1346631 1.15946284
## 14 heat 0.02032575 0.5044793 0.52480503
## 15 landslide 0.32470100 0.0200170 0.34471800
## 16 fog 0.02282950 0.0000000 0.02282950
## 17 dust 0.00628763 0.0031000 0.00938763
## 18 avalanche 0.00372180 0.0000000 0.00372180
The data shows that tonado’s and floods have caused the most damage, followed by hail, tropical storms and drought. Also, weather seems to cause more property damage than crop damage, but this is probably because the USA has more property than crops that can be damaged.
## The data is sorted by total number of victims and aligns the EVTYPES to it in order to sort the categorical variables in the bar chart by total number of victims
summaryData1 <- arrange(summaryData1, -total)
summaryData1$EVTYPE <- factor(summaryData1$EVTYPE, levels=unique(summaryData1$EVTYPE))
plotData1 <- melt(summaryData1, id.vars=c(1), measure.vars=c(2,3))
plot <- ggplot(plotData1, aes(x=EVTYPE, y=value, fill=variable))
plot +
geom_bar(position="dodge", stat="identity") +
ggtitle("Total Crop and Property Damage caused by Extreme Weather in the USA\n(1950 - 2011) ") +
xlab("Weather types") +
ylab("Damage (billions)") +
theme(axis.text.x=element_text(angle=90,hjust=1, vjust=.25),
legend.title=element_blank())
The plot shows that crops are relatively vulnarable to drought, heat, rain and cold. It shows that property is relatively vulnbarable to tornado’s, thunderstorms and fire.
For policy implications, it is wise to compare extreme weather patterns between states, because the likelihood of certain storms occurring is likely to be dependent of natural factors.
STATE are compared with a list of official abbreviations. Invalid codes are discarded and the latitudes and longitudes for valid state abbreviations are retrieved from Google using the ggmap-package and stored in a seperate data frame.## Matching observations in STATE to a list of official abbreviations
STATE <- levels(processedData$STATE)
states <- read.csv("states.csv", sep=";")
statesmerge <- merge(STATE, states, by.x=1, by.y=2)
## Retrieve longitudes and latitudes from Google.
lonlat <- geocode(as.character(statesmerge$State))
## Store data in a seperate data frame
geodata <- cbind(statesmerge,x=lonlat[1],y=lonlat[2])
## Property, crop and total damage per state and weather type incl. lon and lat codes
dmgevtypeperstate <- processedData %>%
group_by(STATE, EVTYPE) %>%
summarize(propdmg=(sum(DMGproperty)/1000000000), cropdmg=(sum(DMGcrops)/1000000000))
## Fatalities, injuries and total victims per state and weather type incl. lon and lat codes
victevtypeperstate <- processedData %>%
group_by(STATE, EVTYPE) %>%
summarize(fatalities=sum(FATALITIES), injuries=sum(INJURIES))
cropdmg <- dmgevtypeperstate %>%
top_n(1, cropdmg) %>%
merge(geodata, by.x=1, by.y=1)
propdmg <- dmgevtypeperstate %>%
top_n(1, propdmg) %>%
merge(geodata, by.x=1, by.y=1)
totalinj <- victevtypeperstate %>%
top_n(1, injuries) %>%
merge(geodata, by.x=1, by.y=1)
totalfat <- victevtypeperstate %>%
top_n(1, fatalities) %>%
merge(geodata, by.x=1, by.y=1)
The map below shows that…
## Prepare map and plots
map <- ggmap(get_map("USA", zoom=4, maptype="terrain", color='bw'), legend='right', extent='device')
map +
geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=propdmg), data=propdmg)+
ggtitle("Property Damage") +
scale_size_continuous(range=c(4,12)) +
guides(colour = guide_legend(override.aes = list(size=7)))+
scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
size = guide_legend(title="Damage (billions)", ncol=2))
The map below shows that…
map +
geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=cropdmg), data=cropdmg)+
ggtitle("Crop Damage") +
scale_size_continuous(range=c(4,12)) +
guides(colour = guide_legend(override.aes = list(size=7)))+
scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
size = guide_legend(title="Damage (billions)", ncol=2))
The map below shows that…
map +
geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=injuries), data=totalinj)+
ggtitle("Injuries") +
scale_size_continuous(range=c(4,12)) +
scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
size = guide_legend(title="Injuries", ncol=2))
The map below shows that…
map +
geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=fatalities), data=totalfat)+
ggtitle("Fatalities") +
scale_size_continuous(range=c(4,12)) +
scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
size = guide_legend(title="Fatalities", ncol=2))
Analysis of the NOAA Storms Database shows that between 1950 and 2011…