SUMMARY

To find out which weather types caused the most economical damage and damage to human health in the USA between 1950 and 2011, the NOAA Storm Database was explored. The set consists of property damage, crop damage, injuries and fatalities for all registered weather events in the USA for this period. This report shows how the data was processed, analyzed and visualized. Analysis shows that tornado’s cause most damage to properties, crops, and human health. Floods cause much proiperty damage and drought causes most damage to crops. The South and Midwest seem to have the most injuries, fatalities and damage caused by extreme weather. The West seems relatively safe.

1. DATA PROCESSING

In order to succesfully analyze the data, the set is processed as follows:

  1. Preparing settings and data
  2. Recoding property and crop damage into dollars
  3. Recoding Storm Types

1.1 Preparing settings and data

First, the required packages are attached and the right options are set.

## Attach packages
require(knitr)
require(R.utils)
require(dplyr)
require(lubridate)
require(data.table)
require(reshape2)
require(ggplot2)
require(scales)
require(ggmap)
require(maptools)
require(grid)
require(gridExtra)
require(RColorBrewer)

## Set global chunk options: 
opts_chunk$set(cache=TRUE, cache.path = 'DocumentName_cache/', fig.path='figure/')

## Set plotting options
theme_set(theme_grey(base_size = 16)) 

If necassary, the data is downloaded from the original source (the National Climatic Data Center, National Oceanic and Atmposperic Administration). The file is unzipped, read, and the relevant columns are selected:

  • BGN_DATE - Start date and time of storm event
  • STATE - Location of storm event
  • EVTYPE - Description of storm event
  • FATALITIES - Ammount of fatalities per storm event
  • INJURIES - Ammount of injuries per storm event
  • PROPDMG - Property damage per storm event
  • PROPDMGEXP - Character indicating unit size of PROPDMG
  • CROPDMG - Crop damage per storm event
  • CROPDMGEXP - Character indicating unit size of CROPDMG

The variables are transformed into the right format and put into a a new data frame called data.

if(!file.exists("StormData.csv")) {


download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile="StormData.csv.bz2")
bunzip2(filename="StormData.csv.bz2", destname="StormData.csv", remove=TRUE)
        
}

file <- read.csv("StormData.csv")

data <- select(file, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
               PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP)

data$BGN_DATE <- as.POSIXct(strptime(data$BGN_DATE,"%m/%d/%Y %H:%M:%S"))

1.2. Recoding property- and crop damage

In the original set, the damage to property and crops in dollars are coded in two seperate variables: PROPDMG and CROPDMG, and PROPDMGEXP and CROPDMGEXP. Followinging the NOAA Storms Database codebook, PROP- and CROPDMG are multiplied by 1000 when PROP- or CROPDMGEXP is ‘k’, by 1000000 when it is ‘m’, and by 0 when there is no value (cases do not have to match). Values that do not correspond to the k-, or m-multiplier in the NOAA Storm Database Codebook are ignored.

data <- data %>%

## creating new column DMGproperty for property damage in dollars.

  mutate(DMGproperty = ifelse(grepl("k", PROPDMGEXP, ignore.case=TRUE, perl=TRUE), PROPDMG*1000,
         ifelse(grepl("m", PROPDMGEXP, ignore.case=TRUE, perl=TRUE),PROPDMG*1000000,
         ifelse(grepl("", PROPDMGEXP, ignore.case=TRUE, perl=TRUE),0,NA))),

## creating new column DMGcrops for crop damage in dollars

         DMGcrops = ifelse(grepl("k", CROPDMGEXP, ignore.case=TRUE, perl=TRUE), CROPDMG*1000,
         ifelse(grepl("m", CROPDMGEXP, ignore.case=TRUE, perl=TRUE),CROPDMG*1000000,
         ifelse(grepl("", PROPDMGEXP, ignore.case=TRUE, perl=TRUE),0,NA))),

## creating new column DMGtotal which the total damage in dollars

         DMGtotal=DMGproperty+DMGcrops

        )

1.3. Recoding storm types

Although the National Climatic Data Center distinguishes a total of 48 unique Storm Types ( EVTYPES ), the number of unique EVTYPES in the data lay far beyond that number.Propably as a result of inconsistent coding.

summarize(data,n_distinct(EVTYPE))
##   n_distinct(EVTYPE)
## 1                985

One approach to fix this would be to match each of the observations into one of 48 ‘official’ EVTYPES. However, some of the descriptions do not fit nicely into just one of those (e.g.: EXTREME HEAT AND DROUGHT, COLD WINTERWEATHER, etc.). To determine for which event types to use, I first extracted all unique words (seperated by a space) from EVTPE and display them in a frequency table. The table below shows the 50 most frequently used words within EVTYPES:

words <- paste(data$EVTYPE, collapse=" ")
uniqueWords <- strsplit(words, " ")[[1]]
wordCount <- as.data.frame(table(uniqueWords)) %>% arrange(desc(Freq))

head(wordCount,n=50)
##     uniqueWords   Freq
## 1          WIND 339051
## 2          HAIL 289327
## 3          TSTM 227232
## 4  THUNDERSTORM 109517
## 5         FLOOD  81427
## 6       TORNADO  60685
## 7         FLASH  55043
## 8         HEAVY  27939
## 9         WINDS  22856
## 10         HIGH  22814
## 11       WINTER  19578
## 12         SNOW  17373
## 13    LIGHTNING  15772
## 14        STORM  15024
## 15       MARINE  12614
## 16         RAIN  12086
## 17      WEATHER   7047
## 18       FUNNEL   6981
## 19        CLOUD   6852
## 20       STRONG   3813
## 21   WATERSPOUT   3800
## 22       STREAM   3472
## 23    URBAN/SML   3393
## 24          FLD   3392
## 25     WILDFIRE   2761
## 26     BLIZZARD   2726
## 27         HEAT   2643
## 28      DROUGHT   2497
## 29          ICE   2146
## 30      EXTREME   1915
## 31          FOG   1880
## 32    EXCESSIVE   1716
## 33        CHILL   1580
## 34    COLD/WIND   1541
## 35         FIRE   1463
## 36  WILD/FOREST   1458
## 37 FROST/FREEZE   1342
## 38        DENSE   1306
## 39     FLOODING   1123
## 40  WEATHER/MIX   1104
## 41         SURF   1052
## 42    WIND/HAIL   1031
## 43         COLD    877
## 44      COASTAL    812
## 45          RIP    777
## 46     TROPICAL    757
## 47  LAKE-EFFECT    636
## 48  FLOOD/FLASH    629
## 49    LANDSLIDE    601
## 50         DUST    578

This list of words is used to recode EVTYPE into a comprehensive set of weather types through regex-matching using the data.table package. To make sure no major weather types were missed, the original list of Storm Types in the NOAA Storms Database codebook was consulted.

temp <- data %>% select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, DMGproperty, DMGcrops) %>% as.data.table()

## Weather events are recoded using regex and the data.table package
temp[grepl("TSTM|THUNDERSTORM|LIGHTNING", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="thunderstorm"]
temp[grepl("WIND|MICROBURST|(?=.*MICRO)(?=.*BURST)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="wind"]
temp[grepl("HAIL", EVTYPE, ignore.case=TRUE), EVTYPE:="hail"]
temp[grepl("FLOOD|FLD", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="flood"]
temp[grepl("TORNADO|FUNNEL|WATERSPOUT", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="tornado"]
temp[grepl("SLEET|SNOW", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="sleet and snow"]
temp[grepl("RAIN", EVTYPE, ignore.case=TRUE), EVTYPE:="rain"]
temp[grepl("SURF|TIDE|SURGE|RIP|CURRENT", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="surftide"]
temp[grepl("ICE|FREEZ|FROST|FROZEN|COLD|CHILL", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="cold"]
temp[grepl("BLIZZARD|(?=.*ICE)(?=.*STORM)|(?=.*SNOW)(?=.*STORM)|(?=.*WINTER)(?=.*STORM)|(?=.*LAKE)(?=.*EFFECT)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="blizzard"]
temp[grepl("DUST", EVTYPE, ignore.case=TRUE), EVTYPE:="dust"]
temp[grepl("WILDFIRE|(?=.*WILD)(?=.*FIRE)|(?=.*FOREST)(?=.*FIRE)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="fire"]
temp[grepl("HEAT|WARM", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="heat"]
temp[grepl("(?=.*DRY)(?=.*WARM)|DROUGHT", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="drought"]
temp[grepl("FOG", EVTYPE, ignore.case=TRUE), EVTYPE:="fog"]
temp[grepl("HURRICANE|TYPHOON|(?=.*TROPICAL)(?=.*STORM)(?=.*depression)", EVTYPE, perl=TRUE, ignore.case=TRUE), EVTYPE:="tropical storm"]
temp[grepl("LANDSLIDE", EVTYPE, ignore.case=TRUE), EVTYPE:="landslide"]
temp[grepl("AVALANCHE", EVTYPE, ignore.case=TRUE), EVTYPE:="avalanche"]

## Only observations with any of the new types of weather are kept.
processedData <- temp %>% filter(grepl("^thunderstorm$|^wind$|^hail$|^flood$|^tornado$|^sleet and snow$|^rain$|^surftide$|^cold$|^blizzard$|^dust$|^fire$|^heat$|^drought$|^fog$|^tropical storm$|^landslide$|^avalanche$", EVTYPE, perl=TRUE)) %>% as.data.frame()

processedData$EVTYPE <- as.factor(as.character(processedData$EVTYPE))

A downside of this method is that not all EVTYPE observations will be matched. To verify that not too many observations are discarded this way, we compare the amount of observations of the olf versus the new dataset.

(nrow(temp)-nrow(processedData))/nrow(temp)
## [1] 0.01091215

Approximately 1,1% of the observations is discarded. Considering the the large sample of the data and the exploratory nature of this research, this is justifiable.

2. RESULTS

The report attempts to answer the following questions:

  1. Across the United States, which types of weather events are most harmful with respect to population health?

  2. Across the United States, which types of weather events have the greatest economic consequences?

2.1. Across the United States, which types of events are most harmful with respect to population health?

The most straight-forward approach is to compare total injuries and fatalities for every type of weather. The code below processes the tidy dataset and displays this information, sorted by the sum of fatalities and injuries per type of weather.

summaryData <- processedData %>%
  group_by(EVTYPE) %>%
  summarize(count=n(),
    fatalities=sum(FATALITIES),
    injuries=sum(INJURIES)) %>%
  mutate(total=fatalities+injuries) %>%
  arrange(desc(total)) %>%
  select(EVTYPE, fatalities,injuries, total)

summaryData
## Source: local data frame [18 x 4]
## 
##            EVTYPE fatalities injuries total
## 1         tornado       5639    91439 97078
## 2    thunderstorm       1571    14775 16346
## 3            heat       3178     9243 12421
## 4           flood       1553     8683 10236
## 5            cold        317     2450  2767
## 6            wind        701     2022  2723
## 7        blizzard        317     2143  2460
## 8            fire         90     1606  1696
## 9        surftide        759      818  1577
## 10 tropical storm        133     1333  1466
## 11           hail         15     1371  1386
## 12 sleet and snow        166     1123  1289
## 13            fog         80     1076  1156
## 14           dust         24      483   507
## 15           rain        107      303   410
## 16      avalanche        224      170   394
## 17      landslide         39       53    92
## 18        drought          0        4     4

From all weather types types, tornado’s were most harmful with respect to human health. Thunderstorms, heat, flood and cold follow as weather types that caused the most fatalities and injuries.

## The data is sorted by total number of victims and aligns the EVTYPES to it in order to sort the categorical variables in the bar chart by total number of victims
summaryData <- arrange(summaryData, -total)
summaryData$EVTYPE <- factor(summaryData$EVTYPE, levels=unique(summaryData$EVTYPE))

## Prepares narrow data
plotData <- melt(summaryData, id.vars=1, measure.vars=c(2,3))

## Prepares basic plot
plot <- ggplot(plotData, aes(x=EVTYPE, y=value, fill=variable))

## Plots
plot +
        geom_bar(position="dodge", stat="identity") +
        ggtitle("Total Fatalities and Injuries caused by Extreme Weather in the USA\n(1950 - 2011) ") +
        ylab("Fatalities / Injuries") +
        xlab("Weather types") +
        theme(axis.text.x=element_text(angle=90, hjust=1, vjust=.25),
              legend.title=element_blank())

Tornado’s and heat have not only caused the most fatalities, the plot above suggests tornado’s and heat, as well as floods and surf tides, are more fatal than other weather types. The proportion of fatalities compared to inuries is larger than for other weather types.

2.2. Across the United States, which types of events have the greatest economic consequences?

The most straight-forward approach is to compare total crop and property damage for every weather type. The code below processes the tidy dataset displays this information sorted by the sum of property and crop damage.

summaryData1 <- processedData %>%
  group_by(EVTYPE) %>%
  summarize(count=n(),
            property_damage=sum(DMGproperty/1000000000, na.rm=TRUE),
            crop_damage=sum(DMGcrops/1000000000, na.rm=TRUE)) %>%
  mutate(total=property_damage+crop_damage) %>%
  arrange(desc(total)) %>%
  select(EVTYPE, property_damage,crop_damage, total)

summaryData1
## Source: local data frame [18 x 4]
## 
##            EVTYPE property_damage crop_damage       total
## 1         tornado     51.70285703   0.4149614 52.11781839
## 2           flood     39.08102497   7.2757472 46.35677217
## 3            hail     14.17404272   3.0468374 17.22088017
## 4  tropical storm     10.95641001   3.9961178 14.95252781
## 5         drought      1.04610600  12.4725660 13.51867200
## 6    thunderstorm     10.71483631   1.2863010 12.00113730
## 7            cold      4.10392271   3.2282908  7.33221351
## 8            fire      5.95656350   0.4032816  6.35984513
## 9            wind      4.98227659   0.9033655  5.88564214
## 10       blizzard      2.34881120   0.1395040  2.48831520
## 11           rain      0.73398619   0.8061528  1.54013899
## 12       surftide      1.51660700   0.0008550  1.51746200
## 13 sleet and snow      1.02479974   0.1346631  1.15946284
## 14           heat      0.02032575   0.5044793  0.52480503
## 15      landslide      0.32470100   0.0200170  0.34471800
## 16            fog      0.02282950   0.0000000  0.02282950
## 17           dust      0.00628763   0.0031000  0.00938763
## 18      avalanche      0.00372180   0.0000000  0.00372180

The data shows that tonado’s and floods have caused the most damage, followed by hail, tropical storms and drought. Also, weather seems to cause more property damage than crop damage, but this is probably because the USA has more property than crops that can be damaged.

## The data is sorted by total number of victims and aligns the EVTYPES to it in order to sort the categorical variables in the bar chart by total number of victims
summaryData1 <- arrange(summaryData1, -total)
summaryData1$EVTYPE <- factor(summaryData1$EVTYPE, levels=unique(summaryData1$EVTYPE))

plotData1 <- melt(summaryData1, id.vars=c(1), measure.vars=c(2,3))

plot <- ggplot(plotData1, aes(x=EVTYPE, y=value, fill=variable))

plot +
        geom_bar(position="dodge", stat="identity") +
        ggtitle("Total Crop and Property Damage caused by Extreme Weather in the USA\n(1950 - 2011) ") +
        xlab("Weather types") +
        ylab("Damage (billions)") +
        theme(axis.text.x=element_text(angle=90,hjust=1, vjust=.25),
              legend.title=element_blank())

The plot shows that crops are relatively vulnarable to drought, heat, rain and cold. It shows that property is relatively vulnbarable to tornado’s, thunderstorms and fire.

3. Comparing states

For policy implications, it is wise to compare extreme weather patterns between states, because the likelihood of certain storms occurring is likely to be dependent of natural factors.

3.1 Data processing

  • First, The observations in STATE are compared with a list of official abbreviations. Invalid codes are discarded and the latitudes and longitudes for valid state abbreviations are retrieved from Google using the ggmap-package and stored in a seperate data frame.
## Matching observations in STATE to a list of official abbreviations
STATE <- levels(processedData$STATE)
states <- read.csv("states.csv", sep=";") 
statesmerge <- merge(STATE, states, by.x=1, by.y=2)

## Retrieve longitudes and latitudes from Google.
lonlat <- geocode(as.character(statesmerge$State))

## Store data in a seperate data frame
geodata <- cbind(statesmerge,x=lonlat[1],y=lonlat[2])
  • For every state, the weather types that caused the most injuries, fatalities, crop damage and property damage are stored in two data frame.
## Property, crop and total damage per state and weather type incl. lon and lat codes
dmgevtypeperstate <- processedData %>% 
                        group_by(STATE, EVTYPE) %>% 
                        summarize(propdmg=(sum(DMGproperty)/1000000000), cropdmg=(sum(DMGcrops)/1000000000))


## Fatalities, injuries and total victims per state and weather type incl. lon and lat codes
victevtypeperstate <- processedData %>% 
                        group_by(STATE, EVTYPE) %>% 
                        summarize(fatalities=sum(FATALITIES), injuries=sum(INJURIES))
  • Next, the data is prepared for plotting by extracting injuries, fatalities, crop damage, and property damage for all states into separate data frames.
cropdmg <- dmgevtypeperstate %>% 
                top_n(1, cropdmg) %>%
                merge(geodata, by.x=1, by.y=1)

propdmg <- dmgevtypeperstate %>% 
                top_n(1, propdmg) %>%
                merge(geodata, by.x=1, by.y=1)

totalinj <- victevtypeperstate %>% 
                top_n(1, injuries) %>%
                merge(geodata, by.x=1, by.y=1)    

totalfat <- victevtypeperstate %>% 
                top_n(1, fatalities) %>%
                merge(geodata, by.x=1, by.y=1)
  • Finally, the data is plotted.

3.2 Property damage

The map below shows that…

  • Tornado’s caused the most property damage, especially in the South and Midwest.
  • Tropical stroms caused the most property damage in the coastal states in the Southeast.
  • Floods caused the most property damage in the Northeastern States
  • Floods and hail caused the most property damage in the West, except in California were fires caused the most property damage.
## Prepare map and plots

map <- ggmap(get_map("USA", zoom=4, maptype="terrain", color='bw'), legend='right', extent='device')

map  +
        geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=propdmg), data=propdmg)+
        ggtitle("Property Damage") +
        scale_size_continuous(range=c(4,12))  +
        guides(colour = guide_legend(override.aes = list(size=7)))+
        scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
        guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
                size = guide_legend(title="Damage (billions)", ncol=2))

3.3 Crop damage

The map below shows that…

  • Drought caused most of the crop damage, especially in the South and Northeast
  • Flood and hail caused most of the crop damage in the Midwest and North.
  • Crops in California and South Carolina were vulnarable to cold, Alabama and Mississippi to tornado’s, Maine to rain, Utah to wind and Massachusets to thunderstorms.
map  +
        geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=cropdmg), data=cropdmg)+
        ggtitle("Crop Damage") +
        scale_size_continuous(range=c(4,12)) +
        guides(colour = guide_legend(override.aes = list(size=7)))+ 
        scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
        guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
                size = guide_legend(title="Damage (billions)", ncol=2))

3.4 Injuries

The map below shows that…

  • Most injuries were caused by Tornado’s, especially in the South and Midwest
  • Thunderstorms caused most of the injuries in the Northeast, followed by surf tides, heat and tornado’s
  • Thunderstorms caused most injuries in Montana, Idaho, Colorado and Arizona. Blizzards caused most injuries in Wyoming and Utah.
  • Most injuries in Californa were caused by fire, in Oregon by wind, and in Nevada by floods.
map  +
        geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=injuries), data=totalinj)+
        ggtitle("Injuries") +
        scale_size_continuous(range=c(4,12)) +
        scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
        guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
                size = guide_legend(title="Injuries", ncol=2))

3.5 Fatalities

The map below shows that…

  • Most fatalities were caused by tornado’s, especially in the South and Midwest. In Florida surftides caused most of the fatalities.
  • Heat caused relatively many fatalities in Illinois and Pennsylvania. In California and Nevada heat caused the most fatalities as well.
  • Most fatalities were caused by thunderstorms, wind, floods and surftides in the Northeast.
  • In the West, thunderstoms, floods and wind caused the most fatalities.
map  +
        geom_point(aes(x=lon, y=lat, colour=EVTYPE, size=fatalities), data=totalfat)+
        ggtitle("Fatalities") +
        scale_size_continuous(range=c(4,12)) +
        scale_colour_manual(values=c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462","#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f","#ffffff", "#000000")) +
        guides(colour = guide_legend(title="Weather type", override.aes = list(size=7), ncol=2),
                size = guide_legend(title="Fatalities", ncol=2))

4. Conclusion

Analysis of the NOAA Storms Database shows that between 1950 and 2011…