Synopsis

We performed an exploratory analysis of data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database in order to address the following questions: a) Which types of weather events are most harmful with respect to population health, and b) which have the greatest economic consequences? We looked at absolute effects summed per event type across all of the United States in time period 1996 to 2010. We determined the top 10 event categories measured in terms of fatalities, injuries, and property/crop damage and found the following: Excessive Heat, Tornadoes, Flash Floods and Lightning are most fatal. Tornadoes, Floods, Excessive Heat and Thunderstorm Winds cause most injuries. Thunderstorm Winds, Flash Floods, Tornadoes and Hail cause most economic damage. Furthermore, damage to property is overall much higher than to crops and that the biggest contributor to crop damage is Hail.

Data Processing

Loading libraries and data

For our analysis we need to load the following libraries.

library(lattice)
library(ggplot2)
library(lubridate)
library(stringdist)
suppressMessages(library(dplyr))

Data was downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and we use this as our raw data set.

if(!file.exists("stormdata.csv.bz2")){
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2", method="curl")
}
df.stormdata <- read.csv("stormdata.csv.bz2")

The raw data set consists of 902297 entries of 37 variables.

dim(df.stormdata)
## [1] 902297     37
names(df.stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Furthermore we load an the official list of event types, extracted from the National Weather Service Storm Data Documentation. Its contents are show below:

eventtypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Designator", "Event Name", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
length(eventtypes)
## [1] 50

Convert dates and select relevant variables

We first convert date variable BGN_DATE to type POSIXct

df.stormdata$BGN_DATE <- mdy_hms(df.stormdata$BGN_DATE)

and select only the variables relevant for this analysis

df.stormdata.explore <- subset(df.stormdata, select=c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

Checking data quality

A quick summary of this subsetted data set shows that there are no missing values, that a 75% majority of the dataset was generated since 1995 (see BGN_DATE), that multiple of the EVTYPE values do not correspond exactly with elements of eventtypes. Also, the majority of the entries seem to have zero values for FATALITIES, INFURIES, PROPDMG and CROPDMG.

summary(df.stormdata.explore)
##     BGN_DATE                                 EVTYPE      
##  Min.   :1950-01-03 00:00:00   HAIL             :288661  
##  1st Qu.:1995-04-20 00:00:00   TSTM WIND        :219940  
##  Median :2002-03-18 00:00:00   THUNDERSTORM WIND: 82563  
##  Mean   :1998-12-27 23:37:48   TORNADO          : 60652  
##  3rd Qu.:2007-07-28 00:00:00   FLASH FLOOD      : 54277  
##  Max.   :2011-11-30 00:00:00   FLOOD            : 25326  
##                                (Other)          :170878  
##    FATALITIES          INJURIES            PROPDMG          PROPDMGEXP    
##  Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00          :465934  
##  1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00   K      :424665  
##  Median :  0.0000   Median :   0.0000   Median :   0.00   M      : 11330  
##  Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06   0      :   216  
##  3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50   B      :    40  
##  Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00   5      :    28  
##                                                           (Other):    84  
##     CROPDMG          CROPDMGEXP    
##  Min.   :  0.000          :618413  
##  1st Qu.:  0.000   K      :281832  
##  Median :  0.000   M      :  1994  
##  Mean   :  1.527   k      :    21  
##  3rd Qu.:  0.000   0      :    19  
##  Max.   :990.000   B      :     9  
##                    (Other):     9

Problem 1: The PROPDMGEXP variable, which according to the documentation seems to denote exponential multiplication factors 10^{PROPDMGEXP} for PROPDMG seems to contain invalid values. Also, a single amount should not be stored in two separate variables. The same goes for CROPDMGEXP:

sort(unique(df.stormdata.explore$PROPDMGEXP))
##  [1]   - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
sort(unique(df.stormdata.explore$CROPDMGEXP))
## [1]   ? 0 2 B k K m M
## Levels:  ? 0 2 B k K m M

Investigating this a bit further we see that almost all entries with peculiar values for PROPDMGEXP and CROPDMGEXP are from the time range 1993 to 1995.

summary(df.stormdata.explore[df.stormdata.explore$PROPDMGEXP %in% c("-", "?", "+", "h", "H", "0", "1", "2"),]$BGN_DATE)
##                  Min.               1st Qu.                Median 
## "1993-04-24 00:00:00" "1995-02-03 00:00:00" "1995-05-18 00:00:00" 
##                  Mean               3rd Qu.                  Max. 
## "1995-05-04 14:24:00" "1995-07-14 12:00:00" "2011-07-27 00:00:00"
summary(df.stormdata.explore[df.stormdata.explore$CROPDMGEXP %in% c("?", "0", "2"),]$BGN_DATE)
##                  Min.               1st Qu.                Median 
## "1993-02-12 00:00:00" "1994-10-06 00:00:00" "1995-02-16 00:00:00" 
##                  Mean               3rd Qu.                  Max. 
## "1994-12-05 12:26:40" "1995-06-01 12:00:00" "1995-09-01 00:00:00"

Problem 2: Furthermore we find a long list of 985 different EVTYPE event types (many more than the 50 different categories stored in our eventtypes variable).

# Check EVTYPE categories matching official categories
length(unique(df.stormdata.explore$EVTYPE))
## [1] 985
head(sort(unique(df.stormdata.explore$EVTYPE)), 10)
##  [1] ?                      ABNORMALLY DRY         ABNORMALLY WET        
##  [4] ABNORMAL WARMTH        ACCUMULATED SNOWFALL   AGRICULTURAL FREEZE   
##  [7] APACHE COUNTY          ASTRONOMICAL HIGH TIDE ASTRONOMICAL LOW TIDE 
## [10] AVALANCE              
## 985 Levels: ? ABNORMALLY DRY ABNORMALLY WET ... WND

Select date range with good data quality

We decided to select only data from 1995 until 2010 to work with, for multiple reasons: lower quality of PROPDMGEXP and CROPDMGEXP between 1993 and 1995, the fact that less event types were recorded before 1996 according to the documentation, and because the data from 2011 is not yet complete (in other words: only fully available years with higher data quality were selected).

df.stormdata.clean <- subset(df.stormdata, BGN_DATE > ymd("1995-12-31") & BGN_DATE < ymd("2011-01-01"), select=c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

Convert property and crop damage amounts

We introduce new variables PROPERTY_DAMAGE and CROP_DAMAGE (calculated from PROPDMG * 10^PROPDMGEXP and CROPDMG * 10^CROPDMGEXP respectively) and drop the old ones.

multiplier_mapping <- c("0" = 1, "1" = 10, "2" = 100, "3" = 1000, "4" = 10000, "5" = 100000, "6" = 1000000, "7" = 10000000, "8" = 100000000, "H" = 100, "h" = 100, "K" = 1000, "k" = 1000, "M" = 1000000, "m" = 1000000, "B" =1000000000, "b" = 1000000000)
df.stormdata.clean$PROPERTY_DAMAGE <- df.stormdata.clean$PROPDMG * multiplier_mapping[droplevels(df.stormdata.clean$PROPDMGEXP)]
df.stormdata.clean$CROP_DAMAGE <- df.stormdata.clean$CROPDMG * multiplier_mapping[droplevels(df.stormdata.clean$CROPDMGEXP)]
df.stormdata.clean <- subset(df.stormdata.clean, select=-c(PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

Clean up event types

We try to clean up the EVTYPE values by first converting them to lowercase (to prevent upper/lower mismatch) and adding a few manual substitution rules:

df.stormdata.clean$EVTYPE <- tolower(df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("tstm", "thunderstorm", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^wild/forest fire$", "wildfire", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^thunderstorm wind/hail$", "hail", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^winter weather/mix$", "winter weather", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^extreme cold$", "extreme cold/wind chill", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^snow$", "heavy snow", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^wind$", "strong wind", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^storm surge$", "storm surge/tide", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^landslide$", "debris flow", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^heavy surf/high surf$", "high surf", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^extreme windchill$", "extreme cold/wind chill", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^hurricane$", "hurricane (typhoon)", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^record warmth$|^recored heat$", "excessive heat", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("wintry", "winter", df.stormdata.clean$EVTYPE)

And subsequently trying to matching the raw EVTYPE to the officially used eventtypes that we defined above (correcting for small string differences). We use factor level UNKNOWN for an unmatched event type.

df.stormdata.clean$EVENT_TYPE <- eventtypes[amatch(tolower(df.stormdata.clean$EVTYPE), tolower(eventtypes), maxDist=3, nomatch=NA)]
levels(df.stormdata.clean$EVENT_TYPE) <- c(levels(df.stormdata.clean$EVENT_TYPE), "UNKNOWN")
df.stormdata.clean$EVENT_TYPE[is.na(df.stormdata.clean$EVENT_TYPE)] <- "UNKNOWN"
df.stormdata.clean <- subset(df.stormdata.clean, select=-c(EVTYPE))

This results in 1.02% of all entries having event type UNKNOWN, which we certainly consider acceptable.

sum(df.stormdata.clean$EVENT_TYPE=="UNKNOWN")/length(df.stormdata.clean$EVENT_TYPE)
## [1] 0.01021212

Results

Harmful weather events for population health

Aggregating the number of fatalities and injuries per event type:

df.fatalities <- aggregate(FATALITIES ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.injuries <- aggregate(INJURIES ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.fatalities.injuries <- merge(df.fatalities, df.injuries) %>% mutate(SUM.FATALATIES.INJURIES = FATALITIES + INJURIES)

We find a top 10 event types in terms of fatalities:

df.fatalities.injuries %>% arrange(desc(FATALITIES)) %>% head(10)
##                 EVENT_TYPE FATALITIES INJURIES SUM.FATALATIES.INJURIES
## 1           Excessive Heat       1761     6253                    8014
## 2                  Tornado        924    14504                   15428
## 3              Flash Flood        819     1644                    2463
## 4                Lightning        625     3947                    4572
## 5              Rip Current        513      476                     989
## 6                    Flood        435     7475                    7910
## 7        Thunderstorm Wind        317     4656                    4973
## 8  Extreme Cold/Wind Chill        255      107                     362
## 9                High Wind        231     1072                    1303
## 10               Avalanche        214      148                     362

And a top 10 event types causing most injuries:

df.fatalities.injuries %>% arrange(desc(INJURIES)) %>% head(10)
##             EVENT_TYPE FATALITIES INJURIES SUM.FATALATIES.INJURIES
## 1              Tornado        924    14504                   15428
## 2                Flood        435     7475                    7910
## 3       Excessive Heat       1761     6253                    8014
## 4    Thunderstorm Wind        317     4656                    4973
## 5            Lightning        625     3947                    4572
## 6          Flash Flood        819     1644                    2463
## 7             Wildfire         81     1340                    1421
## 8  Hurricane (Typhoon)        125     1321                    1446
## 9         Winter Storm        190     1292                    1482
## 10           High Wind        231     1072                    1303

Visualizing these results shows clearly that Excessive Heat, Tornadoes, Flash Floods and Lightning are most fatal and Tornadoes, Floods, Excessive Heat and Thunderstorm Winds cause most injuries.

df.fatalities.injuries.top.sum <- df.fatalities.injuries %>% arrange(desc(SUM.FATALATIES.INJURIES)) %>% head(10)
barchart(FATALITIES + INJURIES ~ EVENT_TYPE, data=df.fatalities.injuries.top.sum, ylab="Number of people", main="Fatalities and Injuries by Weather Type")
Figure 1: Total number of fatalities (turquoise) and injuries (pink) per type of weather event in the USA between 1996 and 2010.

Figure 1: Total number of fatalities (turquoise) and injuries (pink) per type of weather event in the USA between 1996 and 2010.

Economic consequences of bad weather events

Aggregating the amount of property and crop damage per event type:

df.property.damage <- aggregate(PROPERTY_DAMAGE ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.crop.damage <- aggregate(CROP_DAMAGE ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.economic.damage <- merge(df.property.damage, df.crop.damage) %>% mutate(TOTAL.DAMAGE = PROPERTY_DAMAGE + CROP_DAMAGE)

We find a top 10 event types in terms of total damage:

df.economic.damage %>% arrange(desc(TOTAL.DAMAGE)) %>% head(10)
##             EVENT_TYPE PROPERTY_DAMAGE CROP_DAMAGE TOTAL.DAMAGE
## 1    Thunderstorm Wind       199989507    16470685    216460192
## 2          Flash Flood       127070281    16536480    143606761
## 3              Tornado       115642211     7337891    122980102
## 4                 Hail        63363505    49441240    112804745
## 5                Flood        84500695    16910420    101411115
## 6            Lightning        45625086      183644     45808730
## 7            High Wind        33335099     2109740     35444839
## 8              Drought         1336750    14382830     15719580
## 9  Hurricane (Typhoon)        10930453     4353775     15284228
## 10            Wildfire        12960535     1021913     13982448

Visualizing these results we see that Thunderstorm Winds, Flash Floods, Tornadoes and Hail cause most economic damage, that damage to property is overall much higher than to crops and that the biggest contributor to crop damage is Hail.

df.economic.damage.top.sum <- df.economic.damage %>% arrange(desc(TOTAL.DAMAGE)) %>% head(10)
barchart(PROPERTY_DAMAGE + CROP_DAMAGE ~ EVENT_TYPE, data=df.economic.damage.top.sum, ylab="Damage (USD)", main="Economic Damage by Weather Type")
Figure 2: Total economic damage in terms of property (turquoise) and crops (pink) caused by different types of weather event, measured in the USA between 1996 and 2010.

Figure 2: Total economic damage in terms of property (turquoise) and crops (pink) caused by different types of weather event, measured in the USA between 1996 and 2010.