Author: Kendall Giles
Contact: @kendallgiles
In this report we look at the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, analyzing the data to find weather events most harmful to people and from an economic standpoint. Since severe weather events can cause harm to human populations as well as impact local economies, understanding which weather events are more harmful than others can help governments prioritize resources.
Before analyzing the data using R, we perform simple data cleaning steps to account for missing and corrupted data values.
Our preliminary analysis indicates that, including injuries and fatalities, the most harmful weather events to humans are tornados; after combining property and crop damage, the most harmful economic weather events are floods.
Severe weather events can cause harm to human populations as well as impact local economies–injuries, fatalities, property damage, and crop damage are of particular concern. Since severe weather events can cause harm to human populations as well as impact local economies, understanding which weather events are more harmful than others can help governments prioritize resources.
To help addresses these concenrs, we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and other weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property/crop damage.
Two main questions are the focus of this analysis:
Across the United States, which types of weather events are most harmful with respect to population health?
Across the United States, which types of weather events have the greatest economic consequences?
This analysis was conducted using R (version 3.1.2) in RStudio. This report, an example of literate (statistical) programming, was created with the package knitr and using the package dplyr, in addition to the default packages loaded by RStudio.
Note that all R processing code used in the analysis is presented in the text of this document–normally some non-essential-to-the-reader code segments are hidden, but for demonstration and transparancy purposes, all code and all outputs are presented herein.
The NOAA storm database covers recorded weather events in the US from 1950 to November 2011. Additional documentation about this dataset includes:
The raw NOAA Storm dataset, a csv file compressed via bzip2, was downloaded. This dataset was uncompressed and placed in the root of the R working directory. The resulting raw .csv data file is 561.6 MB.
The dataset was read into memory. Note that missing values were coded as empty strings, and the header was read to get the variable names.
raw.data <- read.csv("./noaa_storm_data.csv", header=TRUE, na.strings = "")
The raw dataset contains 902297 observations of 37 variables each, and takes up 409.4 Mb of memory:
dim(raw.data)
## [1] 902297 37
print(object.size(raw.data), units="Mb")
## 409.4 Mb
Here are the first few rows of the dataset:
head(raw.data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 2 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 3 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 4 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 5 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 6 TORNADO 0 <NA> <NA> <NA> <NA> 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 <NA> <NA> 14.0 100 3 0 0
## 2 NA 0 <NA> <NA> 2.0 150 2 0 0
## 3 NA 0 <NA> <NA> 0.1 123 2 0 0
## 4 NA 0 <NA> <NA> 0.0 100 2 0 0
## 5 NA 0 <NA> <NA> 0.0 150 2 0 0
## 6 NA 0 <NA> <NA> 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0 <NA> <NA> <NA> <NA>
## 2 0 2.5 K 0 <NA> <NA> <NA> <NA>
## 3 2 25.0 K 0 <NA> <NA> <NA> <NA>
## 4 2 2.5 K 0 <NA> <NA> <NA> <NA>
## 5 2 2.5 K 0 <NA> <NA> <NA> <NA>
## 6 6 2.5 K 0 <NA> <NA> <NA> <NA>
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 <NA> 1
## 2 3042 8755 0 0 <NA> 2
## 3 3340 8742 0 0 <NA> 3
## 4 3458 8626 0 0 <NA> 4
## 5 3412 8642 0 0 <NA> 5
## 6 3450 8748 0 0 <NA> 6
With respect to weather event types, indicated by the variable EVTYPE, there are two main categories of event consequences that were tracked in the dataset:
Harm to the human population, either fatality or injury as the result of a storm, is quantified by the variables: FATALITIES and INJURIES.
Ecomonic harm, which in this dataset encompasses property damage and crop damage, is quantified by the variables PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.
Calculating some statistics about harm to human populations and property:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hail.data <- filter(raw.data, EVTYPE=="HAIL")
sum(hail.data$FATALITIES)
## [1] 15
water.data <- filter(raw.data, EVTYPE=="WATERSPOUT")
sum(water.data$INJURIES)
## [1] 29
wind.data <- filter(raw.data, EVTYPE=="WIND DAMAGE")
prop.damage.exponents <- wind.data %>% filter(PROPDMG>0) %>% select(PROPDMGEXP)
table(prop.damage.exponents)
## prop.damage.exponents
## - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0
sum(wind.data$PROPDMG)
## [1] 77
As an example, in the years encompassed by this dataset, 1950-2011, there were 15 recorded fatalities from hail, 29 recorded injuries from waterspouts, and $77K in recorded wind damage to property.
tsunami.data <- filter(raw.data, EVTYPE=="TSUNAMI")
sum(tsunami.data$FATALITIES)
## [1] 33
But, note that there were 33 recorded fatalities due to tsunamis during the same time period, indicating that some weather events do more harm than others.
There are 985 weather event types recorded in the dataset, such as: TORNADO, TSTM WIND, HAIL, FREEZING RAIN, SNOW, ICE STORM/FLASH FLOOD, SNOW/ICE, WINTER STORM, HURRICANE OPAL/HIGH WINDS, THUNDERSTORM WINDS.
length(unique(raw.data$EVTYPE))
## [1] 985
as.character(unique(raw.data$EVTYPE)[1:10])
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
However, according to the National Weather Service Documentation, there are only meant to be 48 event types recorded. This means that the dataset is not clean–through poor user interface design and data collection methods the data in the dataset has incomplete, wrong, or missing values.
Also, from this document on the NOAA website describing the storm events database, it can be seen that from 1950-1955, only TORNADO events were recorded; from 1955-1996 only TORNADO, THUNDERSTORM WIND, and HAIL events were recorded. It was only starting in 1996 that all 48 event types were recorded. Thus, in the early years, the data are severely biased towards a subset of the weather events, since not all weather events were recorded.
So, for the purposes of this preliminary analysis, only weather events that started in or after 1996 will be used (using the BGN_DATE to get the date the weather event started):
library(dplyr)
relevant.data <- filter(raw.data, as.numeric(format(as.Date(as.character(raw.data$BGN_DATE), "%m/%d/%Y %H:%M:%S"), "%Y")) >= 1996)
dim(relevant.data)
## [1] 653530 37
As noted previously the variables for economic harm are record in PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.
According to the NWS documentation, the only valid damage estimate multipliers are “K” for thousands, “M” for millions, and “B” for billions–eg, 1.55 in the PROPDMG variable and B in the PROPDMGEXP variable means the property damage was $1,550,000,000. So, a PROPDMGEXP value of K corresponds to a PROPDMG multiplier of 1000, M a multiplier of 1000000, and B a multiplier of 1000000000.
However, in looking at the values in the PROPDMGEXP and CROPDMGEXP variables, we see that some values are missing, and the PROPDMPEXP variable has some eponents coded as 0. For this report, damage estimates with missing (NA) values will be removed, under the assumption the dollar amount of the damages was either unknown or unavailable. Exponents with 0 (i.e., indicating an exponent multiplier of 10 to the 0 = 1) will indicate that the damage estimates are in units of dollars.
unique(as.character(relevant.data$PROPDMGEXP))
## [1] "K" NA "M" "B" "0"
unique(as.character(relevant.data$CROPDMGEXP))
## [1] "K" NA "M" "B"
So, removing NA damage estimate records and multiplying the damage estimates by the appropriate exponent:
prop.data <- relevant.data %>% select(EVTYPE, PROPDMG, PROPDMGEXP)
crop.data <- relevant.data %>% select(EVTYPE, CROPDMG, CROPDMGEXP)
dim(prop.data)
## [1] 653530 3
dim(crop.data)
## [1] 653530 3
prop.data <- prop.data[!is.na(prop.data$PROPDMGEXP),]
dim(prop.data)
## [1] 377345 3
crop.data <- crop.data[!is.na(crop.data$CROPDMGEXP),]
dim(crop.data)
## [1] 280461 3
prop.0 <- prop.data %>% filter(as.character(PROPDMGEXP)=="0") %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
prop.K <- prop.data %>% filter(as.character(PROPDMGEXP)=="K") %>% mutate(PROPDMG = PROPDMG*1000) %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
prop.M <- prop.data %>% filter(as.character(PROPDMGEXP)=="M") %>% mutate(PROPDMG = PROPDMG*1000000) %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
prop.B <- prop.data %>% filter(as.character(PROPDMGEXP)=="B") %>% mutate(PROPDMG = PROPDMG*1000000000) %>% select(EVTYPE, PROPDMG) %>% rename(DMG = PROPDMG)
crop.K <- crop.data %>% filter(as.character(CROPDMGEXP)=="K") %>% mutate(CROPDMG = CROPDMG*1000) %>% select(EVTYPE, CROPDMG) %>% rename(DMG = CROPDMG)
crop.M <- crop.data %>% filter(as.character(CROPDMGEXP)=="M") %>% mutate(CROPDMG = CROPDMG*1000000) %>% select(EVTYPE, CROPDMG) %>% rename(DMG = CROPDMG)
crop.B <- crop.data %>% filter( as.character(CROPDMGEXP)=="B") %>% mutate(CROPDMG = CROPDMG*1000000000) %>% select(EVTYPE, CROPDMG) %>% rename(DMG = CROPDMG)
economic.data <- rbind(prop.0, prop.K, prop.M, prop.B, crop.K, crop.M, crop.B )
dim(economic.data)
## [1] 657806 2
Here is a plot of the sum of the number of fatalities by event type over the years 1996-2011:
grouped.data <- group_by(relevant.data, EVTYPE)
sums.by.event <- summarise_each(grouped.data, funs(sum))
plot(1:nrow(sums.by.event), sums.by.event$FATALITIES, main="Sum of Fatalities by Weather Type, 1996-2011", type="p", pch=19, xlab="Weather Type Index", ylab="Sum of Fatalities")
The x-axis of the plot is the index of the weather type event, while the y-axis is the sum of the fatalities over the years 1996-2011. While most of the fatality sums are near zero for most of the weather events, there is one weather event with a large number of fatalities (> 1500).
To find this weather event name, we find the index of the weather event with the largest sum of fatalities, and the resulting max fatal weather event is EXCESSIVE HEAT.
as.character(sums.by.event$EVTYPE[which.max(sums.by.event$FATALITIES)])
## [1] "EXCESSIVE HEAT"
If we perform a similar analysis, but including injuries as well as fatalities, then the most harmful weather event during 1996-2001 is TORNADO.
grouped.data <- relevant.data %>% group_by(EVTYPE) %>% select(EVTYPE, FATALITIES, INJURIES)
sums.by.event <- summarise_each(grouped.data, funs(sum))
plot(1:nrow(sums.by.event), sums.by.event$FATALITIES + sums.by.event$INJURIES, main="Sum of Fatalities + Injuries by Weather Type, 1996-2011", type="p", pch=19, xlab="Weather Type Index", ylab="Sum of Fatalities + Injuries")
as.character(sums.by.event$EVTYPE[which.max(sums.by.event$FATALITIES + sums.by.event$INJURIES)])
## [1] "TORNADO"
To be sure, we arrange the data and look at the top 20 events by sum of fatalities and injuries:
ordered.sums.by.event <- sums.by.event %>% mutate(SUM = FATALITIES + INJURIES) %>% select(EVTYPE, SUM) %>% arrange(desc(SUM))
ordered.sums.by.event[1:20,]
## Source: local data frame [20 x 2]
##
## EVTYPE SUM
## 1 TORNADO 22178
## 2 EXCESSIVE HEAT 8188
## 3 FLOOD 7172
## 4 LIGHTNING 4792
## 5 TSTM WIND 3870
## 6 FLASH FLOOD 2561
## 7 THUNDERSTORM WIND 1530
## 8 WINTER STORM 1483
## 9 HEAT 1459
## 10 HURRICANE/TYPHOON 1339
## 11 HIGH WIND 1318
## 12 WILDFIRE 986
## 13 HEAVY SNOW 805
## 14 FOG 772
## 15 HAIL 720
## 16 WILD/FOREST FIRE 557
## 17 RIP CURRENT 549
## 18 RIP CURRENTS 496
## 19 BLIZZARD 455
## 20 ICE STORM 400
We see for example one event recorded as TSTM WIND and one event labeled as THUNDERSTORM WIND. TSTM WIND does not appear as a valid weather event, though it could be that this event is a mis-labelling of the valid weather event THUNDERSTORM WIND, a corruption perhaps due to a poor user interface or data collection method. If they are in fact meant to be the same weather event, then the SUM values would need to be added together. However, the resulting combination of TSTM WIND and THUNDERSTORM WIND values is still less than the sum for the TORNADO weather event.
Due to the sharp dropoff in sums for all the following weather events, there appears to be no other mislabeled weather events that, when all summed together with the valid weather event label observations, would top the TORNADO sum.
Along similar lines, we can plot the property and crop damage by weather event and find the weather event corresponding to the max property and crop damage sum:
grouped.data <- group_by(economic.data, EVTYPE)
sums.by.event <- summarise_each(grouped.data, funs(sum))
damage.sums <- sums.by.event$DMG
plot(1:nrow(sums.by.event), damage.sums, main="Sum of Crop and Property Damage, 1996-2011", type="p", pch=19, xlab="Weather Type Index", ylab="Sum of Property + Crop Damage")
as.character(sums.by.event$EVTYPE[which.max(damage.sums)])
## [1] "FLOOD"
We find that the most harmful weather event from an economic standpoint is FLOOD.
And as with the analysis on weather event mislabellings with fatality/injury observations, from looking at the first 20 property and crop damage observations from the ordered set, we see no mislabelled observations that, when combined, would sum to greater than that for FLOOD events.
ordered.sums.by.event <- sums.by.event %>% arrange(desc(DMG))
ordered.sums.by.event[1:20,]
## Source: local data frame [20 x 2]
##
## EVTYPE DMG
## 1 FLOOD 148919611950
## 2 HURRICANE/TYPHOON 71913712800
## 3 STORM SURGE 43193541000
## 4 TORNADO 24900370720
## 5 HAIL 17071172870
## 6 FLASH FLOOD 16557105610
## 7 HURRICANE 14554229010
## 8 DROUGHT 14413667000
## 9 TROPICAL STORM 8320186550
## 10 HIGH WIND 5881421660
## 11 WILDFIRE 5054139800
## 12 TSTM WIND 5031941790
## 13 STORM SURGE/TIDE 4642038000
## 14 THUNDERSTORM WIND 3780985440
## 15 ICE STORM 3657908810
## 16 WILD/FOREST FIRE 3108564830
## 17 WINTER STORM 1544687250
## 18 HEAVY RAIN 1313034240
## 19 EXTREME COLD 1308733400
## 20 FROST/FREEZE 1103566000
In this report we looked at the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, analyzing the data to find weather events most harmful to people and from an economic standpoint.
We discovered that this dataset was not clean–it contained missing and corrupted values. Some data cleaning was performed, but note that other cleaning decisions could be made with more knowledge of the dataset and data collection methods. With this particular dataset, it appeared that some weather event observations were mislabelled, but correcting for those mislabellings would not have changed our results for the specific questions we were asking.
Thus, for the purposes of this preliminary investigation, including injuries and fatalities, in this dataset the most harmful weather event to humans appears to be tornados; after combining property and crop damage, the most harmful economic event appears to be floods.