We performed an exploratory analysis of data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database in order to address the following questions: a) Which types of weather events are most harmful with respect to population health, and b) which have the greatest economic consequences? We looked at absolute effects summed per event type across all of the United States in time period 1996 to 2010. We determined the top 10 event categories measured in terms of fatalities, injuries, and property/crop damage and found the following: Excessive Heat, Tornadoes, Flash Floods and Lightning are most fatal. Tornadoes, Floods, Excessive Heat and Thunderstorm Winds cause most injuries. Thunderstorm Winds, Flash Floods, Tornadoes and Hail cause most economic damage. Furthermore, damage to property is overall much higher than to crops and that the biggest contributor to crop damage is Hail.
For our analysis we need to load the following libraries.
library(lattice)
library(ggplot2)
library(lubridate)
library(stringdist)
suppressMessages(library(dplyr))
Data was downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and we use this as our raw data set.
if(!file.exists("stormdata.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2", method="curl")
}
df.stormdata <- read.csv("stormdata.csv.bz2")
The raw data set consists of 902297 entries of 37 variables.
dim(df.stormdata)
## [1] 902297 37
names(df.stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Furthermore we load an the official list of event types, extracted from the National Weather Service Storm Data Documentation. Its contents are show below:
eventtypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Designator", "Event Name", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
length(eventtypes)
## [1] 50
We first convert date variable BGN_DATE to type POSIXct
df.stormdata$BGN_DATE <- mdy_hms(df.stormdata$BGN_DATE)
and select only the variables relevant for this analysis
df.stormdata.explore <- subset(df.stormdata, select=c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
A quick summary of this subsetted data set shows that there are no missing values, that a 75% majority of the dataset was generated since 1995 (see BGN_DATE), that multiple of the EVTYPE values do not correspond exactly with elements of eventtypes. Also, the majority of the entries seem to have zero values for FATALITIES, INFURIES, PROPDMG and CROPDMG.
summary(df.stormdata.explore)
## BGN_DATE EVTYPE
## Min. :1950-01-03 00:00:00 HAIL :288661
## 1st Qu.:1995-04-20 00:00:00 TSTM WIND :219940
## Median :2002-03-18 00:00:00 THUNDERSTORM WIND: 82563
## Mean :1998-12-27 23:37:48 TORNADO : 60652
## 3rd Qu.:2007-07-28 00:00:00 FLASH FLOOD : 54277
## Max. :2011-11-30 00:00:00 FLOOD : 25326
## (Other) :170878
## FATALITIES INJURIES PROPDMG PROPDMGEXP
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00 :465934
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00 K :424665
## Median : 0.0000 Median : 0.0000 Median : 0.00 M : 11330
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06 0 : 216
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50 B : 40
## Max. :583.0000 Max. :1700.0000 Max. :5000.00 5 : 28
## (Other): 84
## CROPDMG CROPDMGEXP
## Min. : 0.000 :618413
## 1st Qu.: 0.000 K :281832
## Median : 0.000 M : 1994
## Mean : 1.527 k : 21
## 3rd Qu.: 0.000 0 : 19
## Max. :990.000 B : 9
## (Other): 9
Problem 1: The PROPDMGEXP variable, which according to the documentation seems to denote exponential multiplication factors 10^{PROPDMGEXP} for PROPDMG seems to contain invalid values. Also, a single amount should not be stored in two separate variables. The same goes for CROPDMGEXP:
sort(unique(df.stormdata.explore$PROPDMGEXP))
## [1] - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
sort(unique(df.stormdata.explore$CROPDMGEXP))
## [1] ? 0 2 B k K m M
## Levels: ? 0 2 B k K m M
Investigating this a bit further we see that almost all entries with peculiar values for PROPDMGEXP and CROPDMGEXP are from the time range 1993 to 1995.
summary(df.stormdata.explore[df.stormdata.explore$PROPDMGEXP %in% c("-", "?", "+", "h", "H", "0", "1", "2"),]$BGN_DATE)
## Min. 1st Qu. Median
## "1993-04-24 00:00:00" "1995-02-03 00:00:00" "1995-05-18 00:00:00"
## Mean 3rd Qu. Max.
## "1995-05-04 14:24:00" "1995-07-14 12:00:00" "2011-07-27 00:00:00"
summary(df.stormdata.explore[df.stormdata.explore$CROPDMGEXP %in% c("?", "0", "2"),]$BGN_DATE)
## Min. 1st Qu. Median
## "1993-02-12 00:00:00" "1994-10-06 00:00:00" "1995-02-16 00:00:00"
## Mean 3rd Qu. Max.
## "1994-12-05 12:26:40" "1995-06-01 12:00:00" "1995-09-01 00:00:00"
Problem 2: Furthermore we find a long list of 985 different EVTYPE event types (many more than the 50 different categories stored in our eventtypes variable).
# Check EVTYPE categories matching official categories
length(unique(df.stormdata.explore$EVTYPE))
## [1] 985
head(sort(unique(df.stormdata.explore$EVTYPE)), 10)
## [1] ? ABNORMALLY DRY ABNORMALLY WET
## [4] ABNORMAL WARMTH ACCUMULATED SNOWFALL AGRICULTURAL FREEZE
## [7] APACHE COUNTY ASTRONOMICAL HIGH TIDE ASTRONOMICAL LOW TIDE
## [10] AVALANCE
## 985 Levels: ? ABNORMALLY DRY ABNORMALLY WET ... WND
We decided to select only data from 1995 until 2010 to work with, for multiple reasons: lower quality of PROPDMGEXP and CROPDMGEXP between 1993 and 1995, the fact that less event types were recorded before 1996 according to the documentation, and because the data from 2011 is not yet complete (in other words: only fully available years with higher data quality were selected).
df.stormdata.clean <- subset(df.stormdata, BGN_DATE > ymd("1995-12-31") & BGN_DATE < ymd("2011-01-01"), select=c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
We introduce new variables PROPERTY_DAMAGE and CROP_DAMAGE (calculated from PROPDMG * 10^PROPDMGEXP and CROPDMG * 10^CROPDMGEXP respectively) and drop the old ones.
multiplier_mapping <- c("0" = 1, "1" = 10, "2" = 100, "3" = 1000, "4" = 10000, "5" = 100000, "6" = 1000000, "7" = 10000000, "8" = 100000000, "H" = 100, "h" = 100, "K" = 1000, "k" = 1000, "M" = 1000000, "m" = 1000000, "B" =1000000000, "b" = 1000000000)
df.stormdata.clean$PROPERTY_DAMAGE <- df.stormdata.clean$PROPDMG * multiplier_mapping[droplevels(df.stormdata.clean$PROPDMGEXP)]
df.stormdata.clean$CROP_DAMAGE <- df.stormdata.clean$CROPDMG * multiplier_mapping[droplevels(df.stormdata.clean$CROPDMGEXP)]
df.stormdata.clean <- subset(df.stormdata.clean, select=-c(PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
We try to clean up the EVTYPE values by first converting them to lowercase (to prevent upper/lower mismatch) and adding a few manual substitution rules:
df.stormdata.clean$EVTYPE <- tolower(df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("tstm", "thunderstorm", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^wild/forest fire$", "wildfire", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^thunderstorm wind/hail$", "hail", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^winter weather/mix$", "winter weather", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^extreme cold$", "extreme cold/wind chill", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^snow$", "heavy snow", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^wind$", "strong wind", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^storm surge$", "storm surge/tide", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^landslide$", "debris flow", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^heavy surf/high surf$", "high surf", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^extreme windchill$", "extreme cold/wind chill", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^hurricane$", "hurricane (typhoon)", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("^record warmth$|^recored heat$", "excessive heat", df.stormdata.clean$EVTYPE)
df.stormdata.clean$EVTYPE <- gsub("wintry", "winter", df.stormdata.clean$EVTYPE)
And subsequently trying to matching the raw EVTYPE to the officially used eventtypes that we defined above (correcting for small string differences). We use factor level UNKNOWN for an unmatched event type.
df.stormdata.clean$EVENT_TYPE <- eventtypes[amatch(tolower(df.stormdata.clean$EVTYPE), tolower(eventtypes), maxDist=3, nomatch=NA)]
levels(df.stormdata.clean$EVENT_TYPE) <- c(levels(df.stormdata.clean$EVENT_TYPE), "UNKNOWN")
df.stormdata.clean$EVENT_TYPE[is.na(df.stormdata.clean$EVENT_TYPE)] <- "UNKNOWN"
df.stormdata.clean <- subset(df.stormdata.clean, select=-c(EVTYPE))
This results in 1.02% of all entries having event type UNKNOWN, which we certainly consider acceptable.
sum(df.stormdata.clean$EVENT_TYPE=="UNKNOWN")/length(df.stormdata.clean$EVENT_TYPE)
## [1] 0.01021212
Aggregating the number of fatalities and injuries per event type:
df.fatalities <- aggregate(FATALITIES ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.injuries <- aggregate(INJURIES ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.fatalities.injuries <- merge(df.fatalities, df.injuries) %>% mutate(SUM.FATALATIES.INJURIES = FATALITIES + INJURIES)
We find a top 10 event types in terms of fatalities:
df.fatalities.injuries %>% arrange(desc(FATALITIES)) %>% head(10)
## EVENT_TYPE FATALITIES INJURIES SUM.FATALATIES.INJURIES
## 1 Excessive Heat 1761 6253 8014
## 2 Tornado 924 14504 15428
## 3 Flash Flood 819 1644 2463
## 4 Lightning 625 3947 4572
## 5 Rip Current 513 476 989
## 6 Flood 435 7475 7910
## 7 Thunderstorm Wind 317 4656 4973
## 8 Extreme Cold/Wind Chill 255 107 362
## 9 High Wind 231 1072 1303
## 10 Avalanche 214 148 362
And a top 10 event types causing most injuries:
df.fatalities.injuries %>% arrange(desc(INJURIES)) %>% head(10)
## EVENT_TYPE FATALITIES INJURIES SUM.FATALATIES.INJURIES
## 1 Tornado 924 14504 15428
## 2 Flood 435 7475 7910
## 3 Excessive Heat 1761 6253 8014
## 4 Thunderstorm Wind 317 4656 4973
## 5 Lightning 625 3947 4572
## 6 Flash Flood 819 1644 2463
## 7 Wildfire 81 1340 1421
## 8 Hurricane (Typhoon) 125 1321 1446
## 9 Winter Storm 190 1292 1482
## 10 High Wind 231 1072 1303
Visualizing these results shows clearly that Excessive Heat, Tornadoes, Flash Floods and Lightning are most fatal and Tornadoes, Floods, Excessive Heat and Thunderstorm Winds cause most injuries.
df.fatalities.injuries.top.sum <- df.fatalities.injuries %>% arrange(desc(SUM.FATALATIES.INJURIES)) %>% head(10)
barchart(FATALITIES + INJURIES ~ EVENT_TYPE, data=df.fatalities.injuries.top.sum, ylab="Number of people", main="Fatalities and Injuries by Weather Type")
Figure 1: Total number of fatalities (turquoise) and injuries (pink) per type of weather event in the USA between 1996 and 2010.
Aggregating the amount of property and crop damage per event type:
df.property.damage <- aggregate(PROPERTY_DAMAGE ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.crop.damage <- aggregate(CROP_DAMAGE ~ EVENT_TYPE, data = df.stormdata.clean, sum)
df.economic.damage <- merge(df.property.damage, df.crop.damage) %>% mutate(TOTAL.DAMAGE = PROPERTY_DAMAGE + CROP_DAMAGE)
We find a top 10 event types in terms of total damage:
df.economic.damage %>% arrange(desc(TOTAL.DAMAGE)) %>% head(10)
## EVENT_TYPE PROPERTY_DAMAGE CROP_DAMAGE TOTAL.DAMAGE
## 1 Thunderstorm Wind 199989507 16470685 216460192
## 2 Flash Flood 127070281 16536480 143606761
## 3 Tornado 115642211 7337891 122980102
## 4 Hail 63363505 49441240 112804745
## 5 Flood 84500695 16910420 101411115
## 6 Lightning 45625086 183644 45808730
## 7 High Wind 33335099 2109740 35444839
## 8 Drought 1336750 14382830 15719580
## 9 Hurricane (Typhoon) 10930453 4353775 15284228
## 10 Wildfire 12960535 1021913 13982448
Visualizing these results we see that Thunderstorm Winds, Flash Floods, Tornadoes and Hail cause most economic damage, that damage to property is overall much higher than to crops and that the biggest contributor to crop damage is Hail.
df.economic.damage.top.sum <- df.economic.damage %>% arrange(desc(TOTAL.DAMAGE)) %>% head(10)
barchart(PROPERTY_DAMAGE + CROP_DAMAGE ~ EVENT_TYPE, data=df.economic.damage.top.sum, ylab="Damage (USD)", main="Economic Damage by Weather Type")
Figure 2: Total economic damage in terms of property (turquoise) and crops (pink) caused by different types of weather event, measured in the USA between 1996 and 2010.