Weather events are responsible for substantial damage to people and the economy. In this report we identify the weather events in the United States that caused the highest amount of damage. Events from the year 1950 to November 2011 are included in our analysis. According to this data, tornadoes are the most dangerous weather events as measured by number of fatalities and injuries. Most economic damage was caused by hurricanes, as measured by the sum of damage to property and crops. Floods were the most damaging events for crops, while damage to property was again dominated by hurricanes.
This analysis is based on the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data was downloaded from here. It is stored in the CSV format. Missing data is coded as empty fields. Since there are no comments in the data, we can set the argument comment.char to the empty string.
data.full <- read.csv("repdata_data_StormData.csv.bz2",
na.strings = "",
comment.char = "")
Let’s get a feeling for the dataset by looking at its dimensions and the feature names:
data.nrow = nrow(data.full)
data.ncol = ncol(data.full)
names(data.full)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
There are 902297 data entrys and 37 features in the dataset.
We’re mostly interested in the event type, EVTYPE, the impact on public health measured by the number of fatalities, FATALITIES, and the number of injuries, INJURIES, and the economic impact measured by the property damage, PROPDMG, and the crop damage, CROPDMG.
The damage variables are measured in USD. However, the units of these values are stored in additional columns, PROPDMGEXP and CROPDMGEXP. If the unit is one of the letters K, M, B, it is interpreted as thousands, millions and billions, respectively. Otherwise, the unit is set to NA. Luckily, most of the data entries adhere to this code:
table(data.full$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(data.full$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 7 19 1 9 21 281832 1 1994
The conversion from unit code to multiplier for the damage columns is done in a helper function:
unit.to.multiplier <- function(unit)
{
if (is.na(unit))
{
1 # missing entry is interpreted as no exponent -> multiply by one
}
else if (unit %in% c("k", "K"))
{
1e3
}
else if (unit %in% c("m", "M"))
{
1e6
}
else if (unit %in% c("b", "B"))
{
1e9
}
else
{
NA # any unrecognized factor level is interpreted as NA
}
}
After converting the damage columns they are added to create a variable for the total damage in USD, TOTALDMG.USD. In addition, we select only the columns we’re interested in and have a look at the first few rows:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- data.full %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG:CROPDMGEXP) %>%
mutate(PROPDMGEXP = sapply(PROPDMGEXP, unit.to.multiplier),
PROPDMG.USD = PROPDMG * PROPDMGEXP,
CROPDMGEXP = sapply(CROPDMGEXP, unit.to.multiplier),
CROPDMG.USD = CROPDMG * CROPDMGEXP,
TOTALDMG.USD = PROPDMG.USD + CROPDMG.USD)
rm(data.full)
head(data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 1000 0 1
## 2 TORNADO 0 0 2.5 1000 0 1
## 3 TORNADO 0 2 25.0 1000 0 1
## 4 TORNADO 0 2 2.5 1000 0 1
## 5 TORNADO 0 2 2.5 1000 0 1
## 6 TORNADO 0 6 2.5 1000 0 1
## PROPDMG.USD CROPDMG.USD TOTALDMG.USD
## 1 25000 0 25000
## 2 2500 0 2500
## 3 25000 0 25000
## 4 2500 0 2500
## 5 2500 0 2500
## 6 2500 0 2500
We expect that there are only few missing values after processing the data.
sapply(data, function(x) mean(is.na(x)))
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.557587e-04
## CROPDMG CROPDMGEXP PROPDMG.USD CROPDMG.USD TOTALDMG.USD
## 0.000000e+00 2.992363e-05 3.557587e-04 2.992363e-05 3.856823e-04
The impact of weather events on public health can be measured by the total number of fatalities and injuries attributed to a certain event.
data.pop.health <- data %>%
group_by(EVTYPE) %>%
summarize(SUMMED.FATALITIES = sum(FATALITIES),
SUMMED.INJURIES = sum(INJURIES))
head(arrange(data.pop.health, desc(SUMMED.FATALITIES), desc(SUMMED.INJURIES)))
## # A tibble: 6 × 3
## EVTYPE SUMMED.FATALITIES SUMMED.INJURIES
## <fctr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
head(arrange(data.pop.health, desc(SUMMED.INJURIES), desc(SUMMED.FATALITIES)))
## # A tibble: 6 × 3
## EVTYPE SUMMED.FATALITIES SUMMED.INJURIES
## <fctr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
## 4 EXCESSIVE HEAT 1903 6525
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
So we see that tornadoes are by far the most harmful weather events, both by fatalities and by injuries. We can also compare the injuries of the top 10 weather events with the highest total number of injuries. Interestingly, heat causes on average more injuries per event than tornadoes. The total number of injuries is dominated by a few catastrophic tornadoes. Other events such as lightnings almost never cause any injuries. The distributions are even more skewed for fatalities.
top.pop.health.events <- arrange(data.pop.health, desc(SUMMED.INJURIES))$EVTYPE[1:10]
data.top.pop.health.events <- data %>% filter(EVTYPE %in% top.pop.health.events)
library(ggplot2)
qplot(EVTYPE, log10(INJURIES),
data = data.top.pop.health.events,
geom=c("boxplot"),
xlab="Event type",
ylab="log10 of fatalities",
main="Fatalities caused by top 10 fatal weather events")
## Warning: Removed 736740 rows containing non-finite values (stat_boxplot).
Economic damage is calculated as the sum of damage to property and to crops.
data.economy <- data %>%
group_by(EVTYPE) %>%
summarize(SUMMED.TOTALDMG.USD = sum(TOTALDMG.USD),
SUMMED.PROPDMG.USD = sum(PROPDMG.USD),
SUMMED.CROPDMG.USD = sum(CROPDMG.USD))
head(arrange(data.economy, desc(SUMMED.TOTALDMG.USD)))
## # A tibble: 6 × 4
## EVTYPE SUMMED.TOTALDMG.USD SUMMED.PROPDMG.USD
## <fctr> <dbl> <dbl>
## 1 HURRICANE/TYPHOON 71913712800 69305840000
## 2 STORM SURGE 43323541000 43323536000
## 3 HURRICANE 14610229010 11868319010
## 4 RIVER FLOOD 10148404500 5118945500
## 5 TROPICAL STORM 8382236550 7703890550
## 6 WILDFIRE 5060586800 4765114000
## # ... with 1 more variables: SUMMED.CROPDMG.USD <dbl>
head(arrange(data.economy, desc(SUMMED.PROPDMG.USD)))
## # A tibble: 6 × 4
## EVTYPE SUMMED.TOTALDMG.USD SUMMED.PROPDMG.USD
## <fctr> <dbl> <dbl>
## 1 HURRICANE/TYPHOON 71913712800 69305840000
## 2 STORM SURGE 43323541000 43323536000
## 3 HURRICANE 14610229010 11868319010
## 4 TROPICAL STORM 8382236550 7703890550
## 5 RIVER FLOOD 10148404500 5118945500
## 6 WILDFIRE 5060586800 4765114000
## # ... with 1 more variables: SUMMED.CROPDMG.USD <dbl>
head(arrange(data.economy, desc(SUMMED.CROPDMG.USD)))
## # A tibble: 6 × 4
## EVTYPE SUMMED.TOTALDMG.USD SUMMED.PROPDMG.USD
## <fctr> <dbl> <dbl>
## 1 FLOOD NA NA
## 2 RIVER FLOOD 10148404500 5118945500
## 3 ICE STORM NA NA
## 4 HURRICANE 14610229010 11868319010
## 5 HURRICANE/TYPHOON 71913712800 69305840000
## 6 FLASH FLOOD NA NA
## # ... with 1 more variables: SUMMED.CROPDMG.USD <dbl>
Measured by total and property damage, hurricanes/typhoons are the most devastating weather events, followed by storm surge. If you combined the events HURRICANE/TYPHOON and HURRICANE the result would be even clearer. However, if only judging by crop damage, floods lead to the largest economic losses, followed by ice storms.