I explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to establish which events have the biggest impact on U.S. population health and economy. The dataset is checked for major errors and these are corrected. Events are assigned to types that are partially standardized.
The results show that tornades are by far the most detrimental for the population health. Thunderstorm winds, excessive heat and floods follow but their impact is several times smaller. In terms of economic impact hurricanes incur the biggest damage, followed by tornadoes, storm surges/tides and floods.
file <- "StormData.csv.bz2"
data <- read.csv(file)
str(data$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
985 is too much. Official list contains only 48 event types. Following offical event list is taken from the Storm Data Preparation directive, p.6:
events <- read.table("event_list_raw.txt", sep=";", col.names = "Event.Type")
sapply(events, function(x) substr(x, 1, regexpr(" [A-Z]$", x) - 1))
## Event.Type
## [1,] "Astronomical Low Tide"
## [2,] "Avalanche"
## [3,] "Blizzard"
## [4,] "Coastal Flood"
## [5,] "Cold/Wind Chill"
## [6,] "Debris Flow"
## [7,] "Dense Fog"
## [8,] "Dense Smoke"
## [9,] "Drought"
## [10,] "Dust Devil"
## [11,] "Dust Storm"
## [12,] "Excessive Heat"
## [13,] "Extreme Cold/Wind Chill"
## [14,] "Flash Flood"
## [15,] "Flood"
## [16,] "Frost/Freeze"
## [17,] "Funnel Cloud"
## [18,] "Freezing Fog"
## [19,] "Hail"
## [20,] "Heat"
## [21,] "Heavy Rain"
## [22,] "Heavy Snow"
## [23,] "High Surf"
## [24,] "High Wind"
## [25,] "Hurricane (Typhoon)"
## [26,] "Ice Storm"
## [27,] "Lake-Effect Snow"
## [28,] "Lakeshore Flood"
## [29,] "Lightning"
## [30,] "Marine Hail"
## [31,] "Marine High Wind"
## [32,] "Marine Strong Wind"
## [33,] "Marine Thunderstorm Wind"
## [34,] "Rip Current"
## [35,] "Seiche"
## [36,] "Sleet"
## [37,] "Storm Surge/Tide"
## [38,] "Strong Wind"
## [39,] "Thunderstorm Wind"
## [40,] "Tornado"
## [41,] "Tropical Depression"
## [42,] "Tropical Storm"
## [43,] "Tsunami"
## [44,] "Volcanic Ash"
## [45,] "Waterspout"
## [46,] "Wildfire"
## [47,] "Winter Storm"
## [48,] "Winter Weather"
Event list should be standardized to conform to the offical list before one can decide which event types make the biggest impact. For this end I don’t apply any automatic string transformation. Instead I use a manual method:
To be sure this approach allows corret identification of the most deterimental event types I will compare the resulting ordering of event types with the impact of the left out, unstandardize EVTYPE’s. We will check if standardizing these other EVTYPE’s might change the ordering at the top of the even type list.
To apply this method I need to first check and clean the numerical measures of damage.
The numerical values describing damage are also doubtful in many cases. To correct the errors I check the top highest numbers as these have the biggest impact on the ordering of event types.
Let’s print and review the exponent values.
unique(data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
We see the some checking and cleaning should be done:
Checking PROPDMGEXP == B:
table(data[data$PROPDMGEXP=='B',]$PROPDMG)
##
## 0.1 1 1.04 1.2 1.3 1.5 1.6 1.7 1.8 2 2.09 2.1
## 1 6 1 1 1 4 1 1 1 1 1 1
## 2.5 2.8 3 4 4.83 5 5.15 5.42 5.88 7.35 10 11.26
## 2 1 2 3 1 2 1 1 1 1 1 1
## 16.93 31.3 115
## 1 1 1
Let’s check the extreme PROPDMG=115 case (115*10^9, the biggest value):
data[data$PROPDMGEXP=='B' & data$PROPDMG==115,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 605953 6 1/1/2006 0:00:00 12:00:00 AM PST 55 NAPA
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 605953 CA FLOOD 0 COUNTYWIDE 1/1/2006 0:00:00
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 605953 07:00:00 AM 0 NA 0 COUNTYWIDE
## LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953 0 0 NA 0 0 0 115 B 32.5
## CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 605953 M MTR CALIFORNIA, Western 3828 12218
## LATITUDE_E LONGITUDE_
## 605953 3828 12218
## REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## REFNUM
## 605953 605943
It looks like it’s an error and should be M instead of B. I checked other top values for exponents B, 8, 7, 6 and they look fine.
unique(data$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
I checked several of the highest values here and they seem OK.
summary(data$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
Values >= 1000 with exponent attached looks suspicious. Let’s check them:
cond <- data$PROPDMG>=1000
dc <- data[cond,]
table(dc$PROPDMG)
##
## 1000 1584 3000 3200 3500 4410 4800 5000
## 10 1 3 1 1 1 1 4
dc <- dc[order(dc$PROPDMG,decreasing=TRUE),]
dc[1,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 778568 37 7/26/2009 0:00:00 08:11:00 PM EST 69 FRANKLIN
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI
## 778568 NC THUNDERSTORM WIND 1 WNW STALLINGS
## END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE
## 778568 7/26/2009 0:00:00 08:13:00 PM 0 NA 0
## END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 778568 0 0 NA 50 0 0 5000
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 778568 K 0 K RAH NORTH CAROLINA, Central
## ZONENAMES
## 778568
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 778568 3604 7810 0 0
## REMARKS
## 778568 EPISODE NARRATIVE: Showers and thunderstorms developed in the afternoon and evening along a surface trough across central North Carolina, in a warm and moist atmosphere. A mid level shear axis crossed the area during the evening helping to intensive the thunderstorms. Some of the thunderstorms during the evening became severe and produced several reports of damaging winds.EVENT NARRATIVE: Power-lines were reported down on North Carolina Highway 56. Numerous power outages were reported across eastern portions of Franklin County as well. One tree was blown onto a house on Edward Best Road.
## REFNUM
## 778568 778558
Checking all 22 records with PROPDMG>=1000 shows that all have K exponent. I guess it’s an error and there should either be 5*10^3 or 5000*10^0. This should be corrrected.
summary(data$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.527 0.000 990.000
There are no values above 1000. I checked some top values and they look fine.
Other checks show that there are plenty of cases where DMGEXP>0 but no DMG value is provided, both for PROP and CROP. I assume the DMG in such cases should be corrected to 1.
d1 <- subset(data, select = c(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
d1[d1$PROPDMGEXP=='B' & d1$PROPDMG==115,]$PROPDMGEXP <- 'M'
d1[d1$PROPDMG >= 1000,]$PROPDMGEXP <- 0
d1[d1$PROPDMGEXP=='',]$PROPDMGEXP <- 0
d1[d1$CROPDMGEXP=='',]$CROPDMGEXP <- 0
library(plyr)
exponent_fix <- function (x) {
mapvalues(x,
from=c('+','-','?','h','H','k','K','m','M','b','B'),
to = c(NA, NA, NA, '2','2','3','3','6','6','9','9')
)
}
d1$PROPDMGEXP = exponent_fix(d1$PROPDMGEXP)
d1$CROPDMGEXP = exponent_fix(d1$CROPDMGEXP)
Some of these observations might be partially complete but I accept this. There are 21 observations to be removed.
d1 <- na.omit(d1)
d1$PROPDMGEXP <- as.numeric(levels(d1$PROPDMGEXP)[d1$PROPDMGEXP])
d1$CROPDMGEXP <- as.numeric(levels(d1$CROPDMGEXP)[d1$CROPDMGEXP])
There are 197250 such cases for PROP and 261774 for CROP
d1[d1$PROPDMGEXP!=0 & d1$PROPDMG==0,]$PROPDMG <- 1
d1[d1$CROPDMGEXP!=0 & d1$CROPDMG==0,]$CROPDMG <- 1
d1$PROPDMGVAL <- d1$PROPDMG*10^d1$PROPDMGEXP
d1$CROPDMGVAL <- d1$CROPDMG*10^d1$CROPDMGEXP
Once we have the numerical values for damage calculate we can finally standardize the Event Type variable.
d_ft <- aggregate(FATALITIES~EVTYPE, data=d1, sum)
d_in <- aggregate(INJURIES~EVTYPE, data=d1, sum)
d_pr <- aggregate(PROPDMGVAL~EVTYPE, data=d1, sum)
d_cr <- aggregate(CROPDMGVAL~EVTYPE, data=d1, sum)
dft <- d_ft[order(d_ft$FATALITIES, decreasing=TRUE),]
din <- d_in[order(d_in$INJURIES, decreasing=TRUE),]
dpr <- d_pr[order(d_pr$PROPDMGVAL, decreasing=TRUE),]
dcr <- d_cr[order(d_cr$CROPDMGVAL, decreasing=TRUE),]
dft$EVTYPE <- as.character(dft$EVTYPE)
din$EVTYPE <- as.character(din$EVTYPE)
dpr$EVTYPE <- as.character(dpr$EVTYPE)
dcr$EVTYPE <- as.character(dcr$EVTYPE)
Plotting EVTYPE’s impact on health (FATALITIES, INJURIES) and economy (Property Damage, Crop Damage).
plot_damage <- function(df,n=nrow(df)) {
all_other <- if(n < nrow(df)) sum(df[(n+1):nrow(df),2]) else 0
ymax = max(df[1,2], all_other)
plot(df[1:n,2], main=names(df)[2], ylab="Damage", xlab="Event type", ylim=c(0, ymax))
if(all_other > 0) abline(h=all_other)
}
par(mfrow=c(2,2))
n = 40
plot_damage(dft, n)
plot_damage(din, n)
plot_damage(dpr, n)
plot_damage(dcr, n)
The plot shows the damage for top 40 EVTYPE’s for each damage indicator. The horizontal line shows combined damage of all the other EVTYPE’s. It seems that it’s enough to standardize the top 40 EVTYPE’s to account for the major event types. The comined damage of the other EVTYPE’s shouldn’t change the top event types ordering.
Here are all the relevant events:
selected_events = sort(unique(c(
dft[1:n,1],
din[1:n,1],
dpr[1:n,1],
dcr[1:n,1])))
selected_events
## [1] "AGRICULTURAL FREEZE" "AVALANCHE"
## [3] "BLIZZARD" "COASTAL FLOOD"
## [5] "COASTAL FLOODING" "COLD"
## [7] "COLD AND WET CONDITIONS" "COLD/WIND CHILL"
## [9] "Damaging Freeze" "DAMAGING FREEZE"
## [11] "DENSE FOG" "DROUGHT"
## [13] "DUST STORM" "Early Frost"
## [15] "EXCESSIVE HEAT" "EXCESSIVE WETNESS"
## [17] "EXTREME COLD" "EXTREME COLD/WIND CHILL"
## [19] "EXTREME HEAT" "FLASH FLOOD"
## [21] "FLASH FLOOD/FLOOD" "FLASH FLOODING"
## [23] "FLOOD" "FLOOD/FLASH FLOOD"
## [25] "FLOOD/RAIN/WINDS" "FOG"
## [27] "FREEZE" "FROST"
## [29] "FROST/FREEZE" "GLAZE"
## [31] "HAIL" "HAILSTORM"
## [33] "HEAT" "HEAT WAVE"
## [35] "HEAVY RAIN" "HEAVY RAIN/SEVERE WEATHER"
## [37] "HEAVY RAINS" "HEAVY SNOW"
## [39] "HEAVY SURF/HIGH SURF" "HIGH SURF"
## [41] "HIGH WIND" "HIGH WINDS"
## [43] "HURRICANE" "HURRICANE ERIN"
## [45] "HURRICANE OPAL" "HURRICANE/TYPHOON"
## [47] "ICE" "ICE STORM"
## [49] "LANDSLIDE" "LIGHTNING"
## [51] "RIP CURRENT" "RIP CURRENTS"
## [53] "RIVER FLOOD" "SEVERE THUNDERSTORM"
## [55] "SEVERE THUNDERSTORM WINDS" "STORM SURGE"
## [57] "STORM SURGE/TIDE" "STRONG WIND"
## [59] "THUNDERSTORM WIND" "THUNDERSTORM WINDS"
## [61] "TORNADO" "TORNADOES, TSTM WIND, HAIL"
## [63] "TROPICAL STORM" "TSTM WIND"
## [65] "TSTM WIND/HAIL" "TSUNAMI"
## [67] "TYPHOON" "UNSEASONABLY WARM AND DRY"
## [69] "URBAN/SML STREAM FLD" "WILD FIRES"
## [71] "WILD/FOREST FIRE" "WILDFIRE"
## [73] "WIND" "WINTER STORM"
## [75] "WINTER WEATHER" "WINTER WEATHER/MIX"
There are 76 of them. We write them into file and create a dictionary where each EVTYPE on a list have a standardized Event Type. Additionally we add “Other” event type to simplify subsequent calculations.
I wrote the 76 EVTYPE’s into a file and created a dictionary by assigining each of these EVTYPE’s a standardized name.
write.table(selected_events, file="dict_keys.txt", row.names=F, col.names=F)
write.table(events_clean, file="events_clean.txt", row.names=F, col.names=F)
dict = read.table(file="dict.txt", stringsAsFactors = F, col.names=c("EVTYPE","EVENT_TYPE"))
dict
## EVTYPE EVENT_TYPE
## 1 AGRICULTURAL FREEZE Frost/Freeze
## 2 AVALANCHE Avalanche
## 3 BLIZZARD Blizzard
## 4 COASTAL FLOOD Coastal Flood
## 5 COASTAL FLOODING Coastal Flood
## 6 COLD Cold/Wind Chill
## 7 COLD AND WET CONDITIONS Cold/Wind Chill
## 8 COLD/WIND CHILL Cold/Wind Chill
## 9 Damaging Freeze Frost/Freeze
## 10 DAMAGING FREEZE Frost/Freeze
## 11 DENSE FOG Dense Fog
## 12 DROUGHT Drought
## 13 DUST STORM Dust Storm
## 14 Early Frost Frost/Freeze
## 15 EXCESSIVE HEAT Excessive Heat
## 16 EXCESSIVE WETNESS Excessive Wetness
## 17 EXTREME COLD Extreme Cold/Wind Chill
## 18 EXTREME COLD/WIND CHILL Extreme Cold/Wind Chill
## 19 EXTREME HEAT Excessive Heat
## 20 FLASH FLOOD Flash Flood
## 21 FLASH FLOOD/FLOOD Flash Flood
## 22 FLASH FLOODING Flash Flood
## 23 FLOOD Flood
## 24 FLOOD/FLASH FLOOD Flash Flood
## 25 FLOOD/RAIN/WINDS Flood
## 26 FOG Dense Fog
## 27 FROST Frost/Freeze
## 28 FREEZE Frost/Freeze
## 29 FROST/FREEZE Frost/Freeze
## 30 GLAZE Ice Storm
## 31 HAIL Hail
## 32 HAILSTORM Hail
## 33 HEAT Heat
## 34 HEAT WAVE Heat
## 35 HEAVY RAIN Heavy Rain
## 36 HEAVY RAIN/SEVERE WEATHER Heavy Rain
## 37 HEAVY RAINS Heavy Rain
## 38 HEAVY SNOW Heavy Snow
## 39 HEAVY SURF/HIGH SURF High Surf
## 40 HIGH SURF High Surf
## 41 HIGH WIND High Wind
## 42 HIGH WINDS High Wind
## 43 HURRICANE Hurricane (Typhoon)
## 44 HURRICANE ERIN Hurricane (Typhoon)
## 45 HURRICANE OPAL Hurricane (Typhoon)
## 46 HURRICANE/TYPHOON Hurricane (Typhoon)
## 47 ICE Ice Storm
## 48 ICE STORM Ice Storm
## 49 LANDSLIDE Landslide
## 50 LIGHTNING Lightning
## 51 RIP CURRENT Rip Current
## 52 RIP CURRENTS Rip Current
## 53 RIVER FLOOD Flood
## 54 SEVERE THUNDERSTORM Thunderstorm Wind
## 55 SEVERE THUNDERSTORM WINDS Thunderstorm Wind
## 56 STORM SURGE Storm Surge/Tide
## 57 STORM SURGE/TIDE Storm Surge/Tide
## 58 STRONG WIND High Wind
## 59 THUNDERSTORM WIND Thunderstorm Wind
## 60 THUNDERSTORM WINDS Thunderstorm Wind
## 61 TORNADO Tornado
## 62 TORNADOES, TSTM WIND, HAIL Tornado
## 63 TROPICAL STORM Tropical Storm
## 64 TSTM WIND Thunderstorm Wind
## 65 TSTM WIND/HAIL Thunderstorm Wind
## 66 TSUNAMI Tsunami
## 67 TYPHOON Hurricane (Typhoon)
## 68 UNSEASONABLY WARM AND DRY Heat
## 69 URBAN/SML STREAM FLD Flash Flood
## 70 WILD FIRES Wildfire
## 71 WILD/FOREST FIRE Wildfire
## 72 WILDFIRE Wildfire
## 73 WIND High Wind
## 74 WINTER STORM Winter Storm
## 75 WINTER WEATHER Winter Weather
## 76 WINTER WEATHER/MIX Winter Weather
## 77 other other
Confirming that the dictionary contains all the selected EVTYPE’s:
all(sapply(selected_events, function(x) x %in% dict$EVTYPE))
## [1] TRUE
EVENT_TYPE column contains the standardized name for each EVTYPE in the data.
standardize_event_types <- function(df, n = nrow(df)) {
for(i in 1:nrow(df)) {
if(!any(dict$EVTYPE==df[i,"EVTYPE"])) {
df[i,"EVTYPE"] = "other"
}
}
transform(df[1:n,], EVENT_TYPE = mapvalues(df$EVTYPE, dict$EVTYPE, dict$EVENT_TYPE))
}
dftst <- standardize_event_types(dft)
dinst <- standardize_event_types(din)
dprst <- standardize_event_types(dpr)
dcrst <- standardize_event_types(dcr)
Now I aggregate event data and sum up health and economic damages. Event types are sorted by their impact. The “Other” categories are moved to the last rows of the data frames.
dftf <- aggregate(FATALITIES~EVENT_TYPE, data=dftst, sum)
dinf <- aggregate(INJURIES~EVENT_TYPE, data=dinst, sum)
dprf <- aggregate(PROPDMGVAL~EVENT_TYPE, data=dprst, sum)
dcrf <- aggregate(CROPDMGVAL~EVENT_TYPE, data=dcrst, sum)
# final sorting
dftfs <- dftf[order(dftf$FATALITIES, decreasing=TRUE),]
dinfs <- dinf[order(dinf$INJURIES, decreasing=TRUE),]
dprfs <- dprf[order(dprf$PROPDMGVAL, decreasing=TRUE),]
dcrfs <- dcrf[order(dcrf$CROPDMGVAL, decreasing=TRUE),]
# moving "other" to the last row
put_other_at_the_end <- function(df) {
n = nrow(df)
other = which(df$EVENT_TYPE=="other")
if(other>1 & other<n)
index = c(1:(other-1),(other+1):n,other)
else if (other==1)
index = c(2:n,other)
else
index = 1:n
df[index,]
}
dftfs <- put_other_at_the_end(dftfs)
dinfs <- put_other_at_the_end(dinfs)
dprfs <- put_other_at_the_end(dprfs)
dcrfs <- put_other_at_the_end(dcrfs)
Let’s plot the most damageing event types for each damage indicator.
par(mfrow=c(2,2))
barplot_with_the_last_record <- function(df, col, n=3) {
m = nrow(df)
if(n==m)
barplot(height = df[,col], names.arg = df$EVENT_TYPE, cex.names=0.7, main=col, ylab="Damage")
else
barplot(height = df[c(1:n,m),col], names.arg = df$EVENT_TYPE[c(1:n,m)], cex.names=0.7, main=col, ylab="Damage")
}
barplot_with_the_last_record(dftfs,"FATALITIES")
barplot_with_the_last_record(dinfs,"INJURIES")
barplot_with_the_last_record(dprfs,"PROPDMGVAL")
barplot_with_the_last_record(dcrfs,"CROPDMGVAL")
The plot shows the resulting ordering of event types. We see that the nonstandardized EVTYPE’s combined under “other” category can’t change the top ordering of event types.
Let’s plot event types of the highest health and economic impact.
library(tidyr)
library(dplyr)
bind_rows(dftfs, dinfs, dprfs, dcrfs) %>%
gather(dimension, damage, -EVENT_TYPE) %>%
na.omit %>%
spread(dimension, damage) %>%
as.data.frame -> d
rownames(d) <- d$EVENT_TYPE
d <- subset(d,select=-EVENT_TYPE)
n <- 4
par(mfrow=c(2,1))
hc <- c('FATALITIES','INJURIES')
human <- data.frame(value=rowSums(d[,hc]))
barplot(t(as.matrix(d[order(human$value,decreasing=T)[1:n],hc])),legend=hc,main="Health Impact")
ec <- c('PROPDMGVAL','CROPDMGVAL')
econ <- data.frame(value=rowSums(d[,ec]))
barplot(t(as.matrix(d[order(econ$value,decreasing=T)[1:n],ec])),legend=ec,main="Economic Impact")
We can see the the most harmful events in terms of populatino health are by far tornados. Thunderstorm winds, excessive heat and floods occupy the 2nd to 4th positions.
The most detrimental events for the economy are Hurricanes, then tornadoes, storm surges/tides and floods.
The same results where arrive at when standardizing only top 20 EVTYPE’s for each damage indicator. It required standardizing only 39 EVTYPE’s total.