The effect of severe weather events on population health and economic damage in the United States

On the NOAA Storm Database an analysis was conducted to determine which type of severe weather events have had the greatest impact on population health in terms of fatalities and injuries, and which type had the greatest economic consequences, as measured by property and crop damage. It appears that TORNADO has by far the largest impact on population health, whereas FLOOD has the biggest economic consequences.

Data Processing

The storm data is available through the Coursera course web site: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. Because it is a large file, we first download it locally and then put it in a data frame.

file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file_url, "./StormData.csv.bz2", mode = 'wb')
df_stormdata <- read.csv("./StormData.csv.bz2")

Population health

# Some basic views at the data frame
#str(df_stormdata)
#head(df_stormdata, 10)
#table(df_stormdata$FATALITIES)
#table(df_stormdata$INJURIES)

Looking at the structure of the data frame, there seem to be 2 variables that relate to population health: FATALITIES and INJURIES. We want to get an impression in which proportion they influence population health. Therefore we summarise both by EVTYPE and try to plot them in a combined graph. We first look at the summarised data.

sum_fat <- df_stormdata %>% 
           group_by(EVTYPE) %>% 
           summarise(total_number = sum(FATALITIES))
sum_inj <- df_stormdata %>% 
           group_by(EVTYPE) %>% 
           summarise(total_number = sum(INJURIES))

We see a lot of EVTYPEs with 0 or small numbers of casualties. To make a good comparison, we only want to plot the most occurring EVTYPEs. Therefore we filter fatalities > 100 and injuries > 1000.

high_cas_fat <- df_stormdata %>% 
                group_by(EVTYPE) %>% 
                summarise(total_number = sum(FATALITIES)) %>% 
                filter(total_number > 100) %>% 
                droplevels() %>% 
                mutate(type_cas = "Fatalities")
high_cas_inj <- df_stormdata %>% 
                group_by(EVTYPE) %>% 
                summarise(total_number = sum(INJURIES)) %>% 
                filter(total_number > 1000) %>% 
                droplevels() %>% 
                mutate(type_cas = "Injuries")
# call to droplevels() is to drop unused levels for the plot
# type_cas variable is introduced to combine the data frames for a panel plot
high_cas <- rbind(high_cas_fat, high_cas_inj)
# plot
plt <- ggplot(high_cas, aes(x = EVTYPE, y = total_number))
plt <- plt + theme(axis.text.x=element_text(angle=90,size=8,hjust=1,vjust=0.5))
plt + geom_bar(stat = "identity") + facet_grid(type_cas ~ ., scales = "free")

The first thing standing out is the high peak for TORNADO in both graphs. So this is by far the most health threatening event. But the high peak also distorts the graph. To get a clearer view of the other events, we draw the plot again, but leaving out the TORNADO event.

high_cas <- high_cas %>% filter(EVTYPE != "TORNADO")
plt <- ggplot(high_cas, aes(x = EVTYPE, y = total_number))
plt <- plt + theme(axis.text.x=element_text(angle=90,size=8,hjust=1,vjust=0.5))
plt + geom_bar(stat = "identity") + facet_grid(type_cas ~ ., scales = "free")

It seems that the top fatalities and top injures both are caused by the same 5 or 6 events. So we can focus on those for our final conclusions.

Economic consequences

# Some basic views at the data frame
#str(df_stormdata)
#head(df_stormdata, 10)
#levels(df_stormdata$PROPDMGEXP)
#levels(df_stormdata$CROPDMGEXP)

To get a picture of economic consequences of storms, 4 variables seem to be relevant: PROPDMG and CROPDMG containing the damage amount, plus PROPDMGEXP and CROPDMGEXP containing a multiplication factor. The idea is to multiply each amount by its multiplication factor, add the PROP and CROP amounts to a total damage for each observation and then summarise it by EVTYPE.
But there’s an issue with the multiplication factors. According to the documentation only the factors K, M and B (for thousand, million, billion) should exist, but both variables also contain a range of other values:
PROPDMGEXP: , -, ?, +, 0, 1, 2, 3, 4, 5, 6, 7, 8, B, h, H, K, m, M
CROPDMGEXP: , ?, 0, 2, B, k, K, m, M
It is not clear what these codes mean. To get an idea of the extent of these dubious cases, we count the number of times they occur.
In PROPDMGEXP: 328
In CROPDMGEXP: 49
These numbers of occurences are very small, compared to the entire data set (902297 rows). So a practical solution is to set the multiplication factor at 1 in those cases and use that for our damage calculation.

damage <- df_stormdata %>%
          select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
          mutate(prop_fctr = factor(PROPDMGEXP), 
                 prop_mplr = ifelse(prop_fctr == "K", 1000, 
                                    ifelse(prop_fctr == "M", 1000000,
                                           ifelse(prop_fctr == "B", 1000000000, 
                                                  1
                              ))), 
                 prop_dmg = PROPDMG * prop_mplr
                 ) %>%
          mutate(crop_fctr = factor(CROPDMGEXP), 
                 crop_mplr = ifelse(crop_fctr == "K", 1000, 
                                    ifelse(crop_fctr == "M", 1000000,
                                           ifelse(crop_fctr == "B", 1000000000, 
                                                  1
                              ))), 
                 crop_dmg = CROPDMG * crop_mplr
                 ) %>%
          mutate(tot_dmg = prop_dmg + crop_dmg) %>%
          select(EVTYPE, tot_dmg) %>% 
          group_by(EVTYPE) 

Results

For the effect on population health, we calculate the total number of fatalities and the total number of injuries for all event types (EVTYPE) and show the top 6 of each.

high_cas_fat <- df_stormdata %>% 
                group_by(EVTYPE) %>% 
                summarise(total_fatalities = sum(FATALITIES)) %>% 
                top_n(n = 6, wt = total_fatalities) %>%
                arrange(desc(total_fatalities))
high_cas_inj <- df_stormdata %>% 
                group_by(EVTYPE) %>% 
                summarise(total_injuries = sum(INJURIES)) %>% 
                top_n(n = 6, wt = total_injuries) %>% 
                arrange(desc(total_injuries))
cbind(high_cas_fat, high_cas_inj)
##           EVTYPE total_fatalities         EVTYPE total_injuries
## 1        TORNADO             5633        TORNADO          91346
## 2 EXCESSIVE HEAT             1903      TSTM WIND           6957
## 3    FLASH FLOOD              978          FLOOD           6789
## 4           HEAT              937 EXCESSIVE HEAT           6525
## 5      LIGHTNING              816      LIGHTNING           5230
## 6      TSTM WIND              504           HEAT           2100

From these figures we see that TORNADO has by far the highest impact on population health. The next most influencing event is EXCESSIVE HEAT, which is the second cause for fatalities. (For injuries it is the 4th cause, but it is of the same order of magnitude as the 2nd and 3rd causes, TSTM WIND and FLOOD.)

Lastly, we show the top 6 of damage amounts.

summarise(damage, total_damage = sum(tot_dmg)) %>% 
top_n(n = 5, wt = total_damage) %>% 
arrange(desc(total_damage))
## # A tibble: 5 × 2
##              EVTYPE total_damage
##              <fctr>        <dbl>
## 1             FLOOD 150319678257
## 2 HURRICANE/TYPHOON  71913712800
## 3           TORNADO  57340614060
## 4       STORM SURGE  43323541000
## 5              HAIL  18752904943

It appears that FLOOD has the biggest economic consequences, over 150 Billion, roughly twice the damage that HURRICANE/TYPHOON causes. TORNADO comes in third place, so in terms of economic damage it is not as prominent as in population health.