On the NOAA Storm Database an analysis was conducted to determine which type of severe weather events have had the greatest impact on population health in terms of fatalities and injuries, and which type had the greatest economic consequences, as measured by property and crop damage. It appears that TORNADO has by far the largest impact on population health, whereas FLOOD has the biggest economic consequences.
The storm data is available through the Coursera course web site: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. Because it is a large file, we first download it locally and then put it in a data frame.
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file_url, "./StormData.csv.bz2", mode = 'wb')
df_stormdata <- read.csv("./StormData.csv.bz2")
Population health
# Some basic views at the data frame
#str(df_stormdata)
#head(df_stormdata, 10)
#table(df_stormdata$FATALITIES)
#table(df_stormdata$INJURIES)
Looking at the structure of the data frame, there seem to be 2 variables that relate to population health: FATALITIES and INJURIES. We want to get an impression in which proportion they influence population health. Therefore we summarise both by EVTYPE and try to plot them in a combined graph. We first look at the summarised data.
sum_fat <- df_stormdata %>%
group_by(EVTYPE) %>%
summarise(total_number = sum(FATALITIES))
sum_inj <- df_stormdata %>%
group_by(EVTYPE) %>%
summarise(total_number = sum(INJURIES))
We see a lot of EVTYPEs with 0 or small numbers of casualties. To make a good comparison, we only want to plot the most occurring EVTYPEs. Therefore we filter fatalities > 100 and injuries > 1000.
high_cas_fat <- df_stormdata %>%
group_by(EVTYPE) %>%
summarise(total_number = sum(FATALITIES)) %>%
filter(total_number > 100) %>%
droplevels() %>%
mutate(type_cas = "Fatalities")
high_cas_inj <- df_stormdata %>%
group_by(EVTYPE) %>%
summarise(total_number = sum(INJURIES)) %>%
filter(total_number > 1000) %>%
droplevels() %>%
mutate(type_cas = "Injuries")
# call to droplevels() is to drop unused levels for the plot
# type_cas variable is introduced to combine the data frames for a panel plot
high_cas <- rbind(high_cas_fat, high_cas_inj)
# plot
plt <- ggplot(high_cas, aes(x = EVTYPE, y = total_number))
plt <- plt + theme(axis.text.x=element_text(angle=90,size=8,hjust=1,vjust=0.5))
plt + geom_bar(stat = "identity") + facet_grid(type_cas ~ ., scales = "free")
The first thing standing out is the high peak for TORNADO in both graphs. So this is by far the most health threatening event. But the high peak also distorts the graph. To get a clearer view of the other events, we draw the plot again, but leaving out the TORNADO event.
high_cas <- high_cas %>% filter(EVTYPE != "TORNADO")
plt <- ggplot(high_cas, aes(x = EVTYPE, y = total_number))
plt <- plt + theme(axis.text.x=element_text(angle=90,size=8,hjust=1,vjust=0.5))
plt + geom_bar(stat = "identity") + facet_grid(type_cas ~ ., scales = "free")
It seems that the top fatalities and top injures both are caused by the same 5 or 6 events. So we can focus on those for our final conclusions.
Economic consequences
# Some basic views at the data frame
#str(df_stormdata)
#head(df_stormdata, 10)
#levels(df_stormdata$PROPDMGEXP)
#levels(df_stormdata$CROPDMGEXP)
To get a picture of economic consequences of storms, 4 variables seem to be relevant: PROPDMG and CROPDMG containing the damage amount, plus PROPDMGEXP and CROPDMGEXP containing a multiplication factor. The idea is to multiply each amount by its multiplication factor, add the PROP and CROP amounts to a total damage for each observation and then summarise it by EVTYPE.
But there’s an issue with the multiplication factors. According to the documentation only the factors K, M and B (for thousand, million, billion) should exist, but both variables also contain a range of other values:
PROPDMGEXP: , -, ?, +, 0, 1, 2, 3, 4, 5, 6, 7, 8, B, h, H, K, m, M
CROPDMGEXP: , ?, 0, 2, B, k, K, m, M
It is not clear what these codes mean. To get an idea of the extent of these dubious cases, we count the number of times they occur.
In PROPDMGEXP: 328
In CROPDMGEXP: 49
These numbers of occurences are very small, compared to the entire data set (902297 rows). So a practical solution is to set the multiplication factor at 1 in those cases and use that for our damage calculation.
damage <- df_stormdata %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(prop_fctr = factor(PROPDMGEXP),
prop_mplr = ifelse(prop_fctr == "K", 1000,
ifelse(prop_fctr == "M", 1000000,
ifelse(prop_fctr == "B", 1000000000,
1
))),
prop_dmg = PROPDMG * prop_mplr
) %>%
mutate(crop_fctr = factor(CROPDMGEXP),
crop_mplr = ifelse(crop_fctr == "K", 1000,
ifelse(crop_fctr == "M", 1000000,
ifelse(crop_fctr == "B", 1000000000,
1
))),
crop_dmg = CROPDMG * crop_mplr
) %>%
mutate(tot_dmg = prop_dmg + crop_dmg) %>%
select(EVTYPE, tot_dmg) %>%
group_by(EVTYPE)
For the effect on population health, we calculate the total number of fatalities and the total number of injuries for all event types (EVTYPE) and show the top 6 of each.
high_cas_fat <- df_stormdata %>%
group_by(EVTYPE) %>%
summarise(total_fatalities = sum(FATALITIES)) %>%
top_n(n = 6, wt = total_fatalities) %>%
arrange(desc(total_fatalities))
high_cas_inj <- df_stormdata %>%
group_by(EVTYPE) %>%
summarise(total_injuries = sum(INJURIES)) %>%
top_n(n = 6, wt = total_injuries) %>%
arrange(desc(total_injuries))
cbind(high_cas_fat, high_cas_inj)
## EVTYPE total_fatalities EVTYPE total_injuries
## 1 TORNADO 5633 TORNADO 91346
## 2 EXCESSIVE HEAT 1903 TSTM WIND 6957
## 3 FLASH FLOOD 978 FLOOD 6789
## 4 HEAT 937 EXCESSIVE HEAT 6525
## 5 LIGHTNING 816 LIGHTNING 5230
## 6 TSTM WIND 504 HEAT 2100
From these figures we see that TORNADO has by far the highest impact on population health. The next most influencing event is EXCESSIVE HEAT, which is the second cause for fatalities. (For injuries it is the 4th cause, but it is of the same order of magnitude as the 2nd and 3rd causes, TSTM WIND and FLOOD.)
Lastly, we show the top 6 of damage amounts.
summarise(damage, total_damage = sum(tot_dmg)) %>%
top_n(n = 5, wt = total_damage) %>%
arrange(desc(total_damage))
## # A tibble: 5 × 2
## EVTYPE total_damage
## <fctr> <dbl>
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57340614060
## 4 STORM SURGE 43323541000
## 5 HAIL 18752904943
It appears that FLOOD has the biggest economic consequences, over 150 Billion, roughly twice the damage that HURRICANE/TYPHOON causes. TORNADO comes in third place, so in terms of economic damage it is not as prominent as in population health.