Synopsis

The data set has been analysed to determine the most significant weather events in terms of the impact on human health and the cost of damage to property and crops.

Both of these have been established for individual incidents and for the total devastation caused over the duration of the records in the data set. In both cases these provide different results (due to the relative infrequency of the most damaging weather events eg tsunamis).

Data Processing

Load data, reduce to variables of interest and remove main dataset from memory.

setwd("/Users/esteebie/Documents/DATA/Courses/4. JHU/5. Reproducible Research/Project 2 - storms")
storms <- read.csv("repdata_data_StormData.csv")
stms <- storms[,c(8,23:28)]
rm(storms)
summary(stms)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

Create new variables for costs using exponents. Recode for letter codes.

stms$prop_exp <- as.numeric(as.character(stms$PROPDMGEXP))
## Warning: NAs introduced by coercion
stms$prop_exp[stms$PROPDMGEXP %in% c("-","?","+","")] <- 0
stms$prop_exp[stms$PROPDMGEXP %in% c("H","h")] <- 2
stms$prop_exp[stms$PROPDMGEXP %in% c("K","k")] <- 3
stms$prop_exp[stms$PROPDMGEXP %in% c("M","m")] <- 6
stms$prop_exp[stms$PROPDMGEXP %in% c("B","b")] <- 9

stms$crop_exp <- as.numeric(as.character(stms$CROPDMGEXP))
## Warning: NAs introduced by coercion
stms$crop_exp[stms$CROPDMGEXP %in% c("-","?","+","")] <- 0
stms$crop_exp[stms$CROPDMGEXP %in% c("H","h")] <- 2
stms$crop_exp[stms$CROPDMGEXP %in% c("K","k")] <- 3
stms$crop_exp[stms$CROPDMGEXP %in% c("M","m")] <- 6
stms$crop_exp[stms$CROPDMGEXP %in% c("B","b")] <- 9

Recalculate damage using recoded powers.

stms$PROPDMG <- stms$PROPDMG * 10^stms$prop_exp
stms$CROPDMG <- stms$CROPDMG * 10^stms$crop_exp

Next simplify events into fewer principle groups.

ev_types <- length(unique(stms$EVTYPE))

Initially, the dataset contains 985

stms$EVTYPE <- gsub(".*HEAT.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*AVALAN.*","AVALANCHE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TSTM.*","THUNDER STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*THUNDERSTORM.*","THUNDER STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WIND.*","WIND",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SNOW.*","SNOW",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HAIL.*","HAIL",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HURRICANE.*","HURRICANE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FIRE.*","FIRE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FOG.*","FOG",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*COLD.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SURF.*","SURF",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*RAIN.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TYPHOON.*","TYPHOON",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WINTER.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TROPICAL.*","TROPICAL STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TORN.*","TORNADO",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WATERSPOUT.*","WATERSPOUT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WET.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*VOLC.*","VOLCANO",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WND.*","WIND",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FLOOD.*","FLOOD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SLEET.*","SLEET",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*DRY.*","DRY",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WINTRY.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WARM.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HOT.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HIGH.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LOW.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*ICE.*","ICE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FREEZ.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LIGHTNING.*","LIGHTNING",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LANDSLIDE.*","LANDSLIDE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*DUST.*","DUST",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*ICY.*","ICE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*BLIZZARD.*","SNOW",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*STREAM.*","FLOOD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*RECORD.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*MUD.*","MUDSLIDE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*EROSION.*","EROSION",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SMOKE.*","SMOKE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LANDSL.*","LANDSLIDE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SURGE.*","SURGE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SHOWER.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*PRECIP.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HYPOTH.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*EXPOS.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*GUST.*","WIND",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FUNNEL.*","FUNNEL CLOUD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FROST.*","FROST",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*COASTAL.*","COASTAL STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*DAM.*","DAM FAILURE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*MARINE.*","MARINE INCIDENT",stms$EVTYPE, ignore.case = TRUE)
ev_types <- length(unique(stms$EVTYPE))

Now the dataset contains 140

For analysis of fatalaties and injuries, create subset of data where at least one person was affected and drop the levels of all others.

Create summary tables for total and averages for each event type in new subset.

library(ggplot2)
stms_inj <- stms[stms$FATALITIES !=0 | stms$INJURIES !=0,]
stms_inj <- droplevels(stms_inj)
event_type <- sort(unique(stms_inj$EVTYPE))
fats_sum <- as.numeric(tapply(stms_inj$FATALITIES,stms_inj$EVTYPE,sum))
inj_sum <- as.numeric(tapply(stms_inj$INJURIES,stms_inj$EVTYPE,sum))
fats_av <- as.numeric(tapply(stms_inj$FATALITIES,stms_inj$EVTYPE,mean))
inj_av <- as.numeric(tapply(stms_inj$INJURIES,stms_inj$EVTYPE,mean))
df <- as.data.frame(cbind(fats_sum,fats_av,inj_sum,inj_av))
df <- cbind(event_type,df)

Next convert dataframes into tidy data for output using ggplot. Since fatalities are more significant than injuries, rank by number of fatalaties (first totals, then averages for second plot). Select only top 20 event types.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
df_fats <- df %>% arrange(desc(fats_sum))
df_fats_tidy <- gather(df_fats[1:20,],Statistic,Number,c(fats_sum,fats_av,inj_sum,inj_av))
df_fats_tidy$seq <- rep(1:20,4)

Prepare plot

plot1 <- ggplot(df_fats_tidy[df_fats_tidy$Statistic %in% c("inj_sum","fats_sum"),], aes(x=reorder(event_type,seq), y=Number, fill=Statistic)) +geom_bar(position=position_stack(reverse=TRUE), stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Total people affected across all years") +scale_fill_discrete(name="Key",labels=c("Fatalities","Injuries"))

Do same for average fatalities/injuries

df_fats_av <- df %>% arrange(desc(fats_av))
df_fats_av_tidy <- gather(df_fats_av[1:20,],Statistic,Number,c(fats_sum,fats_av,inj_sum,inj_av))
df_fats_av_tidy$seq <- rep(1:20,4)

Prepare plot

plot2 <- ggplot(df_fats_av_tidy[df_fats_av_tidy$Statistic %in% c("inj_av","fats_av"),], aes(x=reorder(event_type,seq), y=Number, fill=Statistic)) +geom_bar(position=position_stack(reverse=TRUE), stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Average number of people\n affected per incident") +scale_fill_discrete(name="Key",labels=c("Fatalities","Injuries"))

For economic damage, produce a similar subset and totals/averages.

stms_eco <- stms[stms$PROPDMG !=0 | stms$CROPDMG != 0,]
stms_eco <- droplevels(stms_eco)
event_type_eco <- sort(unique(stms_eco$EVTYPE))
prop_sum <- as.numeric(tapply(stms_eco$PROPDMG,stms_eco$EVTYPE,sum))
crop_sum <- as.numeric(tapply(stms_eco$CROPDMG,stms_eco$EVTYPE,sum))
prop_av <- as.numeric(tapply(stms_eco$PROPDMG,stms_eco$EVTYPE,mean))
crop_av <- as.numeric(tapply(stms_eco$CROPDMG,stms_eco$EVTYPE,mean))
df_eco <- as.data.frame(cbind(prop_sum,prop_av,crop_sum,crop_av))
df_eco <- cbind(event_type_eco,df_eco)
df_eco$total_sum <- df_eco$prop_sum + df_eco$crop_sum
df_eco$total_av <- df_eco$prop_av + df_eco$crop_av

Create tidy dataframe for total costs, order by total cost for crop and property and select top 20.

df_costs <- df_eco %>% arrange(desc(total_sum))
df_costs_tidy <- gather(df_costs[1:20,c(1,2,4)],Statistic,Number,c(prop_sum,crop_sum))
df_costs_tidy$seq <- rep(1:20,2)

Prepare plot

plot3 <- ggplot(df_costs_tidy, aes(x=reorder(event_type_eco,seq), y=Number, fill=factor(Statistic,levels=c("crop_sum","prop_sum")))) +geom_bar(stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Total costs across all years /$") +scale_fill_discrete(name="Damage",labels=c("Crops","Property")) 

Create tidy dataframe for average costs, order by total average cost for crop and property and select top 20.

df_av_costs <- df_eco %>% arrange(desc(total_av))
df_av_costs_tidy <- gather(df_av_costs[1:20,c(1,3,5)],Statistic,Number,c(prop_av,crop_av))
df_av_costs_tidy$seq <- rep(1:20,2)
plot4 <- ggplot(df_av_costs_tidy, aes(x=reorder(event_type_eco,seq), y=Number, fill=factor(Statistic,levels=c("crop_av","prop_av")))) +geom_bar(stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Average costs for each incident /$") +scale_fill_discrete(name="Damage",labels=c("Crops","Property")) 

Results

plot1

plot2

We can see that by far the greatest number of deaths and injuries over the period of this dataset is from tornadoes. Looking at the average number of people affected per incident however, we see that tsumanis are by far the most dangerous event, with hurricanes injuring the second highest number of people (though with relatively fewer fatalities).

This makes sense since tsunamis are so much rarer than tornadoes.

These plots are displayed in order of total damage. We see that over the period of the dataset, costs due to flooding (the majority of which is propery damage) are almost double those from anything else. Drought accounts for the most crop damage, but only appears 6th in total damage.

On a per-incident basis however, floods are not even in the top 10. Hurricanes cause double the amount of damage of the second biggest contributor (storm surges) which are themselves over twice as costly per event as anything else. The majority of this damage is to property, with draughts again responsible for the largest average cost to crop damage.

Summary

Although the aggregation of event types into a far smaller subset may have reduced the granularity of the analysis, it has enabled a clear picture to be established.

The greatest human cost over time is from tornadoes, whilst individually tsunamis are far more devastating.

In terms of the economic cost, flooding has taken the greatest toll over time, whilst an individual hurricane is likely to do more damage than any other event.

Further consideration of event types could result in even more aggregation, which might affect these broad conclusions. e.g the distinctions between tropical storms, typhoons, hurricanes or snow, cold and ice.