The data set has been analysed to determine the most significant weather events in terms of the impact on human health and the cost of damage to property and crops.
Both of these have been established for individual incidents and for the total devastation caused over the duration of the records in the data set. In both cases these provide different results (due to the relative infrequency of the most damaging weather events eg tsunamis).
Load data, reduce to variables of interest and remove main dataset from memory.
setwd("/Users/esteebie/Documents/DATA/Courses/4. JHU/5. Reproducible Research/Project 2 - storms")
storms <- read.csv("repdata_data_StormData.csv")
stms <- storms[,c(8,23:28)]
rm(storms)
summary(stms)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
Create new variables for costs using exponents. Recode for letter codes.
stms$prop_exp <- as.numeric(as.character(stms$PROPDMGEXP))
## Warning: NAs introduced by coercion
stms$prop_exp[stms$PROPDMGEXP %in% c("-","?","+","")] <- 0
stms$prop_exp[stms$PROPDMGEXP %in% c("H","h")] <- 2
stms$prop_exp[stms$PROPDMGEXP %in% c("K","k")] <- 3
stms$prop_exp[stms$PROPDMGEXP %in% c("M","m")] <- 6
stms$prop_exp[stms$PROPDMGEXP %in% c("B","b")] <- 9
stms$crop_exp <- as.numeric(as.character(stms$CROPDMGEXP))
## Warning: NAs introduced by coercion
stms$crop_exp[stms$CROPDMGEXP %in% c("-","?","+","")] <- 0
stms$crop_exp[stms$CROPDMGEXP %in% c("H","h")] <- 2
stms$crop_exp[stms$CROPDMGEXP %in% c("K","k")] <- 3
stms$crop_exp[stms$CROPDMGEXP %in% c("M","m")] <- 6
stms$crop_exp[stms$CROPDMGEXP %in% c("B","b")] <- 9
Recalculate damage using recoded powers.
stms$PROPDMG <- stms$PROPDMG * 10^stms$prop_exp
stms$CROPDMG <- stms$CROPDMG * 10^stms$crop_exp
Next simplify events into fewer principle groups.
ev_types <- length(unique(stms$EVTYPE))
Initially, the dataset contains 985
stms$EVTYPE <- gsub(".*HEAT.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*AVALAN.*","AVALANCHE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TSTM.*","THUNDER STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*THUNDERSTORM.*","THUNDER STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WIND.*","WIND",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SNOW.*","SNOW",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HAIL.*","HAIL",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HURRICANE.*","HURRICANE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FIRE.*","FIRE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FOG.*","FOG",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*COLD.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SURF.*","SURF",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*RAIN.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TYPHOON.*","TYPHOON",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WINTER.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TROPICAL.*","TROPICAL STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*TORN.*","TORNADO",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WATERSPOUT.*","WATERSPOUT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WET.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*VOLC.*","VOLCANO",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WND.*","WIND",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FLOOD.*","FLOOD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SLEET.*","SLEET",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*DRY.*","DRY",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WINTRY.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*WARM.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HOT.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HIGH.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LOW.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*ICE.*","ICE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FREEZ.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LIGHTNING.*","LIGHTNING",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LANDSLIDE.*","LANDSLIDE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*DUST.*","DUST",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*ICY.*","ICE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*BLIZZARD.*","SNOW",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*STREAM.*","FLOOD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*RECORD.*","HEAT",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*MUD.*","MUDSLIDE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*EROSION.*","EROSION",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SMOKE.*","SMOKE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*LANDSL.*","LANDSLIDE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SURGE.*","SURGE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*SHOWER.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*PRECIP.*","RAIN",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*HYPOTH.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*EXPOS.*","COLD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*GUST.*","WIND",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FUNNEL.*","FUNNEL CLOUD",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*FROST.*","FROST",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*COASTAL.*","COASTAL STORM",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*DAM.*","DAM FAILURE",stms$EVTYPE, ignore.case = TRUE)
stms$EVTYPE <- gsub(".*MARINE.*","MARINE INCIDENT",stms$EVTYPE, ignore.case = TRUE)
ev_types <- length(unique(stms$EVTYPE))
Now the dataset contains 140
For analysis of fatalaties and injuries, create subset of data where at least one person was affected and drop the levels of all others.
Create summary tables for total and averages for each event type in new subset.
library(ggplot2)
stms_inj <- stms[stms$FATALITIES !=0 | stms$INJURIES !=0,]
stms_inj <- droplevels(stms_inj)
event_type <- sort(unique(stms_inj$EVTYPE))
fats_sum <- as.numeric(tapply(stms_inj$FATALITIES,stms_inj$EVTYPE,sum))
inj_sum <- as.numeric(tapply(stms_inj$INJURIES,stms_inj$EVTYPE,sum))
fats_av <- as.numeric(tapply(stms_inj$FATALITIES,stms_inj$EVTYPE,mean))
inj_av <- as.numeric(tapply(stms_inj$INJURIES,stms_inj$EVTYPE,mean))
df <- as.data.frame(cbind(fats_sum,fats_av,inj_sum,inj_av))
df <- cbind(event_type,df)
Next convert dataframes into tidy data for output using ggplot. Since fatalities are more significant than injuries, rank by number of fatalaties (first totals, then averages for second plot). Select only top 20 event types.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
df_fats <- df %>% arrange(desc(fats_sum))
df_fats_tidy <- gather(df_fats[1:20,],Statistic,Number,c(fats_sum,fats_av,inj_sum,inj_av))
df_fats_tidy$seq <- rep(1:20,4)
Prepare plot
plot1 <- ggplot(df_fats_tidy[df_fats_tidy$Statistic %in% c("inj_sum","fats_sum"),], aes(x=reorder(event_type,seq), y=Number, fill=Statistic)) +geom_bar(position=position_stack(reverse=TRUE), stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Total people affected across all years") +scale_fill_discrete(name="Key",labels=c("Fatalities","Injuries"))
Do same for average fatalities/injuries
df_fats_av <- df %>% arrange(desc(fats_av))
df_fats_av_tidy <- gather(df_fats_av[1:20,],Statistic,Number,c(fats_sum,fats_av,inj_sum,inj_av))
df_fats_av_tidy$seq <- rep(1:20,4)
Prepare plot
plot2 <- ggplot(df_fats_av_tidy[df_fats_av_tidy$Statistic %in% c("inj_av","fats_av"),], aes(x=reorder(event_type,seq), y=Number, fill=Statistic)) +geom_bar(position=position_stack(reverse=TRUE), stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Average number of people\n affected per incident") +scale_fill_discrete(name="Key",labels=c("Fatalities","Injuries"))
For economic damage, produce a similar subset and totals/averages.
stms_eco <- stms[stms$PROPDMG !=0 | stms$CROPDMG != 0,]
stms_eco <- droplevels(stms_eco)
event_type_eco <- sort(unique(stms_eco$EVTYPE))
prop_sum <- as.numeric(tapply(stms_eco$PROPDMG,stms_eco$EVTYPE,sum))
crop_sum <- as.numeric(tapply(stms_eco$CROPDMG,stms_eco$EVTYPE,sum))
prop_av <- as.numeric(tapply(stms_eco$PROPDMG,stms_eco$EVTYPE,mean))
crop_av <- as.numeric(tapply(stms_eco$CROPDMG,stms_eco$EVTYPE,mean))
df_eco <- as.data.frame(cbind(prop_sum,prop_av,crop_sum,crop_av))
df_eco <- cbind(event_type_eco,df_eco)
df_eco$total_sum <- df_eco$prop_sum + df_eco$crop_sum
df_eco$total_av <- df_eco$prop_av + df_eco$crop_av
Create tidy dataframe for total costs, order by total cost for crop and property and select top 20.
df_costs <- df_eco %>% arrange(desc(total_sum))
df_costs_tidy <- gather(df_costs[1:20,c(1,2,4)],Statistic,Number,c(prop_sum,crop_sum))
df_costs_tidy$seq <- rep(1:20,2)
Prepare plot
plot3 <- ggplot(df_costs_tidy, aes(x=reorder(event_type_eco,seq), y=Number, fill=factor(Statistic,levels=c("crop_sum","prop_sum")))) +geom_bar(stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Total costs across all years /$") +scale_fill_discrete(name="Damage",labels=c("Crops","Property"))
Create tidy dataframe for average costs, order by total average cost for crop and property and select top 20.
df_av_costs <- df_eco %>% arrange(desc(total_av))
df_av_costs_tidy <- gather(df_av_costs[1:20,c(1,3,5)],Statistic,Number,c(prop_av,crop_av))
df_av_costs_tidy$seq <- rep(1:20,2)
plot4 <- ggplot(df_av_costs_tidy, aes(x=reorder(event_type_eco,seq), y=Number, fill=factor(Statistic,levels=c("crop_av","prop_av")))) +geom_bar(stat="identity") +theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +xlab("Event type") +ylab("Average costs for each incident /$") +scale_fill_discrete(name="Damage",labels=c("Crops","Property"))
plot1
plot2
We can see that by far the greatest number of deaths and injuries over the period of this dataset is from tornadoes. Looking at the average number of people affected per incident however, we see that tsumanis are by far the most dangerous event, with hurricanes injuring the second highest number of people (though with relatively fewer fatalities).
This makes sense since tsunamis are so much rarer than tornadoes.
These plots are displayed in order of total damage. We see that over the period of the dataset, costs due to flooding (the majority of which is propery damage) are almost double those from anything else. Drought accounts for the most crop damage, but only appears 6th in total damage.
On a per-incident basis however, floods are not even in the top 10. Hurricanes cause double the amount of damage of the second biggest contributor (storm surges) which are themselves over twice as costly per event as anything else. The majority of this damage is to property, with draughts again responsible for the largest average cost to crop damage.
Although the aggregation of event types into a far smaller subset may have reduced the granularity of the analysis, it has enabled a clear picture to be established.
The greatest human cost over time is from tornadoes, whilst individually tsunamis are far more devastating.
In terms of the economic cost, flooding has taken the greatest toll over time, whilst an individual hurricane is likely to do more damage than any other event.
Further consideration of event types could result in even more aggregation, which might affect these broad conclusions. e.g the distinctions between tropical storms, typhoons, hurricanes or snow, cold and ice.